untitled

Size: px

Start display at page:

Download "untitled"

ためひとうすい
5 years ago
Views:

c NUMA 1. 18 (Moore s law) 1Hz CPU 2. 1 (Register) (RAM) Level 1 (L1) L2 L3 L4 TLB (translation look-aside buffer) (OS) TLB TLB 3.

1 c NUMA (Moore s law) 1Hz CPU 2. 1 (Register) (RAM) Level 1 (L1) L2 L3 L4 TLB (translation look-aside buffer) (OS) TLB TLB 3. NUMA NUMA (Non-uniform memory access) Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

Intel Xeon X5460 Harpertown CPU 2 CPU 4 1 8(=2 4 1) 2 2-way Intel Xeon X5460 NUMA UMA (Uniform memory access) 2 UMA 3

STREAM 1 1 STREAM: Sustainable Memory Bandwidth in High Performance Computers http://www.cs.virginia.

n a, b, c R n r a b + rc 1 bytes 4 OpenMP Triad C/C++ 4 OpenMP Triad 5 4-way Intel Xeon E5 4640 n n = {2 10,.

2 Intel Xeon X5460 Harpertown CPU 2 CPU 4 1 8(=2 4 1) 2 2-way Intel Xeon X5460 NUMA UMA (Uniform memory access) 2 UMA 3 NUMA UMA 2 CPU Intel Xeon X CPU CPU (RAM) RAM NUMA NUMA NUMA CPU NUMA 3 CPU Intel Xeon E NUMA 4 4. STREAM 1 1 STREAM: Sustainable Memory Bandwidth in High Performance Computers stream/ Intel Xeon E SandyBridge-EP CPU 4 CPU (= 4 8 2) 3 4-way Intel Xeon E STREAM 4 1 Triad n a, b, c R n r a b + rc 1 bytes 4 OpenMP Triad C/C++ 4 OpenMP Triad 5 4-way Intel Xeon E n n = {2 10,...,2 30 } Triad (GB/s) 2 20 STREAM 20 16, 32, 64 95, 98, 92 GB/s Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

3 5 STREAM TRIAD 2 Hyper-threading Linux numactl NUMA numactl NUMA node NUMA node 3 --physcpubind --membind NUMA ID NUMA ID Linux /proc/cpuinfo processor ID physical id NUMA ID Portable Hardware Locality (HWLOC) [1] 6 n = {2 10, 2 11,...,2 30 } NUMA NUMA NUMA Triad NUMA 6 NUMA 0 NUMA (GB/s) 12 GB/s NUMA 3GB/s NUMA 1/4 4.2 numactl --localalloc 32 NUMA 0, KBytes NUMA NUMA numactl --interleave 32 NUMA 0, Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

4 NUMA 0, 1 4 Local allocation 5. NUMA 4.4 7(a) 7(b) STREAM TRIAD (GB/s) NUMA 1, 2, 4 n = {2 10,...,2 30 } NUMA (Local-allocation) (Interleaving) 2 20 NUMA Local allocation 13 GB/s, 21 GB/s, 24 GB/s Interleaving 13 GB/s, 6 GB/s, 8 GB/s Interleaving TRIAD STREAM 4 Local allocation Interleaving 6 NUMA numactl Linux sched_setaffinity() sched_getaffinity() mbind() sched_setaffinity() sched_setaffinity() mbind() NUMA NUMA 5.1 STREAM TRIAD TRIAD a, b, c 1 NUMA 7 STREAM TRIAD: (GB/s) Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

5 8 TRIAD 9 NUMA 8 TRIAD NUMA 1, 2, 4 24, 48, 96 GB/s (Breath-first search; BFS) BFS G =(V,E) n = V m = E O(n + m) HPC Graph500 1 Graph500 2 BFS SCALE edgefactor =m/n 16 (a) (b) (c) (a) n=2 SCALE m=n edgefactor Kronecker graph (b) (c) 64 BFS 1 (traversed edges per second; TEPS) (c) 64 TEPS Green Graph500 3 Graph500 TEPS TEPS/W 9 1 BFS (Level) Level-synchronized BFS Beamer [3] Top-down Bottom-up Small-world Top-down Bottom-up Beamer Kronecker graph 4-way Intel Xeon E GTEPS (10 9 TEPS) NUMA GTEPS [4] Bottom-up Small-world 2.68 [5] [4, 5] CSR (Compressed Sparse Row) 2 Graph500: 3 Green Graph500: Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

1 (n, m) TEPS Madduri Cray MTA-2 (40 procs) (2 21,2 30 ) 0.5 G Agarwal [2] Intel Xeon X7560 4 (2 20,2 26 ) 1.3 G Beamer [3] Intel Xeon E7-8870 4 (2 28,2 32 ) 5.

6 1 (n, m) TEPS Madduri Cray MTA-2 (40 procs) (2 21,2 30 ) 0.5 G Agarwal [2] Intel Xeon X (2 20,2 26 ) 1.3 G Beamer [3] Intel Xeon E (2 28,2 32 ) 5.1 G Yasui [4] Intel Xeon E (2 26,2 30 ) 11.1 G Yasui [5] Intel Xeon E (2 27,2 31 ) 29.0 G V k = { [ )} kn (k +1)n v j V j, l l A Top-down v V A F k (v) Bottom-up w V k A B k (w) l 1 A F k (v) A B k (w) A F k (v)={w w {V k A(v)}}, v V, A B k (w)={v v A(w)}, w V k. NUMA Graph NUMA BFS Graph500 10(a) NUMA 10(b) NUMA l G l {G k}, (k = {0, 1,...,l 1}) NUMA k V k A k V k SGI UV Kronecker GTEPS Green Graph Big Data category 4-way Intel Xeon E , GTEPS 59.1 MTEPS/W 1 UV SDPARA (SemiDefinite Programming Algorithm PARAllel version) [6] SDPA (Semidefinite Programming Algorithms) ZDD (Zero-suppressed decision diagram) [7] [8] NUMA ULIBC (Ubiquity Library for Intelligently Binding Cores) 4 jun isc.php Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

7 6. NUMA NUMA (JST) CREST SGI Silicon Graphics International Corp. [1] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault and R. Namyst, hwloc: A generic framework for managing hardware affinities in HPC applications, Proc. IEEE Int. Conf. PDP2010, [2] V. Agarwal, F. Petrini, D. Pasetto and D. A. Bader, Scalable graph exploration on multicore processors, Proc. ACM/IEEE Int. Conf. SC10, [3] S. Beamer, K. Asanović and D. A. Patterson, Direction-optimizing breadth-first search, Proc. ACM/IEEE Int. Conf. SC12, [4] Y. Yasui, K. Fujisawa and K. Goto, NUMAoptimized parallel breadth-first search on multicore single-node system, Proc. IEEE Int. Conf. BigData 2013, [5] Y. Yasui, K. Fujisawa and Y. Sato, Fast and energy-efficient breadth-first search on a single NUMA system, Proc. IEEE Int. Conf. ISC 14, [6] K. Fujisawa, T. Endo, Y. Yasui, H. Sato, N. Matsuzawa, S. Matsuoka and H. Waki, Peta-scale general solver for semidefinite programming problems with over two million constraints, Proc. IEEE Int. Conf. IPDPS 2014, [7] ULIBC 2014 (HPCS2014) HPCS [8]Y.Yasui,K.Fujisawa,K.Goto,N.Kamiyamaand M. Takamatsu, NETAL: High-performance implementation of network analysis library considering computer memory hierarchy, J. Oper. Res. Soc. Jpn., 54, , Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

メモリ階層構造を考慮した大規模グラフ処理の高速化

メモリ階層構造を考慮した大規模グラフ処理の高速化 , CREST ERATO 0.. (, CREST) ERATO / 8 Outline NETAL (NETwork Analysis Library) NUMA BFS raph500, reenraph500 Kronecker raph Level Synchronized parallel BFS Hybrid Algorithm for Parallel BFS NUMA Hybrid