c NUMA 1. 18 (Moore s law) 1Hz CPU 2. 1 (Register) (RAM) Level 1 (L1) L2 L3 L4 TLB (translation look-aside buffer) (OS) TLB TLB 3. NUMA NUMA (Non-uniform memory access) 819 0395 744 1 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 21 601
Intel Xeon X5460 Harpertown CPU 2 CPU 4 1 8(=2 4 1) 2 2-way Intel Xeon X5460 NUMA UMA (Uniform memory access) 2 UMA 3 NUMA UMA 2 CPU Intel Xeon X5460 2 CPU CPU (RAM) RAM NUMA NUMA NUMA CPU NUMA 3 CPU Intel Xeon E5-4640 4 NUMA 4 4. STREAM 1 1 STREAM: Sustainable Memory Bandwidth in High Performance Computers http://www.cs.virginia.edu/ stream/ Intel Xeon E5-4640 SandyBridge-EP CPU 4 CPU 8 2 64 (= 4 8 2) 3 4-way Intel Xeon E5 4640 STREAM 4 1 Triad n a, b, c R n r a b + rc 1 bytes 4 OpenMP Triad C/C++ 4 OpenMP Triad 5 4-way Intel Xeon E5 4640 n n = {2 10,...,2 30 } Triad (GB/s) 2 20 STREAM 20 16, 32, 64 95, 98, 92 GB/s 64 602 22 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.
5 STREAM TRIAD 2 Hyper-threading 32 4.1 Linux numactl NUMA numactl NUMA node 0 16 16 NUMA node 3 --physcpubind --membind NUMA ID NUMA ID Linux /proc/cpuinfo processor ID physical id NUMA ID Portable Hardware Locality (HWLOC) [1] 6 n = {2 10, 2 11,...,2 30 } NUMA NUMA NUMA 0 16 16 Triad NUMA 6 NUMA 0 NUMA (GB/s) 12 GB/s NUMA 3GB/s NUMA 1/4 4.2 numactl --localalloc 32 NUMA 0, 1 32 4.3 4KBytes NUMA NUMA numactl --interleave 32 NUMA 0, 1 32 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 23 603
NUMA 0, 1 4 Local allocation 5. NUMA 4.4 7(a) 7(b) STREAM TRIAD (GB/s) NUMA 1, 2, 4 n = {2 10,...,2 30 } NUMA (Local-allocation) (Interleaving) 2 20 NUMA 1 16 2 32 4 64 Local allocation 13 GB/s, 21 GB/s, 24 GB/s Interleaving 13 GB/s, 6 GB/s, 8 GB/s Interleaving TRIAD STREAM 4 Local allocation Interleaving 6 NUMA numactl Linux sched_setaffinity() sched_getaffinity() mbind() sched_setaffinity() sched_setaffinity() mbind() NUMA NUMA 5.1 STREAM TRIAD TRIAD a, b, c 1 NUMA 7 STREAM TRIAD: (GB/s) 604 24 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.
8 TRIAD 9 NUMA 8 TRIAD NUMA 1, 2, 4 24, 48, 96 GB/s 5 5.2 (Breath-first search; BFS) BFS G =(V,E) n = V m = E O(n + m) HPC Graph500 1 Graph500 2 BFS 2010 11 2 SCALE edgefactor =m/n 16 (a) (b) (c) (a) n=2 SCALE m=n edgefactor Kronecker graph (b) (c) 64 BFS 1 (traversed edges per second; TEPS) (c) 64 TEPS Green Graph500 3 Graph500 TEPS TEPS/W 9 1 BFS (Level) Level-synchronized BFS Beamer [3] Top-down Bottom-up Small-world Top-down Bottom-up Beamer 2 28 2 32 Kronecker graph 4-way Intel Xeon E7-8870 5.1 GTEPS (10 9 TEPS) NUMA 2.2 11.15 GTEPS [4] Bottom-up Small-world 2.68 [5] [4, 5] CSR (Compressed Sparse Row) 2 Graph500: http://www.graph500.org. 3 Green Graph500: http://green.graph500.org. 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 25 605
1 (n, m) TEPS Madduri Cray MTA-2 (40 procs) (2 21,2 30 ) 0.5 G Agarwal [2] Intel Xeon X7560 4 (2 20,2 26 ) 1.3 G Beamer [3] Intel Xeon E7-8870 4 (2 28,2 32 ) 5.1 G Yasui [4] Intel Xeon E5-4640 4 (2 26,2 30 ) 11.1 G Yasui [5] Intel Xeon E5-4640 4 (2 27,2 31 ) 29.0 G V k = { [ )} kn (k +1)n v j V j, l l A Top-down v V A F k (v) Bottom-up w V k A B k (w) l 1 A F k (v) A B k (w) A F k (v)={w w {V k A(v)}}, v V, A B k (w)={v v A(w)}, w V k. NUMA Graph500 2014 6 4 10 NUMA BFS Graph500 10(a) NUMA 10(b) NUMA l G l {G k}, (k = {0, 1,...,l 1}) NUMA k V k A k V k SGI UV2000 2 32 2 36 Kronecker 640 131.4 GTEPS Green Graph500 2014 6 5 Big Data category 4-way Intel Xeon E5-4640 2 30, 2 34 28.5 GTEPS 59.1 MTEPS/W 1 UV 2000 5.3 SDPARA (SemiDefinite Programming Algorithm PARAllel version) [6] SDPA (Semidefinite Programming Algorithms) ZDD (Zero-suppressed decision diagram) [7] [8] NUMA ULIBC (Ubiquity Library for Intelligently Binding Cores) 4 http://www.graph500.org/results jun 2014 5 http://green.graph500.org/list 2014 06 isc.php 606 26 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.
6. NUMA NUMA (JST) CREST SGI Silicon Graphics International Corp. [1] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault and R. Namyst, hwloc: A generic framework for managing hardware affinities in HPC applications, Proc. IEEE Int. Conf. PDP2010, 2010. [2] V. Agarwal, F. Petrini, D. Pasetto and D. A. Bader, Scalable graph exploration on multicore processors, Proc. ACM/IEEE Int. Conf. SC10, 2010. [3] S. Beamer, K. Asanović and D. A. Patterson, Direction-optimizing breadth-first search, Proc. ACM/IEEE Int. Conf. SC12, 2012. [4] Y. Yasui, K. Fujisawa and K. Goto, NUMAoptimized parallel breadth-first search on multicore single-node system, Proc. IEEE Int. Conf. BigData 2013, 2013. [5] Y. Yasui, K. Fujisawa and Y. Sato, Fast and energy-efficient breadth-first search on a single NUMA system, Proc. IEEE Int. Conf. ISC 14, 2014. [6] K. Fujisawa, T. Endo, Y. Yasui, H. Sato, N. Matsuzawa, S. Matsuoka and H. Waki, Peta-scale general solver for semidefinite programming problems with over two million constraints, Proc. IEEE Int. Conf. IPDPS 2014, 2014. [7] ULIBC 2014 (HPCS2014) HPCS2014 2014. [8]Y.Yasui,K.Fujisawa,K.Goto,N.Kamiyamaand M. Takamatsu, NETAL: High-performance implementation of network analysis library considering computer memory hierarchy, J. Oper. Res. Soc. Jpn., 54, 259 280, 2011. 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 27 607