Oakleaf-FX(Fujitsu PRIMEHPC FX10) 1,a) 1 1 1 1,2 1 2012 4 Oakleaf-FX (Fujitsu PRIMEHPC FX10) Oakleaf-FX SPARC64IXfx FEFS 1.13PFLOPS Performance Evaluation of Oakleaf-FX (Fujitsu PRIMEHPC FX10) Supercomputer System Satoshi OHSHIMA 1,a) Hideyuki JITSUMOTO 1 Yoshikazu KAMOSHIDA 1 Takahiro KATAGIRI 1 Kenjiro TAURA 1,2 Kengo NAKAJIMA 1 Abstract: We report the performance of Oakleaf-FX (Fujitsu PRIMEHPC FX10) supercomputer system which has begun in April 2012 at Kashiwa campus, Information Technology Center, The University of Tokyo. This system is a large-scale parallel computer with SPARC64IXfx CPU and FEFS file system. The peak performance is 1.13 PFLOPS. Moreover, this system is compatibility of the K computer and expected to contribute a lot to progress of computer/computational science. In this paper, we report some results of performance evaluation on this supercomputer system. 1. 2011 SR11000/J2 SR11000/J2 SR11000/J2 SMP 2 SR16000 M1 *1 Yayoi 1 Information Technology Center, The University of Tokyo 2 Graduate School of Information Science and Technology, The University of Tokyo a) ohshima@cc.u-tokyo.ac.jp 2011 10 11 [1][2] PRIMEHPC FX10 *1 Oakleaf-FX [3] 2012 4 7 [4] 2 Oakleaf-FX 6 ( 1 ) STREAM ( 2 ) HPL ( 3 ) MPIFFT *1 Yayoi =Yayoi Oakleaf-FX =oakleaf c 2012 Information Processing Society of Japan 1
( 4 ) GeoFEM ( 5 ) MDTEST ( 6 ) IOR 2 Oakleaf-FX 3 Oakleaf-FX 4 Oakleaf-FX 2. 2.1 Oakleaf-FX Oakleaf-FX 1 Oakleaf-FX 1 Oakleaf- FX SR11000/J2 HA8000 T2K SR16000/M1 Oakleaf-FX 4800 Tofu GPU 1.13PFLOPS Linpack 1.40MW 2.0MWh Oakleaf-FX(Fujitsu PRIMEHPC FX10) 1 Oakleaf-FX 2.2 2 Oakleaf-FX CPU SPARC64IXfx SPARC64IXfx 16 SPARC64 (SPARC64V9 + HPC-ACE) CPU Oakleaf-FX SPARC64IXfx 1.848GHz 2 SPARC64IXfx CPU SMT 1CPU 236.5GFLOPS(1.848GHz 8IPC 16 ) L1 L1 32KB L2 1CPU 12MB L3 VISIMPACT 1 SPARC64IXfx 1 1 ECC DDR3 32GB Inter Connect Controller, ICC 4 4,800 =76,800 1.13PFLOPS 150TByte 2.3 Oakleaf-FX 6 / (Tofu *1 ) 3 Oakleaf-FX ICC 10 ICC 4 4 10 X (X+,X-) Y (Y+,Y-) Z (Z+,Z-) A B (B+,B-) C X Y Z B A C X,Y,Z 12 Tofu Oakleaf-FX *1 Torus fusion c 2012 Information Processing Society of Japan 2
情報処理学会研究報告 図 1 Oakleaf-FX の全体構成 表 1 Oakleaf-FX の性能諸元 PRIMEHPC FX10 SR16000/M1 SR11000/J2 HA8000 (Oakleaf-FX) (Yayoi) 旧システム (T2K 東大版) SPARC64IXfx Power7 Power5+ Opteron8356 1.848 GHz 3.83 GHz 2.3 GHz 2.3 GHz 総計算ノード数 4800 56 128 952 コア数/計算ノード 16 32 16 16 理論演算性能/コア 14.784 GFLOPS 30.64 GFLOPS 9.2 GFLOPS 9.2 GFLOPS 理論演算性能/計算ノード 236.5 GFLOPS 980.48 GFLOPS 147.2 GFLOPS 147.2 GFLOPS 理論演算性能/全計算ノード 1.13 PFLOPS 54906.88 GFLOPS 18841.6 GFLOPS 140.1344 TFLOPS CPU 主記憶容量/計算ノード 32 GByte 200 GByte 128 GByte 32 GByte 使用可能容量 (28 GByte) (170 GByte) (112 GByte) (28 GByte) 主記憶容量/全計算ノード 150 TByte 11200 GByte 16384 GByte 31.25 TByte B/F 値 0.36 0.52 1.39 0.29 非対応 非対応 SMT 機能 非対応 計算ノード間 6 次元メッシュ/トーラス ネットワーク構成 (Tofu ネットワーク) 計算ノード間転送性能 ストレージ容量 20 GByte 双方向 4 方向同時通信可能 1.1 PByte + 2.1 PByte (+ 3.6 PByte) CPU/主記憶間物理転送 性能/計算ノード 85 GByte/sec 最大 4 スレッド/コア 運用時最大 2 スレッド/コア 階層型完全結合 3 段クロスバー フルバイセクション バンド幅 FatTree A 群 5 GByte/sec 双方向 96 GByte/sec 双方向 12 GByte/sec 双方向 556 TByte 94.2 TByte 1 PByte 512 GByte/sec 204.6 GByte/sec 42 GByte/sec B 群 2.5 GByte/sec 双方向 ワークは最大で 3 次元のトーラス空間となり 複雑な 6 次 ルファイルシステムと共有ファイルシステムを備えてい 元の形状を強く意識せずとも常に高いネットワーク性能を る ローカルファイルシステムはステージング用に用意さ 得ることができる れたシステムである PRIMERGY RX300 S6 と ETER- NUS DX80 S2 から構成されており 1.1PByte の容量と 2.4 ストレージ Oakleaf-FX は 2 系統のストレージシステム ローカ c 2012 Information Processing Society of Japan 131GByte/sec の性能を備えている 一方の共有ファイル システムは全計算ノードに加えてログインノードからも 3
3 Oakleaf-FX Tofu PRIMERGY RX300 S6 ETERNUS DX80 S2 ETERNUS DX410 S2 2.1PByte 136GByte/sec 2 FEFS(Fujitsu Exabyte File System) FEFS Lustre Oakleaf-FX 3.6PByte Lustre 3. 3.1 STREAM STREAM [6] STREAM MB/s Copy c[j] = a[j] Scale b[j] = scalar*c[j] Add c[j] = a[j]+b[j] Triad a[j] = b[j]+scalar*c[j] 2 STREAM MB/sec Oakleaf-FX Yayoi PRIMEHPC FX10 SR16000/M1 Copy 59987.3012 224825.3361 (68.9%) (42.9%) Scale 59768.9227 226349.5329 (68.7%) (43.2%) Add 64640.5627 256364.6680 (74.3%) (48.9%) Triad 64712.2441 255192.6583 (74.3%) (48.7%) 1 OpenMP 16 Fortran -Kopenmp -Kfast -KXFILL -Kprefetch sequential=soft -Kprefetch double line L2 - Kprefetch line L2=64 -Koptmsg -Qt C C (N) 80,000,512 (NTIMES) (10) OMP NUM THREADS PAR- ALLEL 16 Yayoi(SR16000/M1 32 [2] ) 2 Oakleaf-FX Yayoi 25% Yayoi 50% Oakleaf-FX 68% 3.2 HPL HPCC (HPCC 1.4.0)[7] HPL c 2012 Information Processing Society of Japan 4
LU - BLAS3 DGEMM C Fortran BLAS BLAS -O3 - Kopenmp,parallel,fast -Nsrc,sta -Koptmsg C -Kopenmp,parallel,ocl,fast -Koptmsg -Qt Fortran 3.2.1 1 1 1 16CPU MPI 1 16 MPI 1 MPI hpccinf.txt Ns = 56000, NBs = 448, Ps = 1, Qs = 1 3 1 0.21 TFLOPS, 90.59% Yayoi 1 0.83 TFLOPS, 84.65% [2] 3.2.2 4800 1 16CPU MPI 1 16 MPI 4800 hpccinf.txt Ns = 4058880, NBs = 448, Ps = 30, Qs = 160 3 4800 1.04 PFLOPS, 91.89% 2012 6 TOP500 List 18 [14] 1176.80kW 3.3 MPIFFT HPCC (HPCC 1.4.0) FFT MPI Alltoall C Fortran FFTW, SSLII SSLII BLAS/LAPACK -Kfast -Kopenmp - Nsrc,sta -Koptmsg C -Kfast -Kopenmp -mlcmain=main -SSL2BLAMP Fortran HPCC -DHPCC FFT 235 -DHPCC MEMALLCTR -DRA SANDIA NOPT (mpifft.o, wrapmpifftw.o, pzfft1d.o -DUSING FFTW ) 8 1 MPI 1 16 MPI 128 hpccinf.txt Ns = 160000, NBs = 80, Ps = 1, Qs = 8 Vector Size Vector size: 3,200,000,000 30.213 GFLOPS, 1.59% Yayoi 8 151.121 GFLOPS, 1.92% [2] B/F Oakleaf-FX 3.4 GeoFEM 3.4.1 GeoFEM [8] GeoFEM-Cube[9] GeoFEM [10] ( 1 ) Cube PGA ( 2 ) ( 3 ) GFLOPS c 2012 Information Processing Society of Japan 5
4 Cube OpenMP FOR- TRAN90 MPI GeoFEM [8] SMP MPI OpenMP Hybrid OpenMP MPI [10] 3 GeoFEM 4 Cube cc-numa HA8000. GeoFEM-Cube GeoFEM-Cube SGS(Symmetric Gauss-Seidel)[10] (Conjugate Gradient CG) SGS/CG 1 3 1 GeoFEM (a) CRS(Compressed Row Storage) RCM RCM Cyclic cyclic multicoloring CM CM-RCM 3 Flat MPI Hybrid Hybrid Hybrid a b HB a b a MPI OpenMP b MPI MPI 3 40 3 3 64,000 192,000 1 Hitachi SR11000/J2 Hitachi SR11K/J2 Hitachi SR16000/M1 Hitachi SR16K/M1 Hitachi HA8000 T2K FX10 Flat MPI Oakleaf-FX 6.77% 8.59% SPARC64 VIIIfx Oakleaf-FX SPARC64 IXfx 8 16 8% 25% Oakleaf-FX SR16K/M1 Power7 Byte/Flop SR16K/M1 3.5 MDTEST MDTEST MDTEST Lawrence Livermore National Laboratory (LLNL) Livermore Computing Center I/O [13] MDTEST (b) DJDS(Descending order Jagged Diagonal Storage) 2 GeoFEM-Cube CRS SGS A ILU [10] GeoFEM Multicoloring MC Reverse Cuthill-McKee 1 10, 000 5, 000 10 5 1 MDTEST c 2012 Information Processing Society of Japan 6
3 GeoFEM-Cube 1 Flat MPI 40 3 3 64,000 192,000 Hitachi SR11000/J2 Hitachi SR11K/J2 Hitachi SR16000/M1 Hitachi SR16K/M1 Hitachi HA8000 T2K Fujitsu PRIMEHPC FX10 Oakleaf-FX Hitachi Hitachi T2K Fujitsu FX10 SR11K/J2 SR16K/M1 Oakleaf-FX IBM IBM AMD SPARC64 SPARC64 Processor Power5+ Power7 Opteron8356 IXfx VIIIfx 2.3 GHz 3.83 GHz 2.3 GHz 1.848 GHz 2.0 GHz Core #/Node 16 32 16 16 8 Peak Performance (GFLOPS) 147.2 980.5 147.2 236.5 128.0 STREAM Triad (GB/s) 101.0 264.2 20.0 64.7 43.3 Byte/Flop 0.686 0.269 0.136 0.274 0.338 GeoFEM-Cube (GFLOPS) 19.0 72.7 4.69 16.0 11.0 % to Peak 12.9 7.41 3.18 6.77 8.59 Last Level Cache/core (MB) 18.0 4.00 2.00 0.75 0.75 (Operations per second) 1 I/O 6 32 1 MDTEST 1 2 / Oakleaf-FX Yayoi Yayoi 1 8 5,892 / 8,302 / 7,044 / 5,796 / Oakleaf-FX Yayoi 1/4 Yayoi GPFS Lustre FEFS 3.6 IOR IOR IOR MDTEST LLNL Livermore Computing Center I/O 5 MDTEST (1 ) 6 MDTEST (32 ) 4 IOR 1 (MB/sec) 4,023.40 3,964.92 (ior-multi) (MB/sec) 139,008.00 134,734.62 (ior-multi) (MB/sec) N/A 80,724.43 (ior-single) IOR c 2012 Information Processing Society of Japan 7
ior-multi ior-single POSIX I/O 1 1MiB 4 1 ior-multi 16 256GiB 4TiB 4GB/ Tofu 5GB/ ior-multi 1 32TiB IOR 139GB/ 134GB/ 1, 200 26.86 GiB ( 32.2 TiB) 1, 872 17.09 GiB ( 32.00 TiB) ior-single 1, 920 17.09 GiB ( 32.81 TiB) 80.7 GB/ FEFS Lustre OST Lustre 1.8 160 FEFS 20, 000 OST 480 4095MiB IOR Yayoi Yayoi 10GB/ Oakleaf-FX 13 Yayoi ior-multi ior-single Oakleaf-FX 4 10GB/ I/O GPFS Lustre 4. Oakleaf-FX(Fujitsu PRIMEHPC FX10) Oakleaf-FX Oakleaf-FX [1] SR16000 SMP Yayoi http://www.cc.u-tokyo.ac.jp/system/smp/. [2],,,,, : SMP (HI- TACHI SR16000 M1), (HPC-133) (2012). [3] PRIMEHPC FX10 http://jp.fujitsu.com/solutions/hpc/products/ primehpc/. [4] FX10 Oakleaf-FX http://www.cc.u-tokyo.ac. jp/system/fx10/. [5] HA8000 T2K http://www.cc.u-tokyo.ac.jp/system/ ha8000/. [6] STREAM BENCHMARK http://www.cs.virginia. edu/stream/. [7] HPC Challenge Benchmark http://icl.cs.utk.edu/ hpcc/. [8] GeoFEM http://geofem.tokyo.rist.or.jp/. [9] UT-HPC benchmark http://www.cspp.cc.u-tokyo. ac.jp/ut-hpc-benchmark/. [10] HPC-120-6 2009. [11] Mattson, T.G., Sanders, B.A., Massingill, B.L.: Patterns for Parallel Programming, Software Patterns Series (SPS), Addison-Wesley (2005). [12] Nakajima, K.: New Strategy for Coarse Grid Solvers in Parallel Multigrid Methods using OpenMP/MPI Hybrid Programming Models, ACM Proceedings of PPoPP/PMAM 2012, New Orleans, LA, USA (2012). [13] Scalable I/O Benchmark Downloads, Lawrence Livermore National Laboratory https://computing.llnl. gov/?set=code&page=sio downloads. [14] TOP500 List - June 2012 (1-100) TOP500 Supercomputing Sites http://www.top500.org/list/2012/ 06/100. c 2012 Information Processing Society of Japan 8