( 4 ) GeoFEM ( 5 ) MDTEST ( 6 ) IOR 2 Oakleaf-FX 3 Oakleaf-FX 4 Oakleaf-FX Oakleaf-FX Oakleaf-FX 1 Oakleaf-FX 1 Oakleaf- FX SR11000/J2 HA8000 T

Similar documents
東京大学情報基盤センターFX10スパコンシステム(Oakleaf-FX)活用事例

09中西

スーパーコンピュータ「京」の概要

GPU n Graphics Processing Unit CG CAD

資料3 今後のHPC技術に関する研究開発の方向性について(日立製作所提供資料)

ÊÂÎó·×»»¤È¤Ï/OpenMP¤Î½éÊâ¡Ê£±¡Ë

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

スパコンに通じる並列プログラミングの基礎

Microsoft PowerPoint - stream.ppt [互換モード]

スパコンに通じる並列プログラミングの基礎

HPC可視化_小野2.pptx

Microsoft PowerPoint - RBU-introduction-J.pptx

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

I I / 47

Microsoft PowerPoint - CCS学際共同boku-08b.ppt

main.dvi

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

スパコンに通じる並列プログラミングの基礎

untitled

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

資料2-1 計算科学・データ科学融合へ向けた東大情報基盤センターの取り組み(中村委員 資料)

倍々精度RgemmのnVidia C2050上への実装と応用

Microsoft PowerPoint - ★13_日立_清水.ppt

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

スライド 1

CCS HPCサマーセミナー 並列数値計算アルゴリズム

001.dvi

第3回戦略シンポジウム緑川公開用

Microsoft PowerPoint PCクラスタワークショップin京都.ppt

EGunGPU

最新の並列計算事情とCAE

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

2012年度HPCサマーセミナー_多田野.pptx

040312研究会HPC2500.ppt

MAC root Linux 1 OS Linux 2.6 Linux Security Modules LSM [1] Security-Enhanced Linux SELinux [2] AppArmor[3] OS OS OS LSM LSM Performance Monitor LSMP

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

IPSJ SIG Technical Report Vol.2015-HPC-148 No /3/3 I/O 1 2 Gerofi Balazs 2 Guo-Yuan Lien netcdf API 2 File I/O Arbitr

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

HPEハイパフォーマンスコンピューティング ソリューション

openmp1_Yaguchi_version_170530

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

HP High Performance Computing(HPC)

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

07-二村幸孝・出口大輔.indd

SC SC10 (International Conference for High Performance Computing, Networking, Storage and Analysis) (HPC) Ernest N.

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

FFTSS Library Version 3.0 User's Guide

GPGPU

untitled

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

Transcription:

Oakleaf-FX(Fujitsu PRIMEHPC FX10) 1,a) 1 1 1 1,2 1 2012 4 Oakleaf-FX (Fujitsu PRIMEHPC FX10) Oakleaf-FX SPARC64IXfx FEFS 1.13PFLOPS Performance Evaluation of Oakleaf-FX (Fujitsu PRIMEHPC FX10) Supercomputer System Satoshi OHSHIMA 1,a) Hideyuki JITSUMOTO 1 Yoshikazu KAMOSHIDA 1 Takahiro KATAGIRI 1 Kenjiro TAURA 1,2 Kengo NAKAJIMA 1 Abstract: We report the performance of Oakleaf-FX (Fujitsu PRIMEHPC FX10) supercomputer system which has begun in April 2012 at Kashiwa campus, Information Technology Center, The University of Tokyo. This system is a large-scale parallel computer with SPARC64IXfx CPU and FEFS file system. The peak performance is 1.13 PFLOPS. Moreover, this system is compatibility of the K computer and expected to contribute a lot to progress of computer/computational science. In this paper, we report some results of performance evaluation on this supercomputer system. 1. 2011 SR11000/J2 SR11000/J2 SR11000/J2 SMP 2 SR16000 M1 *1 Yayoi 1 Information Technology Center, The University of Tokyo 2 Graduate School of Information Science and Technology, The University of Tokyo a) ohshima@cc.u-tokyo.ac.jp 2011 10 11 [1][2] PRIMEHPC FX10 *1 Oakleaf-FX [3] 2012 4 7 [4] 2 Oakleaf-FX 6 ( 1 ) STREAM ( 2 ) HPL ( 3 ) MPIFFT *1 Yayoi =Yayoi Oakleaf-FX =oakleaf c 2012 Information Processing Society of Japan 1

( 4 ) GeoFEM ( 5 ) MDTEST ( 6 ) IOR 2 Oakleaf-FX 3 Oakleaf-FX 4 Oakleaf-FX 2. 2.1 Oakleaf-FX Oakleaf-FX 1 Oakleaf-FX 1 Oakleaf- FX SR11000/J2 HA8000 T2K SR16000/M1 Oakleaf-FX 4800 Tofu GPU 1.13PFLOPS Linpack 1.40MW 2.0MWh Oakleaf-FX(Fujitsu PRIMEHPC FX10) 1 Oakleaf-FX 2.2 2 Oakleaf-FX CPU SPARC64IXfx SPARC64IXfx 16 SPARC64 (SPARC64V9 + HPC-ACE) CPU Oakleaf-FX SPARC64IXfx 1.848GHz 2 SPARC64IXfx CPU SMT 1CPU 236.5GFLOPS(1.848GHz 8IPC 16 ) L1 L1 32KB L2 1CPU 12MB L3 VISIMPACT 1 SPARC64IXfx 1 1 ECC DDR3 32GB Inter Connect Controller, ICC 4 4,800 =76,800 1.13PFLOPS 150TByte 2.3 Oakleaf-FX 6 / (Tofu *1 ) 3 Oakleaf-FX ICC 10 ICC 4 4 10 X (X+,X-) Y (Y+,Y-) Z (Z+,Z-) A B (B+,B-) C X Y Z B A C X,Y,Z 12 Tofu Oakleaf-FX *1 Torus fusion c 2012 Information Processing Society of Japan 2

情報処理学会研究報告 図 1 Oakleaf-FX の全体構成 表 1 Oakleaf-FX の性能諸元 PRIMEHPC FX10 SR16000/M1 SR11000/J2 HA8000 (Oakleaf-FX) (Yayoi) 旧システム (T2K 東大版) SPARC64IXfx Power7 Power5+ Opteron8356 1.848 GHz 3.83 GHz 2.3 GHz 2.3 GHz 総計算ノード数 4800 56 128 952 コア数/計算ノード 16 32 16 16 理論演算性能/コア 14.784 GFLOPS 30.64 GFLOPS 9.2 GFLOPS 9.2 GFLOPS 理論演算性能/計算ノード 236.5 GFLOPS 980.48 GFLOPS 147.2 GFLOPS 147.2 GFLOPS 理論演算性能/全計算ノード 1.13 PFLOPS 54906.88 GFLOPS 18841.6 GFLOPS 140.1344 TFLOPS CPU 主記憶容量/計算ノード 32 GByte 200 GByte 128 GByte 32 GByte 使用可能容量 (28 GByte) (170 GByte) (112 GByte) (28 GByte) 主記憶容量/全計算ノード 150 TByte 11200 GByte 16384 GByte 31.25 TByte B/F 値 0.36 0.52 1.39 0.29 非対応 非対応 SMT 機能 非対応 計算ノード間 6 次元メッシュ/トーラス ネットワーク構成 (Tofu ネットワーク) 計算ノード間転送性能 ストレージ容量 20 GByte 双方向 4 方向同時通信可能 1.1 PByte + 2.1 PByte (+ 3.6 PByte) CPU/主記憶間物理転送 性能/計算ノード 85 GByte/sec 最大 4 スレッド/コア 運用時最大 2 スレッド/コア 階層型完全結合 3 段クロスバー フルバイセクション バンド幅 FatTree A 群 5 GByte/sec 双方向 96 GByte/sec 双方向 12 GByte/sec 双方向 556 TByte 94.2 TByte 1 PByte 512 GByte/sec 204.6 GByte/sec 42 GByte/sec B 群 2.5 GByte/sec 双方向 ワークは最大で 3 次元のトーラス空間となり 複雑な 6 次 ルファイルシステムと共有ファイルシステムを備えてい 元の形状を強く意識せずとも常に高いネットワーク性能を る ローカルファイルシステムはステージング用に用意さ 得ることができる れたシステムである PRIMERGY RX300 S6 と ETER- NUS DX80 S2 から構成されており 1.1PByte の容量と 2.4 ストレージ Oakleaf-FX は 2 系統のストレージシステム ローカ c 2012 Information Processing Society of Japan 131GByte/sec の性能を備えている 一方の共有ファイル システムは全計算ノードに加えてログインノードからも 3

3 Oakleaf-FX Tofu PRIMERGY RX300 S6 ETERNUS DX80 S2 ETERNUS DX410 S2 2.1PByte 136GByte/sec 2 FEFS(Fujitsu Exabyte File System) FEFS Lustre Oakleaf-FX 3.6PByte Lustre 3. 3.1 STREAM STREAM [6] STREAM MB/s Copy c[j] = a[j] Scale b[j] = scalar*c[j] Add c[j] = a[j]+b[j] Triad a[j] = b[j]+scalar*c[j] 2 STREAM MB/sec Oakleaf-FX Yayoi PRIMEHPC FX10 SR16000/M1 Copy 59987.3012 224825.3361 (68.9%) (42.9%) Scale 59768.9227 226349.5329 (68.7%) (43.2%) Add 64640.5627 256364.6680 (74.3%) (48.9%) Triad 64712.2441 255192.6583 (74.3%) (48.7%) 1 OpenMP 16 Fortran -Kopenmp -Kfast -KXFILL -Kprefetch sequential=soft -Kprefetch double line L2 - Kprefetch line L2=64 -Koptmsg -Qt C C (N) 80,000,512 (NTIMES) (10) OMP NUM THREADS PAR- ALLEL 16 Yayoi(SR16000/M1 32 [2] ) 2 Oakleaf-FX Yayoi 25% Yayoi 50% Oakleaf-FX 68% 3.2 HPL HPCC (HPCC 1.4.0)[7] HPL c 2012 Information Processing Society of Japan 4

LU - BLAS3 DGEMM C Fortran BLAS BLAS -O3 - Kopenmp,parallel,fast -Nsrc,sta -Koptmsg C -Kopenmp,parallel,ocl,fast -Koptmsg -Qt Fortran 3.2.1 1 1 1 16CPU MPI 1 16 MPI 1 MPI hpccinf.txt Ns = 56000, NBs = 448, Ps = 1, Qs = 1 3 1 0.21 TFLOPS, 90.59% Yayoi 1 0.83 TFLOPS, 84.65% [2] 3.2.2 4800 1 16CPU MPI 1 16 MPI 4800 hpccinf.txt Ns = 4058880, NBs = 448, Ps = 30, Qs = 160 3 4800 1.04 PFLOPS, 91.89% 2012 6 TOP500 List 18 [14] 1176.80kW 3.3 MPIFFT HPCC (HPCC 1.4.0) FFT MPI Alltoall C Fortran FFTW, SSLII SSLII BLAS/LAPACK -Kfast -Kopenmp - Nsrc,sta -Koptmsg C -Kfast -Kopenmp -mlcmain=main -SSL2BLAMP Fortran HPCC -DHPCC FFT 235 -DHPCC MEMALLCTR -DRA SANDIA NOPT (mpifft.o, wrapmpifftw.o, pzfft1d.o -DUSING FFTW ) 8 1 MPI 1 16 MPI 128 hpccinf.txt Ns = 160000, NBs = 80, Ps = 1, Qs = 8 Vector Size Vector size: 3,200,000,000 30.213 GFLOPS, 1.59% Yayoi 8 151.121 GFLOPS, 1.92% [2] B/F Oakleaf-FX 3.4 GeoFEM 3.4.1 GeoFEM [8] GeoFEM-Cube[9] GeoFEM [10] ( 1 ) Cube PGA ( 2 ) ( 3 ) GFLOPS c 2012 Information Processing Society of Japan 5

4 Cube OpenMP FOR- TRAN90 MPI GeoFEM [8] SMP MPI OpenMP Hybrid OpenMP MPI [10] 3 GeoFEM 4 Cube cc-numa HA8000. GeoFEM-Cube GeoFEM-Cube SGS(Symmetric Gauss-Seidel)[10] (Conjugate Gradient CG) SGS/CG 1 3 1 GeoFEM (a) CRS(Compressed Row Storage) RCM RCM Cyclic cyclic multicoloring CM CM-RCM 3 Flat MPI Hybrid Hybrid Hybrid a b HB a b a MPI OpenMP b MPI MPI 3 40 3 3 64,000 192,000 1 Hitachi SR11000/J2 Hitachi SR11K/J2 Hitachi SR16000/M1 Hitachi SR16K/M1 Hitachi HA8000 T2K FX10 Flat MPI Oakleaf-FX 6.77% 8.59% SPARC64 VIIIfx Oakleaf-FX SPARC64 IXfx 8 16 8% 25% Oakleaf-FX SR16K/M1 Power7 Byte/Flop SR16K/M1 3.5 MDTEST MDTEST MDTEST Lawrence Livermore National Laboratory (LLNL) Livermore Computing Center I/O [13] MDTEST (b) DJDS(Descending order Jagged Diagonal Storage) 2 GeoFEM-Cube CRS SGS A ILU [10] GeoFEM Multicoloring MC Reverse Cuthill-McKee 1 10, 000 5, 000 10 5 1 MDTEST c 2012 Information Processing Society of Japan 6

3 GeoFEM-Cube 1 Flat MPI 40 3 3 64,000 192,000 Hitachi SR11000/J2 Hitachi SR11K/J2 Hitachi SR16000/M1 Hitachi SR16K/M1 Hitachi HA8000 T2K Fujitsu PRIMEHPC FX10 Oakleaf-FX Hitachi Hitachi T2K Fujitsu FX10 SR11K/J2 SR16K/M1 Oakleaf-FX IBM IBM AMD SPARC64 SPARC64 Processor Power5+ Power7 Opteron8356 IXfx VIIIfx 2.3 GHz 3.83 GHz 2.3 GHz 1.848 GHz 2.0 GHz Core #/Node 16 32 16 16 8 Peak Performance (GFLOPS) 147.2 980.5 147.2 236.5 128.0 STREAM Triad (GB/s) 101.0 264.2 20.0 64.7 43.3 Byte/Flop 0.686 0.269 0.136 0.274 0.338 GeoFEM-Cube (GFLOPS) 19.0 72.7 4.69 16.0 11.0 % to Peak 12.9 7.41 3.18 6.77 8.59 Last Level Cache/core (MB) 18.0 4.00 2.00 0.75 0.75 (Operations per second) 1 I/O 6 32 1 MDTEST 1 2 / Oakleaf-FX Yayoi Yayoi 1 8 5,892 / 8,302 / 7,044 / 5,796 / Oakleaf-FX Yayoi 1/4 Yayoi GPFS Lustre FEFS 3.6 IOR IOR IOR MDTEST LLNL Livermore Computing Center I/O 5 MDTEST (1 ) 6 MDTEST (32 ) 4 IOR 1 (MB/sec) 4,023.40 3,964.92 (ior-multi) (MB/sec) 139,008.00 134,734.62 (ior-multi) (MB/sec) N/A 80,724.43 (ior-single) IOR c 2012 Information Processing Society of Japan 7

ior-multi ior-single POSIX I/O 1 1MiB 4 1 ior-multi 16 256GiB 4TiB 4GB/ Tofu 5GB/ ior-multi 1 32TiB IOR 139GB/ 134GB/ 1, 200 26.86 GiB ( 32.2 TiB) 1, 872 17.09 GiB ( 32.00 TiB) ior-single 1, 920 17.09 GiB ( 32.81 TiB) 80.7 GB/ FEFS Lustre OST Lustre 1.8 160 FEFS 20, 000 OST 480 4095MiB IOR Yayoi Yayoi 10GB/ Oakleaf-FX 13 Yayoi ior-multi ior-single Oakleaf-FX 4 10GB/ I/O GPFS Lustre 4. Oakleaf-FX(Fujitsu PRIMEHPC FX10) Oakleaf-FX Oakleaf-FX [1] SR16000 SMP Yayoi http://www.cc.u-tokyo.ac.jp/system/smp/. [2],,,,, : SMP (HI- TACHI SR16000 M1), (HPC-133) (2012). [3] PRIMEHPC FX10 http://jp.fujitsu.com/solutions/hpc/products/ primehpc/. [4] FX10 Oakleaf-FX http://www.cc.u-tokyo.ac. jp/system/fx10/. [5] HA8000 T2K http://www.cc.u-tokyo.ac.jp/system/ ha8000/. [6] STREAM BENCHMARK http://www.cs.virginia. edu/stream/. [7] HPC Challenge Benchmark http://icl.cs.utk.edu/ hpcc/. [8] GeoFEM http://geofem.tokyo.rist.or.jp/. [9] UT-HPC benchmark http://www.cspp.cc.u-tokyo. ac.jp/ut-hpc-benchmark/. [10] HPC-120-6 2009. [11] Mattson, T.G., Sanders, B.A., Massingill, B.L.: Patterns for Parallel Programming, Software Patterns Series (SPS), Addison-Wesley (2005). [12] Nakajima, K.: New Strategy for Coarse Grid Solvers in Parallel Multigrid Methods using OpenMP/MPI Hybrid Programming Models, ACM Proceedings of PPoPP/PMAM 2012, New Orleans, LA, USA (2012). [13] Scalable I/O Benchmark Downloads, Lawrence Livermore National Laboratory https://computing.llnl. gov/?set=code&page=sio downloads. [14] TOP500 List - June 2012 (1-100) TOP500 Supercomputing Sites http://www.top500.org/list/2012/ 06/100. c 2012 Information Processing Society of Japan 8