B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

Size: px

Start display at page:

Download "B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1"

はるまさちとく
7 years ago
Views:

1 TSUBAME 2.0 Linpack 1,,,, Intel NVIDIA GPU TSUBAME 2.0 Linpack 2CPU 3GPU 1400 Dual-Rail QDR InfiniBand TSUBAME PFlops TSUBAME 1.0 Linpack GPU 1.192PFlops PFlops Top500 4 Achievement of Linpack Performance of over 1PFlops on TSUBAME 2.0 Supercomputer Toshio Endo,, Akira Nukada, and Satoshi Matsuoka,, We report Linpack benchmark results on the TSUBAME 2.0 supercomputer, a large scale heterogeneous system with Intel processors and NVIDIA GPUs, operation of which has started in November The main part of this system consists of about 1400 compute nodes, each of which is equipped with two CPUs and three GPUs. The nodes are connected via full bisection fat tree network of Dual-Rail QDR InfiniBand. The theoretical peak performance reaches 2.4PFlops, 30 times larger than that of the predecessor TSUBAME 1.0, while its power consumption is similar to TSUBAME 1.0. We conducted improvement and tuning of Linpack benchmark considering characteristics of large scale systems with GPUs, and achieved Linpack performance of 1.192PFlops. This is the first result that exceeds 1PFlops in Japan, and ranked as 4th in the latest Top500 supercomputer ranking. GPU CPU 1. HPC (GPU ) 2008 Top500 2) 1PFlops HPC LANL RoadRunner Opteron 2006 CPU Sony/Toshiba/IBM PowerXCell 8i 8) TSUBAME Top500 TSUBAME 2.0 TSUBAME 2.0 TSUBAME PFlops NVIDIA GPU CPU GPU Intel Sandy-bridge NVIDIA Tesla M2050 Modular Cooling System (MCS) Tokyo Institute of Technology JST, CREST TSUBAME 2.0 Linpack National Institute of Informatics Linpack Top c 2011 Information Processing Society of Japan

B 2 Thin Q=3 0 0 P=2 1 2 3 N 0 3 0 3 0 4 ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 10) IO hub 2 Socket 1 PCIe Gen2 x16 HCA, GPU 11)

2 B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 10) IO hub 2 Socket 1 PCIe Gen2 x16 HCA, GPU 11) GPU PCIe MPI PCI-Express TSUBAME bit SuSE Linux 4071GPU 1.192PFlops Enterprise Server 11 Windows HPC server Top R2 Linux 1PFlops : ( PFlops) % Dual rail rail kW(Green500 1) ) 36 Voltaire GridDirector MFlops/W GridDirector 2. TSUBAME rail 6 12 TSUBAME Gbps QDR InfiniBand 7.1PBytes QDR InifiniBand 2 Dual rail ( 1) 1408 Thin 24 Medium 10 Fat Tesla M2050 GPU: NVIDIA Tesla Thin M2050 Fermi GPU 3GPU GPU (SM) 14 SM SIMD CUDA core 32 SM Thin : Hewlett-Packard 150Gbytes/s 3GB Proliant SL390s G7 6 Intel Xeon X5670 GDDR5 GPU 2.93GHz 2 NVIDIA Tesla M GFlops GPU TFlops Tesla 54GB CUDA DDR3 40Gbps QDR InfiniBand host channel adapter (HCA) 2 C 2 HCA, 3 GPU I/O 3. High performance Linpack IO Hub(IOH) 2 HCA Socket 0 CPU ( ) IO Linpack 374 c 2011 Information Processing Society of Japan

3 1 TSUBAME Thin High performance Linpack (HPL) ( k ) HPL : k L MPI LU N Flops : L P Q ( : 4) N B k ( U ) HPL NB N B B 375 c 2011 Information Processing Society of Japan

4 : U RoadRunner 8) (DTRSM) L : 1 U A k A k = A k L U TSUBAME 2.0 8GB/s (DGEMM) MPI 5 1.7TFlops HPL lookahead k + 1 k 4.2 TSUBAME 2.0 HPL HPL MPI CPU ( O(N 2 B) GPU O(N 2 (P + Q)) O(N 3 ) DGEMM/DTRSM N PCIe Linpack GPU N PCIe BLAS MPI U TSUBAME 2.0 U U 0, U 1, U A k A 0, A 1, A 2 TSUBAME 2.0 Linpack MPI (thread1) TSUBAME 1.2 5) GPU PCIe (thread2) : GPU (DGEMM) L MPI GPU TSUBAME L PCIe U 0 PCIe 2.0 GPU 92%, Xeon 8% GPU CPU GPU PCIe CPU Lin- CPU CPU pack : Linpack N N MPI 5 8% CPU MPI N TSUBAME 2.0 MPI 54GB GPU 3GPU 9GB GPU GPU 5. PCIe 5.1 TSUBAME c 2011 Information Processing Society of Japan

5 HPL 1 1 x86 1 - PCI PCI (GFlops) (GB/s) (GB/s) x86 cluster 100 300 1 8 - RoadRunner 450 2 4 TSUBAME 1.2 157 330 2 1 3 TSUBAME 2.0 1685 8 24 6 TSUBAME 2.

5 5 HPL 1 1 x PCI PCI (GFlops) (GB/s) (GB/s) x86 cluster RoadRunner TSUBAME TSUBAME TSUBAME 2.0 HPL 1 Socket 0 Socket 1 SUSE Linux Enterprise 11, Open- ( 1, 2) Socket 1 MPI 1.4.2, GCC 4.3 CUDA 3.1 BLAS Socket 0 Xeon GotoBLAS2 Linpack ) Tesla GPU NVIDIA DGEMM/DTRSM 6) NVIDIA BLAS CUBLAS MPI GPU Xeon TurboBoost : MPI GPU 3 DGEMM : Linpack (=GPU ) B CPU GPU PCI -PCIe 3 GPU DGEMM( Socket 0 CPU 1 Socket ) 7 1GPU 1 CPU 2 NVIDIA GPU DGEMM first touch PCIe CPU Linpack 377 c 2011 Information Processing Society of Japan

6 7 M2050 1GPU NVIDIA 8 ( M B) (B M) M2050 1GPU CUBLAS (M B) (B M) (M B) (B M) 5% B, M Linpack 350GFlops PCIe B TSUBAME2.0 Linpack B B (=GPU ) 4071 Linpack P Q = B = 1024 N = 2, 490, 368, B = 1024 DGEMM : 7 350GFlops M2050 GPU 43, , GFlops 1 (3 ) 35.4GB PCIe 360GFlops S1070 GPU PFlops GFlops DGEMM 80GFlops GFlops 1PFlops NVIDIA TSUBAME M2050 Fermi GPU %(=386GFlops) Top500 DGEMM 4 Tianhe-1A 3 NVIDIA Nebulae GPU CUBLAS PFlops Linpack 5.2 Linpack 52.1% TSUBAME 2.0 TSUBAME Linpack TSUB- 9 AME % 35GB DGEMM TSUBAME 2.0, TSUBAME 1.2 TSUB- 880GFlops AME 1.2 Opteron CPU c 2011 Information Processing Society of Japan

7 9 256 Linpack 10 TSUBAME 2.0, TSUBAME 1.2, TSUBAME TSUBAME 1.2 Opteron (CPU ) Linpack ClearSpeed Tesla S1070 GPU 2GPU TSUBAME Linpack 2.0 TSUBAME 2.0 ( ) 90 5) 100% 1% Linpack / Linpack Elem-DGEMM CPU DGEMM MCS Node-DGEMM CPU DGEMM Linpack (TSUBAME 1.2 TSUBAME Linpack 2.0 GPU ) PCI 1440kW Elem-DGEMM Node-DGEMM TSUBAME 1.2 TSUBAME 2.0 Linpack Green500 1) kW TSUBAME 1.2 Node-DGEMM Green500 Linpack Linpack 20% Linpack 21.3% 4 U ( 36kW TSUBAME 2.0 Peak ) Green500 Elem-DGEMM 5.1 Fermi GPU DGEMM 958MFlops/W Green500 2 Elem-DGEMM Node-DGEMM, Linpack the Greenest Production Supercomputer in the World 5.4 GRAPE-DR c 2011 Information Processing Society of Japan

8 2) TOP500 supercomputer sites TSUBAME 2.0 3) Cedric Augonnet, Samuel Thibault, Raymond Linpack Namyst, and Pierre-Andre Wacrenier. StarPU: 1.192PFlops 958MFlops/W A unified platform for task scheduling on heterogeneous multicore architectures. In Proceedings of International Euro-Par Conference on Parallel Processing, pages , TSUBAME 2.0 4) G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Herault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, GPU CPU A. Yarkhan, and J. Dongarra. Distibuted dense MPI numerical linear algebra algorithms on massively parallel architectures: DPLASMA. Tech- CPU/GPU / nical Report UT-CS , University of Tennessee Computer Science, MPI CUDA 5) Toshio Endo, Akira Nukada, Satoshi Matsuoka, and Naoya Maruyama. Linpack eval- uation on a supercomputer with heterogeneous accelerators. In Proceedings of IEEE IPDPS10, ( ) page 8pages, DAG GPU 6) Massimiliano Fatica. Accelerating Linpack StarPU 3) with CUDA on heterogeneous clusters. In DPLASMA 4) Proceedings of Workshop on General-purpose Computation on Graphics Processing Units Pivoting Cholesky (GPGPU 09), Linpack pivoting 7) K. Goto and R. A. van de Geijn. Anatomy of high-performance matrix multiplication. SMPSS/MPI 9) ACM Transactions on Mathematical Software, send/recv 34(3):1 25, Linpack 8) Michael Kistler, John Gunnels, Daniel Brokenshire, and Brad Benton. Petascale com- CPU puting with accelerators. In Proceedings of ACM Symposium on Principles and Practice of Paralle Computing (PPoPP09), pages , ) Vladimir Marjanovi, Jesus Labarta, Eduard Ayguade, and Mateo Valero. Overlapping communication and computation by using a hybrid NVIDIA Voltaire DDN MPI/SMPSs approach. In Proceedings of ACM ICS 10, pages 5 16, COE 10) A. Petitet, R. C. Whaley, J. Dongarra, JST-CREST and A. Cleary. HPL - a portable implementation of the high-performance Linpack, JST-ANR benchmark for distributed-memory computers. (11) ) TSUBAME 2.0 Linpack. pages 1 6, ) The GREEN500 list. (HOKKE-18). 380 c 2011 Information Processing Society of Japan

HP High Performance Computing（HPC）

HP High Performance Computing（HPC） ACCELERATE HP High Performance Computing HPC HPC HPC HPC HPC 1000 HPHPC HPC HP HPC HPC HPC HP HPCHP HP HPC 1 HPC HP 2 HPC HPC HP ITIDC HP HPC 1HPC HPC No.1 HPC TOP500 2010 11 HP 159 32% HP HPCHP 2010 Q1-Q4