B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

Similar documents
HP High Performance Computing(HPC)

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

GPU n Graphics Processing Unit CG CAD

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

HPEハイパフォーマンスコンピューティング ソリューション

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

GPUコンピューティング講習会パート1

GPGPU

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

untitled

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

09中西

SC SC10 (International Conference for High Performance Computing, Networking, Storage and Analysis) (HPC) Ernest N.

GPUコンピューティング講習会パート1

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

untitled

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

Microsoft PowerPoint - GPU_computing_2013_01.pptx

supercomputer2010.ppt

untitled

1重谷.PDF

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

untitled

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

07-二村幸孝・出口大輔.indd

倍々精度RgemmのnVidia C2050上への実装と応用

main.dvi

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

untitled

スーパーコンピュータ「京」の概要

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

HP Workstation 総合カタログ

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

HPC (pay-as-you-go) HPC Web 2

RDMAプロトコル: ネットワークパフォーマンスの向上

HP Workstation 総合カタログ

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

Microsoft PowerPoint - CCS学際共同boku-08b.ppt

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

橡3_2石川.PDF

HPC pdf

untitled

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

XACCの概要

1, 4,a) 1, 4 1, 4 1, , 4 3, 4 HPC HPC HPC Slurm 1. HPC Tianhe MW MW [1] MW CREST a)

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

Vol.57 No (Mar. 2016) 1,a) , L3 CG VDI VDI A Migration to a Cloud-based Information Infrastructure to Support

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

HPC可視化_小野2.pptx

HP xw9400 Workstation

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

Second-semi.PDF

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

パナソニック技報

001.dvi

Microsoft Word - 0_0_表紙.doc

HP High Performance Computing(HPC)

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

スライド 1

チューニング講習会 初級編

次世代スーパーコンピュータのシステム構成案について

システムソリューションのご紹介

スパコンに通じる並列プログラミングの基礎

Microsoft PowerPoint - endo-hokke13-kfc.pptx

HPE Moonshot System ~ビッグデータ分析&モバイルワークプレイスを新たなステージへ~

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

2017 (413812)

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

ProLiant BL460c システム構成図

<4D F736F F F696E74202D20834B F C8FEE95F A7793C195CA8D758B E348C8E3893FA816A202D E >

スパコンに通じる並列プログラミングの基礎

untitled

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

( )


HASC2012corpus HASC Challenge 2010,2011 HASC2011corpus( 116, 4898), HASC2012corpus( 136, 7668) HASC2012corpus HASC2012corpus

先進的計算基盤システムシンポジウム SACSIS2012 Symposium on Advanced Computing Systems and Infrastructures SACSIS /5/18 CPU, CPU., Memory-bound CPU,., Memory-bo

P2P P2P peer peer P2P peer P2P peer P2P i


HP Workstation 総合カタログ

独立行政法人情報通信研究機構 Development of the Information Analysis System WISDOM KIDAWARA Yutaka NICT Knowledge Clustered Group researched and developed the infor

スライド 1

スパコンに通じる並列プログラミングの基礎

mobicom.dvi

12 PowerEdge PowerEdge Xeon E PowerEdge 11 PowerEdge DIMM Xeon E PowerEdge DIMM DIMM 756GB 12 PowerEdge Xeon E5-

EGunGPU

IPSJ SIG Technical Report Vol.2009-BIO-17 No /5/26 DNA 1 1 DNA DNA DNA DNA Correcting read errors on DNA sequences determined by Pyrosequencing

SWoPP BOF BOF-1 8/3 19:10 BoF SWoPP : BOF-2 8/5 17:00 19:00 HW/SW 15 x5 SimMips/MieruPC M-Core/SimMc FPGA S

Transcription:

TSUBAME 2.0 Linpack 1,,,, Intel NVIDIA GPU 2010 11 TSUBAME 2.0 Linpack 2CPU 3GPU 1400 Dual-Rail QDR InfiniBand TSUBAME 1.0 30 2.4PFlops TSUBAME 1.0 Linpack GPU 1.192PFlops PFlops Top500 4 Achievement of Linpack Performance of over 1PFlops on TSUBAME 2.0 Supercomputer Toshio Endo,, Akira Nukada, and Satoshi Matsuoka,, We report Linpack benchmark results on the TSUBAME 2.0 supercomputer, a large scale heterogeneous system with Intel processors and NVIDIA GPUs, operation of which has started in November 2010. The main part of this system consists of about 1400 compute nodes, each of which is equipped with two CPUs and three GPUs. The nodes are connected via full bisection fat tree network of Dual-Rail QDR InfiniBand. The theoretical peak performance reaches 2.4PFlops, 30 times larger than that of the predecessor TSUBAME 1.0, while its power consumption is similar to TSUBAME 1.0. We conducted improvement and tuning of Linpack benchmark considering characteristics of large scale systems with GPUs, and achieved Linpack performance of 1.192PFlops. This is the first result that exceeds 1PFlops in Japan, and ranked as 4th in the latest Top500 supercomputer ranking. GPU CPU 1. HPC (GPU ) 2008 Top500 2) 1PFlops HPC LANL RoadRunner Opteron 2006 CPU Sony/Toshiba/IBM PowerXCell 8i 8) TSUBAME 1 2010 11 2010 11 Top500 TSUBAME 2.0 TSUBAME 2.0 TSUBAME 2.0 5 3 2.4PFlops NVIDIA GPU CPU GPU Intel Sandy-bridge NVIDIA Tesla M2050 Modular Cooling System (MCS) Tokyo Institute of Technology JST, CREST TSUBAME 2.0 Linpack National Institute of Informatics Linpack Top500 373 c 2011 Information Processing Society of Japan

B 2 Thin Q=3 0 0 P=2 1 2 3 N 0 3 0 3 0 4 ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 10) IO hub 2 Socket 1 PCIe Gen2 x16 HCA, GPU 11) GPU PCIe MPI PCI-Express TSUBAME 2.0 1357 64bit SuSE Linux 4071GPU 1.192PFlops Enterprise Server 11 Windows HPC server 2010 11 Top500 4 2008 R2 Linux 1PFlops : (1357 2.288PFlops) 2 52.1% Dual rail rail 1243.8kW(Green500 1) ) 36 Voltaire GridDirector 4036 958.35MFlops/W 185 18 18 324 GridDirector 2. TSUBAME 2.0 4700 rail 6 12 TSUBAME 2.0 1400 2 40Gbps QDR InfiniBand 7.1PBytes QDR InifiniBand 2 Dual rail ( 1) 1408 Thin 24 Medium 10 Fat Tesla M2050 GPU: NVIDIA Tesla Thin M2050 Fermi GPU 3GPU GPU (SM) 14 SM SIMD CUDA core 32 SM Thin : Hewlett-Packard 150Gbytes/s 3GB Proliant SL390s G7 6 Intel Xeon X5670 GDDR5 GPU 2.93GHz 2 NVIDIA Tesla M2050 515GFlops GPU 3 2 3 1.03TFlops Tesla 54GB CUDA DDR3 40Gbps QDR InfiniBand host channel adapter (HCA) 2 C 2 HCA, 3 GPU I/O 3. High performance Linpack IO Hub(IOH) 2 HCA Socket 0 CPU ( ) IO Linpack 374 c 2011 Information Processing Society of Japan

1 TSUBAME 2.0 3 Thin High performance Linpack (HPL) ( k ) HPL : k L MPI LU N Flops : L P Q ( : 4) N B k ( U ) HPL NB N B B 375 c 2011 Information Processing Society of Japan

: U RoadRunner 8) (DTRSM) L : 1 U A k A k = A k L U TSUBAME 2.0 8GB/s (DGEMM) MPI 5 1.7TFlops HPL lookahead k + 1 k 4.2 TSUBAME 2.0 HPL HPL MPI CPU ( O(N 2 B) GPU O(N 2 (P + Q)) O(N 3 ) DGEMM/DTRSM N PCIe Linpack GPU N PCIe BLAS MPI U 5 6 4. TSUBAME 2.0 U U 0, U 1, U 2 4.1 A k A 0, A 1, A 2 TSUBAME 2.0 Linpack MPI (thread1) TSUBAME 1.2 5) GPU PCIe (thread2) : GPU (DGEMM) L MPI GPU TSUBAME L PCIe U 0 PCIe 2.0 GPU 92%, Xeon 8% GPU CPU GPU PCIe CPU Lin- CPU CPU pack : Linpack N N MPI 5 8% CPU MPI N TSUBAME 2.0 MPI 54GB GPU 3GPU 9GB GPU GPU 5. PCIe 5.1 TSUBAME 2.0 376 c 2011 Information Processing Society of Japan

5 HPL 1 1 x86 1 - PCI PCI (GFlops) (GB/s) (GB/s) x86 cluster 100 300 1 8 - RoadRunner 450 2 4 TSUBAME 1.2 157 330 2 1 3 TSUBAME 2.0 1685 8 24 6 TSUBAME 2.0 HPL 1 Socket 0 Socket 1 SUSE Linux Enterprise 11, Open- ( 1, 2) Socket 1 MPI 1.4.2, GCC 4.3 CUDA 3.1 BLAS Socket 0 Xeon GotoBLAS2 Linpack 1.13 7) Tesla GPU NVIDIA DGEMM/DTRSM 6) NVIDIA BLAS CUBLAS MPI GPU Xeon TurboBoost : MPI GPU 3 DGEMM : Linpack (=GPU ) B CPU GPU PCI -PCIe 3 GPU DGEMM( Socket 0 CPU 1 Socket ) 7 1GPU 1 CPU 2 NVIDIA GPU DGEMM first touch PCIe CPU Linpack 377 c 2011 Information Processing Society of Japan

7 M2050 1GPU NVIDIA 8 ( M B) (B M) M2050 1GPU CUBLAS (M B) (B M) (M B) (B M) 5% B, M Linpack 350GFlops PCIe B TSUBAME2.0 Linpack B 2010 10 B 1408 1357 1024 (=GPU ) 4071 Linpack P Q = 59 69 B = 1024 N = 2, 490, 368, B = 1024 DGEMM : 7 350GFlops M2050 GPU 43, 008 36, 864 515GFlops 1 (3 ) 35.4GB PCIe 360GFlops S1070 GPU 1.192 PFlops 878 86.4GFlops DGEMM 80GFlops GFlops 1PFlops NVIDIA TSUBAME 1.2 13.7 M2050 Fermi GPU 8640 75%(=386GFlops) 2010 11 Top500 DGEMM 4 Tianhe-1A 3 NVIDIA Nebulae GPU CUBLAS 8 3.1 3.2 5.3 1357 2.288PFlops Linpack 5.2 Linpack 52.1% TSUBAME 2.0 TSUBAME 2.0 128 Linpack TSUB- 9 AME 1.2 53% 35GB DGEMM 10 8 128 TSUBAME 2.0, TSUBAME 1.2 TSUB- 880GFlops AME 1.2 Opteron CPU 4 378 c 2011 Information Processing Society of Japan

9 256 Linpack 10 TSUBAME 2.0, TSUBAME 1.2, TSUBAME TSUBAME 1.2 Opteron 16 1.2(CPU ) Linpack ClearSpeed Tesla S1070 GPU 2GPU TSUBAME Linpack 2.0 TSUBAME 2.0 ( ) 90 5) 100% 1% Linpack / Linpack Elem-DGEMM CPU DGEMM MCS Node-DGEMM CPU DGEMM Linpack (TSUBAME 1.2 TSUBAME Linpack 2.0 GPU ) PCI 1440kW Elem-DGEMM Node-DGEMM TSUBAME 1.2 TSUBAME 2.0 Linpack Green500 1) 1243.8kW TSUBAME 1.2 Node-DGEMM Green500 Linpack Linpack 20% Linpack 21.3% 4 U ( 36kW TSUBAME 2.0 Peak ) Green500 Elem-DGEMM 5.1 Fermi GPU DGEMM 958MFlops/W 2010 11 Green500 2 Elem-DGEMM Node-DGEMM, Linpack the Greenest Production Supercomputer in the World 5.4 GRAPE-DR 3 379 c 2011 Information Processing Society of Japan

2) TOP500 supercomputer sites. 6. http://www.top500.org/. TSUBAME 2.0 3) Cedric Augonnet, Samuel Thibault, Raymond Linpack Namyst, and Pierre-Andre Wacrenier. StarPU: 1.192PFlops 958MFlops/W A unified platform for task scheduling on heterogeneous multicore architectures. In Proceedings of International Euro-Par Conference on Parallel Processing, pages 863 874, 2009. 5 TSUBAME 2.0 4) G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Herault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, GPU CPU A. Yarkhan, and J. Dongarra. Distibuted dense MPI numerical linear algebra algorithms on massively parallel architectures: DPLASMA. Tech- CPU/GPU / nical Report UT-CS-10-660, University of Tennessee Computer Science, 2010. MPI CUDA 5) Toshio Endo, Akira Nukada, Satoshi Matsuoka, and Naoya Maruyama. Linpack eval- uation on a supercomputer with heterogeneous accelerators. In Proceedings of IEEE IPDPS10, ( ) page 8pages, 2010. DAG GPU 6) Massimiliano Fatica. Accelerating Linpack StarPU 3) with CUDA on heterogeneous clusters. In DPLASMA 4) Proceedings of Workshop on General-purpose Computation on Graphics Processing Units Pivoting Cholesky (GPGPU 09), 2009. Linpack pivoting 7) K. Goto and R. A. van de Geijn. Anatomy of high-performance matrix multiplication. SMPSS/MPI 9) ACM Transactions on Mathematical Software, send/recv 34(3):1 25, 2008. Linpack 8) Michael Kistler, John Gunnels, Daniel Brokenshire, and Brad Benton. Petascale com- CPU puting with accelerators. In Proceedings of ACM Symposium on Principles and Practice of Paralle Computing (PPoPP09), pages 241 250, 2009. 9) Vladimir Marjanovi, Jesus Labarta, Eduard Ayguade, and Mateo Valero. Overlapping communication and computation by using a hybrid NVIDIA Voltaire DDN MPI/SMPSs approach. In Proceedings of ACM ICS 10, pages 5 16, 2010. COE 10) A. Petitet, R. C. Whaley, J. Dongarra, JST-CREST and A. Cleary. HPL - a portable implementation of the high-performance Linpack, JST-ANR benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/. (11). 18049028) TSUBAME 2.0 Linpack. pages 1 6, 2010. 1) The GREEN500 list. http://www.green500.org/. (HOKKE-18). 380 c 2011 Information Processing Society of Japan