040312研究会HPC2500.ppt

Similar documents
01_OpenMP_osx.indd

1重谷.PDF

120802_MPI.ppt

MPI usage

2012年度HPCサマーセミナー_多田野.pptx

untitled

C/C++ FORTRAN FORTRAN MPI MPI MPI UNIX Windows (SIMD Single Instruction Multipule Data) SMP(Symmetric Multi Processor) MPI (thread) OpenMP[5]

untitled

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

次世代スーパーコンピュータのシステム構成案について

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

スパコンに通じる並列プログラミングの基礎

橡3_2石川.PDF

OpenMP (1) 1, 12 1 UNIX (FUJITSU GP7000F model 900), 13 1 (COMPAQ GS320) FUJITSU VPP5000/64 1 (a) (b) 1: ( 1(a))

卒業論文

スパコンに通じる並列プログラミングの基礎

Second-semi.PDF

PowerPoint Presentation

<4D F736F F F696E74202D D F95C097F D834F E F93FC96E5284D F96E291E85F8DE391E52E >

スパコンに通じる並列プログラミングの基礎

スライド 1

2 /83

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£²¡Ë

XcalableMP入門

untitled

untitled

¥Ñ¥Ã¥±¡¼¥¸ Rhpc ¤Î¾õ¶·

コードのチューニング

untitled

26

ÊÂÎó·×»»¤È¤Ï/OpenMP¤Î½éÊâ¡Ê£±¡Ë

演習準備

I I / 47

スーパーコンピュータ「京」の概要

02_C-C++_osx.indd

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

untitled

develop

~~~~~~~~~~~~~~~~~~ wait Call CPU time 1, latch: library cache 7, latch: library cache lock 4, job scheduler co

openmp1_Yaguchi_version_170530

11042 計算機言語7回目 サポートページ:

HPEハイパフォーマンスコンピューティング ソリューション

09中西

nakao

Microsoft PowerPoint - GPU_computing_2013_01.pptx

インテル(R) Visual Fortran Composer XE

Microsoft PowerPoint - 高速化WS富山.pptx

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

HPC (pay-as-you-go) HPC Web 2

並列計算の数理とアルゴリズム サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.


[1] [2] [3] (RTT) 2. Android OS Android OS Google OS 69.7% [4] 1 Android Linux [5] Linux OS Android Runtime Dalvik Dalvik UI Application(Home,T

main.dvi

IPSJ SIG Technical Report Vol.2014-HPC-144 No /5/26 ES2 1,a) 1,b) 1,c) (ES2) The system architecture and operation results of the Earth Simulato

HPC可視化_小野2.pptx

GPGPU

untitled

(^^

Microsoft PowerPoint - ★13_日立_清水.ppt

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

ohpr.dvi

演習準備 2014 年 3 月 5 日神戸大学大学院システム情報学研究科森下浩二 1 RIKEN AICS HPC Spring School /3/5

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

untitled

Microsoft PowerPoint - 演習1:並列化と評価.pptx

統計数理研究所とスーパーコンピュータ

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

スライド 1

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

DEIM Forum 2012 C2-6 Hadoop Web Hadoop Distributed File System Hadoop I/O I/O Hadoo

MPI MPI MPI.NET C# MPI Version2

07-二村幸孝・出口大輔.indd

Microsoft PowerPoint - KHPCSS pptx

システムの政府調達に関する日米内外価格差調査

XACCの概要

大規模共有メモリーシステムでのGAMESSの利点

nakayama15icm01_l7filter.pptx

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

PowerPoint プレゼンテーション

スライド 1

研究背景 大規模な演算を行うためには 分散メモリ型システムの利用が必須 Message Passing Interface MPI 並列プログラムの大半はMPIを利用 様々な実装 OpenMPI, MPICH, MVAPICH, MPI.NET プログラミングコストが高いため 生産性が悪い 新しい並

(Microsoft PowerPoint \211\211\217K3_4\201i\216R\226{_\211\272\215\342\201j.ppt [\214\335\212\267\203\202\201[\203h])

C言語によるアルゴリズムとデータ構造

Networking Semester 802.3

Microsoft Word - HOKUSAI_system_overview_ja.docx

Microsoft PowerPoint - sales2.ppt

IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm

Microsoft Word - 2.2_takaki.doc

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

untitled

WinHPC ppt

fiš„v8.dvi

±é½¬£²¡§£Í£Ð£É½éÊâ

2012_00表紙

(Basic Theory of Information Processing) Fortran Fortan Fortan Fortan 1

Myrinet2000 ご紹介

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

Transcription:

2004312 e-mail : m-aoki@jp.fujitsu.com 1

2

PRIMEPOWER VX/VPP300 VPP700 GP7000 AP3000 VPP5000 PRIMEPOWER 2000 PRIMEPOWER HPC2500 1998 1999 2000 2001 2002 2003 3

VPP5000 PRIMEPOWER ( 1 VU 9.6 GF 16GB 1 VU 9.6 GF 16GB 128 PE 1.22 TF 128 (6.2GF/) 798.7GF 512GB SMP 1 SMP 128 (6.2GF/) 798.7GF 512GB SMP 128 node 102.2TF 4

( 128) ( 128) 128 ( 128) ( 128) DTU D T U D T U D T U D T U l l Adapter Adapter 16 DTU : Data Transfer Unit I/O 5

VPP5000 82 21 6

HPC2500-4 - 6-2 / - ( M&A/M/A/DIV/) x 2 - / - 16-outstanding - - 7

MEM MEM prefetch X load X,fr4 miss waiting add fr4... load X,fr4 add fr4... hit time 8

JAXA Central Numerical Simulation System (CeNSS) PRIMEPOWER HPC2500 system was installed to the Japan Aerospace Exploration Agency (JAXA) on Oct. 2002 as a main compute engine. Configuration of CeNSS PRIMEPOWER HPC2500 ~ 14 Compute Cabinets ~ -Peak Performance: 9.3TFlops -Memory (Total): 3.6TB HPC2500(1Cabinet): - : SPRAC64 V(1.3GHz) x 128 - Memory: 256GB Interconnect : - Crossbar Switch: 4GB/s(Bi-directional) (Node to Node communication) 9

Kyoto University The largest class of supercomputer system in the world. The largest supercomputer system among Japanese university centers. Configuration [PRIMEPOWER HPC2500] - 128/Node 11Cabinets (Compute Nodes) - 64/Node 1Cabinet (I/O Node) Supercomputer PRIMEPOWER HPC2500 9.185TFLOPS,, Memory:5.75TB 9.185TFLOPS PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 High Speed Optical Interconnect PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 (IO Node) PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 RAID ETERNUS6000 Model600 8.0TB(RAID5) Tape Library Network Router PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 Pre-post operation 10

11

Parallelnavi DTU (BLASTBAND HPC) (GFS/GDS) Solaris TM Operating Environment (SRFS) 12

MPL Fortran C C++ MPI OpenMP Parallelnavi Workbench *2 SSL II C-SSL II BLAS LAPACK ScaLAPACK XPFortran *1 SSL II/XPF *3 *1:eXtended Parallel Fortran(VPP Fortran ). Parallelnavi *2: *3:SSL II/VPPXPFortran 13

14

Fortran C C++ OpenMP MPI ISO/IEC 1539-1:1997 1:1997 JIS X3001-1:1998(Fortran95) 1:1998(Fortran95) (FORTRAN77/Fortran90 ) ISO/IEC 9899:1999(C99 ) X3.159-1989(ANSI 1989(ANSI CC )K&R ISO/IEC 14882:1998 (Rogue Wave Tools.h++ V8) OpenMP Fortran Application Program Interface Version 2.0 OpenMP C and C++ Application Program Interface Version 2.0 MPI-2: Extension to the Message-Passing Interface (July 18,1997) 15

OpenMP XPFortran MPI (VPP Fortran ) ( ) : : 16

XPFortran MPI MPI 17

MPI 18

program main dimension dif(1000),u(1000) : c = 2.0!$OMP PARALLEL DO do i = 2, 999 dif(i) = u(i+1) - c*u(i) + u(i-1) end do : end program main OpenMP program main include "mpif.h" real(kind=4),dimension(:),allocatable :: dif,u integer STATUS(MPI_STATUS_SIZE) : call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, npe,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,myrank,ierr) im = 1000 ilen = (im + npe - 1 )/npe ist = myrank*250 + 1 iend = ist + ilen - 1 allocate( u(ist-1:iend+1), dif(ist:iend) ) nright = myrank + 1 nleft = myrank - 1 program main if(myrank== 0) then!xocl PROCESSOR P(4) nleft = MPI_PROC_NULL dimension u(1000),dif(1000) else if(myrank==npe-1) then!xocl INDEX PARTITION Q=(P,INDEX=1:1000) nright = MPI_PROC_NULL!XOCL GLOBAL u(/q(overlap=(1,1))),dif(/q) end if! call MPI_SENDRECV( u(iend ),1,MPI_REAL,nright,0, &!XOCL PARALLEL REGION u(ist-1 ),1,MPI_REAL, nleft,0, & MPI_COMM_WORLD,STATUS,IERR ) c = 2.0 call MPI_SENDRECV( u(ist ),1,MPI_REAL, nleft,1, &!XOCL OVERLAPFIX(u)(id) u(iend+1),1,mpi_real,nright,1, &!XOCL MOVE WAIT(id) MPI_COMM_WORLD,STATUS,IERR )!XOCL SPREAD DO REGIDENT(u,dif) /Q c = 2.0 do i = 2, 999 ist_do = max( 2,ist ) dif(i) = u(i+1) - c*u(i) + u(i-1) iend_do = min(999,iend) end do do i = ist_do, iend_do!xocl END SPREAD dif(i) = u(i+1) - c*u(i) + u(i-1) end do!xocl END PARALLEL XPFortran : MPI end program call MPI_FINALIZE(ierr) 19 end program main

* * : DO I=1,N SUM=SUM+A(I) END DO DO I=1,N Prefetch1 A(I+1):1 SUM = SUM + A(I) Prefetch2 A(I+17) :2 END DO 20

21 SSLII

VPP5000 22

: : DO I=1,1000 B(I)=(A(I)+A(I+1)/2.0) END DO : OpenMP :!$OMP PARALLEL DO DO I=1,1000 B(I)=(A(I)+A(I+1))/2.0 END DO!$OMP END PARALLEL DO : 23

Barrier ( ) micro-sec 12 10 8 6 4 2 0 software hardware Barrier 1 2 4 8 16 32 64 128 s 1 18 24

Scaling Factor 100 90 80 70 60 50 40 30 20 10 0 NAS Parallel BT Class B 6.7 Gflops 0 20 40 60 80 100 120 s HPC2500 1.3GHz OpenMP HPC2500 Linear Scaling VPP5000/1 25

OpenMP SPEC OMPM2001 45000 40000 SPEC OMPM2001 OpneMP 35000 30000 SPEC Rate 25000 20000 15000 10000 Parallelnavi 2.3/HPC2500 (1.3GHz) Parallelnavi 2.3/HPC2500 (1.5GHz) HP Superdome (Itanium2, 1.5GHz) 5000 SGI Altix 3000 (Itanum2 1.5GHz) Others 0 0 20 40 60 80 100 120 140 Number of Threads 26

MPI 27

Barrier MPI_Barrier 250 200 HPC2500-H HPC2500-S micro sec 150 100 50 0 0 128 256 384 512 # of process 28

MPI 29

30

31 / / / ( ) / ( ) ( ) ( ) MIPS, / / / MIPS, MIPS, / / /

common a,b,c,d real*8 a(4097,4096),b(4097,4096),c(4097,4096)!$omp PARALLEL DO do j=1,4096 do i=j,4096 a(i,j)=b(i,j)+c(i,j) enddo enddo Performance Analysis Elapsed User System -------------------------------------------------------------------------- 1.563679e+01 4.050000e+00 3.630000e+00 Process 0-0 + +-------------------------------+--------------------------------+ ******************* + 77% 1.420000e+02 Thread 0 ********** + 38% 1.110000e+02 Thread 1-0% 8.000000e+01 Thread 2 ********** - 39% 4.900000e+01 Thread 3 ******************* - 76% 1.900000e+01 Thread 4 +-------------------------------+--------------------------------+ Balance against average time per Thread 32

) common a,b,c,d real*8 a(4097,4096),b(4097,4096),c(4097,4096)!$omp PARALLEL DO SCHEDULE(STATIC,1) do j=1,4096 do i=j,4096 a(i,j)=b(i,j)+c(i,j) enddo enddo Performance Analysis Elapsed User System --------------------------------------------------------------------------- 9.884062e+00 4.180000e+00 4.470000e+00 Process 0-0 + +--------------------------------+-------------------------------+ - 1% 8.200000e+01 Thread 0 0% 8.300000e+01 Thread 1 + 1% 8.400000e+01 Thread 2 0% 8.300000e+01 Thread 3 0% 8.300000e+01 Thread 4 +--------------------------------+-------------------------------+ Balance against average time per Thread 33

34

35