2004312 e-mail : m-aoki@jp.fujitsu.com 1
2
PRIMEPOWER VX/VPP300 VPP700 GP7000 AP3000 VPP5000 PRIMEPOWER 2000 PRIMEPOWER HPC2500 1998 1999 2000 2001 2002 2003 3
VPP5000 PRIMEPOWER ( 1 VU 9.6 GF 16GB 1 VU 9.6 GF 16GB 128 PE 1.22 TF 128 (6.2GF/) 798.7GF 512GB SMP 1 SMP 128 (6.2GF/) 798.7GF 512GB SMP 128 node 102.2TF 4
( 128) ( 128) 128 ( 128) ( 128) DTU D T U D T U D T U D T U l l Adapter Adapter 16 DTU : Data Transfer Unit I/O 5
VPP5000 82 21 6
HPC2500-4 - 6-2 / - ( M&A/M/A/DIV/) x 2 - / - 16-outstanding - - 7
MEM MEM prefetch X load X,fr4 miss waiting add fr4... load X,fr4 add fr4... hit time 8
JAXA Central Numerical Simulation System (CeNSS) PRIMEPOWER HPC2500 system was installed to the Japan Aerospace Exploration Agency (JAXA) on Oct. 2002 as a main compute engine. Configuration of CeNSS PRIMEPOWER HPC2500 ~ 14 Compute Cabinets ~ -Peak Performance: 9.3TFlops -Memory (Total): 3.6TB HPC2500(1Cabinet): - : SPRAC64 V(1.3GHz) x 128 - Memory: 256GB Interconnect : - Crossbar Switch: 4GB/s(Bi-directional) (Node to Node communication) 9
Kyoto University The largest class of supercomputer system in the world. The largest supercomputer system among Japanese university centers. Configuration [PRIMEPOWER HPC2500] - 128/Node 11Cabinets (Compute Nodes) - 64/Node 1Cabinet (I/O Node) Supercomputer PRIMEPOWER HPC2500 9.185TFLOPS,, Memory:5.75TB 9.185TFLOPS PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 High Speed Optical Interconnect PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 (IO Node) PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 RAID ETERNUS6000 Model600 8.0TB(RAID5) Tape Library Network Router PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 PRIMEPOWER HPC2500 Pre-post operation 10
11
Parallelnavi DTU (BLASTBAND HPC) (GFS/GDS) Solaris TM Operating Environment (SRFS) 12
MPL Fortran C C++ MPI OpenMP Parallelnavi Workbench *2 SSL II C-SSL II BLAS LAPACK ScaLAPACK XPFortran *1 SSL II/XPF *3 *1:eXtended Parallel Fortran(VPP Fortran ). Parallelnavi *2: *3:SSL II/VPPXPFortran 13
14
Fortran C C++ OpenMP MPI ISO/IEC 1539-1:1997 1:1997 JIS X3001-1:1998(Fortran95) 1:1998(Fortran95) (FORTRAN77/Fortran90 ) ISO/IEC 9899:1999(C99 ) X3.159-1989(ANSI 1989(ANSI CC )K&R ISO/IEC 14882:1998 (Rogue Wave Tools.h++ V8) OpenMP Fortran Application Program Interface Version 2.0 OpenMP C and C++ Application Program Interface Version 2.0 MPI-2: Extension to the Message-Passing Interface (July 18,1997) 15
OpenMP XPFortran MPI (VPP Fortran ) ( ) : : 16
XPFortran MPI MPI 17
MPI 18
program main dimension dif(1000),u(1000) : c = 2.0!$OMP PARALLEL DO do i = 2, 999 dif(i) = u(i+1) - c*u(i) + u(i-1) end do : end program main OpenMP program main include "mpif.h" real(kind=4),dimension(:),allocatable :: dif,u integer STATUS(MPI_STATUS_SIZE) : call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, npe,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,myrank,ierr) im = 1000 ilen = (im + npe - 1 )/npe ist = myrank*250 + 1 iend = ist + ilen - 1 allocate( u(ist-1:iend+1), dif(ist:iend) ) nright = myrank + 1 nleft = myrank - 1 program main if(myrank== 0) then!xocl PROCESSOR P(4) nleft = MPI_PROC_NULL dimension u(1000),dif(1000) else if(myrank==npe-1) then!xocl INDEX PARTITION Q=(P,INDEX=1:1000) nright = MPI_PROC_NULL!XOCL GLOBAL u(/q(overlap=(1,1))),dif(/q) end if! call MPI_SENDRECV( u(iend ),1,MPI_REAL,nright,0, &!XOCL PARALLEL REGION u(ist-1 ),1,MPI_REAL, nleft,0, & MPI_COMM_WORLD,STATUS,IERR ) c = 2.0 call MPI_SENDRECV( u(ist ),1,MPI_REAL, nleft,1, &!XOCL OVERLAPFIX(u)(id) u(iend+1),1,mpi_real,nright,1, &!XOCL MOVE WAIT(id) MPI_COMM_WORLD,STATUS,IERR )!XOCL SPREAD DO REGIDENT(u,dif) /Q c = 2.0 do i = 2, 999 ist_do = max( 2,ist ) dif(i) = u(i+1) - c*u(i) + u(i-1) iend_do = min(999,iend) end do do i = ist_do, iend_do!xocl END SPREAD dif(i) = u(i+1) - c*u(i) + u(i-1) end do!xocl END PARALLEL XPFortran : MPI end program call MPI_FINALIZE(ierr) 19 end program main
* * : DO I=1,N SUM=SUM+A(I) END DO DO I=1,N Prefetch1 A(I+1):1 SUM = SUM + A(I) Prefetch2 A(I+17) :2 END DO 20
21 SSLII
VPP5000 22
: : DO I=1,1000 B(I)=(A(I)+A(I+1)/2.0) END DO : OpenMP :!$OMP PARALLEL DO DO I=1,1000 B(I)=(A(I)+A(I+1))/2.0 END DO!$OMP END PARALLEL DO : 23
Barrier ( ) micro-sec 12 10 8 6 4 2 0 software hardware Barrier 1 2 4 8 16 32 64 128 s 1 18 24
Scaling Factor 100 90 80 70 60 50 40 30 20 10 0 NAS Parallel BT Class B 6.7 Gflops 0 20 40 60 80 100 120 s HPC2500 1.3GHz OpenMP HPC2500 Linear Scaling VPP5000/1 25
OpenMP SPEC OMPM2001 45000 40000 SPEC OMPM2001 OpneMP 35000 30000 SPEC Rate 25000 20000 15000 10000 Parallelnavi 2.3/HPC2500 (1.3GHz) Parallelnavi 2.3/HPC2500 (1.5GHz) HP Superdome (Itanium2, 1.5GHz) 5000 SGI Altix 3000 (Itanum2 1.5GHz) Others 0 0 20 40 60 80 100 120 140 Number of Threads 26
MPI 27
Barrier MPI_Barrier 250 200 HPC2500-H HPC2500-S micro sec 150 100 50 0 0 128 256 384 512 # of process 28
MPI 29
30
31 / / / ( ) / ( ) ( ) ( ) MIPS, / / / MIPS, MIPS, / / /
common a,b,c,d real*8 a(4097,4096),b(4097,4096),c(4097,4096)!$omp PARALLEL DO do j=1,4096 do i=j,4096 a(i,j)=b(i,j)+c(i,j) enddo enddo Performance Analysis Elapsed User System -------------------------------------------------------------------------- 1.563679e+01 4.050000e+00 3.630000e+00 Process 0-0 + +-------------------------------+--------------------------------+ ******************* + 77% 1.420000e+02 Thread 0 ********** + 38% 1.110000e+02 Thread 1-0% 8.000000e+01 Thread 2 ********** - 39% 4.900000e+01 Thread 3 ******************* - 76% 1.900000e+01 Thread 4 +-------------------------------+--------------------------------+ Balance against average time per Thread 32
) common a,b,c,d real*8 a(4097,4096),b(4097,4096),c(4097,4096)!$omp PARALLEL DO SCHEDULE(STATIC,1) do j=1,4096 do i=j,4096 a(i,j)=b(i,j)+c(i,j) enddo enddo Performance Analysis Elapsed User System --------------------------------------------------------------------------- 9.884062e+00 4.180000e+00 4.470000e+00 Process 0-0 + +--------------------------------+-------------------------------+ - 1% 8.200000e+01 Thread 0 0% 8.300000e+01 Thread 1 + 1% 8.400000e+01 Thread 2 0% 8.300000e+01 Thread 3 0% 8.300000e+01 Thread 4 +--------------------------------+-------------------------------+ Balance against average time per Thread 33
34
35