IPSJ SIG Technical Report Vol.2017-HPC-158 No /3/9 OpenACC MPS 1,a) 1 Moving Particle Semi-implicit (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K2

OpenACC MPS 1,a) 1 Movng Partcle Sem-mplct (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K20c GTX1080 P100(PCIe) P100(NVlnk) 5 OpenACC 3.5 3 Fortran 29.0 74.5 GPU 1. MPS [1] 1 MPS MPS CUDA GPU [2] [3] [4] OpenACC GPU OpenACC GPU API [5] (clause) GPU CUDA OpenCL GPU MPS NSRU-MPS NSRU-MPS For- 1 182-8522 7-44-1 a) myajma.takaak@jaxa.jp tran 95 MPI OpenMP 2 OpenACC CPU 2 4 GPU GPU 2. MPS MPS 4.0km 3.5km 2 6 [6] MPS 2.1 MPS (Explct MPS ) 2 r j r j Proc 0) 1

1: MPS Proc 1) Proc 2) Proc 3) Proc 4) Proc 5) Proc 6)Proc 1 6 Proc 0) MPS λ 0 n 0 ρ λ 0 j = ( r0 j r0 )2 ω( r 0 j r0 ) j ω( r0 j r0 ) (1) n 0 = ω( r 0 j r 0 ) (2) j Proc 1) 2 3 ( u = u k + t ν 2d ) λ 0 n 0 (u k j u k )ω( r j r ) +g(3) j k t j ν d g u k k u k Proc 2) Proc 1 r = r k + tu (4) r k k Proc 3) n = j ω( r j r ) (5) 2: P k+1 = c 2 ρ0 n 0 (n n 0 ) (6) P k+1 k c n k Proc 4) P k+1 = d n 0 j ( ) (P k+1 + P k+1 j )(r j r ) r j r 2 ω grad ( r j r ) (7) ω grad 2.3 Proc 5) ( ) k+1 u k+1 = u 1 t ρ P (8) r k+1 u k+1 ( = r 1 t ρ P ) k+1 (9) r k k + 1 Proc 6) Forward tme step u t 2.2 MPS 5 1 5 Smoothed Partcle Hydrodynamcs(SPH) N [2] [7] [3] NSRU-MPS 2

3: NSRU-MPS Laplacan u cal nden grad p explct 2 Read-After-Wrte(RAW) 2.3 MPS 10 11 1 2 10 11 ω grad (r) = re r r r e (10) ω(r) = re r r r e 2 (11) r e r j 2.4 NSRU-MPS NSRU-MPS MPS Sem-mplct MPS MPI NSRU-MPS r u u P r P n 9 AoS Array of Structure 2 2 Intel Xeon E5-2697 v2 @ 2.7GHz(12 24 ) 128GB DDR3-12800 4: CPU-GPU NVlnk PCIe Gen3 2.7 35 35 7 25,704 70 70 14 224,910 8.75 MPI Wtme 200 MPI 2 112,455 0 14,688 1 11,016 3 NSRU-MPS Laplacan u grad p ex-plct cal nden 3 86% 89% 1 4 3 Laplacan u 252.8[ms] elastc collson 9% 10% 2% 3. OpenACC tmestep Laplacan u Tesla K20c, GeForce GTX1080, Tesla P100(NV-lnk), P100(PCIe) 4 GPU 1 GPU 2CPU 2GPU GPU CPU 2 PGI Fortran Compler K20c GTX1080, P100(PCIe) x86-64 (ver 16.10) P100(NVlnk) lnuxpower (ver 16.10) -acc -ta=nvda, cuda8.0, fastmath, cc60 K20c cc60 cc35 MPI OpenMPI 1.10.2 CPU 1 1 1GPU tmestep Laplacan u MPI 2.4 CPU-GPU 4 3

1: GPU CUDA GPU CPU-GPU GPU [TFLOPS] [MHz] Cores [Gbps] ( ) CPU K20c 3.5 706 2,496 208 PCIe Gen2 x16 ( 8GB/s) Intel Xeon E5-2697 v2 GTX1080 8.8 1,733 2,560 320 PCIe Gen3 x16 (16GB/s) Intel Xeon E5-2697 v2 P100 (PCIe) 9.3 1,303 3,584 732 PCIe Gen3 x16 (16GB/s) Intel Xeon E5-2630Lv3 P100 (NVlnk) 10.6 1,406 3,584 732 NVLnk (40GB/s) IBM POWER8 NVL Tesla K20c PCI-Express Generaton 2 x16 GTX1080 P100(PCIe) PCI-Express Generaton 3 x16 P100(NVlnk) NVlnk 4 NVlnk 82.7% PCIe Gen3 x16 2.7 40KB GTX1080 GTX1080 P100 330MHz 3.1 tmestep OpenACC GPU Proc 6 tmestep OpenACC CUDA 5 44,064 (3 14,688) 337,365(3 112,455) CPU GPU 176,256 bytes (44,064 4byte) 1,349,460 bytes (337,365 4byte) 2 5 6 2: CPU GPU 172.13KB 1.29MB [ms] BW [GB/s] [ms] BW [GB/s] K20c 0.030 5.41 0.216 5.81 GTX1080 0.018 9.09 0.126 9.94 P100(PCIe) 0.023 6.96 0.167 7.50 P100(NVlnk) 0.007 22.90 0.044 28.25 3.1.1 acc kernels 1 maxval() acc kernels Fortran CUDA 2,3,4 maxval() 3 CUDA Kernel( 9 ) Kernel NVIDIA GPU 2 [8] CPU cxmax, cymax, czmax 3 CPU P100(NVlnk) lnuxpower x86-64 4 GTX1080 P100 Lstng 1: acc kernels 1!$acc kernels copyn(nm) 2 cxmax = maxval(c(1,1:nm))! 3 cymax = maxval(c(2,1:nm))! 4 czmax = maxval(c(3,1:nm))! 5 cmax = max(cxmax,cymax)! 6 cmax = max(cmax,czmax)! 7!$acc end kernels 3.1.2 maxval maxval maxval() acc kernels Kernel acc kernels 3 CUDA Kernel 2 CPU maxval() 1 3 CUDA Kernel CPU lnuxpower x86-64 Kernel 9 3 K20c GTX1080 acc kernel Lstng 2: maxval() 1 1!$acc kernels copyn(nm) 2 cxmax = maxval(c(:3,:nm)) 3!$acc end kernels 3.1.3 reducton reducton maxval max reducton PGI max GPU acc loop reducton acc loop vector(32) CUDA Kernel 2 maxval Kernel GTX1080 acc kernels 3.2 1.6 4

5: GTX1080 GPU 6: P100(NVlnk) GPU Lstng 3: max() reducton 1!$acc parallel copyn(nm) 2!$acc loop reducton(max:cmax) 3 do row=1,3 4!$acc loop vector(32) 5 do col=1,nm 6 cmax = max(cmax, c(col,row)) 7 end do 8 end do 9!$acc end parallel 3.1.4 unroll unroll reducton collapse(2) collapse(n) N CUDA Kernel acc kernel 1.6 3.5 reducton 1.4 1.9 P100(NVlnk) OpenACC 3.1.5 CUDA dev maxval CUDA Fortran OpenACC CUDA Kernel 2 acc host data use devce c dev maxval CUDA Fortran maxval unroll 35% lnuxpower Lstng 4: CUDA Fortran 1!$acc data copyn(c(:3,:nm)) 2!$acc host data use devce(c) 3 cmax = dev maxval(c, 3, Nm) 4!$acc end host data 5!$acc end data 1 attrbutes(devce) real functon dev maxval(gdata, x, y) 2 use cudafor, gpu maxval => maxval 3 nteger,value :: x, y 4 real,devce :: gdata(x,y) 5 dev maxval = gpu maxval(gdata) 6 end functon dev maxval 3.1.6 Multcore PGI -ta=multcore CPU CUDA Xeon E5-2697 v2 MPI mpexec -bnd-to none -n 3 acc kernels K20c 3: Multcore ([ms]) acc kernels maxval reducton unroll 15.180 1.029 1.101 1.062 14.603 1.511 1.184 0.827 3.2 Laplacan u 7 5 (do-loop1) 2,3,4(do-loop2,3,4) 3 3 3 (=27) 5 19 17 13,15 257.66[ms] 1924.98[ms] OpenACC 3 2 4 GPU 10 5

7: Nave 8: Atomc 9: 3-D Lstng 5: 1! for all the partcles 2 do loop1: target ptcl = 1,all ptcl 3 b = bucket num[m] 4! traverse adjacent buckets (3 dm: 3x3x3=27) 5 do loop2: x=x1,x2 6 do loop3: y=y1,y2 7 do loop4: z=z1,z2 8 bb = get adj bucket num(x,y,z) 9 num of ptcl = get num of ptcl n bucket(bb) 10! accumulate all the neghbour partcles 11 do loop5: np = 1,num of ptcl! ndefnte loop 12 f (ptcl s n halo) 13 lcr = ptcl halo[np]! random access 14 else 15 lcr = ptcl[np]! random access 16 end f 17 dst = sqrt(dot product(m, lcr))! get dstance 18 weght = get weght(dst) 19 accum = accum + phys(weght)! aggregaton 20 m phys[m] = m phys[m] + accum! n place add 3.2.1 Nave Nave 1 GPU 1 OpenACC 7 do-loop1 acc kernels acc loop gang vector do-loop2 acc loop collapse(3) seq 3 (do-loop2,3,4) do-loop5 acc loop seq 128 1 GPU 1 CUDA 1 14,688 115 (= 14,688 128) occupancy 100% GPU 64,256,512 P100(NVlnk) 451 Lstng 6: Nave 1!$acc kernels 2!$acc loop gang vector(128) 3 do loop1: target ptcl = 1,all ptcl 4... 5!$acc loop collapse(3) seq 6 do loop2: x=x1,x2 7 do loop3: y=y1,y2 8 do loop4: z=z1,z2 9... 10!$acc loop seq 11 do loop5: np = 1,num of ptcl 12... 3.2.2 Atomc Atomc atomc 8 27(3 3 3) GPU 1 CUDA Nave 27 1 14,688 128 396,576(= 14,688 27) 3,099 (= 14,688 27 128) Nave occupancy 100% GPU Atomc atomc do-loop1 do-loop4 do-loop1 acc parallel acc loop collapse(4) gang vector GPU 1 15,17 acc atomc update atomc P100(PCIe) 220 6

Lstng 7: Atomc 1!$acc parallel 2!$acc loop collapse(4) ndependent gang vector(128) 3 do loop1: target ptcl = 1,all ptcl 4... 5 do loop2: x=x1,x2 6 do loop3: y=y1,y2 7 do loop4: z=z1,z2 8... 9! moved here from do loop1 10 b = bucket num[m] 11!$acc loop seq 12 do loop5: np = 1,num of ptcl 13... 14! moved here from do loop1 15!$acc atomc update 16 m phys[m] = m phys[m] + accum! n place add 17!$acc end atomc 3.2.3 3-D thread 3-D thread MPS 9 CUDA threadidx.x threadidx.y, threadidx.z 27 GPU 1 CUDA Atomc 27 occupancy GPU 1 14,688 27 396,576(= 14,688 27) 14,688 (= 14,688 27 27) do-loop2,3,4 acc loop vector(3) Nave Atomc occupancy Lstng 8: 3-D thread 1!$acc kernels 2!$acc loop ndependent 3 do loop1: target ptcl = 1,all ptcl 4... 5!$acc loop vector(3) 6 do loop2: x=x1,x2 7!$acc loop vector(3) 8 do loop3: y=y1,y2 9!$acc loop vector(3) 10 do loop4: z=z1,z2 11... 12! moved here from do loop1 13 b = bucket num[m] 14!$acc loop seq 15 do loop5: np = 1,num of ptcl 16... 17! moved here from do loop1 18!$acc atomc update 19 m phys[m] = m phys[m] + accum! n place add 20!$acc end atomc 3.2.4 Multcore Multcore PGI -ta=multcore Nave Loop not vectorzed/parallelzed: too deeply nested MPI Xeon CPU mpexec -bnd-to none -n 2 mpstat 2 24 20 60% 34.03[ms] 318.16[ms] 7.5 6.1 4 4: Multcore 2 4 8 16 Processng tme [ms] 34.03 72.48 141.75 315.43 3.3 Speed-up 7.57 3.55 1.82 0.82 3.1 3.2 3.1 5 GTX1080 GTX1080 P100(PCIe) 18% 27% GTX1080 P100 330MHz 6 P100(NVlnk) P100(NVlnk) P100(PCIe) 20% 44% 14% 3.2 10 Nave GPU 100(PCIe) P100(NVlnk) Nave 3-D Thread 3.1 Nave atomc 1 Atomc 3-D Thread 1 atomc occupancy GPU OpenACC GTX1080 P100 330MHz 7

10: Nave GPU GPU atomc P100 GTX1080 4. Concluson MPS OpenACC 5 3 2 4 GPU 3.5 Xeon CPU 220 451 Fortran 29.0 74.5 GPU GPU MPS GPU NVIDIA [1] Koshzuka, S. and Oka, Y.: Movng partcle sem-mplct method for fragmentaton of ncompressble flud, Nuclear Scence and Engneerng, Vol. 123, pp. 421 434 (1996). [2] Seya, W., Takayuk, A., Sator, T. and Takash, S.: Neghbor-partcle Searchng Method for Partcle Smulaton Based on Contact Interacton Model for GPU Computng, IPSJ Transactons on Advanced Computng Systems, Vol. 8, No. 4, pp. 50 60 (2015). [3] Murotan, K., Masae, I., Matsunaga, T., Koshzuka, S., Shoya, R., Ogno, M. and Fujsawa, T.: Performance mprovements of dfferental operators code for MPS method on GPU, Computatonal Partcle Mechancs, Vol. 2, No. 3, pp. 261 272 (onlne), DOI: 10.1007/s40571-015- 0059-2 (2015). [4] Sota, Y., Watanabe, A. and Kojma, T.: Accerelaton of the movng parcle sem-mplct method through mult- GPU parallel computng wth dynamc doman decomposton, Journal of Japan Socety of Cvl Engneers, Ser. A2 (Appled Mechancs (AM)), Vol. 69, No. 2 (2013). [5] : OpenACC Home www.openacc.org, http://www.openacc.org/. [6] Murotan, K., Koshzuka, S., Tama, T., Shbata, K., Mtsume, N., Yoshmura, S., Tanaka, S., Hasegawa, K., Naga, E. and Fujsawa, T.: Development of Herarchcal Doman Decomposton Explct MPS Method and Applcaton to Large-scale Tsunam Analyss wth Floatng Objects, Journal of Advanced Smulaton n Scence and Engneerng, Vol. 1, No. 1, pp. 16 35 (onlne), DOI: 10.15748/jasse.1.16 (2014). [7] Sun, H., Tan, Y., Zhang, Y., Wu, J., Wang, S., Yang, Q. and Zhou, Q.: A Specal Sortng Method for Neghbor Search Procedure n Smoothed Partcle Hydrodynamcs on GPUs, Parallel Processng Workshops (ICPPW), 2015 44th Internatonal Conference on, pp. 81 85 (onlne), DOI: 10.1109/ICPPW.2015.46 (2015). [8] Woolley, C.: Professonal CUDA C Programmng (2014). 8