IPSJ SIG Technical Report Vol.2017-HPC-158 No /3/9 OpenACC MPS 1,a) 1 Moving Particle Semi-implicit (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K2

Size: px

Start display at page:

Download "IPSJ SIG Technical Report Vol.2017-HPC-158 No /3/9 OpenACC MPS 1,a) 1 Moving Particle Semi-implicit (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K2"

ゆゆこだいほうじ
4 years ago
Views:

1 OpenACC MPS 1,a) 1 Movng Partcle Sem-mplct (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K20c GTX1080 P100(PCIe) P100(NVlnk) 5 OpenACC Fortran GPU 1. MPS [1] 1 MPS MPS CUDA GPU [2] [3] [4] OpenACC GPU OpenACC GPU API [5] (clause) GPU CUDA OpenCL GPU MPS NSRU-MPS NSRU-MPS For a) myajma.takaak@jaxa.jp tran 95 MPI OpenMP 2 OpenACC CPU 2 4 GPU GPU 2. MPS MPS 4.0km 3.5km 2 6 [6] MPS 2.1 MPS (Explct MPS ) 2 r j r j Proc 0) 1

1: MPS Proc 1) Proc 2) Proc 3) Proc 4) Proc 5) Proc 6)Proc 1 6 Proc 0) MPS λ 0 n 0 ρ λ 0 j = ( r0 j r0 )2 ω(

)ω( r j r ) +g(3) j k t j ν d g u k k u k Proc 2) Proc 1 r = r k + tu (4) r k k Proc 3) n = j ω( r j r ) (5)

2 1: MPS Proc 1) Proc 2) Proc 3) Proc 4) Proc 5) Proc 6)Proc 1 6 Proc 0) MPS λ 0 n 0 ρ λ 0 j = ( r0 j r0 )2 ω( r 0 j r0 ) j ω( r0 j r0 ) (1) n 0 = ω( r 0 j r 0 ) (2) j Proc 1) 2 3 ( u = u k + t ν 2d ) λ 0 n 0 (u k j u k )ω( r j r ) +g(3) j k t j ν d g u k k u k Proc 2) Proc 1 r = r k + tu (4) r k k Proc 3) n = j ω( r j r ) (5) 2: P k+1 = c 2 ρ0 n 0 (n n 0 ) (6) P k+1 k c n k Proc 4) P k+1 = d n 0 j ( ) (P k+1 + P k+1 j )(r j r ) r j r 2 ω grad ( r j r ) (7) ω grad 2.3 Proc 5) ( ) k+1 u k+1 = u 1 t ρ P (8) r k+1 u k+1 ( = r 1 t ρ P ) k+1 (9) r k k + 1 Proc 6) Forward tme step u t 2.2 MPS Smoothed Partcle Hydrodynamcs(SPH) N [2] [7] [3] NSRU-MPS 2

3: NSRU-MPS Laplacan u cal nden grad p explct 2 Read-After-Wrte(RAW) 2.

4 NSRU-MPS NSRU-MPS MPS Sem-mplct MPS MPI NSRU-MPS r u u P r P n 9 AoS Array of Structure 2 2 Intel Xeon E5-2697 v2 @ 2.

75 MPI Wtme 200 MPI 2 112,455 0 14,688 1 11,016 3 NSRU-MPS Laplacan u grad p ex-plct cal nden 3 86% 89% 1 4 3 Laplacan u 252.

3 3: NSRU-MPS Laplacan u cal nden grad p explct 2 Read-After-Wrte(RAW) 2.3 MPS ω grad (r) = re r r r e (10) ω(r) = re r r r e 2 (11) r e r j 2.4 NSRU-MPS NSRU-MPS MPS Sem-mplct MPS MPI NSRU-MPS r u u P r P n 9 AoS Array of Structure 2 2 Intel Xeon E GHz(12 24 ) 128GB DDR : CPU-GPU NVlnk PCIe Gen , , MPI Wtme 200 MPI 2 112, , ,016 3 NSRU-MPS Laplacan u grad p ex-plct cal nden 3 86% 89% Laplacan u 252.8[ms] elastc collson 9% 10% 2% 3. OpenACC tmestep Laplacan u Tesla K20c, GeForce GTX1080, Tesla P100(NV-lnk), P100(PCIe) 4 GPU 1 GPU 2CPU 2GPU GPU CPU 2 PGI Fortran Compler K20c GTX1080, P100(PCIe) x86-64 (ver 16.10) P100(NVlnk) lnuxpower (ver 16.10) -acc -ta=nvda, cuda8.0, fastmath, cc60 K20c cc60 cc35 MPI OpenMPI CPU 1 1 1GPU tmestep Laplacan u MPI 2.4 CPU-GPU 4 3

4 1: GPU CUDA GPU CPU-GPU GPU [TFLOPS] [MHz] Cores [Gbps] ( ) CPU K20c , PCIe Gen2 x16 ( 8GB/s) Intel Xeon E v2 GTX ,733 2, PCIe Gen3 x16 (16GB/s) Intel Xeon E v2 P100 (PCIe) 9.3 1,303 3, PCIe Gen3 x16 (16GB/s) Intel Xeon E5-2630Lv3 P100 (NVlnk) ,406 3, NVLnk (40GB/s) IBM POWER8 NVL Tesla K20c PCI-Express Generaton 2 x16 GTX1080 P100(PCIe) PCI-Express Generaton 3 x16 P100(NVlnk) NVlnk 4 NVlnk 82.7% PCIe Gen3 x KB GTX1080 GTX1080 P MHz 3.1 tmestep OpenACC GPU Proc 6 tmestep OpenACC CUDA 5 44,064 (3 14,688) 337,365(3 112,455) CPU GPU 176,256 bytes (44,064 4byte) 1,349,460 bytes (337,365 4byte) : CPU GPU KB 1.29MB [ms] BW [GB/s] [ms] BW [GB/s] K20c GTX P100(PCIe) P100(NVlnk) acc kernels 1 maxval() acc kernels Fortran CUDA 2,3,4 maxval() 3 CUDA Kernel( 9 ) Kernel NVIDIA GPU 2 [8] CPU cxmax, cymax, czmax 3 CPU P100(NVlnk) lnuxpower x GTX1080 P100 Lstng 1: acc kernels 1!$acc kernels copyn(nm) 2 cxmax = maxval(c(1,1:nm))! 3 cymax = maxval(c(2,1:nm))! 4 czmax = maxval(c(3,1:nm))! 5 cmax = max(cxmax,cymax)! 6 cmax = max(cmax,czmax)! 7!$acc end kernels maxval maxval maxval() acc kernels Kernel acc kernels 3 CUDA Kernel 2 CPU maxval() 1 3 CUDA Kernel CPU lnuxpower x86-64 Kernel 9 3 K20c GTX1080 acc kernel Lstng 2: maxval() 1 1!$acc kernels copyn(nm) 2 cxmax = maxval(c(:3,:nm)) 3!$acc end kernels reducton reducton maxval max reducton PGI max GPU acc loop reducton acc loop vector(32) CUDA Kernel 2 maxval Kernel GTX1080 acc kernels

5 reducton 1.4 1.9 P100(NVlnk) OpenACC 3.1.5 CUDA dev maxval CUDA Fortran OpenACC CUDA Kernel 2 acc host data use devce c dev maxval CUDA Fortran maxval unroll 35% lnuxpower Lstng 4: CUDA Fortran 1!

5 5: GTX1080 GPU 6: P100(NVlnk) GPU Lstng 3: max() reducton 1!$acc parallel copyn(nm) 2!$acc loop reducton(max:cmax) 3 do row=1,3 4!$acc loop vector(32) 5 do col=1,nm 6 cmax = max(cmax, c(col,row)) 7 end do 8 end do 9!$acc end parallel unroll unroll reducton collapse(2) collapse(n) N CUDA Kernel acc kernel reducton P100(NVlnk) OpenACC CUDA dev maxval CUDA Fortran OpenACC CUDA Kernel 2 acc host data use devce c dev maxval CUDA Fortran maxval unroll 35% lnuxpower Lstng 4: CUDA Fortran 1!$acc data copyn(c(:3,:nm)) 2!$acc host data use devce(c) 3 cmax = dev maxval(c, 3, Nm) 4!$acc end host data 5!$acc end data 1 attrbutes(devce) real functon dev maxval(gdata, x, y) 2 use cudafor, gpu maxval => maxval 3 nteger,value :: x, y 4 real,devce :: gdata(x,y) 5 dev maxval = gpu maxval(gdata) 6 end functon dev maxval Multcore PGI -ta=multcore CPU CUDA Xeon E v2 MPI mpexec -bnd-to none -n 3 acc kernels K20c 3: Multcore ([ms]) acc kernels maxval reducton unroll Laplacan u 7 5 (do-loop1) 2,3,4(do-loop2,3,4) (=27) , [ms] [ms] OpenACC GPU 10 5

accumulate all the neghbour partcles 11 do loop5: np = 1,num of ptcl! ndefnte loop 12 f (ptcl s n halo) 13 lcr = ptcl halo[np]! random access 14 else 15 lcr = ptcl[np]!

6 7: Nave 8: Atomc 9: 3-D Lstng 5: 1! for all the partcles 2 do loop1: target ptcl = 1,all ptcl 3 b = bucket num[m] 4! traverse adjacent buckets (3 dm: 3x3x3=27) 5 do loop2: x=x1,x2 6 do loop3: y=y1,y2 7 do loop4: z=z1,z2 8 bb = get adj bucket num(x,y,z) 9 num of ptcl = get num of ptcl n bucket(bb) 10! accumulate all the neghbour partcles 11 do loop5: np = 1,num of ptcl! ndefnte loop 12 f (ptcl s n halo) 13 lcr = ptcl halo[np]! random access 14 else 15 lcr = ptcl[np]! random access 16 end f 17 dst = sqrt(dot product(m, lcr))! get dstance 18 weght = get weght(dst) 19 accum = accum + phys(weght)! aggregaton 20 m phys[m] = m phys[m] + accum! n place add Nave Nave 1 GPU 1 OpenACC 7 do-loop1 acc kernels acc loop gang vector do-loop2 acc loop collapse(3) seq 3 (do-loop2,3,4) do-loop5 acc loop seq GPU 1 CUDA 1 14, (= 14, ) occupancy 100% GPU 64,256,512 P100(NVlnk) 451 Lstng 6: Nave 1!$acc kernels 2!$acc loop gang vector(128) 3 do loop1: target ptcl = 1,all ptcl !$acc loop collapse(3) seq 6 do loop2: x=x1,x2 7 do loop3: y=y1,y2 8 do loop4: z=z1,z !$acc loop seq 11 do loop5: np = 1,num of ptcl Atomc Atomc atomc 8 27(3 3 3) GPU 1 CUDA Nave , ,576(= 14,688 27) 3,099 (= 14, ) Nave occupancy 100% GPU Atomc atomc do-loop1 do-loop4 do-loop1 acc parallel acc loop collapse(4) gang vector GPU 1 15,17 acc atomc update atomc P100(PCIe) 220 6

7 Lstng 7: Atomc 1!$acc parallel 2!$acc loop collapse(4) ndependent gang vector(128) 3 do loop1: target ptcl = 1,all ptcl do loop2: x=x1,x2 6 do loop3: y=y1,y2 7 do loop4: z=z1,z ! moved here from do loop1 10 b = bucket num[m] 11!$acc loop seq 12 do loop5: np = 1,num of ptcl ! moved here from do loop1 15!$acc atomc update 16 m phys[m] = m phys[m] + accum! n place add 17!$acc end atomc D thread 3-D thread MPS 9 CUDA threadidx.x threadidx.y, threadidx.z 27 GPU 1 CUDA Atomc 27 occupancy GPU 1 14, ,576(= 14,688 27) 14,688 (= 14, ) do-loop2,3,4 acc loop vector(3) Nave Atomc occupancy Lstng 8: 3-D thread 1!$acc kernels 2!$acc loop ndependent 3 do loop1: target ptcl = 1,all ptcl !$acc loop vector(3) 6 do loop2: x=x1,x2 7!$acc loop vector(3) 8 do loop3: y=y1,y2 9!$acc loop vector(3) 10 do loop4: z=z1,z ! moved here from do loop1 13 b = bucket num[m] 14!$acc loop seq 15 do loop5: np = 1,num of ptcl ! moved here from do loop1 18!$acc atomc update 19 m phys[m] = m phys[m] + accum! n place add 20!$acc end atomc Multcore Multcore PGI -ta=multcore Nave Loop not vectorzed/parallelzed: too deeply nested MPI Xeon CPU mpexec -bnd-to none -n 2 mpstat % 34.03[ms] [ms] : Multcore Processng tme [ms] Speed-up GTX1080 GTX1080 P100(PCIe) 18% 27% GTX1080 P MHz 6 P100(NVlnk) P100(NVlnk) P100(PCIe) 20% 44% 14% Nave GPU 100(PCIe) P100(NVlnk) Nave 3-D Thread 3.1 Nave atomc 1 Atomc 3-D Thread 1 atomc occupancy GPU OpenACC GTX1080 P MHz 7

10: Nave GPU GPU atomc P100 GTX1080 4. Concluson MPS OpenACC 5 3 2 4 GPU 3.5 Xeon CPU 220 451 Fortran 29.0 74.5 GPU GPU MPS GPU NVIDIA [1] Koshzuka, S. and Oka, Y.

8 10: Nave GPU GPU atomc P100 GTX Concluson MPS OpenACC GPU 3.5 Xeon CPU Fortran GPU GPU MPS GPU NVIDIA [1] Koshzuka, S. and Oka, Y.: Movng partcle sem-mplct method for fragmentaton of ncompressble flud, Nuclear Scence and Engneerng, Vol. 123, pp (1996). [2] Seya, W., Takayuk, A., Sator, T. and Takash, S.: Neghbor-partcle Searchng Method for Partcle Smulaton Based on Contact Interacton Model for GPU Computng, IPSJ Transactons on Advanced Computng Systems, Vol. 8, No. 4, pp (2015). [3] Murotan, K., Masae, I., Matsunaga, T., Koshzuka, S., Shoya, R., Ogno, M. and Fujsawa, T.: Performance mprovements of dfferental operators code for MPS method on GPU, Computatonal Partcle Mechancs, Vol. 2, No. 3, pp (onlne), DOI: /s (2015). [4] Sota, Y., Watanabe, A. and Kojma, T.: Accerelaton of the movng parcle sem-mplct method through mult- GPU parallel computng wth dynamc doman decomposton, Journal of Japan Socety of Cvl Engneers, Ser. A2 (Appled Mechancs (AM)), Vol. 69, No. 2 (2013). [5] : OpenACC Home [6] Murotan, K., Koshzuka, S., Tama, T., Shbata, K., Mtsume, N., Yoshmura, S., Tanaka, S., Hasegawa, K., Naga, E. and Fujsawa, T.: Development of Herarchcal Doman Decomposton Explct MPS Method and Applcaton to Large-scale Tsunam Analyss wth Floatng Objects, Journal of Advanced Smulaton n Scence and Engneerng, Vol. 1, No. 1, pp (onlne), DOI: /jasse.1.16 (2014). [7] Sun, H., Tan, Y., Zhang, Y., Wu, J., Wang, S., Yang, Q. and Zhou, Q.: A Specal Sortng Method for Neghbor Search Procedure n Smoothed Partcle Hydrodynamcs on GPUs, Parallel Processng Workshops (ICPPW), th Internatonal Conference on, pp (onlne), DOI: /ICPPW (2015). [8] Woolley, C.: Professonal CUDA C Programmng (2014). 8

粒子法による流れの数値解析

粒子法による流れの数値解析 21 2002 230 239. Numercal Analyss of Flow usng Partcle Method Sech KOSHIZUKA 1 1 2 Los Alamos PAF Partcle-and-Force MAC Marker-and- Cell MAC PIC Partcle-n-Cell 319-1188 2-22 E-mal: kosh@utnl.jp PIC Los