IPSJ SIG Technical Report Vol.2017-HPC-158 No /3/9 OpenACC MPS 1,a) 1 Moving Particle Semi-implicit (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K2

Similar documents
粒子法による流れの数値解析

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

GPU チュートリアル :OpenACC 篇 Himeno benchmark を例題として 高エネルギー加速器研究機構 (KEK) 松古栄夫 (Hideo Matsufuru) 1 December 2018 HPC-Phys 理化学研究所 共通コードプロジェクト

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

07-二村幸孝・出口大輔.indd


Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

GPU n Graphics Processing Unit CG CAD

XACCの概要

untitled

FRP SPH(Smoothed Partcle Hydrodynamcs) (3) MPS(Movng Partcle sem-mplct) (4) MPS (5) (6) (7, 8) (9) (10) (11) Tama and Koshzuka LSMPS(Least Squares Mov

> σ, σ j, j σ j, σ j j σ σ j σ j (t) = σ (t ) σ j (t) = σ () j(t ) n j σ, σ j R lm σ = σ j, j V (8) t σ R σ d R lm σ = σ d V (9) t Fg.. Communcaton ln

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

GPGPU

スライド 1

HP Workstation 総合カタログ

2ndD3.eps

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

strtok-count.eps

HPC (pay-as-you-go) HPC Web 2

Microsoft PowerPoint - GPU_computing_2013_01.pptx

main.dvi

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

untitled

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

supercomputer2010.ppt

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

スパコンに通じる並列プログラミングの基礎

GPU CUDA CUDA 2010/06/28 1

HPEハイパフォーマンスコンピューティング ソリューション

スパコンに通じる並列プログラミングの基礎

01_OpenMP_osx.indd

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

( )

HPC146

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

HP Workstation 総合カタログ


次世代スーパーコンピュータのシステム構成案について

IPSJ SIG Techncal Report 2. RangeBased RangeFree. 2.1 Rangebased RangeBased TDOA(Tme Dfference Of Arrval) TOA(Tme Of Arrval) TDOA TDOA Actve Bat 2) Cr

スパコンに通じる並列プログラミングの基礎

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

main.dvi

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

EGunGPU

音声読み上げブラウザの読み上げかた

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

CTA 82: CTA A A B B A B A, C A A A D A B Max-Planck-Inst. fuer Phys. C D

Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max)

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

CUDA 連携とライブラリの活用 2

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,

ストリーミング SIMD 拡張命令2 (SSE2) を使用した、倍精度浮動小数点ベクトルの最大/最小要素とそのインデックスの検出

HPC143

HP ProLiant 500シリーズ

GPUコンピューティング講習会パート1

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

untitled

10_4.dvi

Microsoft Word - vga

非線形長波モデルと流体粒子法による津波シミュレータの開発 I_ m ρ v p h g a b a 2h b r ab a b Fang W r ab h 5 Wendland 1995 q= r ab /h a d W r ab h

名称 : 日本 GPU コンピューティングパートナーシップ (G-DEP) 所在 : 東京都文京区本郷 7 丁目 3 番 1 号東京大学アントレプレナープラザ, 他工場 URL アライアンスパートナー コアテクノロジーパートナー NVIDIA JAPAN ソリュ

ARTED Xeon Phi Xeon Phi 2. ARTED ARTED (Ab-initio Real-Time Electron Dynamics simulator) RTRS- DFT (Real-Time Real-Space Density Functional Theory, )

GPUコンピューティング講習会パート1

Explicit MPS Algorithm for Large-scale Free Surface Flow Analysis

2012 M

倍々精度RgemmのnVidia C2050上への実装と応用

ÊÂÎó·×»»¤È¤Ï/OpenMP¤Î½éÊâ¡Ê£±¡Ë

Itanium2ベンチマーク


IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm

フカシギおねえさん問題の高速計算アルゴリズム

単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

メモリ階層構造を考慮した大規模グラフ処理の高速化

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

I I / 47

1重谷.PDF

橡jttc2.PDF

HP_PPT_Standard_16x9_JP

Ver Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI

DELL PRECISION T7400 T5400 T3400 M6400 M4400 M2400 R5400 FX100 February /

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

GPU.....

IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 I/O Jianwei Liao 1 Gerofi Balazs 1 1 Guo-Yuan Lien Prototyping F

Microsoft PowerPoint - suda.pptx

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

IPSJ SIG Techncal Report 歌声データベース 歌声の波形 スペクトル抽出 基本周波数抽出 HMM メルケプストラム ラベル HMM の学習 対数基本周波数 c 学習部 コンテキスト依存モデル c ( 合成部 楽譜 ラベル変換 ラベル... メルケプストラム パラメータ生成 ML

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

HPC pdf

Microsoft Word - vga

Ver. 3.7 Ver E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,, HT,

Transcription:

OpenACC MPS 1,a) 1 Movng Partcle Sem-mplct (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K20c GTX1080 P100(PCIe) P100(NVlnk) 5 OpenACC 3.5 3 Fortran 29.0 74.5 GPU 1. MPS [1] 1 MPS MPS CUDA GPU [2] [3] [4] OpenACC GPU OpenACC GPU API [5] (clause) GPU CUDA OpenCL GPU MPS NSRU-MPS NSRU-MPS For- 1 182-8522 7-44-1 a) myajma.takaak@jaxa.jp tran 95 MPI OpenMP 2 OpenACC CPU 2 4 GPU GPU 2. MPS MPS 4.0km 3.5km 2 6 [6] MPS 2.1 MPS (Explct MPS ) 2 r j r j Proc 0) 1

1: MPS Proc 1) Proc 2) Proc 3) Proc 4) Proc 5) Proc 6)Proc 1 6 Proc 0) MPS λ 0 n 0 ρ λ 0 j = ( r0 j r0 )2 ω( r 0 j r0 ) j ω( r0 j r0 ) (1) n 0 = ω( r 0 j r 0 ) (2) j Proc 1) 2 3 ( u = u k + t ν 2d ) λ 0 n 0 (u k j u k )ω( r j r ) +g(3) j k t j ν d g u k k u k Proc 2) Proc 1 r = r k + tu (4) r k k Proc 3) n = j ω( r j r ) (5) 2: P k+1 = c 2 ρ0 n 0 (n n 0 ) (6) P k+1 k c n k Proc 4) P k+1 = d n 0 j ( ) (P k+1 + P k+1 j )(r j r ) r j r 2 ω grad ( r j r ) (7) ω grad 2.3 Proc 5) ( ) k+1 u k+1 = u 1 t ρ P (8) r k+1 u k+1 ( = r 1 t ρ P ) k+1 (9) r k k + 1 Proc 6) Forward tme step u t 2.2 MPS 5 1 5 Smoothed Partcle Hydrodynamcs(SPH) N [2] [7] [3] NSRU-MPS 2

3: NSRU-MPS Laplacan u cal nden grad p explct 2 Read-After-Wrte(RAW) 2.3 MPS 10 11 1 2 10 11 ω grad (r) = re r r r e (10) ω(r) = re r r r e 2 (11) r e r j 2.4 NSRU-MPS NSRU-MPS MPS Sem-mplct MPS MPI NSRU-MPS r u u P r P n 9 AoS Array of Structure 2 2 Intel Xeon E5-2697 v2 @ 2.7GHz(12 24 ) 128GB DDR3-12800 4: CPU-GPU NVlnk PCIe Gen3 2.7 35 35 7 25,704 70 70 14 224,910 8.75 MPI Wtme 200 MPI 2 112,455 0 14,688 1 11,016 3 NSRU-MPS Laplacan u grad p ex-plct cal nden 3 86% 89% 1 4 3 Laplacan u 252.8[ms] elastc collson 9% 10% 2% 3. OpenACC tmestep Laplacan u Tesla K20c, GeForce GTX1080, Tesla P100(NV-lnk), P100(PCIe) 4 GPU 1 GPU 2CPU 2GPU GPU CPU 2 PGI Fortran Compler K20c GTX1080, P100(PCIe) x86-64 (ver 16.10) P100(NVlnk) lnuxpower (ver 16.10) -acc -ta=nvda, cuda8.0, fastmath, cc60 K20c cc60 cc35 MPI OpenMPI 1.10.2 CPU 1 1 1GPU tmestep Laplacan u MPI 2.4 CPU-GPU 4 3

1: GPU CUDA GPU CPU-GPU GPU [TFLOPS] [MHz] Cores [Gbps] ( ) CPU K20c 3.5 706 2,496 208 PCIe Gen2 x16 ( 8GB/s) Intel Xeon E5-2697 v2 GTX1080 8.8 1,733 2,560 320 PCIe Gen3 x16 (16GB/s) Intel Xeon E5-2697 v2 P100 (PCIe) 9.3 1,303 3,584 732 PCIe Gen3 x16 (16GB/s) Intel Xeon E5-2630Lv3 P100 (NVlnk) 10.6 1,406 3,584 732 NVLnk (40GB/s) IBM POWER8 NVL Tesla K20c PCI-Express Generaton 2 x16 GTX1080 P100(PCIe) PCI-Express Generaton 3 x16 P100(NVlnk) NVlnk 4 NVlnk 82.7% PCIe Gen3 x16 2.7 40KB GTX1080 GTX1080 P100 330MHz 3.1 tmestep OpenACC GPU Proc 6 tmestep OpenACC CUDA 5 44,064 (3 14,688) 337,365(3 112,455) CPU GPU 176,256 bytes (44,064 4byte) 1,349,460 bytes (337,365 4byte) 2 5 6 2: CPU GPU 172.13KB 1.29MB [ms] BW [GB/s] [ms] BW [GB/s] K20c 0.030 5.41 0.216 5.81 GTX1080 0.018 9.09 0.126 9.94 P100(PCIe) 0.023 6.96 0.167 7.50 P100(NVlnk) 0.007 22.90 0.044 28.25 3.1.1 acc kernels 1 maxval() acc kernels Fortran CUDA 2,3,4 maxval() 3 CUDA Kernel( 9 ) Kernel NVIDIA GPU 2 [8] CPU cxmax, cymax, czmax 3 CPU P100(NVlnk) lnuxpower x86-64 4 GTX1080 P100 Lstng 1: acc kernels 1!$acc kernels copyn(nm) 2 cxmax = maxval(c(1,1:nm))! 3 cymax = maxval(c(2,1:nm))! 4 czmax = maxval(c(3,1:nm))! 5 cmax = max(cxmax,cymax)! 6 cmax = max(cmax,czmax)! 7!$acc end kernels 3.1.2 maxval maxval maxval() acc kernels Kernel acc kernels 3 CUDA Kernel 2 CPU maxval() 1 3 CUDA Kernel CPU lnuxpower x86-64 Kernel 9 3 K20c GTX1080 acc kernel Lstng 2: maxval() 1 1!$acc kernels copyn(nm) 2 cxmax = maxval(c(:3,:nm)) 3!$acc end kernels 3.1.3 reducton reducton maxval max reducton PGI max GPU acc loop reducton acc loop vector(32) CUDA Kernel 2 maxval Kernel GTX1080 acc kernels 3.2 1.6 4

5: GTX1080 GPU 6: P100(NVlnk) GPU Lstng 3: max() reducton 1!$acc parallel copyn(nm) 2!$acc loop reducton(max:cmax) 3 do row=1,3 4!$acc loop vector(32) 5 do col=1,nm 6 cmax = max(cmax, c(col,row)) 7 end do 8 end do 9!$acc end parallel 3.1.4 unroll unroll reducton collapse(2) collapse(n) N CUDA Kernel acc kernel 1.6 3.5 reducton 1.4 1.9 P100(NVlnk) OpenACC 3.1.5 CUDA dev maxval CUDA Fortran OpenACC CUDA Kernel 2 acc host data use devce c dev maxval CUDA Fortran maxval unroll 35% lnuxpower Lstng 4: CUDA Fortran 1!$acc data copyn(c(:3,:nm)) 2!$acc host data use devce(c) 3 cmax = dev maxval(c, 3, Nm) 4!$acc end host data 5!$acc end data 1 attrbutes(devce) real functon dev maxval(gdata, x, y) 2 use cudafor, gpu maxval => maxval 3 nteger,value :: x, y 4 real,devce :: gdata(x,y) 5 dev maxval = gpu maxval(gdata) 6 end functon dev maxval 3.1.6 Multcore PGI -ta=multcore CPU CUDA Xeon E5-2697 v2 MPI mpexec -bnd-to none -n 3 acc kernels K20c 3: Multcore ([ms]) acc kernels maxval reducton unroll 15.180 1.029 1.101 1.062 14.603 1.511 1.184 0.827 3.2 Laplacan u 7 5 (do-loop1) 2,3,4(do-loop2,3,4) 3 3 3 (=27) 5 19 17 13,15 257.66[ms] 1924.98[ms] OpenACC 3 2 4 GPU 10 5

7: Nave 8: Atomc 9: 3-D Lstng 5: 1! for all the partcles 2 do loop1: target ptcl = 1,all ptcl 3 b = bucket num[m] 4! traverse adjacent buckets (3 dm: 3x3x3=27) 5 do loop2: x=x1,x2 6 do loop3: y=y1,y2 7 do loop4: z=z1,z2 8 bb = get adj bucket num(x,y,z) 9 num of ptcl = get num of ptcl n bucket(bb) 10! accumulate all the neghbour partcles 11 do loop5: np = 1,num of ptcl! ndefnte loop 12 f (ptcl s n halo) 13 lcr = ptcl halo[np]! random access 14 else 15 lcr = ptcl[np]! random access 16 end f 17 dst = sqrt(dot product(m, lcr))! get dstance 18 weght = get weght(dst) 19 accum = accum + phys(weght)! aggregaton 20 m phys[m] = m phys[m] + accum! n place add 3.2.1 Nave Nave 1 GPU 1 OpenACC 7 do-loop1 acc kernels acc loop gang vector do-loop2 acc loop collapse(3) seq 3 (do-loop2,3,4) do-loop5 acc loop seq 128 1 GPU 1 CUDA 1 14,688 115 (= 14,688 128) occupancy 100% GPU 64,256,512 P100(NVlnk) 451 Lstng 6: Nave 1!$acc kernels 2!$acc loop gang vector(128) 3 do loop1: target ptcl = 1,all ptcl 4... 5!$acc loop collapse(3) seq 6 do loop2: x=x1,x2 7 do loop3: y=y1,y2 8 do loop4: z=z1,z2 9... 10!$acc loop seq 11 do loop5: np = 1,num of ptcl 12... 3.2.2 Atomc Atomc atomc 8 27(3 3 3) GPU 1 CUDA Nave 27 1 14,688 128 396,576(= 14,688 27) 3,099 (= 14,688 27 128) Nave occupancy 100% GPU Atomc atomc do-loop1 do-loop4 do-loop1 acc parallel acc loop collapse(4) gang vector GPU 1 15,17 acc atomc update atomc P100(PCIe) 220 6

Lstng 7: Atomc 1!$acc parallel 2!$acc loop collapse(4) ndependent gang vector(128) 3 do loop1: target ptcl = 1,all ptcl 4... 5 do loop2: x=x1,x2 6 do loop3: y=y1,y2 7 do loop4: z=z1,z2 8... 9! moved here from do loop1 10 b = bucket num[m] 11!$acc loop seq 12 do loop5: np = 1,num of ptcl 13... 14! moved here from do loop1 15!$acc atomc update 16 m phys[m] = m phys[m] + accum! n place add 17!$acc end atomc 3.2.3 3-D thread 3-D thread MPS 9 CUDA threadidx.x threadidx.y, threadidx.z 27 GPU 1 CUDA Atomc 27 occupancy GPU 1 14,688 27 396,576(= 14,688 27) 14,688 (= 14,688 27 27) do-loop2,3,4 acc loop vector(3) Nave Atomc occupancy Lstng 8: 3-D thread 1!$acc kernels 2!$acc loop ndependent 3 do loop1: target ptcl = 1,all ptcl 4... 5!$acc loop vector(3) 6 do loop2: x=x1,x2 7!$acc loop vector(3) 8 do loop3: y=y1,y2 9!$acc loop vector(3) 10 do loop4: z=z1,z2 11... 12! moved here from do loop1 13 b = bucket num[m] 14!$acc loop seq 15 do loop5: np = 1,num of ptcl 16... 17! moved here from do loop1 18!$acc atomc update 19 m phys[m] = m phys[m] + accum! n place add 20!$acc end atomc 3.2.4 Multcore Multcore PGI -ta=multcore Nave Loop not vectorzed/parallelzed: too deeply nested MPI Xeon CPU mpexec -bnd-to none -n 2 mpstat 2 24 20 60% 34.03[ms] 318.16[ms] 7.5 6.1 4 4: Multcore 2 4 8 16 Processng tme [ms] 34.03 72.48 141.75 315.43 3.3 Speed-up 7.57 3.55 1.82 0.82 3.1 3.2 3.1 5 GTX1080 GTX1080 P100(PCIe) 18% 27% GTX1080 P100 330MHz 6 P100(NVlnk) P100(NVlnk) P100(PCIe) 20% 44% 14% 3.2 10 Nave GPU 100(PCIe) P100(NVlnk) Nave 3-D Thread 3.1 Nave atomc 1 Atomc 3-D Thread 1 atomc occupancy GPU OpenACC GTX1080 P100 330MHz 7

10: Nave GPU GPU atomc P100 GTX1080 4. Concluson MPS OpenACC 5 3 2 4 GPU 3.5 Xeon CPU 220 451 Fortran 29.0 74.5 GPU GPU MPS GPU NVIDIA [1] Koshzuka, S. and Oka, Y.: Movng partcle sem-mplct method for fragmentaton of ncompressble flud, Nuclear Scence and Engneerng, Vol. 123, pp. 421 434 (1996). [2] Seya, W., Takayuk, A., Sator, T. and Takash, S.: Neghbor-partcle Searchng Method for Partcle Smulaton Based on Contact Interacton Model for GPU Computng, IPSJ Transactons on Advanced Computng Systems, Vol. 8, No. 4, pp. 50 60 (2015). [3] Murotan, K., Masae, I., Matsunaga, T., Koshzuka, S., Shoya, R., Ogno, M. and Fujsawa, T.: Performance mprovements of dfferental operators code for MPS method on GPU, Computatonal Partcle Mechancs, Vol. 2, No. 3, pp. 261 272 (onlne), DOI: 10.1007/s40571-015- 0059-2 (2015). [4] Sota, Y., Watanabe, A. and Kojma, T.: Accerelaton of the movng parcle sem-mplct method through mult- GPU parallel computng wth dynamc doman decomposton, Journal of Japan Socety of Cvl Engneers, Ser. A2 (Appled Mechancs (AM)), Vol. 69, No. 2 (2013). [5] : OpenACC Home www.openacc.org, http://www.openacc.org/. [6] Murotan, K., Koshzuka, S., Tama, T., Shbata, K., Mtsume, N., Yoshmura, S., Tanaka, S., Hasegawa, K., Naga, E. and Fujsawa, T.: Development of Herarchcal Doman Decomposton Explct MPS Method and Applcaton to Large-scale Tsunam Analyss wth Floatng Objects, Journal of Advanced Smulaton n Scence and Engneerng, Vol. 1, No. 1, pp. 16 35 (onlne), DOI: 10.15748/jasse.1.16 (2014). [7] Sun, H., Tan, Y., Zhang, Y., Wu, J., Wang, S., Yang, Q. and Zhou, Q.: A Specal Sortng Method for Neghbor Search Procedure n Smoothed Partcle Hydrodynamcs on GPUs, Parallel Processng Workshops (ICPPW), 2015 44th Internatonal Conference on, pp. 81 85 (onlne), DOI: 10.1109/ICPPW.2015.46 (2015). [8] Woolley, C.: Professonal CUDA C Programmng (2014). 8