DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

1 1, 2 1, 2 3 2, 3 4 GP LES ASUCA LES NVIDIA CUDA LES 1. Graphics Processing Unit GP General-Purpose SIMT Single Instruction Multiple Threads 1 2 3 4 1),2) LES Large Eddy Simulation 3) ASUCA 4) LES LES 2. LES LES LES 5),6) 3. LES LES Raasch and Schroter 2001 Chow et al 2006 LES T2k-Tsukuba CFD LES 7) LES LES SMAC Adams-Bashforth Crank-Nicolson Bi-CGStab LES 1 1 c 2011 Information Processing Society of Japan

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速度勾配スケールの計算 gradient_scale 圧力勾配の計算 gradient_press 圧力勾配の計算 ( 格子界面 ) gradient_cell_surface Smagorinsky 定数 Csの計算 sgs_smagrinsky 温位 (E) の修正物理速度の修正反変速度速度, 反変速度の境界条件 smac 修正圧力の計算 ( ポアソン方程式を解く ) END DO implicit loop ( 陰解放 ) 平均圧力を求める cgstab Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ks/call Ks/call name 25.80 35022.13 35022.13 38233 0.00 0.00 module_bicgstab_mod_cgstab 24.56 68357.84 33335.71 191165 0.00 0.00 module_dynamics_mod_gradient_cell_center_surface 16.44 90682.76 22324.92 1 22.32 135.76 module_run_mod_run 11.55 106368.40 15685.64 76466 0.00 0.00 module_dynamics_mod_gradient_cell_surface 6.62 115356.29 8987.89 38233 0.00 0.00 module_sgs_mod_sgs_stress_vec 2.98 119395.75 4039.46 38233 0.00 0.00 module_smac_mod_smac 2.41 122667.01 3271.26 20000 0.00 0.00 module_addition_inst_value_mod_addition_inst_value 2.23 125691.93 3024.93 38233 0.00 0.00 module_sgs_mod_sgs_stress_sca 2.00 128406.13 2714.19 38233 0.00 0.00 module_dynamics_mod_tke_flux 1.34 130228.95 1822.82 191165 0.00 0.00 module_dynamics_mod_diffusion_crank_nicolson 0.86 131390.48 1161.53 38233 0.00 0.00 module_dynamics_mod_gradient_pres 0.84 132535.98 1145.50 100000 0.00 0.00 module_dynamics_mod_advection_adams_bashforth_2nd 0.81 133630.44 1094.46 20000 0.00 0.00 module_dynamics_mod_contravariant_velocity 0.35 134103.40 472.96 38233 0.00 0.00 module_dynamics_mod_gradient_scale 地表面摩擦応力の計算 tau_u 拡散項の計算 diffusion_crank_nicolson 平均が0になるように圧力を修正 END DO 時間積分 2 LES module... MOD 1 LES 5. 4. GP NVIDIA CUDA SM Streaming Multiprocessor 8) SM SP Streaming Processor 8 CUDA Fermi SM SP 32 L1 L2 9),10) LES N=imax jmax kmax imax jmax kmax 102 2 Intel Xeon E5630 Westmere-EP 2.53GHz 4-core 2 24Gbyte LES max time step 20000 2 cgstab Bi-CGStab addition inst value 70% cgstab gradient cell center surface cgstab gradient cell center surface gradient cell center surface 2 c 2011 Information Processing Society of Japan

gradient cell surface gradient cell surface bicgstab gradient cell center surface gradient cell surface 3(a) 4(a) gradient cell center surface gradient cell surface sec N 3(a) 4(a) 3(b) 4(b) gradient cell center surface gradient cell surface ) c e s ( 間時行実 1.4 1.2 1 0.8 0.6 0.4 0.2 0 (a) (b) 4 gradient cell surface 6. LES 6.1 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -> 間データ転送時間 -> 間データ転送時間計算時間 NVIDIA Tesla M2050 Fermi CUDA LES run run 5 // gpu_run.cu double *d_f1,*d_xix, *d_xiy, *d_xiz, ; 1.6 1.4 1.2 ) c e s 1 ( 間 0.8 時行実 0.6 0.4 0.2 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -> 間データ転送時間 -> 間データ転送時間計算時間 call gpu_initialize(size) call gpu_memdata(f,,size) subroutine run() Call gradient_cell_surface(f, ) end subroutine call gpu_finalize() extern C void gpu_initialize_(int *size) cudamalloc((void**)&d_f,sizeof(double)*(*size)); cudamalloc((void**)&d_xix,sizeof(double)*(*size)); extern C void gpu_memdata_(double *f,, int *size) cudamemcpu((d_f, f, sizeof(double)*(*size), cudamemcpydevicetohost); extern C void gradient_cell_surface_(double *f, ) gpu_gradient_cell_surface<<<dg,db>>>(d_f, ); extern C void gpu_finalize_() cudafree(d_f); cudafree(d_xix); (a) (b) 3 gradient cell center surface 5 cudafree(d_zez); 3 c 2011 Information Processing Society of Japan

run global gpu initialize global gpu memdata run run global gpu finalize gpu jmax blockdim.x block(1,0) block(1,1) ijk ijk index 1 do k =2, kmax -1 2 do j = 2, jmax -1 3 do i = 2, imax -1 4 fx1 (i,j,k) = ( xix (i+1,j,k)*f(i+1,j,k) - xix (i,j,k)*f(i,j,k) & 5 + ( etx (i+1,j+1,k)*f(i+1,j+1,k) & 6 - etx (i+1,j -1,k)*f(i+1,j -1,k) & 7 + etx (i,j+1,k)*f(i,j+1,k) & 8 - etx (i,j -1,k)*f(i,j -1,k) )*0.25 d0 & 9 + ( zex (i+1,j,k +1)* f(i+1,j,k +1) & 10 - zex (i+1,j,k -1)* f(i+1,j,k -1) & 11 + zex (i,j,k +1)* f(i,j,k +1) & 12 - zex (i,j,k -1)* f(i,j,k -1) )*0.25 d0 & 13 )* hjac1 (i,j,k) 14 enddo 15 enddo 16 enddo block(0,0) block(0,1) blockdim.y 7 Fortran imax 6 CUDA 6.2 LES i j imax jmax kmax N=imax jmax kmax CUDA i j ID ID i j ID 6 ID ID ID Fortran 7 i j ID ID CUDA 8 CUDA 7. RAM OS Compiler 1 Intel Xeon E5630 2.53GHz 4cores 2 DDR3 SDRAM 1066MHz 4GB 6 GDDR5 SDRAM 1.55GHz 3GB (ECC on) NVIDIA Tesla M2050 1.15GHz CentOS Linux release 6.0 (Final) GNU Fortran GCC 4.4.4 nvcc 4.0 (-arch sm 20) for code 1 cgstab addition inst value run 9 Tesla M2050 16KB/L1 4 c 2011 Information Processing Society of Japan

1 int ijk ; 2 int i= blockdim.x* blockidx.x + threadidx.x + 1; 3 int j= blockdim.y* blockidx.y + threadidx.y + 1; 4 5 for ( int k = 1 ; k < kmax -1; k++ ) 6 ijk = i + j* imax + k* imax * jmax ; 7 8 d_fx1 [ijk ] = ( d_xix [ijk + 1]* d_f [ijk + 1] - d_xix [ijk ]* d_f [ijk ] 9 + ( d_etx [ijk + imax + 1]* d_f [ijk + imax + 1] 10 - d_etx [ijk - imax + 1]* d_f [ijk - imax + 1] 11 + d_etx [ijk + imax ]* d_f [ijk + imax ] 12 - d_etx [ijk - imax ]* d_f [ijk - imax ] )*0.25 13 + ( d_zex [ijk + imax *jmax + 1]* d_f [ijk + imax *jmax + 1] 14 - d_zex [ijk - imax *jmax + 1]* d_f [ijk - imax *jmax + 1] 15 + d_zex [ijk + imax *jmax ]* d_f [ijk + imax *jmax ] 16 - d_zex [ijk - imax *jmax ]* d_f [ijk - imax *jmax ] )*0.25 17 )* d_hjac1 [ijk ]; 18 ) c e s ( 間間時行実 12 10 8 6 4 2 0 9 I J 8 7 CUDA 48KB 48KB/L1 16KB N imax jmax kmax kmax=102 imax jmax global memory 3GB imax jmax 132 9 imax jmax 102 132 7.9 8.4 LES 8. LES 8.4 70% global memory LES MPI OpenMP 1) CUDA Vol 20 No.2 pp.37-43 Jun 2010 2) TSUBAME May 2009 3) 2010 Dec 2010 4) 5 c 2011 Information Processing Society of Japan

ASUCA TSUBAME2.0 24 Dec 2010 5) LES 2011 May 2011 6) Ryosaku Ikeda Hiroyuki Kusaka satoru Iizuka Taisuke Boku Development of Local Meteorological Model based on CFD 5th International symposium on wind effects on buildings and urban enviroment ISWE5 Mar 2011 7) Iizuka S, Kondo H Large-eddy simulations of turbulent flow over complex terrain using modified static eddy viscosity models Atmospheric Environment, 40, pp.925-935 Feb 2006 8) NVIDIA Corporation CUDA ZONE http://www.nvidia.com/object/cuda home.html 9) Peter Glaskowsky NVIDIA s Fermi : The First Complete Computing Architecture 10) Dave Patterson The Top 10 Innovations in the New NVIDIA Fermi Architecture and the Top 3 Next Challenges 6 c 2011 Information Processing Society of Japan