DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

Size: px

Start display at page:

Download "DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速"

きゅうたかに
7 years ago
Views:

1 1 1, 2 1, 2 3 2, 3 4 GP LES ASUCA LES NVIDIA CUDA LES 1. Graphics Processing Unit GP General-Purpose SIMT Single Instruction Multiple Threads ),2) LES Large Eddy Simulation 3) ASUCA 4) LES LES 2. LES LES LES 5),6) 3. LES LES Raasch and Schroter 2001 Chow et al 2006 LES T2k-Tsukuba CFD LES 7) LES LES SMAC Adams-Bashforth Crank-Nicolson Bi-CGStab LES 1 1 c 2011 Information Processing Society of Japan

2 DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速度勾配スケールの計算 gradient_scale 圧力勾配の計算 gradient_press 圧力勾配の計算 ( 格子界面 ) gradient_cell_surface Smagorinsky 定数 Csの計算 sgs_smagrinsky 温位 (E) の修正物理速度の修正反変速度速度, 反変速度の境界条件 smac 修正圧力の計算 ( ポアソン方程式を解く ) END DO implicit loop ( 陰解放 ) 平均圧力を求める cgstab Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ks/call Ks/call name module_bicgstab_mod_cgstab module_dynamics_mod_gradient_cell_center_surface module_run_mod_run module_dynamics_mod_gradient_cell_surface module_sgs_mod_sgs_stress_vec module_smac_mod_smac module_addition_inst_value_mod_addition_inst_value module_sgs_mod_sgs_stress_sca module_dynamics_mod_tke_flux module_dynamics_mod_diffusion_crank_nicolson module_dynamics_mod_gradient_pres module_dynamics_mod_advection_adams_bashforth_2nd module_dynamics_mod_contravariant_velocity module_dynamics_mod_gradient_scale 地表面摩擦応力の計算 tau_u 拡散項の計算 diffusion_crank_nicolson 平均が0になるように圧力を修正 END DO 時間積分 2 LES module... MOD 1 LES GP NVIDIA CUDA SM Streaming Multiprocessor 8) SM SP Streaming Processor 8 CUDA Fermi SM SP 32 L1 L2 9),10) LES N=imax jmax kmax imax jmax kmax Intel Xeon E5630 Westmere-EP 2.53GHz 4-core 2 24Gbyte LES max time step cgstab Bi-CGStab addition inst value 70% cgstab gradient cell center surface cgstab gradient cell center surface gradient cell center surface 2 c 2011 Information Processing Society of Japan

3 gradient cell surface gradient cell surface bicgstab gradient cell center surface gradient cell surface 3(a) 4(a) gradient cell center surface gradient cell surface sec N 3(a) 4(a) 3(b) 4(b) gradient cell center surface gradient cell surface ) c e s ( 間時行実 (a) (b) 4 gradient cell surface 6. LES % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -> 間データ転送時間 -> 間データ転送時間計算時間 NVIDIA Tesla M2050 Fermi CUDA LES run run 5 // gpu_run.cu double *d_f1,*d_xix, *d_xiy, *d_xiz, ; ) c e s 1 ( 間 0.8 時行実 % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -> 間データ転送時間 -> 間データ転送時間計算時間 call gpu_initialize(size) call gpu_memdata(f,,size) subroutine run() Call gradient_cell_surface(f, ) end subroutine call gpu_finalize() extern C void gpu_initialize_(int *size) cudamalloc((void**)&d_f,sizeof(double)*(*size)); cudamalloc((void**)&d_xix,sizeof(double)*(*size)); extern C void gpu_memdata_(double *f,, int *size) cudamemcpu((d_f, f, sizeof(double)*(*size), cudamemcpydevicetohost); extern C void gradient_cell_surface_(double *f, ) gpu_gradient_cell_surface<<<dg,db>>>(d_f, ); extern C void gpu_finalize_() cudafree(d_f); cudafree(d_xix); (a) (b) 3 gradient cell center surface 5 cudafree(d_zez); 3 c 2011 Information Processing Society of Japan

4 run global gpu initialize global gpu memdata run run global gpu finalize gpu jmax blockdim.x block(1,0) block(1,1) ijk ijk index 1 do k =2, kmax -1 2 do j = 2, jmax -1 3 do i = 2, imax -1 4 fx1 (i,j,k) = ( xix (i+1,j,k)*f(i+1,j,k) - xix (i,j,k)*f(i,j,k) & 5 + ( etx (i+1,j+1,k)*f(i+1,j+1,k) & 6 - etx (i+1,j -1,k)*f(i+1,j -1,k) & 7 + etx (i,j+1,k)*f(i,j+1,k) & 8 - etx (i,j -1,k)*f(i,j -1,k) )*0.25 d0 & 9 + ( zex (i+1,j,k +1)* f(i+1,j,k +1) & 10 - zex (i+1,j,k -1)* f(i+1,j,k -1) & 11 + zex (i,j,k +1)* f(i,j,k +1) & 12 - zex (i,j,k -1)* f(i,j,k -1) )*0.25 d0 & 13 )* hjac1 (i,j,k) 14 enddo 15 enddo 16 enddo block(0,0) block(0,1) blockdim.y 7 Fortran imax 6 CUDA 6.2 LES i j imax jmax kmax N=imax jmax kmax CUDA i j ID ID i j ID 6 ID ID ID Fortran 7 i j ID ID CUDA 8 CUDA 7. RAM OS Compiler 1 Intel Xeon E GHz 4cores 2 DDR3 SDRAM 1066MHz 4GB 6 GDDR5 SDRAM 1.55GHz 3GB (ECC on) NVIDIA Tesla M GHz CentOS Linux release 6.0 (Final) GNU Fortran GCC nvcc 4.0 (-arch sm 20) for code 1 cgstab addition inst value run 9 Tesla M KB/L1 4 c 2011 Information Processing Society of Japan

5 1 int ijk ; 2 int i= blockdim.x* blockidx.x + threadidx.x + 1; 3 int j= blockdim.y* blockidx.y + threadidx.y + 1; 4 5 for ( int k = 1 ; k < kmax -1; k++ ) 6 ijk = i + j* imax + k* imax * jmax ; 7 8 d_fx1 [ijk ] = ( d_xix [ijk + 1]* d_f [ijk + 1] - d_xix [ijk ]* d_f [ijk ] 9 + ( d_etx [ijk + imax + 1]* d_f [ijk + imax + 1] 10 - d_etx [ijk - imax + 1]* d_f [ijk - imax + 1] 11 + d_etx [ijk + imax ]* d_f [ijk + imax ] 12 - d_etx [ijk - imax ]* d_f [ijk - imax ] )* ( d_zex [ijk + imax *jmax + 1]* d_f [ijk + imax *jmax + 1] 14 - d_zex [ijk - imax *jmax + 1]* d_f [ijk - imax *jmax + 1] 15 + d_zex [ijk + imax *jmax ]* d_f [ijk + imax *jmax ] 16 - d_zex [ijk - imax *jmax ]* d_f [ijk - imax *jmax ] )* )* d_hjac1 [ijk ]; 18 ) c e s ( 間間時行実 I J 8 7 CUDA 48KB 48KB/L1 16KB N imax jmax kmax kmax=102 imax jmax global memory 3GB imax jmax imax jmax LES 8. LES % global memory LES MPI OpenMP 1) CUDA Vol 20 No.2 pp Jun ) TSUBAME May ) 2010 Dec ) 5 c 2011 Information Processing Society of Japan

6 ASUCA TSUBAME Dec ) LES 2011 May ) Ryosaku Ikeda Hiroyuki Kusaka satoru Iizuka Taisuke Boku Development of Local Meteorological Model based on CFD 5th International symposium on wind effects on buildings and urban enviroment ISWE5 Mar ) Iizuka S, Kondo H Large-eddy simulations of turbulent flow over complex terrain using modified static eddy viscosity models Atmospheric Environment, 40, pp Feb ) NVIDIA Corporation CUDA ZONE home.html 9) Peter Glaskowsky NVIDIA s Fermi : The First Complete Computing Architecture 10) Dave Patterson The Top 10 Innovations in the New NVIDIA Fermi Architecture and the Top 3 Next Challenges 6 c 2011 Information Processing Society of Japan

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments 計算機アーキテクチャ第 11 回マルチプロセッサ本資料は授業用です無断で転載することを禁じます名古屋大学大学院情報科学研究科准教授加藤真平デスクトップジョブレベル並列性スーパーコンピュータ並列処理プログラムプログラムの並列化 for (i = 0; i < N; i++) { x[i] = a[i] + b[i]; } プログラムの並列化 x[0] = a[0] + b[0];