CUDA GPGPU 2012 UDX 12/5/24 p. 1
FDTD GPU FDTD GPU FDTD FDTD FDTD PGI Acceralator CUDA OpenMP Fermi GPU (Tesla C2075/C2070, GTX 580) GT200 GPU (Tesla C1060, GTX 285) PC GPGPU 2012 UDX 12/5/24 p. 2
FDTD CIP 1 PC / PC FPGA Cell/B.E. GPU MPI Verilog/HDL CUDA/OpenCL GPGPU 2012 UDX 12/5/24 p. 3
GPU NVIDIA CUDA OpenCL CUDA CPU/GPU GPU CPU/GPU FDTD (PGI Acceralator) CUDA OpenMP GPGPU 2012 UDX 12/5/24 p. 4
FDTD GPU FDTD GPU FDTD FDTD FDTD PGI Acceralator CUDA OpenMP Fermi GPU (Tesla C2075/C2070, GTX 580) GT200 GPU (Tesla C1060, GTX 285) PC GPGPU 2012 UDX 12/5/24 p. 5
FDTD FDTD FDTD (Finite-Difference Time-Domain) Maxwell 2 Maxwell E = B t H = J + D t 2 F(x, y, z, t) x = F n (i + 1 2,j,k) F n (i 1 2,j,k) Δx + O(Δx 2 ) for xyz 6 GPGPU 2012 UDX 12/5/24 p. 6
FDTD FDTD GPGPU 2012 UDX 12/5/24 p. 7
FDTD MPI/OpenMP GPU CUDA/OpenCL GPU GPU PCI Express GPGPU 2012 UDX 12/5/24 p. 8
GPU Host (CPU) CPU Over 10 GB/s Host memory PCI Express 2.0 16 GB/s Control SP SP SP SP SP SP SP SP Registers SM/cache SP SP SP SP SP SP SP SP Registers SM/cache Device (GPU) SP SP SP SP SP SP SP SP Registers SM/cache Device memory MP GT200:30 MPs, 8 SPs Fermi: 16 MPs, 32 SPs Over 100 GB/s SP SP SP SP SP SP SP SP Registers SM/cache 5 GB/s Infiniband QDR GPGPU 2012 UDX 12/5/24 p. 9
GPU C2075 GTX 580 C1060 GTX 285 Number of cores 448 512 240 240 GFLOPS (single) 1030 1581 622 720 Memory (MB) 6144 3072 4096 2048 Bandwidth (GB/s) 144 192 102 159 SM/Caches (KB) 64 L1+SM, 768 L2 SM 16 Fermi 512 GT200 240 1 TFLOPS Core i7 100 GFLOPS 100 GB/s GPGPU 2012 UDX 12/5/24 p. 10
FDTD GPU GPU FDTD CUDA 1. CPU GPU 2. GPU CPU 3. CPU FDTD GPU GPU GPU C2075/C2070 2 GPGPU 2012 UDX 12/5/24 p. 11
GPU CUDA/OpenCL CUDA/OpenCL C/C++ Fortran PGI CUDA Fortran OpenMP C/Fortran CUDA NVIDIA OpenACC PGI Acceralator OpenACC GPU CPU/GPU CUDA GPGPU 2012 UDX 12/5/24 p. 12
OpenMP FDTD 1: for (t = 0.0; t < Te; t += dt){ 2: #pragma omp parallel{ 3: // Ex 4: #pragma omp for private(i, j, k) 5: for (i = 0; i < Ni - 1; i++){ 6: for (j = 1; j < Nj - 1; j++){ 7: for (k = 1; k < Nk - 1; k++) { 8: Ex[i][j][k] = c1 * Ex[i][j][k] 9: + c2 * (Hz[i][j][k] - Hz[i][j - 1] 10: - Hy[i][j][k] + Hy[i][j][k - GPGPU 2012 UDX 12/5/24 p. 13
PGI Acceralator FDTD 1: #pragma acc data region copy(ex[0:ni][0:nj][0:nk]), 2: copyin(ey[0:ni][0:nj][0:nk], Ez[0:Ni][0:Nj] 3: Hx[0:Ni][0:Nj][0:Nk], Hy[0:Ni][0:Nj] 4: ep[0:ni][0:nj][0:nk], sig[0:ni][0:nj 5: { 6: for (t = 0.0; t < Te; t += dt){ 7: #pragma acc region 8: { 9: // Ex 10: #pragma acc for parallel 11: for (i = 0; i < Ni - 1; i++){ 12: #pragma acc for parallel, vector(256) 13: for (j = 1; j < Nj - 1; j++){ 14: #pragma acc for vector(512) 15: for (k = 1; k < Nk - 1; k++){ 16: Ex[i][j][k] = c1 * Ex[i][j][k] 17: + c2 * (Hz[i][j][k] - Hz[i][j - 1] 18: - Hy[i][j][k] + Hy[i][j][k - GPGPU 2012 UDX 12/5/24 p. 14
FDTD GPU FDTD GPU FDTD FDTD FDTD PGI Acceralator CUDA OpenMP Fermi GPU (Tesla C2075/C2070, GTX 580) GT200 GPU (Tesla C1060, GTX 285) PC GPGPU 2012 UDX 12/5/24 p. 15
1 Fermi GPU GPU Tesla C2075/C2070, Gefroce GTX 580 PGI Acceralator C/C++ Workstation 12.2 CUDA 4.0 CPU Intel Core i7 980X (3.33 GHz) gcc 4.4.3 -O3 OpenMP OS: 64 bit Linux (Ubuntu 10.04 LTS server) GPGPU 2012 UDX 12/5/24 p. 16
256 3 J x 1.0 m E x CPU 1.0 0.5 Exact CPU GPU Electric field Ex (V/m) 0.0-0.5-1.0-1.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Time (ns) GPGPU 2012 UDX 12/5/24 p. 17
CPU 8 CPU GPU 256 256 256 1000 5 GPU precision CPU t C1 (s) CPU t C8 (s) GPU t GD (s) t C8 /t GD GTX 580 float 1410.55 330.52 32.67 10.12 double 1124.13 349.03 39.79 8.77 C2075 float 1410.55 330.52 48.75 6.78 double 1124.13 349.03 65.03 5.37 Core i7 980X:10 GTX580:5 C2075:20 CPU 8 GTX 580: 10 9 C2075: 7 5 CPU 1 GTX 580: 43 28 C2075: 29 17 GPGPU 2012 UDX 12/5/24 p. 18
CUDA CUDA 256 256 256 1000 5 GPU precision GPU t GD (s) GPU t GC (s) t GC /t GD GTX 580 float 32.67 10.10 0.309 double 39.79 21.25 0.534 C2075 float 48.75 19.36 0.397 double 65.03 39.28 0.604 CUDA GTX 580: 31% 53% C2075: 40% 60% GPGPU 2012 UDX 12/5/24 p. 19
PC 320 480 320 CUDA 5000 GPU CPU t C1 (s) GPU t GD (s) GPU t GC (s) t C G /td G GTX 580 18140.44 484.02 222.87 0.460 C2070 18140.44 724.21 406.62 0.561 GPGPU 2012 UDX 12/5/24 p. 20
(a) 3 ns later (b) 6 ns later (c) 9 ns later (d) (c) GPGPU 2012 UDX 12/5/24 p. 21
2 FDTD FDTD 1/10 GPU 5/29-31 GPGPU 2012 UDX 12/5/24 p. 22
4 FDTD FDTD 2 F(x, y, z, t) x F(x, y, z, t) t = F n (i + 1 2,j,k) F n (i 1 2,j,k) Δx = F n+ 1 2 (i, j, k) F n+ 1 2 (i, j, k) Δt + O(Δx 2 ) + O(Δt 2 ) FDTD(2,4) 4 2 F(x, y, z, t) = 9 F n (i + 1 2,j,k) F n (i 1 2,j,k) x 8 Δx 1 F n (i + 3 2,j,k) F n (i 3 2,j,k) + O(Δx 4 ) 24 Δx GPGPU 2012 UDX 12/5/24 p. 23
CPU 8 CPU GPU 256 256 256 1000 5 GPU precision CPU t H C8 (s) GPU th GD (s)gpu tf GD (s) th C8 /th GD GTX 580 float 391.30 34.67 32.67 11.98 double 431.84 46.82 39.79 9.22 C2075 float 391.30 52.39 48.75 7.47 double 431.84 74.73 65.03 5.78 CPU 8 GTX 580: 12 9 C2075: 7 6 FDTD GTX 580: 1.1 1.2 C2075: 1.1 1.1 GPGPU 2012 UDX 12/5/24 p. 24
CUDA CUDA 256 256 256 1000 5 GPU precision GPU t H GD (s)gpu th GC (s) th GC /th GD t F GC /tf GD GTX 580 float 34.67 18.83 0.543 0.309 double 46.82 40.32 0.861 0.534 C2075 float 52.39 24.09 0.460 0.397 double 74.73 70.90 0.945 0.604 CUDA GTX 580: 54 % 86 % C2075: 46 % 95 % GPGPU 2012 UDX 12/5/24 p. 25
3 GT 200 GPU 2011 3.14 GT 200 Geforce GTX 285 Tesla C1060 PGI Accelerator Workstation C/C++ 10.9 CUDA 3.1 CPU Intel Core i7 980X (3.33 GHz) gcc 4.4.3 -O3 OpenMP GPGPU 2012 UDX 12/5/24 p. 26
CPU 8 CPU GPU 256 256 256 1000 5 GPU precision CPU t C1 (s) CPU t C8 (s) GPU t GD (s) t C8 /t GD GTX 285 float 1410.55 330.52 115.83 2.85 C1060 float 1410.55 330.52 122.63 2.70 C2070 float 1410.55 330.52 65.01 5.08 CPU 8 GTX 285: 3 C1060: 3 C2070: 5 GPGPU 2012 UDX 12/5/24 p. 27
CUDA CUDA 256 256 256 1000 5 GPU precision GPU t GD (s) GPU t GC (s) t GC /t GD GTX 285 float 115.83 22.49 0.194 C1060 float 122.63 24.56 0.200 C2070 float 65.01 20.81 0.320 CUDA GTX 285: 20 % C1060: 20 % C2070: 32 % GPGPU 2012 UDX 12/5/24 p. 28
4 PC 2005 PC super computer SX-7 our PC cluster at Tohoku Univ. Pentium 4 3.0 GHz 16 (NEC) (handmade) # of CPUs 240 16 memory 1920 Gbyte 8 Gbyte job class max 32 CPU, 256 Gbyte 16 CPU, 8 Gbyte accounting 0.4 Y/sec 0 parallelize auto (sxcc Pauto ) Message Passing (MPI) GPGPU 2012 UDX 12/5/24 p. 29
PC PC FDTD 160 160 160 1000 5 computation time [s] architecture FDTD FDTD(2,4) NEC SX-7 5.24 8.02 Pentium 4 2.8GHz 16 642.80 2816.94 C2075 (PGI 12.2) 21.16 23.59 C2075 (CUDA 4.0) 9.34 14.33 GPGPU 2012 UDX 12/5/24 p. 30
FDTD GPU Fermi GPU CPU 8 GTX 580 10 C2075 6 CUDA GTX 580 30 50 % C2075 40 60 % CUDA CUDA 50 % FDTD FDTD 1.2 CUDA 90 % GT 200 GPU CPU 8 3 CUDA 20 % NEC SX-7 C2075 1/4 GPGPU 2012 UDX 12/5/24 p. 31
X Maxwell 989-3128 16 1 Jun SONODA E-mail: sonoda@sendai-nct.ac.jp GPGPU 2012 UDX 12/5/24 p. 32
1. FDTD H21 H22,23 Cell/B.E. FDTD Cell Challenge 2009 1 IPv6 PC H21 23 2. GPU H23 NTT H23 JST A-STEP H20 H22 H19 H21 GPGPU 2012 UDX 12/5/24 p. 33
1. GPGPU 2012 UDX 12/5/24 p. 34
FDTD (Finite-Difference Time-Domain) CIP (Constrained Interpolation Profile) FDTD CIP Maxwell FDTD CIP GPGPU 2012 UDX 12/5/24 p. 35
FDTD FDTD 2 [ ( )] 1 ωδt 2 [ vδt sin 1 = 2 Δζ sin ζ=x,y,z ( )] 2 k ζ Δζ GPGPU 2012 UDX 12/5/24 p. 36 2
Maximum dispersion error c 0 -c n /c 0 (%) 1000 100 10 1 0.1 0.01 Δ=λ/10 Δ=λ/20 Δ=λ/40 Δ=λ/60 Δ=λ/80 Δ=λ/100 1 10 100 1000 Propagation distance (λ) GPGPU 2012 UDX 12/5/24 p. 37
Δ=λ/m R = nλ 1.7 n e R 100 log(m) 1 (%) model Δ R e Rmax (%) e FDTD (%) by our eq. by FDTD 2-D λ/10 30λ 51 51 λ/10 60λ 102 102 λ/20 30λ 13 13 λ/20 120λ 51 50 3-D λ/10 15λ 26 25 λ/10 30λ 51 51 GPGPU 2012 UDX 12/5/24 p. 38
2 FDTD N =2,M =2 f(x) x = a 1 + 1 a 1 3 f(x + 1 2 Δ) f(x 1 2 Δ) Δ f(x + 3 2 Δ) f(x 3 2 Δ) Δ + O(Δ 2 ) 2 FDTD a 1 k k Θ= β k = 2 Δ β (kδ kδ) 2 d(kδ) [a 1 ( sin kδ 2 1 3 sin 3kΔ 2 π β π ) + 1 3 ] 3kΔ sin 2 GPGPU 2012 UDX 12/5/24 p. 39
2 FDTD β a 1 a 1 1. β a 1 2. a 1 k k a 1 Θ/ a 1 =0 ( ) ( 8 27 sin β 2 sin 3β 12β 9cos β 2 2 cos 3β 2 a 1 = 60β 90 sin β +18sin2β 2sin3β 6β 18 sin β +9sin2β 2sin3β + 60β 90 sin β +18sin2β 2sin3β ) GPGPU 2012 UDX 12/5/24 p. 40
10-1 Dispersion error e θφ 10-2 10-3 10-4 10-5 0.01 0.1 1 Courant number FDTD(2,2) FDTD(2,4) Tam 1993 Wang 1996 Proposed 2 GPGPU 2012 UDX 12/5/24 p. 41
GPGPU 2012 UDX 12/5/24 p. 42
0.002 measurement 0.002 measurement FDTD Opt.FDTD 0.001 0.001 Electric field Ex[V/m] 0.000-0.001 Electric field Ex [V/m] 0.000-0.001-0.002 13 14 15 16 17 18 19 time [ns] FDTD -0.002 13 14 15 16 17 18 19 time [ns] FDTD GPGPU 2012 UDX 12/5/24 p. 43
FDTD FDTD PC FPGA (Field-Programmable Gate Array) Cell Broadband Engine (Cell/B.E.) GPU (Graphics Processing Unit) FDTD CIP GPGPU 2012 UDX 12/5/24 p. 44
Cell/B.E. FDTD Cell/B.E. SONY IBM CPU PS3 PS3 Cell/B.E. 1 8 GPGPU 2012 UDX 12/5/24 p. 45
Main memory Main memory : SPE FDTD t t n +2 n +3/2 n +1 n +1/2 n SPE 2 SPE 1 n +2 n +3/2 n +1 n +1/2 n i 2 i 1 i 3/2 i 1/2 i i +1 i +2 i +1/2 i +3/2 z i 2 i 1 i 3/2 i 1/2 i i +1 i +2 i +1/2 i +3/2 i +5/2 z GPGPU 2012 UDX 12/5/24 p. 46
PS3 FDTD Speedup Ratio 6 5 4 3 2 1 TSP Large TSP Small Parallel Large Parallel Small Ideal 1 2 3 4 5 6 Number of SPE(s) Xeon 2.8GHz MacPro 10 GPGPU 2012 UDX 12/5/24 p. 47
PC PC PC SCore Clustermatic Los Alamos National Lab. Windows HPC Server 2008 (Microsoft ) OS Live Linux PC KNOPPIX PC DHCP GPGPU 2012 UDX 12/5/24 p. 48
IPv6 PC USB/CD/DVD 1 PC PC PC Live Linux USB/CD/DVD Linux IPv6 PC Live Linux OS PC PC GPGPU 2012 UDX 12/5/24 p. 49
HTTP-FUSE-KNOPPIX PC Live Linux USB/CD/DVD PC PC /home block file kernel magic packet NFS HTTP TFTP WOL PC 01 PC 02 PC 03 PC 04 PC n server client client client client Live Linux HTTP-FUSE-KNOPPIX USB or CD for server system for client boot loader, kernel, blockfile PC PC Live Linux GPGPU 2012 UDX 12/5/24 p. 50
PC 2.2 2 1.8 our system NFS_servr * 3 144.9 [s] ratio 1.6 1.4 1.2 1 0.8 71.2 [s] 108.5 [s] 89.5 [s] 0 10 20 30 40 50 60 70 80 90 100 # of PCs 1/2 GPGPU 2012 UDX 12/5/24 p. 51
IPv6 PC DHCP IP DHCP IPv6 MAC IP GPGPU 2012 UDX 12/5/24 p. 52
PC 120 100 boot time [s] 80 60 40 20 0 4 8 12 16 20 number of PCs IPv4 NFS IPv4 SSHFS IPv6 SSHFS IPv4 NFS GPGPU 2012 UDX 12/5/24 p. 53
NPB EP-D 1600 1400 IPv4 NFS IPv6 SSHFS IPv4 SSHFS EP Class D[Mop/s] 1200 1000 800 600 400 200 0 0 10 20 30 40 50 60 70 80 number of cores IPv4 NFS GPGPU 2012 UDX 12/5/24 p. 54
2. GPGPU 2012 UDX 12/5/24 p. 55
GPGPU 2012 UDX 12/5/24 p. 56
GPR (Ground Penetrating Radar) FDTD 1990 FDTD FPGA Cell/B.E. FDTD GPGPU 2012 UDX 12/5/24 p. 57
2D/3D air 0.1 m 1.0 m J y z y O x ground ε r =4.0 σ =0.001 S/m 0.1 m cylinder ε r =1.0 σ =0.0S/m GPGPU 2012 UDX 12/5/24 p. 58
2D 3D problem size 1024 x 1024 256 x 256 x 256 source line current point current pulse Gaussian ( 3dB width:0.5 ns) 2.5 x +2.5 1.0 x/y +1.0 scan range (Δx =0.05 m) (Δx =Δy =0.1m) # of scannings 100 400 ground ε r =4.0, σ =0.001 S/m cylinder ε r =1.0, σ =0.0S/m increments Δ=0.01 m, Δs =0.01 10 6 s # of time steps 3000 ABC 1st. Mur compiler CUDA 4.0 (gcc 4.4.5 -O3) GPGPU 2012 UDX 12/5/24 p. 59
GPU Geforce GTX 580 10 PC 1 2 GPU 10 GPGPU 2012 UDX 12/5/24 p. 60
3 GPGPU 2012 UDX 12/5/24 p. 61
3 CPU/GPU CPU 980X x10 65 GTX 580 x10 30 GPGPU 2012 UDX 12/5/24 p. 62
FDTD X GPGPU 2012 UDX 12/5/24 p. 63
d 0 = L ε 1,μ 1 ε 2,μ 2 ε 1,μ 1 0 0th stage L x d 1 0 1st stage L x d 2 0 2nd stage L x GPGPU 2012 UDX 12/5/24 p. 64
Transmission coefficient (db) 0-5 -10-15 -20-25 -30-35 0 0.1 0.2 0.3 0.4 0.5 d / λ 1st 2nd 3rd ε 2 /ε 1 =4.0 Transmission coefficient (db) 0-5 -10-15 -20-25 -30-35 -40 0 0.1 0.2 0.3 0.4 0.5 d / λ 3 layers 7 layers 15 layers GPGPU 2012 UDX 12/5/24 p. 65
0 peak 15-5 Q Minimum transmission (db) -10-15 -20-25 -30 10 5 Q value -35 1 2 3 stage number 0 GPGPU 2012 UDX 12/5/24 p. 66
20 Peak Q 1000 Maximum resonance (db) 15 10 5 100 Q value 0 1 2 3 stage number 10 GPGPU 2012 UDX 12/5/24 p. 67
SiO2-TiO2 2 400 800 nm GPGPU 2012 UDX 12/5/24 p. 68
1.2 1 Transmission 0.8 0.6 0.4 0.2 measured FDTD 0 0.14 0.16 0.18 0.2 0.22 0.24 d 2 /λ 2 GPGPU 2012 UDX 12/5/24 p. 69
1 6 GHz S21 FDTD GPGPU 2012 UDX 12/5/24 p. 70
ε r =2.25 GPGPU 2012 UDX 12/5/24 p. 71
FDTD 2 0.0 Transmission coefficient (db) -1.0-2.0-3.0-4.0-5.0-6.0-7.0-8.0 measured FDTD 1 2 3 4 5 6 Frequency (GHz) GPGPU 2012 UDX 12/5/24 p. 72
LLS FDTD 3 MW-FDTD PC PC GPGPU 2012 UDX 12/5/24 p. 73
MW-FDTD Moving Window FDTD MW-FDTD FDTD MW-FDTD MW-FDTD LLS GPGPU 2012 UDX 12/5/24 p. 74
3 MW-FDTD PC MW-FDTD 4 MW-FDTD F=1/16 FDTD 47 1/64 FDTD 20 1/16 GPGPU 2012 UDX 12/5/24 p. 75
地形モデルによる雷放電電磁界解析 ツールで始める GPGPU 2012 春 秋葉原 UDX 12/5/24 p. 76
GPU(Graphics Processing Unit) GPGPU 2012 UDX 12/5/24 p. 77
SfM FDTD FDTD SfM (Structure from Motion) 1 3 1 3 FDTD GPGPU 2012 UDX 12/5/24 p. 78
AR による電波環境の現実的可視化 FDTD 法 計算結果は電磁界 6 成分の時間応答 計算結 果を分かりやすく表示 AVS 等の 3 次元可視化ソフト 高コスト 現実感なし AR (Augmented Reality) 技術 実際の映像上に人工物を マッピングする技術 AR によるポインティングベクトル分布の可視化 ツールで始める GPGPU 2012 春 秋葉原 UDX 12/5/24 p. 79