,, 1
HPC (pay-as-you-go) HPC Web 2
HPC Amazon EC2 OpenFOAM GPU EC2 3
HPC MPI MPI Courant 1 GPGPU MPI 4
AMAZON EC2 GPU CLUSTER COMPUTE INSTANCE EC2 GPU (cg1.4xlarge) ( N. Virgina ) Quadcore Intel Xeon 5570 2.93 Ghz x2 (8cores) 22GB Memory NVIDIA M2050 (2687MB) x 2 10 GbEtherNet Amazon Linux AMI 2012.03 (RHEL base) $2.10 /hour / node 5
EC2 Youtube : Building a Cluster in Less Than Ten Minutes sudo CUDA SDK, OpenFOAM, GPU, Machine Image Web $ 0.10 / GB / month 6
EC2 WEB CONSOLE 7
PCC-GPU: APPRO GPU CLUSTER (in-house) GPU (pcc-gpu) Octocore AMD Opteron 6136 @ 2.4 GHz x 2 (16 cores) 32 GB Memory NVIDIA M2050 (2687MB) x 2 Infiniband QDR CentOS 6.2 8 ( 9 ) 8
Intel MPI Benchmarks (IMB) 3.2.3 OpenFOAM 2 : PingPong 2 : Allreduce (MPI_SUM, 8bytes) 9
IMB: PINGPONG (2 NODES) IMB PingPong (2nodes) cg1.4xlarge pcc-gpu Elapsed time [μsec] 900 800 700 600 500 400 300 200 100 0 0 50000 100000 150000 200000 250000 300000 Message size [byte] [Kbyte] 10
IMB: ALLREDUCE (SUM, 8BYTES) IMB Allreduce (8bytes) cg1.4xlarge pcc-gpu 350 300 Elapsed time [μsec] 250 200 150 100 50 0 1 2 3 4 5 6 7 8 9 Number of nodes 11
NS { (ρu) =0, (U ) U (ν U) = P ( ) H(U) U f = ( P ) f a p f (a p ) f ( ) 1 P a p = = f ( ) H(U) S a p ( ) H(U) a p f NS p f ap P Uf Poisson 12
SIMPLE Algorithm 1 SIMPLE 1: 2: repeat 3: 4: 5: PCG 6: 7: 8: 9: until 13
PRECONDITIONED CG 3 MPI - CUBLAS SpMV (sparse Matrix Vector) - CUDA ITSOL (Li and Saad, 2012) JAD MPI : - CUDA ITSOL, NVIDIA CUSP 14
GPU CUDA ITSOL (Li and Saad, 2011) CUDA JAD SpMV (Sparse Matrix Vector product) GPU NVIDIA CUSP: AMG MPI OpenFOAM 15
JAD: SPARSE MATRIX STORAGE Compressed Row Storage Wavefront ordering JAgged Diagonal storage Wavefront ordering JAD : CUDA 16
JAD SPMV CPU-CSR GPU-CSR GPU-JAD 20 15 16.07 13.72 Gflop/s 10 5 0 0.65 4.47 4.79 1.23 0.49 0.38 bones01 parabolic_fem thermal2 8.93 127,224 525,825 1,228,045 6,715,152 3,674,625 8,580,313 17
OpenFOAM GPU JAD JAD 1 18
SPMV: MPI (Ghost cell) CPU (D2H) MPI GPU (H2D) SpMV CUDA MPI Device2Host Host2Device SpMV 19
AMG (Algebraic MultiGrid Preconditioner) NVIDIA CUSP LIBRARY smoothed_aggregation 20
: https://commons.wikimedia.org/wiki/file:gray505.png 21
MRI Gambit OF OF 22
23
simplefoam (OpenFOAM-2.1.1) ν =3.33 10 6 [m 2 /s]( ) V =0.461 [m/s] (Re = 6500) P = 76 [Pa], P = 0 [Pa] 0.6 δp 1 1.0 10 6 and δv 1 1.0 10 6 GPU-AMG-CG ILU-BiCG r 1 1.0 10 8 24
1778SIMPLE 25
SMALL MEDIUM LARGE 1,912,272 2,980,302 5,144,730 155MB 311MB 543MB 26
EC2 CPU-ICCG EC2 GPU-AMGCG JAIST GPU Cluster GPU-AMGCG 27
CG EC2 vs. Inhouse: AMG-PCG inner loop cg1.4xlarge (CPU-DIC) cg1.4xlarge (GPU-AMG) pcc-gpu (CPU-DIC) pcc-gpu (GPU-AMG) 0.7 0.6 Elapsed time [sec] 0.5 0.4 0.3 0.2 0.1 0 SMALL MEDIUM LARGE Number of nodes 28
PCG EC2 vs. In-house: CG LOOP (LARGE) cg1.4xlarge (ICCG) pcc-gpu (AMGCG) cg1.4xlarge (AMGCG) 0.18 0.16 0.14 Elapsed time [sec] 0.12 0.1 0.08 0.06 0.04 0.02 0 1 2 4 8 Number of Nodes 29
SIMPLE (LARGE) EC2 vs. Inhouse: SIMPLE LOOP pcc-gpu cg1.4xlarge 250 200 Elapsed time [sec] 150 100 50 0 1 2 4 8 Number of Threads 30
r 1 1.0 10-8 r0 1 1 2 4 8 ICCG 1005 1356 1362 1373 AMG-CG 41 94 139 198 31
EC2 OpenFOAM CUDA ITSOL NVIDIA CUSP GPU 8 32