2. Amazon GPU Cluster Compute Instance Amazon CCI Amazon EC2 CCI GPU Cluster GPU Quadruple Extra Large Instance (cg1.4xlarge) [6] On Demand Inhouse In

Amazon EC2 GPU OpenFOAM 1 1,2 1,3 VM HPC HPC Amazo EC2 GPGPU OpeFOAM GPU OpenFOAM MPI GPGPU 8 EC2 GPU, Cloud, CFD Akihiko Saijo 1 Yasushi Inoguchi 1,2 Teruo Matsuzawa 1,3 1. HPC (Inhouse) IaaS (Infrastructre as a Service) PC HPC HPC HPC HPC MPI 1 Japan Advanced Institute of Technology and Science, School of Infomation 2 3 CPU HPC MPI MPI HPC HPC Amazon EC2 (Elastic Compute Cloud) EC2 HPC CCI (Cluster Compute Instance) HPC CCI NVIDIA CUDA GPU GPU CCI HPC EC2 CCI HPC Amazon EC2 GPU CCI OpenFOAM GPU CCI Inhouse GPU Amazon EC2 HPC 1

2. Amazon GPU Cluster Compute Instance Amazon CCI Amazon EC2 CCI GPU Cluster GPU Quadruple Extra Large Instance (cg1.4xlarge) [6] On Demand Inhouse Infiniband GPU Cluster (pcc-gpu) 1 3. MPI 3.1 Amaozon EC2 GPU CCI GPU CCI (Virgina) 2 GPU Amazon CCI VM VM API VM OS Cluster GPU Amazon Linux AMI 212.3 Red Hat Enterprise Linux Amazon MPI GPU HVM (Hardware Virtual Module) CUDA EC2 CCI StarCluster[7] OpenFOAM EC2 VM Cloud-Flu[8] GPU GCC OpenFOAM OpenMPI Xeon HyperThreading OS NFS EBS(Elastic Block Store) Open- FOAM I/O 3.2 MPI Intel MPI Benchmarks (IMB) MPI (Ghost Cell) 2 IMB PingPong 1 CCI (cg1.4xlarge) Inhouse (pcc-gpu) CCI 6 8bytes 2 CCI 1 EC2 Elapsed time [μsec] 1 Elapsed time [μsec] 2 9 8 7 6 5 4 3 2 1 IMB PingPong (2nodes) cg1.4xlarge pcc-gpu 5 1 15 2 25 3 Message size [byte] 2 : EC2 CCI vs. Inhouse Cluster 35 3 25 2 15 1 5 IMB Allreduce (8bytes) cg1.4xlarge pcc-gpu 1 2 3 4 5 6 7 8 9 : EC2 CCI vs. Inhouse Cluster 2

1 EC2 GPU Cluster Instance Inhouse GPU Cluster Table 1 Specifications cg1.4xlarge pcc-gpu CPU Intel Xeon X557 2.93 GHz AMD Opteron 6136 2.4 GHz CPU () 2(8 w/o HyperThreading) 2(16) 22 GB 32 GB GPUs NVIDIA Tesla M25 2 1 Gigabit Ethernet Infiniband QDR OS Cluster GPU Amazon Linux AMI 212.3 CentOS 6.2 GNU GCC 4.4.6 Options: -O2 -fpic CUDA Version NVIDIA CUDA 4.2 CUDA 4.1 MPI Library Open MPI 1.5.3 MVAPICH2 1.7 4. OpenFOAM GPU OpenFOAM SIMPLE (Semi-Implicit Methods Pressure-Linked Equations) () Navier-Stokes { (ρu) =, (U ) U (ν U) = P (1) OpenFOAM SIMPLE 2 (FVM) [3]. (Node) p a p U p = H(U) P U p = H(U) P, (2) a p a p where H(U) = a n U n. n NEIGH(p) a p U p H(U) p U U = SU f (3) f F ACE S FVM U f f ( ) H(U) U f = ( P ) f (4) a p f (a p ) f Algorithm 1 SIMPLE 1: 2: repeat 3: 4: 5: PCG 6: 7: 8: 9: until ( ) 1 P = a p = f ( ) H(U) a p ( ) H(U) S a p f (5) A.x = b (x [P 1, P 2,..., P N ], b ) A CG SIMPLE [4] CG 4.1 GPU CG 2? p PCG (SpMV) GPU SpMV Li Saad 3

Algorithm 2 Parallel Preconditioned Conjuagte Gradient 1: Given x. 2: Let p = b Ax, z = M 1 r, r = p, k =. 3: repeat 4: MPI Send GHOST CELLS of p k. 5: q k = Ap k 6: MPI Recv GHOST CELLS of q k. 7: α k = p T k r k/p T k q k 8: MPI Allreduce SUM α k. 9: x k+1 = x k + α k p k 1: r k+1 = r k α k q k 11: z k+1 = M 1 r k+1 12: β k = r T k+1 q k/p T k q k 13: MPI Allreduce SUM β k. 14: p k+1 = r k+1 + β k p k 15: k = k + 1 16: until ( r k+1 / r ɛ) CUDA ITSOL[2] SpMV JAD (JAgged Diagonal) SIMPLE A OpenFOAM JAD M (5) SIMPLE JAD 2 4.1.1 AMG 2 GPU (Algebraic MultiGrid) CUDA AMG CUSP [9] smoothed aggregation AMG float 1 2 GPU P2P 1 1MPI MPI 4.1.2 OpenFOAM MPI MPI GPU MPI GPU CPU PCIe MPI MPI SpMV CUDA Stream 4.2 (Thoracic Aorta) MRI ANSYS Gambit Open- FOAM SMALL, MEDIUM, LARGE 3 SMALL 4 2 3 OpenFOAM Scotch [1] 4 4 3 4 4

2 Table 2 Meshes SMALL MEDIUM LARGE 1,912,272 2,98,32 5,144,73 3,874,336 6,31,26 1,382,979 [MB] 155 311 543 3 Table 3 Simulation parameters simplefoam (OpenFOAM-2.1.1) ν = 3.33 1 6 [Pa.s]( ) V =.263 [m/s] (Re = 3) P = [Pa].6 δp 1 1. 1 6 and δv 1 1. 1 6 GPU-AMG-CG ILU-BiCG r 1 1. 1 8 Elpased time [sec] EC2 vs. Inhouse: AMG-CG 1 inner loop cg1.4xlarge (CPU) cg1.4xlarge pcc-gpu.6.5.4.3.2.1 1 2 4 8 6 GPU CCI Inhouse Cluster AMG-CG : LAREG EC2 vs. Inhouse: SIMPLE outer loop cg1.4xlarge (CPU) cg1.4xlarge pcc-gpu 4.3 GPU CCI 1 ( 5) LARGE ( 6, 7) 1 CPU DICCG GPU CCI Inhouse Cluster CPU GPU CCI GPU GPU CCI CPU Inhouse Cluster CPU GPU CPU 4 6 MPI 8 CPU 9 8 EC2 Elapsed time [sec] EC2 vs. Inhouse: AMG-PCG inner loop cg1.4xlarge (CPU-DIC) pcc-gpu (CPU-DIC) cg1.4xlarge (GPU-AMG) pcc-gpu (GPU-AMG).7.6.5.4.3.2.1 SMALL MEDIUM LARGE 5 CCI Inhouse Cluster CPU DIC-CG GPU AMG-CG 7 Elapsed time [sec] 12 1 8 6 4 2 1 2 4 8 GPU CCI Inhouse Cluster SIMPLE : LAREG 5. EC2 HPC Zhai [11] EC2 IMB NPB 6. Amazon EC2 GPU CCI IMB GPU-AMG-CG EC2 EC2 [1] Malecha Ziemowit M, Miroslaw Lukasz, Tomczak Tadeusz, Koza Zbigniew, Matyka Maciej, Tarnawski Wojciech, Szczerba Dominik. GPU-based simulation of 3D blood flow in abdominal aorta using OpenFoam. Archives of Mechanics, 211, vol. 63, No 2, pp. 137-161 [2] R.Li, Y.Saad. GPU-accelerated preconditioned iterative linear solvers, Report umsi-21-112, Minnesota Supercomputer Institute, University of Minnesota, Minneapo- 5

lis, MN, 21. [3] The SIMPLE algorithm in Open- FOAM - OpenFOAMWiki, http://openfoamwiki.net/index.php/ The SIMPLE algorithm in OpenFOAM [4] J.H.Ferziger, M.Peric. Computational Methods for Fluid Dynamics. Springer-Verlag Berling, Heidelberg, 1996. [5] Y.Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Co.,Massachusetts, MA, 2. [6] Amazon: EC2 Instance Type (online): https://aws.amazon.com/ec2/instance-types/ [7] Star: Cluster http://web.mit.edu/star/cluster/ [8] Alexey Petrov, Andrey Simurzin Cloud Flu. http://sourceforge.net/apps/mediawiki/cloudflu/ index.php?title=main Page [9] Nathan Bell and Michael Garland. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations, 212. http://cusplibrary.googlecode.com [1] SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs. Proceedings of HPCN 96, Brussels, Belgium. LNCS 167, pages 493-498. Springer, April 1996. F. Pellegrini and J. Roman. www.labri.fr/perso/pelegrin/scotch/ [11] Yan Zhai, Mingliang Liu, Jidong Zhai, Xiaosong Ma, and Wenguang Chen. Cloud versus in-house cluster: evaluating Amazon cluster compute instances for running MPI applications. In State of the Practice Reports (SC 11). ACM, New York, NY, USA, Article 11,1 pages. 211 6