GPU 1 1 2 1, 3 2, 3 (Graphics Unit: GPU) GPU GPU GPU Evaluation of GPU Computing Based on An Automatic Program Generation Technology Makoto Sugawara, 1 Katsuto Sato, 1 Kazuhiko Komatsu, 2 Hiroyuki Takizawa 1 and Hiroaki Kobayashi 2, 1 Recently, heterogeneous computing systems that achieve high-performance computing by using Graphics Units (GPUs) as accelarators draw much attention in the area of computation sciences. However, a problem in use of GPUs is that it is necessary to port an existing program to a program for GPUs. To relieve the porting effort, this paper focuses on the technology to automatically generate a GPU program by inserting directives into an existing sequential code and evaluates the sustained performance of the auto-generated program. In addition, we show the achievable code optimizations by using directives. A simple matrix multiplication program is used for the evaluation to demonstrate that the automatically generated code can achieve a high sustained performance. 1. (Graphics Unit: GPU) GPU NVIDIA GPU Compute Unified Device Architecture(CUDA) 1) Open Computing Language(OpenCL) 2) GPU GPU CUDA OpenCL GPU GPU ( CAPS HMPP 3) 2. 2.1 OpenCL OpenCL 2) Khronos OpenCL OpenCL 1 Graduate School of Information Sciences, Tohoku University 2 Cyberscience Center, Tohoku University 3 Japan Science and Technology Agency, Core Research for Evolutional Science and Technology 1 c 2011 Information
1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group NDRange SPMD ID ID ID SPMD ID ID 1 OpenCL Private Memory Local Memory Software-managed Cache Global Memory 2.2 Hybrid Multicore Parallel Programming workbench(hmpp) C Fortran 4)5)6) CAPS Hybrid Multicore Parallel Programming workbench(hmpp) 3) HMPP HMPP HMPP HMPP codelet HMPP CPU HMPP HMPP OpenCL 2 c 2011 Information
OpenCL HMPP OpenCL HMPP OpenMP 7) HMPP CPU OpenCL OpenCL HMPP GPU HMPP OpenCL HMPP OpenCL 32 4 CPU GPU OpenCL OpenCL 3. HMPP OpenCL 3.1 OpenCL C Fortran GPU ID 2 OpenCL 2 7 8 ID 1 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 2 // MatrixMul : C = alpha A B + beta C 3 // m is A s width, n is A s height and k is B s height 4 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 5 kernel void MatrixMul( int m, int n, global float A, global float B, global float C ) 6 { 7 int i = get global id(0); // work item ID 8 int j = get global id(1); // work item ID 9 int l; // Induction variables 10 float AB = 0.0f; // Temporary result 11 for( l = 0; l < n ; ++l){ 12 AB += A[i m +l] B[ l n + i]; 13 } 14 C[ j m + i] = alpha AB + beta C[ j m + i]; 15 } 2 OpenCL GPU HMPP 3 5 1 GPU HMPP OpenCL 1 2 3.2 GPU OpenCL GPU GPU 3 c 2011 Information
1 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 2 // MatrixMul : C = alpha A B + beta C 3 // m is A s width, n is A s height and k is B s height 4 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 5 #pragma hmpp MatrixMul codelet, target=opencl, args[c].io=inout 6 void MatrixMul( int m, int n, int k, float A, float B, float C, float alpha, float beta) 7 { 8 int i,j,l; // Induction variables 9 float AB; // Temporary result 10 for( int j = 0 ; j < m ; j++ ) { 11 for( int i = 0 ; i < k ; i++ ) { 12 AB = 0.0f; 13 for( int l = 0 ; l < n ; l++ ){ 14 AB += A[j m + l ] B[ l n + i] ; 15 } 16 C[ j m + i] = alpha AB + beta C[ j m + i] ; 17 } 18 } 19 } 4 3 HMPP C 4 C B C A A B 8) 5 C HMPP GPU NVIDIA GPU 16 5 9) 1 16 8) HMPP 4. OpenCL CPU Intel Core i7 920 4 c 2011 Information
6 7 Core i7 Tesla C1060 Core i7 Tesla C2070 NVIDIA Tesla C1060 NVIDIA Tesla C2070 OS Cent5.5(Linux 2.6.18) GCC4.1.2 HMPP version2.4.0 GPU CPU GPU OpenCL OpenCL HMPP HMPP OpenMP CPU OpenMP GotoBLAS 10) CUBLAS 11) OpenCL HMPP 16 16 OpenMP 8 3 256 512 768 1024 10 Tesla C1060 Tesla C2070 6 7 6 7 OpenMP CPU (OpenMP ) CPU GotoBLAS (GotoBLAS) GPU CUBLAS C (CUBLAS) OpenCL ( OpenCL ) HMPP ( HMPP ) ( blocking) ( unblocking) 7 HMPP Tesla C2070 OpenMP CPU 73 6 HMPP OpenCL 55 HMPP 3 HMPP GPU CPU Tesla C2070 HMPP GotoBLAS 5 c 2011 Information
GPU Tesla C1060 OpenCL GotoBLAS 20% CPU GPU GotoBLAS CPU OpenCL 2 3 GPU GPU. HMPP GotoBLAS. GotoBLAS CPU 2 GPU CPU HMPP CUBLAS HMPP CUBLAS CUBLAS CUBLAS HMPP HMPP Tesla C2070 HMPP OpenCL Tesla C1060 Tesla C2070 HMPP OpenCL OpenCL for if HMPP HMPP OpenCL OpenCL OpenCL GPU 5. GPU HMPP OpenCL OpenMP CPU OpenCL GPU CUBLAS GotoBLAS JCC (CAPS ) (B)(23700028) (JST) (CREST) VLSI 3 VLSI 6 c 2011 Information
1) NVIDIA Corporation. NVIDIA CUDA Programming Guide 3.0, 2010. 2) Khronos OpenCLWorking Group. The OpenCL Specification version 1.1. 3) R.Dolbeau et al. HMPP: A Hybrid Multicore Parallel Programming Environment. Workshop on GPGPU 2007, 2007. 4) The Portland Group. PGI Accelerator Programming Model for Fortran & C. http://www.softek.co.jp/spg/pgi/accel/, 2010. 5) Seyong Lee and R.Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. pp. 1 11, nov. 2010. 6) T.D. Han and T.S. Abdelrahman. hicuda: High-Level GPGPU Programming. Parallel and Distributed Systems, IEEE Transactions on, Vol.22, No.1, pp. 78 90, jan. 2011. 7) OpenMP.org. OpenMP Application Program Interface. http://openmp.org/wp/, 2008. 8) NVIDIA Corporation. NVIDIA OpenCL Best Practice Guide 2.3, 2009. 9) Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, Vol.28, pp. 39 55, 2008. 10). Texas Advanced Computing Center. http://www.tacc.utexas.edu/. 11) NVIDIA Corporation. CUDA Toolkit 4.0 CUBLAS Library. http://developer.nvidia.com/nvidiagpu-computing-documentation, 2011. 7 c 2011 Information