Vol.214-HPC-145 No.45 214/7/3 OpenACC 1 3,1,2 1,2 GPU CUDA OpenCL OpenACC OpenACC High-level OpenACC CPU Intex Xeon Phi K2X GPU Intel Xeon Phi 27% K2X GPU 24% 1. TSUBAME2.5 CPU GPU CUDA OpenCL CPU OpenMP 1 Tokyo Institute of Technology 2 CREST JST CREST 3 RIKEN AICS c 214 Information Processing Society of Japan OpenACC[6] OpenACC CPU CPU CPU GPU ([3]) CUDA OpenACC Array of Structure (AoS) Structure of Array (SoA) - OpenACC 1
Vol.214-HPC-145 No.45 214/7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] clause] ] structured block!$acc end directive-name 1 OpenACC 2. Shuai [1] CUDA OpenCL API Dymaxion++ Dymaxion++ 2 Reshape P lace Reshape 3 trasnpose diagonal indirect PCI-E P lace GPU on-chip High-level 4. 2.1 OpenACC CUDA GPU CUDA OpenCL OpenACC OpenACC NVIDIA Cray PGI C/C++ Fortran OpenMP GPU CUDA OpenCL GPU GPU OpenACC GPU OpenACC hmpp[2], PGI [9] OpenMP CUDA OpenMPC[5] OpenACC 1 data kernels 3. Sung [8] GPU Array-of-Structure-of-Tiled-Array(ASTA) CUDA OpenCL Low-level Array of Structures ASTA High-level c 214 Information Processing Society of Japan 4.1 2 1 CPU TSUBAME CPU2 ( 12 ) Intel Xeon Phi 24 Xeon Phi KMP AFFINITY=compact OpenMP 4.1.1 Intel CPU, Intel Xeon Phi, NVIDIA Kepler [4] OpenACC 1 2 ( 5) 2 row major 5 2 Structure of Arrays (SoA) 2 Array of Structures (AoS) CPU, Xeon Phi, Kepler 2
Vol.214-HPC-145 No.45 214/7/3 CPU GPU MIC 1 Intel Xeon X567 6cores 2.93 GHz 2 sockets 54 GB Memory NVIDIA Kepler K2X 2688 CUDA cores 6GB Memory Intel Xeon Phi 712X 61 cores 16GB Memory MB/s 35 3 25 2 15 1 5 AoS Copy: AoS Scale: AoS Add: AoS Triad: SoA Copy: SoA Scale: 2 CPU icc -O3 -openmp GPU pgcc -O3 -ta=nvidia,cc35,kepler MIC icc -O3 -mmic -openmp -opt-prefetch-distance=4,1 -opt-streaming-stores always -opt-streaming-cache-evict= 2, 3, 4 Copy : C = A Scale : B = scalar C Add : C = A+B T riad : A = B +scalar C A B C 1M double 2 SoA Array AoS Structure GPU AoS 16 9% MIC SoA AoS CPU AoS 4.1.2 [11] C static allocation OpenMP GPU OpenACC 4 P MIC [1] 4 a, b, c c 214 Information Processing Society of Japan 2 3 5 MB/s MB/s 4 1 1 1 AoS: Structure 要素数 SoA: Array の本数 SoA Add: SoA Triad: Intel Xeon CPU 12 18 16 14 12 1 8 6 4 2 1 1 1 AoS: Structure 要素数 SoA: Array の本数 AoS Copy: AoS Scale: AoS Add: AoS Triad: SoA Copy: SoA Scale: SoA Add: SoA Triad: Intel Xeon Phi 24 2 18 16 14 12 1 8 6 4 2 1 1 1 AoS: Structure 要素数 SoA: Array の本数 AoS Copy: AoS Scale: AoS Add: AoS Triad: SoA Copy: SoA Scale: SoA Add: SoA Triad: NVIDIA K2X GPU 1 2 3 4 M SoA 1 3 2 4 M オリジナル AoS 1 2 3 4 M AoS SoA row major 6 SOAOS 4 3 4 AoS 3 3
Vol.214-HPC-145 No.45 214/7/3 #if SOAOS static float a[mimax][mjmax][mkmax][4], b[mimax][mjmax][mkmax][3], c[mimax][mjmax][mkmax][3]; #endif #if SOSOA static float a[4][mimax][mjmax][mkmax], b[3][mimax][mjmax][mkmax], c[3][mimax][mjmax][mkmax]; #endif #if AOS static float abc[mimax][mjmax][mkmax][1]; #endif #if SOA static float abc[1][mimax][mjmax][mkmax]; #endif 6 8 7 6 5 SOAOS GFlops 4 3 2 1 OMP MIC GPU SOSOA AOS SOA 7 AoS Structure of Array of Structure (SOAOS) 3 AoS SoA SOSOA 3 AoS SoA 7 OpenMP 12 MIC 24 GPU PGI 64 4 OpenMP MIC GPU 1% 4% 4.1.3 UPACS UPACS c 214 Information Processing Society of Japan 8 ( ) ( ) UPACS 2 (Convection) (Viscosity) 8 9 cellfacetype UPACS 25.% 37.7% 2 UPACS 9 1, 11 AoS, SoA CPU GPU 12, 13 AoS UPACS cellfacetype CPU GPU 12, 13 GPU SoA 4
Vol.214-HPC-145 No.45 214/7/3 type cellfacetype real(8) end type real(8), dimension(3) :: real(8), dimension(5) :: :: area, nt nv q_r, q_l, flux real(8) :: shockfix type(cellfacetype), dimension(:,:,:), pointer :: cface allocate(cface(-1:in+1,-1:jn+1,-1:kn+1)) 9 real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: area,nt real(8),dimension(3,-1:in+1,-1:jn+1,-1:kn+1) :: nv real(8),dimension(5,-1:in+1,-1:jn+1,-1:kn+1) :: q_r,q_l,flux real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: shockfix 1 Array of Structures real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: area,nt real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1,3) :: nv real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1,5) :: q_r,q_l,flux real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: shockfix Elapsed )me[sec] 12 Elapsed )me[sec] 13 11 1.9.8.7.6.5.4.3.2.1.35.3.25.2.15.1.5 Structure of Arrays Original AoS SoA Viscosity Convec?on UPACS CPU 5. Original AoS SoA Viscosity Convec:on UPACS GPU c 214 Information Processing Society of Japan CPU - OpenACC CPU 5.1 (1) (2) 2 (1) acc data acc kernels acc data acc data - acc kernels 5
Vol.214-HPC-145 No.45 214/7/3 #pragma acc trans transpose array_name \ [start : length][start : length][start : length], [1, 3, 2] structured block 14 acc trans (2) acc loop acc loop GPU gang, worker, vector clause 5.2 acc trans( 14) acc trans (1) (2) (3) (optional) (2) C 1 (3) array[i][j][k] 14 [1,3,2] 1 1 2 3 3 2 array[i][j][k] array[i][k][j] 6. acc trans -to- ROSE Compiler Infrastructure[7] ( 1 ) Rose c 214 Information Processing Society of Japan #pragma bcc trans transpose(foo_a[:1][:1][:3],[1,3,2]) #pragma acc data copy (foo_a[:1][:1][:3], \ foo_b[:1][:1][:3]) #pragma acc kernels #pragma acc loop gang independent for(k = ;k < 1;k++) #pragma acc loop vector independent for(j = ;j < 1;j++) for(i = ;i < 3;i++) foo_b[k][j][i] = foo_a[k][j][i]; 15 acc trans AST ( 2 ) #pragma acc trans ( 3 ) ( 4 ) 1 ( 5 ) / ( 6 ) #pragma acc trans acc data ( 7 ) 15 16 acc data data acc trans 7. 6 SOAOS AOS 6
Vol.214-HPC-145 No.45 214/7/3 #pragma bcc trans transpose \ ( foo_a [ : 1 ] [ : 1 ] [ : 3 ], [ 1, 3, 2 ] ) double *foo_a_generated 1_3_2; foo_a_generated 1_3_2 = ((void *)\ (malloc(sizeof(double ) * 1 * 1 * 3))); transpos_foo_a_1_3_2(((double *)\ foo_a_generated 1_3_2),((double *)foo_a)); MFlops 84 82 8 78 76 74 A[i][j][k][4], B[i][j][k][3], C[i][j][k][3] ABC[i][j][k][1] #pragma acc data copy (foo_a_generated 1_3_2[:1 * 1 * 3]\, foo_b[:1][:1][:3]) #pragma acc kernels #pragma acc loop gang independent for (k = ; k < 1; k++) #pragma acc loop vector independent for (j = ; j < 1; j++) for (i = ; i < 3; i++) foo_b[k][j][i] = foo_a_generated 1_3_2 \ [(( * 1 + k) * 3 + i) * 1 + j]; retranspos_foo_a_1_3_2(((double *)foo_a), \ ((double *)foo_a_generated 1_3_2)); free(foo_a_generated 1_3_2); 16 acc trans 2 CPU, Xeon Phi, GPU 17, 18, 19 4 24 Xeon Phi K2X GPU [1,2,4,3] A[I][J][K][4] A[I][J][4][K] Xeon Phi GPU 27% 24% CPU acc trans Xeon Phi 7 7GFlops static c 214 Information Processing Society of Japan 17 MFlops MFlops 72 7 [1234] A[i][j][k][4] [1243] A[i][j][4][k] [1423] A[i][4][j][k] [4123] A[4][i][j][k] on CPU ( ) 4 35 3 25 2 15 1 5 45 4 35 3 25 2 15 1 5 [1234] A[i][j][k][4] 18 [1234] A[i][j][k][4] 19 [1243] A[i][j][4][k] [1423] A[i][4][j][k] [4123] A[4][i][j][k] on Intel Xeon Phi [1243] A[i][j][4][k] [1423] A[i][4][j][k] [4123] A[4][i][j][k] on K2X A[i][j][k][4], B[i][j][k][3], C[i][j][k][3] ABC[i][j][k][1] A[i][j][k][4], B[i][j][k][3], C[i][j][k][3] ABC[i][j][k][1] 1 UPACS 8. OpenACC Intel 7
Xeon Phi 27% K2X GPU 24% Array of Structures, Structure of Arrays Vol.214-HPC-145 No.45 214/7/3 [1] Che, S., Sheaffer, J. W. and Skadron, K.: Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems, Proceedings of 211 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 11, New York, NY, USA, ACM, pp. 13:1 13:11 (online), DOI: 1.1145/263384.26341 (211). [2] Dolbeau, R., Bihan, S. and Bodin, F.: A Hybrid Multicore Parallel Programming Environment, High Performance Computing (Valero, M., Joe, K., Kitsuregawa, M. and Tanaka, H., eds.), Lecture Notes in Computer Science, Vol. 194, Springer Berlin / Heidelberg, pp. 182 19 (27). [3] Hoshino, T., Maruyama, N., Matsuoka, S. and Takaki, R.: CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application, Cluster Computing and the Grid, IEEE International Symposium on, Vol., pp. 136 143 (online), DOI: http://doi.ieeecomputersociety.org/1.119/ccgrid.213.12 (213). [4] in High Performance Computers, S. S. M. B.: http://www.cs.virginia.edu/stream/. [5] Lee, S. and Eigenmann, R.: OpenMPC: Extended OpenMP Programming and Tuning for GPUs, Proceedings of the 21 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 1, Washington, DC, USA, IEEE Computer Society, pp. 1 11 (online), DOI: 1.119/SC.21.36 (21). [6] OpenACC-standard.org: The OpenACC Application Programming Interface, (online), available from http://www.openacc.org/sites/default/files/openacc.1..pdf (211). [7] Schordan, M. and Quinlan, D.: A Source-To-Source Architecture for User-Defined Optimizations, Modular Programming Languages (Böszörményi, L. and Schojer, P., eds.), Lecture Notes in Computer Science, Vol. 2789, Springer Berlin Heidelberg, pp. 214 223 (23). [8] Sung, I.-J., Liu, G. and Hwu, W.-M.: DL: A data layout transformation system for heterogeneous computing, Innovative Parallel Computing (InPar), 212, pp. 1 11 (online), DOI: 1.119/InPar.212.633966 (212). [9] Wolfe, M.: Implementing the PGI Accelerator model, Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU 1, New York, NY, USA, ACM, pp. 43 5 (online), DOI: http://doi.acm.org/1.1145/1735688.1735697 (21). [1] Xeon Phi (213). [11] http://accc.riken.jp/2145.htm. c 214 Information Processing Society of Japan 8