Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

Size: px

Start display at page:

Download "Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c"

ようたえんの
7 years ago
Views:

1 Vol.214-HPC-145 No /7/3 OpenACC 1 3,1,2 1,2 GPU CUDA OpenCL OpenACC OpenACC High-level OpenACC CPU Intex Xeon Phi K2X GPU Intel Xeon Phi 27% K2X GPU 24% 1. TSUBAME2.5 CPU GPU CUDA OpenCL CPU OpenMP 1 Tokyo Institute of Technology 2 CREST JST CREST 3 RIKEN AICS c 214 Information Processing Society of Japan OpenACC[6] OpenACC CPU CPU CPU GPU ([3]) CUDA OpenACC Array of Structure (AoS) Structure of Array (SoA) - OpenACC 1

2 Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] clause] ] structured block!$acc end directive-name 1 OpenACC 2. Shuai [1] CUDA OpenCL API Dymaxion++ Dymaxion++ 2 Reshape P lace Reshape 3 trasnpose diagonal indirect PCI-E P lace GPU on-chip High-level OpenACC CUDA GPU CUDA OpenCL OpenACC OpenACC NVIDIA Cray PGI C/C++ Fortran OpenMP GPU CUDA OpenCL GPU GPU OpenACC GPU OpenACC hmpp[2], PGI [9] OpenMP CUDA OpenMPC[5] OpenACC 1 data kernels 3. Sung [8] GPU Array-of-Structure-of-Tiled-Array(ASTA) CUDA OpenCL Low-level Array of Structures ASTA High-level c 214 Information Processing Society of Japan CPU TSUBAME CPU2 ( 12 ) Intel Xeon Phi 24 Xeon Phi KMP AFFINITY=compact OpenMP Intel CPU, Intel Xeon Phi, NVIDIA Kepler [4] OpenACC 1 2 ( 5) 2 row major 5 2 Structure of Arrays (SoA) 2 Array of Structures (AoS) CPU, Xeon Phi, Kepler 2

3 Vol.214-HPC-145 No /7/3 CPU GPU MIC 1 Intel Xeon X567 6cores 2.93 GHz 2 sockets 54 GB Memory NVIDIA Kepler K2X 2688 CUDA cores 6GB Memory Intel Xeon Phi 712X 61 cores 16GB Memory MB/s AoS Copy: AoS Scale: AoS Add: AoS Triad: SoA Copy: SoA Scale: 2 CPU icc -O3 -openmp GPU pgcc -O3 -ta=nvidia,cc35,kepler MIC icc -O3 -mmic -openmp -opt-prefetch-distance=4,1 -opt-streaming-stores always -opt-streaming-cache-evict= 2, 3, 4 Copy : C = A Scale : B = scalar C Add : C = A+B T riad : A = B +scalar C A B C 1M double 2 SoA Array AoS Structure GPU AoS 16 9% MIC SoA AoS CPU AoS [11] C static allocation OpenMP GPU OpenACC 4 P MIC [1] 4 a, b, c c 214 Information Processing Society of Japan MB/s MB/s AoS: Structure 要素数 SoA: Array の本数 SoA Add: SoA Triad: Intel Xeon CPU AoS: Structure 要素数 SoA: Array の本数 AoS Copy: AoS Scale: AoS Add: AoS Triad: SoA Copy: SoA Scale: SoA Add: SoA Triad: Intel Xeon Phi AoS: Structure 要素数 SoA: Array の本数 AoS Copy: AoS Scale: AoS Add: AoS Triad: SoA Copy: SoA Scale: SoA Add: SoA Triad: NVIDIA K2X GPU M SoA M オリジナル AoS M AoS SoA row major 6 SOAOS AoS 3 3

4 Vol.214-HPC-145 No /7/3 #if SOAOS static float a[mimax][mjmax][mkmax][4], b[mimax][mjmax][mkmax][3], c[mimax][mjmax][mkmax][3]; #endif #if SOSOA static float a[4][mimax][mjmax][mkmax], b[3][mimax][mjmax][mkmax], c[3][mimax][mjmax][mkmax]; #endif #if AOS static float abc[mimax][mjmax][mkmax][1]; #endif #if SOA static float abc[1][mimax][mjmax][mkmax]; #endif SOAOS GFlops OMP MIC GPU SOSOA AOS SOA 7 AoS Structure of Array of Structure (SOAOS) 3 AoS SoA SOSOA 3 AoS SoA 7 OpenMP 12 MIC 24 GPU PGI 64 4 OpenMP MIC GPU 1% 4% UPACS UPACS c 214 Information Processing Society of Japan 8 ( ) ( ) UPACS 2 (Convection) (Viscosity) 8 9 cellfacetype UPACS 25.% 37.7% 2 UPACS 9 1, 11 AoS, SoA CPU GPU 12, 13 AoS UPACS cellfacetype CPU GPU 12, 13 GPU SoA 4

5 Vol.214-HPC-145 No /7/3 type cellfacetype real(8) end type real(8), dimension(3) :: real(8), dimension(5) :: :: area, nt nv q_r, q_l, flux real(8) :: shockfix type(cellfacetype), dimension(:,:,:), pointer :: cface allocate(cface(-1:in+1,-1:jn+1,-1:kn+1)) 9 real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: area,nt real(8),dimension(3,-1:in+1,-1:jn+1,-1:kn+1) :: nv real(8),dimension(5,-1:in+1,-1:jn+1,-1:kn+1) :: q_r,q_l,flux real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: shockfix 1 Array of Structures real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: area,nt real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1,3) :: nv real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1,5) :: q_r,q_l,flux real(8),dimension(-1:in+1,-1:jn+1,-1:kn+1) :: shockfix Elapsed )me[sec] 12 Elapsed )me[sec] Structure of Arrays Original AoS SoA Viscosity Convec?on UPACS CPU 5. Original AoS SoA Viscosity Convec:on UPACS GPU c 214 Information Processing Society of Japan CPU - OpenACC CPU 5.1 (1) (2) 2 (1) acc data acc kernels acc data acc data - acc kernels 5

6 Vol.214-HPC-145 No /7/3 #pragma acc trans transpose array_name \ [start : length][start : length][start : length], [1, 3, 2] structured block 14 acc trans (2) acc loop acc loop GPU gang, worker, vector clause 5.2 acc trans( 14) acc trans (1) (2) (3) (optional) (2) C 1 (3) array[i][j][k] 14 [1,3,2] array[i][j][k] array[i][k][j] 6. acc trans -to- ROSE Compiler Infrastructure[7] ( 1 ) Rose c 214 Information Processing Society of Japan #pragma bcc trans transpose(foo_a[:1][:1][:3],[1,3,2]) #pragma acc data copy (foo_a[:1][:1][:3], \ foo_b[:1][:1][:3]) #pragma acc kernels #pragma acc loop gang independent for(k = ;k < 1;k++) #pragma acc loop vector independent for(j = ;j < 1;j++) for(i = ;i < 3;i++) foo_b[k][j][i] = foo_a[k][j][i]; 15 acc trans AST ( 2 ) #pragma acc trans ( 3 ) ( 4 ) 1 ( 5 ) / ( 6 ) #pragma acc trans acc data ( 7 ) acc data data acc trans 7. 6 SOAOS AOS 6

7 Vol.214-HPC-145 No /7/3 #pragma bcc trans transpose \ ( foo_a [ : 1 ] [ : 1 ] [ : 3 ], [ 1, 3, 2 ] ) double *foo_a_generated 1_3_2; foo_a_generated 1_3_2 = ((void *)\ (malloc(sizeof(double ) * 1 * 1 * 3))); transpos_foo_a_1_3_2(((double *)\ foo_a_generated 1_3_2),((double *)foo_a)); MFlops A[i][j][k][4], B[i][j][k][3], C[i][j][k][3] ABC[i][j][k][1] #pragma acc data copy (foo_a_generated 1_3_2[:1 * 1 * 3]\, foo_b[:1][:1][:3]) #pragma acc kernels #pragma acc loop gang independent for (k = ; k < 1; k++) #pragma acc loop vector independent for (j = ; j < 1; j++) for (i = ; i < 3; i++) foo_b[k][j][i] = foo_a_generated 1_3_2 \ [(( * 1 + k) * 3 + i) * 1 + j]; retranspos_foo_a_1_3_2(((double *)foo_a), \ ((double *)foo_a_generated 1_3_2)); free(foo_a_generated 1_3_2); 16 acc trans 2 CPU, Xeon Phi, GPU 17, 18, Xeon Phi K2X GPU [1,2,4,3] A[I][J][K][4] A[I][J][4][K] Xeon Phi GPU 27% 24% CPU acc trans Xeon Phi 7 7GFlops static c 214 Information Processing Society of Japan 17 MFlops MFlops 72 7 [1234] A[i][j][k][4] [1243] A[i][j][4][k] [1423] A[i][4][j][k] [4123] A[4][i][j][k] on CPU ( ) [1234] A[i][j][k][4] 18 [1234] A[i][j][k][4] 19 [1243] A[i][j][4][k] [1423] A[i][4][j][k] [4123] A[4][i][j][k] on Intel Xeon Phi [1243] A[i][j][4][k] [1423] A[i][4][j][k] [4123] A[4][i][j][k] on K2X A[i][j][k][4], B[i][j][k][3], C[i][j][k][3] ABC[i][j][k][1] A[i][j][k][4], B[i][j][k][3], C[i][j][k][3] ABC[i][j][k][1] 1 UPACS 8. OpenACC Intel 7

8 Xeon Phi 27% K2X GPU 24% Array of Structures, Structure of Arrays Vol.214-HPC-145 No /7/3 [1] Che, S., Sheaffer, J. W. and Skadron, K.: Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems, Proceedings of 211 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 11, New York, NY, USA, ACM, pp. 13:1 13:11 (online), DOI: / (211). [2] Dolbeau, R., Bihan, S. and Bodin, F.: A Hybrid Multicore Parallel Programming Environment, High Performance Computing (Valero, M., Joe, K., Kitsuregawa, M. and Tanaka, H., eds.), Lecture Notes in Computer Science, Vol. 194, Springer Berlin / Heidelberg, pp (27). [3] Hoshino, T., Maruyama, N., Matsuoka, S. and Takaki, R.: CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application, Cluster Computing and the Grid, IEEE International Symposium on, Vol., pp (online), DOI: (213). [4] in High Performance Computers, S. S. M. B.: [5] Lee, S. and Eigenmann, R.: OpenMPC: Extended OpenMP Programming and Tuning for GPUs, Proceedings of the 21 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 1, Washington, DC, USA, IEEE Computer Society, pp (online), DOI: 1.119/SC (21). [6] OpenACC-standard.org: The OpenACC Application Programming Interface, (online), available from (211). [7] Schordan, M. and Quinlan, D.: A Source-To-Source Architecture for User-Defined Optimizations, Modular Programming Languages (Böszörményi, L. and Schojer, P., eds.), Lecture Notes in Computer Science, Vol. 2789, Springer Berlin Heidelberg, pp (23). [8] Sung, I.-J., Liu, G. and Hwu, W.-M.: DL: A data layout transformation system for heterogeneous computing, Innovative Parallel Computing (InPar), 212, pp (online), DOI: 1.119/InPar (212). [9] Wolfe, M.: Implementing the PGI Accelerator model, Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU 1, New York, NY, USA, ACM, pp (online), DOI: (21). [1] Xeon Phi (213). [11] c 214 Information Processing Society of Japan 8

GPU n Graphics Processing Unit CG CAD

GPU 2016/06/27 第 20 回 GPU コンピューティング講習会 ( 東京工業大学 ) 1 GPU n Graphics Processing Unit CG CAD www.nvidia.co.jp www.autodesk.co.jp www.pixar.com GPU n GPU ü n NVIDIA CUDA ü NVIDIA GPU ü OS Linux, Windows, Mac