Slide 1

Size: px

Start display at page:

Download "Slide 1"

とよみおおはし
7 years ago
Views:

1 OpenACC CUDA による GPU コンピューティング Akira Naruse, 19 th Jul. 2018

2 成瀬彰 (Naruse, Akira) 自己紹介 2013 年 ~: NVIDIA シニアデベローパーテクノロジーエンジニア 1996~2013 年 : 富士通研究所研究員など専門興味 : 並列処理性能最適化スパコン HPC GPU コンピューティング DeepLearning 詳しくは github.com/anaruse 2

3 AGENDA GPU Computing Volta GPUs OpenACC CUDA 3

4 GPU Computing 4

5 GPU コンピューティング Low latency + High throughput CPU GPU 5

6 アプリケーション実行アプリケーションコード CPU Do i=1,n 計算の重い部分 GPU 逐次部分は CPU 上で実行 End do 並列部分は GPU で実行 6

7 GPU の構造 (TESLA P100) 大量の CUDA コア並列性が鍵 64 CUDA core/sm 3584 CUDA core/chip 56 SM/chip 7

8 GPU APPLICATIONS 数百のアプリケーションが GPU に対応 8

9 LEADING APPLICATIONS 9

10 DL FRAMEWORKS 10

11 アプリを GPU 対応する方法 Application Library OpenACC CUDA GPU 対応ライブラリにチェンジ簡単に開始既存コードにディレクティブを挿入簡単に加速主要処理を CUDA で記述高い自由度 11

12 NVIDIA LIBRARIES cublas cusparse cusolver AmgX cufft curand nvgraph Thrust Performance Primitives cudnn TensorRT NCCL 12

13 PARTNER LIBRARIES Computer Vision Audio and Video Matrix, Signal and Image Linear Algebra Math, Signal and Image Graph Sparse direct solvers Linear Algebra Linear Algebra Computational Geometry Sparse Iterative Methods Real-time visual simulation 13

14 アプリを GPU 対応する方法 Application Library OpenACC CUDA GPU 対応ライブラリにチェンジ簡単に開始既存コードにディレクティブを挿入簡単に加速主要処理を CUDA で記述高い自由度 14

15 SAXPY (Y=A*X+Y) OpenMP void saxpy(int n, float a, float *x, float *restrict y) { #pragma omp parallel for for (int i = 0; i < n; ++i) y[i] += a*x[i]; saxpy(n, 3.0, x, y); void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc parallel copy(y[:n]) copyin(x[:n]) for (int i = 0; i < n; ++i) y[i] += a*x[i]; saxpy(n, 3.0, x, y); OpenACC 15

16 アプリを GPU 対応する方法 Application Library OpenACC CUDA GPU 対応ライブラリにチェンジ簡単に開始既存コードにディレクティブを挿入簡単に加速主要処理を CUDA で記述高い自由度 16

17 CPU SAXPY (Y=A*X+Y) CUDA void saxpy(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] += a*x[i]; global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx; if (i < n) y[i] += a*x[i]; saxpy(n, 3.0, x, y); size_t size = sizeof(float) * N; cudamemcpy(d_x, x, size, cudamemcpyhosttodevice); cudamemcpy(d_y, y, size, cudamemcpyhosttodevice); saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y); cudamemcpy(y, d_y, size, cudamemcpydevicetohost); 17

18 Volta GPUs 18

19 Performance / W GPU ロードマップ Volta Pascal Maxwell Tesla Fermi Kepler

20 TESLA V100 の概要 Volta Architecture Improved NVLink & HBM2 Volta MPS Improved SIMT Model Tensor Core Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms 125 Programmable TFLOPS Deep Learning Deep Learning と HPC 両方に最適な GPU 20

21 P100 に対する相対性能 VOLTA HPC 性能を大きく向上 HPC アプリケーション性能 System Config Info: 2X Xeon E v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware. Summit Supercomputer 200+ PetaFlops ~3,400 Nodes 10 Megawatts 21

QMCPACK RAPTOR SPECFEM XGC 3+EFLOPS Tensor

22 VOLTA 米国トップスパコンのエンジン SUMMIT AI Exascale Today Performance Leadership 200 PF Accelerated Science ACME DIRAC FLASH GTC HACC LSDALTON NAMD 20 PF NUCCOR NWCHEM QMCPACK RAPTOR SPECFEM XGC 3+EFLOPS Tensor Ops 10X Perf Over Titan 5-10X Application Perf Over Titan 22

23 TESLA V100 トランジスタ数 :21B 815 mm 2 80 SM 5120 CUDA コア 640 Tensor コア HBM2 32 GB, 900 GB/s NVLink 300 GB/s *full GV100 chip contains 84 SMs 23

24 GPU ピーク性能比較 : P100 vs v100 P100 V100 性能 UP トレーニング性能 10 TOPS 125 TOPS 12x インファレンス性能 21 TFLOPS 125 TOPS 6x FP64/FP32 5/10 TFLOPS 7.8/15.6 TFLOPS 1.5x HBM2 バンド幅 720 GB/s 900 GB/s 1.2x NVLink バンド幅 160 GB/s 300 GB/s 1.9x L2 キャッシュ 4 MB 6 MB 1.5x L1 キャッシュ 1.3 MB 10 MB 7.7x 24

25 STREAM: Triad- Delivered GB/s HBM2 メモリ使用効率 UP 実効バンド幅 1.5 倍 V100 measured on pre-production hardware. HBM2 stack P100 V100 76% 95% 使用効率 25

26 VOLTA NVLINK P100 V100 リンク数 4 6 バンド幅 / リンクトータルバンド幅 (*) バンド幅は双方向 40 GB/s 50 GB/s 160 GB/s 300 GB/s DGX1V 26

27 VOLTA GV100 SM GV100 FP32 ユニット 64 FP64 ユニット 32 INT32 ユニット 64 Tensor コア 8 レジスタファイル統合 L1 共有メモリ 256 KB 128 KB Active スレッド 2048 (*) SM あたり 27

28 VOLTA GV100 SM 生産性の向上命令セットを一新スケジューラを2 倍命令発行機構をシンプルに L1キャッシュの大容量高速化 SIMTモデルの改善テンソル計算の加速最もプログラミングの簡単な SM 28

29 OpenACC 29

30 OPENACC プログラミング概要紹介プログラムの OpenACC 化 OpenACC 化事例 30

31 CPU GPU OPENACC 簡単 : 既存のコードにコンパイラへのヒントを追加 Program myscience serial code!$acc kernels do k = 1,n1 do i = 1,n2 parallel code enddo enddo!$acc end kernels serial code End Program myscience 既存の C/Fortran コードヒントの追加強力 : 相応の労力でコンパイラがコードを自動で並列化オープン : 複数コンパイラベンダが様々なプロセッサをサポート NVIDIA GPU, AMD GPU, x86 CPU, 31

32 GPU コンピューティングアプリケーションコード CPU Do i=1,n 計算の重い部分 OpenACC GPU 逐次部分は CPU 上で実行 End do 並列部分は GPU で実行 32

x * blockidx; if (i < n) y[i] += a*x[i]; size_t size = sizeof(float) * N; cudamemcpy(d_x, x, size, cudamemcpyhosttodevice);

33 CPU void saxpy(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] += a*x[i]; saxpy(n, 3.0, x, y); SAXPY (Y=A*X+Y) CUDA global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx; if (i < n) y[i] += a*x[i]; size_t size = sizeof(float) * N; cudamemcpy(d_x, x, size, cudamemcpyhosttodevice); cudamemcpy(d_y, y, size, cudamemcpyhosttodevice); saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y); cudamemcpy(y, d_y, size, cudamemcpydevicetohost); 33

34 SAXPY (y=a*x+y) OpenMP void saxpy(int n, float a, float *x, float *restrict y) { #pragma omp parallel for for (int i = 0; i < n; ++i) y[i] += a*x[i]; saxpy(n, 3.0, x, y); void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc parallel copy(y[:n]) copyin(x[:n]) for (int i = 0; i < n; ++i) y[i] += a*x[i]; saxpy(n, 3.0, x, y); OpenACC omp acc データの移動 34

$omp end parallel do end subroutine saxpy OpenACC subroutine saxpy(n, a, X, Y) real :: a, Y(:), Y(:) integer

35 SAXPY (y=a*x+y, FORTRAN) OpenMP subroutine saxpy(n, a, X, Y) real :: a, X(:), Y(:) integer :: n, i!$omp parallel do do i=1,n Y(i) = a*x(i)+y(i) enddo!$omp end parallel do end subroutine saxpy OpenACC subroutine saxpy(n, a, X, Y) real :: a, Y(:), Y(:) integer :: n, i!$acc parallel copy(y(:)) copyin(x(:)) do i=1,n Y(i) = a*x(i)+y(i) enddo!$acc end parallel end subroutine saxpy call saxpy(n, 3.0, x, y) call saxpy(n, 3.0, x, y) FORTRAN も同様 35

36 簡単にコンパイル OpenMP / OpenACC void saxpy(int n, float a, float *x, float *restrict y) $ pgcc -acc { Minfo=acc saxpy.c saxpy: #pragma acc parallel copy(y[:n]) copyin(x[:n]) 16, Generating #pragma present_or_copy(y[:n]) omp parallel for Generating for present_or_copyin(x[:n]) (int i = 0; i < n; ++i) Generating y[i] Tesla += code a*x[i]; 19, Loop is parallelizable Accelerator kernel generated 19, #pragma acc loop gang, vector(128) /* blockidx.x threadidx.x */ saxpy(n, 3.0, x, y); 36

37 簡単に実行 OpenMP / OpenACC void saxpy(int n, float a, float *x, float *restrict y) $ pgcc -Minfo { -acc saxpy.c $ nvprof./a.out saxpy: #pragma acc parallel copy(y[:n]) copyin(x[:n]) ==10302== NVPROF #pragma is profiling omp parallel process for 10302, command:./a.out 16, Generating present_or_copy(y[:n]) ==10302== Profiling for (int application: Generating present_or_copyin(x[:n]) i = 0; i./a.out < n; ++i) ==10302== Profiling Generating y[i] result: Tesla += code a*x[i]; Time(%) Time 19, Loop is Calls Avg Min Max Name parallelizable 62.95% ms ms ms ms [CUDA memcpy HtoD] Accelerator kernel generated 31.48% ms 19, #pragma ms ms ms [CUDA memcpy DtoH] acc loop gang, vector(128) /* blockidx.x threadidx.x */ 5.56% us saxpy(n, 3.0, 1 x, us y); us us saxpy_19_gpu 37

38 OPENACC プログラミング概要紹介プログラムの OpenACC 化 OpenACC 化事例 38

39 事例 : ヤコビ反復法 while ( error > tol ) { error = 0.0; A(i,j+1) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i])); A(i-1,j) A(i,j) A(i,j-1) A(i+1,j) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; 39

40 並列領域の指定 (parallel/kernels ディレクティブ ) while ( error > tol ) { error = 0.0; Parallel と Kernels #pragma acc kernels for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); #pragma acc kernels for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; Parallel OpenMP と親和性開発者主体 Kernels 複数 kernel の生成コンパイラ主体 40

41 [PGI tips] コンパイラメッセージ $ pgcc acc Minfo=accel jacobi.c jacobi: 44, Generating copyout(anew[1:4094][1:4094]) Generating copyin(a[:][:]) Generating Tesla code 45, Loop is parallelizable 46, Loop is parallelizable Accelerator kernel generated 45, #pragma acc loop gang /* blockidx.y */ 46, #pragma acc loop gang, vector(128) /* blockidx.x threadidx.x */ 49, Max reduction generated for error 41

42 並列領域の指定 (kernels) while ( error > tol ) { error = 0.0; #pragma acc kernels for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; $ pgcc error -Minfo=acc = max(error, abs(anew[j][i] -acc jacobi.c - A[j][i]); jacobi: 並列領域の指定 Parallels と Kernels Parallels OpenMP と親和性 59, Generating present_or_copyout(anew[1:4094][1:4094]) #pragma acc Generating kernels present_or_copyin(a[:][:]) 開発者主体 for (int j Generating = 1; j < N-1; Tesla j++) code { for (int i = 1; i < M-1; i++) { Kernels 61, Loop is parallelizable A[j][i] = Anew[j][i]; 63, Loop is parallelizable 複数 kernelの生成 Accelerator kernel generated コンパイラ主体 61, #pragma acc loop gang /* blockidx.y */ 63, #pragma acc loop gang, vector(128) /* blockidx.x threadidx.x */ Max reduction generated for error 42

25; $ pgcc error -Minfo=acc = max(error, abs(anew[j][i] -acc jacobi.

43 データの転送 (data clause) while ( error > tol ) { error = 0.0; #pragma acc kernels for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; $ pgcc error -Minfo=acc = max(error, abs(anew[j][i] -acc jacobi.c - A[j][i]); jacobi: 並列領域の指定 Parallels と Kernels Parallels OpenMP と親和性 59, Generating present_or_copyout(anew[1:4094][1:4094]) #pragma acc Generating kernels present_or_copyin(a[:][:]) 開発者主体 for (int j Generating = 1; j < N-1; Tesla j++) code { for (int i = 1; i < M-1; i++) { Kernels 61, Loop is parallelizable A[j][i] = Anew[j][i]; 63, Loop is parallelizable 複数 kernelの生成 Accelerator kernel generated コンパイラ主体 61, #pragma acc loop gang /* blockidx.y */ 63, #pragma acc loop gang, vector(128) /* blockidx.x threadidx.x */ Max reduction generated for error 43

$0; #pragma acc kernels \ pcopyout(anew[1:4094][1:4094]) pcopyin(a[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) *$

44 データの転送 (data clause) while ( error > tol ) { error = 0.0; #pragma acc kernels \ pcopyout(anew[1:4094][1:4094]) pcopyin(a[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); #pragma acc kernels \ pcopyout(a[1:4094][1:4094]) pcopyin(anew[1:4094][1:4094]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; N=M=4096 copyin (Host GPU) copyout (Host GPU) copy create present 44

45 配列形状の指定配列は全要素でなく一部だけ指定して転送することも可能注意 : C/C++ と Fortran では指定方法が異なる C/C++: array[ start : size ] float Anew[4096][4096] pcopyout( Anew[1:4094][1:4094]) pcopyin( A[:][:]) ) Fortran: array( start : end ) real Anew(4096,4096) pcopyout( Anew(2:4095, 2:4095) ) pcopyin( A(:,:) ) 45

46 データの転送 (data clause) while ( error > tol ) { error = 0.0; #pragma acc kernels \ pcopy(anew[:][:]) pcopyin(a[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); #pragma acc kernels \ pcopy(a[:][:]) pcopyin(anew[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; N=M=4096 copyin (Host GPU) copyout (Host GPU) copy create present 46

47 [PGI tips] PGI_ACC_TIME $ PGI_ACC_TIME=1./a.out Accelerator Kernel Timing data /home/anaruse/src/openacc/jacobi/c/task1-solution/jacobi.c jacobi NVIDIA devicenum=0 time(us): 649,886 44: data region reached 200 times 44: data copyin transfers: 800 device time(us): total=14,048 max=41 min=15 avg=17 53: data copyout transfers: 800 device time(us): total=11,731 max=43 min=6 avg=14 44: compute region reached 200 times 46: kernel launched 200 times grid: [32x4094] block: [128] device time(us): total=382,798 max=1,918 min=1,911 avg=1,913 elapsed time(us): total=391,408 max=1,972 min=1,953 avg=1,957 46: reduction kernel launched 200 times grid: [1] block: [256] device time(us): total=48,235 max=242 min=241 avg=241 elapsed time(us): total=53,510 max=280 min=266 avg=267 統計データ 47

48 [PGI tips] PGI_ACC_NOTIFY $ PGI_ACC_NOTIFY=3./a.out upload CUDA data file=/home/anaruse/src/openacc/jacobi/c/task1- solution/jacobi.c function=jacobi line=44 device=0 variable=a bytes= launch CUDA kernel file=/home/anaruse/src/openacc/jacobi/c/task1- solution/jacobi.c function=jacobi line=46 device=0 num_gangs= num_workers=1 vector_length=128 grid=32x4094 block=128 shared memory=1024 download CUDA data file=/home/anaruse/src/openacc/jacobi/c/task1- solution/jacobi.c function=jacobi line=53 device=0 variable=anew bytes= トレースデータ 48

49 NVIDIA VISUAL PROFILER (NVVP) 49

50 NVVP による解析 : データ転送がボトルネック 1 cycle 利用率 : 低い GPU kernel GPU kernel 50

51 過剰なデータ転送 while ( error > tol ) { error = 0.0; #pragma acc kernels \ pcopy(anew[:][:]) pcopyin(a[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); #pragma acc kernels \ pcopy(a[:][:]) pcopyin(anew[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; 51

$0; 過剰なデータ転送 GPU #pragma acc kernels \ pcopy(anew[:][:]) \ pcopyin(a[:][:]) { copyin copyout #pragma acc loop$ reduction(max:error) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] +

reduction(max:error) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] +

52 Host while ( error > tol ) { error = 0.0; 過剰なデータ転送 GPU #pragma acc kernels \ pcopy(anew[:][:]) \ pcopyin(a[:][:]) { copyin copyout #pragma acc loop reduction(max:error) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); #pragma acc kernels \ pcopy(a[:][:]) \ pcopyin(anew[:][:]) { copyin copyout for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; 52

53 データ領域の指定 (data ディレクティブ ) #pragma acc data pcopy(a, Anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); copyin (CPU GPU) copyout (CPU GPU) copy create present #pragma acc kernels pcopy(a[:][:]) pcopyin(anew[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; 53

54 データ領域の指定 (data ディレクティブ ) #pragma acc data pcopy(a) create(anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); copyin (CPU GPU) copyout (CPU GPU) copy create present #pragma acc kernels pcopy(a[:][:]) pcopyin(anew[:][:]) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; 54

$Host #pragma acc data \ pcopy(a) create(anew) while ( error > tol ) { error = 0.$ $25; error = max(error, abs(anew[j][i] - A[j][i]); #pragma acc kernels \ pcopy(a[:][:]) \ pcopyin(anew[:][:]) { copyout for$

55 Host #pragma acc data \ pcopy(a) create(anew) while ( error > tol ) { error = 0.0; 適正なデータ転送 copyin GPU #pragma acc kernels \ pcopy(anew[:][:]) \ pcopyin(a[:][:]) { for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); #pragma acc kernels \ pcopy(a[:][:]) \ pcopyin(anew[:][:]) { copyout for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { A[j][i] = Anew[j][i]; 55

56 データ転送が減少 (NVVP) 1 cycle 利用率 : 高い 56

57 2 つの処理 CPU Memory データ転送 GPU Memory PCI 計算オフロード計算オフロードデータ転送両方を考慮する必要がある 57

$その他のデータ管理方法 float *array; Init( ) { array = (float*)malloc( ); input_array( array ); #pragma enter data copyin(array) Fin( )$

58 その他のデータ管理方法 float *array; Init( ) { array = (float*)malloc( ); input_array( array ); #pragma enter data copyin(array) Fin( ) { #pragma exit data copyout(array) output_array( array ); free( array ); Enter data Copyin Create Exit data Copyout Delete 58

59 その他のデータ管理方法 #pragma acc data pcopy(a,b) for (k=0; k<loop; k++) { #pragma acc kernels present(a,b) for (i=0; i<n; i++) { A[i] = suba(i,a,b); #pragma acc update self(a[0:1]) output[k] = A[0]; Update self CPU GPU Update device CPU GPU A[N-1] = input[k]; #pragma acc update device(a[n-1:1]) #pragma acc kernels present(a,b) for (i=0; i<n; i++) { B[i] = subb(i,a,b); 59

60 Unified Memory の利用その他のデータ管理方法 PGI コンパイラオプション : -acc ta=,managed, プログラム実行中に動的に確保される配列は (C: malloc, Fortran: allocate) Unified Memory で管理される OpenACC のデータディレクティブで移動を指示する必要がない 60

61 リダクション ( 縮約計算 ) while ( error > tol ) { error = 0.0; #pragma acc kernels for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); 61

62 リダクション ( 縮約計算 ) 演算の種類 while ( error > tol ) { error = 0.0; + 和 * 積 Max 最大 Min 最小 jacobi: ビット和 & ビット積 ^ XOR 論理和 && 論理積 #pragma acc kernels for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; $ pgcc -Minfo=acc error = max(error, -acc jacobi.c abs(anew[j][i] - A[j][i]); 59, Generating present_or_copyout(anew[1:4094][1:4094]) Generating present_or_copyin(a[:][:]) #pragma acc kernels for (int Generating j = 1; j < Tesla N-1; j++) code{ 61, for (int Loop i is = 1; parallelizable i < M-1; i++) { 63, A[j][i] Loop is = Anew[j][i]; parallelizable Accelerator kernel generated 61, #pragma acc loop gang /* blockidx.y */ 63, #pragma acc loop gang, vector(128) /* blockidx.x threadidx.x */ Max reduction generated for error 62

63 リダクション (REDUCTION CLAUSE) while ( error > tol ) { error = 0.0; #pragma acc kernels #pragma acc loop reduction(max:error) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); 演算種類 (C/C++) + 和 * 積 max 最大 min 最小ビット和 & ビット積 ^ XOR 論理和 && 論理積 63

64 並列方法の指示 #pragma acc data pcopy(a) create(anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop reduction(max:error) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); 64

65 並列方法の指示 #pragma acc data pcopy(a) create(anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop reduction(max:error) for (int j = 1; j < N-1; j++) { $ pgcc -Minfo=acc -acc jacobi.c #pragma acc loop reduction(max:error) jacobi: for (int i = 1; i < M-1; i++) { 59, Anew[j][i] Generating = (A[j][i+1] present_or_copyout(anew[1:4094][1:4094]) + A[j][i-1] + Generating A[j-1][i] present_or_copyin(a[:][:]) + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); Generating Tesla code 61, Loop is parallelizable 63, Loop is parallelizable Accelerator kernel generated 61, #pragma acc loop gang /* blockidx.y */ 63, #pragma acc loop gang, vector(128) /* blockidx.x threadidx.x */ Max reduction generated for error 65

66 並列方法の指示 (loop ディレクティブ ) #pragma acc data pcopy(a) create(anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop gang vector(1) reduction(max:error) for (int j = 1; j < N-1; j++) { #pragma acc loop gang vector(128) for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); Gang Worker Vector SIMD 幅 Collapse Independent Seq Cache Tile 66

67 実行条件設定 (gang, vector) #pragma acc loop gang vector(4) for (j = 0; j < 16; j++) { #pragma acc loop gang vector(16) for (i = 0; i < 16; i++) { #pragma acc loop gang vector(8) for (j = 1; j < 16; j++) { #pragma acc loop gang vector(8) for (i = 0; i < 16; i++) { i i 4 x 16 4 x 16 8 x 8 8 x 8 j 4 x 16 j 8 x 8 8 x 8 67

$0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop reduction(max:error) \ collapse(2) gang vector(128) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) {$

68 ループを融合 (collapse) #pragma acc data pcopy(a) create(anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop reduction(max:error) \ collapse(2) gang vector(128) for (int j = 1; j < N-1; j++) { for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); Gang Worker Vector SIMD 幅 Collapse Independent Seq Cache Tile 68

69 ループを融合 (collapse) #pragma acc data pcopy(a) create(anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop reduction(max:error) gang vector(128) for (int ji = 0; ji < (N-2)*(M-2); ji++) { j = (ji / (M-2)) + 1; i = (ji % (M-2)) + 1; Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); Gang Worker Vector SIMD 幅 Collapse Independent Seq Cache Tile 69

70 並列実行可能 (independent) #pragma acc data pcopy(a) create(anew) while ( error > tol ) { error = 0.0; #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop reduction(max:error) independent for (int jj = 1; jj < NN-1; jj++) { int j = list_j[jj]; for (int i = 1; i < M-1; i++) { Anew[j][i] = (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]) * 0.25; error = max(error, abs(anew[j][i] - A[j][i]); Gang Worker Vector SIMD 幅 Collapse Independent Seq Cache Tile 70

71 逐次に実行 (seq) #pragma acc kernels pcopy(anew[:][:]) pcopyin(a[:][:]) #pragma acc loop seq for (int k = 3; k < NK-3; k++) { #pragma acc loop for (int j = 0; j < NJ; j++) { #pragma acc loop for (int i = 0; i < NI; i++) { Anew[k][j][i] = func( A[k-1][j][i], A[k-2][j][i], A[k-3][j][i], A[k+1][j][i], A[k+2][j][i], A[k+3][j][i], ); Gang Worker Vector SIMD 幅 Collapse Independent Seq Cache Tile 71

72 OPENACC プログラミング概要紹介プログラムの OpenACC 化 OpenACC 化事例 72

OPENACC ACCELERATES COMPUTATIONAL SCIENCE LSDalton PowerGrid CloverLeaf INCOMP3D

Illinois 40 days to 2 hours Comp Hydrodynamics AWE 4X speedup Single CPU/GPU code CFD

73 OPENACC ACCELERATES COMPUTATIONAL SCIENCE LSDalton PowerGrid CloverLeaf INCOMP3D Quantum Chemistry Aarhus University 12X speedup 1 week Medical Imaging University of Illinois 40 days to 2 hours Comp Hydrodynamics AWE 4X speedup Single CPU/GPU code CFD NC State University 4X speedup NekCEM MAESTRO CASTRO FINE/Turbo Comp Electromagnetics Argonne National Lab 2.5X speedup 60% less energy Astrophysics Stony Brook University 4.4X speedup 4 weeks effort CFD NUMECA International 10X faster routines 2X faster app 73

Speedup vs CPU LSDALTON Large-scale application for calculating highaccuracy molecular energies Lines of Code Modified

Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 11.

Initial OpenACC implementation required only minor effort, and more importantly, no modifications of our existing CPU

74 Speedup vs CPU LSDALTON Large-scale application for calculating highaccuracy molecular energies Lines of Code Modified Minimal Effort # of Weeks Required # of Codes to Maintain <100 Lines 1 Week 1 Source Big Performance LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 11.7x OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, no modifications of our existing CPU implementation. Janus Juul Eriksen, PhD Fellow qleap Center for Theoretical Chemistry, Aarhus University 7.9x ALANINE-1 13 ATOMS 8.9x ALANINE-2 23 ATOMS ALANINE-3 33 ATOMS 74

NUMECA FINE/TURBO Commercial CFD Application CHALLENGE Accelerate 20 year old highly optimized code on GPUs OpenACC enabled us to target routines for GPU acceleration without rewriting code, allowing

75 NUMECA FINE/TURBO Commercial CFD Application CHALLENGE Accelerate 20 year old highly optimized code on GPUs OpenACC enabled us to target routines for GPU acceleration without rewriting code, allowing us to maintain portability on a code that is 20-years old David Gutzwiller, Head of HPC NUMECA International SOLUTION Accelerated computationally intensive routines with OpenACC RESULTS Achieved 10x or higher speed-up on key routines Full app speedup of up to 2x on the Oak Ridge Titan supercomputer Total time spent on optimizing various routines with OpenACC was just five person-months 75

POWERGRID Advanced MRI Reconstruction Model CHALLENGE Produce detailed and accurate brain images by applying computationally intensive algorithms to MRI data Reduce reconstruction time to make

Professor of Bioengineering and Technical Director of the Biomedical Imaging Center University of Illinois at Urbana-Champaign SOLUTION Accelerated MRI reconstruction application with OpenACC using

76 POWERGRID Advanced MRI Reconstruction Model CHALLENGE Produce detailed and accurate brain images by applying computationally intensive algorithms to MRI data Reduce reconstruction time to make diagnostic use possible Now that we ve seen how easy it is to program the GPU using OpenACC and the PGI compiler, we re looking forward to translating more of our projects Brad Sutton, Associate Professor of Bioengineering and Technical Director of the Biomedical Imaging Center University of Illinois at Urbana-Champaign SOLUTION Accelerated MRI reconstruction application with OpenACC using NVIDIA GPUs RESULTS Reduced reconstruction time for a single high-resolution MRI scan from 40 days to a couple of hours Scaled on Blue Waters at NCSA to reconstruct 3000 images in under 24 hours 76

INCOMP3D 3D Fully Implicit CFD Solver OpenACC is a highly effective tool for programming fully implicit CFD solvers on GPU to achieve true 4X speedup Lixiang Luo, Researcher Aerospace Engineering

77 INCOMP3D 3D Fully Implicit CFD Solver OpenACC is a highly effective tool for programming fully implicit CFD solvers on GPU to achieve true 4X speedup Lixiang Luo, Researcher Aerospace Engineering Computational Fluid Dynamics Laboratory North Carolina State University (NCSU) CHALLENGE Accelerate a complex implicit CFD solver on GPU SOLUTION Used OpenACC to run solver on GPU with minimal code changes. RESULTS Achieved up to 4X speedups on Tesla GPU over parallel MPI implementation on x86 multicore processor Structured approach to parallelism provided by OpenACC allowed better algorithm design without worrying about GPU architecture * CPU Speedup on 6 cores of Xeon E5645, with additional cores performance reduces due to partitioning and MPI overheads 77

NEKCEM Computational Electromagnetics Application CHALLENGE Enable NekCEM to deliver strong scaling on next-generation architectures while maintaining portability The most significant result from our

78 NEKCEM Computational Electromagnetics Application CHALLENGE Enable NekCEM to deliver strong scaling on next-generation architectures while maintaining portability The most significant result from our performance studies is faster computation with less energy consumption compared with our CPU-only runs. Dr. Misun Min, Computation Scientist Argonne National Laboratory SOLUTION Use OpenACC to port the entire program to GPUs RESULTS 2.5x speedup over a highly tuned CPU-only version GPU used only 39 percent of the energy needed by 16 CPUs to do the same computation in the same amount of time 78

MAESTRO & CASTRO Astrophysics 3D Simulation On the reactions side, accelerated calculations allow us to model larger networks of nuclear reactions for similar computational costs as the

Pursuing strong scaling and portability to run code on GPU-powered supercomputers SOLUTION OpenACC compiler on OLCF s Titan supercomputer RESULTS 4.

79 MAESTRO & CASTRO Astrophysics 3D Simulation On the reactions side, accelerated calculations allow us to model larger networks of nuclear reactions for similar computational costs as the simple networks we model now 2 weeks to learn OpenACC 2 weeks to modify code Adam Jacobs, PhD candidate in the Department of Physics and Astronomy at Stony Brook University CHALLENGE Pursuing strong scaling and portability to run code on GPU-powered supercomputers SOLUTION OpenACC compiler on OLCF s Titan supercomputer RESULTS 4.4x faster reactions than on a multi-core computer with 16 cores Accelerated calculations allow modeling larger networks of nuclear reactions for similar computational costs as simple networks 79

Speedup vs 1 CPU Core CLOVERLEAF Performance Portability for a Hydrodynamics Application

We were extremely impressed that we can run OpenACC on a CPU with no code change and get

Wayne Gaudin and Oliver Perks Atomic Weapons Establishment, UK https://developer.nvidia.

then recompile and run the same source code on multi-core CPUs RESULTS Same performance as

80 Speedup vs 1 CPU Core CLOVERLEAF Performance Portability for a Hydrodynamics Application CHALLENGE Application code that runs across architectures without compromising on performance We were extremely impressed that we can run OpenACC on a CPU with no code change and get equivalent performance to our OpenMP/MPI implementation. Wayne Gaudin and Oliver Perks Atomic Weapons Establishment, UK SOLUTION Use OpenACC to port the CloverLeaf mini app to GPUs and then recompile and run the same source code on multi-core CPUs RESULTS Same performance as optimized OpenMP version on x86 CPU 4x faster performance using the same code on a GPU Benchmarked Intel(R) Xeon(R) CPU E GHz, Accelerator: Tesla K80 80

81 Earthquake Simulation ( 理研東大地震研 ) WACCPD 2016(SC16 併催 WS): Best Paper Award より引用 81

82 NICAM 気象気候モデル by 理研 AICS/ 東大膨大なコード ( 数十万行 ) ホットスポットがない ( パレートの法則 ) 特性の異なる 2 種類の処理力学系メモリバンド幅ネック物理系演算ネック 82

83 Performance (GFLOPS) NICAM: 力学系 (NICAM-DC) OpenACC による GPU 化主要サブルーチンは全て GPU 上で動作 (50 以上 ) MPI 対応済み 2 週間良好なスケーラビリティ Tsubame 2.5, 最大 2560 GPUs Scaling factor: E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 Number of CPUs or GPUs Weak scaling Tsubame 2.5 (GPU:K20X) K computer Tsubame 2.5 (CPU:WSM) (*) weak scaling 83 Courtesy of Dr. Yashiro from RIKEN AICS

84 Measured Performance (GFLOPS) NICAM: 力学系 (NICAM-DC) 1.E+05 1.E+04 Tsubame 2.5 (GPU:K20X) K computer Tsubame 2.5 (CPU:WSM) 1.E+03 1.E+02 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Aggregate Peak Memory Bandwidth (GB/s) 84 Courtesy of Dr. Yashiro from RIKEN AICS

85 Speedup vs. CPU 1-core NICAM: 物理系 (SCALE-LES) Atmospheric radiation transfer 物理系の中で最も重い計算 OpenACC による GPU 対応済 Better core 2 core 4 core 10 core 1 GPU 2 GPUs 4 GPUs Xeon E5-2690v2(3.0GHz,10-core) Tesla K40 85 (*) PCIデータ転送時間込み, グリッドサイズ :1256x32x32

86 Time (sec) SEISM3D 地震シミュレーション by 古村教授 ( 東大地震研 ) 主要サブルーチンの GPU 対応が完了メモリバンド幅ネック 3 次元モデル (2 次元分割 ) 隣接プロセス間通信 SEISM3D (480x480x1024, 1K steps) x speedup ( アプリ全体 ) GPU の実行時間内訳 Others (CPU, MPI and so on) [CUDA memcpy DtoH] [CUDA memcpy HtoD] Better K: 8x SPARC64 VIIIfx CPU: 8x Xeon E5-2690v2 134 GPU: 8x Tesla K GPU: 8x Tesla K40 (other subroutines) update_vel_pml update_vel update_stress_pml update_stress diff3d_* 86

87 Speedup vs. 1 CPU core FFR/BCM ( 仮称 ) 次世代 CFD コード by 坪倉准教授 ( 理研 AICS/ 北大 ) MUSCL_bench: MUSCL スキームに基づく Flux 計算 ( とても複雑な計算 ) CFD 計算の主要部分 (60-70%) OpenACC による GPU 対応完了 Better core 2 core 5 core 10 core 1 GPU Xeon E5-2690v2(3.0GHz,10-core) Tesla K40 87 (*) PCIデータ転送時間込みサイズ :80x32x32x32

88 Time (second) CM-RCM IC-CG IC-CG 法のベンチマークコード by 中島教授 ( 東大 ) CM-RCM 法 (Cyclic Multi-coloring Reverse Cuthill-Mckee) を使用メインループ内のサブルーチンを全て OpenACC で GPU 化 CM-RCM ICCG (100x100x100) Better 0 OpenMP OpenMP OpenMP OpenMP OpenACC Opteron 6386SE (2.8GHz,16core) SPARC64 Ixfx (1.85GHz,16core) Xeon E5-2680v2 (2.8GHz,10core) Xeon-Phi 5110P Tesla K40 88 Courtesy of Dr. Ohshima from U-Tokyo

89 GFLOPS CCS-QCD QCD コード by 石川准教授 ( 広島大 ) BiCGStab 計算を全て OpenACC で GPU 化データレイアウトを変更 : AoS SoA Better CCS-QCD: BiCGStab Total FLOPS Xeon E5-2690v2(3.0GHz,10core,OpenMP) 53.3 Tesla K40(OpenACC) x8x8x32 8x8x8x64 8x8x16x64 8x16x16x64 16x16x16x64 Problem Size 89

90 CUDA 90

91 プログラミングモデルアーキテクチャ性能 Tips CUDA プログラミング 91

92 GPU コンピューティング CPU CPU Memory GPU Memory GPU PCI 高スループット指向のプロセッサ分離されたメモリ空間 92

93 CPU GPU プログラム GPU void saxpy(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] += a*x[i]; global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx; if (i < n) y[i] += a*x[i]; saxpy(n, 3.0, x, y); size_t size = sizeof(float) * N; cudamemcpy(d_x, x, size, cudamemcpyhosttodevice); cudamemcpy(d_y, y, size, cudamemcpyhosttodevice); saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y); cudadevicesynchronize(); cudamemcpy(y, d_y, size, cudamemcpydevicetohost); 93

94 GPU 実行の基本的な流れ CPU 入力データ転送 GPU GPU は CPU からの制御で動作 GPU カーネル投入同期 GPU 上で演算入力データ : CPU から GPU に転送 (H2D) GPU カーネル : CPU から投入出力データ : GPU から CPU に転送 (D2H) 出力データ転送 94

95 CPU GPU プログラム GPU void saxpy(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] += a*x[i]; 入力データ転送 global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx; if (i < n) y[i] += a*x[i]; saxpy(n, 3.0, x, y); カーネル起動同期出力データ転送 size_t size = sizeof(float) * N; cudamemcpy(d_x, x, size, cudamemcpyhosttodevice); cudamemcpy(d_y, y, size, cudamemcpyhosttodevice); saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y); cudadevicesynchronize(); cudamemcpy(y, d_y, size, cudamemcpydevicetohost); 95

96 GPU プログラム (Unified Memory) CPU GPU void saxpy(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] += a*x[i]; global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx; if (i < n) y[i] += a*x[i]; カーネル起動 saxpy(n, 3.0, x, y); 同期 saxpy<<< N/128, 128 >>>(N, 3.0, x, y); cudadevicesynchronize(); 96

97 void saxpy(int n, float a, float *x, float *y) { CPU for (int i = 0; i < n; ++i) y[i] += a*x[i]; saxpy(n, 3.0, x, y); GPU カーネル GPU global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx.x; if (i < n) y[i] += a*x[i]; saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y); Global スレッド ID GPU カーネル : 1 つの GPU スレッドの処理内容を記述基本 : 1 つの GPU スレッドが 1 つの配列要素を担当 97

98 Execution Configuration ( ブロック数とブロックサイズ ) スレッド ID global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx.x; if (i < n) y[i] += a*x[i]; ブロックサイズ saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y); ブロック ID ブロック数ブロックサイズブロック数 x ブロックサイズ = 配列要素数 98

99 y[i] = a*x[i] + y[i] スレッド階層 ( スレッドブロックグリッド ) x[] y[] スレッド (global) (local) ブロック0 ブロック1 ブロック2 グリッドブロックサイズ ( スレッド数 / ブロック ) はカーネル毎に設定可能推奨 : 128 or 256 スレッド 99

100 Execution Configuration ( ブロック数とブロックサイズ ) global void saxpy(int n, float a, float *x, float *y) { int i = threadidx.x + blodkdim.x * blockidx.x; if (i < n) y[i] += a*x[i]; saxpy<<< N/256, N/128, 64, >>>(N, 3.0, d_x, d_y); ブロック数ブロックサイズブロック数 x ブロックサイズ = 配列要素数 100

101 2D 配列の GPU カーネル例 global void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadidx.x + blodkdim.x * blockidx.x; int j = threadidx.y + blodkdim.y * blockidy.y; if ( i < N && j < N ) C[i][i] = A[i][j] + B[i][j]; Global スレッド ID (x) Global スレッド ID (y) ブロックサイズ (x,y) dim3 sizeblock( 32, 64, 16, ); dim3 numblocks( N/sizeBlock.x, N/sizeBlock.y ); MatAdd<<< numblocks, sizeblock >>>(A, B, C); ブロック数 (x,y) ブロックサイズ ( ブロック形状 ) は 1D~3D で表現可能 101

102 ブロックマッピングスレッドマッピング (0,0) dim3 sizeblock(16,16) (0,0) dim3 sizeblock(32,8) (0,0) (1,0) (0,0) (1,0) (2,0) (31,7) (0,1) (1,1) (15,15) (0,2) (1,2) (0,1) (1,1) (2,1) (0,3) (1,3) (0,2) (1,2) (2,2) (0,4) (1,4) ブロック ID(blockIdx) スレッド ID(threadIdx) 102

103 プログラミングモデルアーキテクチャ性能 Tips CUDA プログラミング 103

104 GPU アーキテクチャ概要 Pascal GP100 PCI I/F ホスト接続インタフェース Giga Thread Engine SM に処理を割り振るスケジューラ DRAM I/F (HBM2) 全 SM PCI I/F からアクセス可能なメモリ ( デバイスメモリ, フレームバッファ ) L2 cache (4MB) 全 SM からアクセス可能な R/W キャッシュ SM (Streaming Multiprocessor) 並列プロセッサ GP100: 最多

105 SM (Stream Multi-Processor) CUDA コア GPU スレッドはこの上で動作 GP100: 64 個 /SM Other units LD/ST, SFU, etc Pascal GP100 レジスタ (32bit): 64K 個共有メモリ : 64KB Tex/L1 キャッシュ 105

106 GPU カーネル実行の流れ CPU が GPU にグリッドを投入具体的な投入先は Giga Thread Engine グリッドブロックスレッドスレッドブロック 106

107 GPU カーネル実行の流れ Giga Thread Engine(GTE) が SM にブロックを投入 GTE はブロックスケジューラグリッドをブロックに分解してブロックを空いている SM に割当てるグリッドブロックスレッドスレッドブロック 107

108 ブロックを SM に割り当て各ブロックは互いに独立に実行ブロック間では同期しない実行順序の保証なし 1 つのブロックは複数 SM にまたがらない 1 つの SM に複数ブロックが割当てられることはあるグリッドブロックブロックブロックブロック 108

109 GPU カーネル実行の流れ SM 内のスケジューラがスレッドを CUDA コアに投入グリッドブロックスレッドスレッドブロック 109

110 GPU カーネル実行の流れ SM 内のスケジューラがワープを CUDA コアに投入ワープ : 32 スレッドの塊ブロックをワープに分割実行可能なワープを空 CUDA コアに割当てるグリッドブロックワープワープブロック 110

111 ワープの CUDA コアへの割り当てワープ内の32スレッドは同じ命令を同期して実行各ワープは互いに独立して実行同じブロック内のワープは明示的に同期可能 ( syncthreads()) グリッド SIMT (Single Instruction Multiple Threads) ブロックワープワープブロックワープ 111

112 GPU アーキの変化を問題としないプログラミングモデル Pascal, CC cores /SM Kepler, CC cores /SM Maxwell, CC cores /SM 112

113 プログラミングモデルアーキテクチャ性能 Tips CUDA プログラミング 113

114 リソース使用率 (Occupancy) SM の利用効率を上げる SM に割当て可能なスレッド数を上限に近づけるレジスタ使用量 (/ スレッド ) できる限り減らす DP(64bit) は 2 レジスタ消費レジスタ割り当て単位は 8 個レジスタ使用量と割当て可能なスレッド数の関係 32 レジスタ : 2048(100%), 64 レジスタ : 1024(50%) 128 レジスタ :512(25%), 256 レジスタ : 256(12.5%) CUDA コア数 : 64 最大スレッド数 : 2048 最大ブロック数 : 32 共有メモリ : 64KB レジスタ数 (32-bit): 64K 個リソース量 /SM (GP100) 114

115 リソース使用率 (Occupancy) SM の利用効率を上げる SM に割当て可能なスレッド数を上限に近づけるスレッド数 (/ ブロック ) 64 以上にする 64 未満だと最大ブロック数がネックになる CUDA コア数 : 64 最大スレッド数 : 2048 最大ブロック数 : 32 共有メモリ : 64KB レジスタ数 (32-bit): 64K 個リソース量 /SM (GP100) 共有メモリ使用量 (/ ブロック ) できる限り減らす共有メモリ使用量と割当て可能なブロック数の関係 32KB:2 ブロック, 16KB:4 ブロック, 8KB:8 ブロック 115

116 空き時間を埋めるリソース使用率 (Occupancy) (*) 操作 : GPUカーネル, データ転送 CUDAストリーム ( キュー) 同じCUDAストリームに投入した操作 : 投入順に実行別のCUDAストリームに投入した操作 : 非同期に実行 ( オーバラップ実行 ) [CUDA ストリームの効果例 ] GPU カーネルとデータ転送がオーバラップして同時に実行されている 116

117 デバイスメモリへのアクセスはまとめてコアレスアクセス 32 スレッド ( ワープ ) のロード / ストアをまとめてメモリトランザクションを発行トランザクションサイズ : 32B, 64B or 128B トランザクション数は少ないほど良い配列 128B 境界 128B 境界 Best 128B x Good B x2 Bad 32B x

118 分岐を減らすワープ内のスレッドが別パスを選択すると遅くなるワープ内のスレッドは命令を共有 (SIMT) ワープ内のスレッドが選んだ全パスの命令を実行あるパスの命令を実行中そのパスにいないスレッドは inactive 状態 Path divergence を減らすできる限り同ワープ内のスレッドは同じパスを選択させる true 0 処理 X 1 A[id] > 処理 Y false 118

119 まとめ GPU Computing Volta GPUs OpenACC CUDA 119

120

Images per Second Images per Second VOLTA: ディープラーニングにおける大きな飛躍 ResNet-50 トレーニング 2.4x faster ResNet-50 推論 TensorRT - 7ms レイテンシ 3.7x faster P100 V100 P10

NVIDIA TESLA V100 CUDA 9 のご紹介森野慎也, シニアソリューションアーキテクト (GPU-Computing) NVIDIA Images per Second Images per Second VOLTA: ディープラーニングにおける大きな飛躍 ResNet-50 トレーニング 2.4x faster ResNet-50 推論 TensorRT - 7ms レイテンシ