Microsoft PowerPoint - GTC2012-SofTek.pptx

Size: px

Start display at page:

Download "Microsoft PowerPoint - GTC2012-SofTek.pptx"

みりあおいもり
7 years ago
Views:

1 GTC Japan 2012 PGI Accelerator Compiler 実践! PGI OpenACC ディレクティブを使用したポーティング 2012 年 7 月加藤努株式会社ソフテック

2 本日の話 OpenACC によるポーティングの実際 OpenACC ディレクティブ概略説明 Accelerator Programming Model Fortran プログラムによるポーティングステップ三つのディレクティブの利用性能チューニング PGI Accelerator Compiler 製品を使用 1

3 PGI OpenACC 対応コンパイラを使用 PGI Accelerator Compiler 製品 (x64+gpu) 内に実装 PGI アクセラレータコンパイラ製品 (PGI Accelerator Fortran/C/C++) 1. OpenACC コンパイラ (Fortran, C99) 2. PGI Accelerator Programming Model (directiveベース) 3. PGI CUDA Fortran 4. PGI CUDA-x86 for C/C++ compatible& superset 2012 年 7 月 OpenACC 正式版リリース PGI アクセラレータコンパイラソフテック情報サイト 2

4 OpenACC Standard とは何か? 2011 年 11 月 NVIDIA, Cray, PGI, CAPS Accelerators 用のプログラミング API の標準仕様 Fortran, C/C++ 言語上で指定するコンパイラディレクティブ群ユーザサイド開発者がアクセラレータで実行するコード部分をディレクティブで指定する ( コンパイラに対してヒントを与える ) OpenACC コンパイラホスト側の処理をアクセラレータ (GPU) にオフロードするコード生成ホスト -- GPU 間のデータ転送コードの生成 2009 年リリース以来実績を積んだ PGI Accelerator Compiler(directives) がベースとなっている 3

5 Accelerator Programming Model ホスト側ハイブリッド構成 (CPU + Accelerator) GPU 側 CPU Main Memory Host_A(100) 重い計算部分の処理をオフロード使用データを送る結果データを戻す Overhead GPU Device Memory Device_A(100) Host GPU 間のメモリデータの転送が伴うデータ転送のオーバーヘッド時間が伴う 4

6 OpenACC ディレクティブの主な構成ホスト ( 処理 ) Accelerator 1 CPU 重い計算部分の処理をオフロード 3 GPGPU Main Memory 2 ( データ ) Device Memory 1 Accelerate Compute 構文 (offload 領域指示 ) 2 Data 構文 ( データ移動指示 ) 3 Loop 構文 (Mapping for parallel/vector, Tuning) 5

7 program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc kernels do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels print *, r(1000) end 2 行のディレクティブ挿入でコード生成 $ pgfortran -acc -Minfo test.f90 main: 12, Generating copyout(r(1:100000)) Generating copyin(a(1:100000)) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 13, Loop is parallelizable Accelerator kernel generated 13,!$acc loop gang, vector(256)! blockidx%x threadidx%x オフロードする並列対象領域の指 ( 一般にループ部分 ) GPU 側へのデータコピー GPU 用の並列化 Host 側へデータバック自動的かつ Implicit に行う 6

8 1 Accelerate Compute 構文 program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data -- Fortran -- 主要な三つのディレクティブ 1 並列実行 kernel 部分の指定オフロードする並列対象領域の指 ( 一般にループ部分 ) 7

9 2 Data 構文 program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data -- Fortran -- 主要な三つのディレクティブ 2 データ移動指示 1 並列実行 kernel 部分の指定オフロードする並列対象領域の指 ( 一般にループ部分 ) 8

10 3 Loop 構文 program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data -- Fortran -- 主要な三つのディレクティブ 2 データ移動指示 1 並列実行 kernel 部分の指定 3mapping for para/vector オフロードする並列対象領域の指 ( 一般にループ部分 ) 9

11 三つの構文を使用してポーティング!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data Data 構文 Accelerate Compute 構文 Loop 構文オフロードする並列対象領域 ( 一般にループ部分 ) 10

12 OpenACCを使用して GPU 上での実行を行うまでのプログラムポーティングを行う (Fortran) 11

13 subroutine driver (u,f) * dx - grid spacing in x direction * dy - grid spacing in y direction ( 配列宣等は省略 ) * Initialize data cpu0 = second() call initialize (n,m,alpha,dx,dy,u,f) * Solve Helmholtz equation ヤコビ反復プログラム call jacobi (n,m,dx,dy,alpha,relax,u,f,tol,mits) * Check error between exact solution call error_check (n,m,alpha,dx,dy,u,f) cpu1 = second() * Printout Elapsed time elapsed = (cpu1 -cpu0) * t_ac print '(/,1x,a,F10.3/)', & Elpased Time (Initialize + Jacobi solver + Check) : ',elapsed return end 三つのサブルーチンコール手続間で配列の受渡し有り (u,f) 各ルーチン内で高速化を図る時間の掛かっている場所は? 時間の掛かるループ内で使用されている配列は何か? 手続間のデータの受渡しの状況を見る 12

14 Subroutine Jacobi の核心部分 error = 10.0 * tol k = 1 do while (k.le.maxit.and. error.gt. tol) error = 0.0!$omp parallel default(shared)!$omp do do j=1,m do i=1,n uold(i,j) = u(i,j)!$omp do private(resid) reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$omp nowait!$omp end parallel * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop Do while ループ内でステンシル計算収束条件を満たしたら終了内部は 2 次元の nested loop uold(i,j) 配列は並列実行依存性なし u(i,j) 配列も依存性無しストアのみ f(i,j) 配列も依存性無し参照のみ error 変数はリダクション演算並列依存性とは無し : u(i) = u(i) 有り : u(i) =u(i-1) 同じ配列で定義 ~ 参照関係があるとき依存性の検討要 13

15 Jacobi ルーチンへの OpenMP directives error = 10.0 * tol k = 1 do while (k.le.maxit.and. error.gt. tol) error = 0.0!$omp parallel default(shared)!$omp do do j=1,m do i=1,n uold(i,j) = u(i,j)!$omp do private(resid) reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$omp nowait!$omp end parallel * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop $ pgfortran -fastsse mp Minfo jacobi.f jacobi: 204, Parallel region activated 206, Parallel loop activated with static block schedule 207, Memory copy idiom, loop replaced by call to c_mcopy8 214, Barrier 215, Parallel loop activated with static block schedule 216, Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated 4 prefetch instructions for the loop 223, Begin critical section End critical section Parallel region terminated 行番号コンパイラメッセージ実際に並列化とベクトル化を実装している 14

16 シリアル実行用コンパイルとその実行 OpenACC]$ pgfortran -O3 openmp.f -Minfo initialize: 139, Invariant if transformation 140, Invariant assignments hoisted out of loop jacobi: 207, Memory copy idiom, loop replaced by call to c_mcopy8 error_check: 262, Invariant if transformation 263, Invariant assignments hoisted out of loop OpenACC]$ a.out Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual E-011 Solution Error : E-004 コンパイル実行 Elpased Time (Initialize + Jacobi solver + Check) :

17 シリアル実行用最適化コンパイルとその実行 OpenACC]$ pgfortran -fastsse openmp.f -Minfo initialize: 139, Invariant if transformation 140, Invariant assignments hoisted out of loop Loop not vectorized: mixed data types Unrolled inner loop 4 times jacobi: 207, Memory copy idiom, loop replaced by call to c_mcopy8 216, Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated 4 prefetch instructions for the loop error_check: 262, Invariant if transformation 263, Invariant assignments hoisted out of loop Generated 2 alternate versions of the loop Generated vector sse code for the loop Generated a prefetch instruction for the loop [kato@photon29 OpenACC]$ a.out ( 省略 ) Residual E-011 Solution Error : E-004 SSE ベクトル化コンパイル実行 Elpased Time (Initialize + Jacobi solver + Check) :

18 OpenMP 並列実行用コンパイル OpenACC]$ pgf90 -fastsse -mp openmp.f -Minfo initialize: 138, Parallel region activated 139, Parallel loop activated with static block schedule 140, Loop not vectorized: mixed data types Unrolled inner loop 4 times 147, Parallel region terminated スレッド並列化 jacobi: 204, Parallel region activated 206, Parallel loop activated with static block schedule 207, Memory copy idiom, loop replaced by call to c_mcopy8 214, Barrier 215, Parallel loop activated with static block schedule 216, Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated 4 prefetch instructions for the loop 226, Begin critical section SSE ベクトル化 End critical section Parallel region terminated error_check: 261, Parallel region activated 262, Parallel loop activated with static block schedule 263, Generated 2 alternate versions of the loop Generated vector sse code for the loop Generated a prefetch instruction for the loop 269, Begin critical section End critical section Parallel region terminated 17

19 OpenMP 並列実行 OpenACC]$ export OMP_NUM_THREADS=4 (4 スレッド実行 ) [kato@photon29 OpenACC]$ a.out Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual E-011 Solution Error : E-004 Elpased Time (Initialize + Jacobi solver + Check) :

20 シリアル OpenMP 並列実行性能 ( 倍精度演算 ) OpenMP と OpenACC 時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O core スレッド (with SSE vector) -fastsse x 2.0 OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 3.0 OpenMP 用オプションベクトル最適化用オプション OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX 580 PGI 12.5 を使用 19

21 プログラムのプロファイリング取得コンパイル ( シリアル実用 ) [kato@photon29]$ pgfortran -fastsse -Minfo=ccff jacobi.f -o jacobi コンパイル (OpenMP 並列実用 ) [kato@photon29]$ pgfortran -fastsse -Minfo=ccff mp jacobi.f -o jacobi プロファイルデータの取得 (pgcollect utility でサンプリング ) [kato@photon29]$ pgcollect -time jacobi Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual E-011 Solution Error : E-004 実際の実行 Elpased Time (Initialize + Jacobi solver + Check) : target process has terminated, writing profile data プロファイリングツール pgprof の実 [kato@photon29]$ pgprof -exe jacobi ( プロファイラツールの起動 ) 20

22 1 スレッドシリアル実行のプロファイル PGPROF PGI プロファイラ計算コスト 99% Jacobi ルーチン 67% c_mcopy8 ルーチン 32% メモリアクセスのための PGI 組込ルーチン 21

23 4 スレッド並列実行のプロファイルバリア同期計算コスト 97% 4 スレッドの各時間コスト Jacobi ルーチン 49% c_mcopy8 ルーチン 35% OpenMP バリア同期 13% 22

24 コンパイラフィードバック情報 click! ループこのループは 8.91 秒コンパイラフィードバック情報 Compute Intensity このループは 2.43 ベクトル化最適化実施のメッセージ 23

25 ポーティングでの作業方針 1. Jacobi ルーチンの時間コストが 99% 占める最初にこのルーチンの中の GPU 実行部分を特定して OpenACC ディレクティブを挿入 (targeting) 2. ホストと GPU 間のデータ移動を最小化する (GPU 上に計算に必要なデータを常駐化させる ) 3. NVIDIA GPU 用の Grid サイズ Block サイズのチューニングを行う 4. Jacobi ルーチン以外のルーチンに対しても OpenACC ディレクティブを適用する 5. プログラム全体にスコープ範囲を広げホストと GPU 間のデータ移動を最小化する 24

26 三つの構文を使って GPU 用に並列化する OpenACC ディレクティブを使用する!$acc data!$acc kernels!$acc loop do i = 1, n { 並列化可能なループ } end do!$acc end kernels!$acc end data Data 構文 Accelerate Compute 構文 Loop 構文 25

27 PGI コンパイラ OpenACC 用オプション OpenACC directive を認識する Fortran $ pgfortran acc Minfo fast {source}.f90 あるいは $ pgfortran ta=nvidia Minfo fast {source}.f90 (PGI Accelerator directives あるいは OpenACC directives を認識 ) C (C99) 現在 C++ には実装していない $ pgcc acc Minfo fast {source}.c あるいは $ pgcc ta=nvidia Minfo fast {source}.c (PGI Accelerator directives あるいは OpenACC directives 認識 ) 26

28 まず Kernels directive を挿入してみる error = 10.0 * tol k = 1 収束判定ループ do while (k.le.maxit.and. error.gt. tol) error = 0.0!$acc kernels do j=1,m do i=1,n uold(i,j) = u(i,j) 1 2 Accelerator 領域の開始 3 do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid 3 error = error + resid*resid end do!$acc end kernels Accelerator 領域の終了 * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop コンパイラは以下のコードを自動生成 GPU 上のメモリに配列データエリアをアロケートホスト側のデータをGPU 側へコピーするホスト側から kernel プログラムを起動する GPU 上で計算した結果をホスト側に戻す GPU 上のデータをデアロケート問題は? データ転送回数 27

29 PGI コンパイラフィードバック情報 (-Minfo) pgfortran acc -fastsse Minfo=accel -o jacobi1.exe jacobi1.f jacobi: 行番号 204, Generating copyout(uold(1:n,1:m)) Generating copyin(u(:n,:m)) Generating copyout(u(2:n-1,2:m-1)) Generating copyin(f(2:n-1,2:m-1)) Generating compute capability 1.3 binary Generating compute capability 2.0 binary Accelerator kernel generated 213,!$acc loop gang, vector(8)! blockidx%y threadidx%y 214,!$acc loop gang, vector(8)! blockidx%x threadidx%x 使用レジスタ数使用 shared Mem 使用 const. Mem Occupancy per SM CC 1.3 : 32 registers; 640 shared, 28 constant; 50% occupancy CC 2.0 : 28 registers; 520 shared, 160 constant; 33% occupancy 222, Sum reduction generated for error 配列名 Host GPU 間配列データの転送命令生成ネストループの並列分割の様子 (Grid/Block) 総和リダクション検出しリダクションコード生成 NVIDIA H/W Compute capability 使用特性 28

30 実行コンパイル & 実行モジュール作成 make jacobi1.exe pgfortran -o jacobi1.exe jacobi1.f -fastsse -Minfo=accel acc jacobi1.exe と言うモジュールには Host 用コード +GPU 用コードが含まれる実行 jacobi1.exe Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual E-011 Solution Error : E-004 Elpased Time (Initialize + Jacobi solver + Check) : FORTRAN STOP 29

31 PGI 環境変数 (Accelerator Profile) PGI_ACC_TIME $ export PGI_ACC_TIME=1 実行時に OpenACC 領域の実行プロファイル情報を出力する Accelerator Kernel Timing data プロファイル時間の単位 :μ 秒 jacobi( ルーチン名 ) 204: region entered 100 times time(us): total= init= region= Kernelの実時間 (1.89 秒 ) kernels= data= w/o init: total= max= min= avg= データ転送時間 (15.52 秒 ) 番号 206: kernel launched 100 times grid: [640x625] block: [8x8] time(us): total= max=5204 min=5165 avg= : kernel launched 100 times grid: [640x625] block: [8x8] time(us): total= max=13200 min=13180 avg= : kernel launched 100 times grid: [1] block: [256] time(us): total=56786 max=570 min=565 avg= ループの Kernel の実時間 (1.31 秒 ) Grid/Block 分割のサイズ 30

32 OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O core スレッド (with SSE vector) -fastsse OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX

33 ループ内にデータ転送があると転送の嵐 do while (k.le.maxit.and. error.gt. tol)!$acc kernels DO ループ群!$acc end kernels 収束するまで繰り返す外側ループ Host GPU データコピー場所 GPU 上の計算処理 GPU Host データバック場所 end do! End iteration loop Data 構文の利用 : 並列処理とデータ移動の指示を分離する 32

34 ループの外でデータ転送を行う明示的にデータ構文で転送指示!!$acc data copy(u) copyin(f)... do while (k.le.maxit.and. error.gt. tol)!$acc kernels DO ループ群 Host GPU データコピー場所 GPU 内のデータは常駐化 GPU 上の計算処理!$acc end kernels end do! End iteration loop!$acc end data GPU Host データバック場所 33

35 Data Directive を使用する error = 10.0 * tol k = 1!$acc data copy(u)!$acc+ copyin(f) create(uold) do while (k.le.maxit.and. error.gt. tol) error = 0.0 * Copy new solution into old!$acc kernels kernels 並列領域の開始 do j=1,m do i=1,n uold(i,j) = u(i,j) do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$acc end kernels 1 2 kernels 並列領域の終了 3 3 * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop!$acc end data Acc データ領域 4 5 収束判定ループの外側でデータ領域を指定 GPU 上に使用データを常駐させる収束ループが終了時にデータをホストに戻す Host-GPU 間のデータ転送の削減 34

36 データ転送を 1 回だけにした場合のプロファイル Accelerator Kernel Timing data /home/kato/gpgpu/openmp/double/openacc/jacobi2.f jacobi 205: region entered 100 times 3つのkernel の存在 kernels 構の領域の情報 time(us): total= init=3 region= kernels= data=0 w/o init: total= max=19442 min=19148 avg= : kernel launched 100 times grid: [640x625] block: [8x8] time(us): total= max=5197 min=5169 avg= : kernel launched 100 times grid: [640x625] block: [8x8] time(us): total= max=13186 min=13172 avg= : kernel launched 100 times grid: [1] block: [256] time(us): total=56761 max=569 min=566 avg=567 データ転送時間 (0 秒 ) 次はこの時間をチューニングする /home/kato/gpgpu/openmp/double/openacc/jacobi2.f jacobi 199: region entered 1 time データ構の領域のプロファイル情報 1 回のみ time(us): total= init=87485 region= data= データ転送時間 (0.11 秒 ) w/o init: total= max= min= avg=

37 OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O core スレッド (with SSE vector) -fastsse OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) OpenACC ( 繰返ループの外側に data 構文を挿入 ) 2.32 x 3.7 OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX

38 Loop Directives で並列動作を調整 error = 10.0 * tol k = 1!$acc data copy(u(1:n,1:m))!$acc+ copyin(f(1:n,1:m)) create(uold(1:n,1:m)) do while (k.le.maxit.and. error.gt. tol) error = 0.0 * Copy new solution into old!$acc kernels Accelerator 並列領域の開始 do j=1,m do i=1,n uold(i,j) = u(i,j)!$acc loop gang, vector(8) do j = 2,m-1!$acc loop gang, vector(8) do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$acc end kernels Accelerator 並列領域の終了 * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop!$acc end data コンパイラは自動的に対象並列ループを CUDA の Thread-block/Grid に分割マッピングするブロック分割等の mapping を明的に変更することが可能より良い性能を出すには gang, vector の並列スケジューリングを変えて試行錯誤が必要 37

Accelerator ループマッピングを変更する例えば Grid size (16 x16) Block size (16 x16) jacobi: 217, Generating local(uold(:,:)) Generating local(resid) Generating copyin(f(:n,:m)) Generating copy(u(:n,:m)) 235, Loop

39 Accelerator ループマッピングを変更する例えば Grid size (16 x16) Block size (16 x16) jacobi: 217, Generating local(uold(:,:)) Generating local(resid) Generating copyin(f(:n,:m)) Generating copy(u(:n,:m)) 235, Loop is parallelizable 237, Loop is parallelizable Accelerator kernel generated 235,!$acc loop gang(16), vector(16)! blockidx%y threadidx%y 237,!$acc loop gang(16), vector(16)! blockidx%y threadidx%y loop scheduling 節を変更!$acc loop gang(16) vector(16) ( 235) do j = 2,m-1!$acc loop gang(16) vector(16) ( 237) do i= 2,n-1 ( 238) resid = (ax*(uold(i-1,j) + uold(i+1,j)) ( 239) & + ay*(uold(i,j-1) + uold(i,j+1)) ( 240) & + b * uold(i,j) - f(i,j))*b1b ( 241) u(i,j) = uold(i,j) - omega * resid ( 242) error = error + resid*resid ( 243) end do ( 244)!$acc end region CC 1.3 : 26 registers; 2176 shared, 36 constant, 0 local memory bytes; 50% occupancy CC 2.0 : 26 registers; 2056 shared, 144 constant, 0 local memory bytes; 66% occupancy 242, Sum reduction generated for error 38

40 実行プロファイル情報で性能評価 loop scheduling(grid/block size) の変更で性能が変わる 235,!$acc loop gang, vector(8)! blockidx%y threadidx%y 237,!$acc loop gang, vector(8)! blockidx%x threadidx%x 237: kernel launched 100 times grid: [640x625] block: [8x8] time(us): total= max=13189 min=13176 avg= ,!$acc loop gang(16), vector(16)! blockidx%y threadidx%y 237,!$acc loop gang(16), vector(16)! blockidx%x threadidx%x 237: kernel launched 100 times grid: [16x16] block: [16x16] time(us): total= max=7365 min=7223 avg=7293 Device Name: GeForce GTX 580 ( 上記は倍精度計算 ) μ 秒全体の実行時間 :2.32 秒全体の実行時間 :1.32 秒 39

41 OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O core スレッド (with SSE vector) -fastsse OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) OpenACC ( 繰返ループの外側に data 構文を挿入 ) 2.32 x 3.7 OpenACC ( 対象ループを loop 節で並列 mapping 調整 ) 1.32 x 6.6 OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX

42 プログラム全体にスコープ範囲を広げる subroutine driver (u,f) * dx - grid spacing in x direction * dy - grid spacing in y direction ( 配列宣等は省略 ) * Initialize data cpu0 = second()!$acc data copy (u,f) call initialize (n,m,alpha,dx,dy,u,f) * Solve Helmholtz equation call jacobi (n,m,dx,dy,alpha,relax,u,f,tol,mits) * Check error between exact solution call error_check (n,m,alpha,dx,dy,u,f)!$acc end data cpu1 = second() * Printout Elapsed time elapsed = (cpu1 -cpu0) * t_ac print '(/,1x,a,F10.3/)', & Elpased Time (Initialize + Jacobi solver + Check) : ',elapsed return end u() と f() 配列がプログラム全体で使用される u() と f() 配列を Copyin to GPU 各手続上では u, f 配列に係わる計算処理を GPU kernel 化するだけ u() と f() 配列を Copyout to Host 41

43 コンパイラフィードバック情報 ( データ構文に関するもののみ抽出 ) subroutine initialize (n,m,alpha,dx,dy,u,f) real*8 u(n,m),f(n,m),dx,dy,alpha!$acc kernels copyin(dx,dy,alpha) present(u,f) driver:!$acc loop gang private(xx,yy) 108, Generating copy(f(:,:)) do j = 1,m Generating copy(u(:,:))!$acc loop vector(256) ( 以下省略 ) do i = 1,n initialize: xx = dx * real(i-1)! -1 < x < 1 152, Generating present(u(:,:)) yy = dy * real(j-1)! -1 < y < 1 Generating present(f(:,:)) u(i,j) = 0.0 ( 以下省略 ) f(i,j) = -alpha *(1.0-xx*xx)*(1.0-yy*yy) jacobi: & - 2.0*(1.0-xx*xx)-2.0*(1.0-yy*yy) 215, Generating present_or_copyin(f(:,:)) Generating present_or_copy(u(:,:)) Generating local(resid)!$acc end kernels Generating local(uold(1:n,1:m)) ( 以下省略 ) return error_check: 275, Generating present(u(:,:)) ( 以下省略 ) present 節の意味 u() と f() 配列に関しては既に GPU 上に存在していると言う意味 42

44 OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O core スレッド (with SSE vector) -fastsse OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) OpenACC ( 繰返ループの外側に data 構文を挿入 ) 2.32 x 3.7 OpenACC ( 対象ループを loop 節で並列 mapping 調整 ) 1.32 x 6.6 OpenACC (mainプログラム上に data 構文 & present 節使用 ) 1.23 x 7.1 OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX

45 ポーティング開発時における便利なツール等 NVIDIA Visual Profiler ストリームイベント等挙動の視覚的な把握詳細なカーネル特性の把握 PGI 環境変数 ACC_NOTIFY kernel 動作の実行時履歴の出力 PGI_ACC_DEBUG 実行時の CUDA システムコールのイベントログ出力 44

NVIDIA Visual Profiler を使う [kato@photon29]$ make jacobi4.exe コンパイル pgfortran -o jacobi4.exe jacobi4.f -fastsse -acc -ta=nvidia:cuda4.

46 NVIDIA Visual Profiler を使う make jacobi4.exe コンパイル pgfortran -o jacobi4.exe jacobi4.f -fastsse -acc -ta=nvidia:cuda4.1 which nvvp /usr/local/cuda/bin/nvvp CUDA toolkit 4.1 を使用するように指示する nvvp (NVIDIA Visual Profiler の起動 ) 起動実行モジュール jacobi4.exe の指定 NVIDIA Visual Profiler 4.1 を使用する 45

47 NVIDIA CUDA Visual Profiler(1) データコピーが頻繁! GPU 特性全体性能特性この stream 全体の挙動が色別で分かるデータ転送 ( カーキー色 ) が卓越 46

48 NVIDIA CUDA Visual Profiler(2) Compute kernels の実行が主体この stream 全体の挙動が色別で分かるカーネル実行 ( ピーコックブルー色 ) が卓越 Kernel の実行特性 47

49 NVIDIA CUDA Visual Profiler(3) Timeline の詳細 Kernel の実行特性個々のイベント特性の詳細 48

50 PGI 環境変数 ( カーネル起動のログ ) ACC_NOTIFY $export ACC_NOTIFY=1 実行中アクセラレータ上のkernel 動作実行履歴を出力する launch kernel file=/home/kato/jacobi4.f function=initialize line=154 device=0 grid=20 block=256 launch kernel file=/home/kato/jacobi4.f function=jacobi line=227 device=0 grid=5000 block=256 launch kernel file=/home/kato/jacobi4.f function=jacobi line=235 device=0 grid=320x16 block=16x16 launch kernel file=/home/kato/jacobi4.f function=jacobi line=240 device=0 grid=1 block=256 Kernel 実行が行われているか Kernel はどのような並列分割 (grid, thread block) で実行されているか確認できる 49

51 PGI 環境変数 ( 実行時のイベントログ ) PGI_ACC_DEBUG (PGI 2013 以降 ) $export PGI_ACC_DEBUG=1 (disable したい場合は 0) 実行時のPGIのCUDAシステムコールのイベントログを出力 [kato@photon29]$ export PGI_ACC_DEBUG=1 [kato@photon29]$ jacobi4.exe pgi_cu_init() found 2 devices pgi_cu_init( file=acc_init.c, function=acc_init, line=41, startline=1, endline=-1 ) pgi_cu_init() will use device 0 (V2.0) pgi_cu_init() compute context created initialize nvidia pgi_cu_init( file=/home/kato/gpgpu/openmp/double/openacc/jacobi4.f, function=driver, line=107, startline=69, endline=129 ) pgi_acc_dataon(devptr=0x1,hostptr=0x7ff535c48230,offset=0,0,stride=1,5120,size=5120x5000, extent=5120x5000,eltsize=8,lineno=107,name=f,flags=0xf00=create+present+copyin+copyout) NO map for host:0x7ff535c48230 pgi_cu_alloc(size= ,lineno=107,name=f) pgi_cu_alloc( ) returns 0x map dev:0x host:0x7ff535c48230 size: offset:0 data[dev:0x7ff535c48230 host:0x size: ] (line:107 name:f) pgi_cu_launch_a(func=0xaf6f40, params=0x7fff8d2e1190, bytes=72, sharedbytes=0) First arguments are:

52 終わり 51

Microsoft PowerPoint - GDEP-GPG_softek_May24-2.pptx

Microsoft PowerPoint - GDEP-GPG_softek_May24-2.pptx G-DEP 第 3 回セミナー PGI OpenACC Compiler PGIコンパイラ使用の実際新しい OpenACC によるプログラミング 2012 年 5 月加藤努株式会社ソフテック OpenACC によるプログラミング GPU / Accelerator Computing Model のデファクトスタンダードへ OpenACC Standard 概略説明 Accelerator Programming