Microsoft PowerPoint - GTC2012-SofTek.pptx - PDF 無料ダウンロード

GTC Japan 2012 PGI Accelerator Compiler 実践! PGI OpenACC ディレクティブを使用したポーティング 2012 年 7 月加藤努株式会社ソフテック

本日の話 OpenACC によるポーティングの実際 OpenACC ディレクティブ概略説明 Accelerator Programming Model Fortran プログラムによるポーティングステップ三つのディレクティブの利用性能チューニング PGI Accelerator Compiler 製品を使用 1

PGI OpenACC 対応コンパイラを使用 PGI Accelerator Compiler 製品 (x64+gpu) 内に実装 PGI アクセラレータコンパイラ製品 (PGI Accelerator Fortran/C/C++) 1. OpenACC コンパイラ (Fortran, C99) 2. PGI Accelerator Programming Model (directiveベース) 3. PGI CUDA Fortran 4. PGI CUDA-x86 for C/C++ compatible& superset 2012 年 7 月 OpenACC 正式版リリース PGI アクセラレータコンパイラソフテック情報サイト http://www.softek.co.jp/spg/pgi/accel/index.html 2

OpenACC Standard とは何か? 2011 年 11 月 NVIDIA, Cray, PGI, CAPS Accelerators 用のプログラミング API の標準仕様 Fortran, C/C++ 言語上で指定するコンパイラディレクティブ群ユーザサイド開発者がアクセラレータで実行するコード部分をディレクティブで指定する ( コンパイラに対してヒントを与える ) OpenACC コンパイラホスト側の処理をアクセラレータ (GPU) にオフロードするコード生成ホスト -- GPU 間のデータ転送コードの生成 2009 年リリース以来実績を積んだ PGI Accelerator Compiler(directives) がベースとなっている 3

Accelerator Programming Model ホスト側ハイブリッド構成 (CPU + Accelerator) GPU 側 CPU Main Memory Host_A(100) 重い計算部分の処理をオフロード使用データを送る結果データを戻す Overhead GPU Device Memory Device_A(100) Host GPU 間のメモリデータの転送が伴うデータ転送のオーバーヘッド時間が伴う 4

OpenACC ディレクティブの主な構成ホスト ( 処理 ) Accelerator 1 CPU 重い計算部分の処理をオフロード 3 GPGPU Main Memory 2 ( データ ) Device Memory 1 Accelerate Compute 構文 (offload 領域指示 ) 2 Data 構文 ( データ移動指示 ) 3 Loop 構文 (Mapping for parallel/vector, Tuning) 5

program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = 100000 allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc kernels do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels print *, r(1000) end 2 行のディレクティブ挿入でコード生成 $ pgfortran -acc -Minfo test.f90 main: 12, Generating copyout(r(1:100000)) Generating copyin(a(1:100000)) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 13, Loop is parallelizable Accelerator kernel generated 13,!$acc loop gang, vector(256)! blockidx%x threadidx%x オフロードする並列対象領域の指 ( 一般にループ部分 ) GPU 側へのデータコピー GPU 用の並列化 Host 側へデータバック自動的かつ Implicit に行う 6

1 Accelerate Compute 構文 program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = 100000 allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data -- Fortran -- 主要な三つのディレクティブ 1 並列実行 kernel 部分の指定オフロードする並列対象領域の指 ( 一般にループ部分 ) 7

2 Data 構文 program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = 100000 allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data -- Fortran -- 主要な三つのディレクティブ 2 データ移動指示 1 並列実行 kernel 部分の指定オフロードする並列対象領域の指 ( 一般にループ部分 ) 8

3 Loop 構文 program main integer :: n! size of the vector real,dimension(:),allocatable :: a! the vector real,dimension(:),allocatable :: r! the results integer :: i n = 100000 allocate(a(n)) allocate(r(n)) do i = 1,n a(i) = i*2.0!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data -- Fortran -- 主要な三つのディレクティブ 2 データ移動指示 1 並列実行 kernel 部分の指定 3mapping for para/vector オフロードする並列対象領域の指 ( 一般にループ部分 ) 9

三つの構文を使用してポーティング!$acc data copyin(a(1:n)),copyout(r)!$acc kernels!$acc loop gang(32),vector(64) do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2!$acc end kernels!$acc end data Data 構文 Accelerate Compute 構文 Loop 構文オフロードする並列対象領域 ( 一般にループ部分 ) 10

OpenACCを使用して GPU 上での実行を行うまでのプログラムポーティングを行う (Fortran) 11

subroutine driver (u,f) * dx - grid spacing in x direction * dy - grid spacing in y direction ( 配列宣等は省略 ) * Initialize data cpu0 = second() call initialize (n,m,alpha,dx,dy,u,f) * Solve Helmholtz equation ヤコビ反復プログラム call jacobi (n,m,dx,dy,alpha,relax,u,f,tol,mits) * Check error between exact solution call error_check (n,m,alpha,dx,dy,u,f) cpu1 = second() * Printout Elapsed time elapsed = (cpu1 -cpu0) * t_ac print '(/,1x,a,F10.3/)', & Elpased Time (Initialize + Jacobi solver + Check) : ',elapsed return end 三つのサブルーチンコール手続間で配列の受渡し有り (u,f) 各ルーチン内で高速化を図る時間の掛かっている場所は? 時間の掛かるループ内で使用されている配列は何か? 手続間のデータの受渡しの状況を見る 12

Subroutine Jacobi の核心部分 error = 10.0 * tol k = 1 do while (k.le.maxit.and. error.gt. tol) error = 0.0!$omp parallel default(shared)!$omp do do j=1,m do i=1,n uold(i,j) = u(i,j)!$omp do private(resid) reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$omp nowait!$omp end parallel * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop Do while ループ内でステンシル計算収束条件を満たしたら終了内部は 2 次元の nested loop uold(i,j) 配列は並列実行依存性なし u(i,j) 配列も依存性無しストアのみ f(i,j) 配列も依存性無し参照のみ error 変数はリダクション演算並列依存性とは無し : u(i) = u(i) 有り : u(i) =u(i-1) 同じ配列で定義 ~ 参照関係があるとき依存性の検討要 13

Jacobi ルーチンへの OpenMP directives error = 10.0 * tol k = 1 do while (k.le.maxit.and. error.gt. tol) error = 0.0!$omp parallel default(shared)!$omp do do j=1,m do i=1,n uold(i,j) = u(i,j)!$omp do private(resid) reduction(+:error) do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$omp nowait!$omp end parallel * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop $ pgfortran -fastsse mp Minfo jacobi.f jacobi: 204, Parallel region activated 206, Parallel loop activated with static block schedule 207, Memory copy idiom, loop replaced by call to c_mcopy8 214, Barrier 215, Parallel loop activated with static block schedule 216, Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated 4 prefetch instructions for the loop 223, Begin critical section End critical section Parallel region terminated 行番号コンパイラメッセージ実際に並列化とベクトル化を実装している 14

シリアル実行用コンパイルとその実行 [kato@photon29 OpenACC]$ pgfortran -O3 openmp.f -Minfo initialize: 139, Invariant if transformation 140, Invariant assignments hoisted out of loop jacobi: 207, Memory copy idiom, loop replaced by call to c_mcopy8 error_check: 262, Invariant if transformation 263, Invariant assignments hoisted out of loop [kato@photon29 OpenACC]$ a.out Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual 3.8483507870214220E-011 Solution Error : 1.0538552346934107E-004 コンパイル実行 Elpased Time (Initialize + Jacobi solver + Check) : 26.191 15

シリアル実行用最適化コンパイルとその実行 [kato@photon29 OpenACC]$ pgfortran -fastsse openmp.f -Minfo initialize: 139, Invariant if transformation 140, Invariant assignments hoisted out of loop Loop not vectorized: mixed data types Unrolled inner loop 4 times jacobi: 207, Memory copy idiom, loop replaced by call to c_mcopy8 216, Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated 4 prefetch instructions for the loop error_check: 262, Invariant if transformation 263, Invariant assignments hoisted out of loop Generated 2 alternate versions of the loop Generated vector sse code for the loop Generated a prefetch instruction for the loop [kato@photon29 OpenACC]$ a.out ( 省略 ) Residual 3.8483507872410546E-011 Solution Error : 1.0538552346934791E-004 SSE ベクトル化コンパイル実行 Elpased Time (Initialize + Jacobi solver + Check) : 13.149 16

OpenMP 並列実行用コンパイル [kato@photon29 OpenACC]$ pgf90 -fastsse -mp openmp.f -Minfo initialize: 138, Parallel region activated 139, Parallel loop activated with static block schedule 140, Loop not vectorized: mixed data types Unrolled inner loop 4 times 147, Parallel region terminated スレッド並列化 jacobi: 204, Parallel region activated 206, Parallel loop activated with static block schedule 207, Memory copy idiom, loop replaced by call to c_mcopy8 214, Barrier 215, Parallel loop activated with static block schedule 216, Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated 4 prefetch instructions for the loop 226, Begin critical section SSE ベクトル化 End critical section Parallel region terminated error_check: 261, Parallel region activated 262, Parallel loop activated with static block schedule 263, Generated 2 alternate versions of the loop Generated vector sse code for the loop Generated a prefetch instruction for the loop 269, Begin critical section End critical section Parallel region terminated 17

OpenMP 並列実行 [kato@photon29 OpenACC]$ export OMP_NUM_THREADS=4 (4 スレッド実行 ) [kato@photon29 OpenACC]$ a.out Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual 3.8483507872410740E-011 Solution Error : 1.0538552346934802E-004 Elpased Time (Initialize + Jacobi solver + Check) : 8.743 18

シリアル OpenMP 並列実行性能 ( 倍精度演算 ) OpenMP と OpenACC 時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O3 26.19 1 core スレッド (with SSE vector) -fastsse 13.15 x 2.0 OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 3.0 OpenMP 用オプションベクトル最適化用オプション OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX 580 PGI 12.5 を使用 19

プログラムのプロファイリング取得コンパイル ( シリアル実用 ) [kato@photon29]$ pgfortran -fastsse -Minfo=ccff jacobi.f -o jacobi コンパイル (OpenMP 並列実用 ) [kato@photon29]$ pgfortran -fastsse -Minfo=ccff mp jacobi.f -o jacobi プロファイルデータの取得 (pgcollect utility でサンプリング ) [kato@photon29]$ pgcollect -time jacobi Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual 3.8483507872410546E-011 Solution Error : 1.0538552346934791E-004 実際の実行 Elpased Time (Initialize + Jacobi solver + Check) : 13.288 target process has terminated, writing profile data プロファイリングツール pgprof の実 [kato@photon29]$ pgprof -exe jacobi ( プロファイラツールの起動 ) 20

1 スレッドシリアル実行のプロファイル PGPROF PGI プロファイラ計算コスト 99% Jacobi ルーチン 67% c_mcopy8 ルーチン 32% メモリアクセスのための PGI 組込ルーチン 21

4 スレッド並列実行のプロファイルバリア同期計算コスト 97% 4 スレッドの各時間コスト Jacobi ルーチン 49% c_mcopy8 ルーチン 35% OpenMP バリア同期 13% 22

コンパイラフィードバック情報 click! ループこのループは 8.91 秒コンパイラフィードバック情報 Compute Intensity このループは 2.43 ベクトル化最適化実施のメッセージ 23

ポーティングでの作業方針 1. Jacobi ルーチンの時間コストが 99% 占める最初にこのルーチンの中の GPU 実行部分を特定して OpenACC ディレクティブを挿入 (targeting) 2. ホストと GPU 間のデータ移動を最小化する (GPU 上に計算に必要なデータを常駐化させる ) 3. NVIDIA GPU 用の Grid サイズ Block サイズのチューニングを行う 4. Jacobi ルーチン以外のルーチンに対しても OpenACC ディレクティブを適用する 5. プログラム全体にスコープ範囲を広げホストと GPU 間のデータ移動を最小化する 24

三つの構文を使って GPU 用に並列化する OpenACC ディレクティブを使用する!$acc data!$acc kernels!$acc loop do i = 1, n { 並列化可能なループ } end do!$acc end kernels!$acc end data Data 構文 Accelerate Compute 構文 Loop 構文 25

PGI コンパイラ OpenACC 用オプション OpenACC directive を認識する Fortran $ pgfortran acc Minfo fast {source}.f90 あるいは $ pgfortran ta=nvidia Minfo fast {source}.f90 (PGI Accelerator directives あるいは OpenACC directives を認識 ) C (C99) 現在 C++ には実装していない $ pgcc acc Minfo fast {source}.c あるいは $ pgcc ta=nvidia Minfo fast {source}.c (PGI Accelerator directives あるいは OpenACC directives 認識 ) 26

まず Kernels directive を挿入してみる error = 10.0 * tol k = 1 収束判定ループ do while (k.le.maxit.and. error.gt. tol) error = 0.0!$acc kernels do j=1,m do i=1,n uold(i,j) = u(i,j) 1 2 Accelerator 領域の開始 3 do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid 3 error = error + resid*resid end do!$acc end kernels Accelerator 領域の終了 4 5 1 2 3 4 5 * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop コンパイラは以下のコードを自動生成 GPU 上のメモリに配列データエリアをアロケートホスト側のデータをGPU 側へコピーするホスト側から kernel プログラムを起動する GPU 上で計算した結果をホスト側に戻す GPU 上のデータをデアロケート問題は? データ転送回数 27

PGI コンパイラフィードバック情報 (-Minfo) [kato@photon29]$ pgfortran acc -fastsse Minfo=accel -o jacobi1.exe jacobi1.f jacobi: 行番号 204, Generating copyout(uold(1:n,1:m)) Generating copyin(u(:n,:m)) Generating copyout(u(2:n-1,2:m-1)) Generating copyin(f(2:n-1,2:m-1)) Generating compute capability 1.3 binary Generating compute capability 2.0 binary Accelerator kernel generated 213,!$acc loop gang, vector(8)! blockidx%y threadidx%y 214,!$acc loop gang, vector(8)! blockidx%x threadidx%x 使用レジスタ数使用 shared Mem 使用 const. Mem Occupancy per SM CC 1.3 : 32 registers; 640 shared, 28 constant; 50% occupancy CC 2.0 : 28 registers; 520 shared, 160 constant; 33% occupancy 222, Sum reduction generated for error 配列名 Host GPU 間配列データの転送命令生成ネストループの並列分割の様子 (Grid/Block) 総和リダクション検出しリダクションコード生成 NVIDIA H/W Compute capability 使用特性 28

実行コンパイル & 実行モジュール作成 [kato@photon29]$ make jacobi1.exe pgfortran -o jacobi1.exe jacobi1.f -fastsse -Minfo=accel acc jacobi1.exe と言うモジュールには Host 用コード +GPU 用コードが含まれる実行 [kato@photon29]$ jacobi1.exe Input n,m - grid real*8 in x,y direction N= 5120 M= 5000 Input alpha - Helmholts constant Input relax - Successive over-relaxation parameter Input tol - error tolerance for iterative solver Input mits - Maximum iterations for solver Time measurement accuracy :.10000E-05 Total Number of Iterations 101 Residual 3.8483507872410927E-011 Solution Error : 1.0538552346934791E-004 Elpased Time (Initialize + Jacobi solver + Check) : 17.950 FORTRAN STOP 29

PGI 環境変数 (Accelerator Profile) PGI_ACC_TIME $ export PGI_ACC_TIME=1 実行時に OpenACC 領域の実行プロファイル情報を出力する Accelerator Kernel Timing data プロファイル時間の単位 :μ 秒 jacobi( ルーチン名 ) 204: region entered 100 times time(us): total=17765937 init=287277 region=17478660 Kernelの実時間 (1.89 秒 ) kernels=1895314 data=15523197 w/o init: total=17478660 max=204303 min=173351 avg=174786 データ転送時間 (15.52 秒 ) 番号 206: kernel launched 100 times grid: [640x625] block: [8x8] time(us): total=519776 max=5204 min=5165 avg=5197 214: kernel launched 100 times grid: [640x625] block: [8x8] time(us): total=1318752 max=13200 min=13180 avg=13187 222: kernel launched 100 times grid: [1] block: [256] time(us): total=56786 max=570 min=565 avg=567 214 ループの Kernel の実時間 (1.31 秒 ) Grid/Block 分割のサイズ 30

OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O3 26.19 1 core スレッド (with SSE vector) -fastsse 13.15 OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) 17.95 OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX 580 31

ループ内にデータ転送があると転送の嵐 do while (k.le.maxit.and. error.gt. tol)!$acc kernels DO ループ群!$acc end kernels 収束するまで繰り返す外側ループ Host GPU データコピー場所 GPU 上の計算処理 GPU Host データバック場所 end do! End iteration loop Data 構文の利用 : 並列処理とデータ移動の指示を分離する 32

ループの外でデータ転送を行う明示的にデータ構文で転送指示!!$acc data copy(u) copyin(f)... do while (k.le.maxit.and. error.gt. tol)!$acc kernels DO ループ群 Host GPU データコピー場所 GPU 内のデータは常駐化 GPU 上の計算処理!$acc end kernels end do! End iteration loop!$acc end data GPU Host データバック場所 33

Data Directive を使用する error = 10.0 * tol k = 1!$acc data copy(u)!$acc+ copyin(f) create(uold) do while (k.le.maxit.and. error.gt. tol) error = 0.0 * Copy new solution into old!$acc kernels kernels 並列領域の開始 do j=1,m do i=1,n uold(i,j) = u(i,j) do j = 2,m-1 do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$acc end kernels 1 2 kernels 並列領域の終了 3 3 * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop!$acc end data Acc データ領域 4 5 収束判定ループの外側でデータ領域を指定 GPU 上に使用データを常駐させる収束ループが終了時にデータをホストに戻す Host-GPU 間のデータ転送の削減 34

データ転送を 1 回だけにした場合のプロファイル Accelerator Kernel Timing data /home/kato/gpgpu/openmp/double/openacc/jacobi2.f jacobi 205: region entered 100 times 3つのkernel の存在 kernels 構の領域の情報 time(us): total=1915950 init=3 region=1915947 kernels=1893866 data=0 w/o init: total=1915947 max=19442 min=19148 avg=19159 207: kernel launched 100 times grid: [640x625] block: [8x8] time(us): total=519232 max=5197 min=5169 avg=5192 215: kernel launched 100 times grid: [640x625] block: [8x8] time(us): total=1317873 max=13186 min=13172 avg=13178 223: kernel launched 100 times grid: [1] block: [256] time(us): total=56761 max=569 min=566 avg=567 データ転送時間 (0 秒 ) 次はこの時間をチューニングする /home/kato/gpgpu/openmp/double/openacc/jacobi2.f jacobi 199: region entered 1 time データ構の領域のプロファイル情報 1 回のみ time(us): total=2122461 init=87485 region=2034976 data=112926 データ転送時間 (0.11 秒 ) w/o init: total=2034976 max=2034976 min=2034976 avg=2034976 35

OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O3 26.19 1 core スレッド (with SSE vector) -fastsse 13.15 OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) 17.95 OpenACC ( 繰返ループの外側に data 構文を挿入 ) 2.32 x 3.7 OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX 580 36

Loop Directives で並列動作を調整 error = 10.0 * tol k = 1!$acc data copy(u(1:n,1:m))!$acc+ copyin(f(1:n,1:m)) create(uold(1:n,1:m)) do while (k.le.maxit.and. error.gt. tol) error = 0.0 * Copy new solution into old!$acc kernels Accelerator 並列領域の開始 do j=1,m do i=1,n uold(i,j) = u(i,j)!$acc loop gang, vector(8) do j = 2,m-1!$acc loop gang, vector(8) do i = 2,n-1 resid = (ax*(uold(i-1,j) + uold(i+1,j)) & + ay*(uold(i,j-1) + uold(i,j+1)) & + b * uold(i,j) - f(i,j))/b u(i,j) = uold(i,j) - omega * resid error = error + resid*resid end do!$acc end kernels Accelerator 並列領域の終了 * Error check k = k + 1 error = sqrt(error)/dble(n*m)! End iteration loop!$acc end data コンパイラは自動的に対象並列ループを CUDA の Thread-block/Grid に分割マッピングするブロック分割等の mapping を明的に変更することが可能より良い性能を出すには gang, vector の並列スケジューリングを変えて試行錯誤が必要 37

Accelerator ループマッピングを変更する例えば Grid size (16 x16) Block size (16 x16) jacobi: 217, Generating local(uold(:,:)) Generating local(resid) Generating copyin(f(:n,:m)) Generating copy(u(:n,:m)) 235, Loop is parallelizable 237, Loop is parallelizable Accelerator kernel generated 235,!$acc loop gang(16), vector(16)! blockidx%y threadidx%y 237,!$acc loop gang(16), vector(16)! blockidx%y threadidx%y loop scheduling 節を変更!$acc loop gang(16) vector(16) ( 235) do j = 2,m-1!$acc loop gang(16) vector(16) ( 237) do i= 2,n-1 ( 238) resid = (ax*(uold(i-1,j) + uold(i+1,j)) ( 239) & + ay*(uold(i,j-1) + uold(i,j+1)) ( 240) & + b * uold(i,j) - f(i,j))*b1b ( 241) u(i,j) = uold(i,j) - omega * resid ( 242) error = error + resid*resid ( 243) end do ( 244)!$acc end region CC 1.3 : 26 registers; 2176 shared, 36 constant, 0 local memory bytes; 50% occupancy CC 2.0 : 26 registers; 2056 shared, 144 constant, 0 local memory bytes; 66% occupancy 242, Sum reduction generated for error 38

実行プロファイル情報で性能評価 loop scheduling(grid/block size) の変更で性能が変わる 235,!$acc loop gang, vector(8)! blockidx%y threadidx%y 237,!$acc loop gang, vector(8)! blockidx%x threadidx%x 237: kernel launched 100 times grid: [640x625] block: [8x8] time(us): total=1318241 max=13189 min=13176 avg=13182 235,!$acc loop gang(16), vector(16)! blockidx%y threadidx%y 237,!$acc loop gang(16), vector(16)! blockidx%x threadidx%x 237: kernel launched 100 times grid: [16x16] block: [16x16] time(us): total=729351 max=7365 min=7223 avg=7293 Device Name: GeForce GTX 580 ( 上記は倍精度計算 ) μ 秒全体の実行時間 :2.32 秒全体の実行時間 :1.32 秒 39

OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O3 26.19 1 core スレッド (with SSE vector) -fastsse 13.15 OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) 17.95 OpenACC ( 繰返ループの外側に data 構文を挿入 ) 2.32 x 3.7 OpenACC ( 対象ループを loop 節で並列 mapping 調整 ) 1.32 x 6.6 OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX 580 40

プログラム全体にスコープ範囲を広げる subroutine driver (u,f) * dx - grid spacing in x direction * dy - grid spacing in y direction ( 配列宣等は省略 ) * Initialize data cpu0 = second()!$acc data copy (u,f) call initialize (n,m,alpha,dx,dy,u,f) * Solve Helmholtz equation call jacobi (n,m,dx,dy,alpha,relax,u,f,tol,mits) * Check error between exact solution call error_check (n,m,alpha,dx,dy,u,f)!$acc end data cpu1 = second() * Printout Elapsed time elapsed = (cpu1 -cpu0) * t_ac print '(/,1x,a,F10.3/)', & Elpased Time (Initialize + Jacobi solver + Check) : ',elapsed return end u() と f() 配列がプログラム全体で使用される u() と f() 配列を Copyin to GPU 各手続上では u, f 配列に係わる計算処理を GPU kernel 化するだけ u() と f() 配列を Copyout to Host 41

コンパイラフィードバック情報 ( データ構文に関するもののみ抽出 ) subroutine initialize (n,m,alpha,dx,dy,u,f) real*8 u(n,m),f(n,m),dx,dy,alpha!$acc kernels copyin(dx,dy,alpha) present(u,f) driver:!$acc loop gang private(xx,yy) 108, Generating copy(f(:,:)) do j = 1,m Generating copy(u(:,:))!$acc loop vector(256) ( 以下省略 ) do i = 1,n initialize: xx = -1.0 + dx * real(i-1)! -1 < x < 1 152, Generating present(u(:,:)) yy = -1.0 + dy * real(j-1)! -1 < y < 1 Generating present(f(:,:)) u(i,j) = 0.0 ( 以下省略 ) f(i,j) = -alpha *(1.0-xx*xx)*(1.0-yy*yy) jacobi: & - 2.0*(1.0-xx*xx)-2.0*(1.0-yy*yy) 215, Generating present_or_copyin(f(:,:)) Generating present_or_copy(u(:,:)) Generating local(resid)!$acc end kernels Generating local(uold(1:n,1:m)) ( 以下省略 ) return error_check: 275, Generating present(u(:,:)) ( 以下省略 ) present 節の意味 u() と f() 配列に関しては既に GPU 上に存在していると言う意味 42

OpenACC 実行性能サマリー ( 倍精度演算 ) OpenMP 性能と OpenACC 性能時間 ( 秒 ) 倍率 1 core スレッド (without SSE vector) -O3 26.19 1 core スレッド (with SSE vector) -fastsse 13.15 OpenMP 4 core スレッド並列性能 -mp -fastsse 8.74 x 1.0 OpenACC ( 対象ループに kernels 構文のみ挿入 ) 17.95 OpenACC ( 繰返ループの外側に data 構文を挿入 ) 2.32 x 3.7 OpenACC ( 対象ループを loop 節で並列 mapping 調整 ) 1.32 x 6.6 OpenACC (mainプログラム上に data 構文 & present 節使用 ) 1.23 x 7.1 OpenMP 性能 OpenACC 性能 : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (Nehalem) 4core : (Host) 同上 : (GPU) NVIDIA GeForce GTX 580 43

ポーティング開発時における便利なツール等 NVIDIA Visual Profiler ストリームイベント等挙動の視覚的な把握詳細なカーネル特性の把握 PGI 環境変数 ACC_NOTIFY kernel 動作の実行時履歴の出力 PGI_ACC_DEBUG 実行時の CUDA システムコールのイベントログ出力 44

NVIDIA Visual Profiler を使う [kato@photon29]$ make jacobi4.exe コンパイル pgfortran -o jacobi4.exe jacobi4.f -fastsse -acc -ta=nvidia:cuda4.1 [kato@photon29]$ which nvvp /usr/local/cuda/bin/nvvp CUDA toolkit 4.1 を使用するように指示する [kato@photon29]$ nvvp (NVIDIA Visual Profiler の起動 ) 起動実行モジュール jacobi4.exe の指定 NVIDIA Visual Profiler 4.1 を使用する 45

NVIDIA CUDA Visual Profiler(1) データコピーが頻繁! GPU 特性全体性能特性この stream 全体の挙動が色別で分かるデータ転送 ( カーキー色 ) が卓越 46

NVIDIA CUDA Visual Profiler(2) Compute kernels の実行が主体この stream 全体の挙動が色別で分かるカーネル実行 ( ピーコックブルー色 ) が卓越 Kernel の実行特性 47

NVIDIA CUDA Visual Profiler(3) Timeline の詳細 Kernel の実行特性個々のイベント特性の詳細 48

PGI 環境変数 ( カーネル起動のログ ) ACC_NOTIFY $export ACC_NOTIFY=1 実行中アクセラレータ上のkernel 動作実行履歴を出力する launch kernel file=/home/kato/jacobi4.f function=initialize line=154 device=0 grid=20 block=256 launch kernel file=/home/kato/jacobi4.f function=jacobi line=227 device=0 grid=5000 block=256 launch kernel file=/home/kato/jacobi4.f function=jacobi line=235 device=0 grid=320x16 block=16x16 launch kernel file=/home/kato/jacobi4.f function=jacobi line=240 device=0 grid=1 block=256 Kernel 実行が行われているか Kernel はどのような並列分割 (grid, thread block) で実行されているか確認できる 49

PGI 環境変数 ( 実行時のイベントログ ) PGI_ACC_DEBUG (PGI 2013 以降 ) $export PGI_ACC_DEBUG=1 (disable したい場合は 0) 実行時のPGIのCUDAシステムコールのイベントログを出力 [kato@photon29]$ export PGI_ACC_DEBUG=1 [kato@photon29]$ jacobi4.exe pgi_cu_init() found 2 devices pgi_cu_init( file=acc_init.c, function=acc_init, line=41, startline=1, endline=-1 ) pgi_cu_init() will use device 0 (V2.0) pgi_cu_init() compute context created initialize nvidia pgi_cu_init( file=/home/kato/gpgpu/openmp/double/openacc/jacobi4.f, function=driver, line=107, startline=69, endline=129 ) pgi_acc_dataon(devptr=0x1,hostptr=0x7ff535c48230,offset=0,0,stride=1,5120,size=5120x5000, extent=5120x5000,eltsize=8,lineno=107,name=f,flags=0xf00=create+present+copyin+copyout) NO map for host:0x7ff535c48230 pgi_cu_alloc(size=204800000,lineno=107,name=f) pgi_cu_alloc(204800000) returns 0x200200000 map dev:0x200200000 host:0x7ff535c48230 size:204800000 offset:0 data[dev:0x7ff535c48230 host:0x200200000 size:204800000] (line:107 name:f) pgi_cu_launch_a(func=0xaf6f40, params=0x7fff8d2e1190, bytes=72, sharedbytes=0) First arguments are: 5120 5000 206962688 2 2097152 2 5120 0-1610612736 1060780090 1610612736 1060739809 5000 5120 5120 5121 0 107269324 50

終わり 51