概要目的 CUDA Fortran の利用に関する基本的なノウハウを提供する本チュートリアル受講後は Web 上で公開されている資料等を参照しながら独力で CUDA Fortran が利用できることが目標対象 CUDA Fortran の利用に興味を抱いている方前提とする知識 Fortran

Size: px

Start display at page:

Download "概要目的 CUDA Fortran の利用に関する基本的なノウハウを提供する本チュートリアル受講後は Web 上で公開されている資料等を参照しながら独力で CUDA Fortran が利用できることが目標対象 CUDA Fortran の利用に興味を抱いている方前提とする知識 Fortran"

ゆゆこだいほうじ
5 years ago
Views:

1 CUDA Fortran チュートリアル 2010 年 9 月 29 日 NEC

2 概要目的 CUDA Fortran の利用に関する基本的なノウハウを提供する本チュートリアル受講後は Web 上で公開されている資料等を参照しながら独力で CUDA Fortran が利用できることが目標対象 CUDA Fortran の利用に興味を抱いている方前提とする知識 Fortran を用いた Linux システム上でのプログラミングに関する基本的な知識 CUDA C に関する知識が非常に有用 Page 2

3 スケジュールスケジュール時間 10:00 ~ 10:30 (30 分 ) 10:30 ~ 10:40 (10 分 ) 10:40 ~ 11:30 (50 分 ) 11:30 ~ 11:50 (20 分 ) 内容第 1 部 GPGPU 概要第 2 部 CUDA Fortranの概要第 3 部シンプルなコードによる利用例第 4 部姫野ベンチマークのCUDA 化 Page 3

4 第 1 部 GPGPU 概要 Page 4

5 第 1 部の内容 GPGPUとは? 用語集 GPGPUのHW 構造 GPUの特長 GPUは何故速いのか? 性能を引き出すためのポイント克服すべき課題最新アーキテクチャFermi 第 1 部のまとめ Page 5

6 GPGPU とは? GPGPU (General Purpose computing on Graphic Processing Unit) GPU を汎用計算に用いる技術 CUDA (Compute Unified Device Architecture) C 言語の拡張および runtime ライブラリで構成された GPU で汎用計算を行なうための並列プログラミングモデルおよびソフトウェア環境グラフィック API を使用せず C 言語およびその拡張での開発が可能 CUDA Fortran と区別する場合に CUDA C と呼ぶことがある CUDA Fortran PGI Fortran compiler に含まれる機能であり Fortran およびその拡張での GPU プログラム開発が可能 Page 6

7 用語集 Host, Device CPU が処理を行うサーバ側を host GPU 側を device とよぶ Thread, Block, Grid CUDA における並列処理の階層構造 Warp Thread: 最小の処理単位 Block: Thread の集合 Grid: Block の集合スケジューリングにおける最小の処理単位 1 つの streaming multiprocessor 中の 8 コアでの処理 4 cycle 分で 32 thread からなる Kernel GPU 上で動作する関数を kernel とよぶ Thread Grid Block Page 7

8 GPU の HW 構造最適化の際には HW 構造の把握が必須 GPU カード全体の構造 (Tesla S1070 の例 ) 30 個搭載 GByte/s Device Memory (Global Memory) Streaming Multiprocessor (SM) Tesla S1070 では 1 GPU あたり 30 個搭載されている Device Memory (Global Memory) GDDR3 4 GByte 全 SP からアクセス可能 PCI 経由でここにデータをコピーする PCI-Express 2.0 x16 で host と接続バンド幅 8 GByte/s Streaming Multiprocessor の構造 Streaming Processor (SP) 単精度演算ユニット SM あたり 8 個搭載されている倍精度演算ユニット SM あたり 1 個搭載されている Shared Memory SM 内で共有レジスタ並みに高速サイズは SM あたり 16 kbyte Page 8

9 GPU の HW 構造メモリの階層構造に注意! 領域ごとにアクセス速度 scope lifetime が異なる Memory Location on/off chip Scope Lifetime Register Local On Off 1 thread Thread Shared On All threads in block Block Global Off All threads + host Host allocation Host CPU 25.6 GB/s Host Memory 8 GB/s Device SMSMSMSMSMSM SMSMSMSMSM GB/s Device Memory Register および Shared は GPU チップ上に (on chip) 搭載されているため高速 Global および Local は GDDR3 メモリ (Device memory) 上にある (off chip) ため低速 [ 出典 ]NVIDIA CUDA C Programming Best Practice Guide Table 3.1 Salient features of device memory Page 9

10 GPU の特長単精度演算のピーク性能が著しく高い! CPU との比較で約 11 倍 NVIDIA Tesla S GPU: GFLOPS (1.04 TFLOPS) Intel Xeon X5570 (2.93 GHz, 4 cores, SSE 命令 ): GFLOPS Device memory のバンド幅が大きい Host との比較で 4 倍 NVIDIA Tesla S GPU: GByte/s Intel Xeon X5570, DDR MHz: 25.6 GByte/s Page 10

11 GPU は何故速いのか? 多数の演算コアによる超並列処理 CPU と比較してより多くのトランジスタを演算器に割り当てている Cache や制御に割り当てられるトランジスタは少ない多数の演算コアを持つ例えば Tesla S1070 は 1 GPU あたり 240 個の SP を持つこれらを超並列動作させ高い演算性能を実現 CUDA は多数の thread による並列化を前提としたプログラミングモデルであり通常とは異なる考え方が必要 [ 参考 ]NVIDIA CUDA C Programming Guide 1.1 From Graphics Processing to General-Purpose Parallel Computing Page 11

12 性能を引き出すためのポイント演算器 (SP) を効率的に使用するために多数の thread を生成する CPU における OpenMP 等の thread 並列化ではコア数と同程度の thread 数が妥当だが GPU ではさらに多くの thread を生成 1 つの SM に ~1024 個の active thread が生成可能 Thread の切り替え時間が極めて短い SM に多数の thread を割り当てることでメモリアクセス時間を隠蔽できるメモリの階層構造を意識 Device memory (global memory), shared memory, register と異なる性質を持つメモリ領域を適切に使うことが重要 Host-device 間コピーは PCI 経由となり相対的に遅い Host-device 間データコピーは最小限に抑える Page 12

13 克服すべき課題倍精度ではピーク性能が 1/12 にコア (streaming processor) 数 1/8 コアあたりの演算数 2/3 で 1/12 に NVIDIA Tesla S GPU: 86.4 GFLOPS プログラミングが複雑性能を引き出すためにはメモリの階層構造等を意識したプログラミングが必要 Host-device 間コピーは PCI 経由となり遅い計算に必要なデータを device memory にコピーした後はデータの出し入れなしで計算を行なえるのが理想的そのためにはコード全域の CUDA 化が必要な場合も ECC がサポートされていないメモリエラー発生による検出困難な結果不正が生じうる Bit error で結果が変わってしまうことがあるため注意! Page 13

14 克服すべき課題倍精度ではピーク性能が 1/12 にコア (streaming processor) 数 1/8 コアあたりの演算数 2/3 で 1/12 に NVIDIA Tesla S GPU: 86.4 GFLOPS プログラミングが複雑性能を引き出すためにはメモリの階層構造等を意識した最新アーキテクチャFermiでは単精度比 1/2に改善プログラミングが必要 [ 参考 ]Next Generation CUDA Architecture. Code Named Fermi Host-device 間コピーは PCI 経由となり遅い計算に必要なデータを device memory にコピーした後はデータの出し入れなしで計算を行なえるのが理想的そのためにはコード全域の CUDA 化が必要な場合も ECC がサポートされていないメモリエラー発生による検出困難な結果不正が生じうる Bit error で結果が変わってしまうことがあるため注意! Page 14

15 克服すべき課題倍精度ではピーク性能が1/12にコア (streaming processor) 数 1/8 コアあたりの演算数 2/3で1/12に NVIDIA Tesla S GPU: 86.4 GFLOPS プログラミングが複雑性能を引き出すためにはメモリの階層構造等を意識したプログラミングが必要 Host-device 間コピーはPCI 経由となり遅い計算に必要なデータを PGIコンパイラが指示行ベースでの device memory CUDA にコピーした後は化をサポートデータの出し入れなしで計算を行なえるのが理想的する等状況は改善しつつある [ 参考 ]PGI Resources Accelerator そのためにはコード全域のCUDA 化が必要な場合も ECC がサポートされていないメモリエラー発生による検出困難な結果不正が生じうる Bit error で結果が変わってしまうことがあるため注意! Page 15

16 克服すべき課題倍精度ではピーク性能が1/12にコア (streaming processor) 数 1/8 コアあたりの演算数 2/3で1/12に NVIDIA Tesla S GPU: 86.4 GFLOPS プログラミングが複雑次世代 PCIはまだ先プログラミングの工夫で緩和可能な場合が多い性能を引き出すためにはメモリの階層構造等を意識したプログラミングが必要 Host-device 間コピーはPCI 経由となり遅い計算に必要なデータをdevice memoryにコピーした後はデータの出し入れなしで計算を行なえるのが理想的そのためにはコード全域のCUDA 化が必要な場合も ECCがサポートされていないメモリエラー発生による検出困難な結果不正が生じうる Bit errorで結果が変わってしまうことがあるため注意! Page 16

17 克服すべき課題倍精度ではピーク性能が1/12にコア (streaming processor) 数 1/8 コアあたりの演算数 2/3で1/12に NVIDIA Tesla C GPU: 86.4 GFLOPS プログラミングが複雑性能を引き出すためにはメモリの階層構造等を意識したプログラミングが必要 Host-device 間コピーはPCI 経由となり遅い最新アーキテクチャ Fermi では ECC をサポート計算に必要なデータを [ 参考 ]Next Generation CUDA device Architecture. memory Code にコピーした後は Named Fermi データの出し入れなしで計算を行なえるのが理想的そのためにはコード全域の CUDA 化が必要な場合も ECC がサポートされていないメモリエラー発生による検出困難な結果不正が生じうる Bit error で結果が変わってしまうことがあるため注意! Page 17

18 最新アーキテクチャ Fermi NVIDIA GPU の最新アーキテクチャ Fermi これまで課題となっていた多くの事項について改善が行われた倍精度演算性能の改善 ECC のサポート Cache の追加 Shared memory の増加 Etc [ 参考 ]Next Generation CUDA Architecture. Code Named Fermi Page 18

19 第 1 部のまとめ GPGPU に関する概要について説明した HW の構造について概観 Tesla S1070 は 1 GPU あたり 240 個の演算コア (SP) を持つメモリに階層構造があり領域によってアクセス速度 scope lifetime が異なる GPU が高速なのは多数の SP による超並列処理による性能を引き出すためには CPU とは異なる考え方が必要 SM あたり ~1024 thread と多数の thread を生成し演算コアを効率的に使用するメモリの階層構造を意識したコーディングが必要特に hostdevice 間コピーは最小限に抑えなければならない克服すべき課題はいくつかあるが最新アーキテクチャ Fermi ではそれらの多くが改善された第 2 部では CUDA Fortran の概要について説明する Page 19

20 第 2 部 CUDA Fortran の概要 Page 20

21 第 2 部 CUDA Fortran の概要 CUDA Fortran CUDA Cとの差異 CUDA Fortranによるコーディングイメージ CUDA Fortranにおけるスレッド並列化移植 / 最適化の流れ第 2 部のまとめ Page 21

22 CUDA Fortran PGI Fortran 10.0 以降で利用可能 PGI アクセラレータコンパイラ製品でなければ利用できないことに注意 CUDA C が C 言語の拡張であるのに対し CUDA Fortran は Fortran の拡張である Fortran で記述されたプログラムを GPU 化する際に有用基本的な考え方は CUDA C と同じ Device memory の allocate Host Device データコピー Kernel の実行 Device Host データコピーという一連の流れは変わらない C と Fortran の言語仕様における差異に拠るコーディングの差異があることに注意 Page 22

23 CUDA C との差異 CUDA C との差異は主に以下の通り Kernel 内で配列が利用可能 CUDA C では基本的にポインタでの参照のみ属性の指定による記述の簡易化が可能 Device 上の配列についてはその旨を明示するためコンパイラが device 上の配列であることを認識できる Fortran は参照渡しであるため kernel にスカラ値を渡す場合は注意が必要 value で修飾する Block ID および thread ID が 1 origin CUDA C は 0 origin API のインターフェースが一部異なる省略可能な引数データサイズの指定単位等 Texture memory が使用できない Page 23

24 CUDA Fortran によるコーディングのイメージ Fortran とその拡張でコード作成が可能! 1 Device memory(gpu ボード上のメモリ ) 上に配列を allocate する 2 Host memory から device memory にデータをコピーする 3 GPU 上で動作する関数を call する <<<>>> で生成するスレッド数等を指定 4 スレッドの ID がループ変数になるようなイメージ通常の Fortran コードサイズ N の配列 a, b を加算し配列 c にストアする [ ] call sub(n, a, b, c); [ ] subroutine sub(n,a,b,c) implicit none integer :: n real,dimension(n) :: a,b,c integer :: i do i=1,n a(i) = b(i) + c(i) enddo return end subroutine sub 同様の計算を GPU で行なうための CUDA Fortran コード [ ] stat = cudamalloc(d_a,n) stat = cudamalloc(d_b,n) stat = cudamalloc(d_c,n) 1 stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) [ ] 3 attributes(global) subroutine sub(n,a,b,c) implicit none integer,value :: n real,dimension(n),device :: a,b,c integer :: i i = (blockidx%x - 1) * blockdim%x + threadidx%x if(i < n+1) a(i) = b(i) + c(i) return end subroutine sub 2 4 Page 24

25 CUDA Fortran によるコーディングイメージ CUDA Fortran コード [ ] stat = cudamalloc(d_a,n) stat = cudamalloc(d_b,n) stat = cudamalloc(d_c,n) stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) [ ] attributes(global) subroutine sub(n,a,b,c) implicit none integer,value :: n real,dimension(n),device :: a,b,c integer :: i i = (blockidx%x - 1) * blockdim%x + threadidx%x if(i < n+1) a(i) = b(i) + c(i) return end subroutine sub 簡潔な記述が可能 device 属性で device memory 上の配列であることを明示するため device memory の allocate 単純な host-device 間コピーを簡潔に記述できる CUDA C ではポインタが host, device memory のどちらを指しているかがわからない [ ] allocate(d_a(n)) allocate(d_b(n)) allocate(d_c(n)) d_b = b d_c = c call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) [ ] attributes(global) subroutine sub(n,a,b,c) implicit none integer,value :: n real,dimension(n),device :: a,b,c integer :: i i = (blockidx%x - 1) * blockdim%x + threadidx%x if(i < n+1) a(i) = b(i) + c(i) return end subroutine sub Page 25

26 CUDA Fortran におけるスレッド並列化 CUDA C と同様に以下のような階層構造を持つ Threadの集合 Block Thread 3 次元配列で表現される Grid 1 block 内のthread 数には制限がある最大 512 threads BlockがSMに割り当てられる Block 内ではthread 間の同期が可能 Block 間の同期はkernel 内では不可能 Registerおよびshared memory 等のリソースに注意 Blockの集合 Grid 1つのgridが1つのkernelに対応 2 次元配列で表現される 1 Grid 内のblock 数には制限がある最大 65535x65535x1 blocks Block Page 26

27 移植 / 最適化の流れ CUDA Fortran による GPGPU 利用の流れ CUDA 化の対象とするルーチンを決定する gprof 等による高コストルーチンの特定移植 Device memory の allocate Host-device 間コピー Grid および block 形状の指定 Kernel の作成 Make および実行性能測定 CUDA プロファイラを利用して性能を測定する最適化第 3 部ではごくシンプルなコードについて移植 ~ 性能測定の実例を示す Page 27

28 第 2 部のまとめ CUDA Fortran の概要について説明した CUDA C が C 言語の拡張であるのに対して CUDA Fortran は Fortran の拡張 Fortran で記述されたコードを GPU 化する際に有用プログラミングの方法は CUDA C とほぼ同様だが言語仕様に起因する様々な差異があることに注意 Thread-Block-Grid という階層構造を意識する Thread が最小単位 Thread の集合が Block Block 単位で SM による処理が行われる Block の集合が Grid Grid が kernel に対応第 3 部ではごくシンプルなコードについて移植 ~ 性能測定の実例を示す Page 28

29 第 3 部シンプルなコードによる利用例 Page 29

30 第 3 部シンプルなコードによる利用例例 : 配列の加算 CUDA 化後のコード test.cuf Device Memory の Allocate API 関数のエラーチェック Host-Device 間コピー Grid および Block 形状の指定 Kernel の動作確認 Make および実行 CUDA Profiler による性能測定 Debug 手法 Block あたりの使用リソース量確認第 3 部のまとめ Page 30

31 例 : 配列の加算まずはごく単純なプログラムを CUDA 化要素数 N(=1000) の配列 b, c を加算し配列 a にストアする [Fortran コード test.f90] 1 subroutine sub(n,a,b,c) 2 implicit none 3 integer :: n 4 real,dimension(n) :: a,b,c 5 integer :: i 6 do i=1,n 7 a(i) = b(i) + c(i) 8 enddo 9 return 10 end subroutine sub program test 13 implicit none 14 integer,parameter :: N = real,dimension(n) :: a,b,c 16 integer :: i 17 b = 1.0e0 18 c = 2.0e0 19 call sub(n,a,b,c) 20 do i=1,n 21 print *, 'i=', i, ',a(', i, ')=', a(i) 22 enddo 23 stop 24 end program test サブルーチン sub を CUDA 化する [Makefile] 1 FC=pgf90 2 FFLAGS= 3 test: test.f90 4 $(FC) -o $@ $(FFLAGS) $? 5 clean: 6 rm -f test [ 実行結果 ] i= 1,a( 1 )= i= 2,a( 2 )= i= 3,a( 3 )= i= 998,a( 998 )= i= 999,a( 999 )= i= 1000,a( 1000 )= Page 31

32 CUDA 化後のコード test.cuf 朱書き部分が変更箇所 [Fortran コード test.cuf] 1 #define NUM_THREADS module cuda_kernel 4 contains 5 attributes(global) subroutine sub(n,a,b,c) 6 implicit none 7 integer,value :: n 8 real,dimension(n),device :: a,b,c 9 integer :: i 10 i = (blockidx%x - 1) * blockdim%x + threadidx%x 11 if(i < n+1) a(i) = b(i) + c(i) 12 return 13 end subroutine sub 14 end module cuda_kernel program test 17 use cudafor 18 use cuda_kernel 19 implicit none 20 integer,parameter :: N = real,dimension(n) :: a,b,c 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 23 integer :: i 24 integer :: stat type(dim3) :: dimgrid,dimblock dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) b = 1.0e0 32 c = 2.0e stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) stat = cudamemcpy(a,d_a,n,cudamemcpydevicetohost) 50 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) do i=1,n 53 print *, 'i=', i, ',a(', i, ')=', a(i) 54 enddo stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat = cudafree(d_c) 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stop 64 end program test Page 32

33 CUDA 化後のコード test.cuf [Fortranコード test.cuf] 1 #define NUM_THREADS module cuda_kernel 4 contains 5 attributes(global) subroutine sub(n,a,b,c) 6 implicit none 7 integer,value :: n 8 real,dimension(n),device :: a,b,c 9 integer :: i 10 i = (blockidx%x - 1) * blockdim%x + threadidx%x 11 if(i < n+1) a(i) = b(i) + c(i) 12 return 13 end subroutine sub 14 end module cuda_kernel program test 17 use cudafor 18 use cuda_kernel 19 implicit none 20 integer,parameter :: value N = 1000 で修飾する 21 real,dimension(n) :: a,b,c 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 23 integer :: i 24 integer :: stat type(dim3) :: dimgrid,dimblock dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) 30 スレッド数の指定 31 b = 1.0e0 32 c = 2.0e stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) stat = cudamemcpy(a,d_a,n,cudamemcpydevicetohost) 50 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) do i=1,n 53 print *, 'i=', i, ',a(', i, ')=', a(i) 54 enddo stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat = cudafree(d_c) 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stop 64 end program test Kernel は module 内になければならない Kernel には attributes(global) を付加 Fortranは参照渡しであるためスカラ変数の値を受け取りたい場合は Page 33

34 CUDA 化後のコード test.cuf [Fortranコード test.cuf] 1 #define NUM_THREADS module cuda_kernel 4 contains 5 attributes(global) subroutine sub(n,a,b,c) 6 implicit none 7 integer,value :: n 8 real,dimension(n),device :: a,b,c 9 integer :: i 10 i = (blockidx%x - 1) * blockdim%x + threadidx%x 11 if(i < n+1) a(i) = b(i) + c(i) 12 return 13 end subroutine sub 14 end module cuda_kernel program test 17 use cudafor 18 use cuda_kernel 19 implicit none 20 integer,parameter :: N = real,dimension(n) :: a,b,c 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 23 integer :: i 24 integer :: stat type(dim3) :: dimgrid,dimblock dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) b = 1.0e0 32 c = 2.0e stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) Block 38 stat = ID, cudamalloc(d_c,n) thread IDからindexを算出 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 40 IDが1 originであることに注意! 41 stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) stat = cudamemcpy(a,d_a,n,cudamemcpydevicetohost) 50 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) do i=1,n 53 print *, 'i=', i, ',a(', i, ')=', a(i) 54 enddo stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat = cudafree(d_c) 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stop 64 end program test Device memory 上の配列は device で修飾配列外 (i>n) にアクセスしないよう if で制御 Page 34

35 CUDA 化後のコード test.cuf [Fortranコード test.cuf] 1 #define NUM_THREADS module cuda_kernel 4 contains 5 attributes(global) subroutine sub(n,a,b,c) 6 implicit none 7 integer,value :: n 8 real,dimension(n),device :: a,b,c 9 integer :: i 10 i = (blockidx%x - 1) * blockdim%x + threadidx%x 11 if(i < n+1) a(i) = b(i) + c(i) 12 return 13 end subroutine sub Module を使用するための use 文 APIを使用するためにcudaforを使用 end module cuda_kernel program test 17 use cudafor 18 use cuda_kernel 19 implicit none 20 integer,parameter :: N = real,dimension(n) :: a,b,c 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 23 integer :: i 24 integer :: stat type(dim3) :: dimgrid,dimblock dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) b = 1.0e0 32 c = 2.0e stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 46 call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) 48 Device memory 上の配列を定義 deviceで修飾する 49 stat = cudamemcpy(a,d_a,n,cudamemcpydevicetohost) 50 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) do i=1,n 53 print *, 'i=', i, ',a(', i, ')=', a(i) 54 enddo 55 API 関数の返り値を受けるための変数 56 stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) Grid, blockの形状を指定する際は定義型 dim3を使用する 60 stat = cudafree(d_c) 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stop 64 end program test Page 35

36 CUDA 化後のコード test.cuf [Fortranコード test.cuf] 1 #define NUM_THREADS module cuda_kernel 4 contains 配列 a, b, cの領域をallocate 5 attributes(global) subroutine sub(n,a,b,c) サイズの指定は要素数で行う 6 implicit none 7 integer,value CUDA :: Cn ではbyte 数で指定 8 real,dimension(n),device :: a,b,c 9 integer :: i 10 i = (blockidx%x - 1) * blockdim%x + threadidx%x 11 if(i 定義時に < n+1) a(i) device = b(i) + c(i) memory 上の 12 return 13 end subroutine sub 14 end module cuda_kernel allocate(d_a(n)) program testという記述も許される 17 use cudafor 18 use cuda_kernel 19 implicit none 20 integer,parameter 関数の返り値でエラー検出 :: N = real,dimension(n) :: a,b,c 22 real,dimension(:),allocatable,device 返り値がcudaSuccess 以外なら :: d_a,d_b,d_c 23 integer :: i cudageterrorstringで 24 integer :: stat 25 エラー内容を表示 26 type(dim3) :: dimgrid,dimblock dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) 30 cudamalloc で device memory 上に配列であることを明示しているため 31 b = 1.0e0 32 c = 2.0e stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) stat = cudamemcpy cudamemcpy(a,d_a,n,cudamemcpydevicetohost) で配列 b, cのデータを 50 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) device memoryにコピーする do i=1,n サイズの指定は要素数で行う 53 print *, 'i=', i, ',a(', i, ')=', a(i) 54 enddo CUDA Cではbyte 数で指定 stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) コピーの方向を指定する 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat cudamemcpyhosttodevice = cudafree(d_c) は省略可能 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stopまた d_b=bという記述も許される 64 end program test Page 36

37 CUDA 化後のコード test.cuf [Fortranコード test.cuf] 1 #define NUM_THREADS module cuda_kernel 4 contains 5 attributes(global) subroutine sub(n,a,b,c) 6 implicit none 7 integer,value :: n 8 real,dimension(n),device :: a,b,c 9 integer :: i 10 i = (blockidx%x - 1) * blockdim%x + threadidx%x 11 if(i < n+1) a(i) = b(i) + c(i) 12 return 13 end subroutine sub 14 end module cuda_kernel program test 17 use cudafor 18 use cuda_kernel 19 implicit none 20 integer,parameter :: N = real,dimension(n) :: a,b,c 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 23 integer :: i 24 integer :: stat 25 deallocate(d_a) 26 type(dim3) :: dimgrid,dimblock 27 という記述も許される 28 dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) 30 cudagetlasterrorでkernelが正常に実行されたかどうかをチェックする cudafree で device memory を解放 31 b = 1.0e0 32 c = 2.0e stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stat <<<>>> = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) でgrid, block 形状を指定し 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) kernelを呼び出す 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) stat = cudamemcpy(a,d_a,n,cudamemcpydevicetohost) 50 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) do i=1,n 53 print *, 'i=', i, ',a(', i, ')=', a(i) 54 enddo stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat = cudafree(d_c) 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) stop 64 end program test Page 37

38 Device Memory の Allocate 朱書き部分が関連箇所 Device メモリ上に配列 a, b, c を格納する領域を allocate 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 34 stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 56 stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat = cudafree(d_c) 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) integer function cudamalloc(devptr, count) devptr: 1 次元の allocatable な device 配列 count: 配列の要素数ここでは配列 a, b, c を格納する領域として実数型 N 要素の領域を確保 integer function cudafree(devptr) devptr: allocatable な device 配列 Page 38

39 Device Memory の Allocate allocate, deallocate による device memory の確保解放が可能 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 34 stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 56 stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat = cudafree(d_c) 簡潔な記述が可能 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 22 real,dimension(:),allocatable,device :: d_a,d_b,d_c 33 allocate(d_a(n)) 34 allocate(d_b(n)) 35 allocate(d_c(n)) 49 deallocate(d_a) 50 deallocate(d_b) 51 deallocate(d_c) また以下のように返り値を取ることもできる allocate(d_a(n),stat=stat) Page 39

40 API 関数のエラーチェック朱書き部分が関連箇所 API 関数の返り値 ( 整数型 ) でエラーチェックを行う 24 integer :: stat 34 stat = cudamalloc(d_a,n) 35 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 36 stat = cudamalloc(d_b,n) 37 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 38 stat = cudamalloc(d_c,n) 39 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 56 stat = cudafree(d_a) 57 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 58 stat = cudafree(d_b) 59 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 60 stat = cudafree(d_c) 61 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) ここでは正常終了を表す cudasuccess 以外の値が返ってきた場合 cudageterrorstring でエラー内容を表示 Page 40

41 Host-Device 間コピー朱書き部分が関連箇所 Device memory 上の領域に配列 b, c のデータを host device コピー配列 a の計算結果を device host コピー 41 stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) stat = cudamemcpy(a,d_a,n,cudamemcpydevicetohost) 50 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) Integer function cudamemcpy(dst, src, count, kdir) src が指す領域から dst が指す領域に count 要素のコピーを行なう count は要素数であることに注意!(CUDA C では byte 数 ) kdir でコピーの方向を指定 ( 省略可能 ) cudamemcpyhosttodevice: host device のコピー cudamemcpydevicetohost: device host のコピー CUDA Fortran では定義時に device で修飾し device memory 上の領域であること明示するためコピーの方向は省略できる Page 41

42 Host-Device 間コピー単純なデータコピーについては以下のような簡潔な記述が可能ただし device device のデータコピーは cudamemcpy を使う必要がある 41 stat = cudamemcpy(d_b,b,n,cudamemcpyhosttodevice) 42 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 43 stat = cudamemcpy(d_c,c,n,cudamemcpyhosttodevice) 44 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) 37 d_b = b d_c = c 49 stat = cudamemcpy(a,d_a,n,cudamemcpydevicetohost) 簡潔な記述が可能 if(stat /= cudasuccess) print *, trim(cudageterrorstring(stat)) 40 call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 41 print *, trim(cudageterrorstring(cudagetlasterror())) a = d_a 詳細は CUDA Fortran Programming Guide and Reference Chapter 3. Implicit Data Transfer in Expressins 参照 Page 42

43 Grid および Block 形状の指定 Grid および block 形状のイメージ図 512 thread からなる 2 つのブロックで 1000 要素の配列を処理する各 block が順次 SM で実行される各スレッドが warp(32 thread) 単位で実行される生成される thread 数が要素数より多いので配列外アクセスを防ぐための if 文が必要 CUDA Fortran では配列の index grid および block の ID は 1 origin であることに注意 Grid,Block Threadidx%x = 1,2, 配列 a,b,c blockidx%x = 1 Blockidx%x = ,512,1,2,... 並列処理 i = Page 43

44 Grid および Block 形状の指定朱書き部分が関連箇所今回は長さ N(=1000) のループを (BLOCK_SIZE,1,1) の block ((N-1)/BLOCK_SIZE+1,1,1) で実行 512 threads の block 2 つで実行 1 #define NUM_THREADS type(dim3) :: dimgrid,dimblock dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) 46 call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) Grid および block の形状は定義型 dim3 で指定 Block Grid 最大 512 個の thread からなる形状は 3 次元 Block の集合形状は実質 2 次元 (3 次元目は 1 しか取れない ) Kernel 呼び出し時に <<<>>> 内で指定する Page 44

45 Kernel の動作確認 Kernel はサブルーチンであるため返り値はないが以下のようにしてエラーコードを確認できる cudagetlasterror で最後のエラーコードを取得 cudageterrorstring でエラーコードに対応した文字列を取得正常終了なら no error が返される 1 #define NUM_THREADS type(dim3) :: dimgrid,dimblock dimgrid = dim3((n-1)/num_threads+1,1,1) 29 dimblock = dim3(num_threads,1,1) 46 call sub<<<dimgrid,dimblock>>>(n,d_a,d_b,d_c) 47 print *, trim(cudageterrorstring(cudagetlasterror())) Kernel の呼び出し直後にチェックする Page 45

46 Kernel の作成朱書き部分が関連箇所 Kernel は module 内になければならないループ変数 i を block ID (blockidx%x) thread ID (threadidx%x) および block のサイズ (blockdim%x) から算出 [ 変更前 ] 1 subroutine sub(n,a,b,c) 2 implicit none 3 integer :: n 4 real,dimension(n) :: a,b,c 5 integer :: i 6 do i=1,n 7 a(i) = b(i) + c(i) 8 enddo 9 return 10 end subroutine sub Thread 数 512 の block が 2 個となるので threadidx%x: 1~512 blockidx%x: 1~2 blockdim%x: 512 よってアクセス範囲 (i の範囲 ) は 1~1024 となる配列外参照を避けるための if 文を挿入 [ 変更後 ] 3 module cuda_kernel 4 contains 5 attributes(global) subroutine sub(n,a,b,c) 6 implicit none 7 integer,value :: n 8 real,dimension(n),device :: a,b,c 9 integer :: i 10 i = (blockidx%x - 1) * blockdim%x + threadidx%x 11 if(i < n+1) a(i) = b(i) + c(i) 12 return 13 end subroutine sub 14 end module cuda_kernel ID が 1 origin であるため i = (blockidx%x 1) * blockdim%x + threadidx%x となることに注意! Page 46

47 Make および実行 Make ファイルを適宜変更する [Makefile] 1 FC=pgf90 2 FFLAGS=-Mpreprocess -Mcuda 3 test: test.cuf 4 $(FC) -o $@ $(FFLAGS) $? 5 clean: 6 rm -f test *.mod コンパイラオプション -Mcuda を付加コンパイラオプション -Mcuda を付加する実行結果 [ 実行結果 ] no error i= 1,a( 1 )= i= 2,a( 2 )= i= 3,a( 3 )= i= 998,a( 998 )= i= 999,a( 999 )= i= 1000,a( 1000 )= cudagetlasterror の出力が no error だったことを示す Page 47

48 CUDA Profiler による性能測定プログラム実行時に環境変数 CUDA_PROFILE=1 を指定することで性能情報が採取できるデフォルトでは cuda_profile_0.log に出力される環境変数 CUDA_PROFILE_LOG=[ ファイル名 ] で出力ファイルを指定可能 [cuda_profile_0.log] 1 # CUDA_PROFILE_LOG_VERSION # CUDA_DEVICE 0 Tesla T10 Processor 3 # TIMESTAMPFACTOR fffff719889c method,gputime,cputime,occupancy 5 method=[ memcpyhtod ] gputime=[ ] cputime=[ ] 6 method=[ memcpyhtod ] gputime=[ ] cputime=[ ] 7 method=[ sub ] gputime=[ ] cputime=[ ] occupancy=[ ] 8 method=[ memcpydtoh ] gputime=[ ] cputime=[ ] CUDA_PROFILE_CONFIG=[ ファイル名 ] で指定したファイルで採取するデータを選択できる詳細は以下の URL を参照 URL: ocs/visualprofiler/computeprof.html Page 48

49 Debug 手法うまく動作しない場合には以下の事項を確認する Kernel は動作しているか? cudagetlasterror による確認 Device エミュレーションモードでの実行 KernelをCPU 上で実行する Make 時に-Mcuda=emuオプションを付加する主にロジックのバグを検出する際に便利 print 文の使用が可能実行時間が増大することに注意! リソースは不足していないか? Global memory のサイズ S1070 ではカードあたり 4 GByte まで Blockあたりのsmemおよびレジスタ数 attributes(global) 関数の引数引数は shared memory を介して渡される最大 256 Byte Page 49

50 Block あたりの使用リソース量確認 1 Kernel のコンパイル時に -Mcuda=keepbin オプションを付加すると foo.???.bin というファイルが作成される [test.003.bin] 1 architecture {sm_13} 2 abiversion {1} 3 modname {cubin} 4 code { 5 name = sub 6 lmem = 0 7 smem = 48 8 reg = 4 9 bar = 0 10 bincode { 11 0x x xa x smem は block あたりの shared memory 使用量 16,536 Byte 以下でなければならない reg は thread あたりの register 使用量 Block あたりの使用量が 16,536 以下でなければならないここでは thread あたり 4 なので 512 thread で 2,048 となり問題なし Make 時に -Mcuda=maxregcount:[ レジスタ数 ] を付加すると使用レジスタ数を制限できる ( ただし多くの場合性能が低下 ) Page 50

51 Block あたりの使用リソース量確認 2 Kernel のコンパイル時に -Mcuda=ptxinfo オプションを付加するとコンパイル時メッセージとして regster, shared memory 使用量が出力される $ pgf90 -o test -Mpreprocess -Mcuda=ptxinfo test.cuf ptxas info : Compiling entry function 'sub' ptxas info : Used 4 registers, bytes smem ptxas info : Compiling entry function 'sub' ptxas info : Used 4 registers, bytes smem smem は block あたりの shared memory 使用量 16,536 Byte 以下でなければならない reg は thread あたりの register 使用量 Block あたりの使用量が 16,536 以下でなければならないここでは thread あたり 4 なので 512 thread で 2,048 となり問題なし Make 時に -Mcuda=maxregcount:[ レジスタ数 ] を付加すると使用レジスタ数を制限できる ( ただし多くの場合性能が低下 ) Page 51

52 第 3 部のまとめごくシンプルなコードについて CUDA Fortran による GPU 利用の例を示した API を利用し device memory を適切に設定する Device memory の allocate (cudamalloc) Allocate された領域に GPU 上で参照されるデータをコピー (cudamemcpy) 必要に応じて GPU 上での計算結果を host 側にコピー (cudamemcpy) 適切な並列化で kernel を呼び出す並列化の方法は Block および Grid の形状で指定うまく動作しない時は Block あたりの使用リソース量に注意! 第 4 部ではより実際のコードに近い姫野ベンチマークの移植最適化について説明する Page 52

53 参考資料 CUDA Fortran 関連 PGI Resources CUDA Fortran CUDA Fortran に関する解説 CUDA Fortran Programming Guide and Reference CUDA Fortran に関するプログラミングガイド CUDA に関する基本的な知識があることが前提? CUDA Programming Guide と併せて読む必要がある Page 71

54 参考資料 NVIDIA CUDA 関連 CUDA Zone NVIDIA による公式ページ NVIDIA GPU Computing Developer Home Page その他 CUDA C に関する各種ドキュメントが公開されている CUDA Programming Guide: CUDA プログラミングに関する基本的なドキュメント CUDA Best Practice Guide: CUDA プログラムの最適化手法に関するドキュメント対象は CUDA で動作するコードを作成したことがある方 CUDA Reference Guide: API に関する網羅的なドキュメントリファレンスとして参照姫野ベンチマーク Page 72

CUDA 連携とライブラリの活用 2

1 09:30-10:00 受付 10:00-12:00 Reedbush-H ログイン GPU 入門 13:30-15:00 OpenACC 入門 15:15-16:45 OpenACC 最適化入門と演習 17:00-18:00 OpenACC の活用 (CUDA 連携とライブラリの活用 ) CUDA 連携とライブラリの活用 2 3 OpenACC 簡単にGPUプログラムが作成できるそれなりの性能が得られる