GPU CUDA CUDA 2010/06/28 1

Similar documents
untitled

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

Slide 1

07-二村幸孝・出口大輔.indd

untitled

3次多項式パラメタ推定計算の CUDAを用いた実装 (CUDAプログラミングの練習として) Implementation of the Estimation of the parameters of 3rd-order-Polynomial with CUDA

NUMAの構成

1206_Cray_PE_Overview+Roadmap_JPN

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

CUDA を用いた画像処理 画像処理を CUDA で並列化 基本的な並列化の考え方 目標 : 妥当な Naïve コードが書ける 最適化の初歩がわかる ブロックサイズ メモリアクセスパターン

untitled

GPU.....

MPI または CUDA を用いた将棋評価関数学習プログラムの並列化 2009/06/30

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

CUDA 連携とライブラリの活用 2



CuPy とは何か?

untitled

H1_H4_ ai

制御盤BASIC Vol.3

altus_storage_guide

サイボウズ ガルーン 3 管理者マニュアル

P indd

85

1


今日からはじめるプロアクティブ

1 2 STEP 1 STEP 2 STEP 3


untitled


1

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

日本外傷歯学会認定医(平成24年11月30日付) H

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin

(search: ) [1] ( ) 2 (linear search) (sequential search) 1


1 2

untitled

INDEX

INDEX

1002goody_bk_作業用

外為オンライン FX 取引 操作説明書


証券協会_p56

2 09:00-09:30 受付 09:30-12:00 GPU 入門,CUDA 入門 13:00-14:30 OpenACC 入門 + HA-PACS ログイン 14:45-16:15 OpenACC 最適化入門と演習 16:30-18:00 CUDA 最適化入門と演習



XACCの概要

Microsoft PowerPoint - GPU_computing_2013_01.pptx


1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

GPGPUクラスタの性能評価

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

SS_kinou_hyou1_4_BP


TSUBAME2.0におけるGPUの 活用方法

ジェネリック医薬品販売会社(田辺製薬販売株式会社)の設立に伴う包装変更のご案内


06地図目録.pwd

橡82-93協力員原稿>授業

17. (1) 18. (1) 19. (1) 20. (1) 21. (1) (3) 22. (1) (3) 23. (1) (3) (1) (3) 25. (1) (3) 26. (1) 27. (1) (3) 28. (1) 29. (1) 2

(MIRU2010) NTT Graphic Processor Unit GPU graphi

いまからはじめる組み込みGPU実装

概要 目的 CUDA Fortran の利用に関する基本的なノウハウを提供する 本チュートリアル受講後は Web 上で公開されている資料等を参照しながら独力で CUDA Fortran が利用できることが目標 対象 CUDA Fortran の利用に興味を抱いている方 前提とする知識 Fortran

GPU Computing on Business

GPU 画像 動画処理用ハードウェア 低性能なプロセッサがたくさん詰まっている ピーク性能が非常に高い GPUを数値計算に用いるのがGPGPU Graphics Processing Unit General Purpose GPU TSUBAME2.0: GPUスパコン 本演習ではNVIDIA社の

‚æ4›ñ

Cell/B.E. BlockLib

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£²¡Ë


2

2


2

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

02_C-C++_osx.indd

Microsoft PowerPoint - suda.pptx

< C D928696CA5F E706466>

N 体問題 長岡技術科学大学電気電子情報工学専攻出川智啓

Java updated

Slide 1

e-tax e-tax e-tax e-tax OK 2e-Tax

EGunGPU

GPUコンピューティング講習会パート1

/* do-while */ #include <stdio.h> #include <math.h> int main(void) double val1, val2, arith_mean, geo_mean; printf( \n ); do printf( ); scanf( %lf, &v

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

明解Javaによるアルゴリズムとデータ構造

25 2 ) 15 (W 力電 idle FMA(1) FMA(N) 実行コード Memcopy matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmu

HPEハイパフォーマンスコンピューティング ソリューション

Microsoft Word - C.....u.K...doc

HPC146

N08

C言語によるアルゴリズムとデータ構造

main

Transcription:

GPU CUDA CUDA 2010/06/28 1

GPU NVIDIA Mark Harris, Optimizing Parallel Reduction in CUDA http://developer.download.nvidia.com/ compute/cuda/1_1/website/data- Parallel_Algorithms.html#reduction CUDA SDK reduction 2010/06/28 2

/work/nmaruyam/gpu-tutorial/reduction CUDA SDK $ cp r /work/nmaruyam/gpu-tutorial/reduction ~ $ cd ~/reduction $ cd projects/reduction $ make $ cd../.. $./bin/linux/release/reduction 2010/06/28 3

GPU 32 2010/06/28 4

log(n) CUDA 1. 2. 25 3. 4. 5. 3 1 7 0 4 1 6 3 4 7 5 9 11 14 2010/06/28 5

FLOPS TSUBAME Tesla GPU (S1070): 102 GB/s 2010/06/28 6

Reduction #1 global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); // if (tid == 0) g_odata[blockidx.x] = sdata[0]; 2010/06/28 7

Values (shared memory) 10 1 8-1 0-2 3 5-2 -3 2 7 0 11 0 2 Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 IDs Values IDs Values IDs Values IDs Values 0 2 4 6 8 10 12 14 11 1 7-1 -2-2 8 5-5 -3 9 7 11 11 2 2 0 4 8 12 18 1 7-1 6-2 8 5 4-3 9 7 13 11 2 2 0 8 24 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2 0 41 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2 2010/06/28 8

Kernel 1 3.51 ms 4.77 GB/s 4.6 % TSUBAME S1070 GPU 2010/06/28 9

# global void reduce0(int *g_idata, int *g_odata) { extern shared int sdata[]; // unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // for(unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); % // if (tid == 0) g_odata[blockidx.x] = sdata[0]; 2010/06/28 10

CUDA C/Fortran if, while, for, do-while GPU CPU 2010/06/28 11

CUDA 2010/06/28 12

Time Mask Warp 0 1 2 3 4 5 31 T T T T T T T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 13

Time Mask Warp 0 1 2 3 4 5 31 T T T T T T T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 14

Time Mask Warp 0 1 2 3 4 5 31 T T T T T T T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 15

Time Mask Warp 0 1 2 3 4 5 31 T T T T T T T T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 16

Time Mask Warp 0 1 2 3 4 5 31 T T F T F F T T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 17

Time Mask Warp 0 1 2 3 4 5 31 T T F T F F T T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 18

Time Mask Warp 0 1 2 3 4 5 31 T T F T F F T T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 19

Time Mask Warp 0 1 2 3 4 5 31 F F T F T T F T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 20

Time Mask Warp 0 1 2 3 4 5 31 F F T F T T F T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 21

Time Mask Warp 0 1 2 3 4 5 31 F F T F T T F T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 22

Time Mask Warp 0 1 2 3 4 5 31 T T T T T T T T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 23

Time Mask Warp 0 1 2 3 4 5 31 F F T F T T F T T F T F F T x = z - k; if (x > 0) { t = y * x a; else { s = y * y; t = 2 * x + a; s = x * y; a[i] = t * s; 2010/06/28 24

#2 for (unsigned int s=1; s < blockdim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; syncthreads(); for (unsigned int s=1; s < blockdim.x; s *= 2) { int index = 2 * s * tid; if (index < blockdim.x) { sdata[index] += sdata[index + s]; syncthreads(); 2010/06/28 25

#2 Values (shared memory) 10 1 8-1 0-2 3 5-2 -3 2 7 0 11 0 2 Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 IDs Values IDs Values IDs Values IDs Values 0 1 2 3 4 5 6 7 11 1 7-1 -2-2 8 5-5 -3 9 7 11 11 2 2 0 1 2 3 18 1 7-1 6-2 8 5 4-3 9 7 13 11 2 2 0 1 24 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2 0 41 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2 2010/06/28 26

% MAX Kernel 1 3.51 ms 4.77 GB/s Kernel 2 1.62 ms 10.4 GB/s 4.6 % 10.1 % 2.2x 2.2x 2010/06/28 27

#2 Values (shared memory) 2 Step 1 Stride 1 4 Step 2 Stride 2 8 Step 3 Stride 4 16 Step 4 Stride 8 IDs Values IDs Values IDs Values IDs Values 10 1 8-1 0-2 3 5-2 -3 2 7 0 11 0 2 0 1 2 3 4 5 6 7 11 1 7-1 -2-2 8 5-5 -3 9 7 11 11 2 2 0 1 2 3 18 1 7-1 6-2 8 5 4-3 9 7 13 11 2 2 0 1 24 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2 0 41 1 7-1 6-2 8 5 17-3 9 7 13 11 2 2 2010/06/28 28

2010/06/28 29

GPU 1 CUDA 16 16 16 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 2010/06/28 30

No Bank Conflicts 0 1 2 3 4 5 6 7 Linear addressing stride == 1 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 No Bank Conflicts 0 1 2 3 4 5 6 7 Random 1:1 Permutation Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 15 Bank 15 15 Bank 15 2010/06/28 31

2-way Bank Conflicts 0 1 2 3 4 8 9 10 11 Linear addressing stride == 2 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 8-way Bank Conflicts 0 1 2 3 4 5 6 7 15 Linear addressing stride == 8 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 2010/06/28 32

Values (shared memory) 10 1 8-1 0-2 3 5-2 -3 2 7 0 11 0 2 Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 IDs Values IDs Values IDs Values IDs Values 0 1 2 3 4 5 6 7 8-2 10 6 0 9 3 7-2 -3 2 7 0 11 0 2 0 1 2 3 8 7 13 13 0 9 3 7-2 -3 2 7 0 11 0 2 0 1 21 20 13 13 0 9 3 7-2 -3 2 7 0 11 0 2 0 41 20 13 13 0 9 3 7-2 -3 2 7 0 11 0 2 2010/06/28 33

# for (unsigned int s=1; s < blockdim.x; s *= 2) { int index = 2 * s * tid; if (index < blockdim.x) { sdata[index] += sdata[index + s]; syncthreads(); for (unsigned int s=blockdim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; syncthreads(); 2010/06/28 34

% MAX Kernel 1 3.51 ms 4.77 GB/s Kernel 2 1.62 ms 10.4 GB/s Kernel 3 0.81 ms 20.7 GB/s 4.6 % 10.1 % 2.2x 2.2x 20.2 % 2.0x 4.3x 2010/06/28 35

2010/06/28 36

#3 for (unsigned int s=blockdim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; syncthreads(); 1 2010/06/28 OK 37

#4 1 // unsigned int tid = threadidx.x; unsigned int i = blockidx.x*blockdim.x + threadidx.x; sdata[tid] = g_idata[i]; syncthreads(); // 2 // unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blockdim.x*2) + threadidx.x; sdata[tid] = g_idata[i] + g_idata[i+blockdim.x]; syncthreads(); 2010/06/28 38

% MAX Kernel 1 3.51 ms 4.77 GB/s Kernel 2 1.62 ms 10.4 GB/s Kernel 3 0.81 ms 20.7 GB/s Kernel 4 0.47 ms 36.0 GB/s 4.6 % 10.1 % 2.2x 2.2x 20.2 % 2.0x 4.3x 35.2 % 1.7x 7.5x 35% 2010/06/28 39

for (unsigned int s=blockdim.x/2; s>32; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; syncthreads(); if (tid < 32) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid + 1]; 6 syncthreads() 2010/06/28 40

% MAX Kernel 1 3.51 ms 4.77 GB/s Kernel 2 1.62 ms 10.4 GB/s Kernel 3 0.81 ms 20.7 GB/s Kernel 4 0.47 ms 36.0 GB/s Kernel 5 0.28 ms 58.1 GB/s 4.6 % 10.1 % 2.2x 2.2x 20.2 % 2.0x 4.3x 35.2 % 1.7x 7.5x 57.0 % 1.7x 12.5x 2010/06/28 41

for (unsigned int s=blockdim.x/2; s>32; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; syncthreads(); if (tid < 32) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid + 1]; 512 s 64, 128, 256 2010/06/28 42

#6 if (blocksize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; syncthreads(); if (blocksize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; syncthreads(); if (blocksize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; syncthreads(); if (tid < 32) { if (blocksize >= 64) sdata[tid] += sdata[tid + 32]; if (blocksize >= 32) sdata[tid] += sdata[tid + 16]; if (blocksize >= 16) sdata[tid] += sdata[tid + 8]; if (blocksize >= 8) sdata[tid] += sdata[tid + 4]; if (blocksize >= 4) sdata[tid] += sdata[tid + 2]; if (blocksize >= 2) sdata[tid] += sdata[tid + 1]; 2010/06/28 43

% MAX Kernel 1 3.51 ms 4.77 GB/s 4.6 % Kernel 2 1.62 ms 10.4 GB/s 10.1 % 2.2x 2.2x Kernel 3 0.81 ms 20.7 GB/s 20.2 % 2.0x 4.3x Kernel 4 0.47 ms 36.0 GB/s 35.2 % 1.7x 7.5x Kernel 5 0.28 ms 58.1 GB/s 57.0 % 1.7x 12.5x Kernel 6 0.25 ms 66.2 GB/s 65.0 % 1.13x 14.0x 2010/06/28 44

template <unsigned int blocksize> global void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern shared int sdata[]; unsigned int tid = threadidx.x; unsigned int i = blockidx.x*(blocksize*2) + tid; unsigned int gridsize = blocksize*2*griddim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blocksize]; i += gridsize; syncthreads(); if (blocksize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; syncthreads(); if (blocksize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; syncthreads(); if (blocksize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; syncthreads(); if (tid < 32) { if (blocksize >= 64) sdata[tid] += sdata[tid + 32]; if (blocksize >= 32) sdata[tid] += sdata[tid + 16]; if (blocksize >= 16) sdata[tid] += sdata[tid + 8]; if (blocksize >= 8) sdata[tid] += sdata[tid + 4]; if (blocksize >= 4) sdata[tid] += sdata[tid + 2]; if (blocksize >= 2) sdata[tid] += sdata[tid + 1]; 1 if (tid == 0) g_odata[blockidx.x] = sdata[0]; 45

% MAX Kernel 1 3.51 ms 4.77 GB/s 4.6 % Kernel 2 1.62 ms 10.4 GB/s 10.1 % 2.2x 2.2x Kernel 3 0.81 ms 20.7 GB/s 20.2 % 2.0x 4.3x Kernel 4 0.47 ms 36.0 GB/s 35.2 % 1.7x 7.5x Kernel 5 0.28 ms 58.1 GB/s 57.0 % 1.7x 12.5x Kernel 6 0.25 ms 66.2 GB/s 65.0 % 1.13x 14.0x Kernel 7 0.22 ms 76.1 GB/s 74.6 % 1.14x 16.0x 2010/06/28 46

CUDA CUDA CUI GUI CUI CUDA_PROFILE 1 CUDA_Profiler.txt GUI ssh -Y $ export LD_LIBRARY_PATH=/opt/cuda2.3/cudaprof/bin:$LD_LIBRARY_PATH $ export PATH=/opt/cuda2.3/cudaprof/bin:$PATH $ cudaprof 2010/06/28 47