1. マシンビジョンにおける GPU の活用

CUDA 画像処理入門 GTC 213 チュートリアルエヌビディアジャパン CUDA エンジニア森野慎也

1. 医用画像処理における GPU の活用 CT や MRI から画像を受信して三次元画像の構築をするシステム 2 次元スキャンデータから 3 次元 4 次元イメージの高速生成 CUDA 化により画像処理速度を約 2 倍に高速化

1. CUDA で画像処理 GPU = Graphics Processing Unit 画像を生成するためのプロセッサです与えられた画像を処理することも上手です複雑な処理もプログラミングできます CUDA による画像処理の入門編です

2. 画像処理 : アフィン変換画像の線形変換平行移動

2. アフィン変換変換式 1 1 1 y x t d c t b a Y X y x 1 cos sin sin cos T rotate 1 y x magnify r r T 1 1 1 y x translate t t T 変換行列の例

2. 画像のメモリ配置 RGBA(8 bit, uchar4) の配列 index = x + y * pitchinpixels pitchinpixels = pitchinbytes / sizeof(uchar4) width (x, y)

2. 2 次元メモリ確保転送 cudaerror_t cudamallocpitch ( void** devptr, size_t* pitch, size_t width, size_t height ) width バイトのメモリを height 行分取得する行は pitch バイトで整列する cudaerror_t cudamemcpy2d ( void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, cudamemcpykind kind ) dst で示されるメモリ (dpitch バイトで整列 ) に src で示されるメモリ (spitch バイトで整列 ) を width ( バイト ) x height ( 行 ) コピーする

2. アフィン変換 : カーネル設計スレッドに変換後の画面のピクセルを割り当てるピクセル数分スレッドが走る例 : 262,144 (= 512 x 512) スレッドスレッドは処理対象のピクセルを持つ自分の位置 (X, Y) を知ることが必要

2. 2D での Block Thread の割り当て 1 Block 1 Pixel = 1 Thread (i, j) = (GlobalID(x),GlobalID(y)) Thread を 2 次元で質点に対応 Block を 2 次元で定義一定のサイズ Grid : 必要数の Block を 2 次元に並べる

2. 2D での Block Thread の割り当て blockdim.x * blockidx.x threadidx.x blockdim.y * blockidx.y threadidx.y GlobalID は (x, y, z) 方向に計算できる GlobalID(x) = blockdim.x * blockidx.x + threadidx.x GlobalID(y) = blockdim.y * blockidx.y + threadidx.y GlobalID(z) = blockdim.z * blockidx.z + threadidx.z

2. アフィン変換 : Grid サイズ指定 /* value radix で割って切り上げる */ int divroundup(int value, int radix) { return (value + radix 1) / radix; } /* griddim, blockdim を 2 次元 (x, y 方向 ) に初期化 */ dim3 blockdim(128, 4); /* divroundup() は切り上げの割り算 */ dim3 griddim(divroundup(width, blockdim.x), divroundup(height, blockdim.y)); affinetransformkernel<<<griddim, blockdim>>>(ddst, dsrc, );

2. アフィン変換 : カーネルの入出力 global void affinetransformkernel(uchar4 *ddst, const uchar4 *dsrc, ) ddst dsrc

2. アフィン変換 : カーネルのスケルトン global void affinetransformkernel(uchar4 *ddst, const uchar4 *dsrc, int width, int height, int pitch) { int gidx = blockdim.x * blockidx.x + threadidx.x; int gidy = blockdim.y * blockidx.y + threadidx.y; } if ((gidx < width) && (gidy < height)) { uchar4 pixel = ; /* 値を設定 */ int mypixelpos = gidx + gidy * pitch; zdst[mypixelpos] = pixel; }

2. アフィン変換 : 座標は逆変換変換後のピクセル座標 (X, Y) は既知 (X, Y) から (x, y) に逆変換ピクセルをコピー

2. アフィン変換 : 逆変換行列はすべての変換で共通 ( 大域的 ) 事前に CPU 上で計算しておくカーネルでは与えられた行列を使うのみ 1 1 1 1 Y X at ct a c dt bt b d bc ad y x y x x y

2. アフィン変換 : カーネル呼び出し struct Matrix { float a, b, c, d; float tx, ty; } Matrix matrix; // 値設定済み ( 略 ) Matrix inverted; // 逆行列 float det = matrix.a * matrix.d - matrix.b * matrix.c; if (det!=.f) { inverted.a = matrix.d / det; inverted.b = - matrix.b / det; inverted.c = - matrix.c / det; inverted.d = matrix.a / det; inverted.tx = (matrix.b * matrix.ty - matrix.tx * matrix.d) / det; inverted.ty = (matrix.tx * matrix.c - matrix.a * matrix.ty) / det; dim3 blockdim(128, 4); dim3 griddim(divroundup(width, blockdim.x), divroundup(height, blockdim.y)); affinetransformkernel<<<griddim, blockdim>>>(inverted, ddst, texsrc, width, height, pitch / sizeof(uchar4)); ( 略 )

2. アフィン変換 : カーネルの実装 global void affinetransformkernel(matrix invmat, uchar4 *ddst, const uchar4 *dsrc, int width, int height, int pitch) { int gidx = blockdim.x * blockidx.x + threadidx.x; int gidy = blockdim.y * blockidx.y + threadidx.y; } if ((gidx < width) && (gidy < height)) { float X = gidx +.5f; float Y = gidy +.5f; float x = invmat.a * X + invmat.b * Y + invmat.tx; /* 逆変換 */ float y = invmat.d * X + invmat.e * Y + invmat.ty; uchar4 srcpixel ; if ((.f < x) && (x < width) && (.f < y) && (y < wdith)) { int srcpixelpos = int(x) + int(y) * pitchinpixels; srcpixel = dsrc[srcpixelpos]; } else { srcpixel = make_uchar4(,,, ) } ddst[gidx + gidy * pitch] = srcpixel;

2. OpenGL Interoperability CUDA から OpenGL オブジェクトをアクセス OpenGL オブジェクト登録 OpenGL オブジェクト登録解除リソースマップリソースアンマップ CUDA オブジェクト取得 Texture cudagraphicsglregisterimage() cudagraphicsglunregisterimage() cudagrahipcssubresourcegetmapp edarray() PBO/VBO などバッファ cudagraphicsmapresources() cudagraphicsunmapresources() cudagraphicsglregisterbuffer() cudagraphicsglunregisterbuffer() cudagraphicsresourcegetmappedpoi nter()

3. たたみ込み画像フィルタ Gaussian Filter, Sobel Filter, Laplacian Filter パターンマッチング SAD SSD 相関マッチング etc

3. Gaussian Filter 元画像のピクセル x 係数すべて足し合わせる係数をガウス分布とする 1 スレッドで 1 ピクセルを出力 + 係数元画像値の形式は float 足し合わせる

3. カーネルの実装イメージ device float f(int x, int y); // ピクセルの値を取得する関数 global void gaussiankernel_3x3(float *ddst, const float *dsrc, int width, int height, int pitch) { int gidx = blockdim.x * blockidx.x + threadidx.x; int gidy = blockdim.y * blockidx.y + threadidx.y; } if ((gidx < width) && (gidy < height)) { float pixel = coef[][] * f(gidx - 1, gidy - 1) + coef[][1] * f(gidx, gidy - 1) + coef[][2] * f(gidx + 1, gidy - 1); + coef[1][] * f(gidx - 1, gidy ) + coef[1][1] * f(gidx, gidy ) + coef[1][2] * f(gidx + 1, gidy ); + coef[2][] * f(gidx - 1, gidy + 1) + coef[2][1] * f(gidx, gidy + 1) + coef[2][2] * f(gidx + 1, gidy + 1); int mypixelpos = gidx + gidy + pitchinpixels; ddst[mypixelpos] = pixel; }

3. Texture GPU 上のハードウエア Read-only L1キャッシュが使用可能端の要素の処理 Clamp Wrap Mirror Border 線形補間も使用可能 Texture Object Fermi 以降 CUDA 5. 以降で使用可能カーネルに引数として渡せる

3. Texture オブジェクトの作成 TextureDesc texdesc; ResourceDesc resdesc; // 値のクリア memset(&texdesc,, sizeof(texdesc)); memset(&resdec,, sizeof(resdesc)); texdesc.addressmode[] = texdesc.addressmode[1] = cudaaddressmodeclamp; texdesc.filtermode = cudafiltermodepoint; texdesc.readmode = cudareadmodeelementtype; texdesc.normalizedcoords = ; resdesc.restype = cudaresourcetypepitch2d; resdesc.res.pitch2d.devptr = dsrc; resdesc.res.pitch2d.desc = cudacreatechanneldesc<float>(); resdesc.res.pitch2d.pitchinbytes = pitchinbytes; resdesc.res.pitch2d.width = width; resdesc.res.pitch2d.height = height; cudatextureobject_t tex; cudacreatetextureobject(&tex, &resdesc, &texdesc, NULL);

カーネル実装 :Texture 導入 device float f(cudatextureobject_t texsrc, int x, int y) { // ピクセルの値を取得する関数 return tex2d<float>(texsrc, x, y); } global void gaussiankernel_3x3(float *ddst, cudatextureobject_t texsrc, int width, int height, int pitch) { int gidx = blockdim.x * blockidx.x + threadidx.x; int gidy = blockdim.y * blockidx.y + threadidx.y; } if ((gidx < width) && (gidy < height)) { float pixel = coef[][] * f(gidx - 1, gidy - 1) + coef[][1] * f(gidx, gidy - 1) + coef[][2] * f(gidx + 1, gidy - 1) + coef[1][] * f(gidx - 1, gidy ) + coef[1][1] * f(gidx, gidy ) + coef[1][2] * f(gidx + 1, gidy ) + coef[2][] * f(gidx - 1, gidy + 1) + coef[2][1] * f(gidx, gidy + 1) + coef[2][2] * f(gidx + 1, gidy + 1); ddst[gidx + gidy * pitchinpixels] = pixel; }

3. Constant Memory 定数専用のメモリ複数のスレッドから同じ値をアクセスするのが前提サイズは64 KB キャッシュされる値の設定直接初期化 Hostから値を設定することも可能 cudamemcpytosymbol()

3. カーネル実装 :Texture 導入 constant float coef[3][3] = { { 1.f / 16.f, 2.f / 16.f, 1.f / 16.f, }, { 2.f / 16.f, 4.f / 16.f, 2.f / 16.f, }, { 1.f / 16.f, 2.f / 16.f, 1.f / 16.f, }, }; device float f(cudatextureobject_t texsrc, int x, int y) { // ピクセルの値を取得する関数 return tex2d<float>(texsrc, x, y); } global void gaussiankernel_3x3(float *ddst, cudatextureobject_t texsrc, int width, int height, int pitch) { int gidx = blockdim.x * blockidx.x + threadidx.x; int gidy = blockdim.y * blockidx.y + threadidx.y; ( 略 ) }

3. 演算量メモリアクセス量の算出画像サイズ : x (pixels) * y(pixels) メモリ読みこみ書き出し量 = 2 * x * y * sizeof(float) [byte] 演算量 = 17 * x * y [FP] B/F = 8 / 17.48 [byte/fp] 実際の GPU =.4~.8 [byte/fp] メモリ読み込み量が多いバンド幅律速

3. TIPS: ベクタライズによる高速化 1 つのスレッドで複数のピクセルを処理する ( 例では 2x2) 係数元画像 : レジスタに保存元画像からの読み込み値は変数 ( レジスタ ) に保存する出力 Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola, David Sheffield, Michael Anderson, Phitchaya Mangpo Phothilimthana, Kurt Keutzer, http://parlab.eecs.berkeley.edu/publication/899

Sobel Filter 輪郭の検出係数 ( 横方向 ) -1-2 -1 1 2 1 + 係数 ( 縦方向 ) -1 1-2 2-1 1 横縦成分の合成 v 2 v x v y 2

3. ベンチマーク例ベクタ化性能 (GFLOPS) バンド幅 (GB/s) バンド幅効率性能向上 Gaussian Filter (3x3) Sobel Filter - 256 112 54 % - 2x2 346 152 73 % 35 % - 25 95.3 46 % - 2x2 315 147 71 % 54 % Tesla K2 ECC off, 248 x 248 pixels.

画像処理のための CUDA 入門画像処理のための CUDA 入門日時 : 8/28 9/26 15:~18: 場所 : NVIDIA Japan 赤坂オフィス定員 : 2 名申し込み : http://www.nvidia.co.jp/object/event-calendar-jp.html 入門編無償です