(MIRU2010) NTT Graphic Processor Unit GPU graphi

(MIRU2010) 2010 7 889 2192 1-1 905 2171 905 NTT 243 0124 3-1 E-mail: ac094608@edu.okinawa-ct.ac.jp, akisato@ieee.org Graphic Processor Unit GPU graphic processor unit CUDA Fully automatic extraction of salient objects in near real-time Kazuma AKAMINE, Ken FUKUCHI, Akisato KIMURA, and Shigeru TAKAGI Department of Computer Science and System Engineering, Faculty of Engineering, Miyazaki University 1-1 Gakuen Kibanadai-Nishi, Miyazaki, 889 2192 Japan. Department of Information and Communication Systems Engineering, Okinawa National College of Technology Henoko 905, Nago, Okinawa, 905 2171 Japan. NTT Communication Science Laboratories, NTT Corporation Morinosato Wakamiya 3-1, Atsugi, Kanagawa, 243 0198 Japan. E-mail: ac094608@edu.okinawa-ct.ac.jp, akisato@ieee.org 1. (MRF) (MAP) MRF Greig [1] Boykov Interactive Graph Cuts [2] Kohli Dynamic Graph Cuts [3] MRF MAP [4] [5] Fu [6] [7]

Graphic Processor Unit GPU SIMD Single Instruction Multiple Data GPU CPU [8] [9] [10] GPU general-purpose GPU(GPGPU) [11] GPU C++ CUDA [12] OpenCL [13] GPU GPU CUDA GPU CUDA 2 GPU CUDA 3 [7] 4 GPU 5 6 2. Graphic Processor Unit GPU GPU CUDA 2. 1 GPU(Graphics Processing Unit) GPU GPU SIMD 1 1 2 GPU GPGPU GPU GPU NVIDIA CUDA [12] AMD ATi Stream [14] CPU GPU OpenCL [13] Microsoft API Direct Compute [15] 2. 2 CUDA GPU HLSL [15] GLSL [16] NVIDIA GPU CUDA GPU CUDA C GPU C GPU 1 ID ID 32 32 2 CPU GPU GPU CPU GPU CPU GPU 2. 3 CUDA CUDA CPU GPU

2 CPU GPU 4 CUDA 1 host device int add(int a, int b) { 2 return a + b; 3 } 4 global void addall(int length, int a, int b, int result) { 5 int px = blockidx.x blockdim.x + threadidx; 6 if(px < length) { 7 result[px] = add(a[px], b[px]); 8 } 9 } 10 void main() { 11 12 Dim3 blocks = {10, 0, 0}; 13 Dim3 threads = {32, 0, 0}; 14 int sm cap = 0; 15 addeach<<<blocks, threads, sm cap>>>(length, a, b, result); 16 17 } 3 CUDA CPU GPU 1 1 CUDA host CPU CPU global CPU GPU device GPU GPU GPU device global void host device 3 3 GPU a b addall GPU add CPU GPU GPU addall <<<>>> 3 blocks 5 2 GPU CPU threads 1 sm cap 2. 4 CUDA CUDA GPU 4 5 6 2 CUDA GPU 6 2 16KB syncthreads()

6 Saliency Graph Cuts 7 CPU GPU GPU 64KB GPU CPU GPU GPU CPU GPU CPU GPU CPU GPU 3. SGC [7] SGC 6 SGC ( 6(a)) [17] ( 6(b)) (Lucas-Kanade [18]) [19] / ( 6(c1))P r (O) P r (B) p P r (O; p) P r (O; p) 0 P r (B; p) = 1 P r (O; p) Boykov [20] 7 S T t-link n-link t-link (1)(2) n-link (5) C p p P r (C p O) P r (C p B) P r (O C p ) P r (B C p ) I p I q p q dist(p, q) p q R p ( obj ) = ln P r (O C p ) (1) R p ( bkg ) = ln P r (B C p ) (2) P r (O C p ) = P r(c p O)P r (O; p) P r (C p ) P r (B C p ) = P r(c p B)P r (B; p) P r (C p ) B {p,q} exp { (I p I q ) 2 } 1 2σ dist(p, q) (3) (4) (5) ( 6(c2)) (6)(8) (GMM) GMM d RGB 3 a k S k π k EM P r (C p O)

F (x, y) P (x, y) n m F k (i, j) (9) F (x, y) = n m F k (i, j) i=0 j=0 P (x + i n 2, y + j m 2 ) (9) 8 GPU n P r (O) RGB P r (C p B) P r (O) RGB p(x; a k, S k, π k ) = π k > = 0, M π k p k (x) (6) k=1 M π k = 1 (7) k=1 1 p k (x) = (2π) d/2 S k 1/2 { exp 1 } 2 (x a k) T S 1 k (x a k) (8) ( 6(d)) 1 4. GPU 3 GPU 8 GPU t-link n-link GPU GPU 1 1 P (x, y) F k (i, j) F (x, y) F k (i, j) P (x, y) 9 1-2 (filter) (src1) height width fheight fwidth result 1 10 10 minsrc maxsrc syncthreads 0 1 P r (O) GPU t-link EM CUDA ZONE [12] Harp [21] EM k-means CPU GMM (6) 9 P k (x) (8) {(2π) d/2 S k 1/2 } 1 1 CPU GPU

1 texture<float, 1, cudareadmodeelementtype> filter; 2 texture<float, 1, cudareadmodeelementtype> src1; 3 device float Filter2DCore(texture<float, 1, cudareadmodeelementtype> fsource, int x, int y, int height, int width, int filtersizex, int filtersizey) { 4 float sum = 0; 5 x = filtersizex/2; 6 y = filtersizey/2; 7 for(int fy = 0; fy < filtersizey; fy++) { 8 int by = y + fy; 9 if(by > 0 && by < height) { 10 by = width; 11 for(int fx = 0; fx < filtersizex; fx++) { 12 int bx = x + fx; 13 if(bx > 0 && bx < width) { 14 sum += tex1dfetch(filter, fy filtersizex + fx) tex1dfetch(fsource, by + bx); 15 } 16 } 17 } 18 } 19 return sum; 20 } 21 global void Filter2DKernel(int height, int width, int fheight, int fwidth, float result) { 22 int px = blockdim.x blockidx.x + threadidx.x; 23 if(px < height width) { 24 int x = px%width; 25 int y = px/width; 26 result[px] = Filter2DCore(src1, x, y, height, width, fwidth, fheight); 27 } 28 } 1 texture<float, 1, cudareadmodeelementtype> minsrc; 2 texture<float, 1, cudareadmodeelementtype> maxsrc; 3 global void SMRangeNormalizeKernel1(int length, float localmin, float localmax) { 4 int px = blockdim.x blockidx.x + threadidx.x; 5 6 shared float mini[32], maxi[32]; 7 if(px < length) { 8 mini[threadidx.x] = tex1dfetch(minsrc, px); 9 maxi[threadidx.x] = tex1dfetch(maxsrc, px); 10 } else { 11 mini[threadidx.x] = FLT MAX; 12 maxi[threadidx.x] = FLT MIN; 13 } 14 syncthreads(); 15 if(threadidx.x == 0) { 16 for(int i = 1; i < blockdim.x; i++) { 17 mini[0] = min(mini[0], mini[i]); 18 maxi[0] = max(maxi[0], maxi[i]); 19 } 20 localmin[blockidx.x] = mini[0]; 21 localmax[blockidx.x] = maxi[0]; 22 } 23 } 10 9 CUDA (8) exp { 1 2 (x a k) T S 1 k (x a k) } a k S 1 k n-link 2 1 GPU CUDA ZONE Vineet CUDA Cuts [22] 1 5. 10 10 CPU GPU 3 352 288 480 384 640 512 3 1 11 1 640x512 GPU 11 4 11 4 GPU 12 5 3 CPU Intel Core2Quad Q9550 4GB GPU NVIDIA Geforce 9800GT 512MB OS Windows XP Professional NVIDIA CUDA 2.1 OpenCV 1.1 4 640x512 2.4 132 t-link 4.2 n-link 11 2.9

352x288 2.1 640x512 4.5 CUDA ZONE EM OpenCL!"! 12 4 1 1 [ms] t- - link link 352 CPU 32.9 148.1 218.6 9.3 97.0 71.0 288 GPU 22.2 1.9 109.6 0.9 69.0 65.6 480 CPU 58.8 372.8 350.8 16.7 246.5 86.4 384 GPU 30.4 3.5 120.8 1.6 127.7 74.6 640 CPU 109.8 814.5 602.6 29.5 664.5 112.7 512 GPU 45.2 6.2 142.6 2.7 232.3 87.1 5 [us] CPU GPU 352 288 5.69 2.66 2.14 480 384 6.14 1.95 3.16 640 512 7.12 1.57 4.52 4.5 5 CPU GPU GPU CPU GPU 2 GPU GPU SGC 11 4 GPU t-link t-link EM 2 CUDAZONE 6. GPU [1] D.Greig B.Porteous and A.Seheuit Exact maximum a posteriori estimation for binary images Royalstat Vol.B:51 No.2 pp.271 279 1989. [2] Y.Boykov and M-P.Jolly Interactive Graph Cuts for Optical Boundary & Region Segmentation of Objects in N-D Images Proc.ICCV Vol.I pp.105 112 2001. [3] P.Kohli and P.Torr Dynamic graph cuts for efficient inference in Markov random fields IEEE Trans.PAMI Vol.29 No.12 pp.2079 2088 2007. [4] AdaBoost Saliency Map Graph Cuts (MIRU2008) IS3-33 pp.796 801 2008 7. [5] PRMU2008 232 pp.145 150 2009 2. [6] Y.Fu J.Cheng Z.Li and H.Lu Saliency cuts: Anautomatic approach to object segmentation Proc.ICPR 2008. [7] (MIRU2009) 2009. [8] TSUBAME http://www.gsic.titech.ac.jp/ ccwww/ tebiki/tesla/tesla.html [9] TMPGEnc 4.0 XPress http://tmpgenc.pegasys-inc.com/ja/press/08 081030.html [10] MediaShow Espresso http://jp.cyberlink.com/prog/ company/press-news-content.do?pid=2115 [11] GPU GPGPU 15 2009 6. [12] http://www.nvidia.com/object/cuda home new.html [13] http://www.khronos.org/opencl/ [14] http://www.amd.com/us/products/technologies/ stream-technology/pages/stream-technology.aspx [15] http://msdn.microsoft.com/en-us/library/ ee663301(vs.85).aspx [16] http://www.opengl.org/documentation/glsl/ [17] MCMC-based particle filter (MIRU2009) 2009 7. [18] B.D.Lucas and T.Kanade An Iterative Image Registration Technique with an Application to Stereo Vision Proceedings of the 7th International Joint Conference on Artificial Intelligence(IJCAI 81) pp.674 679 August 1981. [19] L.Itti C.Koch and E.Niebur A model of saliencybased visual attention for rapid scene analysis IEEE Trans.PAMI Vol.20 No.11 pp.1254 1259 November 1998. [20] Y.Boykov and G.F.Lea Graph cuts and efficient N- D image segmentation Proc.ICCV Vol.70 No.2 pp.109 131 2006.

[21] A.Harp Computational Statistics via GPU http://andrewharp.com/gmmcuda [22] V.Vineet and P.J.Narayanan Cuda Cuts: Fast Graph Cuts on the GPU CVPR Workshop on Visual Computer Vision on GPUs 2008.