PC 1 1 [1][2] [3][4] ( ) GPU(Graphics Processing Unit) GPU PC GPU PC ( 2 GPU ) GPU Harris Corner Detector[5] CPU ( ) ( ) CPU GPU 2 3 GPU 4 5 6 7 1 toyohiro@isc.kyutech.ac.jp 45
2 ( ) CPU ( ) ( ) () 2.1 ( 1) 1. 2. ( ) 3. 2.2 () CPU SIMD PC (CPU) SIMD (Intel :SSE4 AMD :SSE5 ) [6] CPU 4 32bit 46
input () output 1: SMP(Symmetric Multi Processing) CPU (Intel :Core2 AMD :Phenom ) SMP SMP OpenMP[7] CPU ( ) SMP 16 (= ) PC MPI[8] (Ethernet ) 3 GPU(Graphics Processing Unit) GPU 3 PC. GPU NVIDIA (GeForce ) AMD(ATI) (Radeon ). GPU 47
PC 1 GPU 1: GPU (Nvidia GeForce 8800GT) 1.8GHz 14 7,168 (VRAM) 336GFlops 512MByte 57.6GByte/sec PC PCI-express 2.0 PC 8GByte/sec GPU GPU GPGPU[9] 2 GPU (2006 ) 3 (OpenGL Direct3D) NVidia C GPU (CUDA : Compute Unified Device Architecture) [10] GPU GPU GPU ( 2) CPU (main) bus bridge PCI e (8GB/sec) VRAM ( ) GPU RAM ( ) 1.Copy input data from RAM to GPU 2.Copy sub input data from VRAM to each GPU core 3. Execution, copy sub result to VRAM 4. Copy result to RAM 2: GPU 1. GPU 2. GPU GPU (CUDA ) 2 [11][12] 48
3. GPU GPU 4. GPU GPU GPU PC 1 GPU 8.0GByte/sec PC CPU 10.6GByte/sec GPU CPU 14 7,000 GPU 300GFlops PC CPU 7 3 GPU CPU 4 GPU 3 1. 1 1 2 2 1 1 ( 3(a)) 2. 1. ( 3(b)) 2 3. Harris Corner Detector CUDA 1.1 4 Microsoft Windows ( 4.5 )GPU CUDA NVidia GeForce 8800GT 4.1 1 1 4.1.1 I in (x, y, c) I out (x, y, c) 3 2010 2 GPU 1TFlops 4 2010 2 CUDA 2.3 49
(b) From RAM Input data : I I(1) I(2) I(3) I(4) I(5) (a) From RAM Processing on GPU proc proc proc proc proc Input data : I I(1) I(2) I(3) I(4) I(5) Sub Result : S S(1) S(2) S(3) S(4) S(5) Processing on GPU proc proc proc proc proc Merging exclusive operation exclusive operation Result data : R R(1) R(2) R(3) R(4) R(5) To RAM Result Dat a : R R(1) R(2) R(3) To RAM 3: RGB YUV HSI HSV rgb GPU RGB Sobel Prewitt LoG Sobel Prewitt 4.1.2 I C Filter(I,C) = I C = IFFT(FFT(I) FFT(C)) CUDA FFT (CUDAFFT) 4.1.3 2 2 2 2 Harris Corner Detector 2 2 50
4.2 4.2.1 ( 4) Grayscale Input: I I(1) I(2) I(3) I(4) I(5) proc proc proc proc proc Vot ing exclusive operation AtomicAdd() Operation H(1) H(2) H(3) Result Dat a : R R(1) R(2) R(3) 4: CUDA ( ) AtomicAdd 4.2.2 H1 H2 HIN(H1,H2) D1 D2 CORR(D1,D2) HIN(H1,H2) = i max i=1 min(h1(i),h2(i)) CORR(D1,D2) i = (D1 i D1)(D2 i D2) i (D1 i D1) 2 i (D2 i D2) 2 51
4.3 4.3.1 Harris Corner Detector Harris Corder Detector Harris Corner Detector ( 5) Input image Red points : Corner 1. C I 5: Harris Corner Detector 2. (I xx =( I x )2 ) (I yy =( I y )2 ) (I xy = I x y ) 3. Gaussian (A = G I xx ), (B = G I yy ), (C = G I xy ) ( ) Ai C 4. H i = i λ Ci Bi 1,λ 2 5. M i = λ 1 λ 2 α (λ 1 + λ 2 ) 2 1 1 GPU I 4.4 GPU 7168 5 640 480 7168 43 5 GPU 52
GPU GPU GPU CPU GPU1 ( 6) CPU CPU (Multi core) OpenMP SMP bus bridge VRAM ( ) GPU 1 : CPU core 1 GPU 2 : CPU core2 RAM ( ) VRAM ( ) 6: GPU 2 GPU 14336 22 4.5 GPU CPU C GPU CUDA Microsoft Windows DLL(Dynamic Link Library) DLL C Matlab 6 5 GPU 4 (256 256, 512 512, 1024 1024, 2048 2048) RGB HSV Sobel 7 7 Gaussian 256 6 Matlab DLL (loadlibrary ) 53
2 2 4 (256 256, 5120 512, 1024 1024, 2048 2048) 3 (Matlab (CPU ) GPU GPU2 (CUDA ))) GPU Matlab DLL GPU2 CPU GPU GPU ( 2 5 ) Matlab CPU Intel Core2 Quad Q6600 ( 40GFlops) 4GByte CPU 10.6GByte/sec GPU 8GByte/sec(PCI-Express) CPU 2 5 2: ( : 256 256) :msec CPU GPU GPU 2 44.31 9.620 6.941 42.49 16.44 13.92 1.988 6.922 7.092 17.19 (9.302) 0.012 0.978 (0.651) 8.070 3.482 (3.140) 158.0 17.47 11.21 551.3 61.48 53.10 3: ( : 512 512) :msec CPU GPU GPU 2 187.6 30.27 19.79 198.1 63.98 51.60 17.31 13.73 26.34 68.74 (45.70) 0.013 1.292 (0.716) 18.09 8.660 (8.547) 625.8 55.45 34.93 2295 163.2 130.9 54
4: ( : 1024 1024) :msec CPU GPU GPU 2 774.7 132.9 79.65 862.1 259.0 213.6 89.84 47.85 103.4 274.8 (168.8) 0.014 1.350 (0.7697) 60.83 30.29 (28.65) 2529 199.0 124.3 9062 547.0 447.5 5: ( : 2048 2048) :msec CPU GPU GPU 2 3033 398.3 336.4 3598 1027 851.9 399.0 224.0 429.5 1103 (568.6) 0.016 1.364 (0.7170) 232.1 117.0 (115.4) 10163 793.6 484.3 36405 2284 1853 6 256 256 GPU GPU CPU 2 1. GPU CPU GPU 2. GPU GPU CPU CPU 16 GPU1 GPU2 1.67 GPU ( ) 1.15 2 GPU 55
6.1 GPU GPU CPU GPU GPU GPU CPU GPU GPU CPU GPU GPU 7 GPU GPU DLL Harris Corner Detector 512 512 GPU GPU PC [1] 35 5 pp.582-587 2006 [2] 6 H pp.17-20 2007 [3] 2007 10 pp.53-57 2007 [4],,, 10 pp.1283-1288 2007 [5] A combined corner and edge detector C. Harris and M. Stephens Proceedings of the 4th Alvey Vision Conference pp.147-151 1988. [6] Intel Streaming SIMD Extensions 4 (SSE4) Instruction Set Intel Corp. http://www.intel.com/technology/ architecture-silicon/sse4-instructions/ 2007 [7] The OpenMP specification for parallel programming OpenMP Architecture Review Board http://www.openmp.org/ [8] Message Passing Interface Forum MPI Forum http://www.mpi-forum.org/ 56
[9] General-Purpose Computation Using Graphics Hardware http://www.gpgpu.org/ [10] NVIDIA CUDA Zone NVIDIA Corp. http://www.nvidia.com/object/cuda home.html 2007 [11] GPU-based implementation of the KLT Tracker, http://cs.unc.edu/ ssinha/research/gpu KLT/ [12] GPU-based implementation of Scale Invariant Feature Transform, http://cs.unc.edu/ ccwu/siftgpu/ 57