GPU 1 1 2 2 1 1024 3 GPUGraphics Unit1024 3 GPU GPU GPU GPU 1024 3 Tesla S1070-400 1 GPU 2.6 Accelerating Out-of-core Cone Beam Reconstruction Using GPU Yusuke Okitsu, 1 Fumihiko Ino, 1 Taketo Kishi, 2 Syuhei Ohnishi 2 and Kenichi Hagihara 1 This paper presents an acceleration method of cone beam reconstruction for large-scale volume using graphics processing unit (GPU). The proposed method reconstructs subvolumes using multi-gpu environments because such a largescale volume exceeds the amount of video memory. With respect to the parallelization scheme for multi-gpu environments, there can be two parallelization approaches, projection parallelism or voxel parallelism. Projection parallelism scheme reconstructs the entire volume from a subset of projections on each GPU. On the other hand, voxel parallelism scheme reconstructs subvolume from all projections on each GPU. Experimental results demonstrate that our method can perform out-of-core reconstruction for 1024 3 -voxel volume. In addition, the performance of NVIDIA Tesla S1070-400 is 2.6 times faster than that of single GPU. 1. CTComputed Tomography CT CT X 1024 3 CPU GPUGraphics UnitGPU GPU GPGPU 1) General Purpose Computation on GPUs nvidia CUDA 2) Compute Unified Device Architecture C GPU 3) 5) FDKFeldkampDavisKress 6) 3) FDK GPU VRAMSherl 4) 2) TB 2) 2 2) Noel 5) 2) 1 Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 2 NDI Non-Destructive Inspection Business Unit, Analytical and Measuring Instruments Division, Simadzu corporation 1 c2009 Information
v θ n u z y x X 1 uv P n1 n Kxyz F D d n X θ n X xyz x D 1 FDK VRAM 512 3 1024 3 VRAM VRAM GPU GPU 2 GPU GPU 2 FDK 3 1 GPU 4 5 6 2. Feldkamp, Davis, Kress FDK 6) CT U V K P 1,...,P K N 3 F 1 FDK d n P 1,P 2,...,P K Shepp-Logan 7) Q 1,Q 2,...,Q K Q n Q n(u, v) 0 u U 1, 0 v V 1 P n P n (u, v) R W 1 (r, v) (1) W 1(r, v) (2) R 2 Q n(u, v) = W1(r, v)pn(r, v) (1) π 2 (1 4r 2 ) W 1(r, v) = r= R D D2 + r 2 + v 2 (2) F F (x, y, z) 0 x, y, z N 1 Q 1,Q 2,...,Q K (3) F (x, y, z) = 1 K W 2(x, y, n)q n(u(x, y, n),v(x, y, z, n)) (3) 2πK n=1 (3) Q n (u, v) Q n (u, v) W 2 (x, y, n) W 2 (x, y, n) (u(x, y, n),v(x, y, z, n)) (4)(6) ( ) d 2 n W 2 (x, y, n) = (4) D(x sin θn + y cos θn) u(x, y, n) = (5) Dz v(x, y, z, n) = (6) u(x, y, n) v(x, y, z, n) 2 c2009 Information
3. 1 GPU N 1024 F N 1024 F 4GB VRAM 4GB VRAM VRAM (3) F (x, y, z) (1) (3) F N 3 /BB >1 B F b 1 b B b xy z b N 3 /B VRAM F b B 3) (6) (7) (8) z N (8) v(x, y, z, n) =v (x, y, n)z (7) v D (x, y, n) = (8) z (8) x y VRAM x y x VRAM x B Coalesced 2) y 2 F 1 Q n 9 Q n Input: Projection P 1...P K, filtering size R, numberofsubvolumeb, and parameters D, d 1...d K,θ 1...θ K Output: Volume F Algorithm Reconstruction() 1: Divide volume F to subvolume F 1,F 2,...,F B ; 2: for b =1to B do 3: Initialize subvolume F b ; 4: for k =1to K do 5: if b == 1 do 6: Transfer projection P k to global memory; 7: Q k Filtering(P k,r); 8: Transfer filtered projection Q k to texture memory; 9: Transfer filtered projection Q k to main memory; 10: else 11: Transfer filtered projection Q k to texture memory; 12: end if 13: Bind filtered projection Q k as a texture; 14: F b Backprojection(Q k,d,d k,θ k,k,b); 15: end for 16: Transfer subvolume F b to main memory; 17: end for 2 CUDA 2) CPU GPU 4. 3 C GPU C GPU 1 2 (1)(2) 2 4 3 2 2 3 3(a) GPU 3 c2009 Information
K/C B F b K/C B F b K/C B F b 1 GPU O(BKUV/C) O(BKUV/C) O(KUV/C) O(KUV ) O(KN 3 /C) O(KN 3 /C) O(N 3 ) O(N 3 /C) CPU O(N 3 C) O(C) VRAM O(N 3 /B) O(N 3 /B) O(N 3 C) O(N 3 ) (a) K B/C F b B/C F b B/C F b (b) 3 GPU K/C B F b (3) F (x, y, z) GPU F b F b 3(b) GPU K B/C F b F (x, y, z) (3) F b B <C C B GPU B C 3(b) B = C 1 1 VRAM VRAM CPU CPUGPU 2 1 F b 1 GPU B K/C B 1 GPU B/C K B/C O(UV) O(BKUV/C) F b 1 1 GPU B 1 GPU B/C F b O(N 3 /B) O(N 3 )O(N 3 /C) 1 GPU 1 GPU K/C 1 GPU K 1 UV 4 c2009 Information
O(KUV/C)O(KUV ) 1 GPU 1 GPU K/C B 1 GPU K B/C 1 N 3 /B O(KN 3 /C) C F b 1 N 3 /B B O(N 3 C) B O(B) VRAM 1 VRAM O(N 3 /B) C GPU B O(CN 3 ) C GPU B/C O(N 3 ) 5. K = 1200,U = 1024,V = 1012,N = 1024 R = 512,B =8 CPU Xeon E54503.0GHz16GB PC GPU 4 GPU nvidia Tesla S1070-400 OS CentOS 5.3 185.18.08 CUDA CUDA 2.2 5.1 2 C =1, 2, 4 TB 64 8 C =4 2 C =1 C =2 1.7 C =4 2.6 C C =2 2 GPU C =1 C =2 C =4 0.52 0.52 0.26 0.13 14.88 9.73 9.78 6.32 9.24 4.69 9.24 9.24 63.76 31.94 32.01 16.60 1.59 1.10 1.63 2.27 1.27 1.41 0.67 0.35 3.76 0.00 0.00 91.26 53.15 53.59 34.91 C 1 C 1 CPUGPU PCI-Express PCI-Express GPU 2 B 4 CPUGPU CPUGPU 4 C =2 8% C C C C =1 10% C =4 25% C C 5 c2009 Information
100% 3 % % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% C=1 C=2 C=2 C=4 C 4 % 5.2 4 CPUGPU 20 25% CPUGPU CPUGPU 2) CPUGPU cudaarray 2) CPUGPU 3 C CPUGPU 3 CPUGPU 2 C CPUGPU 3 C 6. VRAM GPU 2 GPU GPU C =1 2.81 15.84 C =2 2.25 18.40 2.79 23.13 C =4 2.98 33.26 1 GPU VRAM 4 GPU 1 GPU 2.6 A2024002 COE 1) GPGPU: General-Purpose Computation Using Graphics Hardware (2007). http: //www.gpgpu.org/. 2) nvidia Corporation: CUDA Programming Guide Version 1.1 (2007). 3) Okitsu, Y., Ino, F. and Hagihara, K.: Fast Cone Beam Reconstruction Using the CUDA-enabled GPU, Proc. 15th Int l Conf. High Performance Computing (HiPC 08), pp.108 119 (2008). 4) Scherl, H., Keck, B., Kowarschik, M. and Hornegger, J.: Fast GPU-Based CT Reconstruction using the Common Unified Device Architecture (CUDA), Proc. Nuclear Science Symp. and Medical Imaging Conf. (NSS/MIC 07), pp.4464 4466 (2007). 5) Noël, P.B., Walczak, A.M., Hoffmann, K.R., Xu, J., Corso, J.J. and Schafer, S.: Clinical Evaluation of GPU-Based Cone Beam Computed Tomography, Proc. High- Performance Medical Image Computing and Computer Aided Intervention (HP- MICCAI 08) (2008). 6) Feldkamp, L.A., Davis, L.C. and Kress, J.W.: Practical cone-beam algorithm, J. Optical Society of America, Vol.1, No.6, pp.612 619 (1984). 7) Shepp, L.A. and Logan, B.F.: The Fourier Reconstruction of a Head Section, IEEE Trans. Nuclear Science, Vol.21, No.3, pp.21 43 (1974). 6 c2009 Information