GPU 18 2 3
i GPU GPU GPU GPU CPU Radeon X800 Pro 3.2 α
Studies on the Speedup of Volume Rendering with Commodity GPUs Abstract ii Yuki SHINOMOTO A slice-order texture-based algorithm for volume rendering with commmodity GPUs loses spacial locality of reference and suffers from low cache hit ratio at some viewpoints. This is because the access pattern of volume references depends on the position of the viewpoint. A cuboid-order ray-casting algorithm which maximizes spatial locality of reference has been proposed. The cuboid-order algorithm divides the volume into sub volumes named cuboid, and controls the access pattern by rendering each cuboid. Maximization is achieved by detecting and sampling points in a cuboid fetched into the cache memory, before the cache lines composing the cuboid are replaced. In this paper we propose a cuboid-order texture-based algorithm based on the cuboid-order ray-casting algorithm. CPUs cannot control the access pattern as easily as CPUs. Our algorithm controls the aceess pattern by dividing the slice into smaller slices and arranges them in the cuboid-order when rendering each cuboid. The number of slices increases in proportion to the number of cuboids. We evaluate our algorithm with Radeon X800 Pro. The result shows that performance of the cuboid-order algorithm is lower than that of the slice-order when the size of cuboids are smaller than that of texture-cache and 3.2 times higer at some size of cuboids. The performance of the cuboid-order algorithm suffers from the cost of processing vertices of slices and changing cuboids which are processed. In this paper we proposed the address transformation of the volume and blending with the fragment prosessor of the GPU. The former reduces the cost of chaing cuboids and the latter reduces the number of vertices.
GPU 1 1 2 3 2.1 Volume Rendering................................ 3 2.1.1........................ 4 2.2 GPU...................................... 5 2.2.1 GPU............................... 6 2.2.2.................... 7 2.2.3 GPU.................. 8 2.2.4...................... 9 2.2.5 GPU....................... 10 2.2.6 GPU............................ 11 2.3 Texture Based Volume Rendering..................... 12 2.3.1......................... 12 2.3.2............ 18 2.4.................. 19 2.5 2.................................. 20 3 22 3.1............................... 23 3.1.1......................... 23 3.1.2........................ 23 3.2.................................. 24 3.2.1................................ 25 3.3............................. 26 3.3.1............................ 26 3.3.2.................... 27 3.4............................... 29 3.5.................................... 29
3.5.1.............................. 30 3.6 3.................................. 31 4 32 4.1....................................... 32 4.2 1.............. 33 4.3 2.................... 35 4.4............... 36 5 39 5.1............. 39 5.2............... 39 5.2.1.............................. 40 5.2.2........................... 42 5.3 GPU............................. 44 6 45 46 47
1 3 2 / [1] / 4K 3 [2] [3] 6.5TFLOPS 64GB GPU GPU GPU GPU GPU GPU 1
CPU CPU [4, 5] DRAM CPU GPU GPU GPU CPU GPU 2 GPU 3 4 5 2
2 GPU 2.1 Volume Rendering 3, 2,. 2.,., 3,,.,, CPU GPU PC B Ray A I(A, B) Volume 1: 3 3
1 1 1 A B I(A, B) I(A, B) = B A g(s)e s A τ(x)dx ds (1) s, x g τ g p g(p) p 1 g 2 g 1 I(A, B) = B i=a g(s i )e s j j=a τ(x j) (2) 2.1.1.,,, ( 2). 4
(front to back), (back to front). back to front,, v 0, v 1,, v n, RGB( ) c k α k v k ( ), C C = n i=0 i 1 α(v i )c(v i ) (1 α(v j )) (3). C k. j=0 C k 1 = α(v k 1 )c(v k 1 ) + (1 α(v k 1 ))C k (4) C = C 0. (2), RGB α. α RGB, α, RGB. 2: 2.2 GPU GPU GPU 3 5
GPU GPU GPU NVIDIA GeForce ATI Radeon Matrox Parhelia 2.2.1 GPU GPU (Vertex Processor) RGBα (Triangle Setup Engine) 3 (rasterizer) GPU (Fragment Processor) (texture unit) (Texture) 1 6
(Texture Cache) 2 GPU[6] 1 2 2 CPU CPU (Pixel Unit) (Raster Unit) α (Frame Buffer) RAMDAC (Random Access Memory Digital-to-Analog Converter) (Video Memory) GDDR GDDR DRAM GPU GPU 512MB 2.2.2 7
GPU 3 GPU Matrices,light positions,blend factors, and other uniform parameters GPU/ application Programabble Vertex Processor Primitive Assenmbly Rasterization & Interpretation Programabble Fragment Proceddor Frame-Buffer Tests & Blending Vertex Indeces Textures Frame Buffer data Flows:Primitive,Vertex,and Fragment Data Uniform Parameters-Change infrequently 3: CPU GPU / CPU GPU GPU CPU CPU CPU 3 RGBα α 2.2.3 GPU GPU API API (Graphics API) CPU GPU API GPU API Windows 8
DirectX[7] OS OpenGL[8] API GPU OpenGL glcolor3f(1.0, 1.0, 1.0); glbegin(gl_quads); glvertex3f(-1.0, -1.0, 0.0); glvertex3f( 1.0, -1.0, 0.0); glvertex3f( 1.0, 1.0, 0.0); glvertex3f( -1.0, 1.0, 0.0); glend(); API GPU 2 CPU-GPU GPU 2.2.4 API Cg (C for Graphics) HLSL (High Level Shader Language), GLSL 9
(OpenGL Shading Language) GPU NVIDIA microsoft OpenGL ARB CPU API 2.2.5 GPU GPU GPU RGBα 1 1 RGBα RRRR BGG - GPU GPU 2 GPU RGBα 10
2.2.6 GPU GPU API DirectX DirectX 7 GPU CG,Hardware T&L (Hardware Transformation and Lighting) DirectX 8 GPU CG 2 (vertex program) (fragment program) DirectX 9 GPU DirectX 9c GPU GPU GPU GPU GPU GPU GPGPU (General-Purpose computation on GPUs) [9] 11
2.3 Texture Based Volume Rendering GPU [10, 11] 2.3.1 1. 2. 3. 4. 5., α CPU GPU API, α ( 4)., GPU [12]. 4: 12
GPU GPU 3 3 ( 5). CPU GPU API 3 5: 3 GPU 3 2 2D GPU 3 3 3 1 1 1 Z Z Z 13
1 1 GPU CPU CPU GPU GPU GPU 6 GPU ( 6) 14
6: [0,0, 1.0] (0.5, 0.5, 0.5) 7 100 100 [0, 1] GPU Z 15
(-50.0, 50.0) (50.0, 50.0) (0.0,1.0) (1.0,1.0) (-50.0, -50.0) (50.0, -50.0) (0.0,0.0) (0.0,1.0) 7: Z RGBα GPU CPU 3 GPU 3 3 4 GPU GPU GPU GPU Dependent Texture ( ) [13] 2 3 RGBα 1 16
0 1 3 1 ( 8) 0 2 3... 244 255 244 R 0 200 50 20 G 0 64 B 0 185 A 0 244 8: 3 2 3 1 GPU 17
α 2.3.2 1 2 4 1/8 GPU CPU RC [4] GPU α CPU GPU 18
2 3 CG 2 GPU 2.4 CPU [14, 15] ( 9) 1 1 DRAM 6 1.15 [5] 19
1 2 5 3 6 9 4 7 10 13 8 11 14 12 15 16 9: CPU GPU [4] Itanium2 128 3 9FPS 4 Radeon X800 Pro 512 3 1.8FPS Radeon X800 Pro Itanium2 12.8 GPU CPU View 2.5 2 GPU GPU GPU GPU 4 1/8 CPU 20
21
3 GPU (Cuboid: ) GPU GPU CPU CPU 22
GPU GPU CPU 4 GPU GPU 3.1 GPU GPU CPU 3.1.1 GPU 2 2 GPU 2 2 2 3.1.2 CPU ID 23
CPU ID GPU ID 3.2 GPU CPU 2 CPU CPU API GPU x, y, z int x, y, z; void loop_x(void){ for(x = 0; x < cx; ++x) loop_y(); for(x = X_MAX; x > cx; --x) loop_y(); x = cx; loop_y(); } X MAX x 1 10 2 cx x x 10 2 cy cy > 3 loop x x loop x x < cx, x > cx, x = cx 3 x < cx 0 cx - 1 0, 1, 2,, cx - 1 24
0 1 2 3 x 0 1-1 2-1 4-1 3-1 1 1-2 2-2 4-2 3-2 2 1-3 2-3 4-3 3-3 3 1-4 2-4 4-4 3-4 y # :cuboid 10: x > cx X MAX cx + 1 x = cx 0, 1, 3, 2 loop y y 0, 1, 2, 3 cx > X MAX cx < 0 loop y() 10 1-1, 1-2, 4-4 3.2.1 x, y xy y yx x 10 1-1, 2-1,, 4-4 10 xy 25
3.3 GPU 11: 3.3.1 ( 12) 26
GPU 6 8 GPU 1-1 1-1 1-1 1-2 12: 3.3.2 27
CPU ( 13) 13: 28
3.4 3.5 n 3 O(n 2 ) n 3 n 2 n 3 29
3.5.1 1 2 ( 14) GPU 1 2 [9] GPU 3/4 GPU 1/2 14: 30
3.6 3 GPU GPU n 3 n 2 31
4 2 2 GPU 1 3 2 4.1 CPU Intel Pentium 4 2.5GHZ GPU ATI Radeon X800 Pro Linux 2.6.8 C, OpenGL 1.5 Cg (C for graphics) Radeon X800 Pro 1 1: Radeon X800 Pro 475 MHz 28.8 GB/sec. 5.7 GPixels/sec. 712.5 MTriangles Memory 256 MB Memory Interface (bit) 256 Memory Data Rate 900MHz Pixels per Clock (peak) 12 32
4.2 1 GPU 1 1 512 512, 512 512 3 (128MB) 8 3 (512Byte) ALPHA 8 1Byte α 2 α RGBα X 0 360 15 Y 0 360 16 (FPS) 15 16 8 3 Radeon X800 Pro 4KB 512Byte 33
FPS 14 12 10 8 6 4 2 0 0 45 90 135 180 225 270 315 360 angle 15: (X ) 512 3 256 3 128 3 64 3 16 3 8 3 FPS 14 12 10 8 6 4 2 512 3 256 3 128 3 64 3 16 3 8 3 0 0 45 90 135 180 225 270 315 360 angle 16: (Y ) 34
CPU DRAM DRAM GPU DRAM 4.3 2 1 1 1 1 512 3 Byte 1 512 3 16 3 X 0 180 2 (FPS) 3 n O(n 2 ) 2 O(1/n 2 ) FPS FPS 45 0 2 FPS 1/ 2 35
2: 512 3 256 3 128 3 64 3 32 3 16 3 0 1316.7 386.3 131.5 41.8 10.7 2.0 45 924.4 255.7 85.3 25.6 6.5 1.7 90 1310.3 383.7 130.6 41.6 10.6 2.0 135 922.2 255.5 85.3 25.6 6.5 1.7 180 1304.3 383.6 130.6 41.6 10.6 2.0 FPS n O(n 3 ) 32 3 FPS 10.7 6.5 4.4 1 8 3 2 32 3 512 3 512 2 1 (512/ 1 ) 3 X 0 360 17 Y 0 360 18 36
8 3 GPU FPS 14 12 10 8 6 4 2 512 3 256 3 128 3 64 3 32 3 16 3 0 0 45 90 135 180 225 270 315 360 angle 17: (X ) FPS 14 12 10 8 6 4 2 512 3 256 3 128 3 64 3 32 3 16 3 0 0 45 90 135 180 225 270 315 360 angle 18: (Y ) 37
512 3 64 3 512 3 13.9FPS 512 3 1.8FPS 64 3 5.3FPS 32 3 10.5FPS 5.9FPS 16 3 1 1 8 3 GPU 64 3 = 262, 144 API GPU 32 3 2 32 3 17, 18 1.8FPS 32 3 5.7FPS 13.9FPS 10.5FPS 8 3 38
5 4 4 2 5.1 2 1 1 F P S Vertices 0 3 3: 1 2 3 4 3 8 3 16 3 32 3 2,022,451 2,373,427 3,231,744 4,109,107 4,207,411 3,145,728 4.2 MVertices 1 1/100 5.2 39
5.2.1 1 GPU 3 2 GPU 2 2 2 n 3 n 3/2 n 3/2 2 8 3 16 16 GPU 3 2 2 [16] 3D 1D 1D 2D 3D 2D 1 2 Cg float2 addrtranslation_1dto2d( float address1d, float2 texsize ) { // float2 CONV_CONST = float2( 1.0 / texsize.x, 1.0 / (texsize.x * texsize.y )); float2 normaddr2d = address1d * CONV_CONST; 40
float2 address2d = float2( frac(normaddr2d.x), normaddr2d,y ); return address2d; } frac(normaddr2d.x) normaddr2d.x texsize 2 address1d 1 3 1 Cg float2 addrtranslation_3dto2d(float3 address3d, float3 sizetex3d, float2 sizetex2d) { // float3 SIZE_CONST = float3(1.0, sizetex3d.x, sizetex3d.y * sizetex3d.x); float address1d = dot( address3d, SIZE_CONST); return addrtranslation_1dto2d( address1d, sizetex2d); } dot(address3d, SIZEl CONST) address3d SIZE CONST 3D 1D GPU 2 3 2 3 2D 2 1 3 2 3 ( 19) 3 GPU 2 4096 4096 2 [5] 41
2 1 4 2 3 1 3 4 5 6 8 6 7 5 8 7 19: [0:1.0] GPU 5.2.2 2 α 2 α α 42
Cg lerp() α α α GPU GPU GPU GPU GPU [17] 8 3 8 24 8 GPU RC 43
GPU GPU 5.3 GPU GPU GPU GPU CPU GPU GPU CPU GPU CPU GPU CPU [13] CPU GPU CPU GPU CPU GPU GPU GPU CPU API 5.2.1 GPU 44
6 GPU 3 1 α GPU GPU GPU GPU GPU 45
46
[1] Lichtenbelt, B., Crane, R. and Naqvi, S.: Introduction To Volume Rendering, Hewlett-Packard (1998). [2], : X TV SHIN- MAVISION ELNOS, Medical Now, Vol. 44, pp. 14 15 (2000). [3], : SHD - -, (SPWS-TMWG) 005 (1999). [4],,,, :,, Vol. 44, No. SIG 11(ACS 3), pp. 137 146 (2003). [5],,,, :,, Vol. 45, No. SIG 11(ACS 7), pp. 356 367 (2004). [6] Montrym, J. and Moreton, H.: The GeForce 6800, IEEE Micro, Vol. 25, No. 2, pp. 41 51 (2005). [7] DirectX: http://www.microsoft.com/windows/directx/. [8] OpenGL: http://www.opengl.org/. [9] GPGPU: http://www.gpgpu.org/. [10] Muraki, S., Lum, E. B., Ma, K.-L., Ogata, M. and Liu, X.: A PC Cluster System for Simultaneous Interactive Volumetric Modeling and Visualization, IEEE Symposium on Parallel and Large-Data Visualization and Graphics, pp. 95 102 (2003). [11],, : PC, CVIMl-130-10 (2001). [12] Rezk-Salama, C., K. Engel, M. B., Greiner, G. and Ertl, T.: Interactive Volume Rendering on Standard PC Graphics Hardware Using Multi-Texturesand Multi-Stage Rasterization, Proceedings of Eurograph- 47
ics/siggraph Workshop on Graphics Hardware (2000). [13],,,,, :, 2005 ARC 164 (SWoPP 2005), pp. 145 150 (2005). [14] Wolfe, M.: More Iteration Space Tiling, Proceedings of Supercomputing (SC 89), pp. 655 664 (1989). [15] Lam, M. S., Tothberg, E. E. and Wolf, M. E.: The Cache Performance and Optimizations of Blocked Algorithms, In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 63 74 (1991). [16] Pharr, M. and Fernando, R.(eds.): GPU Gems 2: Programming Techniques For High-Performance Graphics And General-Purpose Computation, Addison-Wesley Pub (2005). [17] Stegmaier, S., Strengert, M., Klein, T. and Ertl, T.: A Simple and Flexible Volume Rendering Framework for Graphics-Hardware-based Raycasting, Proceedings of Eurographics/IEEE VGTC Workshop on Volume Graphics 2005, pp. 187 195 (2005). 48