23 FPGA CUDA Performance Comparison of FPGA Array with CUDA on Poisson Equation (lijiang@sekine-lab.ei.tuat.ac.jp), (kazuki@sekine-lab.ei.tuat.ac.jp), (takahashi@sekine-lab.ei.tuat.ac.jp), (tamukoh@cc.tuat.ac.jp), (yu-koba@cc.tuat.ac.jp), (sekinem@cc.tuat.ac.jp) 184-8588 2-24-16 Li Jiang, Kazuki Sato, Kenichi Takahashi, Tamukoh Hakaru, Yuuichi Kobayashi, Sekine Masatoshi Tokyo University of Agriculture and Technology 2-24-16 Naka-chou,Koganei-shi,Tokyo, 184-8588 Japan Abstract In recent years, the examples which use FPGA or GPGPU for the HPC use are increasing. We propose an FPGA array which accumulated a lot of small cards with the three-dimensional I/O that installed large-scale FPGA. The FPGA array is suited to the scalable design, and it is possible to control from the host PC easily. As a contrast, we also structured CUDA system by GeForce 9800GT. In this paper, we implemented FPGA array and CUDA to calculated Poisson equation by the finite difference floating point number method, and the performance and power consumption are presented. We also discuss the result which from the different hardware architecture and the advantages between in FPGA and GPGPU. 1. HPC HPC x86 POWER HPC GPU GPGPU(General Purpose computing ongpu) (1) GPGPU GPU HPC HPC(High Performance Computing) LSI FPGA(Field Programmable Gate Array) 1 FPGA FPGA FPGA FPGA FPGA HPC HPC FPGA RHPC(Reconfigurable High Performance Computing) FPGA HPC FPGA HPC PC CPU FPGA hw/sw (2)(3) FPGA FPGA GPU CUDA 2. Fig.1 2 ϕ ρ ϕ = ρ (1) (1) ϕ new i,j = α 4 α = h 2 ρ + ϕ old i 1,j + ϕold i+1,j + ϕold i,j 1 + ϕold i,j+1 (2) FPGA CUDA (2) Fig. 1: 1 Copyright c 2009 by JSFM
23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA hw/sw (2)(3) 3.2 hwmodule V2 FPGA hwmodulev2 Fig.2 FPGA PC hw/sw hwmodule V2 PC hw/sw FPGA hwmodulev2 PC-FPGA 3.3 hwmodule VS FPGA hwmodule VS FPGA Fig.3 hwmodule VS hwmodule VS Fig.4 PE Board Sub Board 2 Fig.4 hwmodule VS 2 3 3.4 FPGA hwmodule VS FPGA Fig.5 3 FPGA Fig.6 hwmodule V2 FPGA Software(CPU) PC hw/sw Fig. 3: hwmodule VS Fig. 4: PE Board Sub Board FPGA 3.5 Fig.7 FPGA Fig.7 4 1 (Processing Element:PE) 40 66.6[MHz] PE IEEE754 ϕ ρ FPGA Block- RAM(BRAM) Fig.8 PE BRAM Cache BRAM FPGA BRAM Cache PC hwmodule SDRAM FPGA Xilinx XC3S4000 XC3S4000 BRAM 2 BRAM1 32[bit] 512[word] 2 Cache ρ PE BRAM ϕ PE 2 BRAM PE 1 10 PE 1 ρ BRAM 3 ϕ BRAM 1 PE 10 ρ 12 ϕ 10 XC3S4000 PE1 11[ ] 2 Copyright c 2009 by JSFM
23 Fig. 7: FPGA ( ) Fig. 5: FPGA. Fig. 8: Fig. 6: hw/sw FPGA PE10 85[ ] PE1 7[ ] FPGA PE 1 10 2 PE 1 20 80 PE 10 60 60 PE10 hwmodule VS FPGA FPGA PE10 FPGA hwmodule V2 PE1 hwmodule VS 1 1 PE 4 FPGA Fig.9 PE10 60 60 6 1 4. GPGPU 4.1 GPU NVIDIA GeForce 9800GT GPU Fig.11 GPU G92 [6] HPC CUDA Global Memory 512[MB] GDDR3 GPU Multi Processor Multi Processor 8 Streaming Processor Streaming Processor Multi Processor Shared Memory Shared Memory 16[KB] Global Memory (3) 14 Multi Processor Multi Processor 8 Streaming Processor 112 Streaming Processor 4.2 CUDA grid,block,thread CUDA grid,block,thread CUDA Fig.12 CUDA GPU Streaming Processor block block thread 1 Multi Processor 1 block Multi Processor block 1 Multi Processor block block grid CPU CPU GPU (6) 4.3 CUDA CUDA FPGA (2) x y 1 1block block 1 thread 1 thread 1 64 64 block 64 block 3 Copyright c 2009 by JSFM
23 Fig. 11: GPU Fig. 9: PE10 Fig. 10: GPU(GeForce 9800GT) 64 thread 64 64=4,096 CUDA GPU CUDA GPU CUDA ( )Global Memory Global Memory Shared Memory 5. 5.1 FPGA FPGA PC Fig.13 Tab.1 FPGA FPGA PE 1 10 Fig. 12: grid, block,thread Fig.15 (CPU) PE10 hwmodule V2 hw- Module VS hwmodule VS hwmodule VS 4 FPGA 10 (2) 5.2 CUDA CUDA CUDA Tab.2 GPU CPU Fig.14 GPU CUDA CPU C++ 10,000 Fig.14 1 10,000 4 Copyright c 2009 by JSFM
23 Tab. 1: FPGA CPU Athlon64 X2 3800+ (2.0[GHz]) DDR2 SDRAM PC2-6400 2[GB] ASUS M2A-VM HDMI (AMD 690G ) OS Windows XP Professional SP3 Borland C++ Builder 2006 Tab. 2: CUDA CPU Athlon64 X2 3800+ (2.0[GHz]) DDR2 SDRAM PC2-6400 2[GB] ASUS M2A-VM HDMI (AMD 690G ) OS Windows XP Professional SP3 MicroSoft Visual Studio 2008 GALAXY GeForce 9800GT (GDDR3 512[MB]) Fig. 14: GPU CPU Fig. 13: Global Memory Fig.14 matirix size 16 16 GPU 0.136[GFlops] matirix size GPU 64 64 GPU CPU 128 128 GPU CPU 1.84 256 256 2.86 matrix size CUDA thread 16 16 thread 256 Multi Processor thread Fig.14 GPU CPU 128 128 GPU 1.4[GFlops] GPU 128 128 matrix size matrix size matrix size GPU 14 Multi Processor 128 128 matrix size Multi Processor 128 128 16,384 thread 5.3 FPGA GPU CPU FPGA GPU CPU FPGA GPU CPU Flops Fig.15 matrix size GPU 128 128 16 16 2 CPU FPGA(10PE) 60 60 matrix size FPGA(10PE) hwmodule V2 FPGA(1PE) 4 hwmodule VS FPGA 1 VS 1 PE FPGA 4 PE 4 PE 20 80 1 PE 20 20 GPU 128 128 thread G92 14 Multiprocessor Streaming Processor FPGA PE 1 CPU PE 10 CPU 3.4 GPU 2.2 GPU FPGA CUDA FPGA FPGA (2) 150 hwmodule VS FPGA 1[TFlops] GeForce 9800GT Flops 462[GFlops] CUDA CUDA 5 Copyright c 2009 by JSFM
23 Tab. 4:. / [GFlops] [W] [MFlops/W] FPGA 3.226 7.92 407.3 GPU(GeForce 9800GT) 1.442 66.0 21.85 CPU(Athlon64 X2) 0.954 62.0 15.39 Fig. 15: FPGA GPU CPU Tab. 3: ( :[GFlops]). [ ] 10 2 10 4 10 6 FPGA (10PE) 0.105 3.051 3.213 GPU(GeForce 9800GT) 1.370 1.442 1.299 CPU(Athlon64 X2) 0.900 0.957 0.952 Global Memory Global Memory thread thread Shared Memory 5.4 FPGA PE 1 1[W] 131[MFlops/W] (2) PE 10 407[MFlops/W] CPU 65[W] FPGA GPU CPU (Tab4). 5.5 GPU FPGA GPU CUDA thread 100 CPU FPGA FPGA PE 1 CPU 10 PE. FPGA FPGA FPGA FPGA PC HPC GPU CUDA FPGA CUDA GPU hw/sw FPGA 6. FPGA HPC CUDA GPGPU CPU GPU FPGA hwmodule V2 10 PE hwmodule VS 2 3 PE FPGA CUDA HPC FPGA FPGA HPC FPGA (1) nvidia CUDA, http://www.nvidia.com/cuda/ (2),,, FPGA TFlops,, vol. 108, no. 414, pp. 19-24, 2009. (3),,, FPGA Reconfigurable HPC,, vol. 107, no. 416, pp.13-18, 2008. (4),,,, hw/sw, 21, pp.207-212, 2008. (5) K.Kudo, et. al., Hardware Object Model and Its Application to the Image Processing, IEICE Trans. on Fund., vol. E87-A, no.3, pp.547-558, 2004. (6), GPU CFD,, vol.50, no.2, pp. 107-115, 2009. (7),, GPU CIP 2,, vol. 13, no. 2, pp. 837-840, 2008. 6 Copyright c 2009 by JSFM