Vol. 3 No. 3 209 220 (Sep. 2010) FPGA CUBE 1 2 2 1 1 3 512 FPGA 1 FPGA CUBE CUBE GPU NVIDIA GeForce GTX280 Cell/B.E. Performance Evaluation of One-dimensional FPGA-cluster CUBE for Stream Applications Masato Yoshimi, 1 Yuri Nishikawa, 2 Hideharu Amano, 2 Mitsunori Miki, 1 Tomoyuki Hiroyasu 1 and Oskar Mencer 3 This paper reports implementation and evaluation of CUBE, which is a multi- FPGA system which can connect 512 FPGAs in a form of a simple one dimensional array. As the system well suits as stream-oriented application platforms, we evaluated its performance by implementing edit distance computation algorithm, which is a typical stream-oriented algorithm. Performances are compared with Cell/B.E., NVIDIA s GeForce GTX280 and a general multi-core microprocessor. The report also discusses performance efficiency, logic consumption and power efficiency with comparison to other multi-core devices. 1. PC FPGA 1) 3) 2008 CUBE FPGA 1 4) CUBE 512 Spartan 3 FPGA FPGA ASIC FPGA BlockRAM 5),6) FPGA CUBE Splash 2 7) PROGRAPE 1) FPGA GPU CUBE FPGA FPGA 1 Doshisha University 2 Keio University 3 Imperial College London 209 c 2010 Information Processing Society of Japan
210 FPGA CUBE CUBE Intel Core 2 Quad Q9550 Cell/B.E. GPU 2. CUBE: 512 FPGA cluster CUBE 1 8 8 64 FPGA CUBE 8 4) FPGA 64 512 1 FPGA CUBE FPGA Spartan 3 FPGA FPGA FPGA FIFO 1 CUBE CUBE FPGA 2 (1) FPGA FIFO FPGA 100 [MHz] FPGA 6.4 [Gbps] = 100 [MHz] 64 [bit] 1 CUBE Fig. 1 Architecture of CUBE. (2) FPGA 1 CUBE 1 CUBE CUBE(1b): 64 FPGA 2 40 4) 104 [W] 1,460 [ ] Quad Core Xeon 2.5 [GHz] 359 CPU CUBE(1b) 690
211 FPGA CUBE 2 Fig. 2 Overview of the algorithm. 3. 3.1 8) 2 2 write weight 2 stra (1) (2) (3) strb c lena lenb c (lena +1) (lenb +1) c(i, j) 3 c(i 1,j 1) c(i, j 1) c(i 1,j) (1) c(i, 0) c(0,j) i j (2) stra i strb j 0 1 a (3) c(i 1,j 1) + a c(i, j 1) + 1 c(i 1,j)+1 c(i, j) (2) (3) c(lena, lenb) 2 3 Fig. 3 Overview of the parallel algorithm. 3.2 3 B(i, j) stra i strb j p(i, j 1) q(i 1,j) a(i 1,j 1) 5 p(i, j) q(i, j) a(i, j) 3 4. CUBE FPGA Splash 2 7) PROGRAPE 1) FPGA BEE2 9) Splash 2 CUBE Splash 2 16
212 FPGA CUBE FPGA 36 FPGA Splash CRAY-2 330 7) Cell/B.E. GPU 10),11) DP: Dynamic Programming DP 12),13) FPGA 14) Cell/B.E. 315 15) GPU 100 1 3 O(lenA lenb) w O( lena/w lenb) Cell/B.E. SPE 128 SPE 128 FPGA PE FPGA Splash 2 CUBE FPGA 5. 5.1 CUBE CUBE FPGA 4 1 FPGA CUBE FPGA FPGA FPGA FPGA FPGA 2 4 CUBE Fig. 4 Computational operation and dataflow on CUBE.
213 FPGA CUBE 5.2 FPGA 3.2 FPGA PE 1 CUBE 4 4 FPGA FPGA FPGA PE0 strb 0 stra 0 stra 3 PE1 FPGA i CUBE FPGA 4 3.2 p(i, j) a(i, j) FPGA strb i FPGA 1 FPGA FPGA FPGA FPGA FPGA FPGA stra i CUBE FPGA FPGA stra i p i FPGA a (i,j) stra p i FPGA FPGA FPGA FPGA q (3,3) p (3,j) q (i,3) 5.3 FPGA FPGA LD thread 5 FPGA 8 16 LD unit 128 = 8 16 1 128 CUBE 5 Fig. 5 Block computation modules for obtaining edit distance. FPGA 100 MHz 512 FPGA FPGA 64 k LD unit 8 3 LD unit LD unit CUBE 100 [MHz] (1) (2) (3) (4) 4 p q LD unit 8 8 80 3 8 83 64 IPC IPC u =64/83 = 0.77 [Operation/Clock] IPC u LD unit 1 IPC t LD thread IPC LD thread 16 LD unit LD thread LD unit
214 FPGA CUBE LD unit stra strb 8 p q p q LD thread 83 3 LD thread 1 2 15 LD thread LD unit LD unit LD thread 15 LD unit 16 LD unit 1 LD thread 128 LD thread 128 1 8 31 IPC t (1) IPC t =6.368 ( ) IPC t = ( ) 128 [ ] 128 [ ] = 83 [Clock/Block] 31 [ ] = 6.368 [Operation/Clock] (1) IPC u 0.398 = 6.368/16 6. 6.1 Verilog-HDL Xilinx ISE10.1i CUBE Spartan 3 XC3S4000-5-FG676 FPGA FPGA LD thread p- q- Spartan 3 BlockRAM 1 LD thread CUBE 6.2 CUBE FPGA 100 [MHz] 1 Table 1 Logic utilization and maximum operating frequency. LD thread LD unit XC3S4000 Slices 22,483 1,340 27,648 FFs 17,985 1,063 55,296 LUTs 42,369 2,496 55,296 BRAMs 10 1 96 Freq. [MHz] 125.594 156.966 FPGA 128 2,573 = 83 [Clock/BlockCycle] 31 [BlockCycle] 644 [Byte] 82 FPGA 4 PE p- 512 [Byte] = 128 4 [Byte] stra 128 [Byte] = 128 FPGA 128 2,573 + 82 = 2,655 CUBE n 512 4 2n 1 n 512 128 k n =1,024 (2 MaxBlock 1) ( n/maxblock ) 2 =(2n 1) 4= (2 512 1) 4 = 4,092 CUBE PC CUBE 2,655 256 [Byte] = 128 2 stra strb 1,024 [Byte] = 128 2 4 [Byte] p- q- PC-CUBE I/O 385.687 [Mbps] ((1,024+256) [Byte] 8)/2,655 [ClockCycle] 100 [MHz] PC CUBE CUBE strb 2,655 128 [Byte] PC-CUBE 424.256 [Mbpz] 4
215 FPGA CUBE Table 2 2 pthread Execution environment of pthread program. CPU RAM OS Compiler Intel Core 2 Quad Q9550 @ 2.83 GHz 8.0 GByte GNU/Linux 2.6.26-2-amd64 gcc-4.1.3(-o3 -lpthread) 6.3 4 (1) CUBE 3 (a) Spartan 3 (b) 1 CUBE CUBE(1b) (c) 8 CUBE CUBE CUBE(8b) 100 MHz RTL ( 2 ) Intel Core 2 Quad Q9550 3 2 ( 3 ) Sony ZEGO 16) Cell/B.E. ZEGO 7 SPE Cell/B.E. Cell Challenge 2009 C 3 pthread 7 SPE SPE SIMD ( 4 ) NVIDIA GeForce GTX280 GPU GPU Challenge 2009 6.4 PC PC 3 pthread C 2 (1) 1 (2) 2 1 6 Fig. 6 Impact of multithreading and block size on computational time. 6 256 k 5 T2 1 T1 16 k 8 T8 16 4 Core 2 Quad 4k 4k 7 O(N 2 ) 4
216 FPGA CUBE Fig. 7 7 Impact of multithreading and sequence length on computational time. Fig. 9 9 Performance for computing edit distance. Fig. 8 8 C2Q Cell/B.E. GeForce GTX280 CUBE Comparison of computational time among C2Q, Cell/B.E., GeForce GTX280 and CUBE. 10 CUBE(8b) Fig. 10 Power consumption ratio based on CUBE(8b). 8 6.5 4 3 8 9 10 (1) 1 8 8 CUBE Spartan 3 1 FPGA 2,655 100 [MHz] (2) 9 1 8 64 k CUBE(8b) Core2Quad 4 352 (3) [J] 10 CUBE(8b) [J]
217 FPGA CUBE 3 4 Spartan 3 CUBE Table 3 Power consumption of each system. Table 4 Performance comparison between Spartan 3 and CUBE. Vendor Device Power [W] Intel Core2Quad Q9550 95 Sony ZEGO(BCU-100) Cell/B.E. 330 NVIDIA GeForce GTX280 236 Imperial CUBE (8 boards) 832 CUBE(1b) CUBE(8b) 8b/1b 4k 15.760 11.052 0.701 16 k 32.252 58.783 1.823 64 k 32.252 256.250 7.945 256 k 32.252 256.250 7.945 1M 32.252 256.250 7.945 3 8 10 NVIDIA GPU 50 [W] 200 [W] 17) Spartan 3 FPGA Intel Cell/B.E. 7. 7.1 8 9 3.1 N O(N 2 ) Cell/B.E. Core2Quad Cell/B.E. 300 [Mops] Cell/B.E. SIMD 1 128 SPE GPU GTX280 SP 240 SP 8 SP 16 KB 7.2 CUBE 6 CUBE CUBE FPGA O(N 2 ) O(N) 1 FPGA CUBE(1b) 8k = 128 64 CUBE(8b) 64 k = 128 512 9 Spartan 3 1 CUBE(1b) CUBE(8b) 4 9 Spartan 3 Core2Quad CUBE FPGA 4 64 FPGA 32 512 FPGA 256 CUBE(1b) CUBE(8b) FPGA 8 7.3 CUBE CUBE
218 FPGA CUBE IPC CUBE(8b) FPGA LD thread 128 FPGA 1 64 k IPC (2) 1,581.320 ( ) IPC = ( ) ([ A] [ B]) = ([LD thread ] [ ]) (64 1,024) (64 1,024) = = 1,581.320 (2) 2,655 1,023 IPC t 3.089 = 1,581.320/512 IPC u 0.193 = 1,581.320/ (512 16) CUBE(1b) 1 8k 64 k 64 = 8 8 FPGA IPC (3) 199.027 ( ) IPC = ( ) ([ A] [ B]) = ([LD thread ] [ ] [ ]) = (64 1,024) (64 1,024) 2,655 127 64 = 199.027 (3) IPC t 3.110 = 199.027/64 LD unit IPC 0.194 = 199.027/(64 16) 4 CUBE(8b) IPC CUBE(1b) 7.945 CUBE 2 1 FPGA 2 IPC 4 FPGA LD unit FPGA LD unit16 128 128 8 4 FPGA IPC FPGA CUBE LD thread 16 LD unit 31 1 IPC LD thread IPC 256 LD unit 16 16 = 1,024 15 15 i =120 LD unit i=1 16 15 LD unit 15 15 i i=1 =120 LD unit 15 49 LD unit IPC t (4) 9.994 IPC u 0.625 = 9.994/16 LD unit IPC ( ) IPC t = ( ) = 256 [ ] 256 [ ] 83 [Clock/Block] 79 [BlockCycle] = 9.994 [Operation/Clock] (4) IPC t (1) 1.56 = 10.967/6.368 CUBE LD thread 1 BlockRAM 2 256 1 128 2.3 = 72/31 6.2 CUBE stra strb FPGA FPGA CUBE 10 CUBE
219 FPGA CUBE 8. 1 FPGA CUBE CUBE FPGA x86 GPU Cell/B.E. CUBE CUBE CUBE 1) Hamada, T., Fukushige, T., Kawai, A. and Makino, J.: PROGRAPE-1: A Programmable, Multi-Purpose Computer for Many-Body Simulations, Publications of the Astronomical Society of Japan, Vol.52, pp.943 954 (2000). 2) Burke, D., Wawrzynek, J., Asanovic, K., Krasnov, A., Schultz, A., Gibeling, G. and Droz, P.Y.: RAMP Blue: Implementation of a Multicore 1008 Processor FPGA System, Proc. 4th Annual Reconfigurable Systems Summer Institute (RSSI 08 ) (2008). 3) Osana, Y., Fukushima, T., Yoshimi, M. and Amano, H.: An FPGA-Based Acceleration Method for Metabolic Simulation, IEICE Trans. Inf. Syst., Vol.E87-D, No.8, pp.2029 2037 (2004). 4) Mencer, O., Tsoi, K.H., Craimer, S., Todman, T., Luk, W., Wong, M.Y. and Leong, P.H.W.: CUBE: A 512-FPGA CLUSTER, Proc. IEEE Southern Programmable Logic Conference (2009). 5) Yoshimi, M., Nishikawa, Y., Osana, Y., Funahiashi, A., Hiroi, N., Shibata, Y., Yamada, H., Kitano, H. and Amano, H.: Practical Implementation of a Network- Based Stochastic Biochemical Simulation System on an FPGA, The 18th International Conference on Field Programmable Logic and Applications (FPL 08 ), pp.663 666 (2008). 6) Morishita, H., Osana, Y., Fujita, N. and Amano, H.: Exploiting Memory Hierarchy for a Computational Fluid Dynamics Accelerator on FPGAs, Proc. Field Programmable Technology 2008 (FPT 08 ), pp.193 200 (2008). 7) Arnold, J.M., Buell, D.A. and Davis, E.G.: Splash 2, SPAA 92: Proc. 4th annual ACM symposium on Parallel algorithms and architectures, New York, NY, USA, pp.316 322, ACM (1992). 8) Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Physics Doklady, Vol.10, No.8, pp.707 710 (1966). 9) Chang, C., Wawrzynek, J. and Brodersen, R.W.: BEE2: A High-End Reconfigurable Computing System, IEEE Design and Test of Computers, Vol.22, No.2, pp.114 125 (2005). 10) Cell Challenge 2009 SACSIS2009 Cell Challenge 2009. http://www.hpcc.jp/sacsis/2009/cell/ 11) GPU Challenge Cell Challenge 2009 GPU Challenge 2009. http://www.hpcc.jp/sacsis/2009/gpu/ 12) Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM, Vol.46, No.3, pp.395 415 (1999). 13) Hyyrö, H.: A bit-vector algorithm for computing Levenshtein and Damerau edit distances, Nordic J. of Computing, Vol.10, No.1, pp.29 39 (2003). 14) Masuno, S., Maruyama, T., Yamaguchi, Y. and Konagaya, A.: Multiple Sequence Alignment Based on Dynamic Programming Using FPGA, Transaction on Information and Systems, Vol.E90-D, No.12, pp.1939 1946 (2007). 15) Cell (2009). http://www.hpcc.jp/sacsis/2009/cell/outputs/pdf/kitei 1.pdf 16) SONY: BCU-100 Computing Unit with Cell/B.E. and RSX. http://pro.sony.com/bbsccms/ext/zego/files/bcu-100 Whitepaper.pdf 17) GPU Vol.2009-HPC-121, No.27, pp.1 5 (SWoPP2009) (2009). ( 22 1 26 ) ( 22 4 28 ) 16 21 18 DC1
220 FPGA CUBE 18 20 20 DC1 56 61 25 53 62 6 IEEE 9 20 IEEE Ph.D. DIGITAL Systems Maxeler Technologies CEO EPSRC