31_17.dvi

Similar documents
23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

GPGPU

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

スライド 1

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

組込みシステムシンポジウム2011 Embedded Systems Symposium 2011 ESS /10/20 FPGA Android Android Java FPGA Java FPGA Dalvik VM Intel Atom FPGA PCI Express DM

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

& Vol.2 No (Mar. 2012) 1,a) , Bluetooth A Health Management Service by Cell Phones and Its Us

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

スライド 1

Run-Based Trieから構成される 決定木の枝刈り法

main.dvi

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

10D16.dvi

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

Vol.57 No (Mar. 2016) 1,a) , L3 CG VDI VDI A Migration to a Cloud-based Information Infrastructure to Support

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

2017 (413812)

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max)

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

Vol.53 No (Mar. 2012) 1, 1,a) 1, 2 1 1, , Musical Interaction System Based on Stage Metaphor Seiko Myojin 1, 1,a

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

Haiku Generation Based on Motif Images Using Deep Learning Koki Yoneda 1 Soichiro Yokoyama 2 Tomohisa Yamashita 2 Hidenori Kawamura Scho

DEIM Forum 2009 B4-6, Str

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

P2P P2P peer peer P2P peer P2P peer P2P i

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

IPSJ SIG Technical Report Vol.2015-ARC-215 No.7 Vol.2015-OS-133 No /5/26 Just-In-Time PG 1,a) 1, Just-In-Time VM Geyser Dalvik VM Caffei

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

24_10.dvi

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

1 1 CodeDrummer CodeMusician CodeDrummer Fig. 1 Overview of proposal system c


B

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

IPSJ SIG Technical Report Vol.2012-MUS-96 No /8/10 MIDI Modeling Performance Indeterminacies for Polyphonic Midi Score Following and

IPSJ SIG Technical Report NetMAS NetMAS NetMAS One-dimensional Pedestrian Model for Fast Evacuation Simulator Shunsuke Soeda, 1 Tomohisa Yam

EGunGPU

mobicom.dvi

,., ping - RTT,., [2],RTT TCP [3] [4] Android.Android,.,,. LAN ACK. [5].. 3., 1.,. 3 AI.,,Amazon, (NN),, 1..NN,, (RNN) RNN

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St


4.1 % 7.5 %

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

IPSJ SIG Technical Report 1 1, Nested Transactional Memory Selecting the Optimal Rollback Point Yuji Ito, 1 Ryota Shioya, 1, 2 Masahiro Goshima

修士論文

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal

07-二村幸孝・出口大輔.indd

IPSJ SIG Technical Report Vol.2009-DPS-141 No.23 Vol.2009-GN-73 No.23 Vol.2009-EIP-46 No /11/27 t-room t-room 2 Development of

Chip Size and Performance Evaluations of Shared Cache for On-chip Multiprocessor Takahiro SASAKI, Tomohiro INOUE, Nobuhiko OMORI, Tetsuo HIRONAKA, Han

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

Cell/B.E. BlockLib

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

HASC2012corpus HASC Challenge 2010,2011 HASC2011corpus( 116, 4898), HASC2012corpus( 136, 7668) HASC2012corpus HASC2012corpus

1. HNS [1] HNS HNS HNS [2] HNS [3] [4] [5] HNS 16ch SNR [6] 1 16ch 1 3 SNR [4] [5] 2. 2 HNS API HNS CS27-HNS [1] (SOA) [7] API Web 2

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

Microsoft Word - toyoshima-deim2011.doc

IPSJ SIG Technical Report Vol.2014-EIP-63 No /2/21 1,a) Wi-Fi Probe Request MAC MAC Probe Request MAC A dynamic ads control based on tra

Computer Security Symposium October 2013 Android OS kub

13金子敬一.indd

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

dsample.dvi

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

IPSJ SIG Technical Report Vol.2015-ARC-215 No.13 Vol.2015-OS-133 No /5/ ,a) % 13.9% 1. Transactional Memory: TM [1] TM TM 1 Nag

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

1

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

36 581/2 2012

MAC root Linux 1 OS Linux 2.6 Linux Security Modules LSM [1] Security-Enhanced Linux SELinux [2] AppArmor[3] OS OS OS LSM LSM Performance Monitor LSMP

16.16%

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

IPSJ SIG Technical Report Vol.2012-HCI-149 No /7/20 1 1,2 1 (HMD: Head Mounted Display) HMD HMD,,,, An Information Presentation Method for Weara

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

Transcription:

Vol. 3 No. 3 209 220 (Sep. 2010) FPGA CUBE 1 2 2 1 1 3 512 FPGA 1 FPGA CUBE CUBE GPU NVIDIA GeForce GTX280 Cell/B.E. Performance Evaluation of One-dimensional FPGA-cluster CUBE for Stream Applications Masato Yoshimi, 1 Yuri Nishikawa, 2 Hideharu Amano, 2 Mitsunori Miki, 1 Tomoyuki Hiroyasu 1 and Oskar Mencer 3 This paper reports implementation and evaluation of CUBE, which is a multi- FPGA system which can connect 512 FPGAs in a form of a simple one dimensional array. As the system well suits as stream-oriented application platforms, we evaluated its performance by implementing edit distance computation algorithm, which is a typical stream-oriented algorithm. Performances are compared with Cell/B.E., NVIDIA s GeForce GTX280 and a general multi-core microprocessor. The report also discusses performance efficiency, logic consumption and power efficiency with comparison to other multi-core devices. 1. PC FPGA 1) 3) 2008 CUBE FPGA 1 4) CUBE 512 Spartan 3 FPGA FPGA ASIC FPGA BlockRAM 5),6) FPGA CUBE Splash 2 7) PROGRAPE 1) FPGA GPU CUBE FPGA FPGA 1 Doshisha University 2 Keio University 3 Imperial College London 209 c 2010 Information Processing Society of Japan

210 FPGA CUBE CUBE Intel Core 2 Quad Q9550 Cell/B.E. GPU 2. CUBE: 512 FPGA cluster CUBE 1 8 8 64 FPGA CUBE 8 4) FPGA 64 512 1 FPGA CUBE FPGA Spartan 3 FPGA FPGA FPGA FIFO 1 CUBE CUBE FPGA 2 (1) FPGA FIFO FPGA 100 [MHz] FPGA 6.4 [Gbps] = 100 [MHz] 64 [bit] 1 CUBE Fig. 1 Architecture of CUBE. (2) FPGA 1 CUBE 1 CUBE CUBE(1b): 64 FPGA 2 40 4) 104 [W] 1,460 [ ] Quad Core Xeon 2.5 [GHz] 359 CPU CUBE(1b) 690

211 FPGA CUBE 2 Fig. 2 Overview of the algorithm. 3. 3.1 8) 2 2 write weight 2 stra (1) (2) (3) strb c lena lenb c (lena +1) (lenb +1) c(i, j) 3 c(i 1,j 1) c(i, j 1) c(i 1,j) (1) c(i, 0) c(0,j) i j (2) stra i strb j 0 1 a (3) c(i 1,j 1) + a c(i, j 1) + 1 c(i 1,j)+1 c(i, j) (2) (3) c(lena, lenb) 2 3 Fig. 3 Overview of the parallel algorithm. 3.2 3 B(i, j) stra i strb j p(i, j 1) q(i 1,j) a(i 1,j 1) 5 p(i, j) q(i, j) a(i, j) 3 4. CUBE FPGA Splash 2 7) PROGRAPE 1) FPGA BEE2 9) Splash 2 CUBE Splash 2 16

212 FPGA CUBE FPGA 36 FPGA Splash CRAY-2 330 7) Cell/B.E. GPU 10),11) DP: Dynamic Programming DP 12),13) FPGA 14) Cell/B.E. 315 15) GPU 100 1 3 O(lenA lenb) w O( lena/w lenb) Cell/B.E. SPE 128 SPE 128 FPGA PE FPGA Splash 2 CUBE FPGA 5. 5.1 CUBE CUBE FPGA 4 1 FPGA CUBE FPGA FPGA FPGA FPGA FPGA 2 4 CUBE Fig. 4 Computational operation and dataflow on CUBE.

213 FPGA CUBE 5.2 FPGA 3.2 FPGA PE 1 CUBE 4 4 FPGA FPGA FPGA PE0 strb 0 stra 0 stra 3 PE1 FPGA i CUBE FPGA 4 3.2 p(i, j) a(i, j) FPGA strb i FPGA 1 FPGA FPGA FPGA FPGA FPGA FPGA stra i CUBE FPGA FPGA stra i p i FPGA a (i,j) stra p i FPGA FPGA FPGA FPGA q (3,3) p (3,j) q (i,3) 5.3 FPGA FPGA LD thread 5 FPGA 8 16 LD unit 128 = 8 16 1 128 CUBE 5 Fig. 5 Block computation modules for obtaining edit distance. FPGA 100 MHz 512 FPGA FPGA 64 k LD unit 8 3 LD unit LD unit CUBE 100 [MHz] (1) (2) (3) (4) 4 p q LD unit 8 8 80 3 8 83 64 IPC IPC u =64/83 = 0.77 [Operation/Clock] IPC u LD unit 1 IPC t LD thread IPC LD thread 16 LD unit LD thread LD unit

214 FPGA CUBE LD unit stra strb 8 p q p q LD thread 83 3 LD thread 1 2 15 LD thread LD unit LD unit LD thread 15 LD unit 16 LD unit 1 LD thread 128 LD thread 128 1 8 31 IPC t (1) IPC t =6.368 ( ) IPC t = ( ) 128 [ ] 128 [ ] = 83 [Clock/Block] 31 [ ] = 6.368 [Operation/Clock] (1) IPC u 0.398 = 6.368/16 6. 6.1 Verilog-HDL Xilinx ISE10.1i CUBE Spartan 3 XC3S4000-5-FG676 FPGA FPGA LD thread p- q- Spartan 3 BlockRAM 1 LD thread CUBE 6.2 CUBE FPGA 100 [MHz] 1 Table 1 Logic utilization and maximum operating frequency. LD thread LD unit XC3S4000 Slices 22,483 1,340 27,648 FFs 17,985 1,063 55,296 LUTs 42,369 2,496 55,296 BRAMs 10 1 96 Freq. [MHz] 125.594 156.966 FPGA 128 2,573 = 83 [Clock/BlockCycle] 31 [BlockCycle] 644 [Byte] 82 FPGA 4 PE p- 512 [Byte] = 128 4 [Byte] stra 128 [Byte] = 128 FPGA 128 2,573 + 82 = 2,655 CUBE n 512 4 2n 1 n 512 128 k n =1,024 (2 MaxBlock 1) ( n/maxblock ) 2 =(2n 1) 4= (2 512 1) 4 = 4,092 CUBE PC CUBE 2,655 256 [Byte] = 128 2 stra strb 1,024 [Byte] = 128 2 4 [Byte] p- q- PC-CUBE I/O 385.687 [Mbps] ((1,024+256) [Byte] 8)/2,655 [ClockCycle] 100 [MHz] PC CUBE CUBE strb 2,655 128 [Byte] PC-CUBE 424.256 [Mbpz] 4

215 FPGA CUBE Table 2 2 pthread Execution environment of pthread program. CPU RAM OS Compiler Intel Core 2 Quad Q9550 @ 2.83 GHz 8.0 GByte GNU/Linux 2.6.26-2-amd64 gcc-4.1.3(-o3 -lpthread) 6.3 4 (1) CUBE 3 (a) Spartan 3 (b) 1 CUBE CUBE(1b) (c) 8 CUBE CUBE CUBE(8b) 100 MHz RTL ( 2 ) Intel Core 2 Quad Q9550 3 2 ( 3 ) Sony ZEGO 16) Cell/B.E. ZEGO 7 SPE Cell/B.E. Cell Challenge 2009 C 3 pthread 7 SPE SPE SIMD ( 4 ) NVIDIA GeForce GTX280 GPU GPU Challenge 2009 6.4 PC PC 3 pthread C 2 (1) 1 (2) 2 1 6 Fig. 6 Impact of multithreading and block size on computational time. 6 256 k 5 T2 1 T1 16 k 8 T8 16 4 Core 2 Quad 4k 4k 7 O(N 2 ) 4

216 FPGA CUBE Fig. 7 7 Impact of multithreading and sequence length on computational time. Fig. 9 9 Performance for computing edit distance. Fig. 8 8 C2Q Cell/B.E. GeForce GTX280 CUBE Comparison of computational time among C2Q, Cell/B.E., GeForce GTX280 and CUBE. 10 CUBE(8b) Fig. 10 Power consumption ratio based on CUBE(8b). 8 6.5 4 3 8 9 10 (1) 1 8 8 CUBE Spartan 3 1 FPGA 2,655 100 [MHz] (2) 9 1 8 64 k CUBE(8b) Core2Quad 4 352 (3) [J] 10 CUBE(8b) [J]

217 FPGA CUBE 3 4 Spartan 3 CUBE Table 3 Power consumption of each system. Table 4 Performance comparison between Spartan 3 and CUBE. Vendor Device Power [W] Intel Core2Quad Q9550 95 Sony ZEGO(BCU-100) Cell/B.E. 330 NVIDIA GeForce GTX280 236 Imperial CUBE (8 boards) 832 CUBE(1b) CUBE(8b) 8b/1b 4k 15.760 11.052 0.701 16 k 32.252 58.783 1.823 64 k 32.252 256.250 7.945 256 k 32.252 256.250 7.945 1M 32.252 256.250 7.945 3 8 10 NVIDIA GPU 50 [W] 200 [W] 17) Spartan 3 FPGA Intel Cell/B.E. 7. 7.1 8 9 3.1 N O(N 2 ) Cell/B.E. Core2Quad Cell/B.E. 300 [Mops] Cell/B.E. SIMD 1 128 SPE GPU GTX280 SP 240 SP 8 SP 16 KB 7.2 CUBE 6 CUBE CUBE FPGA O(N 2 ) O(N) 1 FPGA CUBE(1b) 8k = 128 64 CUBE(8b) 64 k = 128 512 9 Spartan 3 1 CUBE(1b) CUBE(8b) 4 9 Spartan 3 Core2Quad CUBE FPGA 4 64 FPGA 32 512 FPGA 256 CUBE(1b) CUBE(8b) FPGA 8 7.3 CUBE CUBE

218 FPGA CUBE IPC CUBE(8b) FPGA LD thread 128 FPGA 1 64 k IPC (2) 1,581.320 ( ) IPC = ( ) ([ A] [ B]) = ([LD thread ] [ ]) (64 1,024) (64 1,024) = = 1,581.320 (2) 2,655 1,023 IPC t 3.089 = 1,581.320/512 IPC u 0.193 = 1,581.320/ (512 16) CUBE(1b) 1 8k 64 k 64 = 8 8 FPGA IPC (3) 199.027 ( ) IPC = ( ) ([ A] [ B]) = ([LD thread ] [ ] [ ]) = (64 1,024) (64 1,024) 2,655 127 64 = 199.027 (3) IPC t 3.110 = 199.027/64 LD unit IPC 0.194 = 199.027/(64 16) 4 CUBE(8b) IPC CUBE(1b) 7.945 CUBE 2 1 FPGA 2 IPC 4 FPGA LD unit FPGA LD unit16 128 128 8 4 FPGA IPC FPGA CUBE LD thread 16 LD unit 31 1 IPC LD thread IPC 256 LD unit 16 16 = 1,024 15 15 i =120 LD unit i=1 16 15 LD unit 15 15 i i=1 =120 LD unit 15 49 LD unit IPC t (4) 9.994 IPC u 0.625 = 9.994/16 LD unit IPC ( ) IPC t = ( ) = 256 [ ] 256 [ ] 83 [Clock/Block] 79 [BlockCycle] = 9.994 [Operation/Clock] (4) IPC t (1) 1.56 = 10.967/6.368 CUBE LD thread 1 BlockRAM 2 256 1 128 2.3 = 72/31 6.2 CUBE stra strb FPGA FPGA CUBE 10 CUBE

219 FPGA CUBE 8. 1 FPGA CUBE CUBE FPGA x86 GPU Cell/B.E. CUBE CUBE CUBE 1) Hamada, T., Fukushige, T., Kawai, A. and Makino, J.: PROGRAPE-1: A Programmable, Multi-Purpose Computer for Many-Body Simulations, Publications of the Astronomical Society of Japan, Vol.52, pp.943 954 (2000). 2) Burke, D., Wawrzynek, J., Asanovic, K., Krasnov, A., Schultz, A., Gibeling, G. and Droz, P.Y.: RAMP Blue: Implementation of a Multicore 1008 Processor FPGA System, Proc. 4th Annual Reconfigurable Systems Summer Institute (RSSI 08 ) (2008). 3) Osana, Y., Fukushima, T., Yoshimi, M. and Amano, H.: An FPGA-Based Acceleration Method for Metabolic Simulation, IEICE Trans. Inf. Syst., Vol.E87-D, No.8, pp.2029 2037 (2004). 4) Mencer, O., Tsoi, K.H., Craimer, S., Todman, T., Luk, W., Wong, M.Y. and Leong, P.H.W.: CUBE: A 512-FPGA CLUSTER, Proc. IEEE Southern Programmable Logic Conference (2009). 5) Yoshimi, M., Nishikawa, Y., Osana, Y., Funahiashi, A., Hiroi, N., Shibata, Y., Yamada, H., Kitano, H. and Amano, H.: Practical Implementation of a Network- Based Stochastic Biochemical Simulation System on an FPGA, The 18th International Conference on Field Programmable Logic and Applications (FPL 08 ), pp.663 666 (2008). 6) Morishita, H., Osana, Y., Fujita, N. and Amano, H.: Exploiting Memory Hierarchy for a Computational Fluid Dynamics Accelerator on FPGAs, Proc. Field Programmable Technology 2008 (FPT 08 ), pp.193 200 (2008). 7) Arnold, J.M., Buell, D.A. and Davis, E.G.: Splash 2, SPAA 92: Proc. 4th annual ACM symposium on Parallel algorithms and architectures, New York, NY, USA, pp.316 322, ACM (1992). 8) Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Physics Doklady, Vol.10, No.8, pp.707 710 (1966). 9) Chang, C., Wawrzynek, J. and Brodersen, R.W.: BEE2: A High-End Reconfigurable Computing System, IEEE Design and Test of Computers, Vol.22, No.2, pp.114 125 (2005). 10) Cell Challenge 2009 SACSIS2009 Cell Challenge 2009. http://www.hpcc.jp/sacsis/2009/cell/ 11) GPU Challenge Cell Challenge 2009 GPU Challenge 2009. http://www.hpcc.jp/sacsis/2009/gpu/ 12) Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM, Vol.46, No.3, pp.395 415 (1999). 13) Hyyrö, H.: A bit-vector algorithm for computing Levenshtein and Damerau edit distances, Nordic J. of Computing, Vol.10, No.1, pp.29 39 (2003). 14) Masuno, S., Maruyama, T., Yamaguchi, Y. and Konagaya, A.: Multiple Sequence Alignment Based on Dynamic Programming Using FPGA, Transaction on Information and Systems, Vol.E90-D, No.12, pp.1939 1946 (2007). 15) Cell (2009). http://www.hpcc.jp/sacsis/2009/cell/outputs/pdf/kitei 1.pdf 16) SONY: BCU-100 Computing Unit with Cell/B.E. and RSX. http://pro.sony.com/bbsccms/ext/zego/files/bcu-100 Whitepaper.pdf 17) GPU Vol.2009-HPC-121, No.27, pp.1 5 (SWoPP2009) (2009). ( 22 1 26 ) ( 22 4 28 ) 16 21 18 DC1

220 FPGA CUBE 18 20 20 DC1 56 61 25 53 62 6 IEEE 9 20 IEEE Ph.D. DIGITAL Systems Maxeler Technologies CEO EPSRC