23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

Similar documents
GPGPU

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

07-二村幸孝・出口大輔.indd

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

2017 (413812)

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

組込みシステムシンポジウム2011 Embedded Systems Symposium 2011 ESS /10/20 FPGA Android Android Java FPGA Java FPGA Dalvik VM Intel Atom FPGA PCI Express DM

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

main.dvi

indd

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

LAN LAN LAN LAN LAN LAN,, i

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

EGunGPU


supercomputer2010.ppt

Web Web Web Web Web, i

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

untitled

27 VR Effects of the position of viewpoint on self body in VR environment

An Interactive Visualization System of Human Network for Multi-User Hiroki Akehata 11N F

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

1重谷.PDF

Microsoft PowerPoint - GPU_computing_2013_01.pptx

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System


Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

& Vol.2 No (Mar. 2012) 1,a) , Bluetooth A Health Management Service by Cell Phones and Its Us

HPC (pay-as-you-go) HPC Web 2

, IT.,.,..,.. i

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

fiš„v8.dvi

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

Vol.53 No (Mar. 2012) 1, 1,a) 1, 2 1 1, , Musical Interaction System Based on Stage Metaphor Seiko Myojin 1, 1,a

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

1_26.dvi

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

Abstract The purpose of this study is to reveal an effective video effects in Projection Mapping event. So, I made a Projection Mapping event in Old P

3_39.dvi

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

ActionScript Flash Player 8 ActionScript3.0 ActionScript Flash Video ActionScript.swf swf FlashPlayer AVM(Actionscript Virtual Machine) Windows

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

P2P P2P peer peer P2P peer P2P peer P2P i

mate10„”„õŒì4

1

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

4.1 % 7.5 %

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

FIT2013( 第 12 回情報科学技術フォーラム ) I-032 Acceleration of Adaptive Bilateral Filter base on Spatial Decomposition and Symmetry of Weights 1. Taiki Makishi Ch

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

P2P Web Proxy P2P Web Proxy P2P P2P Web Proxy P2P Web Proxy Web P2P WebProxy i

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

PowerPoint プレゼンテーション

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

WikiWeb Wiki Web Wiki 2. Wiki 1 STAR WARS [3] Wiki Wiki Wiki 2 3 Wiki 5W1H Wiki Web 2.2 5W1H 5W1H 5W1H 5W1H 5W1H 5W1H 5W1H 2.3 Wiki 2015 Informa

: u i = (2) x i Smagorinsky τ ij τ [3] ij u i u j u i u j = 2ν SGS S ij, (3) ν SGS = (C s ) 2 S (4) x i a u i ρ p P T u ν τ ij S c ν SGS S csgs

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

DEIM Forum 2009 B4-6, Str

10D16.dvi

熊本大学学術リポジトリ Kumamoto University Repositor Title 特別支援を要する児童生徒を対象としたタブレット端末 における操作ボタンの最適寸法 Author(s) 竹財, 大輝 ; 塚本, 光夫 Citation 日本産業技術教育学会九州支部論文集, 23: 61-


卒業論文2.dvi

soturon.dvi

HP Workstation 総合カタログ

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

Bull. of Nippon Sport Sci. Univ. 47 (1) Devising musical expression in teaching methods for elementary music An attempt at shared teaching

卒業論文

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

untitled

IPSJ SIG Technical Report Vol.2012-IS-119 No /3/ Web A Multi-story e-picture Book with the Degree-of-interest Extraction Function

Estimation of Photovoltaic Module Temperature Rise Motonobu Yukawa, Member, Masahisa Asaoka, Non-member (Mitsubishi Electric Corp.) Keigi Takahara, Me

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

GPGPUクラスタの性能評価

A comparative study of the team strengths calculated by mathematical and statistical methods and points and winning rate of the Tokyo Big6 Baseball Le

24 LED A visual programming environment for art work using a LED matrix

IPSJ SIG Technical Report Secret Tap Secret Tap Secret Flick 1 An Examination of Icon-based User Authentication Method Using Flick Input for

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

untitled

Web Stamps 96 KJ Stamps Web Vol 8, No 1, 2004

IPSJ SIG Technical Report Vol.2014-EIP-63 No /2/21 1,a) Wi-Fi Probe Request MAC MAC Probe Request MAC A dynamic ads control based on tra

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP


DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

Transcription:

23 FPGA CUDA Performance Comparison of FPGA Array with CUDA on Poisson Equation (lijiang@sekine-lab.ei.tuat.ac.jp), (kazuki@sekine-lab.ei.tuat.ac.jp), (takahashi@sekine-lab.ei.tuat.ac.jp), (tamukoh@cc.tuat.ac.jp), (yu-koba@cc.tuat.ac.jp), (sekinem@cc.tuat.ac.jp) 184-8588 2-24-16 Li Jiang, Kazuki Sato, Kenichi Takahashi, Tamukoh Hakaru, Yuuichi Kobayashi, Sekine Masatoshi Tokyo University of Agriculture and Technology 2-24-16 Naka-chou,Koganei-shi,Tokyo, 184-8588 Japan Abstract In recent years, the examples which use FPGA or GPGPU for the HPC use are increasing. We propose an FPGA array which accumulated a lot of small cards with the three-dimensional I/O that installed large-scale FPGA. The FPGA array is suited to the scalable design, and it is possible to control from the host PC easily. As a contrast, we also structured CUDA system by GeForce 9800GT. In this paper, we implemented FPGA array and CUDA to calculated Poisson equation by the finite difference floating point number method, and the performance and power consumption are presented. We also discuss the result which from the different hardware architecture and the advantages between in FPGA and GPGPU. 1. HPC HPC x86 POWER HPC GPU GPGPU(General Purpose computing ongpu) (1) GPGPU GPU HPC HPC(High Performance Computing) LSI FPGA(Field Programmable Gate Array) 1 FPGA FPGA FPGA FPGA FPGA HPC HPC FPGA RHPC(Reconfigurable High Performance Computing) FPGA HPC FPGA HPC PC CPU FPGA hw/sw (2)(3) FPGA FPGA GPU CUDA 2. Fig.1 2 ϕ ρ ϕ = ρ (1) (1) ϕ new i,j = α 4 α = h 2 ρ + ϕ old i 1,j + ϕold i+1,j + ϕold i,j 1 + ϕold i,j+1 (2) FPGA CUDA (2) Fig. 1: 1 Copyright c 2009 by JSFM

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA hw/sw (2)(3) 3.2 hwmodule V2 FPGA hwmodulev2 Fig.2 FPGA PC hw/sw hwmodule V2 PC hw/sw FPGA hwmodulev2 PC-FPGA 3.3 hwmodule VS FPGA hwmodule VS FPGA Fig.3 hwmodule VS hwmodule VS Fig.4 PE Board Sub Board 2 Fig.4 hwmodule VS 2 3 3.4 FPGA hwmodule VS FPGA Fig.5 3 FPGA Fig.6 hwmodule V2 FPGA Software(CPU) PC hw/sw Fig. 3: hwmodule VS Fig. 4: PE Board Sub Board FPGA 3.5 Fig.7 FPGA Fig.7 4 1 (Processing Element:PE) 40 66.6[MHz] PE IEEE754 ϕ ρ FPGA Block- RAM(BRAM) Fig.8 PE BRAM Cache BRAM FPGA BRAM Cache PC hwmodule SDRAM FPGA Xilinx XC3S4000 XC3S4000 BRAM 2 BRAM1 32[bit] 512[word] 2 Cache ρ PE BRAM ϕ PE 2 BRAM PE 1 10 PE 1 ρ BRAM 3 ϕ BRAM 1 PE 10 ρ 12 ϕ 10 XC3S4000 PE1 11[ ] 2 Copyright c 2009 by JSFM

23 Fig. 7: FPGA ( ) Fig. 5: FPGA. Fig. 8: Fig. 6: hw/sw FPGA PE10 85[ ] PE1 7[ ] FPGA PE 1 10 2 PE 1 20 80 PE 10 60 60 PE10 hwmodule VS FPGA FPGA PE10 FPGA hwmodule V2 PE1 hwmodule VS 1 1 PE 4 FPGA Fig.9 PE10 60 60 6 1 4. GPGPU 4.1 GPU NVIDIA GeForce 9800GT GPU Fig.11 GPU G92 [6] HPC CUDA Global Memory 512[MB] GDDR3 GPU Multi Processor Multi Processor 8 Streaming Processor Streaming Processor Multi Processor Shared Memory Shared Memory 16[KB] Global Memory (3) 14 Multi Processor Multi Processor 8 Streaming Processor 112 Streaming Processor 4.2 CUDA grid,block,thread CUDA grid,block,thread CUDA Fig.12 CUDA GPU Streaming Processor block block thread 1 Multi Processor 1 block Multi Processor block 1 Multi Processor block block grid CPU CPU GPU (6) 4.3 CUDA CUDA FPGA (2) x y 1 1block block 1 thread 1 thread 1 64 64 block 64 block 3 Copyright c 2009 by JSFM

23 Fig. 11: GPU Fig. 9: PE10 Fig. 10: GPU(GeForce 9800GT) 64 thread 64 64=4,096 CUDA GPU CUDA GPU CUDA ( )Global Memory Global Memory Shared Memory 5. 5.1 FPGA FPGA PC Fig.13 Tab.1 FPGA FPGA PE 1 10 Fig. 12: grid, block,thread Fig.15 (CPU) PE10 hwmodule V2 hw- Module VS hwmodule VS hwmodule VS 4 FPGA 10 (2) 5.2 CUDA CUDA CUDA Tab.2 GPU CPU Fig.14 GPU CUDA CPU C++ 10,000 Fig.14 1 10,000 4 Copyright c 2009 by JSFM

23 Tab. 1: FPGA CPU Athlon64 X2 3800+ (2.0[GHz]) DDR2 SDRAM PC2-6400 2[GB] ASUS M2A-VM HDMI (AMD 690G ) OS Windows XP Professional SP3 Borland C++ Builder 2006 Tab. 2: CUDA CPU Athlon64 X2 3800+ (2.0[GHz]) DDR2 SDRAM PC2-6400 2[GB] ASUS M2A-VM HDMI (AMD 690G ) OS Windows XP Professional SP3 MicroSoft Visual Studio 2008 GALAXY GeForce 9800GT (GDDR3 512[MB]) Fig. 14: GPU CPU Fig. 13: Global Memory Fig.14 matirix size 16 16 GPU 0.136[GFlops] matirix size GPU 64 64 GPU CPU 128 128 GPU CPU 1.84 256 256 2.86 matrix size CUDA thread 16 16 thread 256 Multi Processor thread Fig.14 GPU CPU 128 128 GPU 1.4[GFlops] GPU 128 128 matrix size matrix size matrix size GPU 14 Multi Processor 128 128 matrix size Multi Processor 128 128 16,384 thread 5.3 FPGA GPU CPU FPGA GPU CPU FPGA GPU CPU Flops Fig.15 matrix size GPU 128 128 16 16 2 CPU FPGA(10PE) 60 60 matrix size FPGA(10PE) hwmodule V2 FPGA(1PE) 4 hwmodule VS FPGA 1 VS 1 PE FPGA 4 PE 4 PE 20 80 1 PE 20 20 GPU 128 128 thread G92 14 Multiprocessor Streaming Processor FPGA PE 1 CPU PE 10 CPU 3.4 GPU 2.2 GPU FPGA CUDA FPGA FPGA (2) 150 hwmodule VS FPGA 1[TFlops] GeForce 9800GT Flops 462[GFlops] CUDA CUDA 5 Copyright c 2009 by JSFM

23 Tab. 4:. / [GFlops] [W] [MFlops/W] FPGA 3.226 7.92 407.3 GPU(GeForce 9800GT) 1.442 66.0 21.85 CPU(Athlon64 X2) 0.954 62.0 15.39 Fig. 15: FPGA GPU CPU Tab. 3: ( :[GFlops]). [ ] 10 2 10 4 10 6 FPGA (10PE) 0.105 3.051 3.213 GPU(GeForce 9800GT) 1.370 1.442 1.299 CPU(Athlon64 X2) 0.900 0.957 0.952 Global Memory Global Memory thread thread Shared Memory 5.4 FPGA PE 1 1[W] 131[MFlops/W] (2) PE 10 407[MFlops/W] CPU 65[W] FPGA GPU CPU (Tab4). 5.5 GPU FPGA GPU CUDA thread 100 CPU FPGA FPGA PE 1 CPU 10 PE. FPGA FPGA FPGA FPGA PC HPC GPU CUDA FPGA CUDA GPU hw/sw FPGA 6. FPGA HPC CUDA GPGPU CPU GPU FPGA hwmodule V2 10 PE hwmodule VS 2 3 PE FPGA CUDA HPC FPGA FPGA HPC FPGA (1) nvidia CUDA, http://www.nvidia.com/cuda/ (2),,, FPGA TFlops,, vol. 108, no. 414, pp. 19-24, 2009. (3),,, FPGA Reconfigurable HPC,, vol. 107, no. 416, pp.13-18, 2008. (4),,,, hw/sw, 21, pp.207-212, 2008. (5) K.Kudo, et. al., Hardware Object Model and Its Application to the Image Processing, IEICE Trans. on Fund., vol. E87-A, no.3, pp.547-558, 2004. (6), GPU CFD,, vol.50, no.2, pp. 107-115, 2009. (7),, GPU CIP 2,, vol. 13, no. 2, pp. 837-840, 2008. 6 Copyright c 2009 by JSFM