IPSJ SIG Technical Report Vol.2016-HPC-155 No /8/10 FPGA 1,a) FPGA(Field Programmable Gate Array) FPGA OpenCL FPGA FPGA OpenCL FPGA 1. CP

Similar documents
IPSJ SIG Technical Report Vol.2016-HPC-153 No /3/1 FPGA 1,a) FPGA(Field Programmable Gate Array) FPGA HPC OpenCL FPGA HPC FPGA FEM CG Open

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

ネットリストおよびフィジカル・シンセシスの最適化

HP Workstation 総合カタログ

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

strtok-count.eps

GPGPU

FPGAメモリおよび定数のインシステム・アップデート

PLDとFPGA


Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

07-二村幸孝・出口大輔.indd


HP Workstation 総合カタログ

Microsoft PowerPoint - Lec pptx

untitled

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

プロセッサ・アーキテクチャ

Nios II ハードウェア・チュートリアル

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

HPEハイパフォーマンスコンピューティング ソリューション


GPU n Graphics Processing Unit CG CAD

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

デザインパフォーマンス向上のためのHDLコーディング法

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

Cloud[2] (48 ) Xeon Phi (50+ ) IBM Cyclops[9] (64 ) Cavium Octeon II (32 ) Tilera Tile-GX (100 ) PE [11][7] 2 Nsim[10] 8080[1] SH-2[5] SH [8

Nios II 簡易チュートリアル

HP High Performance Computing(HPC)

if clear = 1 then Q <= " "; elsif we = 1 then Q <= D; end rtl; regs.vhdl clk 0 1 rst clear we Write Enable we 1 we 0 if clk 1 Q if rst =

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

Run-Based Trieから構成される 決定木の枝刈り法

main.dvi

「FPGAを用いたプロセッサ検証システムの製作」


2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

DDR3 SDRAMメモリ・インタフェースのレベリング手法の活用

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

5 2 5 Stratix IV PLL 2 CMU PLL 1 ALTGX MegaWizard Plug-In Manager Reconfig Alt PLL CMU PLL Channel and TX PLL select/reconfig CMU PLL reconfiguration

Microsoft PowerPoint - GPU_computing_2013_01.pptx

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,

IBM PureData

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

matrox0

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,,

スライド 1

論理設計の基礎

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

HP Workstation Xeon 5600

Ver Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI

GPUコンピューティング講習会パート1

09中西

動的適応型ハードウェアの提案

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

IPSJ SIG Technical Report Vol.2015-MUS-107 No /5/23 HARK-Binaural Raspberry Pi 2 1,a) ( ) HARK 2 HARK-Binaural A/D Raspberry Pi 2 1.

01_OpenMP_osx.indd

Transcription:

FPGA 1,a) 1 1 1 FPGA(Field Programmable Gate Array) FPGA OpenCL FPGA FPGA OpenCL FPGA 1. CPU GPGPU HPC FPGA (Field Programmable Gate Array) FPGA FPGA FPGA Catapult[1] HPC FPGA [3], [4] FPGA Verilog HDL (HDL) FPGA 1 a) hanawa@cc.u-tokyo.ac.jp FPGA OpenCL OpenCL GPU [2] FPGA Verilog HDL OpenCL HPC FPGA [5], [6], [7], [8] CPU GPU [9], [10] FPGA [11], [12] HPC FPGA OpenCL FPGA 2 OpenCL FPGA 3 1

4 5 2. OpenCL FPGA 2.1 OpenCL FPGA FPGA Verilog HDL VHDL C Fortran FPGA HPC FPGA OpenCL FPGA HPC Altera FPGA Stratix V OpenCL Verilog HDL OpenCL FPGA OpenCL Khronos GPU HPC AMD GPU CPU Xeon Phi NVIDIA GPU Altera Stratix V CPU FPGA ARM IP FPGA (Xilinx Zynq, Altera Arria SoC ) OpenCL FPGA CPU PCI Express OpenCL FPGA Altera Stratix V FPGA PCI Express GPU I/O PCI Express *1 *1 Intel Altera FPGA (Partial reconfiguration) FPGA PCI Express DDR PCI Express OpenCL PCI Express FPGA OpenCL FPGA FPGA MB PCI Express FPGA Altera Stratix V Bittware PCI Express S5-PCIe-HQ (s5phq d5) ( 1) FPGA 1 Adaptive Logic Module (ALM) 172,600 4 2 6 Look Up Table (LUT) 2 FPGA 2,014 20Kbit RAM (M20K) 640bit Memory Logic Array Block (MLAB) 8,630 Digital Signal Processor (DSP) 27 1,590 DSP Stratix V ALM RAM *2 [13][14] OpenCL Altera Offline Compiler ( aoc ) aoc -c kernel.cl ( 1 ) OpenCL ( 2 ) ( DSP ) ( 3 ) PCI Express DDR3-DRAM QPI *2 Arria 10, Stratix 10 DSP 2

1 FPGA FPGA: Altera Stratix V GS D5 (5SGSMD5K2F40C2) #Logic units (ALMs) 172,600 #RAM blocks (M20K) 2,014 #DSP blocks 1,590 (27 27) : Bittware S5-PCIe-HQ GSMD5 DDR DDR PCIe I/F (4 + 4) GB 25.6 GB/sec Gen3 x8 (OpenCL Gen2 x8 ) Altera Quartus II 16.0.1 OpenCL SDK, Altera Offline Compiler OpenCL Verilog HDL ( 4 ) kernel.aoco aoc kernel.aoco ( 1 ) Quartus (Altera FPGA ) ( 2 ) FPGA kernel.aocx aoco aocx Quartus FPGA 1 Intel Xeon E5 (Haswell ) 2 OpenCL FPGA FPGA --report -c FPGA & 1 Bittware S5-PCIe-HQ (Bittware QDR II+) 2.2 FPGA OpenCL OpenCL C++ (API ) GPU OpenCL CUDA[15] OpenCL CPU 3

2 FPGA OpenCL OpenMP GPU CUDA OpenCL FPGA FPGA OpenCL 2.0 Altera Offline Compiler 16.0 2.0 2 OpenCL FPGA FPGA FPGA () kernel global CUDA CUDA Driver API OpenCL CUDA FPGA FPGA OpenCL FPGA GPU OpenCL GPU FPGA GPU OpenCL FPGA GPU 2.3 Altera FPGA Altera [17] [18] (SIMD ) 2.3.1 2.1 FPGA OpenCL global DDR local RAM ( local ) 2 2.4 OpenCL for while Altera OpenCL Compiler (AOC) 3 for FPGA 4

================================================================================ *** Optimization Report ***... ================================================================================ Kernel: hacapk_body ================================================================================ The kernel is compiled for single work-item execution. Loop Report: + Loop "Block1" (file hacapk-calc0.cl line 36) NOT pipelined due to: Loop structure: loop contains divergent inner loops.... -+ Loop "Block4" (file hacapk-calc0.cl line 53) Pipelined with successive iterations launched every 2 cycles due to:... -+ Loop "Block5" (file hacapk-calc0.cl line 55) Pipelined with successive iterations launched every 8 cycles due to:... -+ Loop "Block9" (file hacapk-calc0.cl line 62) Pipelined well. Successive iterations are launched every cycle. 3 AOC 1 0 (single stream) for 2.4.1 CPU FPGA 2.4.2 (SIMD, ) FPGA GPU OpenCL clenqueuendrangekernel FPGA FPGA ID API ID CUDA GPU GPU FPGA OpenCL FPGA attribute num_simd_work_items(4) SIMD 4 num_compute_units(4) 4 3. 3.1 N Ā RN N. Ā A N I := 1,, N J := 1,, N I J m I J M m M s m I, t m J m = s m t m m Ā A m s m t m R #sm #tm (1) # m A m s m t m à m à m := V m W m V m R #sm rm (2) W m R rm #tm r m min(#s m, #t m ) r m N à m à m A m s m t m V m W m 4 4 A m s m t m à m A 5

2 100ts 216h human 1x1 101250 21600 19664 222274 50098 46618 89534 17002 16202 132740 33096 20416 W m x tm c rm (7) 4 V m c rm ŷ sm (8) N(M) m N(m) N(m) := { N(M) = m M N(m) (3) #s m #t m m r m (#s m + #t m ) m (4) r m #s m #t m r m (#s m + #t m ) #s m #t m 3.2 Ax y, x, y R N (5) y A m s m t m A m s m t m x tm ŷ sm (6) x tm x t m #t m ŷ sm #s m y s m à m c R rm à m x tm = V m W m x tm ŷ sm ŷ sm ŷ sm y (9) m M 3.3 ppopen-appl/bem ver.0.4.0 HACApK 1.0.0 [10] ppopen- APPL/BEM JST CREST : ppopen-hpc [9] 1 (Boundary Element Method, BEM) HACApK [19] ACA ACA+ [20] HACApK Fortran90 C 3.4 2 3 ( ) 4. 4.1 [7] FPGA CG FPGA 6

FPGA Intel Xeon E5-2680v2 (IvyBridge) 2 PCI Express 2.1 Bittware Stratix V FPGA S5-PCIe-HQ HACApK HACApK_adot_body_lfmtx C HACApK C FPGA OpenCL OpenCL FPGA FPGA OpenCL 4.2 0: FPGA C OpenCL 5 kernel global FPGA DDR3 zbu local FPGA 0 ( 1 ) ( 2 ) ( 3 ) 3 4 CPU Intel Xeon E5-2680v2 1 0 CPU 1 126 1 for(ip=0; ip<nlf; ip++){ 2 sttmp=st_lf+ip; 3 ndl=sttmp->ndl; ndt=sttmp->ndt; 4 nstrtl=sttmp->nstrtl; nstrtt=sttmp->nstrtt; 5 if(sttmp->ltmtx==1){ 6 kt=sttmp->kt; 7 for(il=0; il<kt; il++){ 8 zbu[il] = 0.0; 9 for(it=0; it<ndt; it++){ 10 itt=it+nstrtt-1; 11 itl=it+il*ndt + sttmp->offset_a1; 12 zbu[il] += a1[itl]*zu[itt]; 13 } } 14 for(il=0; il<kt; il++){ 15 for(it=0; it<ndl; it++){ 16 ill=it+nstrtl-1; 17 itl=it+il*ndl + sttmp->offset_a2; 18 zau[ill] += a2[itl]*zbu[il]; 19 } } 20 } else if(sttmp->ltmtx==2){ 21 for(il=0; il<ndl; il++){ 22 ill=il+nstrtl-1; 23 for(it=0; it<ndt; it++){ 24 itt=it+nstrtt-1; 25 itl=it+il*ndt + sttmp->offset_a1; 26 zau[ill] += a1[itl]*zu[itt]; 27 } } } } 5 4.3 1: 5 7 13 21 27 ltmtx 1 2 il kt ndt 1 7

zau 1 ip zau 7 il 9 it il zu[itt] 3 0 1 2 3 Logic utilization 29% 26% 28% 26% DSP blocks 9 4 6 2 Memory bits 16% 18% 14% 15% RAM block 608 630 536 560 (30%) (31%) (27%) (28%) fmax 246.18 244.73 269.25 268.95 4.4 2: 1 7 13 21 27 1 1 zbu 4.5 3: 1 14 19 15 it 14 il il zu[itt] 4.6 10 16 CPU 1 CPU 8 global constant 4 (ms) 0 1 2 3 CPU 100ts 62597.0 5540.9 57661.3 4848.3 494.2 216h 8705.1 808.2 7904.0 684.0 68.7 human 1x1 8762.6 676.9 7962.5 547.3 69.6 DDR3 5. FPGA OpenCL 16 CPU 1 1/8 FPGA FPGA FPGA CPU JSPS 15K00166 (JST/CREST), German Priority Programme 1648 Software for Exascale Computing (SPPEXA-II) 8

Quartus II Altera University Program [1] Putnam, A. and Caulfield, A.M. and Chung, E.S. and Chiou, D. and Constantinides, K. and Demme, J. and Esmaeilzadeh, H. and Fowers, J. and Gopal, G.P. and Gray, J. and Haselman, M. and Hauck, S. and Heil, S. and Hormati, A. and Kim, J.-Y. and Lanka, S. and Larus, J. and Peterson, E. and Pope, S. and Smith, A. and Thong, J. and Xiao, P.Y. and Burger, D., A reconfigurable fabric for accelerating large-scale datacenter services, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp.13-24, 2014. [2] OpenCL - The open standard for parallel programming of heterogeneous systems https://www.khronos.org/ opencl/ [3],,, Alexander Vazhenin, Stanislav Sedukhin: FPGA, (2015-HPC-149), 2015. [4],, :, (2015-HPC-151), 2015. [5], Hamid Reza Zohouri,, : OpenCL FPGA, (2015-HPC-150), 2015. [6] Hamid Reza Zohouri, Naoya Maruyama, Aaron Smith, Motohiko Matsuda, and SatoshiMatsuoka, Optimizing the Rodinia Benchmark for FPGAs (Unrefereed Workshop Manuscript), (2015-HPC- 152), 2015. [7] FPGA (2016-HPC-153) 2016. [8] OpenCL FPGA (2016-HPC-154) 2016 [9] K. Nakajima and M. Satoh and T. Furumura and H. Okuda and T. Iwashita and H. Sakaguchi and T. Katagiri and M. Matsumoto and S. Ohshima and H. Jitsumoto and T. Arakawa and F. Mori and T. Kitayama and A. Ida and M. Y. Matsuo and K. Fujisawa and et al., ppopen-hpc: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT), Optimization in the Real World, pp.15 35, DOI 10.1007/978-4-431-55420-2 2, 2016. [10] ppopen-hpc Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT) http://ppopenhpc.cc.u-tokyo. ac.jp/ppopenhpc/ [11] Tightly Coupled Accelerators GPU Vol.6, No.4, pp.14-25, 2013. [12] Yuetsu Kodama, Toshihiro Hanawa, Taisuke Boku and Mitsuhisa Sato, PEACH2: FPGA based PCIe network device for Tightly Coupled Accelerators, International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2014), pp. 3-8, Jun. 2014. [13] Altera Corporation, Floating-Point IP Cores User Guide, UG-01058, 2015. [14] Altera, Stratix V Device Handbook, https: //www.altera.com/en_us/pdfs/literature/hb/ stratix-v/stx5_core.pdf [15] CUDA Dynamic Parallelism, http://docs.nvidia. com/cuda/cuda-c-programming-guide/index.html# cuda-dynamic-parallelism [16] Altera Corporation, SDK for OpenCL - https://www.altera.co.jp/products/ design-software/embedded-software-developers/ opencl/overview.html [17] Altera Corporation, Altera SDK for OpenCL Programming Guide 16.0, UG-OCL002, 2016. [18] Altera Corporation, Altera SDK for OpenCL Best Practice Guide 16.0, UG-OCL003, 2016. [19] A. Ida, T. Iwashita, T. Mifune and Y. Takahashi, Parallel Hierarchical Matrices with Adaptive Cross Approx ima-tion on Symmetric Multiprocessing Clusters, Journal of Information Processing Vol. 22, pp.642-650, 2014. [20] Börm S., Grasedyck L. and Hackbusch W.: Hierarchical Matrices, Lecture Note, Max-Planck-Institut fur Mathematik, (2006). 9