IPSJ SIG Technical Report Vol.2016-HPC-153 No /3/1 FPGA 1,a) FPGA(Field Programmable Gate Array) FPGA HPC OpenCL FPGA HPC FPGA FEM CG Open

Similar documents
IPSJ SIG Technical Report Vol.2016-HPC-155 No /8/10 FPGA 1,a) FPGA(Field Programmable Gate Array) FPGA OpenCL FPGA FPGA OpenCL FPGA 1. CP

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

strtok-count.eps

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP


HP Workstation 総合カタログ

ネットリストおよびフィジカル・シンセシスの最適化

PLDとFPGA

07-二村幸孝・出口大輔.indd

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c


Microsoft PowerPoint - Lec pptx

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

GPGPU

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

main.dvi

FPGAメモリおよび定数のインシステム・アップデート

プロセッサ・アーキテクチャ

デザインパフォーマンス向上のためのHDLコーディング法

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

Nios II ハードウェア・チュートリアル

HP Workstation 総合カタログ

Cyclone IIIデバイスのI/O機能

「FPGAを用いたプロセッサ検証システムの製作」

HPEハイパフォーマンスコンピューティング ソリューション

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

Cloud[2] (48 ) Xeon Phi (50+ ) IBM Cyclops[9] (64 ) Cavium Octeon II (32 ) Tilera Tile-GX (100 ) PE [11][7] 2 Nsim[10] 8080[1] SH-2[5] SH [8

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

5 2 5 Stratix IV PLL 2 CMU PLL 1 ALTGX MegaWizard Plug-In Manager Reconfig Alt PLL CMU PLL Channel and TX PLL select/reconfig CMU PLL reconfiguration

スライド 1

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

Microsoft PowerPoint - GPU_computing_2013_01.pptx

DDR3 SDRAMメモリ・インタフェースのレベリング手法の活用


IBM PureData

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

HP High Performance Computing(HPC)

matrox0

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

09中西

Stratix IIIデバイスの外部メモリ・インタフェース

GPU n Graphics Processing Unit CG CAD

Run-Based Trieから構成される 決定木の枝刈り法

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

01_OpenMP_osx.indd

untitled

IPSJ SIG Technical Report Vol.2015-MUS-107 No /5/23 HARK-Binaural Raspberry Pi 2 1,a) ( ) HARK 2 HARK-Binaural A/D Raspberry Pi 2 1.

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

GPUコンピューティング講習会パート1

VHDL-AMS Department of Electrical Engineering, Doshisha University, Tatara, Kyotanabe, Kyoto, Japan TOYOTA Motor Corporation, Susono, Shizuok

動的適応型ハードウェアの提案

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

2017 (413812)

untitled

if clear = 1 then Q <= " "; elsif we = 1 then Q <= D; end rtl; regs.vhdl clk 0 1 rst clear we Write Enable we 1 we 0 if clk 1 Q if rst =

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

倍々精度RgemmのnVidia C2050上への実装と応用

論理設計の基礎

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

ADZBT1 Hardware User Manual Hardware User Manual Version 1.0 1/13 アドバンスデザインテクノロジー株式会社

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

Transcription:

FPGA 1,a) 1 1 1 FPGA(Field Programmable Gate Array) FPGA HPC OpenCL FPGA HPC FPGA FEM CG OpenCL FPGA 1. CPU(Central Processing Unit) GPU(Graphics Processing Unit) HPC FPGA (Field Programmable Gate Array) FPGA FPGA FPGA Catapult[1] HPC FPGA [3], [4] FPGA FPGA Verilog FPGA FPGA OpenCL[2] FPGA Verilog OpenCL HPC FPGA [5], [6] CPU GPU [7], [8] FPGA [9], [10] HPC FPGA OpenCL FPGA 2 FPGA 3 OpenCL FPGA 4 1 a) ohshima@cc.u-tokyo.ac.jp c 2016 Information Processing Society of Japan 1

1 FPGA FPGA: Altera Stratix V GS D5 (5SGSMD5K2F40C2) #Logic units (ALMs) 172,600 #RAM blocks (M20K) 2,014 #DSP blocks 1,590 (27 27) : Bittware S5-PCIe-HQ GSMD5 DDR DDR PCIe I/F (4 + 4) MB 25.6 GB/sec Gen3 x8 (OpenCL Gen2 x8 ) Altera Quartus II 15.1 2. FPGA 2.1 FPGA OpenCL SDK FPGA Altera Stratix V GS D5 Stratix V Altera FPGA 1 FPGA 1 Adaptive Logic Module (ALM) 172,600 4 2 6 Look Up Table (LUT) 2 FPGA 2,014 20Kbit RAM (M20K) 640bit Memory Logic Array Block (MLAB) 8,630 Digital Signal Processor (DSP) 27 1,590 DSP Stratix V ALM RAM *1 [11][12] FPGA Bittware PCI Express S5-PCIe-HQ (s5phq d5) ( 1) FPGA Verilog HDL VHDL C Fotran FPGA HPC FPGA *1 Arria 10, Stratix 10 DSP 1 Bittware S5-PCIe-HQ (Bittware QDR II+ ) OpenCL FPGA HPC Altera FPGA Stratix V OpenCL Verilog OpenCL FPGA Altera Stratix V CPU FPGA ARM IP FPGA (Xilinx Zynq, Altera Arria SoC ) OpenCL FPGA CPU PCI Express OpenCL FPGA Altera Stratix V FPGA PCI Express GPU I/O PCI Express *2 PCI Express FPGA OpenCL FPGA PCI Express FPGA (Partial reconfiguration) FPGA PCI Express *2 Intel Altera QPI c 2016 Information Processing Society of Japan 2

DDR PCI Express OpenCL OpenCL Verilog HDL IP 1 Intel Xeon E5 (Haswell ) 2 FPGA FPGA --report -c FPGA & 2.2 OpenCL FPGA OpenCL Khronos GPU FPGA DSP(Digital Signal Processor) HPC AMD GPU CPU Xeon Phi NVIDIA GPU OpenCL C/++ (API ) GPU OpenCL CUDA[13] OpenCL 2.0 OpenCL CPU OpenMP 2 FPGA OpenCL GPU CUDA OpenCL FPGA FPGA Altera Altera OpenCL SDK[14] FPGA SDK Stratix V Altera OpenCL Altera 2013 SDK OpenCL ( ) FPGA ( ) GPU API FPGA (FPGA ) 2 OpenCL FPGA FPGA FPGA ( ) kernel global CUDA CUDA Driver API OpenCL CUDA FPGA FPGA OpenCL FPGA GPU OpenCL GPU c 2016 Information Processing Society of Japan 3

FPGA GPU OpenCL FPGA GPU 2.3 Altera FPGA Altera [15] [16] (SIMD ) 2.3.1 2.2 FPGA OpenCL global DDR local RAM ( local ) 2.3.2 (SIMD, ) FPGA GPU OpenCL clenqueuendrangekernel FPGA FPGA ID API ID CUDA GPU GPU FPGA OpenCL FPGA attribute num_simd_work_items(4) SIMD 4 num_compute_units(4) 4 2.4 OpenCL for while Altera OpenCL Compiler (AOC) 3 for FPGA 1 0 (single stream) for c 2016 Information Processing Society of Japan 4

======================================================================================================================== *** Optimization Report *** ======================================================================================================================== Kernel: cg File:Ln ======================================================================================================================== Loop for.body [1]:30 Pipelined execution inferred. ------------------------------------------------------------------------------------------------------------------------ Loop for.body5 [1]:37 Pipelined execution inferred. Successive iterations launched every 2 cycles due to: Pipeline structure ------------------------------------------------------------------------------------------------------------------------ Loop for.body18 [1]:39 Pipelined execution inferred. Successive iterations launched every 8 cycles due to: Data dependency on variable Largest Critical Path Contributor: 96%: Fadd Operation [1]:40 ------------------------------------------------------------------------------------------------------------------------ Loop for.body37 [1]:45 Pipelined execution inferred. Successive iterations launched every 8 cycles due to: Data dependency on variable BNorm2 [1]:46 Largest Critical Path Contributor: 96%: Fadd Operation [1]:46 3 AOC c 2016 Information Processing Society of Japan 5

2.4.1 CPU FPGA 3. 3.1 OpenCL FPGA [5], [6] Rodinia ppopen-hpc OpenCL FPGA OpenCL FPGA FPGA FPGA OpenCL 2 CG(Conjugate Gradient) C FEM(Finite Element Method) CG (float ) 4 (CG ) 3 7 OpenMP FPGA CPU-FPGA Intel Xeon E5 2 FPGA(Stratix V) 1 {r0} = {b} - [A]{xini} 2 loop 3 solve {z} = [Minv]{r} 4 RHO = {r}{z} 5 if ITER=1 {p} = {z} 6 else BETA = RHO / RHO1 7 {q} = [A]{p} 8 ALPHA = RHO / {p}{q} 9 {x} = {x} + ALPHA * {p} 10 {r} = {r} - ALPHA * {q} 11 endloop 4 CG OpenCL FPGA 3.2 FPGA OpenCL CG kernel CPU-FPGA global OpenCL API (clenqueuereadbuffer, clenqueuewritebuffer) CPU-FPGA global clenqueuendrangekernel API 1 FPGA (FPGA ) -g -W -v --board s5phq_d5 -g -W warning -v --board s5phq_d5 FPGA CPU -O2 OpenCL CPU const restrict const restrict const restrict c 2016 Information Processing Society of Japan 6

5 / 6 local 2 local (MHz) 247.46 269.32 262.12 Logic utilization 60% 68% 39% Dedicated logic registers 31% 34% 18% Memory blocks 61% 71% 34% DSP blocks 2% 2% 2% 5 ( -1 ) 1000 E5-2680 v2 CPU gcc4.4.7 -O2 CPU OpenCL FPGA restrict warning: declaring kernel argument with no restrict may lead to low kernel performance 2 local 1 3.3 FPGA DDR global DDR global local global local 5 400 1000 local OpenCL FPGA local 2 DSP local 3.4 (SIMD ) SIMD FPGA attribute local SIMD SIMD num_simd_work_items reqd_work_group_size attribute CG SIMD c 2016 Information Processing Society of Japan 7

3 (MHz) 269.32 285.3 Logic utilization 68% 63% Dedicated logic registers 34% 31% Memory blocks 71% 68% DSP blocks 2% 2% (msec) 139.190 106.951 7 SIMD ( ) SIMD Compiler Warning: Kernel Vectorization: branching is thread ID dependent... cannot vectorize. Compiler Warning: Kernel cg : limiting to 2 concurrent work-groups because threads might reach barrier out-of-order. 3.5 2.4 1000 7 ( ) SIMD SIMD num_compute_units attribute 3 SIMD FPGA 2 FPGA SIMD 4. SIMD SIMD FPGA OpenCL CG Compiler Warning: Kernel cg : limiting to 2 concurrent OpenCL work-groups because threads might reach barrier out-of-order. OpenCL FPGA GPU CRS(Compressed Row Storage) SIMD Compiler Warning: Kernel Vectorization: branching is thread ID dependent... cannot vectorize. ID FPGA c 2016 Information Processing Society of Japan 8

JST CREST :ppopen-hpc JSPS 15K00166 Quartus II Altera University Program [1] Putnam, A. and Caulfield, A.M. and Chung, E.S. and Chiou, D. and Constantinides, K. and Demme, J. and Esmaeilzadeh, H. and Fowers, J. and Gopal, G.P. and Gray, J. and Haselman, M. and Hauck, S. and Heil, S. and Hormati, A. and Kim, J.-Y. and Lanka, S. and Larus, J. and Peterson, E. and Pope, S. and Smith, A. and Thong, J. and Xiao, P.Y. and Burger, D., A reconfigurable fabric for accelerating large-scale datacenter services, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp.13-24, 2014. [2] OpenCL - The open standard for parallel programming of heterogeneous systems https://www.khronos.org/ opencl/ [3],,, Alexander Vazhenin, Stanislav Sedukhin: FPGA, (2015-HPC-149), 2015. [4],, :, (2015-HPC-151), 2015. [5], Hamid Reza Zohouri,, : OpenCL FPGA, (2015-HPC-150), 2015. [6] Hamid Reza Zohouri, Naoya Maruyama, Aaron Smith, Motohiko Matsuda, and SatoshiMatsuoka, Optimizing the Rodinia Benchmark for FPGAs (Unrefereed Workshop Manuscript), (2015-HPC- 152), 2015. [7] K. Nakajima and M. Satoh and T. Furumura and H. Okuda and T. Iwashita and H. Sakaguchi and T. Katagiri and M. Matsumoto and S. Ohshima and H. Jitsumoto and T. Arakawa and F. Mori and T. Kitayama and A. Ida and M. Y. Matsuo and K. Fujisawa and et al., ppopen-hpc: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT), Optimization in the Real World, pp.15 35, DOI 10.1007/978-4-431-55420-2 2, 2016. [8] ppopen-hpc Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT) http://ppopenhpc.cc.u-tokyo. ac.jp/ppopenhpc/ [9] Tightly Coupled Accelerators GPU Vol.6, No.4, pp.14-25, 2013. [10] Yuetsu Kodama, Toshihiro Hanawa, Taisuke Boku and Mitsuhisa Sato, PEACH2: FPGA based PCIe network device for Tightly Coupled Accelerators, International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2014), pp. 3-8, Jun. 2014. [11] Altera Corporation, Floating-Point IP Cores User Guide, UG-01058, 2015. [12] Altera, Stratix V Device Handbook, https: //www.altera.com/en_us/pdfs/literature/hb/ stratix-v/stx5_core.pdf [13] CUDA Dynamic Parallelism, http://docs.nvidia. com/cuda/cuda-c-programming-guide/index.html# cuda-dynamic-parallelism [14] Altera Corporation, SDK for OpenCL - https://www.altera.co.jp/products/ design-software/embedded-software-developers/ opencl/overview.html [15] Altera Corporation, Altera SDK for OpenCL Programming Guide 15.1, UG-OCL002, 2015. [16] Altera Corporation, Altera SDK for OpenCL Best Practice Guide 15.1, UG-OCL003, 2015. c 2016 Information Processing Society of Japan 9