IPSJ SIG Technical Report Vol.2016-HPC-155 No /8/10 FPGA 1,a) FPGA(Field Programmable Gate Array) FPGA OpenCL FPGA FPGA OpenCL FPGA 1. CP

FPGA 1,a) 1 1 1 FPGA(Field Programmable Gate Array) FPGA OpenCL FPGA FPGA OpenCL FPGA 1. CPU GPGPU HPC FPGA (Field Programmable Gate Array) FPGA FPGA FPGA Catapult[1] HPC FPGA [3], [4] FPGA Verilog HDL (HDL) FPGA 1 a) hanawa@cc.u-tokyo.ac.jp FPGA OpenCL OpenCL GPU [2] FPGA Verilog HDL OpenCL HPC FPGA [5], [6], [7], [8] CPU GPU [9], [10] FPGA [11], [12] HPC FPGA OpenCL FPGA 2 OpenCL FPGA 3 1

4 5 2. OpenCL FPGA 2.1 OpenCL FPGA FPGA Verilog HDL VHDL C Fortran FPGA HPC FPGA OpenCL FPGA HPC Altera FPGA Stratix V OpenCL Verilog HDL OpenCL FPGA OpenCL Khronos GPU HPC AMD GPU CPU Xeon Phi NVIDIA GPU Altera Stratix V CPU FPGA ARM IP FPGA (Xilinx Zynq, Altera Arria SoC ) OpenCL FPGA CPU PCI Express OpenCL FPGA Altera Stratix V FPGA PCI Express GPU I/O PCI Express *1 *1 Intel Altera FPGA (Partial reconfiguration) FPGA PCI Express DDR PCI Express OpenCL PCI Express FPGA OpenCL FPGA FPGA MB PCI Express FPGA Altera Stratix V Bittware PCI Express S5-PCIe-HQ (s5phq d5) ( 1) FPGA 1 Adaptive Logic Module (ALM) 172,600 4 2 6 Look Up Table (LUT) 2 FPGA 2,014 20Kbit RAM (M20K) 640bit Memory Logic Array Block (MLAB) 8,630 Digital Signal Processor (DSP) 27 1,590 DSP Stratix V ALM RAM *2 [13][14] OpenCL Altera Offline Compiler ( aoc ) aoc -c kernel.cl ( 1 ) OpenCL ( 2 ) ( DSP ) ( 3 ) PCI Express DDR3-DRAM QPI *2 Arria 10, Stratix 10 DSP 2

1 FPGA FPGA: Altera Stratix V GS D5 (5SGSMD5K2F40C2) #Logic units (ALMs) 172,600 #RAM blocks (M20K) 2,014 #DSP blocks 1,590 (27 27) : Bittware S5-PCIe-HQ GSMD5 DDR DDR PCIe I/F (4 + 4) GB 25.6 GB/sec Gen3 x8 (OpenCL Gen2 x8 ) Altera Quartus II 16.0.1 OpenCL SDK, Altera Offline Compiler OpenCL Verilog HDL ( 4 ) kernel.aoco aoc kernel.aoco ( 1 ) Quartus (Altera FPGA ) ( 2 ) FPGA kernel.aocx aoco aocx Quartus FPGA 1 Intel Xeon E5 (Haswell ) 2 OpenCL FPGA FPGA --report -c FPGA & 1 Bittware S5-PCIe-HQ (Bittware QDR II+) 2.2 FPGA OpenCL OpenCL C++ (API ) GPU OpenCL CUDA[15] OpenCL CPU 3

2 FPGA OpenCL OpenMP GPU CUDA OpenCL FPGA FPGA OpenCL 2.0 Altera Offline Compiler 16.0 2.0 2 OpenCL FPGA FPGA FPGA () kernel global CUDA CUDA Driver API OpenCL CUDA FPGA FPGA OpenCL FPGA GPU OpenCL GPU FPGA GPU OpenCL FPGA GPU 2.3 Altera FPGA Altera [17] [18] (SIMD ) 2.3.1 2.1 FPGA OpenCL global DDR local RAM ( local ) 2 2.4 OpenCL for while Altera OpenCL Compiler (AOC) 3 for FPGA 4

================================================================================ *** Optimization Report ***... ================================================================================ Kernel: hacapk_body ================================================================================ The kernel is compiled for single work-item execution. Loop Report: + Loop "Block1" (file hacapk-calc0.cl line 36) NOT pipelined due to: Loop structure: loop contains divergent inner loops.... -+ Loop "Block4" (file hacapk-calc0.cl line 53) Pipelined with successive iterations launched every 2 cycles due to:... -+ Loop "Block5" (file hacapk-calc0.cl line 55) Pipelined with successive iterations launched every 8 cycles due to:... -+ Loop "Block9" (file hacapk-calc0.cl line 62) Pipelined well. Successive iterations are launched every cycle. 3 AOC 1 0 (single stream) for 2.4.1 CPU FPGA 2.4.2 (SIMD, ) FPGA GPU OpenCL clenqueuendrangekernel FPGA FPGA ID API ID CUDA GPU GPU FPGA OpenCL FPGA attribute num_simd_work_items(4) SIMD 4 num_compute_units(4) 4 3. 3.1 N Ā RN N. Ā A N I := 1,, N J := 1,, N I J m I J M m M s m I, t m J m = s m t m m Ā A m s m t m R #sm #tm (1) # m A m s m t m Ã m Ã m := V m W m V m R #sm rm (2) W m R rm #tm r m min(#s m, #t m ) r m N Ã m Ã m A m s m t m V m W m 4 4 A m s m t m Ã m A 5

2 100ts 216h human 1x1 101250 21600 19664 222274 50098 46618 89534 17002 16202 132740 33096 20416 W m x tm c rm (7) 4 V m c rm ŷ sm (8) N(M) m N(m) N(m) := { N(M) = m M N(m) (3) #s m #t m m r m (#s m + #t m ) m (4) r m #s m #t m r m (#s m + #t m ) #s m #t m 3.2 Ax y, x, y R N (5) y A m s m t m A m s m t m x tm ŷ sm (6) x tm x t m #t m ŷ sm #s m y s m Ã m c R rm Ã m x tm = V m W m x tm ŷ sm ŷ sm ŷ sm y (9) m M 3.3 ppopen-appl/bem ver.0.4.0 HACApK 1.0.0 [10] ppopen- APPL/BEM JST CREST : ppopen-hpc [9] 1 (Boundary Element Method, BEM) HACApK [19] ACA ACA+ [20] HACApK Fortran90 C 3.4 2 3 ( ) 4. 4.1 [7] FPGA CG FPGA 6

FPGA Intel Xeon E5-2680v2 (IvyBridge) 2 PCI Express 2.1 Bittware Stratix V FPGA S5-PCIe-HQ HACApK HACApK_adot_body_lfmtx C HACApK C FPGA OpenCL OpenCL FPGA FPGA OpenCL 4.2 0: FPGA C OpenCL 5 kernel global FPGA DDR3 zbu local FPGA 0 ( 1 ) ( 2 ) ( 3 ) 3 4 CPU Intel Xeon E5-2680v2 1 0 CPU 1 126 1 for(ip=0; ip<nlf; ip++){ 2 sttmp=st_lf+ip; 3 ndl=sttmp->ndl; ndt=sttmp->ndt; 4 nstrtl=sttmp->nstrtl; nstrtt=sttmp->nstrtt; 5 if(sttmp->ltmtx==1){ 6 kt=sttmp->kt; 7 for(il=0; il<kt; il++){ 8 zbu[il] = 0.0; 9 for(it=0; it<ndt; it++){ 10 itt=it+nstrtt-1; 11 itl=it+il*ndt + sttmp->offset_a1; 12 zbu[il] += a1[itl]*zu[itt]; 13 } } 14 for(il=0; il<kt; il++){ 15 for(it=0; it<ndl; it++){ 16 ill=it+nstrtl-1; 17 itl=it+il*ndl + sttmp->offset_a2; 18 zau[ill] += a2[itl]*zbu[il]; 19 } } 20 } else if(sttmp->ltmtx==2){ 21 for(il=0; il<ndl; il++){ 22 ill=il+nstrtl-1; 23 for(it=0; it<ndt; it++){ 24 itt=it+nstrtt-1; 25 itl=it+il*ndt + sttmp->offset_a1; 26 zau[ill] += a1[itl]*zu[itt]; 27 } } } } 5 4.3 1: 5 7 13 21 27 ltmtx 1 2 il kt ndt 1 7

zau 1 ip zau 7 il 9 it il zu[itt] 3 0 1 2 3 Logic utilization 29% 26% 28% 26% DSP blocks 9 4 6 2 Memory bits 16% 18% 14% 15% RAM block 608 630 536 560 (30%) (31%) (27%) (28%) fmax 246.18 244.73 269.25 268.95 4.4 2: 1 7 13 21 27 1 1 zbu 4.5 3: 1 14 19 15 it 14 il il zu[itt] 4.6 10 16 CPU 1 CPU 8 global constant 4 (ms) 0 1 2 3 CPU 100ts 62597.0 5540.9 57661.3 4848.3 494.2 216h 8705.1 808.2 7904.0 684.0 68.7 human 1x1 8762.6 676.9 7962.5 547.3 69.6 DDR3 5. FPGA OpenCL 16 CPU 1 1/8 FPGA FPGA FPGA CPU JSPS 15K00166 (JST/CREST), German Priority Programme 1648 Software for Exascale Computing (SPPEXA-II) 8

Quartus II Altera University Program [1] Putnam, A. and Caulfield, A.M. and Chung, E.S. and Chiou, D. and Constantinides, K. and Demme, J. and Esmaeilzadeh, H. and Fowers, J. and Gopal, G.P. and Gray, J. and Haselman, M. and Hauck, S. and Heil, S. and Hormati, A. and Kim, J.-Y. and Lanka, S. and Larus, J. and Peterson, E. and Pope, S. and Smith, A. and Thong, J. and Xiao, P.Y. and Burger, D., A reconfigurable fabric for accelerating large-scale datacenter services, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp.13-24, 2014. [2] OpenCL - The open standard for parallel programming of heterogeneous systems https://www.khronos.org/ opencl/ [3],,, Alexander Vazhenin, Stanislav Sedukhin: FPGA, (2015-HPC-149), 2015. [4],, :, (2015-HPC-151), 2015. [5], Hamid Reza Zohouri,, : OpenCL FPGA, (2015-HPC-150), 2015. [6] Hamid Reza Zohouri, Naoya Maruyama, Aaron Smith, Motohiko Matsuda, and SatoshiMatsuoka, Optimizing the Rodinia Benchmark for FPGAs (Unrefereed Workshop Manuscript), (2015-HPC- 152), 2015. [7] FPGA (2016-HPC-153) 2016. [8] OpenCL FPGA (2016-HPC-154) 2016 [9] K. Nakajima and M. Satoh and T. Furumura and H. Okuda and T. Iwashita and H. Sakaguchi and T. Katagiri and M. Matsumoto and S. Ohshima and H. Jitsumoto and T. Arakawa and F. Mori and T. Kitayama and A. Ida and M. Y. Matsuo and K. Fujisawa and et al., ppopen-hpc: Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT), Optimization in the Real World, pp.15 35, DOI 10.1007/978-4-431-55420-2 2, 2016. [10] ppopen-hpc Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications on Post-Peta-Scale Supercomputers with Automatic Tuning (AT) http://ppopenhpc.cc.u-tokyo. ac.jp/ppopenhpc/ [11] Tightly Coupled Accelerators GPU Vol.6, No.4, pp.14-25, 2013. [12] Yuetsu Kodama, Toshihiro Hanawa, Taisuke Boku and Mitsuhisa Sato, PEACH2: FPGA based PCIe network device for Tightly Coupled Accelerators, International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2014), pp. 3-8, Jun. 2014. [13] Altera Corporation, Floating-Point IP Cores User Guide, UG-01058, 2015. [14] Altera, Stratix V Device Handbook, https: //www.altera.com/en_us/pdfs/literature/hb/ stratix-v/stx5_core.pdf [15] CUDA Dynamic Parallelism, http://docs.nvidia. com/cuda/cuda-c-programming-guide/index.html# cuda-dynamic-parallelism [16] Altera Corporation, SDK for OpenCL - https://www.altera.co.jp/products/ design-software/embedded-software-developers/ opencl/overview.html [17] Altera Corporation, Altera SDK for OpenCL Programming Guide 16.0, UG-OCL002, 2016. [18] Altera Corporation, Altera SDK for OpenCL Best Practice Guide 16.0, UG-OCL003, 2016. [19] A. Ida, T. Iwashita, T. Mifune and Y. Takahashi, Parallel Hierarchical Matrices with Adaptive Cross Approx ima-tion on Symmetric Multiprocessing Clusters, Journal of Information Processing Vol. 22, pp.642-650, 2014. [20] Börm S., Grasedyck L. and Hackbusch W.: Hierarchical Matrices, Lecture Note, Max-Planck-Institut fur Mathematik, (2006). 9