IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

Similar documents
07-二村幸孝・出口大輔.indd

XACCの概要

matrox0

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

HPEハイパフォーマンスコンピューティング ソリューション

GPU n Graphics Processing Unit CG CAD

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

1重谷.PDF

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

main.dvi

RDMAプロトコル: ネットワークパフォーマンスの向上

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral


卒業論文

GPGPU

untitled

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

untitled

Microsoft PowerPoint - GPU_computing_2013_01.pptx

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

untitled

スライド 1

12 PowerEdge PowerEdge Xeon E PowerEdge 11 PowerEdge DIMM Xeon E PowerEdge DIMM DIMM 756GB 12 PowerEdge Xeon E5-

HP High Performance Computing(HPC)

組込みシステムシンポジウム2011 Embedded Systems Symposium 2011 ESS /10/20 FPGA Android Android Java FPGA Java FPGA Dalvik VM Intel Atom FPGA PCI Express DM

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

develop

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

untitled

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

strtok-count.eps

09中西

Microsoft PowerPoint - CCS学際共同boku-08b.ppt

CANON_IT_catalog_1612

untitled

コスト効率の高い業界標準サーバーへのERPの導入

(Microsoft PowerPoint - E6x5C SDXC Demo Seminar [\214\335\212\267\203\202\201[\203h])

プロセッサ・アーキテクチャ

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

Second-semi.PDF

HP xw9400 Workstation

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

HP Workstation 総合カタログ

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

A Responsive Processor for Parallel/Distributed Real-time Processing

Keysight Technologies マルチ・プロトコル & ロジック・アナライザ

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,,

VNXe3100 ハードウェア情報ガイド

スライド 1

Dual Stack Virtual Network Dual Stack Network RS DC Real Network 一般端末 GN NTM 端末 C NTM 端末 B IPv4 Private Network IPv4 Global Network NTM 端末 A NTM 端末 B

untitled

Ver. 3.9 Ver E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,, HT,

P33W・P28X カタログ

Ver Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI

ProLiant BL460c システム構成図

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,,

ARTED Xeon Phi Xeon Phi 2. ARTED ARTED (Ab-initio Real-Time Electron Dynamics simulator) RTRS- DFT (Real-Time Real-Space Density Functional Theory, )

02_Matrox Frame Grabbers_1612

PDF.PDF

システムソリューションのご紹介

富士通社製サーバ『PRIMERGY RX200 S8』とHGST(旧Virident)社製ソフトウェア『FlashMAX Connect』の機能検証報告書

HP ProLiant 500シリーズ

JAJP.indd

事務連絡

untitled

橡3_2石川.PDF


富士通PRIMERGYサーバ/ETERNUSストレージとXsigo VP560/VP780の接続検証

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,,

AV 1000 BASE-T LAN 90 IEEE ac USB (3 ) LAN (IEEE 802.1X ) LAN AWS (Amazon Web Services) AP 3 USB wget iperf3 wget 40 MBytes 2 wget 40 MByt

PowerEdge R730xd Contents RAID /RAID & P3-6 PCIe P P P P OS P P P P7 P8 P9 P10-11 P12-17 P P112

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

HPE Moonshot System ~ビッグデータ分析&モバイルワークプレイスを新たなステージへ~

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

untitled

& BIOS/OS BIOS BIOS OS RAS UPS VCCI EMC RoHS Reach : ITS / FA Intel Atom E3800 EMBOX TypeAE840 Intel Atom E3800 VX-6020 Intel Xeon /Core EMBOX TypeRE9

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

IPSJ SIG Technical Report Vol.2015-ARC-215 No.7 Vol.2015-OS-133 No /5/26 Just-In-Time PG 1,a) 1, Just-In-Time VM Geyser Dalvik VM Caffei

IO Linux Vyatta PC

1 Hybrid Memory Cube HMC CPU HMC 2. Hybrid Memory Cube HMC 2.1 Hybrid Memory Cube (HMC) Micron HMC DDR DRAM TSV I/O HMC 1 1 (Vault ) 4 4 HMC DDR

次世代スーパーコンピュータのシステム構成案について

Myrinet2000 ご紹介

ProLiant ML110 システム構成図

1 / 1 idrac8 CPU 1 Intel Xeon E v5 Intel Pentium Intel Core i3 Intel Celeron Intel C236 Microsoft Windows Server 2008 R2 SP1 Microsoft Windows S

5988_7780JA.qxd

FY14Q4 SMB Magalog December - APJ Version

untitled

HPC (pay-as-you-go) HPC Web 2

11U Dell CPU RAID 1U 1 Intel Xeon E v5 Intel Pentium Intel Core i3 Intel Celeron Intel C236 Microsoft Windows Server 2008 R2/2008 R2 SP1 Standar

ProLiant ML110 Generation 4 システム構成図

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

JIIAセミナー

L422277A_Xserve_Guide_01

スライド 1

通信機構として,PCI Express を用いた省電力 高信頼 高性能通信リンク PEARL (PCI Express Adaptive & Reliable Link) を提案している 1),2).PCI Express( 以降, と 略す ) は,PC と周辺機器を接続するための高速なシリアル

Transcription:

Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) HA-PACS 2012 2 HA-PACS TCA (Tightly Coupled Accelerators) TCA PEACH2 1. (Graphics Processing Unit) HPC GP(General Purpose ) TOP500 [1] CPU PCI Express (PCIe) CPU HA-PACS TCA (Tightly Coupled Accelerators) [2] 1 a) hanawa@ccs.tsukuba.ac.jp b) kodama@cs.tsukuba.ac.jp c) taisuke@cs.tsukuba.ac.jp d) msato@cs.tsukuba.ac.jp 1.1 HA-PACS TCA HPC () HA-PACS :TCA (Tightly Coupled Accelerators) HA-PACS TCA HA- PACS/TCA TCA PCIe TCA 8 16 TCA InfiniBand TCA InfiniBand TCA NVIDIA Kepler Tesla Tesla K20 K20 M2090 2 c 2012 Information Processing Society of Japan 1

[3] K20 Direct Support for RDMA[4] PCIe CPU 2. HA-PACS/TCA PEACH2 2.1 PEACH2 PEACH2 HA-PACS/TCA TCA [5] PEACH2 TCA 2.1.1 PCI Express PEARL PCIe PEARL(PCI Express Adaptive and Reliable Link) [6] PCIe PC [7] Ethernet, InfiniBand I/O PCIe Gen 2 2.5GHz, 5GHz Gen 3 8GHz ( 2Gbps, 4Gbps, 8Gbps *1 ) ( x4 ) PCIe CPU CPU Root Complex (RC) EndPoint (EP) PCIe PEARL PEARL PCIe PEACH (PCI Express Adaptive Communication Hub) [8] PCIe RC EP CPU RC PCIe PEACH PCIe PCIe [9] RC EP *1 Gen3 128b130b 7.88Gbps 2.1.2 PEARL 3 A A B B (1) A PCI Express A (2) A B CPU (3) B PCI Express B PEARL PCI Express A A B B 2.1.3 PEACH2 HA-PACS/TCA FPGA PEACH2 PCIe DMA FPGA PCIe Gen2 x8 4 IP Altera Stratix IV GX[10] FPGA TCA PEACH2 PCIe Gen 2 x8 4 1 CPU PCIe RC EP FPGA partial reconfiguration RC EP 2.2 PEACH2 HA-PACS/TCA 1 PEACH2 PCIe CPU RC 4, IB HCA, PEACH2 EP PCIe PEACH CPU PCIe PCIe PCIe Kepler Tesla K20 CUDA 5.0 Direct Support for RDMA [11] PCIe PCIe c 2012 Information Processing Society of Japan 2

G2 x8 G2 x8 G2 x8 G3 x16 PEA CH2 1 G2 x8 0 CPU QPI G3 PCIe x16 1 G3 x16 2 CPU G3 x16 3 HA-PACS/TCA G3 x8 IB HCA To PEACH2 (Root Complex) NIOS (CPU) CPU & side (Endpoint) DMAC Memory Routing function To PEACH2 (Root Complex / Endpoint) To PEACH2 (Endpoint) PEACH2 NVIDIA NDA Direct Support for RDMA PEACH2 0 1 2 3 QPI *2 PEACH2 0, 1 1 ( 1 ) PEACH2 HA-PACS PEACH2 ( 2 ) 0, 1 0, 1 (3) 3 8 3 Xeon E5 CPU 2 PCIe 80 80 PCIe 2.3 PEACH2 PEACH2 2 4 PCIe Gen2 x8 N(orth), E(ast), W(est), S(outh) N EP, E EP, W RC PEACH2 S RC EP PEACH2 S DMA chaining DMA DMA *2 2 PEACH2 FPGA DDR3 SDRAM PEACH2 PEACH2 FPGA Altera NIOS Gigabit Ethernet, RS-232C, 2.4 DMA DMA chaining DMA PCIe Altera PCIe IP chaining DMA [12] IP PCIe 255 DMA 2.5 PCIe PEACH2 PCIe 3 PEACH2 PCIe 64bit PEACH2 ( 512Gbyte) *3 *3 512Gbyte BIOS c 2012 Information Processing Society of Japan 3

PCIe 1 PEACH2 Node 3 Node 2 Node 1 Node 0 0 CPU 5 PEACH2 ( ) 0 3 3 4 1 2 PEACH2 PEACH2 Node3 Node2 Node1 Node0 N 0 N 1 N 2 4 *4 PCIe 4 4 PEACH2 N PEACH2 PEACH2 2.6 PEACH2 5 PEACH2 *4 BIOS N 3 PCIe [13] (106.7cm 312mm) Gen2 x8 PCIe 3 (E,W,S) E, W x8, S x16 8 Altera FPGA Stratix IV GX DDR3 SO-DIMM 1 PCIe FPGA Gigabit Ethernet JTAG HA-PACS/TCA PEACH2 PCIe Gen2 IP 250MHz 2.7 TCA TCA NVIDIA CUDA [14] CUDA 4.0 PCI UVA (Unified Virtual Addressing) *5 PCI CUDA cudamemcpypeer() UVA cudamemcpy() TCA chaining DMA API 3. PEACH2 FPGA 1 HA-PACS *5 Direct 1 Direct Peer-To-Peer Transfers and Memory Access c 2012 Information Processing Society of Japan 4

1 CPU Xeon E5 2.6GHz 2 OS 128GB NVIDIA K20 Intel S2600IP, W2600CR SuperMicro X9DRG-QF Linux, CentOS6.3 kernel-2.6.32-279.{9,14}.1.el6.x86 64 NVIDIA-Linux-x86 64-304.{51,64} CUDA 5.0 PEACH2 Direct Support for RDMA 3.1 DMA PEACH2 FPGA DMA (1) FPGA PCIe ( 2 ) Chaining DMA chaining DMA DMA FPGA CPU (TSC) 3.1.1 PEACH2-CPU PEACH2 DMA streaming PEACH2 chaining DMA DMA write PEACH2 DMA read PEACH2 DMA write, DMA read 255 6 1 DMA 4Kbyte 3.3 Gbyte/sec DMA write PEACH2 DMA CPU PCIe PCIe Gen2 x8 4Gbyte/sec PEACH2-256byte 16byte 2byte, LCRC 4byte, 1byte, 1byte 256 4Gbyte/sec 256+16+2+4+1+1 =3.66Gbyte/sec 93% PEACH2 PCIe IP Chaining DMA DMA write DMA read CPU write 2Kbyte write chaining DMA 1 7 255 DMA 8 4Kbyte 4 75 80% 2 7 8Kbyte 3.1.2 PEACH2-(K20) FPGA chaining DMA CUDA driver API ( 1 ) cumemalloc() (2) () PCIe (3) FPGA chaining DMA ( 4 ) cumemhostalloc() cumemcpydtoh() FPGA (3) PEACH2-CPU DMA PCIe (BAR) CPU 6 7 255 1 4Kbyte DMA CPU DMA CPU DMA 830MB/s CPU PCIe QPI c 2012 Information Processing Society of Japan 5

3500 Mbytes/sec 3000 2500 2000 1500 1000 CPU(write) CPU(read) K20(write) K20(read) Mbytes/sec CPU(write) CPU(read) K20(write) K20(read) 500 0 8 32 128 512 2048 Bytes 8 32 128 512 2048 8192 32768 131072 Bytes 6 PEACH2-CPU, DMA (255 ) 7 PEACH2-CPU, DMA (1 ) 3500 3500 3000 3000 CPU(write) CPU(read) Mbytes/sec 2500 2000 1500 1000 CPU(write) CPU(read) K20(write) K20(read) Mbytes/sec 2500 2000 1500 1000 RemoteCPU (write) 500 500 0 1 4 16 64 256 Counts 0 8 32 128 512 2048 Bytes 8 PEACH2-CPU, DMA (4096byte ) 9 PEACH2-CPU, CPU DMA (255 ) DMA 100MB/s 3.2 PEACH2 PEACH2 DMA PIO( CPU store ) (1) 1PEACH2 2 (A, B ) A, B PCIe ( 2 ) PEACH2 (3) A PCIe B 4byte store (4) A B (5) B (6) (2) 1 PC 782ns DMA PEACH2 CPU DMA 9 255 PEACH2 4Kbyte DMA DMA DMA PEACH2 PEACH2 DMA c 2012 Information Processing Society of Japan 6

DMA DMA 4. PCIe Non transparent bridge (NTB) [15] NTB PCIe EP 2 Endpoint 2 RC NTB PCIe PCIe BIOS EP PEACH2 PCIe -PEACH2 APEnet+[16], [17] FPGA 3D Torus Fermi Direct APEnet+ QSFP+ PCIe 2.2 NVIDIA CUDA 5 CUDA Computing Capability 3.5 Direct Support for RDMA RDMA [4] InfiniBand [18] PEACH2 Direct Support for RDMA InfiniBand PCIe TCA MPI 5. HA-PACS/TCA PEACH2 DMA PEACH2 2012 2012 8 HA-PACS TCA API TCA 2013 10 HA- PACS/TCA 800TFlops 1PFlops HA-PACS TCA Joel Scherpelz NVIDIA NVIDIA JAPAN JST-CREST [1] Dongarra, J., Meuer, H., Stromaier, E. and Simon, H.: TOP500 List, http://www.top500.org/. [2] Tightly Coupled Accelerators ( ) Vol. 2012-ARC-201, No. 26, pp. 1 8 (2012). [3] NVIDIA Corp.: NVIDIA Tesla Kepler Computing Accelerators. http://www.nvidia.co.jp/content/tesla/pdf/tesla- KSeriesOverviewLR.pdf. [4] NVIDIA Corp.: NVIDIA Direct. http://developer.nvidia.com/gpudirect. [5] HA-PACS () Vol. 2011-HPC-130, No. 21, pp. 1 7 (2011). [6] Hanawa, T., Boku, T., Miura, S., Okamoto, T., Sato, M. and Arimoto, K.: Low-Power and High-Performance Communication Mechanism for Dependable Embedded Systems, Proceedings of 2008 International Workshop on Innovative Architecture for Future Generation Processors and Systems, pp. 67 73 (2008). [7] PCI-SIG: PCI Express Base Specification, Rev. 3.0 (2010). [8] Otani, S., Kondo, H., Nonomura, I., Uemura, M., Hayakawa, Y., Oshita, T., Kaneko, S., Asahina, K., Arimoto, K., Miura, S., Hanawa, T., Boku, T. and Sato, M.: An 80Gbps Dependable Communication SoC with PCI Express I/F and 8 CPUs, 2011 IEEE International Solid-State Circuits Conference, pp. 266 267 (2011). [9] PCI-SIG: PCI Express External Cabling Specification, Rev. 1.0 (2007). c 2012 Information Processing Society of Japan 7

[10] Altera Corp.: Stratix IV Device Handbook. http://www.altera.co.jp/literature/lit-stratix-iv.jsp. [11] NVIDIA Corp.: Developing A Linux Kernel Module Using RDMA For Direct. http://developer.download.nvidia.com/compute/cuda/ 5 0/rc/docs/Direct RDMA.pdf. [12] Altera Corp.: IP Compiler for PCI Express user guide. http://www.altera.com/literature/ug/ug pci express.pdf. [13] PCI-SIG: PCI Express Card Electromechanical (CEM) Specification, Rev. 2.0 (2007). [14] NVIDIA Corp.: NVIDIA CUDA: Compute Unified Device Architecture. http://developer.nvidia.com/category/zone/cuda-zone. [15] Gudmundson, J.: Enabling Multi-Host System Designs with PCI Express Technology, http://www.plxtech.com/products/expresslane/techinfo (2004). [16] Ammendola, R. et al.: APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters, Journal of Physics, Conference Series, Vol. 331, Part 5, No. 5 (2011). [17] Rosetti, D. et al.: Leveraging NVIDIA Direct on APEnet+ 3D Torus Cluster Interconnect (2012). http://developer.download.nvidia.com/gtc/pdf/ GTC2012/PresentationPDF/S0282-GTC2012-- Torus-Cluster.pdf. [18] Mellanox Technologies: Mellanox OFED Direct, http://www.mellanox.com/content/pages.php?pg= products dyn&product family=116&menu section=34. c 2012 Information Processing Society of Japan 8