IPSJ SIG Technical Report Vol.2017-OS-141 No /7/27 GPU Victream I/O 1,2,a) CPU(Central Processing Unit) GPU(Graphics Processing Unit) G

Similar documents
1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

main.dvi

GPU n Graphics Processing Unit CG CAD

07-二村幸孝・出口大輔.indd

MapTask 678 Map 関数 バッファ管理モジュール リングバッファ 45#$% *+,-./ 0123!"#$% &'() 外部記憶装置 1 MapReduce IFIle IFIle MapReduce 25% MapReduce 2 MapReduce OS

10D16.dvi

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

GPGPU

DEIM Forum 2012 C2-6 Hadoop Web Hadoop Distributed File System Hadoop I/O I/O Hadoo

2 JSON., 2. JSON,, JSON Jaql [9] Spark Streaming [8], Spark [7].,, 2, 3 4, JSON [3], Jaql [9], Spark [7] Spark Streaming [8] JSON JSON [

HPC可視化_小野2.pptx

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

untitled

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

Chip Size and Performance Evaluations of Shared Cache for On-chip Multiprocessor Takahiro SASAKI, Tomohiro INOUE, Nobuhiko OMORI, Tetsuo HIRONAKA, Han

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

HP High Performance Computing(HPC)

untitled

Microsoft PowerPoint - GPU_computing_2013_01.pptx

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

B

DEIM Forum 2019 H2-2 SuperSQL SuperSQL SQL SuperSQL Web SuperSQL DBMS Pi



! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

GPUコンピューティング講習会パート1

AMD AMD AMD Opteron x86 OS 2P 8P x GHz 75W ACP OEM Q4 2.3GHz HE (55W) 2.8GHz SE (105W) AMD PC 2009 All rights reserved. AMD Japan, L

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

09中西

HPC pdf

mobicom.dvi

AV 1000 BASE-T LAN 90 IEEE ac USB (3 ) LAN (IEEE 802.1X ) LAN AWS (Amazon Web Services) AP 3 USB wget iperf3 wget 40 MBytes 2 wget 40 MByt

1, 4,a) 1, 4 1, 4 1, , 4 3, 4 HPC HPC HPC Slurm 1. HPC Tianhe MW MW [1] MW CREST a)

HPEハイパフォーマンスコンピューティング ソリューション

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

HP Workstation 総合カタログ

分散ストレージシステム (4) (5) (6) 書き込み 書き込み 読み出し 読み出し (2) コーディネータ 1 Fig. 1 Image of distributed storage system. 2 Fig. 2 Process flow of ( 1 ) ( 2 ) ( 3 )

IPSJ-HPC

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

スライド 1

Run-Based Trieから構成される 決定木の枝刈り法

EGunGPU

,., ping - RTT,., [2],RTT TCP [3] [4] Android.Android,.,,. LAN ACK. [5].. 3., 1.,. 3 AI.,,Amazon, (NN),, 1..NN,, (RNN) RNN

1 1 CodeDrummer CodeMusician CodeDrummer Fig. 1 Overview of proposal system c

1 DHT Fig. 1 Example of DHT 2 Successor Fig. 2 Example of Successor 2.1 Distributed Hash Table key key value O(1) DHT DHT 1 DHT 1 ID key ID IP value D

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

56 OS OS OS OS 1 OS HDD OS 1 OS HDD HDD OS OS OSOS HDD 図 1 二重キャッシュ環境 3. 負の参照の時間的局所性 3.1 参照の局所性 Locality of Reference Temporal locality Spatial localit

GPUコンピューティング講習会パート1

HP ProLiant 500シリーズ

DEIM Forum 2017 H ,


MAC root Linux 1 OS Linux 2.6 Linux Security Modules LSM [1] Security-Enhanced Linux SELinux [2] AppArmor[3] OS OS OS LSM LSM Performance Monitor LSMP

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

DELL PRECISION T7400 T5400 T3400 M6400 M4400 M2400 R5400 FX100 February /

スライド 1

hpc141_shirahata.pdf

DEIM Forum 2017 H2-2 Android LAN Android 1 Android LAN

卒業論文

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

3.1 Thalmic Lab Myo * Bluetooth PC Myo 8 RMS RMS t RMS(t) i (i = 1, 2,, 8) 8 SVM libsvm *2 ν-svm 1 Myo 2 8 RMS 3.2 Myo (Root

倍々精度RgemmのnVidia C2050上への実装と応用

先進的計算基盤システムシンポジウム SACSIS2012 Symposium on Advanced Computing Systems and Infrastructures SACSIS /5/18 CPU, CPU., Memory-bound CPU,., Memory-bo

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

27 YouTube YouTube UGC User Generated Content CDN Content Delivery Networks LRU Least Recently Used UGC YouTube CGM Consumer Generated Media CGM CGM U

SWoPP BOF BOF-1 8/3 19:10 BoF SWoPP : BOF-2 8/5 17:00 19:00 HW/SW 15 x5 SimMips/MieruPC M-Core/SimMc FPGA S

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

HPE Moonshot System ~ビッグデータ分析&モバイルワークプレイスを新たなステージへ~

<4D F736F F F696E74202D2091E63489F15F436F6D C982E682E992B48D8291AC92B489B F090CD2888F38DFC E B8CDD8

IPSJ SIG Technical Report IaaS VM 1 1 1, 2 IaaS VM VM VM VM VM VM IaaS VM VM VM FBCrypt-V FBCrypt-V VM VMM FBCrypt-V Xen TightVNC VM Preventing Inform

HP xw9400 Workstation

Publish/Subscribe KiZUNA P2P 2 Publish/Subscribe KiZUNA 2. KiZUNA 1 Skip Graph BF Skip Graph BF Skip Graph Skip Graph Skip Graph DDLL 2.1 Skip Graph S

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

untitled

Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max)

IPSJ SIG Technical Report Vol.2009-DPS-141 No.23 Vol.2009-GN-73 No.23 Vol.2009-EIP-46 No /11/27 t-room t-room 2 Development of

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

book.dvi

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

untitled

Cloud[2] (48 ) Xeon Phi (50+ ) IBM Cyclops[9] (64 ) Cavium Octeon II (32 ) Tilera Tile-GX (100 ) PE [11][7] 2 Nsim[10] 8080[1] SH-2[5] SH [8

PowerPoint プレゼンテーション

Transcription:

GPU Victream I/O 1,2,a) 1 1 1 2 CPU(Central Processing Unit) GPU(Graphics Processing Unit) GPU I/O(Input/Output) [1] GPU GPU Out-of-Core GPU I/O Victream GPU I/O Victream State-of-the-Art 117% Performance Evaluation of Cooperative Scheduling of Processing and I/O of Victream GPU Middleware Jun Suzuki 1,2,a) Yuki Hayashi 1 Takuya Araki 1 Takashi Takenaka 1 Masaru Kitsuregawa 2 1. GPU CPU NVIDIA CUDA GPU GPU GPU GPU CPU GPU I/O 1 2 CPU GPU GPU CPU 10 GPU I/O PCIe 3.0 x 16 1 NEC System Platform Research Laboratories, NEC 2 Institute of Industrial Science, the University of Tokyo a) j-suzuki@ax.jp.nec.com I/O 16 GB/s GPU 10 1 CPU GPU I/O GPU GPU I/O I/O CPU 2 GPU 10GB 2 GPU GPU Out-of-Core Out-of-Core GPU I/O GPU GPU I/O [1] GPU Out-of-Core GPU I/O Victream 1

1 Intel Xeon E7-8894 v2 [5]. Single-precision Floating Point Performance 1.8 Tflops Memory Bandwidth 85 GB/s 2 NVIDIA Tesla P100 GPU [6]. Single-precision Floating Point Performance 9.3 Tflops Memory Size 16 GB Memory Bandwidth 732 GB/s I/O Bus PCIe 3.0 x 16 1 Victream. Victream API(Application Programming Interface) API DAG(Directed Acyclic Graph) DAG Victream GPU DAG GPU I/O GPU I/O DAG DAG Dryad[2] Spark[3] GPU PTask[4] Victream GPU Out-of-Core Victream I/O GPU I/O GPU I/O GPU I/O [1] Victream GPU I/O State-of-the-Art Victream 117% 2 Victream 3 Victream 4 Victream 5 2. Victream Victream Spark API GPU Victream CPU GPU C++ Victeam DAG Victream 1 Victream Victream Victream API API DAG RPC(Remote Procedure Call) DAG DAG I/O GPU DAG UDF(User-Defined Function) GRDD(GPU Resilient Distributed Dataset) GRDD GPU UDF GRDD UDF GRDD DAG GRDD GPU Victream GPU GPU I/O GPU I/O GPU Out-of-Core I/O Victream GRDD GRDD / Key-Value 4 GPU I/O DAG Victream GPU GRDD 1 GRDD NVM(Nonvolatile Memory) Express 2

Card Victream [1] 3. Victream 3.1 Victream DAG Victream I/O GPU DAG Victream DAG DAG Victream DAG Victream GPU Out-of-Core GPU GPU I/O GPU DAG I/O DAG Victream GPU I/O GPU GPU Sundaram [7] GPU Out-of-Core I/O GPU I/O GPU GPU I/O GPU I/O GPU I/O I/O NP-hard Pseudo-Boolean (PB) Optimization GPU PB Optimization Victream GPU I/O DAG ( GPU ) API API I/O 2 (1)GPU I/O 2 GPU. 3 DAG. I/O (2)GPU I/O 2 DAG Victream GPU Out-of-Core (1) (2) I/O DAG DAG [4] Out-of-Core I/O 2 3 I/O DAG 1 GPU GPU I/O GPU 4 GPU 3 DAG 1-4 1 GPU 1 GPU I/O 2-4 2 GPU 1 4 4 3

1 1 GPU GPU GPU 3 5 1 5 GPU GPU I/O 1 5 Victream GPU Victream (1) GPU I/O (2) GPU I/O Victream 2 GPU GPU 2 GPU Victream 2 GPU GPU GPU GPU GPU GPU A 5 5 GPU A 5 GPU GPU I/O GPU A I/O GPU A GPU 9 GPU A ( 5) 5 GPU 9 5 GPU 9 GPU DAG 9 5 GPU A 5 GPU A I/O GPU A I/O 9 GPU I/O 9 GPU I/O I/O Victream GPU I/O GPU GPU GPU FIFO(First-In First-Out) GPU FIFO GPU FIFO I/O I/O GPU I/O I/O GPU FIFO FIFO DAG GPU I/O GPU GPU I/O FIFO GPU I/O I/O, I/O GPU DAG I/O 4

GPU Victream 2 9 Victream 9 DAG 5 5 I/O Victream GPU I/O GPU I/O Victream GPU Victream GPU I/O GPU GPU GPU Out-of-Core GPU I/O ( ) Victream GPU Out-of-Core I/O I/O GPU I/O I/O GPU I/O I/O Victream 2 DAG DAG 3.2 3.2.1 I/O Victream GPU I/O 4 I/O 4. subtask get_next_subtask(gpu) { glob_min = iomin_subtask(global_list); local_min = iomin_subtask(local_list[gpu]); if(glob_min < local_min) { remove(global_list, glob_min); return glob_min; } else { remove(local_list[gpu], local_min); return local_min; }} void schedule() { foreach(g in available_gpu) { if(size(global_list) > 0 size(local_list[g]) > 0) { if(memory_use[g] < load_threashold) { st = get_next_subtask(g); pipeline_dispatch(st,g); } }}} 5. 2 DAG GPU GPU 2 GPU GPU DAG GPU 1 GPU GPU DAG GPU GPU GPU GPU GPU 5

Victream GPU GPU I/O GPU 4 I/O I/O GPU I/O GPU DAG GPU GPU I/O I/O GPU I/O I/O I/O GPU I/O I/O Victream I/O GPU I/O GPU I/O I/O I/O Victream I/O DAG GPU DAG GPU Victream I/O GPU Victream LRU(Least Recently Used) GPU I/O 3.2.2 I/O GPU Victream GPU I/O GPU 4 FIFO I/O I/O GPU I/O I/O GPU 3.2.3 GPU I/O Victream GPU I/O GPU GPU I/O GPU I/O GPU GPU 5 GPU Out-of-Core GPU I/O 3.3 I/O I/O GPU GPU GPU FIFO GPU 4. 4.1 4 4 NVIDIA Tesla K20 GPU I/O GPU GPU 5GB 3.52 Tflops E5-2609 Xeon CPU 2 6

OS Ubuntu 14.04 DAG RAMdisk Victream C++ CUDA 7.5 10K Victream FIFO PTask[4] Data-Aware FIFO GPU GPU GPU I/O PTask FIFO GPU GPU GPU GPU Out-of-Core FIFO GPU Data-Aware GPU I/O FIFO PTask I/O 4.2 (Blur ) 4 Victream API Victream Out-of-Core 4 2 GPU GPU GPU Out-of-Core N GPU 1 GPU N 2 256MB GPU 70% 50% 6 1 1 GPU 6 Victream FIFO PTask PTask 92%-117% GPU Victream GPU FIFO PTask GPU GPU 6 4 GPU I/O DAG I/O Out-of-Core 9%-38% Blur GPU RAMdisk I/O Victream I/O 4 7 Out-of-Core Out-of-Core Victream PTask Out-of-Core Victream Out-of-Core 7

(a) (b) (c) Blur (d) 6. (a) (b) (c) Blur (d) 7. Ptask FIFO PTask GPU I/O FIFO 2 Victream Out-of-Core GPU I/O 5. [1] GPU Out-of-Core I/O Victream Stateof-the-Art Victream Victream DAG GPU I/O GPU I/O GPU I/O State-of-the-Art 117% 38% [1] Victream 2016 / / (SWoPP2016) (2016). [2] Isard, M., Budiu, M., Yu, Y., Birrell, A. and Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks, ACM SIGOPS Operating Systems Review, Vol. 41, No. 3, ACM, pp. 59 72 (2007). [3] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. and Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association, pp. 2 2 (2012). [4] Rossbach, C. J., Currey, J., Silberstein, M., Ray, B. and Witchel, E.: PTask: operating system abstractions to manage GPUs as compute devices, Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, ACM, pp. 233 248 (2011). [5] : Xeon E7-8894 v4, https://ark.intel.com/ja/products/96900/intel-xeon- Processor-E7-8894-v4-60M-Cache-2 40 GHz [6] NVIDIA: NVIDIA TESLA P100 GPU ACCELERATOR, http://images.nvidia.com/content/tesla/pdf/nvidiatesla-p100-pcie-datasheet.pdf. [7] Sundaram, N., Raghunathan, A. and Chakradhar, S. T.: A framework for efficient and scalable execution of domain-specific templates on GPUs, Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, IEEE, pp. 1 12 (2009). 8