IPSJ SIG Technical Report Vol.2017-OS-141 No /7/27 GPU Victream I/O 1,2,a) CPU(Central Processing Unit) GPU(Graphics Processing Unit) G

GPU Victream I/O 1,2,a) 1 1 1 2 CPU(Central Processing Unit) GPU(Graphics Processing Unit) GPU I/O(Input/Output) [1] GPU GPU Out-of-Core GPU I/O Victream GPU I/O Victream State-of-the-Art 117% Performance Evaluation of Cooperative Scheduling of Processing and I/O of Victream GPU Middleware Jun Suzuki 1,2,a) Yuki Hayashi 1 Takuya Araki 1 Takashi Takenaka 1 Masaru Kitsuregawa 2 1. GPU CPU NVIDIA CUDA GPU GPU GPU GPU CPU GPU I/O 1 2 CPU GPU GPU CPU 10 GPU I/O PCIe 3.0 x 16 1 NEC System Platform Research Laboratories, NEC 2 Institute of Industrial Science, the University of Tokyo a) j-suzuki@ax.jp.nec.com I/O 16 GB/s GPU 10 1 CPU GPU I/O GPU GPU I/O I/O CPU 2 GPU 10GB 2 GPU GPU Out-of-Core Out-of-Core GPU I/O GPU GPU I/O [1] GPU Out-of-Core GPU I/O Victream 1

1 Intel Xeon E7-8894 v2 [5]. Single-precision Floating Point Performance 1.8 Tflops Memory Bandwidth 85 GB/s 2 NVIDIA Tesla P100 GPU [6]. Single-precision Floating Point Performance 9.3 Tflops Memory Size 16 GB Memory Bandwidth 732 GB/s I/O Bus PCIe 3.0 x 16 1 Victream. Victream API(Application Programming Interface) API DAG(Directed Acyclic Graph) DAG Victream GPU DAG GPU I/O GPU I/O DAG DAG Dryad[2] Spark[3] GPU PTask[4] Victream GPU Out-of-Core Victream I/O GPU I/O GPU I/O GPU I/O [1] Victream GPU I/O State-of-the-Art Victream 117% 2 Victream 3 Victream 4 Victream 5 2. Victream Victream Spark API GPU Victream CPU GPU C++ Victeam DAG Victream 1 Victream Victream Victream API API DAG RPC(Remote Procedure Call) DAG DAG I/O GPU DAG UDF(User-Defined Function) GRDD(GPU Resilient Distributed Dataset) GRDD GPU UDF GRDD UDF GRDD DAG GRDD GPU Victream GPU GPU I/O GPU I/O GPU Out-of-Core I/O Victream GRDD GRDD / Key-Value 4 GPU I/O DAG Victream GPU GRDD 1 GRDD NVM(Nonvolatile Memory) Express 2

Card Victream [1] 3. Victream 3.1 Victream DAG Victream I/O GPU DAG Victream DAG DAG Victream DAG Victream GPU Out-of-Core GPU GPU I/O GPU DAG I/O DAG Victream GPU I/O GPU GPU Sundaram [7] GPU Out-of-Core I/O GPU I/O GPU GPU I/O GPU I/O GPU I/O I/O NP-hard Pseudo-Boolean (PB) Optimization GPU PB Optimization Victream GPU I/O DAG ( GPU ) API API I/O 2 (1)GPU I/O 2 GPU. 3 DAG. I/O (2)GPU I/O 2 DAG Victream GPU Out-of-Core (1) (2) I/O DAG DAG [4] Out-of-Core I/O 2 3 I/O DAG 1 GPU GPU I/O GPU 4 GPU 3 DAG 1-4 1 GPU 1 GPU I/O 2-4 2 GPU 1 4 4 3

1 1 GPU GPU GPU 3 5 1 5 GPU GPU I/O 1 5 Victream GPU Victream (1) GPU I/O (2) GPU I/O Victream 2 GPU GPU 2 GPU Victream 2 GPU GPU GPU GPU GPU GPU A 5 5 GPU A 5 GPU GPU I/O GPU A I/O GPU A GPU 9 GPU A ( 5) 5 GPU 9 5 GPU 9 GPU DAG 9 5 GPU A 5 GPU A I/O GPU A I/O 9 GPU I/O 9 GPU I/O I/O Victream GPU I/O GPU GPU GPU FIFO(First-In First-Out) GPU FIFO GPU FIFO I/O I/O GPU I/O I/O GPU FIFO FIFO DAG GPU I/O GPU GPU I/O FIFO GPU I/O I/O, I/O GPU DAG I/O 4

GPU Victream 2 9 Victream 9 DAG 5 5 I/O Victream GPU I/O GPU I/O Victream GPU Victream GPU I/O GPU GPU GPU Out-of-Core GPU I/O ( ) Victream GPU Out-of-Core I/O I/O GPU I/O I/O GPU I/O I/O Victream 2 DAG DAG 3.2 3.2.1 I/O Victream GPU I/O 4 I/O 4. subtask get_next_subtask(gpu) { glob_min = iomin_subtask(global_list); local_min = iomin_subtask(local_list[gpu]); if(glob_min < local_min) { remove(global_list, glob_min); return glob_min; } else { remove(local_list[gpu], local_min); return local_min; }} void schedule() { foreach(g in available_gpu) { if(size(global_list) > 0 size(local_list[g]) > 0) { if(memory_use[g] < load_threashold) { st = get_next_subtask(g); pipeline_dispatch(st,g); } }}} 5. 2 DAG GPU GPU 2 GPU GPU DAG GPU 1 GPU GPU DAG GPU GPU GPU GPU GPU 5

Victream GPU GPU I/O GPU 4 I/O I/O GPU I/O GPU DAG GPU GPU I/O I/O GPU I/O I/O I/O GPU I/O I/O Victream I/O GPU I/O GPU I/O I/O I/O Victream I/O DAG GPU DAG GPU Victream I/O GPU Victream LRU(Least Recently Used) GPU I/O 3.2.2 I/O GPU Victream GPU I/O GPU 4 FIFO I/O I/O GPU I/O I/O GPU 3.2.3 GPU I/O Victream GPU I/O GPU GPU I/O GPU I/O GPU GPU 5 GPU Out-of-Core GPU I/O 3.3 I/O I/O GPU GPU GPU FIFO GPU 4. 4.1 4 4 NVIDIA Tesla K20 GPU I/O GPU GPU 5GB 3.52 Tflops E5-2609 Xeon CPU 2 6

OS Ubuntu 14.04 DAG RAMdisk Victream C++ CUDA 7.5 10K Victream FIFO PTask[4] Data-Aware FIFO GPU GPU GPU I/O PTask FIFO GPU GPU GPU GPU Out-of-Core FIFO GPU Data-Aware GPU I/O FIFO PTask I/O 4.2 (Blur ) 4 Victream API Victream Out-of-Core 4 2 GPU GPU GPU Out-of-Core N GPU 1 GPU N 2 256MB GPU 70% 50% 6 1 1 GPU 6 Victream FIFO PTask PTask 92%-117% GPU Victream GPU FIFO PTask GPU GPU 6 4 GPU I/O DAG I/O Out-of-Core 9%-38% Blur GPU RAMdisk I/O Victream I/O 4 7 Out-of-Core Out-of-Core Victream PTask Out-of-Core Victream Out-of-Core 7

(a) (b) (c) Blur (d) 6. (a) (b) (c) Blur (d) 7. Ptask FIFO PTask GPU I/O FIFO 2 Victream Out-of-Core GPU I/O 5. [1] GPU Out-of-Core I/O Victream Stateof-the-Art Victream Victream DAG GPU I/O GPU I/O GPU I/O State-of-the-Art 117% 38% [1] Victream 2016 / / (SWoPP2016) (2016). [2] Isard, M., Budiu, M., Yu, Y., Birrell, A. and Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks, ACM SIGOPS Operating Systems Review, Vol. 41, No. 3, ACM, pp. 59 72 (2007). [3] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S. and Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association, pp. 2 2 (2012). [4] Rossbach, C. J., Currey, J., Silberstein, M., Ray, B. and Witchel, E.: PTask: operating system abstractions to manage GPUs as compute devices, Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, ACM, pp. 233 248 (2011). [5] : Xeon E7-8894 v4, https://ark.intel.com/ja/products/96900/intel-xeon- Processor-E7-8894-v4-60M-Cache-2 40 GHz [6] NVIDIA: NVIDIA TESLA P100 GPU ACCELERATOR, http://images.nvidia.com/content/tesla/pdf/nvidiatesla-p100-pcie-datasheet.pdf. [7] Sundaram, N., Raghunathan, A. and Chakradhar, S. T.: A framework for efficient and scalable execution of domain-specific templates on GPUs, Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, IEEE, pp. 1 12 (2009). 8