1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

Similar documents
IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

GPGPU

untitled

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

untitled

07-二村幸孝・出口大輔.indd

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

fiš„v8.dvi

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

MDD PBL ET 9) 2) ET ET 2.2 2), 1 2 5) MDD PBL PBL MDD MDD MDD 10) MDD Executable UML 11) Executable UML MDD Executable UML

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

Microsoft Word - 0_0_表紙.doc

main.dvi

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

HPC pdf

Web Web Web Web i

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

Introduction Purpose This training course demonstrates the use of the High-performance Embedded Workshop (HEW), a key tool for developing software for

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/7 1,a) 2,3 2,3 3 Development of the ethological recording application for the understanding of

1_26.dvi

XACCの概要

, : GUI Web Java 2.1 GUI GUI GUI 2 y = x y = x y = x

21 Key Exchange method for portable terminal with direct input by user

DEIM Forum 2009 B4-6, Str

7,, i

GPU n Graphics Processing Unit CG CAD

. IDE JIVE[1][] Eclipse Java ( 1) Java Platform Debugger Architecture [5] 3. Eclipse GUI JIVE 3.1 Eclipse ( ) 1 JIVE Java [3] IDE c 016 Information Pr

Fig. 3 3 Types considered when detecting pattern violations 9)12) 8)9) 2 5 methodx close C Java C Java 3 Java 1 JDT Core 7) ) S P S

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

1 1 CodeDrummer CodeMusician CodeDrummer Fig. 1 Overview of proposal system c

Ł\”ƒ-2005

29 jjencode JavaScript

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

1 OpenCL Work-Item Private Memory Workgroup Local Memory Compute Device Global/Constant Memory Host Host Memory OpenCL CUDA CUDA Compute Unit MP Proce

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

Vol. 42 No. SIG 8(TOD 10) July HTML 100 Development of Authoring and Delivery System for Synchronized Contents and Experiment on High Spe

雇用不安時代における女性の高学歴化と結婚タイミング-JGSSデータによる検証-

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

Web Basic Web SAS-2 Web SAS-2 i

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325


10D16.dvi

Web Web Web Web Web, i

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

Windows7 OS Focus Follows Click, FFC FFC focus follows mouse, FFM Windows Macintosh FFC n n n n ms n n 4.2 2

Vol.53 No (Mar. 2012) 1, 1,a) 1, 2 1 1, , Musical Interaction System Based on Stage Metaphor Seiko Myojin 1, 1,a

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

評論・社会科学 84号(よこ)(P)/3.金子

1 7.35% 74.0% linefeed point c 200 Information Processing Society of Japan

workshop Eclipse TAU AICS.key

IPSJ SIG Technical Report Vol.2009-HCI-134 No /7/17 1. RDB Wiki Wiki RDB SQL Wiki Wiki RDB Wiki RDB Wiki A Wiki System Enhanced by Visibl

3D UbiCode (Ubiquitous+Code) RFID ResBe (Remote entertainment space Behavior evaluation) 2 UbiCode Fig. 2 UbiCode 2. UbiCode 2. 1 UbiCode UbiCode 2. 2

知能と情報, Vol.30, No.5, pp

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

6_27.dvi

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

01_OpenMP_osx.indd

WebRTC P2P Web Proxy P2P Web Proxy WebRTC WebRTC Web, HTTP, WebRTC, P2P i

,,,,., C Java,,.,,.,., ,,.,, i

HP cafe HP of A A B of C C Map on N th Floor coupon A cafe coupon B Poster A Poster A Poster B Poster B Case 1 Show HP of each company on a user scree

( ) fnirs ( ) An analysis of the brain activity during playing video games: comparing master with not master Shingo Hattahara, 1 Nobuto Fuji

1 Fig. 2 2 Fig. 1 Sample of tab UI 1 Fig. 1 that changes by clicking tab 5 2. Web HTML Adobe Flash Web ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) 3 Web 2.1 Web Goo

Table 1. Assumed performance of a water electrol ysis plant. Fig. 1. Structure of a proposed power generation system utilizing waste heat from factori

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

XcalableMP入門

CX-Checker CX-Checker (1)XPath (2)DOM (3) 3 XPath CX-Checker. MISRA-C 62%(79/127) SQMlint 76%(13/17) XPath CX-Checker 3. CX-Checker 4., MISRA-C CX- Ch

GPU チュートリアル :OpenACC 篇 Himeno benchmark を例題として 高エネルギー加速器研究機構 (KEK) 松古栄夫 (Hideo Matsufuru) 1 December 2018 HPC-Phys 理化学研究所 共通コードプロジェクト

4.1 % 7.5 %

indd

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

Transcription:

GPU 1 1 2 1, 3 2, 3 (Graphics Unit: GPU) GPU GPU GPU Evaluation of GPU Computing Based on An Automatic Program Generation Technology Makoto Sugawara, 1 Katsuto Sato, 1 Kazuhiko Komatsu, 2 Hiroyuki Takizawa 1 and Hiroaki Kobayashi 2, 1 Recently, heterogeneous computing systems that achieve high-performance computing by using Graphics Units (GPUs) as accelarators draw much attention in the area of computation sciences. However, a problem in use of GPUs is that it is necessary to port an existing program to a program for GPUs. To relieve the porting effort, this paper focuses on the technology to automatically generate a GPU program by inserting directives into an existing sequential code and evaluates the sustained performance of the auto-generated program. In addition, we show the achievable code optimizations by using directives. A simple matrix multiplication program is used for the evaluation to demonstrate that the automatically generated code can achieve a high sustained performance. 1. (Graphics Unit: GPU) GPU NVIDIA GPU Compute Unified Device Architecture(CUDA) 1) Open Computing Language(OpenCL) 2) GPU GPU CUDA OpenCL GPU GPU ( CAPS HMPP 3) 2. 2.1 OpenCL OpenCL 2) Khronos OpenCL OpenCL 1 Graduate School of Information Sciences, Tohoku University 2 Cyberscience Center, Tohoku University 3 Japan Science and Technology Agency, Core Research for Evolutional Science and Technology 1 c 2011 Information

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group NDRange SPMD ID ID ID SPMD ID ID 1 OpenCL Private Memory Local Memory Software-managed Cache Global Memory 2.2 Hybrid Multicore Parallel Programming workbench(hmpp) C Fortran 4)5)6) CAPS Hybrid Multicore Parallel Programming workbench(hmpp) 3) HMPP HMPP HMPP HMPP codelet HMPP CPU HMPP HMPP OpenCL 2 c 2011 Information

OpenCL HMPP OpenCL HMPP OpenMP 7) HMPP CPU OpenCL OpenCL HMPP GPU HMPP OpenCL HMPP OpenCL 32 4 CPU GPU OpenCL OpenCL 3. HMPP OpenCL 3.1 OpenCL C Fortran GPU ID 2 OpenCL 2 7 8 ID 1 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 2 // MatrixMul : C = alpha A B + beta C 3 // m is A s width, n is A s height and k is B s height 4 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 5 kernel void MatrixMul( int m, int n, global float A, global float B, global float C ) 6 { 7 int i = get global id(0); // work item ID 8 int j = get global id(1); // work item ID 9 int l; // Induction variables 10 float AB = 0.0f; // Temporary result 11 for( l = 0; l < n ; ++l){ 12 AB += A[i m +l] B[ l n + i]; 13 } 14 C[ j m + i] = alpha AB + beta C[ j m + i]; 15 } 2 OpenCL GPU HMPP 3 5 1 GPU HMPP OpenCL 1 2 3.2 GPU OpenCL GPU GPU 3 c 2011 Information

1 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 2 // MatrixMul : C = alpha A B + beta C 3 // m is A s width, n is A s height and k is B s height 4 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 5 #pragma hmpp MatrixMul codelet, target=opencl, args[c].io=inout 6 void MatrixMul( int m, int n, int k, float A, float B, float C, float alpha, float beta) 7 { 8 int i,j,l; // Induction variables 9 float AB; // Temporary result 10 for( int j = 0 ; j < m ; j++ ) { 11 for( int i = 0 ; i < k ; i++ ) { 12 AB = 0.0f; 13 for( int l = 0 ; l < n ; l++ ){ 14 AB += A[j m + l ] B[ l n + i] ; 15 } 16 C[ j m + i] = alpha AB + beta C[ j m + i] ; 17 } 18 } 19 } 4 3 HMPP C 4 C B C A A B 8) 5 C HMPP GPU NVIDIA GPU 16 5 9) 1 16 8) HMPP 4. OpenCL CPU Intel Core i7 920 4 c 2011 Information

6 7 Core i7 Tesla C1060 Core i7 Tesla C2070 NVIDIA Tesla C1060 NVIDIA Tesla C2070 OS Cent5.5(Linux 2.6.18) GCC4.1.2 HMPP version2.4.0 GPU CPU GPU OpenCL OpenCL HMPP HMPP OpenMP CPU OpenMP GotoBLAS 10) CUBLAS 11) OpenCL HMPP 16 16 OpenMP 8 3 256 512 768 1024 10 Tesla C1060 Tesla C2070 6 7 6 7 OpenMP CPU (OpenMP ) CPU GotoBLAS (GotoBLAS) GPU CUBLAS C (CUBLAS) OpenCL ( OpenCL ) HMPP ( HMPP ) ( blocking) ( unblocking) 7 HMPP Tesla C2070 OpenMP CPU 73 6 HMPP OpenCL 55 HMPP 3 HMPP GPU CPU Tesla C2070 HMPP GotoBLAS 5 c 2011 Information

GPU Tesla C1060 OpenCL GotoBLAS 20% CPU GPU GotoBLAS CPU OpenCL 2 3 GPU GPU. HMPP GotoBLAS. GotoBLAS CPU 2 GPU CPU HMPP CUBLAS HMPP CUBLAS CUBLAS CUBLAS HMPP HMPP Tesla C2070 HMPP OpenCL Tesla C1060 Tesla C2070 HMPP OpenCL OpenCL for if HMPP HMPP OpenCL OpenCL OpenCL GPU 5. GPU HMPP OpenCL OpenMP CPU OpenCL GPU CUBLAS GotoBLAS JCC (CAPS ) (B)(23700028) (JST) (CREST) VLSI 3 VLSI 6 c 2011 Information

1) NVIDIA Corporation. NVIDIA CUDA Programming Guide 3.0, 2010. 2) Khronos OpenCLWorking Group. The OpenCL Specification version 1.1. 3) R.Dolbeau et al. HMPP: A Hybrid Multicore Parallel Programming Environment. Workshop on GPGPU 2007, 2007. 4) The Portland Group. PGI Accelerator Programming Model for Fortran & C. http://www.softek.co.jp/spg/pgi/accel/, 2010. 5) Seyong Lee and R.Eigenmann. OpenMPC: Extended OpenMP Programming and Tuning for GPUs. pp. 1 11, nov. 2010. 6) T.D. Han and T.S. Abdelrahman. hicuda: High-Level GPGPU Programming. Parallel and Distributed Systems, IEEE Transactions on, Vol.22, No.1, pp. 78 90, jan. 2011. 7) OpenMP.org. OpenMP Application Program Interface. http://openmp.org/wp/, 2008. 8) NVIDIA Corporation. NVIDIA OpenCL Best Practice Guide 2.3, 2009. 9) Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, Vol.28, pp. 39 55, 2008. 10). Texas Advanced Computing Center. http://www.tacc.utexas.edu/. 11) NVIDIA Corporation. CUDA Toolkit 4.0 CUBLAS Library. http://developer.nvidia.com/nvidiagpu-computing-documentation, 2011. 7 c 2011 Information