1 OpenCL Work-Item Private Memory Workgroup Local Memory Compute Device Global/Constant Memory Host Host Memory OpenCL CUDA CUDA Compute Unit MP Proce

Similar documents
( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

07-二村幸孝・出口大輔.indd

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

main.dvi

GPGPU

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

untitled

Nios® II HAL API を使用したソフトウェア・サンプル集 「Modular Scatter-Gather DMA Core」

Microsoft PowerPoint - GPU_computing_2013_01.pptx

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

XcalableMP入門


GPUコンピューティング講習会パート1

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

GPU n Graphics Processing Unit CG CAD

ストリームを用いたコンカレントカーネルプログラミングと最適化 エヌビディアジャパン CUDAエンジニア森野慎也 GTC Japan 2014

GPU.....

01_OpenMP_osx.indd

strtok-count.eps

HP High Performance Computing(HPC)

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

GPU チュートリアル :OpenACC 篇 Himeno benchmark を例題として 高エネルギー加速器研究機構 (KEK) 松古栄夫 (Hideo Matsufuru) 1 December 2018 HPC-Phys 理化学研究所 共通コードプロジェクト

スパコンに通じる並列プログラミングの基礎

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

netcdf

programmingII2019-v01

/ SCHEDULE /06/07(Tue) / Basic of Programming /06/09(Thu) / Fundamental structures /06/14(Tue) / Memory Management /06/1

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

r07.dvi

ohp07.dvi

スパコンに通じる並列プログラミングの基礎

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

WinDriver PCI Quick Start Guide

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

スパコンに通じる並列プログラミングの基礎

熊本大学学術リポジトリ Kumamoto University Repositor Title GPGPU による高速演算について Author(s) 榎本, 昌一 Citation Issue date Type URL Presentation

BASIC / / BA- SIC Web 1/10 1/10 / / JavaScript

WinHPC ppt

第3回戦略シンポジウム緑川公開用

double float

3次元画像計測分野でのGPUを用いた高速化事例

大統一Debian勉強会 gdb+python拡張を使ったデバッグ手法

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

C

GPUコンピューティング講習会パート1

ex01.dvi

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

SystemC 2.0を用いた簡易CPUバスモデルの設計

I. Opal SSC 1. Opal SSC 2. Opal Storage 3. Opal Storage MBR Shadowing 6. SP II. TCG Opal SSC HDD 9. Opal SSC HDD *1. TCG: Trusted Computin

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin

MPI usage

ohp03.dvi

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

CX-Checker CX-Checker (1)XPath (2)DOM (3) 3 XPath CX-Checker. MISRA-C 62%(79/127) SQMlint 76%(13/17) XPath CX-Checker 3. CX-Checker 4., MISRA-C CX- Ch

Nexus7 2 Skia 3 4 skia 5 2. Skia 2D Android 2D Skia 2.1 Skia Skia 2D Skia Google Chrome Mozilla Firefox Android Chorome OS Android 2D Skia [7]. Androi

ストラドプロシージャの呼び出し方

ex01.dvi

Transcription:

GPGPU (VI) GPGPU 1 GPGPU CUDA CUDA GPGPU GPGPU CUDA GPGPU ( ) CUDA GPGPU 2 OpenCL OpenCL GPGPU Apple Khronos Group OpenCL Working Group [1] CUDA GPU NVIDIA GPU *1 OpenCL NVIDIA AMD GPU CPU DSP(Digital Signal Processor) NVIDIA GPU AMD GPU OpenCL OpenCL OpenCL CUDA CUDA OpenCL 2.1 OpenCL OpenCL 1 OpenCL Host( CPU ) ( )Compute Device( GPU ) Compute Device ( )Compute Unit( ) Compute Unit ( )Processing Element( ) OpenCL 2 OpenCL CUDA MP Block SP(CUDA Core) Thread Processing Element Work-Item( ) Compute Unit Workgroup( ) OpenCL *1 1

1 OpenCL Work-Item Private Memory Workgroup Local Memory Compute Device Global/Constant Memory Host Host Memory OpenCL CUDA CUDA Compute Unit MP Processing Element SP Private Memory Register Local Memory Shared Memory Global/Constant Memory Global Memory Host Memory Host Memory CUDA OpenCL OpenCL CUDA OpenCL GPU ( ) OpenCL 0 OpenCL Embedded Profile 2.2 OpenCL OpenCL OpenCL C/C++ OpenCL API OpenCL ( ) Compute Unit Processing Element (API) 2

2 OpenCL OpenCL C/C++ C/C++ GPU( ) kernel Global Memory global OpenCL CUDA OpenCL ( ) CUDA OpenCL OpenCL ( ) 3

2.3 OpenCL OpenCL CentOS 5.5 x86 64 CUDA 3.2RC TeslaC2050 PC AMD GPU ( ) 1 OpenCL ( ) (CPU ) 1. 16 24 : GPU 2. 25 28 : 3. 29 42 : 4. 43 55 : CPU/GPU CPU GPU 5. 56 63 : GPU GPU 6. 64 65 : GPU CPU 7. 66 68 : 8. 69 : 1 arraytest.cpp 1 #include <oclutils.h> 2 3 #define DATA LENGTH 16 4 cl context cxgpucontext; 5 cl kernel kernel; 6 cl command queue commandqueue; 7 8 #define CHK DO(name,o) cierrnum=o;if(cierrnum!=cl SUCCESS){printf(name);printf(" failed\n" );return( 1);} 9 #define CHK ERR(name) if(cierrnum!=cl SUCCESS){printf(name);printf(" failed\n");return ( 1);} 10 11 int main(int argc, char argv){ 12 cl platform id cpplatform = NULL; 13 cl uint cidevicecount = 0; 14 cl device id cddevices = NULL; 15 cl int cierrnum = CL SUCCESS; 16 // get platform 17 CHK DO("oclGetPlatformID", oclgetplatformid(&cpplatform)); 18 // get devices 19 CHK DO("clGetDeviceIDs1", clgetdeviceids(cpplatform, CL DEVICE TYPE GPU, 0, NULL, & cidevicecount)); 20 cddevices = (cl device id )malloc(cidevicecount sizeof(cl device id)); 21 CHK DO("clGetDeviceIDs2", clgetdeviceids(cpplatform, CL DEVICE TYPE GPU, cidevicecount, cddevices, NULL)); 22 // get context 23 cxgpucontext = clcreatecontext(0, cidevicecount, cddevices, NULL, NULL, &cierrnum); 24 CHK ERR("clCreateContext"); 25 // create command queue 26 cl device id device = oclgetdev(cxgpucontext, 0); 27 commandqueue = clcreatecommandqueue(cxgpucontext, device, CL QUEUE PROFILING ENABLE, &cierrnum); 28 CHK ERR("clCreateCommandQueue"); 4

29 // program setup 30 size t program length; 31 const char source path = "gpu.cl"; 32 char source = oclloadprogsource(source path, "", &program length); 33 if(!source){printf("oclloadprogsource failed(%s)\n", source path);return 2000;} 34 // create the program 35 cl program cpprogram = clcreateprogramwithsource(cxgpucontext, 1, (const char )&source, & program length, &cierrnum); 36 CHK ERR("clCreateProgramWithSource"); 37 free(source); 38 // build the program 39 CHK DO("clBuildProgram", clbuildprogram(cpprogram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL)); 40 // Create Kernel 41 kernel = clcreatekernel(cpprogram, "arraytest", &cierrnum); 42 CHK ERR("clCreateKernel"); 43 // setup data 44 cl mem d A; 45 cl mem d R; 46 float h A data = (float )malloc(sizeof(float) DATA LENGTH); 47 for(int i=0; i<data LENGTH; i++)h A data[i] = (float)(i+1); 48 float h R data = (float )malloc(sizeof(float) DATA LENGTH); 49 d A = clcreatebuffer(cxgpucontext, CL MEM READ ONLY CL MEM COPY HOST PTR, sizeof( float) DATA LENGTH, h A data, NULL); 50 d R = clcreatebuffer(cxgpucontext, CL MEM WRITE ONLY, sizeof(float) DATA LENGTH, NULL, NULL); 51 float value = 2.0f; 52 // set args 53 clsetkernelarg(kernel, 0, sizeof(cl mem), (void )&d R); 54 clsetkernelarg(kernel, 1, sizeof(cl mem), (void )&d A); 55 clsetkernelarg(kernel, 2, sizeof(cl float), (void )&value); 56 // run kernel 57 cl event GPUExecution; 58 size t localworksize[] = {4}; 59 size t globalworksize[] = {DATA LENGTH}; 60 clenqueuendrangekernel(commandqueue, kernel, 1, 0, globalworksize, localworksize, 0, NULL, & GPUExecution); 61 clflush(commandqueue); 62 // sync 63 clfinish(commandqueue); 64 // bloking readback 65 clenqueuereadbuffer(commandqueue, d R, CL TRUE, 0, sizeof(float) DATA LENGTH, h R data, 0, NULL, NULL); 66 // check result 67 printf("before: "); for(int i=0; i<data LENGTH; i++){printf(" %.2f", h A data[i]);}printf("\n"); 68 printf("after : "); for(int i=0; i<data LENGTH; i++){printf(" %.2f", h R data[i]);}printf("\n"); 69 // release mem and event 70 clreleasememobject(d A); 71 clreleasememobject(d R); 72 clreleaseevent(gpuexecution); 73 // cleanup 74 cierrnum = clreleasekernel(kernel); 75 cierrnum = clreleasecommandqueue(commandqueue); 76 cierrnum = clreleaseprogram(cpprogram); 77 cierrnum = clreleasecontext(cxgpucontext); 78 CHK ERR("release"); 79 free(h A data); 80 free(h R data); 81 return 0; 82 } GPU PlatformID Device DeviceID Context 2 GPU 5

2 arraytest.cpp( ) 1 // get context create command queue 2 size t ndevicebytes; 3 CHK DO("clGetContextInfo", clgetcontextinfo(cxgpucontext, CL CONTEXT DEVICES, 0, NULL, &ndevicebytes)); 4 cidevicecount = (cl uint)ndevicebytes/sizeof(cl device id); 5 if(cidevicecount == 0){printf("no devices (return code %i)\n", cierrnum);return 1;} 6 // check all devices (get device and print the device name) 7 for(unsigned int i = 0; i < cidevicecount; ++i){ 8 // device GPU 9 cl device id device = oclgetdev(cxgpucontext, i); 10 printf("device %d: ", i); 11 oclprintdevname(logboth, device); // GPU 12 printf("\n"); 13 } clsetkernelarg API GPU API ( ) clenqueuendrangekernel 5 6 (localworksize globalworksize) CUDA Thread Block Thread Block OpenCL Work-Item CUDA Thread Block Thread CUDA ( ) WorkSize 2 CUDA (CUDA C) CUDA C CUDA Driver API CUDA CUDA OpenCL API OpenCL CUDA CUDA Stream( ) Stream GPU CPU-GPU Stream CUDA C 6

1. ( ) oclloadprogsource 2. clcreateprogramwithsource 3. clbuildprogram 4. clcreatekernel 3 CUDA ( ) Work-Item ID CUDA ID get global id Work-Item 1 3 gpu.cl 1 #define DATA LENGTH 16 2 kernel void arraytest( global float R, global float A, float value){ 3 int i; 4 i = get global id(0); 5 R[i] = A[i] value; 6 } CUDA CUDA CUDA CUDA ( ) OpenCL OpenCL CUDA nvcc OpenCL 3 gcc(g++) OpenCL OpenCL CUDA CUDA(CUDA C) CUDA Driver API CUDA C OpenGL GLUT ( ) OpenCL 7

$ ls arraytest.cpp gpu.cl $ g++ -O3 -m64 -o arraytest arraytest.cpp -lopencl \ -I/path_to_cudasdk_3.1/OpenCL/common/inc -I/path_to_cudasdk_3.1/shared/inc \ -L/path_to_cudasdk_3.1/OpenCL/common/lib -L/path_to_cudasdk_3.1/shared/lib \ -loclutil_x86_64 -lshrutil_x86_64 $ ls arraytest arraytest.cpp gpu.cl $./arraytest before: 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 ( ) after : 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00 ( ) 3 OpenCL 3 GPGPU 3.1 GPGPU GPGPU C/C++ pthread OpenMP MPI GPGPU GPGPU OpenMP CUDA OMPCUDA[3] OMPCUDA OpenMP parallel for (for ) for GPU GPU CPU-GPU OpenMP OMPCUDA GPU SP ( 4) OMPCUDA OpenMP OMNI OpenMP compiler( OMNI)[2] OMNI OMPCUDA 5 OMNI OMPCUDA OMPCUDA CUDA GPU OMPCUDA OpenMP C/C++ Fortran OMPCUDA 8

4 OpenMP OMPCUDA 5 OMNI OMPCUDA 9

OMPCUDA Lee [5] OpenMP CUDA OpenMP GPGPU PGI [4] CAPS HMPP[6] 3.2 GPGPU CUDA OpenCL C/C++ C/C++ HPC C/C++ Fortran Java, Perl, Python, Ruby,.NET ( ) CUDA OpenCL GPU PyCUDA[7], PyOpenCL[8] PyCUDA PyOpenCL Python CUDA OpenCL Python CUDA OpenCL CPU-GPU GPU Python C++ Ruby-OpenCL[9] Ruby-OpenCL PyOpenCL Ruby OpenCL GPU Python Ruby JCuda[10] JCuda Java CUDA Python/Ruby GPU (PTX) CUDA (CUBLAS, CUFFT ) API CUDA Fortran[11] CUDA Fortran Fortran CUDA PGI CUDA Fortran C/C++ Fortran ( ) CUDA C Fortran CUDA OpenCL GPU 10

GPU GPU 3.3 GPGPU / GPU GPU CUDA GPU GPU ( / ) CUDA 2005 GPU BrookGPU[12] RapidMind[14] BrookGPU Stanford University 6 BrookGPU (kfunc) (streamread,streawrite) 2 GPU BrookGPU DirectX, OpenGL, CPU, CTM(Close to the Metal AMD ) AMD GPU BrookGPU Brook+[13] RapidMind University of Waterloo Sh[15] RAPIDMIND 7 RapidMind RapidMind BrookGPU RapidMind PetaFLOPS ExaFLOPS GPGPU ( ) CUDA GPGPU CUDA 11

// kernel void kfunc (float x<>, float y<>, out float z<>) { z = x + y; } int main() { float a<100>; float b<100>; float c <100>; // streamread(a, data1); streamread(b, data2); // kfunc(a, b); // streamwrite(c, result); return 0; } int main() { // Program kfunc=begin { In<Value3f>x, y; Out<Value3f>z; z = x + y; } END; // Array<1, Value3f> a(512); Array<1, Value3f> b(512); Array<1, Value3f> c(512); // c = kfunc(a, b); return 0; } 6 BrookGPU 7 RapidMind OpenCL NVIDIA GPU GPU C/C++ GPGPU GPGPU GPGPU 2010 11 Tianhe-1A (2009 11 TOP500 RadeonHD 5 ) 14,336 CPU 7,168 TeslaM2050 LINPACK 2PFLOPS TOP500(2010 11 ) TSUBAME 4,224 TeslaM2050 TSUBAME 2.0 TOP500 Green500 GPGPU GPU GPU GPU GPU GPU GPGPU [1] OpenCL - The open standard for parallel programming of heterogeneous systems, http: //www.khronos.org/opencl/ [2] M.Sato, S.Satoh, K.Kusano, and Y.Tanaka. Design of OpenMP Compiler for an SMP Cluster. 12

In EWOMP 99, pp. 32 39, 1999. [3],,. OMPCUDA : GPU OpenMP. HPCS2009 2009, pp.131 138, 2009. [4] PGI. PGI Accelerator Compilers, http://www.pgroup.com/resources/accel.htm [5] Seyong Lee, Seung-Jai Min, Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.101-110, 2009. [6] CAPS. HMPP Workbench, http://www.caps-entreprise.com/fr/page/index.php?id= 49\&p\ p=36 [7] PyCUDA, http://mathema.tician.de/software/pycuda [8] PyOpenCL, http://mathema.tician.de/software/pyopencl [9] Ruby-OpenCL, http://ruby-opencl.rubyforge.org/ [10] jcuda, http://www.jcuda.org/ [11] CUDA Fortran, http://www.pgroup.com/resources/cudafortran.htm [12] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. SIGGRAPH 2004, 2004. [13] AMD. Brook+. SC07 BOF Session presentation, November 2007. [14] Michael D. McCool. Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform. In GSPx Multicore Applications Conference, 2006. [15] Michael McCool and Stefanus Du Toit. Metaprogramming GPUs with Sh. A K Peters Ltd, 2004. 13