untitled

Similar documents
Slide 1

07-二村幸孝・出口大輔.indd

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin

Slide 1

GPU.....

CUDA を用いた画像処理 画像処理を CUDA で並列化 基本的な並列化の考え方 目標 : 妥当な Naïve コードが書ける 最適化の初歩がわかる ブロックサイズ メモリアクセスパターン

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

GPGPU

NUMAの構成

GPU CUDA CUDA 2010/06/28 1

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

Microsoft PowerPoint - GPU_computing_2013_01.pptx

TSUBAME2.0におけるGPUの 活用方法

CUDA基礎1

GPU 画像 動画処理用ハードウェア 低性能なプロセッサがたくさん詰まっている ピーク性能が非常に高い GPUを数値計算に用いるのがGPGPU Graphics Processing Unit General Purpose GPU TSUBAME2.0: GPUスパコン 本演習ではNVIDIA社の

概要 目的 CUDA Fortran の利用に関する基本的なノウハウを提供する 本チュートリアル受講後は Web 上で公開されている資料等を参照しながら独力で CUDA Fortran が利用できることが目標 対象 CUDA Fortran の利用に興味を抱いている方 前提とする知識 Fortran

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

1. マシンビジョンにおける GPU の活用

CUDA 連携とライブラリの活用 2

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

熊本大学学術リポジトリ Kumamoto University Repositor Title GPGPU による高速演算について Author(s) 榎本, 昌一 Citation Issue date Type URL Presentation

main.dvi

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

GPU のアーキテクチャとプログラム構造 長岡技術科学大学電気電子情報工学専攻出川智啓

HPC pdf

いまからはじめる組み込みGPU実装

(MIRU2010) NTT Graphic Processor Unit GPU graphi

untitled

untitled

Microsoft PowerPoint - suda.pptx

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

untitled

16.16%

supercomputer2010.ppt


ストリームを用いたコンカレントカーネルプログラミングと最適化 エヌビディアジャパン CUDAエンジニア森野慎也 GTC Japan 2014

GPU n Graphics Processing Unit CG CAD

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

untitled

10D16.dvi

2011 年 3 月 3 日 GPGPU ハンズオンプログラミング演習 株式会社クロスアビリティ ability.jp 3 Mar 2011 Copyright (C) 2011 X-Ability Co.,Ltd. All rights reserved.

Images per Second Images per Second VOLTA: ディープラーニングにおける大きな飛躍 ResNet-50 トレーニング 2.4x faster ResNet-50 推論 TensorRT - 7ms レイテンシ 3.7x faster P100 V100 P10

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

GPGPUクラスタの性能評価

PowerPoint Presentation

EGunGPU

GPGPUイントロダクション

Nios® II HAL API を使用したソフトウェア・サンプル集 「Modular Scatter-Gather DMA Core」

MPI または CUDA を用いた将棋評価関数学習プログラムの並列化 2009/06/30

スライド 1

3次多項式パラメタ推定計算の CUDAを用いた実装 (CUDAプログラミングの練習として) Implementation of the Estimation of the parameters of 3rd-order-Polynomial with CUDA

/ SCHEDULE /06/07(Tue) / Basic of Programming /06/09(Thu) / Fundamental structures /06/14(Tue) / Memory Management /06/1

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

GPUコンピューティング講習会パート1

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

TSUBAMEのTeslaを用いたGPGPU(CUDAの基礎)

Microsoft Word - 0_0_表紙.doc

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

untitled

,,,,., C Java,,.,,.,., ,,.,, i

GPGPUによる高速画像処理

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

25 2 ) 15 (W 力電 idle FMA(1) FMA(N) 実行コード Memcopy matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmu

単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

GPUコンピューティング講習会パート1

GPU チュートリアル :OpenACC 篇 Himeno benchmark を例題として 高エネルギー加速器研究機構 (KEK) 松古栄夫 (Hideo Matsufuru) 1 December 2018 HPC-Phys 理化学研究所 共通コードプロジェクト

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

TSUBAMEのTeslaを用いたGPGPU(CUDAの基礎)

strtok-count.eps

TSUBAMEのTeslaを用いたGPGPU(CUDAの基礎)

2 09:00-09:30 受付 09:30-12:00 GPU 入門,CUDA 入門 13:00-14:30 OpenACC 入門 + HA-PACS ログイン 14:45-16:15 OpenACC 最適化入門と演習 16:30-18:00 CUDA 最適化入門と演習

XACCの概要

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

4.1 % 7.5 %

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

はじめに

OpenGL GLSL References Kageyama (Kobe Univ.) Visualization / 58

tutorial_lc.dvi

Microsoft Word - paper.docx

untitled

HP Workstation 総合カタログ

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

cpp1.dvi

GPUを用いたN体計算

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

Fig. 3 Coordinate system and notation Fig. 1 The hydrodynamic force and wave measured system Fig. 2 Apparatus of model testing

2

Slide 1

1 OpenCL Work-Item Private Memory Workgroup Local Memory Compute Device Global/Constant Memory Host Host Memory OpenCL CUDA CUDA Compute Unit MP Proce

on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

RX600 & RX200シリーズ アプリケーションノート RX用仮想EEPROM

Transcription:

GPGPU

NVIDACUDA Learn More about CUDA - NVIDIA http://www.nvidia.co.jp/object/cuda_education_jp.html NVIDIA CUDA programming Guide CUDA http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf CUDA http://tech.ckme.co.jp/cuda.shtml OpenCL NVIDIA http://www.nvidia.co.jp/object/cuda_opencl_jp.html Weekly NVIDIAG802007416 http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm KhronosGDCGPUCell B.E.OpenCL 2009330) http://pc.watch.impress.co.jp/docs/2009/0330/kaigai497.htm

GPU Computing GPGPU - General-Purpose Graphic Processing Unit GPU CUDA Compute Unified Device Architecture GPUNVIDIA GPU GPGPUCUDA CPU GPGPU price!!!

NVIDIA NVIDIA

CPUGPU CPU memory PCIe GPGPU Graphic memory PCIexpress

NVIDIA GPGPU multiprocessor eight Scalar Processor (SP) cores, two special function units for transcendentals a multithreaded instruction unit on-chip shared Memory SIMT (single-instruction, multiplethread). The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes independently with its own instruction address and register state. creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Device Memory (Global Memory) Shared Memory Constant Cache Texture Cache

CUDA (Compute Unified Device Architecture) C programming language on GPUs Requires no knowledge of graphics APIs or GPU programming Access to native instructions and memory Easy to get started and to get real performance benefit Designed and developed by NVIDIA Requires an NVIDIA GPU (GeForce 8xxx/Tesla/Quadro) Stable, available (for free), documented and supported For both Windows and Linux

CUDA (1/2) GPUCPU(host)co-processorcompute device compute intensivedevice off-load body devicekernel kerneldevice kerneldevice host (CPU)device(GPU)host memory device memory CPU memory PCIe GPGPU Graphic memory

CUDA (2/2) (computational Grid) thread Block thread block kernel kernel computational Gridblock1,2,3 blockidthreadid

Element-wise Matrix Add void add_matrix ( float* a, float* b, float* c, int N ) { int index; for ( int i = 0; i < N; ++i ) for ( int j = 0; j < N; ++j ) { index = i + j*n; c[index] = a[index] + b[index]; } } CUDA program int main() { add_matrix( a, b, c, N ); global global add_matrix add_matrix } ( ( float* float* a, a, float* float* b, b, float* float* c, c, int int N N ) ) { { CPU program int int i i = = blockidx.x blockidx.x * * blockdim.x blockdim.x + + threadidx.x; threadidx.x; int int j j = = blockidx.y blockidx.y * * blockdim.y blockdim.y + + threadidx.y; threadidx.y; int int index index = = i i + + j*n; j*n; if if ( ( i i < < N N && && j j < < N N ) ) c[index] c[index] = = a[index] a[index] + + b[index]; b[index]; } } int int main() main() { { dim3 dim3 dimblock( dimblock( blocksize, blocksize, blocksize blocksize ); ); dim3 dim3 dimgrid( dimgrid( N/dimBlock.x, N/dimBlock.x, N/dimBlock.y N/dimBlock.y ); ); add_matrix<<<dimgrid, add_matrix<<<dimgrid, dimblock>>>( dimblock>>>( a, a, b, b, c, c, N N ); ); } }

SM (Streaming Multiprocessor) SM8processor

GPGPU

Tesla C1060 : : 240 240 : : 1.3GHz 1.3GHz : : 4GB 4GB : : 933GFlops 933GFlops () () : : 78GFlops 78GFlops () () : : 102GB/sec 102GB/sec : : 187.8W 187.8W : : IEEE IEEE 754 754 / / : : PCI PCI Express Express x16 x16 (PCI-E2.0) (PCI-E2.0)

kernel<<<dim3 grid, dim3 block, shmem_size>>>( ) <<< >>> : xy : xyz dim3 grid(16 16); dim3 block(16,16); kernel<<<grid, block>>>(...); kernel<<<32, 512>>>(...);

CUDA 11 CUDACPU CUDA CUDA CPU

CPUGPU CPUGPU CPU cudamalloc(void ** pointer, size_t nbytes) cudamemset(void * pointer, int value, size_t count) cudafree(void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int *d_a = 0; cudamalloc( (void**)&d_a nbytes ); cudamemset( d_a, 0, nbytes); cudafree(d_a);

cudamemcpy(void *dst, void *src, size_t nbytes, enum cudamemcpykind direction); directionsrcdst CPU: CUDA enum cudamemcpykind cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice

GPU C GPU void varargs static CPUGPU

global : CPU GPUvoid device : GPU CPU host : CPU host device : CPUGPU

CUDA global device dim3 griddim; 2 dim3 blockdim; dim3 blockidx; dim3 threadidx;

global void minimal( int* d_a) { *d_a = 13; } global void assign( int* d_a, int value) { int idx = blockdim.x * blockidx.x + threadidx.x; d_a[idx] = value; }

global void assign2d(int* d_a, int w, int h, int value) { int iy = blockdim.y * blockidx.y + threadidx.y; int ix = blockdim.x * blockidx.x + threadidx.x; int idx = iy * w + ix; d_a[idx] = value; }... assign2d<<<dim3(64, 64), dim3(16, 16)>>>(...);

CPU void inc_cpu(int*a, intn) { int idx; for (idx =0;idx<N;idx++) a[idx]=a[idx] + 1; } voidmain() {... inc_cpu(a, N); } CUDA global void inc_gpu(int*a_d, intn){ int idx = blockidx.x* blockdim.x +threadidx.x; if (idx < N) a_d[idx] = a_d[idx] + 1; } void main() { dim3dimblock (blocksize); dim3dimgrid(ceil(n/ (float)blocksize)); inc_gpu<<<dimgrid, dimblock>>>(a_d, N); }

// int numbytes = N * sizeof(float) float* h_a = (float*) malloc(numbytes); // // float* d_a = 0; cudamalloc((void**)&d_a, numbytes); // cudamemcpy(d_a, h_a, numbytes, cudamemcpyhosttodevice); // increment_gpu<<< N/blockSize, blocksize>>>(d_a, b); // cudamemcpy(h_a, d_a, numbytes, cudamemcpydevicetohost); // cudafree(d_a);

int main() { float *a = new float[n*n]; float *b = new float[n*n]; float *c = new float[n*n]; for ( int i = 0; i < N*N; ++i ) { a[i] = 1.0f; b[i] = 3.5f; } float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudamalloc( (void**)&ad, size ); cudamalloc( (void**)&bd, size ); cudamalloc( (void**)&cd, size ); cudamemcpy( ad, a, size, cudamemcpyhosttodevice ); cudamemcpy( bd, b, size, cudamemcpyhosttodevice ); dim3 dimblock( blocksize, blocksize ); dim3 dimgrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimgrid, dimblock>>>( ad, bd, cd, N ); cudamemcpy( c, cd, size, cudamemcpydevicetohost ); } cudafree( ad ); cudafree( bd ); cudafree( cd ); delete[] a; delete[] b; delete[] c; return EXIT_SUCCESS;

device cudamalloc device : shared : 5

global void kernel( ) { shared float sdata[256]; } int main(void) { kernel<<<nblocks,blocksize>>>( ); } global void kernel( ) { extern shared float sdata[]; } int main(void) { smbytes = blocksize*sizeof(float); kernel<<<nblocks, blocksize, smbytes>>>( ); }

void syncthreads(); GPU RAW WAR WAW

CUDA nvcc nvcc cudaccg++cl nvcc CCPU PTX CUDA CUDAcuda CUDAcudart APICUDA

GPU GPU GPU

1 vs. =

Constant memory: Quite small, < 20K As fast as register access if all threads in a warp access the same location Texture memory: Spatially cached Optimized for 2D locality Neighboring threads should read neighboring addresses No need to think about coalescing Constraint: These memories can only be updated from the CPU

4 cycles to issue on memory fetch but 400-600 cycles of latency The equivalent of 100 MADs Likely to be a performance bottleneck Order of magnitude speedups possible Coalesce memory access Use shared memory to re-order non-coalesced addressing

coalesce coalesce 16 : 64- intfloat 128- int2float2 256- int4float4 float3align (Warp base address (WBA)) 16*sizeof(type) kk

http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf

http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf

Matrix Transpose global void transpose_naive( float *out, float *in, int w, int h ) { unsigned int xidx = blockdim.x * blockidx.x + threadidx.x; unsigned int yidx = blockdim.y * blockidx.y + threadidx.y; if ( xidx < w && yidx < h ) { unsigned int idx_in = xidx + w * yidx; unsigned int idx_out = yidx + h * xidx; } } out[idx_out] = in[idx_in]; read(in) write(out) http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf

16 x 16 thread block Matrix 16 x 16 write http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf

( global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } } http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf

http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf

4GB/s PCIe x16 1.0 vs.76 GB/s Tesla C870 cudamemcpyasync(dst, src, size, direction, 0);

CPU CUDA cudamemcpy() CPU CUDA cudathreadsynchronize() CUDA

OpenCL GPU NVIDIAC for CUDA NVIDIAAMD(ATI)GPUCPUCell Broadband Engine(Cell B.E.)(Larrabee ) GPU CPU CUDAkernel

xxx kernel

OpenCL

GPGPU 1GPU CUDA kernel local view GPUGPU GPU -- GPU