KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

Similar documents
倍々精度RgemmのnVidia C2050上への実装と応用

07-二村幸孝・出口大輔.indd

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

untitled

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

untitled

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

Microsoft PowerPoint - GPU_computing_2013_01.pptx

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

GPU n Graphics Processing Unit CG CAD

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

4 倍精度基本線形代数ルーチン群 QPBLAS の紹介 [index] 1. Introduction 2. Double-double algorithm 3. QPBLAS 4. QPBLAS-GPU 5. Summary 佐々成正 1, 山田進 1, 町田昌彦 1, 今村俊幸 2, 奥田洋司

MBLAS¤ÈMLAPACK; ¿ÇÜĹÀºÅÙÈǤÎBLAS/LAPACK¤ÎºîÀ®

GPGPU

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

tabaicho3mukunoki.pptx

GPU CUDA CUDA 2010/06/28 1

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

基盤研究(B) 「マルチコア複合環境を指向した適応型自動チューニング技術」


B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

main.dvi

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

Second-semi.PDF

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

CUDA を用いた画像処理 画像処理を CUDA で並列化 基本的な並列化の考え方 目標 : 妥当な Naïve コードが書ける 最適化の初歩がわかる ブロックサイズ メモリアクセスパターン

GPU.....

スライド 1

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

01_OpenMP_osx.indd

10D16.dvi

メモリ階層構造を考慮した大規模グラフ処理の高速化

NUMAの構成

C による数値計算法入門 ( 第 2 版 ) 新装版 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 新装版 1 刷発行時のものです.

(Basic Theory of Information Processing) Fortran Fortan Fortan Fortan 1

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

2ndD3.eps

untitled

2

GPU Computing on Business

B

HPC (pay-as-you-go) HPC Web 2

Microsoft PowerPoint - suda.pptx

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

自動残差修正機能付き GBiCGSTAB$(s,L)$法 (科学技術計算アルゴリズムの数理的基盤と展開)



Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max)

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal

,., ping - RTT,., [2],RTT TCP [3] [4] Android.Android,.,,. LAN ACK. [5].. 3., 1.,. 3 AI.,,Amazon, (NN),, 1..NN,, (RNN) RNN

RaVioli SIMD

OpenGL GLSL References Kageyama (Kobe Univ.) Visualization / 58

高性能計算研究室の紹介 High Performance Computing Lab.

. a, b, c, d b a ± d bc ± ad = c ac b a d c = bd ac b a d c = bc ad n m nm [2][3] BASIC [4] B BASIC [5] BASIC Intel x * IEEE a e d

スパコンに通じる並列プログラミングの基礎

HPC pdf

strtok-count.eps

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

GPUコンピューティング講習会パート1

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

main.dvi

tutorial_lc.dvi

スパコンに通じる並列プログラミングの基礎

Microsoft PowerPoint - sales2.ppt


三石貴志.indd

GPUを用いたN体計算

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

スパコンに通じる並列プログラミングの基礎

Run-Based Trieから構成される 決定木の枝刈り法

HPEハイパフォーマンスコンピューティング ソリューション

[1] [2] [3] (RTT) 2. Android OS Android OS Google OS 69.7% [4] 1 Android Linux [5] Linux OS Android Runtime Dalvik Dalvik UI Application(Home,T

‚æ4›ñ

IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm

FIT2013( 第 12 回情報科学技術フォーラム ) I-032 Acceleration of Adaptive Bilateral Filter base on Spatial Decomposition and Symmetry of Weights 1. Taiki Makishi Ch

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

高性能計算研究室の紹介 High Performance Computing Lab.

GPUコンピューティングの現状と未来

J.JSSAC Vol. 7, No. 2, Mathematica Maple,., Open asir Open xxx asir. Open xxx Open asir, asir., Open xxx, Linux Open asir Open sm1 (kan/sm1). C

インテル(R) Visual Fortran Composer XE

GPU のアーキテクチャとプログラム構造 長岡技術科学大学電気電子情報工学専攻出川智啓

untitled

連載講座 : 高生産並列言語を使いこなす (3) ゲーム木探索問題 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 概要 17 2 ゲーム木探索 必勝 必敗 引き分け 盤面の評価値 αβ 法 指し手の順序付け (mo

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

149 (Newell [5]) Newell [5], [1], [1], [11] Li,Ryu, and Song [2], [11] Li,Ryu, and Song [2], [1] 1) 2) ( ) ( ) 3) T : 2 a : 3 a 1 :

001.dvi

Transcription:

Vol.214-HPC-146 No.14 214/1/3 CUDA-xSYMV 1,3,a) 1 2,3 2,3 (SYMV)., (GEMV) 2.,, mutex., CUBLAS., 1 2,. (AT). 2, SYMV GPU., SSYMV( SYMV), GeForce GTXTitan Black 211GFLOPS( 62.8%)., ( ) (, ) DD(double-double), SYMV (CHEMV, ZHEMV, WSYMV),. 1. (SYMV).,. y := αa UorL x + βy (A(= A T ) R n n, x R n ) (1) BLAS,,. SYMV BLAS 2, O(N 2 ) O(N 2 ), O(1)., GPU CPU, GPU BLAS CUBLAS[1] 1 RIKEN Advanced Institute for Computational Science, Kobe, Hyogo 2 Japan Atomic Energy Agency, Kashiwa, Chiba 3 CREST CREST JST, Kawaguchi, Saitama a) imamura.toshiyuki@riken.jp MAGMA[2] CPU., Sørensen GLAS [3], [4], [5] GEMV 63.8%(NVIDIA Tesla C25 9GB/s=45GFLOPS). [6], [7], [8]., SYMV GPU [9], [1], [11]. SYMV SYxxx,., GEMV 1/2 [9], [11]., Byte/flop 1/2, GEMV 2.,, GEMV 2., SYMV [7], [11]., CUDA 6.[12] CUBLAS c 214 Information Processing Society of Japan 1

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(doubledouble) [14], ( Atomic Algorithm) SYMV.,, SYMV. 2. CUDA-xSYMV 2.1 (Atomic algorithm) 1/2. 1, A SYMV. 2 (Aij=A(i,j)) 1 2., 1word/4flop 1word/2flop). 8/4=2B/F, 4/4=1B/F. MV. B/F. 2.1.1 2, 2 1, 2 1 *1, cublassetatomicsmode CUBLAS ATOMICS ALLOWED.! Sequential SYMV kernel algorithm! Compute y:=alpha*a*x+beta*y! v(1:n)=; y(1:n)*=beta! part one do j=1,n t= do i=1,j-1 Aij=A(i,j) v(i)+=aij*x(j) t+=aij*x(i) enddo y(j)+=alpha*t enddo! part two do i=1,n y(i)+=alpha*a(i,i)*x(i) enddo! part three y(1:n)+=alpha*v(1:n) 1 2 SYMV ( A ) i Tx U s Tx U/Ty dd dd threadidx.x dd d k dd threadidx.y dd dd Ty (Tx, Ty, U) (i, d, k, s)., [11], [13] 1 2.. 1 1, 2 1. v, Ty. 3 CUDA 1. 1 3, part one three. 2.1.2 Vol.214-HPC-146 No.14 214/1/3 1 3 part one 2 ( ) c 214 Information Processing Society of Japan 2

3 kernel symv preprocess j := get threadid(). if j < n then v(j) :=, and y(j) *= beta. if j < MAX blkid then ticket(j) := MAX blkid. if j = then atomicexch( &Master blkid, ). endkernel kernel symv main <Tx, Ty, U, M> define j j + threadidx.x. thid := get localid(), and blkid := get blockid(). d := (U/Ty)*threadIdx.y, i := U*blkID, and s := ceil(i 1, Tx). Ticket := ticket. yreg[] :=... := yreg[u/ty 1] :=. // part one for j:= to s 1 step Tx if j < i 1 then areg[k] := A(j, i + k + d), yreg[k] += areg[k]*x(j), and wreg := kareg[k]*x(i + k) for k [, U/Ty). get Ticket( Ticket ) wreg := sumup wreg through Ty. if j < i 1 then v(j) += wreg. release Ticket( Ticket ), and Ticket++. endfor // part two for j:= thid to U do shm[thid][j] := shm[j][thid] := A(i + thid, i + j). endfor synchthreads if thid < U then yreg[k] := shm[thid][k]*x(i+k) for k [, U/Ty). shm[k][thid] := sumup yreg[k] through Tx for k [, U/Ty). if thid < U then y(i+thid) += alpha*shm[thid][thid]. endkernel kernel symv postprocess // part three j := get threadid(). if j < n then y(j) += alpha*v(j). endkernel 2 Atomic ( A, CUDA., n U.) function get blockid if ismasterthread() then c := atomicinc( &Master blkid ). broadcast c of MasterThread. return MAX blkid c. endfunction function get threadid return threadidx.x+blockidx.x*blockdim.x. endfunction function get localid return threadidx.x+threadidx.y*blockdim.x. endfunction procedure get Ticket( int *Ticket ) if ismastertthread() then while (TRUE) c := atomiccas( Ticket, blkid, 1 ). if c = blkid break endwhile syncthreads endprocedure procedure relase Ticket( int *Ticket ) syncthreads if ismasterthread() then atomicexch( Ticket, blkid 1 ). endprocedure 4 Atomic., v(i) part one. 3, mutex Ticket get_ticket() release_ticket(), ( 4 ). get_ticket() atomiccas Ticket 1. release_ticket().,. CUDA 6.[12] CUBLAS[1].,,.,, ID,. Vol.214-HPC-146 No.14 214/1/3 c 214 Information Processing Society of Japan 3

// void symv <T> // ( char, int, T, T*, int, T*, int, T, T*, int ) // void ASPEN_dsymv ( char uplo, int n, double alpha, double *a, int lda, double *x, int incx, double beta, double *y, int incy ); void ASPEN_ssymv ( char uplo, int n, float alpha, float *a, int lda, float *x, int incx, float beta, float *y, int incy ); void ASPEN_chemv ( char uplo, int n, cufloatcomplex alpha, cufloatcomplex *a, int lda, cufloatcomplex *x, int incx, cufloatcomplex beta, cufloatcomplex *y, int incy); void ASPEN_zhemv ( char uplo, int n, cudoublecomplex alpha, cudoublecomplex *a, int lda, cudoublecomplex *x, int incx, cudoublecomplex beta, cudoublecomplex *y, int incy); void ASPEN_wsymv ( char uplo, int n, cuddreal alpha, cuddreal *a, int lda, cuddreal *x, int incx, cuddreal beta, cuddreal *y, int incy); 5 2.1.3 x-symv API part two, three, 2, v y. preprocess v, y,. 2.2 (template/cucomplex/dd real), SYMV 1., double float, T double float., ( 5 x-symv API ). 2.2.1 ([CZ]HEMV) CUDA, cucomplex.h, cufloatcomplex cudoublecomplex typedef float2 double2., cucomplex,.,,. 2.2.2 4 (WSYMV) GPU.,., Bailey DD(double double) [14], [15]. MPACK[16] DGEMM QPBLAS-GPU[17], GPU. DD, Bailey 1DD 21. Byte/flop 3 LD 1ST (3+1)*(8*2)/21=3.Byte/flop. DD.,, DD DDFLOPS, DD double 2., DD 1/2. 4 DD(double double), typedef double2, DD double2 2.,, typedef cuddreal ( qd [14] dd_real ). * 2 3. (AT) 3.1 3 (Tx, Ty, U), SM(X) (, ) m, M. m, M [13]., GPU 1.,, 5 388 *3. 3.2 Vol.214-HPC-146 No.14 214/1/3 [13], 2 d-spline[18] *2, typedef cucomplex., DD. *3 m, c 214 Information Processing Society of Japan 4

1 I ( ) T x x {32, 64,..., T xmax } T y y {1, 2,..., 8} U U/T y {3, 4,..., 32} M {1, 2,..., 1} i) 96 (3 WarpSize) T x T y T xmax, T xmax := {288 (D, W, Z, C), 32 (S)}. ii) U T x. II (GPU boolean ) USE VOLATILE USE TEXTURE USE RESTRICT USE LDG volatile texture memory const TYPE restrict * read only cache ldg() read only cache Fermi Kepler Maxwell USE VOLATILE 1 1 1 USE TEXTURE 1 USE RESTRICT 1 USE LDG 1.,, Tx, Ty, U, m. Tx, Ty, m, U. 2,. 2 GPU SSYMV Top 5., 2., d-spline Top2,. SYMV, 3 if-then. if-then. 2, Ty 2., 2,., GPU., GPU.,. 4. SYMV 4 4GPU., SYMV, Byte/flop B/s. SYMV Byte/flops W 4(=(8*2)/4), D 2(=8/4), S 1(=4/4), Z 1(=(8*2)/(4*4)), C.5(=(4*2)/(4*4)),. Titan Black W(cuddreal: 4 ): 84GFLOPS D(double: ): 168GFLOPS S(float: ): 336GFLOPS Z(cuDoubleComplex: ): 336GLOPS C(cuFloatComplex: ): 672GFLOPS. W:D:S:Z:C=1:2:4:4:8. 4.1 [DS]SYMV 6 9, 4 GPU [DS]SYMV (GFLOPS)., BLAS CUDA CUBLAS 6.5[1], KBLAS 1.[1], MAGMABLAS 1.5.(beta3)[2]. SYMV, 4GPU, 1 25%.,, Titan Black CUBALS6.5. NVIDIA GPU Titan Black. GPU. GTX58: D(148GB/s=77%), S(149GB/s=78%) K2c: D(134GB/s=64%), S(131GB/s=63%) Titan Black: D(22GB/s=65%), S(25GB/s=61%) * 4 GTX75Ti: D(68GB/s=78%), S(74GB/s=85%) 61% *5,.,., 2., 1.,. 4.2 WSYMV, [CZ]HEMV Vol.214-HPC-146 No.14 214/1/3 1 GeForce GTX Titan Black, 4 WSYMV, CHEMV, ZHEMV. WSYMV Byte/flop, *4 6MHz, 76%, 71%. *5 bandwidthtest, GTX58: 17GB/s, K2c: 146GB/s, Titan Black: 229GB/s, GTX75Ti: 67.3GB/s. c 214 Information Processing Society of Japan 5

Vol.214-HPC-146 No.14 214/1/3 2 SSYMV GPU Top 5 ( Tx, Ty, U, m, M ) kernel ID GTX58 K2c Titan Black GTX75Ti 1. (64, 2, 42, 4, ) (64, 4, 48, 4, ) (64, 4, 48, 4, ) (96, 1, 23, 8, ) 2. (64, 2, 42, 4, 1) (96, 3, 27, 4, 1) (64, 5, 6, 3, ) (64, 2, 28, 8, ) 3. (32, 8, 32, 2, 9) (96, 3, 51, 3, ) (64, 4, 44, 4, 1) (32, 8, 32, 2, 9) 4. (64, 4, 64, 2, 1) (64, 2, 34, 7, ) (96, 3, 27, 4, ) (64, 2, 28, 8, 3) 5. (96, 3, 36, 2, ) (64, 4, 44, 4, 1) (128, 2, 36, 3, 3) (64, 2, 36, 7, 1) 3 GPU (ID= L+U. if-then ) GTX58 K2c Titan Black GTX75Ti if ( 1 n < 16 ) { if ( 1 n < 1771 ) { if ( 1 n < 298 ) { if ( (1 n < 47 ) { ID=; ID=; ID=; ID=; } elsif ( 16 n < 1842 ) { } elsif ( 1777 n < 2989 ) { } elsif (298 n < 2172 ) { } elsif ( 47 n < 261 ) { ID=16; ID=6; ID=6; ID=9; } elsif ( 2166 n < 1842 ) { } elsif ( 2989 n < 3565 ) { } elsif (2172 n < 4378 ) { } elsif ( 261 n < 294 ) { ID=13: ID=5; ID=14; ID=1;... } elsif ( 3565 n ) {...... } elsif ( 579 n ) { ID=1; } elsif ( 32412 n ) { } elsif ( 1955 n ) { ID=1; } ID=1; ID=6; } } } 4 GPU CPU / ( 3 7MHz, GPUBoost 6MHz. 6MHz 288GB/s.) GTX58 Tesla K2c Titan Black GTX75Ti Compute Capability 2. 3.5 3.5 5. GPU Clock (MHz) 1544(boost NA) 76(boost NA) 889(boost 98) 12(boost 185) Multiprocessors 16 13 15 5 CUDA Cores 512 2496 288 64 Memory Capacity (MB) 1536 512 6144 248 Memory Clock (MHz) 48(384bit) 52(32bit) 7(384bit) * 6 54(128bit) Memory Bandwidth (GB/s) 192 28 336 86.4 ECC Support NA Enabled NA NA Host (a) (b) (c) (a) Host (a) Host (b) Host (b) CPU AMD FX-812 Intel Core i7-393k Intel Core i7-393k CPU Core 8 6 6 CPU Clock (GHz) 3.1 3.2 3.2 Memory Capacity (GB) 16 16 16 Linux Kernel version 3.6.11-4 3.11.1-1 3.11.1-1 CUDA Version 6.5 6.5 6.5 Driver Version 34.29 34.29 34.29 GNU gcc Version 4.6.3 4.7.2 4.7.2 c 214 Information Processing Society of Japan 6

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTX58> 8 7 6 5 4 3 2 ASPEN.K2-1.3-DSYMVU-GTX58.dat CUDA-6.5-DSYMVU-GTX58.dat KBLAS-1.-DSYMVU-GTX58.dat MAGMA-1.5.b3-DSYMVU-GTX58.dat 1 5 1 15 2 Performance of SSYMV on <GeForce GTX58> 16 14 12 1 8 6 4 ASPEN.K2-1.3-SSYMVU-GTX58.dat CUDA-6.5-SSYMVU-GTX58.dat KBLAS-1.-SSYMVU-GTX58.dat MAGMA-1.5.b3-SSYMVU-GTX58.dat 2 5 図 6 1 15 2 GeForce GTX58 での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, そ れぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 7

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <Tesla K2c> 7 6 5 4 3 2 1 ASPEN.K2-1.3-DSYMVU-K2c.dat CUDA-6.5-DSYMVU-K2c.dat KBLAS-1.-DSYMVU-K2c.dat MAGMA-1.5.beta3-DSYMVU-K2c.dat 5 1 15 2 Performance of SSYMV on <Tesla K2c> 14 12 1 8 6 4 2 ASPEN.K2-1.3-SSYMVU-K2c.dat CUDA-6.5-SSYMVU-K2c.dat KBLAS-1.-SSYMVU-K2c.dat MAGMA-1.5.beta3-SSYMVU-K2c.dat 5 図 7 1 15 2 Tesla K2c での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ 行列は 8 次元毎に測定) 214 Information Processing Society of Japan 8

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTXTitan Black> 12 1 8 6 4 2 ASPEN.K2-1.3-DSYMVU-GTXTITANBlack.dat CUDA-6.5-DSYMVU-GTXTITANBlack.dat KBLAS-1.-DSYMVU-GTXTITANBlack.dat MAGMA-1.5.beta3-DSYMVU-GTXTITANBlack.dat 5 1 15 2 Performance of SSYMV on <GeForce GTXTitan Black> 2 15 1 5 ASPEN.K2-1.3-SSYMVU-GTXTITANBlack.dat CUDA-6.5-SSYMVU-GTXTITANBlack.dat KBLAS-1.-SSYMVU-GTXTITANBlack.dat MAGMA-1.5.beta3-SSYMVU-GTXTITANBlack.dat 5 図8 1 15 2 GeForce GTX Titan Black での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単 精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 9

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTX75Ti> 4 35 3 25 2 15 1 ASPEN.K2-1.3-DSYMVU-GTX75Ti.dat CUDA-6.5-DSYMVU-GTX75Ti.dat KBLAS-1.-DSYMVU-GTX75Ti.dat MAGMA-1.5.b3-DSYMVU-GTX75Ti.dat 5 5 1 15 2 Performance of SSYMV on <GeForce GTX75Ti> 8 7 6 5 4 3 2 ASPEN.K2-1.3-SSYMVU-GTX75Ti.dat CUDA-6.5-SSYMVU-GTX75Ti.dat KBLAS-1.-SSYMVU-GTX75Ti.dat MAGMA-1.5.b3-SSYMVU-GTX75Ti.dat 1 5 図 9 1 15 2 GeForce GTX75Ti での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 1

D:S:Z:C=1:2:2:4., [DS]SYMV., WSYMV. DSYMV 1/2 5GFLOPS, 4%,. ( 1 ), [SD]SYMV, ( 2 ), ( 3 ) nvcc DD,.,. 5., SYMV, mutex., 2,.. 4GPU, CUDA BLAS., ( ) (, ) DD(double-double),. Level 2 BLAS, GPUBLAS., ( : 22143)., [DS]SYMV Level-2 CUDA BLAS ASPEN.K2, (http://www.aics.riken.jp/labs/lpnctrt/aspenk2. html ). WSYMV [CZ]HEMV. [1] NVIDIA Corporation, The NVIDIA CUDA Basic Linear Algebra Subroutines, http://developer.nvidia.com/cublas [2] Innovative Computing Laboratory, University of Tennessee, Matrix Algebra on GPU and Multicore Architectures, http://icl.cs.utk.edu/magma [3] Sørensen, H. H. B., Auto-tuning Dense Vector and Vol.214-HPC-146 No.14 214/1/3 Matrix-Vector Operations for Fermi GPUs, Parallel Processing and Applied Mathematics, LNCS 723 (212) 619 629. [4] Sørensen, H. H. B.. Auto-Tuning of Level 1 and Level 2 BLAS for GPUs, Concurrency Computat.: Pract. Exper., Wiley (212) 1183 1198. [5] GPUlab: GLAS library version..2, http://gpulab.imm.dtu.dk/docs/ glas v..2 C25 cuda 4. linux.tar.gz [6], CUDA DGEMV,, Vol.4, No.4 (Oct. 211) 158 168. [7] Abdelfattah, A., Keyes, D., and Ltaief, H., KBLAS: High Performance Level-2 BLAS on Multi-GPU Systems, http://ondemand.gputechconf.com/gtc/214/poster /pdf/p4168 KBLAS GPU computing optimization.pdf, GTC214 (214). [8] Imamura, T., ASPEN-K2: Automatic-tuning and Stabilization for the Performance of CUDA BLAS Level 2 Kernels, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP212), http://www.siam.org/meetings/pp12/ [9] Nath, R., Tomov, S., Dong, T. T., and Dongarra, J., Optimizing Symmetric Dense Matrix-vector Multiplication on GPUs, in Proceedings of 211 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 11 (211) 6:1 6:1. [1] Abdelfattah, A., Keyes, D., and Ltaief, H., KAUST BLAS (KBLAS), http://cec.kaust.edu.sa/pages/kblas.aspx [11] Imamura, T., Yamada, S., and Machida, M., A High Performance SYMV Kernel on a Fermi-core GPU, High Performance Computing for Computational Science VECPAR 212, LNCS 7851 (213) 59 7. [12] NVIDIA Corporation, CUDA C Programming guide, http://docs.nvidia.com/cuda/pdf/cuda C Programm ing Guide.pdf (214). [13],,,,, Fermi, Kepler GPU SYMV,, HPC, Vol. 212-HPC-138, No. 8 (212) 1 7. [14] Hida, H., Li, X. S., and Bailey, D. H., Quaddouble arithmetic: Algorithms, implementation, and application (Oct 2), Online PDF http://www.davidhbailey.com/dhbpapers/quaddouble.pdf [15] Bailey, D. H., and Borwein, J. M., High-Precision Computation and Mathematical Physics, texttthttp://crd.lbl.gov/ dhbailey/dhbpapers/dhb-jmbacat8.pdf [16] Nakata, M., The MPACK (MBLAS/MLAPACK); a multiple precision arithmetic version of BLAS and LAPACK, http://mplapack.sourceforge.net/ [17],,,,, QPBLAS-GPU 18, Vol. 18 D-13-5 (213). [18],, (28). c 214 Information Processing Society of Japan 11

Vol.214-HPC-146 No.14 214/1/3 45 Performance of ASPEN.K2 on <GeForce GTXTitan Black> 4 35 3 25 2 15 1 ASPEN.K2-1.3-DSYMVU-GTXTITANBlack.dat ASPEN.K2-1.3-SSYMVU-GTXTITANBlack.dat 5 ASPEN.K2-1.3-zhemv-u.dat ASPEN.K2-1.3-chemv-u.dat ASPEN.K2-1.3-wsymv-u.dat 2 4 6 8 1 12 14 16 18 2 45 Performance of CUDA 6.5 on <GeForce GTXTitan Black> 4 35 3 25 2 15 1 CUDA-6.5-DSYMVU-GTXTITANBlack.dat 5 CUDA-6.5-SSYMVU-GTXTITANBlack.dat CUDA-6.5-zhemv-u.dat CUDA-6.5-chemv-u.dat 2 4 6 8 1 12 14 16 18 2 1 x-{sy HE}MV (GeForce GTX Titan Black, : ASPEN.K2, : CUDA6.5, [DS]-SYMV 8, WSYMV,[CZ]-HEMV 32, WSYMV DD DDFLOPS ) c 214 Information Processing Society of Japan 12