KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

Vol.214-HPC-146 No.14 214/1/3 CUDA-xSYMV 1,3,a) 1 2,3 2,3 (SYMV)., (GEMV) 2.,, mutex., CUBLAS., 1 2,. (AT). 2, SYMV GPU., SSYMV( SYMV), GeForce GTXTitan Black 211GFLOPS( 62.8%)., ( ) (, ) DD(double-double), SYMV (CHEMV, ZHEMV, WSYMV),. 1. (SYMV).,. y := αa UorL x + βy (A(= A T ) R n n, x R n ) (1) BLAS,,. SYMV BLAS 2, O(N 2 ) O(N 2 ), O(1)., GPU CPU, GPU BLAS CUBLAS[1] 1 RIKEN Advanced Institute for Computational Science, Kobe, Hyogo 2 Japan Atomic Energy Agency, Kashiwa, Chiba 3 CREST CREST JST, Kawaguchi, Saitama a) imamura.toshiyuki@riken.jp MAGMA[2] CPU., Sørensen GLAS [3], [4], [5] GEMV 63.8%(NVIDIA Tesla C25 9GB/s=45GFLOPS). [6], [7], [8]., SYMV GPU [9], [1], [11]. SYMV SYxxx,., GEMV 1/2 [9], [11]., Byte/flop 1/2, GEMV 2.,, GEMV 2., SYMV [7], [11]., CUDA 6.[12] CUBLAS c 214 Information Processing Society of Japan 1

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(doubledouble) [14], ( Atomic Algorithm) SYMV.,, SYMV. 2. CUDA-xSYMV 2.1 (Atomic algorithm) 1/2. 1, A SYMV. 2 (Aij=A(i,j)) 1 2., 1word/4flop 1word/2flop). 8/4=2B/F, 4/4=1B/F. MV. B/F. 2.1.1 2, 2 1, 2 1 *1, cublassetatomicsmode CUBLAS ATOMICS ALLOWED.! Sequential SYMV kernel algorithm! Compute y:=alpha*a*x+beta*y! v(1:n)=; y(1:n)*=beta! part one do j=1,n t= do i=1,j-1 Aij=A(i,j) v(i)+=aij*x(j) t+=aij*x(i) enddo y(j)+=alpha*t enddo! part two do i=1,n y(i)+=alpha*a(i,i)*x(i) enddo! part three y(1:n)+=alpha*v(1:n) 1 2 SYMV ( A ) i Tx U s Tx U/Ty dd dd threadidx.x dd d k dd threadidx.y dd dd Ty (Tx, Ty, U) (i, d, k, s)., [11], [13] 1 2.. 1 1, 2 1. v, Ty. 3 CUDA 1. 1 3, part one three. 2.1.2 Vol.214-HPC-146 No.14 214/1/3 1 3 part one 2 ( ) c 214 Information Processing Society of Japan 2

3 kernel symv preprocess j := get threadid(). if j < n then v(j) :=, and y(j) *= beta. if j < MAX blkid then ticket(j) := MAX blkid. if j = then atomicexch( &Master blkid, ). endkernel kernel symv main <Tx, Ty, U, M> define j j + threadidx.x. thid := get localid(), and blkid := get blockid(). d := (U/Ty)*threadIdx.y, i := U*blkID, and s := ceil(i 1, Tx). Ticket := ticket. yreg[] :=... := yreg[u/ty 1] :=. // part one for j:= to s 1 step Tx if j < i 1 then areg[k] := A(j, i + k + d), yreg[k] += areg[k]*x(j), and wreg := kareg[k]*x(i + k) for k [, U/Ty). get Ticket( Ticket ) wreg := sumup wreg through Ty. if j < i 1 then v(j) += wreg. release Ticket( Ticket ), and Ticket++. endfor // part two for j:= thid to U do shm[thid][j] := shm[j][thid] := A(i + thid, i + j). endfor synchthreads if thid < U then yreg[k] := shm[thid][k]*x(i+k) for k [, U/Ty). shm[k][thid] := sumup yreg[k] through Tx for k [, U/Ty). if thid < U then y(i+thid) += alpha*shm[thid][thid]. endkernel kernel symv postprocess // part three j := get threadid(). if j < n then y(j) += alpha*v(j). endkernel 2 Atomic ( A, CUDA., n U.) function get blockid if ismasterthread() then c := atomicinc( &Master blkid ). broadcast c of MasterThread. return MAX blkid c. endfunction function get threadid return threadidx.x+blockidx.x*blockdim.x. endfunction function get localid return threadidx.x+threadidx.y*blockdim.x. endfunction procedure get Ticket( int *Ticket ) if ismastertthread() then while (TRUE) c := atomiccas( Ticket, blkid, 1 ). if c = blkid break endwhile syncthreads endprocedure procedure relase Ticket( int *Ticket ) syncthreads if ismasterthread() then atomicexch( Ticket, blkid 1 ). endprocedure 4 Atomic., v(i) part one. 3, mutex Ticket get_ticket() release_ticket(), ( 4 ). get_ticket() atomiccas Ticket 1. release_ticket().,. CUDA 6.[12] CUBLAS[1].,,.,, ID,. Vol.214-HPC-146 No.14 214/1/3 c 214 Information Processing Society of Japan 3

// void symv <T> // ( char, int, T, T*, int, T*, int, T, T*, int ) // void ASPEN_dsymv ( char uplo, int n, double alpha, double *a, int lda, double *x, int incx, double beta, double *y, int incy ); void ASPEN_ssymv ( char uplo, int n, float alpha, float *a, int lda, float *x, int incx, float beta, float *y, int incy ); void ASPEN_chemv ( char uplo, int n, cufloatcomplex alpha, cufloatcomplex *a, int lda, cufloatcomplex *x, int incx, cufloatcomplex beta, cufloatcomplex *y, int incy); void ASPEN_zhemv ( char uplo, int n, cudoublecomplex alpha, cudoublecomplex *a, int lda, cudoublecomplex *x, int incx, cudoublecomplex beta, cudoublecomplex *y, int incy); void ASPEN_wsymv ( char uplo, int n, cuddreal alpha, cuddreal *a, int lda, cuddreal *x, int incx, cuddreal beta, cuddreal *y, int incy); 5 2.1.3 x-symv API part two, three, 2, v y. preprocess v, y,. 2.2 (template/cucomplex/dd real), SYMV 1., double float, T double float., ( 5 x-symv API ). 2.2.1 ([CZ]HEMV) CUDA, cucomplex.h, cufloatcomplex cudoublecomplex typedef float2 double2., cucomplex,.,,. 2.2.2 4 (WSYMV) GPU.,., Bailey DD(double double) [14], [15]. MPACK[16] DGEMM QPBLAS-GPU[17], GPU. DD, Bailey 1DD 21. Byte/flop 3 LD 1ST (3+1)*(8*2)/21=3.Byte/flop. DD.,, DD DDFLOPS, DD double 2., DD 1/2. 4 DD(double double), typedef double2, DD double2 2.,, typedef cuddreal ( qd [14] dd_real ). * 2 3. (AT) 3.1 3 (Tx, Ty, U), SM(X) (, ) m, M. m, M [13]., GPU 1.,, 5 388 *3. 3.2 Vol.214-HPC-146 No.14 214/1/3 [13], 2 d-spline[18] *2, typedef cucomplex., DD. *3 m, c 214 Information Processing Society of Japan 4

1 I ( ) T x x {32, 64,..., T xmax } T y y {1, 2,..., 8} U U/T y {3, 4,..., 32} M {1, 2,..., 1} i) 96 (3 WarpSize) T x T y T xmax, T xmax := {288 (D, W, Z, C), 32 (S)}. ii) U T x. II (GPU boolean ) USE VOLATILE USE TEXTURE USE RESTRICT USE LDG volatile texture memory const TYPE restrict * read only cache ldg() read only cache Fermi Kepler Maxwell USE VOLATILE 1 1 1 USE TEXTURE 1 USE RESTRICT 1 USE LDG 1.,, Tx, Ty, U, m. Tx, Ty, m, U. 2,. 2 GPU SSYMV Top 5., 2., d-spline Top2,. SYMV, 3 if-then. if-then. 2, Ty 2., 2,., GPU., GPU.,. 4. SYMV 4 4GPU., SYMV, Byte/flop B/s. SYMV Byte/flops W 4(=(8*2)/4), D 2(=8/4), S 1(=4/4), Z 1(=(8*2)/(4*4)), C.5(=(4*2)/(4*4)),. Titan Black W(cuddreal: 4 ): 84GFLOPS D(double: ): 168GFLOPS S(float: ): 336GFLOPS Z(cuDoubleComplex: ): 336GLOPS C(cuFloatComplex: ): 672GFLOPS. W:D:S:Z:C=1:2:4:4:8. 4.1 [DS]SYMV 6 9, 4 GPU [DS]SYMV (GFLOPS)., BLAS CUDA CUBLAS 6.5[1], KBLAS 1.[1], MAGMABLAS 1.5.(beta3)[2]. SYMV, 4GPU, 1 25%.,, Titan Black CUBALS6.5. NVIDIA GPU Titan Black. GPU. GTX58: D(148GB/s=77%), S(149GB/s=78%) K2c: D(134GB/s=64%), S(131GB/s=63%) Titan Black: D(22GB/s=65%), S(25GB/s=61%) * 4 GTX75Ti: D(68GB/s=78%), S(74GB/s=85%) 61% *5,.,., 2., 1.,. 4.2 WSYMV, [CZ]HEMV Vol.214-HPC-146 No.14 214/1/3 1 GeForce GTX Titan Black, 4 WSYMV, CHEMV, ZHEMV. WSYMV Byte/flop, *4 6MHz, 76%, 71%. *5 bandwidthtest, GTX58: 17GB/s, K2c: 146GB/s, Titan Black: 229GB/s, GTX75Ti: 67.3GB/s. c 214 Information Processing Society of Japan 5

Vol.214-HPC-146 No.14 214/1/3 2 SSYMV GPU Top 5 ( Tx, Ty, U, m, M ) kernel ID GTX58 K2c Titan Black GTX75Ti 1. (64, 2, 42, 4, ) (64, 4, 48, 4, ) (64, 4, 48, 4, ) (96, 1, 23, 8, ) 2. (64, 2, 42, 4, 1) (96, 3, 27, 4, 1) (64, 5, 6, 3, ) (64, 2, 28, 8, ) 3. (32, 8, 32, 2, 9) (96, 3, 51, 3, ) (64, 4, 44, 4, 1) (32, 8, 32, 2, 9) 4. (64, 4, 64, 2, 1) (64, 2, 34, 7, ) (96, 3, 27, 4, ) (64, 2, 28, 8, 3) 5. (96, 3, 36, 2, ) (64, 4, 44, 4, 1) (128, 2, 36, 3, 3) (64, 2, 36, 7, 1) 3 GPU (ID= L+U. if-then ) GTX58 K2c Titan Black GTX75Ti if ( 1 n < 16 ) { if ( 1 n < 1771 ) { if ( 1 n < 298 ) { if ( (1 n < 47 ) { ID=; ID=; ID=; ID=; } elsif ( 16 n < 1842 ) { } elsif ( 1777 n < 2989 ) { } elsif (298 n < 2172 ) { } elsif ( 47 n < 261 ) { ID=16; ID=6; ID=6; ID=9; } elsif ( 2166 n < 1842 ) { } elsif ( 2989 n < 3565 ) { } elsif (2172 n < 4378 ) { } elsif ( 261 n < 294 ) { ID=13: ID=5; ID=14; ID=1;... } elsif ( 3565 n ) {...... } elsif ( 579 n ) { ID=1; } elsif ( 32412 n ) { } elsif ( 1955 n ) { ID=1; } ID=1; ID=6; } } } 4 GPU CPU / ( 3 7MHz, GPUBoost 6MHz. 6MHz 288GB/s.) GTX58 Tesla K2c Titan Black GTX75Ti Compute Capability 2. 3.5 3.5 5. GPU Clock (MHz) 1544(boost NA) 76(boost NA) 889(boost 98) 12(boost 185) Multiprocessors 16 13 15 5 CUDA Cores 512 2496 288 64 Memory Capacity (MB) 1536 512 6144 248 Memory Clock (MHz) 48(384bit) 52(32bit) 7(384bit) * 6 54(128bit) Memory Bandwidth (GB/s) 192 28 336 86.4 ECC Support NA Enabled NA NA Host (a) (b) (c) (a) Host (a) Host (b) Host (b) CPU AMD FX-812 Intel Core i7-393k Intel Core i7-393k CPU Core 8 6 6 CPU Clock (GHz) 3.1 3.2 3.2 Memory Capacity (GB) 16 16 16 Linux Kernel version 3.6.11-4 3.11.1-1 3.11.1-1 CUDA Version 6.5 6.5 6.5 Driver Version 34.29 34.29 34.29 GNU gcc Version 4.6.3 4.7.2 4.7.2 c 214 Information Processing Society of Japan 6

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTX58> 8 7 6 5 4 3 2 ASPEN.K2-1.3-DSYMVU-GTX58.dat CUDA-6.5-DSYMVU-GTX58.dat KBLAS-1.-DSYMVU-GTX58.dat MAGMA-1.5.b3-DSYMVU-GTX58.dat 1 5 1 15 2 Performance of SSYMV on <GeForce GTX58> 16 14 12 1 8 6 4 ASPEN.K2-1.3-SSYMVU-GTX58.dat CUDA-6.5-SSYMVU-GTX58.dat KBLAS-1.-SSYMVU-GTX58.dat MAGMA-1.5.b3-SSYMVU-GTX58.dat 2 5 図 6 1 15 2 GeForce GTX58 での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 7

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <Tesla K2c> 7 6 5 4 3 2 1 ASPEN.K2-1.3-DSYMVU-K2c.dat CUDA-6.5-DSYMVU-K2c.dat KBLAS-1.-DSYMVU-K2c.dat MAGMA-1.5.beta3-DSYMVU-K2c.dat 5 1 15 2 Performance of SSYMV on <Tesla K2c> 14 12 1 8 6 4 2 ASPEN.K2-1.3-SSYMVU-K2c.dat CUDA-6.5-SSYMVU-K2c.dat KBLAS-1.-SSYMVU-K2c.dat MAGMA-1.5.beta3-SSYMVU-K2c.dat 5 図 7 1 15 2 Tesla K2c での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 8

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTXTitan Black> 12 1 8 6 4 2 ASPEN.K2-1.3-DSYMVU-GTXTITANBlack.dat CUDA-6.5-DSYMVU-GTXTITANBlack.dat KBLAS-1.-DSYMVU-GTXTITANBlack.dat MAGMA-1.5.beta3-DSYMVU-GTXTITANBlack.dat 5 1 15 2 Performance of SSYMV on <GeForce GTXTitan Black> 2 15 1 5 ASPEN.K2-1.3-SSYMVU-GTXTITANBlack.dat CUDA-6.5-SSYMVU-GTXTITANBlack.dat KBLAS-1.-SSYMVU-GTXTITANBlack.dat MAGMA-1.5.beta3-SSYMVU-GTXTITANBlack.dat 5 図8 1 15 2 GeForce GTX Titan Black での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 9

Vol.214-HPC-146 No.14 214/1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTX75Ti> 4 35 3 25 2 15 1 ASPEN.K2-1.3-DSYMVU-GTX75Ti.dat CUDA-6.5-DSYMVU-GTX75Ti.dat KBLAS-1.-DSYMVU-GTX75Ti.dat MAGMA-1.5.b3-DSYMVU-GTX75Ti.dat 5 5 1 15 2 Performance of SSYMV on <GeForce GTX75Ti> 8 7 6 5 4 3 2 ASPEN.K2-1.3-SSYMVU-GTX75Ti.dat CUDA-6.5-SSYMVU-GTX75Ti.dat KBLAS-1.-SSYMVU-GTX75Ti.dat MAGMA-1.5.b3-SSYMVU-GTX75Ti.dat 1 5 図 9 1 15 2 GeForce GTX75Ti での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 1

D:S:Z:C=1:2:2:4., [DS]SYMV., WSYMV. DSYMV 1/2 5GFLOPS, 4%,. ( 1 ), [SD]SYMV, ( 2 ), ( 3 ) nvcc DD,.,. 5., SYMV, mutex., 2,.. 4GPU, CUDA BLAS., ( ) (, ) DD(double-double),. Level 2 BLAS, GPUBLAS., ( : 22143)., [DS]SYMV Level-2 CUDA BLAS ASPEN.K2, (http://www.aics.riken.jp/labs/lpnctrt/aspenk2. html ). WSYMV [CZ]HEMV. [1] NVIDIA Corporation, The NVIDIA CUDA Basic Linear Algebra Subroutines, http://developer.nvidia.com/cublas [2] Innovative Computing Laboratory, University of Tennessee, Matrix Algebra on GPU and Multicore Architectures, http://icl.cs.utk.edu/magma [3] Sørensen, H. H. B., Auto-tuning Dense Vector and Vol.214-HPC-146 No.14 214/1/3 Matrix-Vector Operations for Fermi GPUs, Parallel Processing and Applied Mathematics, LNCS 723 (212) 619 629. [4] Sørensen, H. H. B.. Auto-Tuning of Level 1 and Level 2 BLAS for GPUs, Concurrency Computat.: Pract. Exper., Wiley (212) 1183 1198. [5] GPUlab: GLAS library version..2, http://gpulab.imm.dtu.dk/docs/ glas v..2 C25 cuda 4. linux.tar.gz [6], CUDA DGEMV,, Vol.4, No.4 (Oct. 211) 158 168. [7] Abdelfattah, A., Keyes, D., and Ltaief, H., KBLAS: High Performance Level-2 BLAS on Multi-GPU Systems, http://ondemand.gputechconf.com/gtc/214/poster /pdf/p4168 KBLAS GPU computing optimization.pdf, GTC214 (214). [8] Imamura, T., ASPEN-K2: Automatic-tuning and Stabilization for the Performance of CUDA BLAS Level 2 Kernels, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP212), http://www.siam.org/meetings/pp12/ [9] Nath, R., Tomov, S., Dong, T. T., and Dongarra, J., Optimizing Symmetric Dense Matrix-vector Multiplication on GPUs, in Proceedings of 211 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 11 (211) 6:1 6:1. [1] Abdelfattah, A., Keyes, D., and Ltaief, H., KAUST BLAS (KBLAS), http://cec.kaust.edu.sa/pages/kblas.aspx [11] Imamura, T., Yamada, S., and Machida, M., A High Performance SYMV Kernel on a Fermi-core GPU, High Performance Computing for Computational Science VECPAR 212, LNCS 7851 (213) 59 7. [12] NVIDIA Corporation, CUDA C Programming guide, http://docs.nvidia.com/cuda/pdf/cuda C Programm ing Guide.pdf (214). [13],,,,, Fermi, Kepler GPU SYMV,, HPC, Vol. 212-HPC-138, No. 8 (212) 1 7. [14] Hida, H., Li, X. S., and Bailey, D. H., Quaddouble arithmetic: Algorithms, implementation, and application (Oct 2), Online PDF http://www.davidhbailey.com/dhbpapers/quaddouble.pdf [15] Bailey, D. H., and Borwein, J. M., High-Precision Computation and Mathematical Physics, texttthttp://crd.lbl.gov/ dhbailey/dhbpapers/dhb-jmbacat8.pdf [16] Nakata, M., The MPACK (MBLAS/MLAPACK); a multiple precision arithmetic version of BLAS and LAPACK, http://mplapack.sourceforge.net/ [17],,,,, QPBLAS-GPU 18, Vol. 18 D-13-5 (213). [18],, (28). c 214 Information Processing Society of Japan 11

Vol.214-HPC-146 No.14 214/1/3 45 Performance of ASPEN.K2 on <GeForce GTXTitan Black> 4 35 3 25 2 15 1 ASPEN.K2-1.3-DSYMVU-GTXTITANBlack.dat ASPEN.K2-1.3-SSYMVU-GTXTITANBlack.dat 5 ASPEN.K2-1.3-zhemv-u.dat ASPEN.K2-1.3-chemv-u.dat ASPEN.K2-1.3-wsymv-u.dat 2 4 6 8 1 12 14 16 18 2 45 Performance of CUDA 6.5 on <GeForce GTXTitan Black> 4 35 3 25 2 15 1 CUDA-6.5-DSYMVU-GTXTITANBlack.dat 5 CUDA-6.5-SSYMVU-GTXTITANBlack.dat CUDA-6.5-zhemv-u.dat CUDA-6.5-chemv-u.dat 2 4 6 8 1 12 14 16 18 2 1 x-{sy HE}MV (GeForce GTX Titan Black, : ASPEN.K2, : CUDA6.5, [DS]-SYMV 8, WSYMV,[CZ]-HEMV 32, WSYMV DD DDFLOPS ) c 214 Information Processing Society of Japan 12