KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

Size: px

Start display at page:

Download "KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou"

かずきさかど
6 years ago
Views:

1 Vol.214-HPC-146 No /1/3 CUDA-xSYMV 1,3,a) 1 2,3 2,3 (SYMV)., (GEMV) 2.,, mutex., CUBLAS., 1 2,. (AT). 2, SYMV GPU., SSYMV( SYMV), GeForce GTXTitan Black 211GFLOPS( 62.8%)., ( ) (, ) DD(double-double), SYMV (CHEMV, ZHEMV, WSYMV),. 1. (SYMV).,. y := αa UorL x + βy (A(= A T ) R n n, x R n ) (1) BLAS,,. SYMV BLAS 2, O(N 2 ) O(N 2 ), O(1)., GPU CPU, GPU BLAS CUBLAS[1] 1 RIKEN Advanced Institute for Computational Science, Kobe, Hyogo 2 Japan Atomic Energy Agency, Kashiwa, Chiba 3 CREST CREST JST, Kawaguchi, Saitama a) imamura.toshiyuki@riken.jp MAGMA[2] CPU., Sørensen GLAS [3], [4], [5] GEMV 63.8%(NVIDIA Tesla C25 9GB/s=45GFLOPS). [6], [7], [8]., SYMV GPU [9], [1], [11]. SYMV SYxxx,., GEMV 1/2 [9], [11]., Byte/flop 1/2, GEMV 2.,, GEMV 2., SYMV [7], [11]., CUDA 6.[12] CUBLAS c 214 Information Processing Society of Japan 1

2 KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(doubledouble) [14], ( Atomic Algorithm) SYMV.,, SYMV. 2. CUDA-xSYMV 2.1 (Atomic algorithm) 1/2. 1, A SYMV. 2 (Aij=A(i,j)) 1 2., 1word/4flop 1word/2flop). 8/4=2B/F, 4/4=1B/F. MV. B/F , 2 1, 2 1 *1, cublassetatomicsmode CUBLAS ATOMICS ALLOWED.! Sequential SYMV kernel algorithm! Compute y:=alpha*a*x+beta*y! v(1:n)=; y(1:n)*=beta! part one do j=1,n t= do i=1,j-1 Aij=A(i,j) v(i)+=aij*x(j) t+=aij*x(i) enddo y(j)+=alpha*t enddo! part two do i=1,n y(i)+=alpha*a(i,i)*x(i) enddo! part three y(1:n)+=alpha*v(1:n) 1 2 SYMV ( A ) i Tx U s Tx U/Ty dd dd threadidx.x dd d k dd threadidx.y dd dd Ty (Tx, Ty, U) (i, d, k, s)., [11], [13] , 2 1. v, Ty. 3 CUDA , part one three Vol.214-HPC-146 No /1/3 1 3 part one 2 ( ) c 214 Information Processing Society of Japan 2

3 3 kernel symv preprocess j := get threadid(). if j < n then v(j) :=, and y(j) *= beta. if j < MAX blkid then ticket(j) := MAX blkid. if j = then atomicexch( &Master blkid, ). endkernel kernel symv main <Tx, Ty, U, M> define j j + threadidx.x. thid := get localid(), and blkid := get blockid(). d := (U/Ty)*threadIdx.y, i := U*blkID, and s := ceil(i 1, Tx). Ticket := ticket. yreg[] :=... := yreg[u/ty 1] :=. // part one for j:= to s 1 step Tx if j < i 1 then areg[k] := A(j, i + k + d), yreg[k] += areg[k]*x(j), and wreg := kareg[k]*x(i + k) for k [, U/Ty). get Ticket( Ticket ) wreg := sumup wreg through Ty. if j < i 1 then v(j) += wreg. release Ticket( Ticket ), and Ticket++. endfor // part two for j:= thid to U do shm[thid][j] := shm[j][thid] := A(i + thid, i + j). endfor synchthreads if thid < U then yreg[k] := shm[thid][k]*x(i+k) for k [, U/Ty). shm[k][thid] := sumup yreg[k] through Tx for k [, U/Ty). if thid < U then y(i+thid) += alpha*shm[thid][thid]. endkernel kernel symv postprocess // part three j := get threadid(). if j < n then y(j) += alpha*v(j). endkernel 2 Atomic ( A, CUDA., n U.) function get blockid if ismasterthread() then c := atomicinc( &Master blkid ). broadcast c of MasterThread. return MAX blkid c. endfunction function get threadid return threadidx.x+blockidx.x*blockdim.x. endfunction function get localid return threadidx.x+threadidx.y*blockdim.x. endfunction procedure get Ticket( int *Ticket ) if ismastertthread() then while (TRUE) c := atomiccas( Ticket, blkid, 1 ). if c = blkid break endwhile syncthreads endprocedure procedure relase Ticket( int *Ticket ) syncthreads if ismasterthread() then atomicexch( Ticket, blkid 1 ). endprocedure 4 Atomic., v(i) part one. 3, mutex Ticket get_ticket() release_ticket(), ( 4 ). get_ticket() atomiccas Ticket 1. release_ticket().,. CUDA 6.[12] CUBLAS[1].,,.,, ID,. Vol.214-HPC-146 No /1/3 c 214 Information Processing Society of Japan 3

4 // void symv <T> // ( char, int, T, T*, int, T*, int, T, T*, int ) // void ASPEN_dsymv ( char uplo, int n, double alpha, double *a, int lda, double *x, int incx, double beta, double *y, int incy ); void ASPEN_ssymv ( char uplo, int n, float alpha, float *a, int lda, float *x, int incx, float beta, float *y, int incy ); void ASPEN_chemv ( char uplo, int n, cufloatcomplex alpha, cufloatcomplex *a, int lda, cufloatcomplex *x, int incx, cufloatcomplex beta, cufloatcomplex *y, int incy); void ASPEN_zhemv ( char uplo, int n, cudoublecomplex alpha, cudoublecomplex *a, int lda, cudoublecomplex *x, int incx, cudoublecomplex beta, cudoublecomplex *y, int incy); void ASPEN_wsymv ( char uplo, int n, cuddreal alpha, cuddreal *a, int lda, cuddreal *x, int incx, cuddreal beta, cuddreal *y, int incy); x-symv API part two, three, 2, v y. preprocess v, y,. 2.2 (template/cucomplex/dd real), SYMV 1., double float, T double float., ( 5 x-symv API ) ([CZ]HEMV) CUDA, cucomplex.h, cufloatcomplex cudoublecomplex typedef float2 double2., cucomplex,.,, (WSYMV) GPU.,., Bailey DD(double double) [14], [15]. MPACK[16] DGEMM QPBLAS-GPU[17], GPU. DD, Bailey 1DD 21. Byte/flop 3 LD 1ST (3+1)*(8*2)/21=3.Byte/flop. DD.,, DD DDFLOPS, DD double 2., DD 1/2. 4 DD(double double), typedef double2, DD double2 2.,, typedef cuddreal ( qd [14] dd_real ). * 2 3. (AT) (Tx, Ty, U), SM(X) (, ) m, M. m, M [13]., GPU 1.,, * Vol.214-HPC-146 No /1/3 [13], 2 d-spline[18] *2, typedef cucomplex., DD. *3 m, c 214 Information Processing Society of Japan 4

5 1 I ( ) T x x {32, 64,..., T xmax } T y y {1, 2,..., 8} U U/T y {3, 4,..., 32} M {1, 2,..., 1} i) 96 (3 WarpSize) T x T y T xmax, T xmax := {288 (D, W, Z, C), 32 (S)}. ii) U T x. II (GPU boolean ) USE VOLATILE USE TEXTURE USE RESTRICT USE LDG volatile texture memory const TYPE restrict * read only cache ldg() read only cache Fermi Kepler Maxwell USE VOLATILE USE TEXTURE 1 USE RESTRICT 1 USE LDG 1.,, Tx, Ty, U, m. Tx, Ty, m, U. 2,. 2 GPU SSYMV Top 5., 2., d-spline Top2,. SYMV, 3 if-then. if-then. 2, Ty 2., 2,., GPU., GPU.,. 4. SYMV 4 4GPU., SYMV, Byte/flop B/s. SYMV Byte/flops W 4(=(8*2)/4), D 2(=8/4), S 1(=4/4), Z 1(=(8*2)/(4*4)), C.5(=(4*2)/(4*4)),. Titan Black W(cuddreal: 4 ): 84GFLOPS D(double: ): 168GFLOPS S(float: ): 336GFLOPS Z(cuDoubleComplex: ): 336GLOPS C(cuFloatComplex: ): 672GFLOPS. W:D:S:Z:C=1:2:4:4: [DS]SYMV 6 9, 4 GPU [DS]SYMV (GFLOPS)., BLAS CUDA CUBLAS 6.5[1], KBLAS 1.[1], MAGMABLAS 1.5.(beta3)[2]. SYMV, 4GPU, 1 25%.,, Titan Black CUBALS6.5. NVIDIA GPU Titan Black. GPU. GTX58: D(148GB/s=77%), S(149GB/s=78%) K2c: D(134GB/s=64%), S(131GB/s=63%) Titan Black: D(22GB/s=65%), S(25GB/s=61%) * 4 GTX75Ti: D(68GB/s=78%), S(74GB/s=85%) 61% *5,.,., 2., 1.,. 4.2 WSYMV, [CZ]HEMV Vol.214-HPC-146 No /1/3 1 GeForce GTX Titan Black, 4 WSYMV, CHEMV, ZHEMV. WSYMV Byte/flop, *4 6MHz, 76%, 71%. *5 bandwidthtest, GTX58: 17GB/s, K2c: 146GB/s, Titan Black: 229GB/s, GTX75Ti: 67.3GB/s. c 214 Information Processing Society of Japan 5

6 Vol.214-HPC-146 No /1/3 2 SSYMV GPU Top 5 ( Tx, Ty, U, m, M ) kernel ID GTX58 K2c Titan Black GTX75Ti 1. (64, 2, 42, 4, ) (64, 4, 48, 4, ) (64, 4, 48, 4, ) (96, 1, 23, 8, ) 2. (64, 2, 42, 4, 1) (96, 3, 27, 4, 1) (64, 5, 6, 3, ) (64, 2, 28, 8, ) 3. (32, 8, 32, 2, 9) (96, 3, 51, 3, ) (64, 4, 44, 4, 1) (32, 8, 32, 2, 9) 4. (64, 4, 64, 2, 1) (64, 2, 34, 7, ) (96, 3, 27, 4, ) (64, 2, 28, 8, 3) 5. (96, 3, 36, 2, ) (64, 4, 44, 4, 1) (128, 2, 36, 3, 3) (64, 2, 36, 7, 1) 3 GPU (ID= L+U. if-then ) GTX58 K2c Titan Black GTX75Ti if ( 1 n < 16 ) { if ( 1 n < 1771 ) { if ( 1 n < 298 ) { if ( (1 n < 47 ) { ID=; ID=; ID=; ID=; } elsif ( 16 n < 1842 ) { } elsif ( 1777 n < 2989 ) { } elsif (298 n < 2172 ) { } elsif ( 47 n < 261 ) { ID=16; ID=6; ID=6; ID=9; } elsif ( 2166 n < 1842 ) { } elsif ( 2989 n < 3565 ) { } elsif (2172 n < 4378 ) { } elsif ( 261 n < 294 ) { ID=13: ID=5; ID=14; ID=1;... } elsif ( 3565 n ) { } elsif ( 579 n ) { ID=1; } elsif ( n ) { } elsif ( 1955 n ) { ID=1; } ID=1; ID=6; } } } 4 GPU CPU / ( 3 7MHz, GPUBoost 6MHz. 6MHz 288GB/s.) GTX58 Tesla K2c Titan Black GTX75Ti Compute Capability GPU Clock (MHz) 1544(boost NA) 76(boost NA) 889(boost 98) 12(boost 185) Multiprocessors CUDA Cores Memory Capacity (MB) Memory Clock (MHz) 48(384bit) 52(32bit) 7(384bit) * 6 54(128bit) Memory Bandwidth (GB/s) ECC Support NA Enabled NA NA Host (a) (b) (c) (a) Host (a) Host (b) Host (b) CPU AMD FX-812 Intel Core i7-393k Intel Core i7-393k CPU Core CPU Clock (GHz) Memory Capacity (GB) Linux Kernel version CUDA Version Driver Version GNU gcc Version c 214 Information Processing Society of Japan 6

7 Vol.214-HPC-146 No /1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTX58> ASPEN.K2-1.3-DSYMVU-GTX58.dat CUDA-6.5-DSYMVU-GTX58.dat KBLAS-1.-DSYMVU-GTX58.dat MAGMA-1.5.b3-DSYMVU-GTX58.dat Performance of SSYMV on <GeForce GTX58> ASPEN.K2-1.3-SSYMVU-GTX58.dat CUDA-6.5-SSYMVU-GTX58.dat KBLAS-1.-SSYMVU-GTX58.dat MAGMA-1.5.b3-SSYMVU-GTX58.dat 2 5 図 GeForce GTX58 での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 7

8 Vol.214-HPC-146 No /1/3 情報処理学会研究報告 Performance of DSYMV on <Tesla K2c> ASPEN.K2-1.3-DSYMVU-K2c.dat CUDA-6.5-DSYMVU-K2c.dat KBLAS-1.-DSYMVU-K2c.dat MAGMA-1.5.beta3-DSYMVU-K2c.dat Performance of SSYMV on <Tesla K2c> ASPEN.K2-1.3-SSYMVU-K2c.dat CUDA-6.5-SSYMVU-K2c.dat KBLAS-1.-SSYMVU-K2c.dat MAGMA-1.5.beta3-SSYMVU-K2c.dat 5 図 Tesla K2c での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 8

9 Vol.214-HPC-146 No /1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTXTitan Black> ASPEN.K2-1.3-DSYMVU-GTXTITANBlack.dat CUDA-6.5-DSYMVU-GTXTITANBlack.dat KBLAS-1.-DSYMVU-GTXTITANBlack.dat MAGMA-1.5.beta3-DSYMVU-GTXTITANBlack.dat Performance of SSYMV on <GeForce GTXTitan Black> ASPEN.K2-1.3-SSYMVU-GTXTITANBlack.dat CUDA-6.5-SSYMVU-GTXTITANBlack.dat KBLAS-1.-SSYMVU-GTXTITANBlack.dat MAGMA-1.5.beta3-SSYMVU-GTXTITANBlack.dat 5 図 GeForce GTX Titan Black での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 9

10 Vol.214-HPC-146 No /1/3 情報処理学会研究報告 Performance of DSYMV on <GeForce GTX75Ti> ASPEN.K2-1.3-DSYMVU-GTX75Ti.dat CUDA-6.5-DSYMVU-GTX75Ti.dat KBLAS-1.-DSYMVU-GTX75Ti.dat MAGMA-1.5.b3-DSYMVU-GTX75Ti.dat Performance of SSYMV on <GeForce GTX75Ti> ASPEN.K2-1.3-SSYMVU-GTX75Ti.dat CUDA-6.5-SSYMVU-GTX75Ti.dat KBLAS-1.-SSYMVU-GTX75Ti.dat MAGMA-1.5.b3-SSYMVU-GTX75Ti.dat 1 5 図 GeForce GTX75Ti での SYMV の性能 (上: DSYMV 倍精度, 下: SSYMV 単精度, それぞれ行列は 8 次元毎に測定) 214 Information Processing Society of Japan 1

11 D:S:Z:C=1:2:2:4., [DS]SYMV., WSYMV. DSYMV 1/2 5GFLOPS, 4%,. ( 1 ), [SD]SYMV, ( 2 ), ( 3 ) nvcc DD,.,. 5., SYMV, mutex., 2,.. 4GPU, CUDA BLAS., ( ) (, ) DD(double-double),. Level 2 BLAS, GPUBLAS., ( : 22143)., [DS]SYMV Level-2 CUDA BLAS ASPEN.K2, ( html ). WSYMV [CZ]HEMV. [1] NVIDIA Corporation, The NVIDIA CUDA Basic Linear Algebra Subroutines, [2] Innovative Computing Laboratory, University of Tennessee, Matrix Algebra on GPU and Multicore Architectures, [3] Sørensen, H. H. B., Auto-tuning Dense Vector and Vol.214-HPC-146 No /1/3 Matrix-Vector Operations for Fermi GPUs, Parallel Processing and Applied Mathematics, LNCS 723 (212) [4] Sørensen, H. H. B.. Auto-Tuning of Level 1 and Level 2 BLAS for GPUs, Concurrency Computat.: Pract. Exper., Wiley (212) [5] GPUlab: GLAS library version..2, glas v..2 C25 cuda 4. linux.tar.gz [6], CUDA DGEMV,, Vol.4, No.4 (Oct. 211) [7] Abdelfattah, A., Keyes, D., and Ltaief, H., KBLAS: High Performance Level-2 BLAS on Multi-GPU Systems, /pdf/p4168 KBLAS GPU computing optimization.pdf, GTC214 (214). [8] Imamura, T., ASPEN-K2: Automatic-tuning and Stabilization for the Performance of CUDA BLAS Level 2 Kernels, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP212), [9] Nath, R., Tomov, S., Dong, T. T., and Dongarra, J., Optimizing Symmetric Dense Matrix-vector Multiplication on GPUs, in Proceedings of 211 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 11 (211) 6:1 6:1. [1] Abdelfattah, A., Keyes, D., and Ltaief, H., KAUST BLAS (KBLAS), [11] Imamura, T., Yamada, S., and Machida, M., A High Performance SYMV Kernel on a Fermi-core GPU, High Performance Computing for Computational Science VECPAR 212, LNCS 7851 (213) [12] NVIDIA Corporation, CUDA C Programming guide, C Programm ing Guide.pdf (214). [13],,,,, Fermi, Kepler GPU SYMV,, HPC, Vol. 212-HPC-138, No. 8 (212) 1 7. [14] Hida, H., Li, X. S., and Bailey, D. H., Quaddouble arithmetic: Algorithms, implementation, and application (Oct 2), Online PDF [15] Bailey, D. H., and Borwein, J. M., High-Precision Computation and Mathematical Physics, texttthttp://crd.lbl.gov/ dhbailey/dhbpapers/dhb-jmbacat8.pdf [16] Nakata, M., The MPACK (MBLAS/MLAPACK); a multiple precision arithmetic version of BLAS and LAPACK, [17],,,,, QPBLAS-GPU 18, Vol. 18 D-13-5 (213). [18],, (28). c 214 Information Processing Society of Japan 11

12 Vol.214-HPC-146 No /1/3 45 Performance of ASPEN.K2 on <GeForce GTXTitan Black> ASPEN.K2-1.3-DSYMVU-GTXTITANBlack.dat ASPEN.K2-1.3-SSYMVU-GTXTITANBlack.dat 5 ASPEN.K2-1.3-zhemv-u.dat ASPEN.K2-1.3-chemv-u.dat ASPEN.K2-1.3-wsymv-u.dat Performance of CUDA 6.5 on <GeForce GTXTitan Black> CUDA-6.5-DSYMVU-GTXTITANBlack.dat 5 CUDA-6.5-SSYMVU-GTXTITANBlack.dat CUDA-6.5-zhemv-u.dat CUDA-6.5-chemv-u.dat x-{sy HE}MV (GeForce GTX Titan Black, : ASPEN.K2, : CUDA6.5, [DS]-SYMV 8, WSYMV,[CZ]-HEMV 32, WSYMV DD DDFLOPS ) c 214 Information Processing Society of Japan 12

倍々精度RgemmのnVidia C2050上への実装と応用

倍々精度RgemmのnVidia C2050上への実装と応用 .. maho@riken.jp http://accc.riken.jp/maho/,,, 2011/2/16 1 - : GPU : SDPA-DD 10 1 - Rgemm : 4 (32 ) nvidia C2050, GPU CPU 150, 24GFlops 25 20 GFLOPS 15 10 QuadAdd Cray, QuadMul Sloppy Kernel QuadAdd Cray,