IPSJ SIG Technical Report Vol.2015-HPC-148 No /3/2 CUDA-BLAS GPU 1,3,a) 1 2,3 2,3, GPU CUDA-BLAS,., GPU Eigen-G, MAGMA CUDA- BLAS., CUDA-BLAS AS

Size: px

Start display at page:

Download "IPSJ SIG Technical Report Vol.2015-HPC-148 No /3/2 CUDA-BLAS GPU 1,3,a) 1 2,3 2,3, GPU CUDA-BLAS,., GPU Eigen-G, MAGMA CUDA- BLAS., CUDA-BLAS AS"

みそらしげまつ
3 years ago
Views:

1 CUDA-BLAS GPU 1,3,a) 1 2,3 2,3, GPU CUDA-BLAS,., GPU Eigen-G, MAGMA CUDA- BLAS., CUDA-BLAS ASPE.K2 dsymv, MAGMA+ASPE.K2. 1. GPU BLAS ([1], [2] ), CUDA [3] CUBLAS[4] MAGMABLAS[5]., ( ) ((SY HE)MV) HPC-138[6], HPC-146[7]. (SY HE)MV.. y := αa U or L x + βy where A(= A ) K n n, x K n, K = R or C. (1), SYMV, SYMV. GPU GPGPU, SYMV. 1 RIKE Advanced Institute for Computational Science, Kobe, Hyogo 2 Japan Atomic Energy Agency, Kashiwa, Chiba 3 CREST CREST JST, Kawaguchi, Saitama a) imamura.toshiyuki@riken.jp. GPU MAMGA[5] magma dsyevd magma dsyevdx 2stage, Eigen-G[8]., GPU, CPU GPU. Level2, Level3 BLAS GPU.,, CUDA-BLAS., MAGMA Eigen-G CPU, CPU GPU., GPU CUDA-BLAS CPU, CPU+GPU., CUDA-BLAS GPU. 2. CUDA-BLAS 2.1 CUBLAS CUBLAS[4] VIDIA CUDA SDK[3] BLAS CUDA. VIDIA Level1 Level3 ),., CUDA 1

2 BLAS., DGEMM GPU, CUDA., Tesla Kc (DGEMM) 1TFLOPS. 2.2 MAGMABLAS MAGMABLAS GPU MAGMA [5] BLAS MAGMABLAS CUBLAS, ([9] ), CUDA CUDA KBLAS KBLAS[], [11] KAUST CUDA BLAS. Level2 GEMV SYMV, SYMV., 1.2(1.3-beta )., MAGMA, KBLAS. 2.4 ASPE.K2 ASPE.K2[1] CUDA- BLAS., GEMV, SYMV ([6], [7] ). SYMV. 2.5, EM Photonics CULA[16] CULABLAS., BLAS, CUBLAS CUBLAS )., GLAS[14],. GLAS Sørenssen GPUlab DTU Level1, Level2. [2] GEMV ( ) CUDA- BLAS. CUBLAS, MAGMA, KBLAS, ASPE.K2. 3. CUDA CUDA GPU 3. MAGMA Eigen-G. 3.1 CULA CULA[16] LAPACK CUDA. QR, + syev, syevx.., MAGMA ASPE.K2. QR,. CULA [17] CPU,.,. 3.2 MAGMA MAGMA[5] 2 magma dsyevd magma dsyevdx 2stage. magma dsyevd LAPACK dsyevd, ( 1 ) (magma dsytrd) ( 2 ) (magma dstedx) ( 3 ) (magma dormtr)., 1) (magma dsytrd) dsymv dsyr2k GPU. 2) (magma dstedx), dgemm GPU. 3) (magma dormtr), WY CPU,, dgemm GPU., magma dsyevdx 2stage,,. 2

3 5. ( 1 ) (magma dsytrd sy2sb) ( 2 ) (magma dsytrd sb2st) ( 3 ) (magma dstedx) ( 4 ) (magma dbuldge back) ( 5 ) (magma dormqr 2stages) magma dsytrd sy2sb, 1. dgemm, dsymm, dsyr2k GPU., Level3 BLAS. GPU Level2 Level3. sy2sb. magma dsytrd sb2st. magma dbuldge back. magma dormqr 2stages API. MAGMA GPU, CPU. 3.3 Eigen-G Eigen-G[8] EigenK, EigenExa[18] GPU. [8]. Eigen-G, magma dsyevd, ( 1 ) ( 2 ) ( 3 ) 3. CPU GPU, Eigen-G CPU GPU dgemm MAGMA1.4., Eigen-G magma dsyevd 2/3., ( 1 ) DSYMV. MAGMA CUBLAS, ASPE.K2 3. ( 2 ) async. 4. Eigen-G, GPU,. 4.1 CUDA-BLAS MAGMA SYMV, SYMV CUBLAS MAGMABLAS., CUBLAS., CUDA-BLAS GPU. 1 2 dgemm, dsymv CUDA-BLAS. CUBLAS dgemm. GTX9, MAGMABLAS (sgemm)., dsymv ASPE.K2 CUBLAS Atomic KBLAS., CUBLAS Atomic KBLAS.,, KBLAS Atomic AtomicAdd., HPC-146 mutex ASPE.K2., ASPE.K2 dsymv. dsymv 1, 2, 3., ASPE.K2., ASPE.K2, Lower( ), Upper( ) AS- PE.K2., ASPE.K2., (dsytrd ) 1, dsymv., dsymv,., CUDA-BLAS dgemm CUBLAS dsymv ASPE.K2., dsyr2k dgemm. 3

4 1 DGEMM ( GFLOPS) Kc CUBLAS MAGMABLAS MKL GTX9 CUBLAS MAGMABLAS MKL DSYMV ( GFLOPS) Kc CUBLAS (Atomic) MAGMABLAS MKL KBLAS ASPE.K2 1.5p GTX9 CUBLAS (Atomic) MAGMABLAS MKL KBLAS ASPE.K2 1.5p

5 3 Tesla Kc ( ) MAGMA (1) trd (2) ed (3) tbk Eigen-G (1) trd (2) ed (3) tbk DSYMV (GeForce GTX 9) 4 GTX9 ( ) MAGMA (1) trd (2) ed (3) tbk Eigen-G (1) trd (2) ed (3) tbk MAGMA Eigen-G,. 2 (magma dsyevdx 2stage), magma dsyevd. 3, 4., Tesla Kc CPU GPU GTX9 ( Tesla Kc GTX9 ). 3 dgemm,, (1) Eigen-G, (2) MAGMA. (3). (1), dsymv. dgemm. (2) MAGMA. Eigen-G dgemm. (3), MAGMA CPU dgemm. Eigen-G CPU, GTX9 dgemm GPU 1GFLOPS, CPU 2GFLOPS 3:2. 2 ASPE.K2 1.5p3x DSYMV, Upper ASPE.K2 1.5p3x DSYMV, Lower DSYMV (GeForce GTX 9) CUBLAS 7.RC, DSYMV, Upper (atomics mode) CUBLAS 7.RC, DSYMV, Lower (atomics mode) DSYMV (GeForce GTX 9) MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) DSYMV (GeForce GTX 9) KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower GeForce GTX9 SYMV ( 32 ) 5

6 5 ( ) GPU CPU / Tesla Kc GTX9 GPU ame GK1 GM4 Compute Capability GPU Clock (MHz) 6(boost A) 1126(boost 1216) Multiprocessors CUDA Cores 2496 (=13*192) 48 (=16*128) Memory Capacity (MByte) 51 (GDDR5) 96 (GDDR5) Memory Clock (MHz) 5(3bit) 12(256bit) Memory Bandwidth (GB/s) ECC Support Enabled A (ECC on ) PCI bus PCIe PCIe (host PCIe2 ) Host (a) (b) Host (a) Host (b) CPU AMD FX-81 Intel Core i7-39k CPU Core 8 6 (4FPUs) (AVX available) CPU Clock (GHz) Memory Capacity (GB) Linux Kernel version CUDA Version 7.RC 6.5 Driver Version GU gcc Version Intel MKL Version

7 DSYMV (GeForce GTX 9) ASPE.K2 1.5p3x DSYMV, Upper ASPE.K2 1.5p3x DSYMV, Lower KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) CUBLAS 7.RC, DSYMV, Upper (atomics mode) CUBLAS 7.RC, DSYMV, Lower (atomics mode) DSYMV (Tesla Kc) ASPE.K2 1.5p2 DSYMV, Upper ASPE.K2 1.5p2 DSYMV, Lower KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) CUBLAS 6.5, DSYMV, Upper (atomics mode) CUBLAS 6.5, DSYMV, Lower (atomics mode) 1 DSYMV CUDA-BLAS (GTX9( ), Tesla Kc( ), 32 ) 7

8 DSYMV (Tesla Kc) ASPE.K2 1.5p2 DSYMV, Upper ASPE.K2 1.5p2 DSYMV, Lower DSYMV (Tesla Kc) CUBLAS 6.5, DSYMV, Upper (atomics mode) CUBLAS 6.5, DSYMV, Lower (atomics mode) DSYMV (Tesla Kc) 6 Tesla Kc MAGMA+ASPE.K2 (1)trd ( ) MAGMA+ASPE.K MAGMA only Eigen-G MAGMA+ASPE.K2, MAGMA+ASPE.K2. magma dsyevd (1) magma dsytrd) dsymv, ASPE.K2 dsymv ((2)(3) (1) ). MAGMA+ASPE.K2 Eigen-G., CUDA-BLAS. MAGMA Eigen-G., MAGMA BLAS GPU., MAGMA Eigen-G.,. 3 MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) DSYMV (Tesla Kc) KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower Tesla Kc SYMV ( 32 ) 4.4, 2 magma dsyevdx 2stage. 2., 1., MAGMA MAGMA+ASPE.K2..,. 5., CUDA-BLAS,. Eigen-G, MAGMA, CUDA-BLAS. 8

9 , MAGMA+ASPE.K2, CPU+single GPU. MAGMA 2stage, MAGMA.., ( : 223) ( (COE) ). [1] Imamura, T., ASPE-K2: Automatic-tuning and Stabilization for the Performance of CUDA BLAS Level 2 Kernels, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12), [2], Kepler GPU SGEMV, GTC Japan 14. [3] VIDIA Corporation, CUDA C Programming guide, C Programm ing Guide.pdf (14). [4] VIDIA Corporation, The VIDIA CUDA Basic Linear Algebra Subroutines, [5] Innovative Computing Laboratory, University of Tennessee, Matrix Algebra on GPU and Multicore Architectures, [6],,,,, Fermi, Kepler GPU SYMV,, HPC, Vol. 12-HPC-138, o. 8 (12) 1 7. [7],,,, CUDA-xSYMV,, HPC, Vol. 14-HPC-146, o. 14 (14) [8] Imamura, T., Yamada, S., Machida, M., Eigen-G: GPUbased eigenvalue solver for real-symmetric dense matrices, th International Conference on Parallel Processing and Applied Mathematics (PPAM14), LCS8384, pp , 14 [9] ath, R., Tomov, S., Dong, T. T., and Dongarra, J., Optimizing Symmetric Dense Matrix-vector Multiplication on GPUs, in Proceedings of 11 International Conference for High Performance Computing, etworking, Storage and Analysis, SC 11 (11) 6:1 6:. [] Abdelfattah, A., Keyes, D., and Ltaief, H., KAUST BLAS (KBLAS), [11] Abdelfattah, A., Keyes, D., and Ltaief, H., KBLAS: High Performance Level-2 BLAS on Multi-GPU Systems, /pdf/p4168 KBLAS GPU computing optimization.pdf, GTC14 (14). [12] Sørensen, H. H. B., Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs, Parallel Processing and Applied Mathematics, LCS 73 (12) [13] Sørensen, H. H. B.. Auto-Tuning of Level 1 and Level 2 BLAS for GPUs, Concurrency Computat.: Pract. Exper., Wiley (12) [14] GPUlab: GLAS library version..2, glas v..2 C cuda 4. linux.tar.gz [15] Imamura, T., Yamada, S., and Machida, M., A High Performance SYMV Kernel on a Fermi-core GPU, High Performance Computing for Computational Science VECPAR 12, LCS 7851 (13) [16] Humphrey, J.R., Price, D. K., Spagnoli, D. K., Paolini, A. L., Kelmelis, E. J., CULA: Hybrid GPU Accelerated Linear Algebra Routines, SPIE Defense and Security Symposium (DSS), April,. [17] CULA. [18] EigenExa : EigenK 9

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou Vol.214-HPC-146 No.14 214/1/3 CUDA-xSYMV 1,3,a) 1 2,3 2,3 (SYMV)., (GEMV) 2.,, mutex., CUBLAS., 1 2,. (AT). 2, SYMV GPU., SSYMV( SYMV), GeForce GTXTitan Black 211GFLOPS( 62.8%)., ( ) (, ) DD(double-double),