IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

GPU CRS 1,a),b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV 00 1.86 177 1. SpMV SpMV CRS Compressed Row Storage *1 SpMV GPU GPU NVIDIA Kepler GPU NVIDIA GPGPU CUDA CRS 1 CRS CPU GPU 008 Bell [1] SpMV CUDA ELL COO HYB CRS 1 a) mukunoki@hpcs.cs.tsukuba.ac.jp b) daisuke@cs.tsukuba.ac.jp *1 NVIDIA cusparse GPU CSR Compressed Sparse Row CRS 8 9 0 0 4 5 0 7 5 6 0 0 6 0 7 0 0 9 0 0 6 0 0 0 0 0 0 0 0 7 8 0 val = [8, 9, 4, 5, 7, 5, 6,, 6, 7, 9, 6,,, 7,, 8] ind = [1,, 5, 6,, 3, 4, 5,, 4, 1, 4, 6, 3, 3, 4, 5] ptr = [1, 5, 9, 11, 14, 15, 18] 1 CRS Bell [] Segmented Scan CUDA Weizhi [3] CRS Xiaowen [4] SIC Matam [5] CRS ELL CRS SpMV GPU SpMV Kubota [6] CRS CRS GPU CPU CRS SpMV CRS SpMV NVIDIA GPU cusparse[7] CRS SpMV 013 Information Processing Society of Japan 1

Fermi GPU Kepler Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV 00 1.86 177. GPU CRS SpMV Bell [1] Bell 1 y = Ax y 1 1 CRS-scalar 1 CRS-vector CRS-scalar CPU GPU CRS-vector Bell 1 3 CRS-vector 3 1 Baskaran [8] 1 3 16 Guo [9] 1 16 3 ElZein [10] CRS-vector CRS-scalar Reguly [11] CRS-vector 1 1,, 4, 8, 16, 3 1 1 CRS-scalar Reguly Yoshizawa [1] 1,, 4, 8, 16, 3 3. Kepler GPU Kepler NVIDIA 01 GPU Kepler Tesla K0 Kepler NVIDIA [13] Fermi Fermi SM SMX Fermi SM 3 CUDA SMX 19 Max X Grid Dimension 1 x 65,535,147,483,647 ID 3 65,536 Kepler 1 63 55 Fermi Kepler CRS SpMV shfl n shfl up n shfl down XOR shfl xor 4 48KB Fermi Kepler const restrict 4. Kepler CRS SpMV cusparse cusparse y = αax + βy 013 Information Processing Society of Japan

8 9 0 0 4 5 0 7 5 6 0 0 6 0 7 0 0 9 0 0 6 0 0 0 0 0 0 0 0 7 8 0 thread 8 9 4 5 1 7 5 6 6 7 3 9 6 4 7 8 5 6 thread 8 9 4 5 0,1 7 5 6,3 6 7 4,5 9 6 6,7 7 8 8,9 10,11 8 9 4 5 7 5 6 6 7 9 6 7 8 thread 0-3 4-7 8-11 1-15 16-19 0-3 iteration 0 1 3 0 1 0 NT=1 NT= NT=4 CRS-vector 1 global void SpmvKernel_NT (int m, double alpha, double* a_val, int* a_ptr, int* a_idx, const double* restrict x, double beta, double* y) { unsigned int t; unsigned int tx = threadidx.x; unsigned int tid = blockdim.x * blockidx.x + tx; unsigned int rowid = tid / NT; unsigned int lane = tid % NT; double val; int val_hi, val_lo; if (rowid < m) { val = 0.0; for (i = a_ptr[rowid] + lane; i < a_ptr[rowid + 1]; i += NT) { val += a_val[i] * x[a_idx[i]]; for (i = NT / ; i > 0; i = i >> 1) { val_hi = doublehiint(val); val_lo = doubleloint(val); val += hilointdouble( shfl_xor(val_hi, i, 3), shfl_xor(val_lo, i, 3)); if (lane == 0) { y[rowid] = alpha * val + beta * y[rowid]; 3 NT 1,, 4, 8, 16, 3 for Reguly CRS-vector 1 NT NT=1,, 4, 8, 16, 3 cusparse SpMV 1 CRS SpMV CRS-vector NT=1,, 4 int Spmv (char trans, int m, int n, double alpha, double* a_val, int* a_ptr, int* a_idx, double* x, double beta, double* y, int nonzeros) { int NT, ntx, nbx; float nnzrow = (float)nonzeros/(float)m; NT = max(1, min(3, (int)pow(.,ceil(log(nnzrow))))); ntx = NTX; nbx = m / (ntx / NT) + ((m % (ntx / NT))!= 0); dim3 threads (ntx); dim3 grid (nbx); if (trans == N ) { if (NT == 3) { cudafuncsetcacheconfig (SpmvKernel_3, cudafunccachepreferl1); SpmvKernel_3 <<< grid, threads >>> (m, alpha, a_val, a_ptr, a_idx, x, beta, y); else if (NT == 16) { else if (NT == ) { else { 4 iteration 1 NT NT = max(1, min(3, (int)pow(,ceil(log (nnzrow)))) NT 3 Kepler (1) () (3) 3 SpMV GPU 4 4 for NT 013 Information Processing Society of Japan 3

Kepler 3 4.1 Kepler 48KB GPU const restrict L1 L1 SpMV x 4. Kepler GPU x MaxGridDimX x 65,535,147,483,647 ID CRS-vector RowMax RowMax=MaxGridDimX BlockDim.x/NT BlockDim.x x Block- Dim.x=18 RowMax NT=3 Fermi GPU RowMax=65,535 18/3=6,140 6,140 ID * Kepler GPU RowMax=,147,483,647 18/3=8,589,934,588 1 3GB GPU GB SpMV 4.3 CRS-vector 1 NT 3 Kepler XOR shfl xor 3bit 64bit double int int double 1 5. 5.1 GPU Kepler NVIDIA Tesla K0 CPU Intel Xeon E3-130 3.0GHz OS CentOS 6.3 kernel:.6.3-79.14.1.el6.x86 64 CUDA5.0 (Driver Version: 304.54) nvcc 5.0 -O3 -arch sm 35 gcc 4.4.6 -O3 -arch sm 35 Kepler GPU Flops CPU PCI-Express GPU 3 1 CUDA5.0 NVIDIA cusparse The University of Florida Sparse Matrix Collection[14] 00 1,813 5,558,36 4,57 117,406,044 6 y = αax + βy A * volatile 5. NVIDIA cus- PARSE CUDA5.0 013 Information Processing Society of Japan 4

GFlops Performance (Tesla K0, Double Precision) 11 Speedup 0 cusparse5.0 10 18 Our Implementation 9 16 8 14 7 1 6 10 5 8 4 6 3 4 1 0 0 0 0 40 60 80 100 10 140 160 180 00 Speedup [Our implementation / cusparse5.0] 5 cusparse5.0 1.0E+07 Matrix Size (Rows) 1.0E+01 Percentage of Non-zero Elements [%] 1.0E+06 1.0E+00 1.0E-01 1.0E+05 1.0E-0 1.0E+04 1.0E-03 1.0E-04 1.0E+03 1.0E+09 1.0E+08 1.0E+07 1.0E+06 1.0E+05 1.0E+04 0 0 40 60 80 100 10 140 160 180 00 Non-zero Elements (NNZ) 1.0E-05 1.0E+04 1.0E+03 1.0E+0 1.0E+01 1.0E+00 0 0 40 60 80 100 10 140 160 180 00 Non-zero Elements per Row (NNZ/Row) 1.0E+03 0 0 40 60 80 100 10 140 160 180 00 6 1.0E-01 0 0 40 60 80 100 10 140 160 180 00 5 cusparse Flops cusparse Flops Matrix Number 00 5 5 cusparse 00 1.86 177 8.1 cusparse 3 cusparse cusparse 0.08 3 16 8 4 1 Threads per Row (NT) 0 0 40 60 80 100 10 140 160 180 00 6. 7 1 NT 6 00 Rows NNZ NNZ/Row 013 Information Processing Society of Japan 5

1.8 1.6 Ver.1: Read Only Cache Ver.: Avoid Outer Loop Ver.3: Shuffle Instruction Ver.4: All Speedup (Tesla K0, Double Precision) 1.4 1. 1 0.8 0 0 40 60 80 100 10 140 160 180 00 8 Ver. 0 Matrix Number 5 Flops NNZ/Row NNZ/Row 7 1 NT NT NNZ/Row NNZ/Row NT NT cusparse cusparse NNZ NNZ/Row NT Kepler 5 Ver. 0 Ver. 1 3 Fermi Ver. 1 Ver. Ver. 3 Ver. 4 Ver. 1 3 Kepler Ver. 0 Fermi Ver. 1 3 Ver. 0 Kepler Ver. 1 x Fermi Ver. 1 Ver. 4 5 Kepler 3 4 8 00 Ver. 0 Ver. 1 4 Kepler Ver. 4 Fermi Ver. 0 00 1.9 1.78 1.04 Ver. 1 0.98 170 00 6 1 NT 3 CRS SpMV Fermi 7. Kepler GPU CRS SpMV Fermi GPU Kepler x 013 Information Processing Society of Japan 6

Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV 00 1.86 177 Kepler GPU JST CREST GPU, Manycore, and Heterogeneous Systems (InPar 01), pp. 1 1 (01). [1] Yoshizawa, H. and Takahashi, D.: Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS format on GPUs, Proc. 15th IEEE International Conference on Computational Science and Engineering (CSE 01), pp. 130 136 (01). [13] Corporation, N.: Whitepaper NVIDIA s Next Generation CUDA Compute Architecture: Kepler GK110 http://www.nvidia.com/content/pdf/kepler/nvidia- Kepler-GK110-Architecture-Whitepaper.pdf (01). [14] Davis, T. and Hu, Y.: The University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices/. [1] Bell, N. and Garland, M.: Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report, No. NVR-008-004 (008). [] Segmented Scan CUDA Vol. 010-HPC-16, No. 1, pp. 1 7 (010). [3] Xu, W., Zhang, H., Jiao, S., Wang, D., Song, F. and Liu, Z.: Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU, Proc. 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 01), pp. 31 35 (01). [4] Feng, X., Jin, H., Zheng, R., Hu, K., Zeng, J. and Shao, Z.: Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs, Proc. IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS 011), pp. 165 17 (011). [5] Matam, K. and Kothapalli, K.: Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU, Proc. International Conference on Parallel Processing (ICPP 011), pp. 61 61 (011). [6] Kubota, Y. and Takahashi, D.: Optimization of Sparse Matrix-Vector Multiplication by Auto Selecting Storage Schemes on GPU, Proc. 11th International Conference on Computational Science and Its Applications (ICCSA 011), Part II, Lecture Notes in Computer Science, No. 6783, pp. 547 561 (011). [7] NVIDIA Corporation: cusparse Library (included in CUDA Toolkit), https://developer.nvidia.com/cusparse. [8] Baskaran, M. M. and Bordawekar, R.: Optimizing Sparse Matrix-Vector Multiplication on GPUs, IBM Research Report, Vol. RC4704 (009). [9] Guo, P. and Wang, L.: Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs, Proc. International Conference on Computational and Information Sciences (ICCIS 010), pp. 1154 1157 (010). [10] El Zein, A. H. and Rendell, A. P.: Generating Optimal CUDA Sparse Matrix Vector Product Implementations for Evolving GPU Hardware, Concurrency and Computation: Practice and Experience, Vol. 4, pp. 3 13 (01). [11] Reguly, I. and Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs, Proc. Innovative Parallel Computing: Foundations and Applications of 013 Information Processing Society of Japan 7