IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

Similar documents
1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

07-二村幸孝・出口大輔.indd

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

GPGPU

HPC (pay-as-you-go) HPC Web 2

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

GPU CUDA CUDA 2010/06/28 1

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

untitled

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

倍々精度RgemmのnVidia C2050上への実装と応用

GPU.....

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

HPC pdf

Run-Based Trieから構成される 決定木の枝刈り法

Microsoft PowerPoint - GPU_computing_2013_01.pptx

CUDA を用いた画像処理 画像処理を CUDA で並列化 基本的な並列化の考え方 目標 : 妥当な Naïve コードが書ける 最適化の初歩がわかる ブロックサイズ メモリアクセスパターン

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

tabaicho3mukunoki.pptx

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin

main.dvi

untitled

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

GPU n Graphics Processing Unit CG CAD

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

10D16.dvi

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

NUMAの構成

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

SQUFOF NTT Shanks SQUFOF SQUFOF Pentium III Pentium 4 SQUFOF 2.03 (Pentium 4 2.0GHz Willamette) N UBASIC 50 / 200 [

it-ken_open.key

,,, 2 ( ), $[2, 4]$, $[21, 25]$, $V$,, 31, 2, $V$, $V$ $V$, 2, (b) $-$,,, (1) : (2) : (3) : $r$ $R$ $r/r$, (4) : 3

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

untitled

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

IPSJ SIG Technical Report Vol.2017-MUS-116 No /8/24 MachineDancing: 1,a) 1,b) 3 MachineDancing MachineDancing MachineDancing 1 MachineDan

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max)

untitled

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

Slide 1

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

IDRstab(s, L) GBiCGSTAB(s, L) 2. AC-GBiCGSTAB(s, L) Ax = b (1) A R n n x R n b R n 2.1 IDR s L r k+1 r k+1 = b Ax k+1 IDR(s) r k+1 = (I ω k A)(r k dr

IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm

~~~~~~~~~~~~~~~~~~ wait Call CPU time 1, latch: library cache 7, latch: library cache lock 4, job scheduler co

スライド 1


2. Amazon GPU Cluster Compute Instance Amazon CCI Amazon EC2 CCI GPU Cluster GPU Quadruple Extra Large Instance (cg1.4xlarge) [6] On Demand Inhouse In

(MIRU2010) NTT Graphic Processor Unit GPU graphi

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

SWoPP BOF BOF-1 8/3 19:10 BoF SWoPP : BOF-2 8/5 17:00 19:00 HW/SW 15 x5 SimMips/MieruPC M-Core/SimMc FPGA S

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

HPEハイパフォーマンスコンピューティング ソリューション

Vol.57 No (Mar. 2016) 1,a) , L3 CG VDI VDI A Migration to a Cloud-based Information Infrastructure to Support

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

スライド 1

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

(Version: 2017/4/18) Intel CPU 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU do


2013 M

XACCの概要

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

IPSJ SIG Technical Report 1,a) 1,b) 1,c) 1,d) 2,e) 2,f) 2,g) 1. [1] [2] 2 [3] Osaka Prefecture University 1 1, Gakuencho, Naka, Sakai,

表1-表4宅建99.indd

表1-表4宅建98.indd

表1-表4宅建101.indd

表1-表4宅建いわて-表紙.indd

IPSJ SIG Technical Report Vol.2015-MUS-106 No.10 Vol.2015-EC-35 No /3/2 BGM 1,4,a) ,4 BGM. BGM. BGM BGM. BGM. BGM. BGM. 1.,. YouTube 201

Cell/B.E. BlockLib

名称 : 日本 GPU コンピューティングパートナーシップ (G-DEP) 所在 : 東京都文京区本郷 7 丁目 3 番 1 号東京大学アントレプレナープラザ, 他工場 URL アライアンスパートナー コアテクノロジーパートナー NVIDIA JAPAN ソリュ

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

IPSJ SIG Technical Report 1 1, Nested Transactional Memory Selecting the Optimal Rollback Point Yuji Ito, 1 Ryota Shioya, 1, 2 Masahiro Goshima

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

AutoTuned-RB

1_26.dvi

2012年度HPCサマーセミナー_多田野.pptx

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

三石貴志.indd

Transcription:

GPU CRS 1,a),b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV 00 1.86 177 1. SpMV SpMV CRS Compressed Row Storage *1 SpMV GPU GPU NVIDIA Kepler GPU NVIDIA GPGPU CUDA CRS 1 CRS CPU GPU 008 Bell [1] SpMV CUDA ELL COO HYB CRS 1 a) mukunoki@hpcs.cs.tsukuba.ac.jp b) daisuke@cs.tsukuba.ac.jp *1 NVIDIA cusparse GPU CSR Compressed Sparse Row CRS 8 9 0 0 4 5 0 7 5 6 0 0 6 0 7 0 0 9 0 0 6 0 0 0 0 0 0 0 0 7 8 0 val = [8, 9, 4, 5, 7, 5, 6,, 6, 7, 9, 6,,, 7,, 8] ind = [1,, 5, 6,, 3, 4, 5,, 4, 1, 4, 6, 3, 3, 4, 5] ptr = [1, 5, 9, 11, 14, 15, 18] 1 CRS Bell [] Segmented Scan CUDA Weizhi [3] CRS Xiaowen [4] SIC Matam [5] CRS ELL CRS SpMV GPU SpMV Kubota [6] CRS CRS GPU CPU CRS SpMV CRS SpMV NVIDIA GPU cusparse[7] CRS SpMV 013 Information Processing Society of Japan 1

Fermi GPU Kepler Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV 00 1.86 177. GPU CRS SpMV Bell [1] Bell 1 y = Ax y 1 1 CRS-scalar 1 CRS-vector CRS-scalar CPU GPU CRS-vector Bell 1 3 CRS-vector 3 1 Baskaran [8] 1 3 16 Guo [9] 1 16 3 ElZein [10] CRS-vector CRS-scalar Reguly [11] CRS-vector 1 1,, 4, 8, 16, 3 1 1 CRS-scalar Reguly Yoshizawa [1] 1,, 4, 8, 16, 3 3. Kepler GPU Kepler NVIDIA 01 GPU Kepler Tesla K0 Kepler NVIDIA [13] Fermi Fermi SM SMX Fermi SM 3 CUDA SMX 19 Max X Grid Dimension 1 x 65,535,147,483,647 ID 3 65,536 Kepler 1 63 55 Fermi Kepler CRS SpMV shfl n shfl up n shfl down XOR shfl xor 4 48KB Fermi Kepler const restrict 4. Kepler CRS SpMV cusparse cusparse y = αax + βy 013 Information Processing Society of Japan

8 9 0 0 4 5 0 7 5 6 0 0 6 0 7 0 0 9 0 0 6 0 0 0 0 0 0 0 0 7 8 0 thread 8 9 4 5 1 7 5 6 6 7 3 9 6 4 7 8 5 6 thread 8 9 4 5 0,1 7 5 6,3 6 7 4,5 9 6 6,7 7 8 8,9 10,11 8 9 4 5 7 5 6 6 7 9 6 7 8 thread 0-3 4-7 8-11 1-15 16-19 0-3 iteration 0 1 3 0 1 0 NT=1 NT= NT=4 CRS-vector 1 global void SpmvKernel_NT (int m, double alpha, double* a_val, int* a_ptr, int* a_idx, const double* restrict x, double beta, double* y) { unsigned int t; unsigned int tx = threadidx.x; unsigned int tid = blockdim.x * blockidx.x + tx; unsigned int rowid = tid / NT; unsigned int lane = tid % NT; double val; int val_hi, val_lo; if (rowid < m) { val = 0.0; for (i = a_ptr[rowid] + lane; i < a_ptr[rowid + 1]; i += NT) { val += a_val[i] * x[a_idx[i]]; for (i = NT / ; i > 0; i = i >> 1) { val_hi = doublehiint(val); val_lo = doubleloint(val); val += hilointdouble( shfl_xor(val_hi, i, 3), shfl_xor(val_lo, i, 3)); if (lane == 0) { y[rowid] = alpha * val + beta * y[rowid]; 3 NT 1,, 4, 8, 16, 3 for Reguly CRS-vector 1 NT NT=1,, 4, 8, 16, 3 cusparse SpMV 1 CRS SpMV CRS-vector NT=1,, 4 int Spmv (char trans, int m, int n, double alpha, double* a_val, int* a_ptr, int* a_idx, double* x, double beta, double* y, int nonzeros) { int NT, ntx, nbx; float nnzrow = (float)nonzeros/(float)m; NT = max(1, min(3, (int)pow(.,ceil(log(nnzrow))))); ntx = NTX; nbx = m / (ntx / NT) + ((m % (ntx / NT))!= 0); dim3 threads (ntx); dim3 grid (nbx); if (trans == N ) { if (NT == 3) { cudafuncsetcacheconfig (SpmvKernel_3, cudafunccachepreferl1); SpmvKernel_3 <<< grid, threads >>> (m, alpha, a_val, a_ptr, a_idx, x, beta, y); else if (NT == 16) { else if (NT == ) { else { 4 iteration 1 NT NT = max(1, min(3, (int)pow(,ceil(log (nnzrow)))) NT 3 Kepler (1) () (3) 3 SpMV GPU 4 4 for NT 013 Information Processing Society of Japan 3

Kepler 3 4.1 Kepler 48KB GPU const restrict L1 L1 SpMV x 4. Kepler GPU x MaxGridDimX x 65,535,147,483,647 ID CRS-vector RowMax RowMax=MaxGridDimX BlockDim.x/NT BlockDim.x x Block- Dim.x=18 RowMax NT=3 Fermi GPU RowMax=65,535 18/3=6,140 6,140 ID * Kepler GPU RowMax=,147,483,647 18/3=8,589,934,588 1 3GB GPU GB SpMV 4.3 CRS-vector 1 NT 3 Kepler XOR shfl xor 3bit 64bit double int int double 1 5. 5.1 GPU Kepler NVIDIA Tesla K0 CPU Intel Xeon E3-130 3.0GHz OS CentOS 6.3 kernel:.6.3-79.14.1.el6.x86 64 CUDA5.0 (Driver Version: 304.54) nvcc 5.0 -O3 -arch sm 35 gcc 4.4.6 -O3 -arch sm 35 Kepler GPU Flops CPU PCI-Express GPU 3 1 CUDA5.0 NVIDIA cusparse The University of Florida Sparse Matrix Collection[14] 00 1,813 5,558,36 4,57 117,406,044 6 y = αax + βy A * volatile 5. NVIDIA cus- PARSE CUDA5.0 013 Information Processing Society of Japan 4

GFlops Performance (Tesla K0, Double Precision) 11 Speedup 0 cusparse5.0 10 18 Our Implementation 9 16 8 14 7 1 6 10 5 8 4 6 3 4 1 0 0 0 0 40 60 80 100 10 140 160 180 00 Speedup [Our implementation / cusparse5.0] 5 cusparse5.0 1.0E+07 Matrix Size (Rows) 1.0E+01 Percentage of Non-zero Elements [%] 1.0E+06 1.0E+00 1.0E-01 1.0E+05 1.0E-0 1.0E+04 1.0E-03 1.0E-04 1.0E+03 1.0E+09 1.0E+08 1.0E+07 1.0E+06 1.0E+05 1.0E+04 0 0 40 60 80 100 10 140 160 180 00 Non-zero Elements (NNZ) 1.0E-05 1.0E+04 1.0E+03 1.0E+0 1.0E+01 1.0E+00 0 0 40 60 80 100 10 140 160 180 00 Non-zero Elements per Row (NNZ/Row) 1.0E+03 0 0 40 60 80 100 10 140 160 180 00 6 1.0E-01 0 0 40 60 80 100 10 140 160 180 00 5 cusparse Flops cusparse Flops Matrix Number 00 5 5 cusparse 00 1.86 177 8.1 cusparse 3 cusparse cusparse 0.08 3 16 8 4 1 Threads per Row (NT) 0 0 40 60 80 100 10 140 160 180 00 6. 7 1 NT 6 00 Rows NNZ NNZ/Row 013 Information Processing Society of Japan 5

1.8 1.6 Ver.1: Read Only Cache Ver.: Avoid Outer Loop Ver.3: Shuffle Instruction Ver.4: All Speedup (Tesla K0, Double Precision) 1.4 1. 1 0.8 0 0 40 60 80 100 10 140 160 180 00 8 Ver. 0 Matrix Number 5 Flops NNZ/Row NNZ/Row 7 1 NT NT NNZ/Row NNZ/Row NT NT cusparse cusparse NNZ NNZ/Row NT Kepler 5 Ver. 0 Ver. 1 3 Fermi Ver. 1 Ver. Ver. 3 Ver. 4 Ver. 1 3 Kepler Ver. 0 Fermi Ver. 1 3 Ver. 0 Kepler Ver. 1 x Fermi Ver. 1 Ver. 4 5 Kepler 3 4 8 00 Ver. 0 Ver. 1 4 Kepler Ver. 4 Fermi Ver. 0 00 1.9 1.78 1.04 Ver. 1 0.98 170 00 6 1 NT 3 CRS SpMV Fermi 7. Kepler GPU CRS SpMV Fermi GPU Kepler x 013 Information Processing Society of Japan 6

Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV 00 1.86 177 Kepler GPU JST CREST GPU, Manycore, and Heterogeneous Systems (InPar 01), pp. 1 1 (01). [1] Yoshizawa, H. and Takahashi, D.: Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS format on GPUs, Proc. 15th IEEE International Conference on Computational Science and Engineering (CSE 01), pp. 130 136 (01). [13] Corporation, N.: Whitepaper NVIDIA s Next Generation CUDA Compute Architecture: Kepler GK110 http://www.nvidia.com/content/pdf/kepler/nvidia- Kepler-GK110-Architecture-Whitepaper.pdf (01). [14] Davis, T. and Hu, Y.: The University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices/. [1] Bell, N. and Garland, M.: Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report, No. NVR-008-004 (008). [] Segmented Scan CUDA Vol. 010-HPC-16, No. 1, pp. 1 7 (010). [3] Xu, W., Zhang, H., Jiao, S., Wang, D., Song, F. and Liu, Z.: Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU, Proc. 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 01), pp. 31 35 (01). [4] Feng, X., Jin, H., Zheng, R., Hu, K., Zeng, J. and Shao, Z.: Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs, Proc. IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS 011), pp. 165 17 (011). [5] Matam, K. and Kothapalli, K.: Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU, Proc. International Conference on Parallel Processing (ICPP 011), pp. 61 61 (011). [6] Kubota, Y. and Takahashi, D.: Optimization of Sparse Matrix-Vector Multiplication by Auto Selecting Storage Schemes on GPU, Proc. 11th International Conference on Computational Science and Its Applications (ICCSA 011), Part II, Lecture Notes in Computer Science, No. 6783, pp. 547 561 (011). [7] NVIDIA Corporation: cusparse Library (included in CUDA Toolkit), https://developer.nvidia.com/cusparse. [8] Baskaran, M. M. and Bordawekar, R.: Optimizing Sparse Matrix-Vector Multiplication on GPUs, IBM Research Report, Vol. RC4704 (009). [9] Guo, P. and Wang, L.: Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs, Proc. International Conference on Computational and Information Sciences (ICCIS 010), pp. 1154 1157 (010). [10] El Zein, A. H. and Rendell, A. P.: Generating Optimal CUDA Sparse Matrix Vector Product Implementations for Evolving GPU Hardware, Concurrency and Computation: Practice and Experience, Vol. 4, pp. 3 13 (01). [11] Reguly, I. and Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs, Proc. Innovative Parallel Computing: Foundations and Applications of 013 Information Processing Society of Japan 7