IPSJ SIG Technical Report Vol.2015-HPC-148 No /3/2 CUDA-BLAS GPU 1,3,a) 1 2,3 2,3, GPU CUDA-BLAS,., GPU Eigen-G, MAGMA CUDA- BLAS., CUDA-BLAS AS

Similar documents
KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

untitled

untitled

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

07-二村幸孝・出口大輔.indd

Microsoft PowerPoint - GPU_computing_2013_01.pptx

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

倍々精度RgemmのnVidia C2050上への実装と応用

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

GPU n Graphics Processing Unit CG CAD

main.dvi

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

GPGPU

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

基盤研究(B) 「マルチコア複合環境を指向した適応型自動チューニング技術」

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

4 倍精度基本線形代数ルーチン群 QPBLAS の紹介 [index] 1. Introduction 2. Double-double algorithm 3. QPBLAS 4. QPBLAS-GPU 5. Summary 佐々成正 1, 山田進 1, 町田昌彦 1, 今村俊幸 2, 奥田洋司


IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

untitled

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

スライド 1

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

2ndD3.eps

高性能計算研究室の紹介 High Performance Computing Lab.

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

高性能計算研究室の紹介 High Performance Computing Lab.

HPC (pay-as-you-go) HPC Web 2

GPUコンピューティング講習会パート1


10D16.dvi

MBLAS¤ÈMLAPACK; ¿ÇÜĹÀºÅÙÈǤÎBLAS/LAPACK¤ÎºîÀ®

Second-semi.PDF

untitled

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

HPEハイパフォーマンスコンピューティング ソリューション

チューニング講習会 初級編

GPUコンピューティング講習会パート1

_Vol16No3.indd

橡3_2石川.PDF

FIT2013( 第 12 回情報科学技術フォーラム ) I-032 Acceleration of Adaptive Bilateral Filter base on Spatial Decomposition and Symmetry of Weights 1. Taiki Makishi Ch

Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max)

untitled

GPU チュートリアル :OpenACC 篇 Himeno benchmark を例題として 高エネルギー加速器研究機構 (KEK) 松古栄夫 (Hideo Matsufuru) 1 December 2018 HPC-Phys 理化学研究所 共通コードプロジェクト

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

HP Workstation 総合カタログ

HPC pdf

OpenGL GLSL References Kageyama (Kobe Univ.) Visualization / 58

HP xw9400 Workstation

GPGPU によるアクセラレーション環境について

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

EigenExa Version 2.3c EigenExa

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

09中西

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

GPUを用いたN体計算

tabaicho3mukunoki.pptx

,., ping - RTT,., [2],RTT TCP [3] [4] Android.Android,.,,. LAN ACK. [5].. 3., 1.,. 3 AI.,,Amazon, (NN),, 1..NN,, (RNN) RNN

HP High Performance Computing(HPC)

AMD AMD AMD Opteron x86 OS 2P 8P x GHz 75W ACP OEM Q4 2.3GHz HE (55W) 2.8GHz SE (105W) AMD PC 2009 All rights reserved. AMD Japan, L

1重谷.PDF

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

HP Workstation 総合カタログ

Microsoft Word - HOKUSAI_system_overview_ja.docx

25 2 ) 15 (W 力電 idle FMA(1) FMA(N) 実行コード Memcopy matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmu

理研スーパーコンピュータ・システム

on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

Microsoft Word - 0_0_表紙.doc

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal

Microsoft PowerPoint - suda.pptx

untitled

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

兵庫県立大学学報vol.17

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

B

supercomputer2010.ppt

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

表面RTX入稿

THE PARALLEL Issue UNIVERSE James Reinders Parallel Building Blocks: David Sekowski Parallel Studio XE Cluster Studio Sanjay Goil John McHug

2

workshop Eclipse TAU AICS.key

2... Numerical Recipes [1] Matrix Computation [2].,.. 2.1, ( ) A. A,.,.. A [ ] [ ] a x T 0 A =, P = I β [0 u T ], P = I βuu T, β = 2/ u 2 x B u P ( ),

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

HP Workstation Xeon 5600

( )

IPSJ SIG Technical Report Vol.2017-HPC-158 No /3/9 OpenACC MPS 1,a) 1 Moving Particle Semi-implicit (MPS) MPS MPS OpenACC GPU 2 4 GPU NVIDIA K2

Transcription:

CUDA-BLAS GPU 1,3,a) 1 2,3 2,3, GPU CUDA-BLAS,., GPU Eigen-G, MAGMA CUDA- BLAS., CUDA-BLAS ASPE.K2 dsymv, MAGMA+ASPE.K2. 1. GPU BLAS ([1], [2] ), CUDA [3] CUBLAS[4] MAGMABLAS[5]., ( ) ((SY HE)MV) HPC-138[6], HPC-146[7]. (SY HE)MV.. y := αa U or L x + βy where A(= A ) K n n, x K n, K = R or C. (1), SYMV, SYMV. GPU GPGPU, SYMV. 1 RIKE Advanced Institute for Computational Science, Kobe, Hyogo 2 Japan Atomic Energy Agency, Kashiwa, Chiba 3 CREST CREST JST, Kawaguchi, Saitama a) imamura.toshiyuki@riken.jp. GPU MAMGA[5] magma dsyevd magma dsyevdx 2stage, Eigen-G[8]., GPU, CPU GPU. Level2, Level3 BLAS GPU.,, CUDA-BLAS., MAGMA Eigen-G CPU, CPU GPU., GPU CUDA-BLAS CPU, CPU+GPU., CUDA-BLAS GPU. 2. CUDA-BLAS 2.1 CUBLAS CUBLAS[4] VIDIA CUDA SDK[3] BLAS CUDA. VIDIA Level1 Level3 ),., CUDA 1

BLAS., DGEMM GPU, CUDA., Tesla Kc (DGEMM) 1TFLOPS. 2.2 MAGMABLAS MAGMABLAS GPU MAGMA [5] BLAS. 1.6.1. MAGMABLAS CUBLAS, ([9] ), CUDA CUDA.. 2.3 KBLAS KBLAS[], [11] KAUST CUDA BLAS. Level2 GEMV SYMV, SYMV., 1.2(1.3-beta )., MAGMA, KBLAS. 2.4 ASPE.K2 ASPE.K2[1] CUDA- BLAS., GEMV, SYMV ([6], [7] ). SYMV. 2.5, EM Photonics CULA[16] CULABLAS., BLAS, CUBLAS CUBLAS )., GLAS[14],. GLAS Sørenssen GPUlab DTU Level1, Level2. [2] GEMV ( ) CUDA- BLAS. CUBLAS, MAGMA, KBLAS, ASPE.K2. 3. CUDA CUDA GPU 3. MAGMA Eigen-G. 3.1 CULA CULA[16] LAPACK CUDA. QR, + syev, syevx.., MAGMA ASPE.K2. QR,. CULA [17] CPU,.,. 3.2 MAGMA MAGMA[5] 2 magma dsyevd magma dsyevdx 2stage. magma dsyevd LAPACK dsyevd, ( 1 ) (magma dsytrd) ( 2 ) (magma dstedx) ( 3 ) (magma dormtr)., 1) (magma dsytrd) dsymv dsyr2k GPU. 2) (magma dstedx), dgemm GPU. 3) (magma dormtr), WY CPU,, dgemm GPU., magma dsyevdx 2stage,,. 2

5. ( 1 ) (magma dsytrd sy2sb) ( 2 ) (magma dsytrd sb2st) ( 3 ) (magma dstedx) ( 4 ) (magma dbuldge back) ( 5 ) (magma dormqr 2stages) magma dsytrd sy2sb, 1. dgemm, dsymm, dsyr2k GPU., Level3 BLAS. GPU Level2 Level3. sy2sb. magma dsytrd sb2st. magma dbuldge back. magma dormqr 2stages API. MAGMA GPU, CPU. 3.3 Eigen-G Eigen-G[8] EigenK, EigenExa[18] GPU. [8]. Eigen-G, magma dsyevd, ( 1 ) ( 2 ) ( 3 ) 3. CPU GPU, Eigen-G CPU GPU dgemm. 13 9 MAGMA1.4., Eigen-G magma dsyevd 2/3., ( 1 ) DSYMV. MAGMA CUBLAS, ASPE.K2 3. ( 2 ) async. 4. Eigen-G, GPU,. 4.1 CUDA-BLAS MAGMA 1.6.1 SYMV, SYMV CUBLAS MAGMABLAS., CUBLAS., CUDA-BLAS GPU. 1 2 dgemm, dsymv CUDA-BLAS. CUBLAS dgemm. GTX9, MAGMABLAS (sgemm)., dsymv ASPE.K2 CUBLAS Atomic KBLAS., CUBLAS Atomic KBLAS.,, KBLAS Atomic AtomicAdd., HPC-146 mutex ASPE.K2., ASPE.K2 dsymv. dsymv 1, 2, 3., ASPE.K2., ASPE.K2, Lower( ), Upper( ) AS- PE.K2., ASPE.K2., (dsytrd ) 1, dsymv., dsymv,., CUDA-BLAS dgemm CUBLAS dsymv ASPE.K2., dsyr2k dgemm. 3

1 DGEMM ( GFLOPS) 88 2112 3136 41 5184 68 7232 8256 Kc CUBLAS 6.5 878. 14.45 13.47 24.86 32.9.63 41.29 42.92 MAGMABLAS 1.6.1 538.65 575.48 577.6 575.13 571.95 565.84 563.17 563.88 MKL 121.32 23.87 142. 146.54 147.87 145.89 145.57 145.44 GTX9 CUBLAS 7. 125.16 137.61 142.58 143.62 143.69 143. 143.91 143.87 MAGMABLAS 1.6.1 121.34 126.19 141.74 142.75 142.67 142.89 143.2 142.96 MKL 47.18 21.77 66.8 65.72.91 64.71 65.72 69.71 2 DSYMV ( GFLOPS) 88 2112 3136 41 5184 68 7232 8256 Kc CUBLAS 6.5 9.67 13.63 15.59 16.72 17.38 17. 18.26 18.48 (Atomic) 17.56 34.22 46.62 55.4 56.23 56.94 59.21 59.54 MAGMABLAS 1.6.1 13.32 26.57 37.56.91 43.63 46. 48.91.97 MKL.57 15.31 7.77 8.4 7.91 7.97 7.76 7.24 KBLAS 1.2 23.25 43.15 49.81 54.39 56.44 57.66 58.38 59.26 ASPE.K2 1.5p2 26.13 46.71 54.88 58.82 59.21 61.73 62.39 62.64 GTX9 CUBLAS 7. 16.35 26.47.13.91 33.29 34.1 34.83 35.47 (Atomic).8.74 54.51 64.97 73.4 76.64 78.26 78.9 MAGMABLAS 1.6.1 17.81.46 39.4 44.66 47.53 49.64 51.29 53.22 MKL 4.46 4. 3.98 4. 4.3 4.15 3.85 4.8 KBLAS 1.2 23.75 48.53 63.16 73.13 77.62 79.27 79.75 79.93 ASPE.K2 1.5p3 31.72 56.18 65.94 73.13 77.33 79.47 82.19 81.53 4

3 Tesla Kc ( ) 88 2122 3136 41 MAGMA (1) trd.15.49 1.5 1.94 (2) ed.4.14.31.52 (3) tbk.1.7.14.28 Eigen-G (1) trd.11.33.74 1.42 (2) ed.4.15.24.65 (3) tbk.2.6.15.29 DSYMV (GeForce GTX 9) 4 GTX9 ( ) 88 2122 3136 41 MAGMA (1) trd.13.49 1.15 2.22 (2) ed.8.24.58 1. (3) tbk.3.18.56 1.22 Eigen-G (1) trd.11.38.86 1.69 (2) ed.6.28.73 1.39 (3) tbk.4.18.52 1.12 4.2 MAGMA Eigen-G,. 2 (magma dsyevdx 2stage), magma dsyevd. 3, 4., 88 41. Tesla Kc CPU GPU GTX9 ( Tesla Kc GTX9 ). 3 dgemm,, (1) Eigen-G, (2) MAGMA. (3). (1), dsymv. dgemm. (2) MAGMA. Eigen-G dgemm. (3), MAGMA CPU dgemm. Eigen-G CPU, GTX9 dgemm GPU 1GFLOPS, CPU 2GFLOPS 3:2. 2 ASPE.K2 1.5p3x DSYMV, Upper ASPE.K2 1.5p3x DSYMV, Lower DSYMV (GeForce GTX 9) CUBLAS 7.RC, DSYMV, Upper (atomics mode) CUBLAS 7.RC, DSYMV, Lower (atomics mode) DSYMV (GeForce GTX 9) MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) DSYMV (GeForce GTX 9) KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower GeForce GTX9 SYMV ( 32 ) 5

5 ( ) GPU CPU / Tesla Kc GTX9 GPU ame GK1 GM4 Compute Capability 3.5 5.2 GPU Clock (MHz) 6(boost A) 1126(boost 1216) Multiprocessors 13 16 CUDA Cores 2496 (=13*192) 48 (=16*128) Memory Capacity (MByte) 51 (GDDR5) 96 (GDDR5) Memory Clock (MHz) 5(3bit) 12(256bit) Memory Bandwidth (GB/s) 8 224 ECC Support Enabled A (ECC on ) PCI bus PCIe 2. 16 PCIe 3. 16 (host PCIe2 ) Host (a) (b) Host (a) Host (b) CPU AMD FX-81 Intel Core i7-39k CPU Core 8 6 (4FPUs) (AVX available) CPU Clock (GHz) 3.1 3.2 Memory Capacity (GB) 16 16 Linux Kernel version 3.6.11-4 3.11.- CUDA Version 7.RC 6.5 Driver Version 346.29 343.19 GU gcc Version 4.6.3 4.7.2 Intel MKL Version 13..1 13..1 6

DSYMV (GeForce GTX 9) ASPE.K2 1.5p3x DSYMV, Upper ASPE.K2 1.5p3x DSYMV, Lower KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) CUBLAS 7.RC, DSYMV, Upper (atomics mode) CUBLAS 7.RC, DSYMV, Lower (atomics mode) DSYMV (Tesla Kc) ASPE.K2 1.5p2 DSYMV, Upper ASPE.K2 1.5p2 DSYMV, Lower KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) CUBLAS 6.5, DSYMV, Upper (atomics mode) CUBLAS 6.5, DSYMV, Lower (atomics mode) 1 DSYMV CUDA-BLAS (GTX9( ), Tesla Kc( ), 32 ) 7

DSYMV (Tesla Kc) ASPE.K2 1.5p2 DSYMV, Upper ASPE.K2 1.5p2 DSYMV, Lower DSYMV (Tesla Kc) CUBLAS 6.5, DSYMV, Upper (atomics mode) CUBLAS 6.5, DSYMV, Lower (atomics mode) DSYMV (Tesla Kc) 6 Tesla Kc MAGMA+ASPE.K2 (1)trd ( ) 88 2122 3136 41 MAGMA+ASPE.K2.11.33.77 1.46 MAGMA only.15.49 1.5 1.94 Eigen-G.11.33.74 1.42 4.3 MAGMA+ASPE.K2, MAGMA+ASPE.K2. magma dsyevd (1) magma dsytrd) dsymv, ASPE.K2 dsymv ((2)(3) (1) ). MAGMA+ASPE.K2 Eigen-G., CUDA-BLAS. MAGMA Eigen-G., MAGMA BLAS GPU., MAGMA Eigen-G.,. 3 MAGMA 1.6.1, DSYMV, Upper MAGMA 1.6.1, DSYMV, Lower MAGMA 1.6.1, DSYMV, Upper (work) MAGMA 1.6.1, DSYMV, Lower (work) DSYMV (Tesla Kc) KBLAS 1.2, DSYMV, Upper KBLAS 1.2, DSYMV, Lower Tesla Kc SYMV ( 32 ) 4.4, 2 magma dsyevdx 2stage. 2., 1., MAGMA MAGMA+ASPE.K2..,. 5., CUDA-BLAS,. Eigen-G, MAGMA, CUDA-BLAS. 8

, MAGMA+ASPE.K2, CPU+single GPU. MAGMA 2stage, MAGMA.., ( : 223) ( (COE) ). [1] Imamura, T., ASPE-K2: Automatic-tuning and Stabilization for the Performance of CUDA BLAS Level 2 Kernels, 15th SIAM Conference on Parallel Processing for Scientific Computing (PP12), http://www.siam.org/meetings/pp12/ [2], Kepler GPU SGEMV, GTC Japan 14. [3] VIDIA Corporation, CUDA C Programming guide, http://docs.nvidia.com/cuda/pdf/cuda C Programm ing Guide.pdf (14). [4] VIDIA Corporation, The VIDIA CUDA Basic Linear Algebra Subroutines, http://developer.nvidia.com/cublas [5] Innovative Computing Laboratory, University of Tennessee, Matrix Algebra on GPU and Multicore Architectures, http://icl.cs.utk.edu/magma [6],,,,, Fermi, Kepler GPU SYMV,, HPC, Vol. 12-HPC-138, o. 8 (12) 1 7. [7],,,, CUDA-xSYMV,, HPC, Vol. 14-HPC-146, o. 14 (14) 1 12. [8] Imamura, T., Yamada, S., Machida, M., Eigen-G: GPUbased eigenvalue solver for real-symmetric dense matrices, th International Conference on Parallel Processing and Applied Mathematics (PPAM14), LCS8384, pp. 673-682, 14 [9] ath, R., Tomov, S., Dong, T. T., and Dongarra, J., Optimizing Symmetric Dense Matrix-vector Multiplication on GPUs, in Proceedings of 11 International Conference for High Performance Computing, etworking, Storage and Analysis, SC 11 (11) 6:1 6:. [] Abdelfattah, A., Keyes, D., and Ltaief, H., KAUST BLAS (KBLAS), http://cec.kaust.edu.sa/pages/kblas.aspx [11] Abdelfattah, A., Keyes, D., and Ltaief, H., KBLAS: High Performance Level-2 BLAS on Multi-GPU Systems, http://ondemand.gputechconf.com/gtc/14/poster /pdf/p4168 KBLAS GPU computing optimization.pdf, GTC14 (14). [12] Sørensen, H. H. B., Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs, Parallel Processing and Applied Mathematics, LCS 73 (12) 619 629. [13] Sørensen, H. H. B.. Auto-Tuning of Level 1 and Level 2 BLAS for GPUs, Concurrency Computat.: Pract. Exper., Wiley (12) 1183 1198. [14] GPUlab: GLAS library version..2, http://gpulab.imm.dtu.dk/docs/ glas v..2 C cuda 4. linux.tar.gz [15] Imamura, T., Yamada, S., and Machida, M., A High Performance SYMV Kernel on a Fermi-core GPU, High Performance Computing for Computational Science VECPAR 12, LCS 7851 (13) 59 7. [16] Humphrey, J.R., Price, D. K., Spagnoli, D. K., Paolini, A. L., Kelmelis, E. J., CULA: Hybrid GPU Accelerated Linear Algebra Routines, SPIE Defense and Security Symposium (DSS), April,. [17] CULA. http://www.culatools.com/dense/ [18] EigenExa : http://www.aics.riken.jp/labs/lpnctrt/eigenexa.html EigenK http://ccse.jaea.go.jp/ja/download/eigenk.html 9