211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

Similar documents
倍々精度RgemmのnVidia C2050上への実装と応用

untitled

untitled

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

MBLAS¤ÈMLAPACK; ¿ÇÜĹÀºÅÙÈǤÎBLAS/LAPACK¤ÎºîÀ®

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

GPGPU


07-二村幸孝・出口大輔.indd

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

main.dvi

tabaicho3mukunoki.pptx

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

4 倍精度基本線形代数ルーチン群 QPBLAS の紹介 [index] 1. Introduction 2. Double-double algorithm 3. QPBLAS 4. QPBLAS-GPU 5. Summary 佐々成正 1, 山田進 1, 町田昌彦 1, 今村俊幸 2, 奥田洋司

HPC pdf

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral


IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

untitled

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

GPU n Graphics Processing Unit CG CAD

6_27.dvi

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

1 (bit ) ( ) PC WS CPU IEEE754 standard ( 24bit) ( 53bit)

Bulletin of JSSAC(2014) Vol. 20, No. 2, pp (Received 2013/11/27 Revised 2014/3/27 Accepted 2014/5/26) It is known that some of number puzzles ca

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

supercomputer2010.ppt

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

Microsoft PowerPoint - GPU_computing_2013_01.pptx

10D16.dvi

HP Workstation 総合カタログ

HP cafe HP of A A B of C C Map on N th Floor coupon A cafe coupon B Poster A Poster A Poster B Poster B Case 1 Show HP of each company on a user scree

[2] 2. [3 5] 3D [6 8] Morishima [9] N n 24 24FPS k k = 1, 2,..., N i i = 1, 2,..., n Algorithm 1 N io user-specified number of inbetween omis

スライド 1

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

インテル(R) Visual Fortran Composer XE

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0


IPSJ SIG Technical Report An Evaluation Method for the Degree of Strain of an Action Scene Mao Kuroda, 1 Takeshi Takai 1 and Takashi Matsuyama 1

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

理研スーパーコンピュータ・システム

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

IPSJ SIG Technical Report Vol.2014-HCI-158 No /5/22 1,a) 2 2 3,b) Development of visualization technique expressing rainfall changing conditions


1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

IPSJ SIG Technical Report Vol.2009-HCI-134 No /7/17 1. RDB Wiki Wiki RDB SQL Wiki Wiki RDB Wiki RDB Wiki A Wiki System Enhanced by Visibl

ipod touch 1 2 Apple ipod touch ipod touch 3 ( ) ipod touch ( 1 ) Apple ( 2 ) Web 1),2) 3. ipod touch 1 2 ipod touch x y z i

3_23.dvi

2017 (413812)

GPU.....

HPEハイパフォーマンスコンピューティング ソリューション

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

PeerPool IP NAT IP UPnP 2) Bonjour 3) PeerPool CPU 4) 2 UPnP Bonjour PeerPool CPU PeerPool PeerPool PPv2 PPv2 2. PeerPool 2.1 PeerPool PeerPool PoolGW

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

IPSJ SIG Technical Report NetMAS NetMAS NetMAS One-dimensional Pedestrian Model for Fast Evacuation Simulator Shunsuke Soeda, 1 Tomohisa Yam

2reN-A14.dvi

<95DB8C9288E397C389C88A E696E6462>

チューニング講習会 初級編

Vol. 48 No. 4 Apr LAN TCP/IP LAN TCP/IP 1 PC TCP/IP 1 PC User-mode Linux 12 Development of a System to Visualize Computer Network Behavior for L

IPSJ SIG Technical Report Vol.2015-ARC-215 No.7 Vol.2015-OS-133 No /5/26 Just-In-Time PG 1,a) 1, Just-In-Time VM Geyser Dalvik VM Caffei

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

1_26.dvi

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

indd

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

EGunGPU

Vol.53 No (Mar. 2012) 1, 1,a) 1, 2 1 1, , Musical Interaction System Based on Stage Metaphor Seiko Myojin 1, 1,a

IPSJ SIG Technical Report Vol.2014-GN-90 No.16 Vol.2014-CDS-9 No.16 Vol.2014-DCC-6 No /1/24 1,a) 2,b) 2,c) 1,d) QUMARION QUMARION Kinect Kinect

2 ( ) i

Transcription:

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 GPU 4 8 BLAS 4 8 BLAS Basic Linear Algebra Subprograms GPU Graphics Processing Unit 4 8 double 2 4 double-double DD 4 4 8 quad-double QD 8 VIDIA Tesla C25 Intel Core i7 92 4 AXPY 9.5 8 AXPY 19 4 GEMM CPU 29 8 GEMM 24 Tesla C25 4 AXPY 2.1 GEMV GEMM CPU PCI-Express PCIe GEMM PCIe 4 8 GEMM 4 8 BLAS GPU CPU Implementation and Evaluation of Quadruple and Octuple Precision BLAS on GPUs Daichi Mukunoki and Daisuke Takahashi We implemented quadruple and octuple precision Basic Linear Algebra Subprograms (BLAS) functions on graphics processing units (GPUs), and evaluated their performances. We used DD-type quadruple precision operation, which combines two double precision values to represent a quadruple precision value, and QD-type octuple precision operation, which combines four double precision value, to represent a octuple precision value. On VIDIA Tesla C25, quadruple precision AXPY is approximately 9.5 times faster, and octuple precision AXPY is approximately 19 times faster than that on Intel Core i7 92. Additionally, quadruple precision GEMM is approximately 29 times faster, and octuple precision GEMM is approximately 24 times faster than that on the CPU. Moreover, the execution time of quadruple precision AXPY takes only approximately 2.1 times longer than that of double precision AXPY on the GPU. Also on quadruple and octuple precision GEMV and GEMM on the GPU, the increase of the execution time relative to double precision operation is decreased compared to the CPU. On the other hand, taking the PCI-Express (PCIe) data transfer time into consideration, the performance of double precision GEMM is limited by PCIe data transfer time, but that of quadruple and octuple precision GEMM is almost not limited by them. In this research, we show that quadruple and octuple precision BLAS operations are suitable for GPUs. 1. CG Graduate School of Systems and Information Engineering, University of Tsukuba 64bit 148 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 a a 1 a 2 a 3 a a 1 1 4 8 GPU Graphics Processing Unit GPU CPU GPU GPGPU General Purpose computing on GPU GPU VIDIA Tesla C25 13GFlops 515GFlops CPU GPU CPU CPU GPU PCI-Express PCIe PCIe 2. x16 8GB/s GPU GPU 1 Byte/Flop GPU 4 8 BLAS Basic Linear Algebra Subprograms VIDIA GPU BLAS GPU 4 8 Byte/Flop 4 8 BLAS GPU 2 4 8 3 4 8 BLAS GPU 4 5 2. 4 8 2.1 2 GMP 1) MPFR 2) ARPREC 3) 4 8 QD 4) QD double 2 4 double-double DD 4 double 4 8 quad-double QD 8 8 4 8 QD DD 4 QD 8 BLAS DD XBLAS 5) XBLAS DD MBLAS 6) BLAS GMP MPFR QD 4 8 CPU GPU Göddeke 7) GPU FEM double-float Thall 8) double-float quad-float GPU 9) DD 4 AMD GPU GRAPE-DR Zhao 1) GPU GMP GPUMP Lu 11) QD ARPREC GPU GQD GARPREC 149 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 GPU BLAS GPU BLAS GPU 4 8 4 8 BLAS GPU GPU BLAS 4 8 2.2 DD 4 QD 8 QD DD 4 QD 8 1 DD 4 4 a 2 a a 1 a = a + a 1 a > a 1 QD 8 4 a = a + a 1 + a 2 + a 3 a > a 1 > a 2 > a 3 IEEE 754-28 binary64 52 53 DD 4 14 16 32 IEEE754-28 4 binary128 QD 8 28 212 64 8 IEEE 754-28 DD 4 4 a = a + a 1 b = b + b 1 a a 1 b b 1 QD 8 Hida 12) QD 16 16 sloppy 2 4 a 1 b 1 16 DD 4 QD 8 a b + c 16 Fused-Multiply Add FMA DD 4 QD 8 1 DD 2 QD 4 GPU 1 DD 4 QD 8 DD 4 QD 8 2 Flop 9 Flop FMA 1 Flop 193 Flop FMA 24 Flop 333 Flop 3. 4 8 BLAS QD DD 4 QD 8 BLAS GPU DD 4 QD 8 VIDIA GPGPU CUDA Compute Unified Device Architecture GT2 GPU 3.1 BLAS Level 1 3 BLAS Level 1 BLAS AXPY (y = αx + y) Level 2 BLAS GEMV (y = αax + βy) Level 3 BLAS GEMM (C = αab + βc) BLAS CUDA CPU GPU GPU CPU GPU PCIe BLAS BLAS CUDA 1 AXPY GEMV 128 65535 128 ID GEMM 4 8 8 8 16 16 8 8 = 64 16 16 = 256 Tesla C16 16KB Tesla C25 64KB 16KB L1 48KB 48KB L1 16KB 2 GPU 16KB 15 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 3.2 4 8 4 8 QD CUDA QD CUDA Lu GQD GQD 16 sloppy GPU FMA QD DD CUDA 13) QD 4 8 CPU QD BLAS GPU GT2 GPU 16 FMA FMA CUDA FMA FMA FMA FMA FMA fma rn FMA FMA dmul rn dadd rn 4. 4.1 DD 4 QD 8 AXPY GEMV GEMM VIDIA Tesla C25 Fermi VIDIA Tesla C16 GT2 2 GPU Tesla C16 78GFlops Tesla C25 515GFlops Tesla C16 4GB GDDR3 12GB/s Tesla C25 3GB GDDR5 144GB/s Tesla C25 ECC ECC ECC GPU 4 8 BLAS CUDDBLAS DD 4 CUQD- BLAS QD 8 GotoBLAS 2-1.13 CPU CUBLAS 3.1 GPU CPU CUDDBLAS CUQDBLAS DD 4 QD 8 BLAS CPU BLAS DDBLAS QDBLAS CPU DD 4 QD 8 BLAS MBLAS MBLAS DDBLAS QDBLAS QD 2.3.11 4) OpenMP CPU Intel Core i7 92 2.67GHz Quad- Core Hyper-Threading GotoBLAS DDBLAS QDBLAS CPU 4 OS CentOS 5.5 x86-64 kernel 2.6.18-194.11.4.el5 CUDA Version 3.1 CPU g++ 4.1.2 O3 GPU nvcc 3.1 O3 DDBLAS QDBLAS QD 2.3.11 Intel C++ icpc 11.1 fast 1 DD 4 QD 8 Flops DDFlops QDFlops 1 GPU BLAS CPU GPU PCIe 4 8 QD dd rand qd rand AXPY α GEMV GEMM α 1. β. 4.2 AXPY 4 8 AXPY 2 4 AXPY 2 3 2:3 Byte/Flop = 8, 192, Tesla C25 1.6GFlops 2.1%4 Tesla C25 5.GDDFlops 1 FMA 4 2DDFlop 3Flop 5.GDDFlops 75.5GFlops 14.7% Tesla C25 8.76GQDFlops 17.8GFlops 151 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 12 DAXPY (Double).8 QDAXPY (Octuple) GFlops 1 8 6 4 2 GQDFlops.7.6.5.4.3.2.1 2.48e+6 4.96e+6 6.144e+6 8.192e+6 2.48e+6 4.96e+6 6.144e+6 8.192e+6 GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 2 AXPY 4 8 AXPY GDDFlops 6 5 4 3 2 1 DDAXPY (Quadruple) 2.48e+6 4.96e+6 6.144e+6 8.192e+6 DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 3 4 AXPY 2.9%4 8 Byte/Flop GPU CPU Tesla C25 4 9.5 8 19 4 AXPY AXPY 15 CPU AXPY 2.5 Tesla C25 AXPY 2.1 AXPY DD 2 8 AXPY CPU 34 Tesla C25 14 4.3 GEMV 4 8 GEMV 5 7 = 8, 192 CPU Tesla C25 4 18 8 19 CPU GEMV 4 GEMV 6.4 8 GEMV 89 Tesla C25 GEMV 3.1 41 GEMV 2 2 2 +2 2:1 AXPY 2:3 Byte/Flop AXPY CPU GEMV AXPY 2.7 4 8 GEMV 4 8 AXPY Tesla C16 GEMV AXPY 3.2 4 8 CPU GEMV AXPY Tesla C25 GEMV AXPY 3. 4 GEMV 4 AXPY 2. 8 GEMV 8 AXPY CPU Tesla C16 4 8 AXPY Tesla C25 4 AXPY 8 AXPY GEMM 152 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 35 DGEMV (Double) 1 QDGEMV (Octuple) 3 25.8 GFlops 2 15 GQDFlops.6.4 1 5.2 248 496 6144 8192 248 496 6144 8192 GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 5 GEMV 7 8 GEMV 12 DDGEMV (Quadruple) 2 FMA AXPY: =8,192,, GEMV: =8,192, GEMM: =4,96 GDDFlops 1 8 6 4 FMA FMA 4 AXPY 5.3 GDDFlops 4.96 GDDFlops 8 AXPY.76 GQDFlops.7 GQDFlops 4 GEMV 1.18 GDDFlops 7.1 GDDFlops 8 GEMV.8 GQDFlops.62 GQDFlops 4 GEMM 14.19 GDDFlops 1.6 GDDFlops 8 GEMM.97 GQDFlops.74 GQDFlops 2 248 496 6144 8192 DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 6 4 GEMV 4.4 GEMM 4 8 GEMV 8 1 = 4, 96 CPU Tesla C25 4 29 8 24 CPU 4 84 8 116 Tesla C25 12 179 GEMM Tesla C16 75.2GFlops 96.4% Tesla C25 173.8GFlops 33.8%GEMM 2 3 3 2 2 : 3 GEMM AXPY GEMV Byte/Flop Tesla C25 GEMM 9 4 Tesla C16 2.6GDDFlops Tesla C25 14.2GDDFlops 39GFlops 212.8GFlops 5.1% 41.3% 1 8 Tesla C16.18GQDFlops Tesla C25.97GQDFlops 25.8GFlops 137.4GFlops 25.8% 33.1% DD QD FMA FMA DD 3.4% QD 3.7% 4.5 FMA CUDDBLAS CUQD- BLAS FMA FMA FMA 153 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 25 DGEMM (Double) 1.2 QDGEMM (Octuple) 2 1 GFlops 15 1 GQDFlops.8.6.4 5.2 124 248 372 496 124 248 372 496 GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 8 GEMM 1 8 GEMM GDDFlops 16 14 12 1 8 6 4 2 DDGEMM (Quadruple) 124 248 372 496 DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 9 4 GEMM 2 FMA FMA 4 GEMM FMA 1.4 8 GEMM 1.3 FMA FMA 4 8 1.5 FMA 4.6 PCIe GPU BLAS CPU GPU BLAS CPU GPU PCIe PCIe 8GB/s GPU Tesla C16 12GB/s Tesla C25 144GB/sGPU PCIe 11 13 PCIe PCIe BLAS 11 AXPY Tesla C16 93.1% PCIe 4 87.2% 8 56.8% Tesla C25 Tesla C16 6.6 PCIe PCIe 12 GEMV 13 GEMM Tesla C16 3.5% Tesla C25 7.6% PCIe PCIe 4 8 1.5% PCIe 4 8 PCIe Byte/Flop GPU 5. DD 4 QD 8 BLAS GPU VIDIA Tesla C25 Intel Core i7 92 4 GEMM 29 8 GEMM 24 154 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 1 AXPY (=8,192,) 1 GEMV (=8,192) 8 8 % of total 6 4 % of total 6 4 2 2 Octuple (C16) Quadruple (C16) Double (C16) Octuple (C25) Quadruple (C25) Double (C25) Octuple (C16) Quadruple (C16) Double (C16) Octuple (C25) Quadruple (C25) Double (C25) Computation Computation PCIe Data Transfer PCIe Data Transfer 11 PCIe AXPY 12 PCIe GEMV CPU 4 GEMM 84 8 GEMM 116 Tesla C25 12 179 GPU GPU DD QD FMA PCI-Express PCIe GEMM PCIe 4 8 4 8 GPU BLAS GPU 4 8 BLAS CPU AXPY DOT SpMV Byte/Flop Tesla C25 4 AXPY 2.1 2 1) Granlund, T.: GMP: GU Multiple Precision Arithmetic Library, http://gmplib.org/. 2) Fousse, L., Hanrot, G., Lefevre, V., Pelissier, % of total 1 8 6 4 2 13 Double (C16) GEMM (=4,96) Quadruple (C16) Octuple (C16) Computation PCIe Data Transfer Double (C25) Quadruple (C25) Octuple (C25) PCIe GEMM P. and Zimmermann, P.: MPFR : GU MPFR Library, http://www.mpfr.org/. 3) Bailey, D. H.: ARPREC (C++/Fortran-9 arbitrary precision package), http://crd.lbl.gov/ dhbailey/mpdist/. 4) Bailey, D. H.: QD (C++ / Fortran-9 double double and quad-double package), http://crd.lbl.gov/ dhbailey/mpdist/. 5) Li, X. S., Demmel, J. W., Bailey, D. H., Hida, Y., Iskandar, J., Kapur, A., Martin, M. C., Thompson, B., Tung, T. and Yoo, D. J.: 155 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 XBLAS Extra Precise Basic Linear Algebra Subroutines. 6) : The MPACK; Multiple precision arithmetic BLAS (MBLAS) and LAPACK (MLAPACK), http://mplapack.sourceforge.net/. 7) Göddeke, D., Strzodka, R. and Turek, S.: Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations, International Journal of Parallel, Emergent and Distributed Systems 22 (27). 8) Thall, A.: Extended-Precision Floating-Point umbers for GPU Computation, ACM SIG- GRAPH 26 Research Posters (26). 9),,, :,, Vol. 29 HPC 121, o. 39 (29). 1) Zhao, K. and Chu, X.: GPUMP: a Multiple- Precision Integer Library for GPUs, Proc. IEEE International Conference on Computer and Information Technology (CIT 21) (21). 11) Lu, M., He, B. and Luo, Q.: Supporting Extended Precision on Graphics Processors, Proc. Sixth International Workshop on Data Management on ew Hardware (DaMo 21) (21). 12) Hida, Y., Li, X. S. and Bailey, D. H.: Algorithms for Quad-Double Precision Floating Point Arithmetic, Proc. 15th Symposium on Computer Arithmetic, pp. 155 162 (21). 13), : GPU 4 BLAS,, Vol. 29 HPC 122, o. 13 (29). 156 c 211 Information Processing Society of Japan