211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 GPU 4 8 BLAS 4 8 BLAS Basic Linear Algebra Subprograms GPU Graphics Processing Unit 4 8 double 2 4 double-double DD 4 4 8 quad-double QD 8 VIDIA Tesla C25 Intel Core i7 92 4 AXPY 9.5 8 AXPY 19 4 GEMM CPU 29 8 GEMM 24 Tesla C25 4 AXPY 2.1 GEMV GEMM CPU PCI-Express PCIe GEMM PCIe 4 8 GEMM 4 8 BLAS GPU CPU Implementation and Evaluation of Quadruple and Octuple Precision BLAS on GPUs Daichi Mukunoki and Daisuke Takahashi We implemented quadruple and octuple precision Basic Linear Algebra Subprograms (BLAS) functions on graphics processing units (GPUs), and evaluated their performances. We used DD-type quadruple precision operation, which combines two double precision values to represent a quadruple precision value, and QD-type octuple precision operation, which combines four double precision value, to represent a octuple precision value. On VIDIA Tesla C25, quadruple precision AXPY is approximately 9.5 times faster, and octuple precision AXPY is approximately 19 times faster than that on Intel Core i7 92. Additionally, quadruple precision GEMM is approximately 29 times faster, and octuple precision GEMM is approximately 24 times faster than that on the CPU. Moreover, the execution time of quadruple precision AXPY takes only approximately 2.1 times longer than that of double precision AXPY on the GPU. Also on quadruple and octuple precision GEMV and GEMM on the GPU, the increase of the execution time relative to double precision operation is decreased compared to the CPU. On the other hand, taking the PCI-Express (PCIe) data transfer time into consideration, the performance of double precision GEMM is limited by PCIe data transfer time, but that of quadruple and octuple precision GEMM is almost not limited by them. In this research, we show that quadruple and octuple precision BLAS operations are suitable for GPUs. 1. CG Graduate School of Systems and Information Engineering, University of Tsukuba 64bit 148 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 a a 1 a 2 a 3 a a 1 1 4 8 GPU Graphics Processing Unit GPU CPU GPU GPGPU General Purpose computing on GPU GPU VIDIA Tesla C25 13GFlops 515GFlops CPU GPU CPU CPU GPU PCI-Express PCIe PCIe 2. x16 8GB/s GPU GPU 1 Byte/Flop GPU 4 8 BLAS Basic Linear Algebra Subprograms VIDIA GPU BLAS GPU 4 8 Byte/Flop 4 8 BLAS GPU 2 4 8 3 4 8 BLAS GPU 4 5 2. 4 8 2.1 2 GMP 1) MPFR 2) ARPREC 3) 4 8 QD 4) QD double 2 4 double-double DD 4 double 4 8 quad-double QD 8 8 4 8 QD DD 4 QD 8 BLAS DD XBLAS 5) XBLAS DD MBLAS 6) BLAS GMP MPFR QD 4 8 CPU GPU Göddeke 7) GPU FEM double-float Thall 8) double-float quad-float GPU 9) DD 4 AMD GPU GRAPE-DR Zhao 1) GPU GMP GPUMP Lu 11) QD ARPREC GPU GQD GARPREC 149 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 GPU BLAS GPU BLAS GPU 4 8 4 8 BLAS GPU GPU BLAS 4 8 2.2 DD 4 QD 8 QD DD 4 QD 8 1 DD 4 4 a 2 a a 1 a = a + a 1 a > a 1 QD 8 4 a = a + a 1 + a 2 + a 3 a > a 1 > a 2 > a 3 IEEE 754-28 binary64 52 53 DD 4 14 16 32 IEEE754-28 4 binary128 QD 8 28 212 64 8 IEEE 754-28 DD 4 4 a = a + a 1 b = b + b 1 a a 1 b b 1 QD 8 Hida 12) QD 16 16 sloppy 2 4 a 1 b 1 16 DD 4 QD 8 a b + c 16 Fused-Multiply Add FMA DD 4 QD 8 1 DD 2 QD 4 GPU 1 DD 4 QD 8 DD 4 QD 8 2 Flop 9 Flop FMA 1 Flop 193 Flop FMA 24 Flop 333 Flop 3. 4 8 BLAS QD DD 4 QD 8 BLAS GPU DD 4 QD 8 VIDIA GPGPU CUDA Compute Unified Device Architecture GT2 GPU 3.1 BLAS Level 1 3 BLAS Level 1 BLAS AXPY (y = αx + y) Level 2 BLAS GEMV (y = αax + βy) Level 3 BLAS GEMM (C = αab + βc) BLAS CUDA CPU GPU GPU CPU GPU PCIe BLAS BLAS CUDA 1 AXPY GEMV 128 65535 128 ID GEMM 4 8 8 8 16 16 8 8 = 64 16 16 = 256 Tesla C16 16KB Tesla C25 64KB 16KB L1 48KB 48KB L1 16KB 2 GPU 16KB 15 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 3.2 4 8 4 8 QD CUDA QD CUDA Lu GQD GQD 16 sloppy GPU FMA QD DD CUDA 13) QD 4 8 CPU QD BLAS GPU GT2 GPU 16 FMA FMA CUDA FMA FMA FMA FMA FMA fma rn FMA FMA dmul rn dadd rn 4. 4.1 DD 4 QD 8 AXPY GEMV GEMM VIDIA Tesla C25 Fermi VIDIA Tesla C16 GT2 2 GPU Tesla C16 78GFlops Tesla C25 515GFlops Tesla C16 4GB GDDR3 12GB/s Tesla C25 3GB GDDR5 144GB/s Tesla C25 ECC ECC ECC GPU 4 8 BLAS CUDDBLAS DD 4 CUQD- BLAS QD 8 GotoBLAS 2-1.13 CPU CUBLAS 3.1 GPU CPU CUDDBLAS CUQDBLAS DD 4 QD 8 BLAS CPU BLAS DDBLAS QDBLAS CPU DD 4 QD 8 BLAS MBLAS MBLAS DDBLAS QDBLAS QD 2.3.11 4) OpenMP CPU Intel Core i7 92 2.67GHz Quad- Core Hyper-Threading GotoBLAS DDBLAS QDBLAS CPU 4 OS CentOS 5.5 x86-64 kernel 2.6.18-194.11.4.el5 CUDA Version 3.1 CPU g++ 4.1.2 O3 GPU nvcc 3.1 O3 DDBLAS QDBLAS QD 2.3.11 Intel C++ icpc 11.1 fast 1 DD 4 QD 8 Flops DDFlops QDFlops 1 GPU BLAS CPU GPU PCIe 4 8 QD dd rand qd rand AXPY α GEMV GEMM α 1. β. 4.2 AXPY 4 8 AXPY 2 4 AXPY 2 3 2:3 Byte/Flop = 8, 192, Tesla C25 1.6GFlops 2.1%4 Tesla C25 5.GDDFlops 1 FMA 4 2DDFlop 3Flop 5.GDDFlops 75.5GFlops 14.7% Tesla C25 8.76GQDFlops 17.8GFlops 151 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 12 DAXPY (Double).8 QDAXPY (Octuple) GFlops 1 8 6 4 2 GQDFlops.7.6.5.4.3.2.1 2.48e+6 4.96e+6 6.144e+6 8.192e+6 2.48e+6 4.96e+6 6.144e+6 8.192e+6 GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 2 AXPY 4 8 AXPY GDDFlops 6 5 4 3 2 1 DDAXPY (Quadruple) 2.48e+6 4.96e+6 6.144e+6 8.192e+6 DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 3 4 AXPY 2.9%4 8 Byte/Flop GPU CPU Tesla C25 4 9.5 8 19 4 AXPY AXPY 15 CPU AXPY 2.5 Tesla C25 AXPY 2.1 AXPY DD 2 8 AXPY CPU 34 Tesla C25 14 4.3 GEMV 4 8 GEMV 5 7 = 8, 192 CPU Tesla C25 4 18 8 19 CPU GEMV 4 GEMV 6.4 8 GEMV 89 Tesla C25 GEMV 3.1 41 GEMV 2 2 2 +2 2:1 AXPY 2:3 Byte/Flop AXPY CPU GEMV AXPY 2.7 4 8 GEMV 4 8 AXPY Tesla C16 GEMV AXPY 3.2 4 8 CPU GEMV AXPY Tesla C25 GEMV AXPY 3. 4 GEMV 4 AXPY 2. 8 GEMV 8 AXPY CPU Tesla C16 4 8 AXPY Tesla C25 4 AXPY 8 AXPY GEMM 152 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 35 DGEMV (Double) 1 QDGEMV (Octuple) 3 25.8 GFlops 2 15 GQDFlops.6.4 1 5.2 248 496 6144 8192 248 496 6144 8192 GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 5 GEMV 7 8 GEMV 12 DDGEMV (Quadruple) 2 FMA AXPY: =8,192,, GEMV: =8,192, GEMM: =4,96 GDDFlops 1 8 6 4 FMA FMA 4 AXPY 5.3 GDDFlops 4.96 GDDFlops 8 AXPY.76 GQDFlops.7 GQDFlops 4 GEMV 1.18 GDDFlops 7.1 GDDFlops 8 GEMV.8 GQDFlops.62 GQDFlops 4 GEMM 14.19 GDDFlops 1.6 GDDFlops 8 GEMM.97 GQDFlops.74 GQDFlops 2 248 496 6144 8192 DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 6 4 GEMV 4.4 GEMM 4 8 GEMV 8 1 = 4, 96 CPU Tesla C25 4 29 8 24 CPU 4 84 8 116 Tesla C25 12 179 GEMM Tesla C16 75.2GFlops 96.4% Tesla C25 173.8GFlops 33.8%GEMM 2 3 3 2 2 : 3 GEMM AXPY GEMV Byte/Flop Tesla C25 GEMM 9 4 Tesla C16 2.6GDDFlops Tesla C25 14.2GDDFlops 39GFlops 212.8GFlops 5.1% 41.3% 1 8 Tesla C16.18GQDFlops Tesla C25.97GQDFlops 25.8GFlops 137.4GFlops 25.8% 33.1% DD QD FMA FMA DD 3.4% QD 3.7% 4.5 FMA CUDDBLAS CUQD- BLAS FMA FMA FMA 153 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 25 DGEMM (Double) 1.2 QDGEMM (Octuple) 2 1 GFlops 15 1 GQDFlops.8.6.4 5.2 124 248 372 496 124 248 372 496 GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 8 GEMM 1 8 GEMM GDDFlops 16 14 12 1 8 6 4 2 DDGEMM (Quadruple) 124 248 372 496 DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 9 4 GEMM 2 FMA FMA 4 GEMM FMA 1.4 8 GEMM 1.3 FMA FMA 4 8 1.5 FMA 4.6 PCIe GPU BLAS CPU GPU BLAS CPU GPU PCIe PCIe 8GB/s GPU Tesla C16 12GB/s Tesla C25 144GB/sGPU PCIe 11 13 PCIe PCIe BLAS 11 AXPY Tesla C16 93.1% PCIe 4 87.2% 8 56.8% Tesla C25 Tesla C16 6.6 PCIe PCIe 12 GEMV 13 GEMM Tesla C16 3.5% Tesla C25 7.6% PCIe PCIe 4 8 1.5% PCIe 4 8 PCIe Byte/Flop GPU 5. DD 4 QD 8 BLAS GPU VIDIA Tesla C25 Intel Core i7 92 4 GEMM 29 8 GEMM 24 154 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 1 AXPY (=8,192,) 1 GEMV (=8,192) 8 8 % of total 6 4 % of total 6 4 2 2 Octuple (C16) Quadruple (C16) Double (C16) Octuple (C25) Quadruple (C25) Double (C25) Octuple (C16) Quadruple (C16) Double (C16) Octuple (C25) Quadruple (C25) Double (C25) Computation Computation PCIe Data Transfer PCIe Data Transfer 11 PCIe AXPY 12 PCIe GEMV CPU 4 GEMM 84 8 GEMM 116 Tesla C25 12 179 GPU GPU DD QD FMA PCI-Express PCIe GEMM PCIe 4 8 4 8 GPU BLAS GPU 4 8 BLAS CPU AXPY DOT SpMV Byte/Flop Tesla C25 4 AXPY 2.1 2 1) Granlund, T.: GMP: GU Multiple Precision Arithmetic Library, http://gmplib.org/. 2) Fousse, L., Hanrot, G., Lefevre, V., Pelissier, % of total 1 8 6 4 2 13 Double (C16) GEMM (=4,96) Quadruple (C16) Octuple (C16) Computation PCIe Data Transfer Double (C25) Quadruple (C25) Octuple (C25) PCIe GEMM P. and Zimmermann, P.: MPFR : GU MPFR Library, http://www.mpfr.org/. 3) Bailey, D. H.: ARPREC (C++/Fortran-9 arbitrary precision package), http://crd.lbl.gov/ dhbailey/mpdist/. 4) Bailey, D. H.: QD (C++ / Fortran-9 double double and quad-double package), http://crd.lbl.gov/ dhbailey/mpdist/. 5) Li, X. S., Demmel, J. W., Bailey, D. H., Hida, Y., Iskandar, J., Kapur, A., Martin, M. C., Thompson, B., Tung, T. and Yoo, D. J.: 155 c 211 Information Processing Society of Japan

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 XBLAS Extra Precise Basic Linear Algebra Subroutines. 6) : The MPACK; Multiple precision arithmetic BLAS (MBLAS) and LAPACK (MLAPACK), http://mplapack.sourceforge.net/. 7) Göddeke, D., Strzodka, R. and Turek, S.: Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations, International Journal of Parallel, Emergent and Distributed Systems 22 (27). 8) Thall, A.: Extended-Precision Floating-Point umbers for GPU Computation, ACM SIG- GRAPH 26 Research Posters (26). 9),,, :,, Vol. 29 HPC 121, o. 39 (29). 1) Zhao, K. and Chu, X.: GPUMP: a Multiple- Precision Integer Library for GPUs, Proc. IEEE International Conference on Computer and Information Technology (CIT 21) (21). 11) Lu, M., He, B. and Luo, Q.: Supporting Extended Precision on Graphics Processors, Proc. Sixth International Workshop on Data Management on ew Hardware (DaMo 21) (21). 12) Hida, Y., Li, X. S. and Bailey, D. H.: Algorithms for Quad-Double Precision Floating Point Arithmetic, Proc. 15th Symposium on Computer Arithmetic, pp. 155 162 (21). 13), : GPU 4 BLAS,, Vol. 29 HPC 122, o. 13 (29). 156 c 211 Information Processing Society of Japan