211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

Size: px

Start display at page:

Download "211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G"

さゆりうみのなか
6 years ago
Views:

1 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 GPU 4 8 BLAS 4 8 BLAS Basic Linear Algebra Subprograms GPU Graphics Processing Unit 4 8 double 2 4 double-double DD quad-double QD 8 VIDIA Tesla C25 Intel Core i AXPY AXPY 19 4 GEMM CPU 29 8 GEMM 24 Tesla C25 4 AXPY 2.1 GEMV GEMM CPU PCI-Express PCIe GEMM PCIe 4 8 GEMM 4 8 BLAS GPU CPU Implementation and Evaluation of Quadruple and Octuple Precision BLAS on GPUs Daichi Mukunoki and Daisuke Takahashi We implemented quadruple and octuple precision Basic Linear Algebra Subprograms (BLAS) functions on graphics processing units (GPUs), and evaluated their performances. We used DD-type quadruple precision operation, which combines two double precision values to represent a quadruple precision value, and QD-type octuple precision operation, which combines four double precision value, to represent a octuple precision value. On VIDIA Tesla C25, quadruple precision AXPY is approximately 9.5 times faster, and octuple precision AXPY is approximately 19 times faster than that on Intel Core i7 92. Additionally, quadruple precision GEMM is approximately 29 times faster, and octuple precision GEMM is approximately 24 times faster than that on the CPU. Moreover, the execution time of quadruple precision AXPY takes only approximately 2.1 times longer than that of double precision AXPY on the GPU. Also on quadruple and octuple precision GEMV and GEMM on the GPU, the increase of the execution time relative to double precision operation is decreased compared to the CPU. On the other hand, taking the PCI-Express (PCIe) data transfer time into consideration, the performance of double precision GEMM is limited by PCIe data transfer time, but that of quadruple and octuple precision GEMM is almost not limited by them. In this research, we show that quadruple and octuple precision BLAS operations are suitable for GPUs. 1. CG Graduate School of Systems and Information Engineering, University of Tsukuba 64bit 148 c 211 Information Processing Society of Japan

2 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU General Purpose computing on GPU GPU VIDIA Tesla C25 13GFlops 515GFlops CPU GPU CPU CPU GPU PCI-Express PCIe PCIe 2. x16 8GB/s GPU GPU 1 Byte/Flop GPU 4 8 BLAS Basic Linear Algebra Subprograms VIDIA GPU BLAS GPU 4 8 Byte/Flop 4 8 BLAS GPU BLAS GPU GMP 1) MPFR 2) ARPREC 3) 4 8 QD 4) QD double 2 4 double-double DD 4 double 4 8 quad-double QD QD DD 4 QD 8 BLAS DD XBLAS 5) XBLAS DD MBLAS 6) BLAS GMP MPFR QD 4 8 CPU GPU Göddeke 7) GPU FEM double-float Thall 8) double-float quad-float GPU 9) DD 4 AMD GPU GRAPE-DR Zhao 1) GPU GMP GPUMP Lu 11) QD ARPREC GPU GQD GARPREC 149 c 211 Information Processing Society of Japan

3 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 GPU BLAS GPU BLAS GPU BLAS GPU GPU BLAS DD 4 QD 8 QD DD 4 QD 8 1 DD 4 4 a 2 a a 1 a = a + a 1 a > a 1 QD 8 4 a = a + a 1 + a 2 + a 3 a > a 1 > a 2 > a 3 IEEE binary DD IEEE binary128 QD IEEE DD 4 4 a = a + a 1 b = b + b 1 a a 1 b b 1 QD 8 Hida 12) QD sloppy 2 4 a 1 b 1 16 DD 4 QD 8 a b + c 16 Fused-Multiply Add FMA DD 4 QD 8 1 DD 2 QD 4 GPU 1 DD 4 QD 8 DD 4 QD 8 2 Flop 9 Flop FMA 1 Flop 193 Flop FMA 24 Flop 333 Flop BLAS QD DD 4 QD 8 BLAS GPU DD 4 QD 8 VIDIA GPGPU CUDA Compute Unified Device Architecture GT2 GPU 3.1 BLAS Level 1 3 BLAS Level 1 BLAS AXPY (y = αx + y) Level 2 BLAS GEMV (y = αax + βy) Level 3 BLAS GEMM (C = αab + βc) BLAS CUDA CPU GPU GPU CPU GPU PCIe BLAS BLAS CUDA 1 AXPY GEMV ID GEMM = = 256 Tesla C16 16KB Tesla C25 64KB 16KB L1 48KB 48KB L1 16KB 2 GPU 16KB 15 c 211 Information Processing Society of Japan

4 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/ QD CUDA QD CUDA Lu GQD GQD 16 sloppy GPU FMA QD DD CUDA 13) QD 4 8 CPU QD BLAS GPU GT2 GPU 16 FMA FMA CUDA FMA FMA FMA FMA FMA fma rn FMA FMA dmul rn dadd rn DD 4 QD 8 AXPY GEMV GEMM VIDIA Tesla C25 Fermi VIDIA Tesla C16 GT2 2 GPU Tesla C16 78GFlops Tesla C25 515GFlops Tesla C16 4GB GDDR3 12GB/s Tesla C25 3GB GDDR5 144GB/s Tesla C25 ECC ECC ECC GPU 4 8 BLAS CUDDBLAS DD 4 CUQD- BLAS QD 8 GotoBLAS CPU CUBLAS 3.1 GPU CPU CUDDBLAS CUQDBLAS DD 4 QD 8 BLAS CPU BLAS DDBLAS QDBLAS CPU DD 4 QD 8 BLAS MBLAS MBLAS DDBLAS QDBLAS QD ) OpenMP CPU Intel Core i GHz Quad- Core Hyper-Threading GotoBLAS DDBLAS QDBLAS CPU 4 OS CentOS 5.5 x86-64 kernel el5 CUDA Version 3.1 CPU g O3 GPU nvcc 3.1 O3 DDBLAS QDBLAS QD Intel C++ icpc 11.1 fast 1 DD 4 QD 8 Flops DDFlops QDFlops 1 GPU BLAS CPU GPU PCIe 4 8 QD dd rand qd rand AXPY α GEMV GEMM α 1. β. 4.2 AXPY 4 8 AXPY 2 4 AXPY 2 3 2:3 Byte/Flop = 8, 192, Tesla C25 1.6GFlops 2.1%4 Tesla C25 5.GDDFlops 1 FMA 4 2DDFlop 3Flop 5.GDDFlops 75.5GFlops 14.7% Tesla C GQDFlops 17.8GFlops 151 c 211 Information Processing Society of Japan

5 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 12 DAXPY (Double).8 QDAXPY (Octuple) GFlops GQDFlops e e e e e e e e+6 GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 2 AXPY 4 8 AXPY GDDFlops DDAXPY (Quadruple) 2.48e e e e+6 DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 3 4 AXPY 2.9%4 8 Byte/Flop GPU CPU Tesla C AXPY AXPY 15 CPU AXPY 2.5 Tesla C25 AXPY 2.1 AXPY DD 2 8 AXPY CPU 34 Tesla C GEMV 4 8 GEMV 5 7 = 8, 192 CPU Tesla C CPU GEMV 4 GEMV GEMV 89 Tesla C25 GEMV GEMV :1 AXPY 2:3 Byte/Flop AXPY CPU GEMV AXPY GEMV 4 8 AXPY Tesla C16 GEMV AXPY CPU GEMV AXPY Tesla C25 GEMV AXPY 3. 4 GEMV 4 AXPY 2. 8 GEMV 8 AXPY CPU Tesla C AXPY Tesla C25 4 AXPY 8 AXPY GEMM 152 c 211 Information Processing Society of Japan

6 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 35 DGEMV (Double) 1 QDGEMV (Octuple) GFlops 2 15 GQDFlops GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 5 GEMV 7 8 GEMV 12 DDGEMV (Quadruple) 2 FMA AXPY: =8,192,, GEMV: =8,192, GEMM: =4,96 GDDFlops FMA FMA 4 AXPY 5.3 GDDFlops 4.96 GDDFlops 8 AXPY.76 GQDFlops.7 GQDFlops 4 GEMV 1.18 GDDFlops 7.1 GDDFlops 8 GEMV.8 GQDFlops.62 GQDFlops 4 GEMM GDDFlops 1.6 GDDFlops 8 GEMM.97 GQDFlops.74 GQDFlops DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 6 4 GEMV 4.4 GEMM 4 8 GEMV 8 1 = 4, 96 CPU Tesla C CPU Tesla C GEMM Tesla C GFlops 96.4% Tesla C GFlops 33.8%GEMM : 3 GEMM AXPY GEMV Byte/Flop Tesla C25 GEMM 9 4 Tesla C16 2.6GDDFlops Tesla C GDDFlops 39GFlops 212.8GFlops 5.1% 41.3% 1 8 Tesla C16.18GQDFlops Tesla C25.97GQDFlops 25.8GFlops 137.4GFlops 25.8% 33.1% DD QD FMA FMA DD 3.4% QD 3.7% 4.5 FMA CUDDBLAS CUQD- BLAS FMA FMA FMA 153 c 211 Information Processing Society of Japan

7 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 25 DGEMM (Double) 1.2 QDGEMM (Octuple) 2 1 GFlops 15 1 GQDFlops GotoBLAS (Double, Core i7 92) CUBLAS (Double, Tesla C16) CUBLAS (Double, Tesla C25) QDBLAS (Octuple, Core i7 92) CUQDBLAS (Octuple, Tesla C16) CUQDBLAS (Octuple, Tesla C25) 8 GEMM 1 8 GEMM GDDFlops DDGEMM (Quadruple) DDBLAS (Quadruple, Core i7 92) CUDDBLAS (Quadruple, Tesla C16) CUDDBLAS (Quadruple, Tesla C25) 9 4 GEMM 2 FMA FMA 4 GEMM FMA GEMM 1.3 FMA FMA FMA 4.6 PCIe GPU BLAS CPU GPU BLAS CPU GPU PCIe PCIe 8GB/s GPU Tesla C16 12GB/s Tesla C25 144GB/sGPU PCIe PCIe PCIe BLAS 11 AXPY Tesla C % PCIe % % Tesla C25 Tesla C PCIe PCIe 12 GEMV 13 GEMM Tesla C16 3.5% Tesla C25 7.6% PCIe PCIe % PCIe 4 8 PCIe Byte/Flop GPU 5. DD 4 QD 8 BLAS GPU VIDIA Tesla C25 Intel Core i GEMM 29 8 GEMM c 211 Information Processing Society of Japan

8 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 1 AXPY (=8,192,) 1 GEMV (=8,192) 8 8 % of total 6 4 % of total Octuple (C16) Quadruple (C16) Double (C16) Octuple (C25) Quadruple (C25) Double (C25) Octuple (C16) Quadruple (C16) Double (C16) Octuple (C25) Quadruple (C25) Double (C25) Computation Computation PCIe Data Transfer PCIe Data Transfer 11 PCIe AXPY 12 PCIe GEMV CPU 4 GEMM 84 8 GEMM 116 Tesla C GPU GPU DD QD FMA PCI-Express PCIe GEMM PCIe GPU BLAS GPU 4 8 BLAS CPU AXPY DOT SpMV Byte/Flop Tesla C25 4 AXPY ) Granlund, T.: GMP: GU Multiple Precision Arithmetic Library, 2) Fousse, L., Hanrot, G., Lefevre, V., Pelissier, % of total Double (C16) GEMM (=4,96) Quadruple (C16) Octuple (C16) Computation PCIe Data Transfer Double (C25) Quadruple (C25) Octuple (C25) PCIe GEMM P. and Zimmermann, P.: MPFR : GU MPFR Library, 3) Bailey, D. H.: ARPREC (C++/Fortran-9 arbitrary precision package), dhbailey/mpdist/. 4) Bailey, D. H.: QD (C++ / Fortran-9 double double and quad-double package), dhbailey/mpdist/. 5) Li, X. S., Demmel, J. W., Bailey, D. H., Hida, Y., Iskandar, J., Kapur, A., Martin, M. C., Thompson, B., Tung, T. and Yoo, D. J.: 155 c 211 Information Processing Society of Japan

9 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 XBLAS Extra Precise Basic Linear Algebra Subroutines. 6) : The MPACK; Multiple precision arithmetic BLAS (MBLAS) and LAPACK (MLAPACK), 7) Göddeke, D., Strzodka, R. and Turek, S.: Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations, International Journal of Parallel, Emergent and Distributed Systems 22 (27). 8) Thall, A.: Extended-Precision Floating-Point umbers for GPU Computation, ACM SIG- GRAPH 26 Research Posters (26). 9),,, :,, Vol. 29 HPC 121, o. 39 (29). 1) Zhao, K. and Chu, X.: GPUMP: a Multiple- Precision Integer Library for GPUs, Proc. IEEE International Conference on Computer and Information Technology (CIT 21) (21). 11) Lu, M., He, B. and Luo, Q.: Supporting Extended Precision on Graphics Processors, Proc. Sixth International Workshop on Data Management on ew Hardware (DaMo 21) (21). 12) Hida, Y., Li, X. S. and Bailey, D. H.: Algorithms for Quad-Double Precision Floating Point Arithmetic, Proc. 15th Symposium on Computer Arithmetic, pp (21). 13), : GPU 4 BLAS,, Vol. 29 HPC 122, o. 13 (29). 156 c 211 Information Processing Society of Japan

倍々精度RgemmのnVidia C2050上への実装と応用

倍々精度RgemmのnVidia C2050上への実装と応用 .. maho@riken.jp http://accc.riken.jp/maho/,,, 2011/2/16 1 - : GPU : SDPA-DD 10 1 - Rgemm : 4 (32 ) nvidia C2050, GPU CPU 150, 24GFlops 25 20 GFLOPS 15 10 QuadAdd Cray, QuadMul Sloppy Kernel QuadAdd Cray,