QD library! Feature! Easy to use high precision! Easy to understand the structure of arithmetic! 2 type high precision arithmetic! Double-Double precision (pseudo quadruple precision)! Quad-Double precision (pseudo octuple precision) * High-Precision Software Directory, http://crd-legacy.lbl.gov/dhbailey/mpdist/
GRAPE-MP4, MP6, MP8 Extending the arithmetic format 52 bit double MP 11 bit MP4 15 bit 116 bit 112 112 bit bit TD MP6 176 bit QD MP8 240 bit software emulation
double[4] c = {s0, s1, s2, s3}; TD operations (SUB, MUL and DIV) are created. Each TD AD other return c; QD to TD example (A) TD MUL has 27 times and 78 times double precision computations, is clear that these numbers are less than those of QD A and QD MULはRenormalizeを行わない また分岐を使わないRenormalizeを採用 1: Algorithm Design for Figure QD A: from a[0] todesign a[3] andfor from 4.3: Algorithm TD A: from a[0] to a[2] an 3] represent a QD precision value [0] aistd highest bits and b[0] torespectively, b[2] represent precision value respectively, [0] is highest b west bits. The box marked + algorithms such + a means addition algorithms [2] ismeans lowestaddition bits. The box marked
OpenCL! Framework for parallel processing programming! Programs run on many platforms and devices (Multi-core CPUs, GPUs, DSP, FPGA etc.)! Target devices of this work! Multi-core CPUs! GPUs! Many Integrated Core (MIC)
is shown in Table 1. 行列乗算 (2012) And in this Section I show performance evaluation of matrix multiplication in calculation by OpenCL. 4 s1160154 In Figure 2, I present test configurations used in this work. And the information about CPU and GPU I used is shown in Section 6.3. PU CPUs do not support FMA, GPU so I used them without Figure 3: Result of CPU (non-parallel) - from left to And I also tried non-parallelized calculations in ntelfma. AMD right in each Dimension (N), No.1, No.2 and No.3 CPUs to compare the result with that of using OpenCL. As one of the non-parallelized calculations, I also used 7-2600K Radeon HD7970 the mpack library. The mpack library is a library Gflops 947 Gflops which has many operations for vector and matrix in multiple precision. VX) (FMA) CL 1.1 OpenCL 1.1 University of Aizu, Graduation Thesis. March, 2012 NUX AMD-APP 831.4 Figure 2: Test configurations Intel) (by AMD) GPU CPU Figure 4: Result of CPU (OpenCL) - from left to right Intel AMD device name Core i7-2600k Radeon HD7970 in each Dimension (N), No.4, No.5 and No.6 peak 108.8 Gflops 947 Gflops performance (AVX) (FMA) c and configuration [6] OpenCL SDK ver. OpenCL 1.1 LINUX (by Intel) OpenCL 1.1 AMD-APP 831.4 (by AMD) Figure 4: Result of CPU (OpenCL) - from left to right in each Dimension (N), No.4, No.5 and No.6 Table 1: Spec and configuration K.Nakamura, G.Thesis 2012
University of Aizu, Graduation(2012) Thesis. March, 2012 LU分解 Figure 6: Result of LU factorization - from left to right K.Nakamura, G.Thesis 2012 in each Dimension (N), mpack (non-blocking), CPU (OpenCL) and GPU (OpenCL)
-GEMM performance on GPUs! HD7970(Tahiti) produces highest performance (60 Gflop/s)
-GEMM performance on CPUs! Xeon Phi produces stable & high performance (11 Gflop/s)
QR decomposition Major routine of Linear Algebra! Decompose matrix A to matrix Q and matrix R! A: m-by-n matrix ( m n )!!! Q: m-by-m orthogonal matrix R: m-by-n upper triangular matrix ブロック化Householder法によるQR分解を実装
Performance Tests of -QR decomposition! Environments of Tests! Compare below! Without OpenCL (Serial execution)! -GEMM on GPUs with OpenCL! -GEMM on CPUs with OpenCL
Stage 1 Stage2 Stage 3 Stage 4 Algorithm 9 Blocked Householder QR Require: A C m n,q T Q = I 1: Q I 2: for k =1ton/r do 3: s =(k 1) r +1 4: for j =1tor do 5: u = s + j 1 6: [v, β] =house(a[u : m, u]) 7: A[u : m, u : s + r 1] = A[u : m, u : s + r 1] βvv T A[u : m, u : s + r 1] 8: V [:,j]=[zeros(j 1, 1); v] 9: B(j) =β 10: end for 11: Y = V [1 : end, 1] 12: W = B(1) V [1 : end, 1] 13: for j =2tor do 14: v = V [:,j] 15: z = B(j) v B(j) WY T v 16: W =[Wz] 17: Y =[Yv] 18: end for 19: A[s : m, s + r : n] =A[s : m, s + r : n]+yw T A[s : m, s + r : n] 20: Q[1 : m, s : m] =Q[1 : m, s : m]+q[1 : m, s : m]wy T 21: end for
Serial vs. OpenCL(GPU) Serial(CPU)による計算時間 計算時間(秒) OpenCL(GPU)による計算時間 N=3072の場合 GPUの利用で20倍高速
分解結果の精度について N=1024 r = 64
1 I = 0 1 x dx 0 1 x y dy 0 湯浅 et al. 2007 dz 1 D 2 D= xys tz 1 x y z x y 2 1 x y z 1 x y m e 2 z 1 x y m f 2 倍精度演算では数値不安定 4 倍精度演算が必要
Table 4.13: Numerical results with HD6970 (λ =10 ) N 256 1024 Double 1.10854011e-7 1.11434660070024864650109150138873600e-7 Double-Double 1.38322168e-7 1.38323589455119100876021157137126812e-7 Triple-Double 1.38322167e-7 1.38323589172160096884080782115035156e-7 Quad-Double 1.38322167e-7 1.38323589172160096884080782115035163e-7 Analytical Answer 1.38323589e-7 1.38323589227981762289646298761828386e-7 Table 4.14: Numerical results with HD6970 (λ =10 ) N 256 1024 Double 1.1623e-7 1.17272710173910304193985404e-7 Double-Double 2.1067e-7 2.11964036570355145245187164e-7 Triple-Double 2.4714e-7 2.47248635217083570183809854e-7 Quad-Double 2.4714e-7 2.47248635217083570183809888e-7 Analytical Answer 2.4724e-7 2.47248635259865968819535221e-7
D TD QD 1 core 4 core GPU 6.8 0.355 0.191 349 25.8 0.859 2335(?) 80 20.1 2921 240 58.1
Kernel Generator LSUMP for AMD GPU, DR, GRAPE-MP 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 for(id3=0; id3<n 2; id3++){ x301[0] = g x301[id3 2 +0]; x301[1] = g x301[id3 2 +1]; gw30[0] = g gw30[id3 2 +0]; gw30[1] = g gw30[id3 2 +1]; TwoProd(x301, cnt4, zz); 入力 10行 OpenCL Kernel 80行 TwoProd(mone, xx, t[0]); TwoProd(t[0], yy, t[1]); TwoProd(t[1], s, t[2]); TwoProd(tt, zz, t[3]); TwoSub(one, xx, t[4]); TwoSub(t[4], yy, t[5]); TwoSub(t[5], zz, t[6]); TwoProd(t[3], t[6], t[7]); TwoSub(t[2], t[7], t[8]); TwoSum(xx, yy, t[9]); TwoProd(t[9], ramda, t[10]);
付録 MPX: Performance比較 MP MP4 MP6 MP8 116 bit 6PE 112 bit 16PE 176 bit 14PE 240 bit 10PE 概要 78% 100MHz 1.2 Gflops 0.49 Gflops 12.6 W 61% 81% 85% 125MHz 95MHz 70MHz 4 Gflops 2.66 Gflops 1.4 Gflops 1.252 Gflops 0.917 Gflops 0.493 Gflops 11.5 W 12.3 W 90nm only PEs 40nm PEs & PCIe logic Nakasato etal. 2012, Daisaka etal. 2011