A = QΛQ T A n n Λ Q A = XΛX 1 A n n Λ X
GPGPU
A 3 T Q T AQ = T (Q: ) T u i = λ i u i T {λ i } {u i } QR MR 3 v i = Q u i A {v i }
A n = 9000 Quad Core Xeon 2 LAPACK (4/3) n 3 O(n 2 ) O(n 3 ) A {v i } 2mn 2 m 3
3 3 A (k) := (I αww T ) A (k) = A (k) αw (w T A (k) ) Rank-1 level-2 BLAS rank-1 0 0 0 0 k A (k) 0 0 Level-2 BLAS
8 128 1 2 100K M 100M G Byte/Flop : 1 < 1
BLAS BLAS Basic Linear Algebra Subprograms BLAS Level-1 BLAS: c := x T y AXPYy: = ax + y Level-2 BLAS: A = y := Ax rank-1a := A + xy T A = A Level-3 BLAS: C := AB C = A B
BLAS Level-1 BLAS O(N) O(N) O(N/p) N p Level-2 BLAS O(N 2 ) O(N 2 ) O(N 2 /p) A A A Level-3 BLAS O(N 3 ) O(N 2 ) O(N) C Byte/Flop O(N 3 /p) A B level-3 BLAS
Level-3 BLAS 3 Bischof et al., 93 A L C 3 T n 0 0 (4/3)n 3 0 6n 2 L 0 A C L T 3 level-3 BLAS level-2 BLAS O(n 2 L)
H = I WαW T H H K 0 0 0 0 H K R 0 H K L 0 L
Level-3 BLAS n A A {v i } L (4/3)n 3 0 C 0 6n 2 L 2mn 2 C 2mn 2 {w i } 0 0 T T {u i } QR DC MR 3 { i } O(n 3 ) level-3 BLAS 4mn 2 2 level-3
n : 9000 L : Level-3 BLAS Fortran LAPACK Xeon 8 Xeon X5355 (2.66GHz, Quad-core 2 Intel Fortran Compiler 9, Intel Math Kernel Library HX600 1 Opteron (2.5GHz, Quad-core 4 Xeon 24 Xeon E7460 (2.4 GHz, 6, L3$ 12MB 4 Intel Fortran Compiler 11, Intel Math Kernel Library
Xeon 8 n = 9000 L = 100 8Level-3 LAPACK 2.1 Level-3
HX600 n = 9000 L = 50
Xeon 24 n = 9000 L = 200 Level-3 BLAS 24 LAPACK 12 1.6 Level-3 BLAS 40%
Xeon 24 n = 9000 Level-3 BLAS L = 200 1 2 24 2 level-3 BLAS
Xeon 24 level-3 BLAS DSYMM: DSYR2K: rank-l DSYMM
3 level-2 BLAS Level-3 BLAS level-3 BLAS Xeon 24 level-3 BLAS
GPU GPU
QR Step 1 : A = H Step 2 : Step 3 : QR H T Step 4 : T A 23
Step 1 4 Step 1 Level-2 3 BLAS Level-2 Step 4 Level-1 BLAS : CPU : Core i7 920 (2.66 GHz) Memory: 6.0GB 24
GPU GPU (Graphics Processing Unit) GPU GPU NVIDIA CUDA CUBLAS CUFFT GPU Step 1 25
=H (H T = H, H T H = HH T = I ) for i = 1, N -2
for i= 1, N -2 Rank-1 O(N 2 ) O(N 2 ) CPU
1 N B N B O(N 2 ), O(N 2 N B )
2 for k = 1, N / N B for i = 1, N B N B (1) t i,v i O(N) (2) w i T v i T A O(N 2 ) (3) O(N N B ) end for end for (4) O(N 2 N B ) ()
(4) (3) (2) w it =t i v it A (1) t i, v i GPU
2 (a) BLAS GPU A Send Receive N B for k =1, N/N B A i for i = 1, N B t i,v i t i,v i BLAS CPU t i v i w it v i T A end for end for Receive A Send
(a) N = 5120
(a) N = 5120 (4) (3) (2) w it =t i v it A (1) t i, v i O(N 2 N B ) O(N N B ) O(N 2 ) O(N) (2)(3) BLAS GPU
2 (b) BLAS CPU A Send Receive for k =1, N/N B N B for i = 1, N B t i,v i Receive end for w i T GPU CPU CPU end for A Send v i T A
(c) CPU GPU A Send Receive N B CPU t i,v i w i T for k =1, N/N B for i = 1, N B w i T Receive end for end for A Send
(a) (b) (c) BLAS GPU BLAS CPU BLAS CPU GPU (1) t i,v i CPU CPU CPU (2) w i T GPU GPU CPU+GPU (3) (4) GPU CPU CPU GPU GPU CPU+GPU
A N = 1024, 2048,, 8192 4 CPU 4 (a), (b), (c) N B = 32 (c) CPU N 1024 2048 3072 4096 5120 6144 7168 8192 CPU 24/32 10/32 8/32 6/32 5/32 5/32 5/32 5/32
CPU (c) N=8192 3.25
N = 5120 (4) (3) (2) w it =t i v it A (1) t i, v i
GPU GPU CUBLAS BLAS CPU BLAS CPU GPU Tesla C1060 Core i7 43.25
BLAS MAGMA GPU