Similar documents
1 (bit ) ( ) PC WS CPU IEEE754 standard ( 24bit) ( 53bit)

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

untitled

untitled

倍々精度RgemmのnVidia C2050上への実装と応用

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2


numb.dvi

tabaicho3mukunoki.pptx

MBLAS¤ÈMLAPACK; ¿ÇÜĹÀºÅÙÈǤÎBLAS/LAPACK¤ÎºîÀ®

Krylov (b) x k+1 := x k + α k p k (c) r k+1 := r k α k Ap k ( := b Ax k+1 ) (d) β k := r k r k 2 2 (e) : r k 2 / r 0 2 < ε R (f) p k+1 :=

86 6 r (6) y y d y = y 3 (64) y r y r y r ϕ(x, y, y,, y r ) n dy = f(x, y) (6) 6 Lipschitz 6 dy = y x c R y(x) y(x) = c exp(x) x x = x y(x ) = y (init

C による数値計算法入門 ( 第 2 版 ) 新装版 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 新装版 1 刷発行時のものです.

Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

Second-semi.PDF

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

ipsj-final.dvi

FFTSS Library Version 3.0 User's Guide

( ) a C n ( R n ) R a R C n. a C n (or R n ) a 0 2. α C( R ) a C n αa = α a 3. a, b C n a + b a + b ( ) p 8..2 (p ) a = [a a n ] T C n p n a

インテル(R) Visual Fortran Composer XE

A11 (1993,1994) 29 A12 (1994) 29 A13 Trefethen and Bau Numerical Linear Algebra (1997) 29 A14 (1999) 30 A15 (2003) 30 A16 (2004) 30 A17 (2007) 30 A18

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

4 倍精度基本線形代数ルーチン群 QPBLAS の紹介 [index] 1. Introduction 2. Double-double algorithm 3. QPBLAS 4. QPBLAS-GPU 5. Summary 佐々成正 1, 山田進 1, 町田昌彦 1, 今村俊幸 2, 奥田洋司

07-二村幸孝・出口大輔.indd

,.,. NP,., ,.,,.,.,,, (PCA)...,,. Tipping and Bishop (1999) PCA. (PPCA)., (Ilin and Raiko, 2010). PPCA EM., , tatsukaw

2009 4

RaVioli SIMD

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

1 filename=mathformula tex 1 ax 2 + bx + c = 0, x = b ± b 2 4ac, (1.1) 2a x 1 + x 2 = b a, x 1x 2 = c a, (1.2) ax 2 + 2b x + c = 0, x = b ± b 2

44 6 MPI 4 : #LIB=-lmpich -lm 5 : LIB=-lmpi -lm 7 : mpi1: mpi1.c 8 : $(CC) -o mpi1 mpi1.c $(LIB) 9 : 10 : clean: 11 : -$(DEL) mpi1 make mpi1 1 % mpiru

Microsoft PowerPoint - sales2.ppt

最新耐震構造解析 ( 第 3 版 ) サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 第 3 版 1 刷発行時のものです.

main.dvi

2001 : : IMS: ; NDC:418; keywords:, ; ( ) 2. ( ) 3. ( ) ( ) 4. ( ) 5.

2

(Basic Theory of Information Processing) 1

(1) 1 y = 2 = = b (2) 2 y = 2 = 2 = 2 + h B h h h< h 2 h

error_g1.eps

2ndD3.eps


HPC (pay-as-you-go) HPC Web 2

橡固有値セミナー2_棚橋改.PDF

IPSJ SIG Technical Report Vol.2011-HPC-131 No /10/6 1 1 Parareal-in-Time Applicability of Time-domain Parallelism to Iterative Linear Calculus T

d dt A B C = A B C d dt x = Ax, A 0 B 0 C 0 = mm 0 mm 0 mm AP = PΛ P AP = Λ P A = ΛP P d dt x = P Ax d dt (P x) = Λ(P x) d dt P x =

Krylov A04 October 8, 2010 T. Sakurai (Univ. Tsukuba) Krylov October 8, / 48

ohp1.dvi

EGunGPU

0 = m 2p 1 p = 1/2 p y = 1 m = 1 2 d ( + 1)2 d ( + 1) 2 = d d ( + 1)2 = = 2( + 1) 2 g() 2 f() f() = [g()] 2 = g()g() f f () = [g()g()]

Part () () Γ Part ,

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

1 4 1 ( ) ( ) ( ) ( ) () 1 4 2

tomocci ,. :,,,, Lie,,,, Einstein, Newton. 1 M n C. s, M p. M f, p d ds f = dxµ p ds µ f p, X p = X µ µ p = dxµ ds µ p. µ, X µ.,. p,. T M p.

医系の統計入門第 2 版 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 第 2 版 1 刷発行時のものです.

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

1 = = = (set) (element) a A a A a A a A a A {2, 5, (0, 1)}, [ 1, 1] = {x; 1 x 1}. (proposition) A = {x; P (x)} P (x) x x a A a A Remark. (i) {2, 0, 0,


JFE.dvi

2012年度HPCサマーセミナー_多田野.pptx

1 Abstract 2 3 n a ax 2 + bx + c = 0 (a 0) (1) ( x + b ) 2 = b2 4ac 2a 4a 2 D = b 2 4ac > 0 (1) 2 D = 0 D < 0 x + b 2a = ± b2 4ac 2a b ± b 2

untitled

keisoku01.dvi

1 8, : 8.1 1, 2 z = ax + by + c ax by + z c = a b +1 x y z c = 0, (0, 0, c), n = ( a, b, 1). f = n i=1 a ii x 2 i + i<j 2a ij x i x j = ( x, A x), f =

修士論文

微分積分 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.

min. z = 602.5x x 2 + 2

., White-Box, White-Box. White-Box.,, White-Box., Maple [11], 2. 1, QE, QE, 1 Redlog [7], QEPCAD [9], SyNRAC [8] 3 QE., 2 Brown White-Box. 3 White-Box

p12.dvi

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1



ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

main.dvi

_0212_68<5A66><4EBA><79D1>_<6821><4E86><FF08><30C8><30F3><30DC><306A><3057><FF09>.pdf

スライド 1

HP High Performance Computing(HPC)

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

. a, b, c, d b a ± d bc ± ad = c ac b a d c = bd ac b a d c = bc ad n m nm [2][3] BASIC [4] B BASIC [5] BASIC Intel x * IEEE a e d

simx simxdx, cosxdx, sixdx 6.3 px m m + pxfxdx = pxf x p xf xdx = pxf x p xf x + p xf xdx 7.4 a m.5 fx simxdx 8 fx fx simxdx = πb m 9 a fxdx = πa a =

10D16.dvi

kokyuroku.dvi

main.dvi

Milnor 1 ( ), IX,. [KN].,. 2 : (1),. (2). 1 ; 1950, Milnor[M1, M2]. Milnor,,. ([Hil, HM, IO, St] ).,.,,, ( 2 5 )., Milnor ( 4.1)..,,., [CEGS],. Ω m, P

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

D = [a, b] [c, d] D ij P ij (ξ ij, η ij ) f S(f,, {P ij }) S(f,, {P ij }) = = k m i=1 j=1 m n f(ξ ij, η ij )(x i x i 1 )(y j y j 1 ) = i=1 j

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2


DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

I A A441 : April 21, 2014 Version : Kawahira, Tomoki TA (Kondo, Hirotaka ) Google

IA 2013 : :10722 : 2 : :2 :761 :1 (23-27) : : ( / ) (1 /, ) / e.g. (Taylar ) e x = 1 + x + x xn n! +... sin x = x x3 6 + x5 x2n+1 + (

スパコンに通じる並列プログラミングの基礎

n (1.6) i j=1 1 n a ij x j = b i (1.7) (1.7) (1.4) (1.5) (1.4) (1.7) u, v, w ε x, ε y, ε x, γ yz, γ zx, γ xy (1.8) ε x = u x ε y = v y ε z = w z γ yz

13 Student Software TI-Nspire CX CAS TI Web TI-Nspire CX CAS Student Software ( ) 1 Student Software 37 Student Software Nspire Nspire Nspir

all.dvi

main.dvi

newmain.dvi

II No.01 [n/2] [1]H n (x) H n (x) = ( 1) r n! r!(n 2r)! (2x)n 2r. r=0 [2]H n (x) n,, H n ( x) = ( 1) n H n (x). [3] H n (x) = ( 1) n dn x2 e dx n e x2

ÊÂÎó·×»»¤È¤Ï/OpenMP¤Î½éÊâ¡Ê£±¡Ë

(I) GotoBALS, kkimur/charpoly.html 2

Transcription:

http://na-inet.jp/ 4 @ 2015 1 19 ( )

MPFR/GMP BNCpack (cf., Vol, 21, pp.197-206, 2011) Runge-Kutta (cf. arxiv preprint arxiv:1306.2392, Vol.19, No.3, pp.313-328, 2009) Strassen (cf. JSIAM Letters, Vol.6, pp.81-84, 2014)

+ - ケチ表現 (1bit) 小数点 Precision _mpfr_prec 符号部 (1bit) 指数部仮数部 in bits + 8bits 1 23 bits ( + 1bit) _mpfr_sign Sign - 単精度 (single) 32 bits _mpfr_exp Exponent Practical multiple + Pointer to 11 bitsieee754(r) 1 standard 52 bits ( + 1bit) *_mpfr_d World record of π Mantissa - precision mantissa 倍精度 (double) 64 bits 1,030,700,000,000 in hex 0 1 2 3 150 bits 24 53 64 113 =4,122,800,000,000 64 bits 224 16384~65536 single double extented quadruple octuple double IEEE754-1985 (CPU, GPU) Length of mantissa in bits (mpfr t ) IEEE754(r) standard 24 53 64 113 実用的な多倍長精度 224 16384~65536 π の世界記録 ( 高橋, 2009) 2576980377524 dec. digits = 8560543516070 bits single double extented double quadruple octuple 仮数部の bit 数 1

IEEE754 ( ) vs. DD(4 ) QD(8 ) exflib MPFR/GMP, mpf(gmp), ARPREC(?) IEEE745 vs. DD QD, ARPREC (by Baily) exflib, MPFR/GMP, mpf(gmp) (GMP ) cf. MPACK (by ( )) = LAPACK / (DD QD + MPFR)

Year Category Integrated Computer Algebra Software 1950 1960 1970 1980 1990 2000 - Maple Mathematica Maxima Linear Computation LINPACK EISPACK LAPACK BLAS ScaLAPACK XBLAS ATLAS exflib CLN MPFR Over-quadruple Precision Floating-point Arithmetic Library GMP ARPREC Double-double QD/DD MP Floating-point Format and Arithmetic (Single and Double Precision) Binary Floating-point Arithmetic in Microsoft BASIC Hexadecimal Floating-point Arithmetic (IBM format) IEEE754 Binary Floating-point Arithmetic 1950 1960 1970 1980 1990 2000-20

MPFR/GMP GMP (GNU MP) (mpn) (mpz) (mpq) (mpf) Version 6.0.0a http://gmplib.org/ MPFR (GNU MPFR) GMP mpn IEEE754 Version 3.1.2 http://www.mpfr.org/ mpz (integer) GNU MP(GMP) mpq (rational) mpf (real) GNU MPFR mpn(mp Natural number arithmetic) kernel generic (pure C codes) x86 x86_64 sparc arm Assembler codes

( ) http://www.mpfr.org/mpfr-current/timings.html MPFR/GMP

MPFR/GMP GMP mpn basecase Karatsuba Toom-Cook FFT CPU SIMD milli-sec/mul 1.E+01 1.E+00 100 #bits1000 10000 100000 1000000 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 Mul of MPFR N^1.5 N^1.6 PentiumD Pentium4 Pentium4 Add milli-sec/div 1.E+02 1.E+01 1.E+00 100 #bits1000 10000 100000 1000000 1.E-01 N^1.4 1.E-02 N^1.5 PentiumD 1.E-03 Pentium4 1.E-04 Div of MPFR O(N 1.4 ) O(N 1.6 )

BNCpack BNCpack Basic Numerical Computation PACKage 2001 GMP mpf t MPFR/GMP mpfr t C (C++ ) MPI ( ) 1. 2. ( ( ) ) 3. ( ) 4. ( QR ) 5. (Newton Newton, Regula-Falsi ) 6. (DKA ) 7. ( Romberg ) 8. ( ) 9. ( )

A = [a ij ], B = [b ij ] C := AB a ij = 5 (i + j 1), b ij = 3 (n i + 1). l c ij := a ik b kj A = [A ik ], B = [B kj ] (1 i M, 1 k L, 1 j N) A, B C = [C ij ] L C ij := A ik B kj k=1 k=1

( ): 128bits vs. Intel Math Kernel( ) H/W Intel Core i7 3850 (3.6 GHz), 64 GB RAM S/W Scientific Linux 6.3 x86 64, Intel C Compiler Ver. 13.0.1, BNCpack ver. 0.8, MPFR 3.1.2, GMP 5.1.3 m n Simple Block(16) Block(32) Block(64) Double 255 255 1.06 1.20 1.22 1.24 256 256 1.25 1.22 1.22 1.25 257 257 1.04 1.25 1.28 1.37 511 511 9.60 9.71 9.61 10.02 512 512 10.83 9.70 9.68 9.96 513 513 10.02 9.89 9.97 10.44 1023 1023 107.78 77.63 77.80 79.36 0.084 1024 1024 213.09 77.77 77.72 79.51 0.083 1025 1025 94.62 78.92 78.41 81.48 0.087 2047 2047 756.81 627.75 619.21 648.31 0.658 2048 2048 1679.04 624.86 618.87 639.71 0.653 2049 2049 632.74 623.24 625.69 640.84 0.675

: 1024bits m n Simple Block(16) Block(32) Block(64) 255 255 5.5 5.46 5.47 5.54 256 256 5.77 5.53 5.53 5.64 257 257 5.68 5.61 5.72 5.80 511 511 45.73 43.81 43.96 44.51 512 512 49.71 44.37 44.31 44.20 513 513 46.58 44.37 44.77 45.12 1023 1023 372.10 352.79 354.37 353.15 1024 1024 463.79 356.10 356.85 355.99 1025 1025 385.24 356.58 357.17 361.24 2047 2047 3122.48 2829.43 2820.16 2833.66 2048 2048 3754.89 2845.70 2824.34 2859.24 2049 2049 2933.26 2829.95 2835.54 2859.66 2048bits alloc, free )

CPU 12 mpfrbench:100 decimal digits 25 mpfrbench: 10000 decimal digits Speedup Ratio 10 8 6 4 2 0 100digits ION330 100digits PentiumIII 100digits PentiumIV 100digits PentiumD 100digits Corei7 100digits Athlon64X2 100digits Core2Quad 100digits CeleronE3200 100digits PhenomII Speedup Ratio 20 15 10 5 100digits Corei7-3820 0 10000digits ION330 10000digits PentiumIII 10000digits PentiumIV 10000digits PentiumD 10000digits Corei7 10000digits Athlon64X2 10000digits Core2Quad 10000digits CeleronE3200 10000digits PhenomII 10000digits Corei7-3820 1CPU(1 core) MPI MPIBNCpack OpenMP( CPU) + BNCpack, CUDA(NVIDIA GPU) CUMP

MPIBNCpack MPFR/GMP or GMP (MPIGMP) _mpfr_size _mpfr_prec _mpfr_exp *_mpfr_d 符号部 指数部 仮数部 mpfr_t data type PE0 0 1 N (_mpfr_prec) パッキング処理 void * _mpfr_size _mpfr_prec _mpfr_exp 0 1 送受信 Tutorial http://na-inet.jp/tutorial/

OpenMP( CPU) GPU CUMP (http://www.hpcs.cs.tsukuba.ac.jp/~nakayama/cump/) GMP mpf t mpn generic C basecase ) (H/W) Intel Xeon E5-2620 v2(2.10ghz) 2 = 12 cores, NVIDIA Tesla K20 (S/W) CentOS 6.5 x86 64, gcc-4.4.7, Intel C compiler 13.1.3, GMP 6.0.0a, MPFR 3.1.2, CUDA 6.5 CPU:12 threads, GPU: 8 8 = 64 threads

12 cores CPU vs. Tesla K20 GPU CPU / GPU CPU/GPU 128 128 256 256 512 512 1024 1024 128bits 0.87 1.15 2.09 8.44 256 0.56 0.88 2.27 7.67 512 0.41 0.28 1.60 3.84 1024 0.12 0.35 1.20 1.93 2048 0.09 0.39 0.82 0.96 4096 0.17 0.46 0.63 0.66 8192 0.25 0.42 0.46 16384 0.25 0.31 0.32 32768 0.18 0.21 CPU GPU GPU

[ ] Ax = b A R n n, b, x i nr n E(Ã), E( b) E( x) [ ] à x = b à = A + E(Ã), b = b + E( b), x = x + E( x) κ(a) = A A 1 ( ) E( x) κ(a) x 1 E(Ã) E(Ã) + E( b) A b A Strassen LU

1967 Moler n f(x) = 0, f : R n R n (1) Newton (1) f(x) = Ax b Jacobi A (L >> S) [L ] r k := b Ax k (2) [S ] Solve Az k = r k for z k (3) [L ] x k+1 := x k + z k (4) r k x k

(2) (4) A [S], b [L] S L A [L] := A, A [S] := A [L], b [L] := b, b [S] := b [L] A [S] := P [S] L [S] U [S] Solve (P [S] L [S] U [S] )x [S] 0 = b [S] for x [S] 0 x [L] 0 := x [S] 0 For k = 0, 1, 2,... r [L] k r [S] k := b [L] Ax [L] k := r [L] k Solve (P [S] L [S] U [S] )z [S] k z [L] k x [L] k+1 := z [S] k := x[l] k Exit if r [L] k + z[l] k = r [S] k for z [S] k 2 n ε R A F x [L] 2 + ε A k

ρ F (n)κ(a)ε S 1 ψ F (n)κ(a)ε S < 1 α F < 1 (5) lim x x k k β F 1 α F x (6) β F /(1 α F ) (cf. A.Buttari, J.Dogarra, Julie Langou, Julien Langou, P.Luszczek, and J.Karzak, Mixed precision iterative refinement techniques for the solution of dense linear system, The International Journal of High Performance Computing Applications, Vol. 21, No. 4, pp. 457 466, 2007.) S-L κ(a)ε S << 1 (7) ε S, ε L S,L

Computational Time L digits Direct Method LU Decomposition Forward & Backward Substitions S digits Direct Method LU F&B Subst S-L Iterative Refinement LU Matrix-Vector Multiplication F&B Subst Matrix-Vector Multiplication F&B Subst Iteration 3 1. κ(a) < 10 7 = (S = 7)- (L = 15): SP-DP ( ) 2. κ(a) < 10 15 = (S = 15)-4 or (L > 30): DP-MP 3. κ(a) > 10 15 = 4 or - 8 or : MP-MP

DP-MP H/W AMD Athlon64X2 3800+, 4GB S/W CentOS 5.2 x86 64, GCC 4.1.2, MPFR 2.3.2/GMP 4.2.1 + BNCpack 0.7b, LAPACK 3.2, ATLAS 3.8.3 ^ Z DW DW W DW dd ^ Z / Z DW DW W DW dd dd dd dd dd ldde ldde lddd ldddd d d E >W< d>^ E >W< d>^ E >W< d>^ >ldd >lddd >lddd

(1/6): n n (ODE) { dy dt = f(t, y) Rn y(t 0 ) = y 0 :[t 0, α] (8) m IRK c 1,..., c m, a 11,..., a mm, b 1,..., b m m 2m Gauss c 1 a 11 a 1m... c m a m1 a mm b 1 b m = c A b T (9)

(2/6): m Runge-Kutta t 0, t 1 := t 0 + h 0,..., t k+1 := t k + h k... t k t k+1 y k+1 y(t k+1 ) 2 (A) Y = [Y 1... Y m ] T R mn Y 1 = y k + h m k j=1 a 1jf(t k + c j h k, Y j ). Y m = y k + h m k j=1 a mjf(t k + c j h k, Y j ) F(Y) = 0 (10) (B) Y y k+1 y k+1 := y k + h k m j=1 b j f(t k + c j h k, Y j )

(3/6):SPARK3 (1/2) W (Hairer & Wanner) (Gauss ) 1/2 ζ 1. X = W T ζ 1 0.. BAW =...... ζm 2 ζ m 2 0 ζ m 1 ζ m 1 0 W = [w ij ] = [ P j 1 (c i )] (i, j = 1, 2,..., m) ( P s (x) : s-th shifted Legendre polynomial) ( ζ i = 2 1 4i 2 1) (i = 1, 2,..., m 1) B = diag(b), I m = W T BW = diag(1 1 1)

(4/6):SPARK3 (2/4) Newton (W T B I n )(I m I n h k A J)(W I n ) E 1 F 1 G 1 E 2 F 2 = I m I n h k X J =......... G m 2 E m 1 F m 1 G m 1 E m E 1 = I n 1 2 h kj, E 2 = = E s = I n F i = h k ζ i J, G i = h k ζ i J (i = 1, 2,..., m 1)

(5/6):SPARK3 + Newton + 1. Y 1 := [y k y k... y k ] T R mn 2. For l = 0, 1, 2,... Newton Y l := [Y (l) 1 Y (l) 2... Y m (l) ] T Y (l) i = y 0 + h mj=1 k a ij f(t k + c i h k, Y (l 1) i ) C := I m I n h k X J, d := (W T B I n )( F(Y l )) (S) Solve Cx 0 = d for x 0 For ν = 0, 1, 2,... r ν := d Cx ν (S) r ν := r ν / r ν (S) Solve Cz = r ν for z x ν+1 := x ν + r ν z Check convergence x νstop Y l+1 := Y l + (W I n )x νstop Check convergence Y lstop 3. Y := Y lstop = [Y 1 Y 2... Y m ] T 4. y k+1 := y k + h k m j=1 b jf(t k + c j h k, Y j )

(6/6) Runge-Kutta (Gauss ) 128 ODE (10 50 ) Intel Core i7 920, 8GB RAM, CentOS 5.6 x86 64, gcc 4.1.2 MPFR 3.1.1/GMP 5.0.5 Comp.Time (s) 1600 1400 1200 1000 800 600 400 200 0 3 4 5 6 7 8 9 10 11 12 m Relative Error 1.E+14 1.E+10 1.E+06 1.E+02 1.E-02 1.E-06 1.E-10 1.E-14 1.E-18 1.E-22 1.E-26 1.E-30 1.E-34 1.E-38 Iter.Ref-DM W-Trans. W-Iter.Ref-MM W-Iter.Ref-DM Max.Rel.Err SPARK3 + DP-MP

Strassen LU LU (A = LU) 1. A A 11 R K K A 12 R K (n K), A 21 R (n K) K, and A 22 R (n K) (n K) 2. A 11 L 11 U 11 (= A 11 ) LU, A 12 U 12 A 21 L 21 3. A (1) 22 := A 22 L 21 U 12 4. A := A (1) 22 n K 0

Winograd s variant (1/2) C := AB = [c ij ] C R m n, A = [a ij ] R m l B = [b ij ] R l n [ ] [ ] A11 A A = 12 B11 B, B = 12. (11) A 21 A 22 B 21 B 22 S 1 := A 21 + A 22, S 2 := S 1 A 11, S 3 := A 11 A 21, S 4 := A 12 S 2, S 5 := B 12 B 11, S 6 := B 22 S 5, S 7 := B 22 B 12, S 8 := S 6 B 21, (12) M 1 := S 2 S 6, M 2 := A 11 B 11, M 3 := A 12 B 21, M 4 := S 3 S 7, M 5 := S 1 S 5, M 6 := S 4 B 22, M 7 := A 22 S 8, (13)

Winograd s variant (2/2) T 1 := M 1 + M 2, T 2 := T 1 + M 4. (14) Through (12) (13) (14), we can obtain C as follows: [ ] M2 + M C := 3 T 1 + M 5 + M 6 T 2 M 7 T 2 + M 5 Winograd s variant involves the following arithmetical operations: Mul(m, l, n) = 7Mul(m/2, l/2, n/2), Addsub(m, l, n) = 4Addsub(m/2, l/2) + 4Addsub(l/2, n/2) + 7Addsub(m/2, n/2).

Table: Computation time: Strassen s and Winograd s algorithms (128 bits) m n min(simple, Block) Strassen Winograd 255 255 1.06 0.72 0.63 256 256 1.22 0.70 0.57 257 257 1.04 0.74 0.60 511 511 9.60 4.84 4.06 512 512 9.68 4.77 3.73 513 513 9.89 4.92 3.88 1023 1023 77.63 32.02 25.57 1024 1024 77.72 31.53 24.10 1025 1025 78.41 32.21 24.77 2047 2047 619.21 211.80 163.87 2048 2048 618.87 211.19 155.67 2049 2049 623.24 212.79 157.52

Table: Computation time: Strassen s and Winograd s algorithms (1024 bits) n n min(simple, Block) Strassen Winograd 255 255 5.46 2.33 1.95 256 256 5.53 2.31 1.72 257 257 5.61 2.41 1.81 511 511 43.81 13.40 10.57 512 512 44.20 13.02 9.44 513 513 44.37 13.38 9.81 1023 1023 352.79 76.93 57.98 1024 1024 355.99 74.58 52.47 1025 1025 356.58 76.36 54.22 2047 2047 2820.16 454.02 329.41 2048 2048 2824.34 446.87 302.56 2049 2049 2829.95 456.08 307.05

Lotkin a ij = { 1 (i = 1) 1/(i + j 1) (i 2) A 1 A 1 1 4.3 10 1576 (n = 1024) x = [0 1... n 1] T α K := 32α

Lotkin 10 138 (n = 1024 ) 8650bits (458bits 10 137.87 ) Comp.Time (s) 2500 2000 1500 1000 500 Winograd 8192bits vs. 8650bits: Lotkin matrix, 1024 1024 Normal LU: 2376.9 s min: 1617.5 s Normal LU: 3.3E-903 0-200 -400-600 -800-1000 log10(max.relative Error) 0-1200 1 2 3 4 5 6 7 8 9 10 α Winograd(8192 bits) Comp.Time Winograd(8650 bits) Comp.Time Winograd(8192 bits) Max.Rel.Err Winograd(8650 bits) Max.Rel.Err 32 %

1CPU

1. GPU 2. 3.