2

Similar documents
untitled

untitled

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

(Basic Theory of Information Processing) 1

インテル(R) Visual Fortran Composer XE

II 2 3.,, A(B + C) = AB + AC, (A + B)C = AC + BC. 4. m m A, m m B,, m m B, AB = BA, A,, I. 5. m m A, m n B, AB = B, A I E, 4 4 I, J, K

ad bc A A A = ad bc ( d ) b c a n A n A n A A det A A ( ) a b A = c d det A = ad bc σ {,,,, n} {,,, } {,,, } {,,, } ( ) σ = σ() = σ() = n sign σ sign(

MBLAS¤ÈMLAPACK; ¿ÇÜĹÀºÅÙÈǤÎBLAS/LAPACK¤ÎºîÀ®

x, y x 3 y xy 3 x 2 y + xy 2 x 3 + y 3 = x 3 y xy 3 x 2 y + xy 2 x 3 + y 3 = 15 xy (x y) (x + y) xy (x y) (x y) ( x 2 + xy + y 2) = 15 (x y)

SO(2)

LINEAR ALGEBRA I Hiroshi SUZUKI Department of Mathematics International Christian University

Microsoft Word - Sample_CQS-Report_English_backslant.doc

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

1 n A a 11 a 1n A =.. a m1 a mn Ax = λx (1) x n λ (eigenvalue problem) x = 0 ( x 0 ) λ A ( ) λ Ax = λx x Ax = λx y T A = λy T x Ax = λx cx ( 1) 1.1 Th

2001 年度 『数学基礎 IV』 講義録

1 (bit ) ( ) PC WS CPU IEEE754 standard ( 24bit) ( 53bit)

01_OpenMP_osx.indd

numb.dvi

1 return main() { main main C 1 戻り値の型 関数名 引数 関数ブロックをあらわす中括弧 main() 関数の定義 int main(void){ printf("hello World!!\n"); return 0; 戻り値 1: main() 2.2 C main

D 24 D D D

セアラの暗号

, = = 7 6 = 42, =

() x + y + y + x dy dx = 0 () dy + xy = x dx y + x y ( 5) ( s55906) 0.7. (). 5 (). ( 6) ( s6590) 0.8 m n. 0.9 n n A. ( 6) ( s6590) f A (λ) = det(a λi)

連載講座 : 高生産並列言語を使いこなす (5) 分子動力学シミュレーション 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 問題の定義 17 2 逐次プログラム 分子 ( 粒子 ) セル 系の状態 ステップ 18

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

RaVioli SIMD

6 2 2 x y x y t P P = P t P = I P P P ( ) ( ) ,, ( ) ( ) cos θ sin θ cos θ sin θ, sin θ cos θ sin θ cos θ y x θ x θ P

A µ : A A A µ(x, y) x y (x y) z = x (y z) A x, y, z x y = y x A x, y A e x e = e x = x A x e A e x A xy = yx = e y x x x y y = x A (1)

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

Second-semi.PDF


[1] [2] [3] (RTT) 2. Android OS Android OS Google OS 69.7% [4] 1 Android Linux [5] Linux OS Android Runtime Dalvik Dalvik UI Application(Home,T

.5 z = a + b + c n.6 = a sin t y = b cos t dy d a e e b e + e c e e e + e 3 s36 3 a + y = a, b > b 3 s363.7 y = + 3 y = + 3 s364.8 cos a 3 s365.9 y =,

January 27, 2015

4 倍精度基本線形代数ルーチン群 QPBLAS の紹介 [index] 1. Introduction 2. Double-double algorithm 3. QPBLAS 4. QPBLAS-GPU 5. Summary 佐々成正 1, 山田進 1, 町田昌彦 1, 今村俊幸 2, 奥田洋司

Jacobson Prime Avoidance


II Time-stamp: <05/09/30 17:14:06 waki> ii

A A = a 41 a 42 a 43 a 44 A (7) 1 (3) A = M 12 = = a 41 (8) a 41 a 43 a 44 (3) n n A, B a i AB = A B ii aa

/* sansu1.c */ #include <stdio.h> main() { int a, b, c; /* a, b, c */ a = 200; b = 1300; /* a 200 */ /* b 200 */ c = a + b; /* a b c */ }

76 3 B m n AB P m n AP : PB = m : n A P B P AB m : n m < n n AB Q Q m A B AQ : QB = m : n (m n) m > n m n Q AB m : n A B Q P AB Q AB 3. 3 A(1) B(3) C(

2012年度HPCサマーセミナー_多田野.pptx


p 2 3

,2,4

Untitled

.1 A cos 2π 3 sin 2π 3 sin 2π 3 cos 2π 3 T ra 2 deta T ra 2 deta T ra 2 deta a + d 2 ad bc a 2 + d 2 + ad + bc A 3 a b a 2 + bc ba + d c d ca + d bc +

5 / / $\mathrm{p}$ $\mathrm{r}$ 8 7 double 4 22 / [10][14][15] 23 P double 1 $\mathrm{m}\mathrm{p}\mathrm{f}\mathrm{u}\mathrm{n}/\mathrm{a


1 8, : 8.1 1, 2 z = ax + by + c ax by + z c = a b +1 x y z c = 0, (0, 0, c), n = ( a, b, 1). f = n i=1 a ii x 2 i + i<j 2a ij x i x j = ( x, A x), f =

連載講座 : 高生産並列言語を使いこなす (3) ゲーム木探索問題 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 概要 17 2 ゲーム木探索 必勝 必敗 引き分け 盤面の評価値 αβ 法 指し手の順序付け (mo

( ) a C n ( R n ) R a R C n. a C n (or R n ) a 0 2. α C( R ) a C n αa = α a 3. a, b C n a + b a + b ( ) p 8..2 (p ) a = [a a n ] T C n p n a

n ( (

HPC146

16 B

数理計画法入門 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.

example2_time.eps

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

PowerPoint プレゼンテーション

x V x x V x, x V x = x + = x +(x+x )=(x +x)+x = +x = x x = x x = x =x =(+)x =x +x = x +x x = x ( )x = x =x =(+( ))x =x +( )x = x +( )x ( )x = x x x R

数学Ⅱ演習(足助・09夏)

23_33.indd

倍々精度RgemmのnVidia C2050上への実装と応用

31 4 MATLAB A, B R 3 3 A = , B = mat_a, mat_b >> mat_a = [-1, -2, -3; -4, -5, -6; -7, -8, -9] mat_a =

Appendix A BASIC BASIC Beginner s All-purpose Symbolic Instruction Code FORTRAN COBOL C JAVA PASCAL (NEC N88-BASIC Windows BASIC (1) (2) ( ) BASIC BAS

4 4 θ X θ P θ 4. 0, 405 P 0 X 405 X P 4. () 60 () 45 () 40 (4) 765 (5) 40 B 60 0 P = 90, = ( ) = X

インテル(R) C++ Composer XE 2011 Windows版 入門ガイド

並列計算の数理とアルゴリズム サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.

Gauss

(I) GotoBALS, kkimur/charpoly.html 2

R R 16 ( 3 )

ver Web

スパコンに通じる並列プログラミングの基礎

Microsoft PowerPoint - sales2.ppt

2 1 x 1.1: v mg x (t) = v(t) mv (t) = mg 0 x(0) = x 0 v(0) = v 0 x(t) = x 0 + v 0 t 1 2 gt2 v(t) = v 0 gt t x = x 0 + v2 0 2g v2 2g 1.1 (x, v) θ

December 28, 2018

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

2012 A, N, Z, Q, R, C

Itanium2ベンチマーク

untitled


A11 (1993,1994) 29 A12 (1994) 29 A13 Trefethen and Bau Numerical Linear Algebra (1997) 29 A14 (1999) 30 A15 (2003) 30 A16 (2004) 30 A17 (2007) 30 A18


( )

直交座標系の回転

index calculus

all.dvi

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

福岡大学人文論叢47-3


[ ] 0.1 lim x 0 e 3x 1 x IC ( 11) ( s114901) 0.2 (1) y = e 2x (x 2 + 1) (2) y = x/(x 2 + 1) 0.3 dx (1) 1 4x 2 (2) e x sin 2xdx (3) sin 2 xdx ( 11) ( s

linearal1.dvi

スパコンに通じる並列プログラミングの基礎


C による数値計算法入門 ( 第 2 版 ) 新装版 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 新装版 1 刷発行時のものです.

x x x 2, A 4 2 Ax.4 A A A A λ λ 4 λ 2 A λe λ λ2 5λ + 6 0,...λ 2, λ 2 3 E 0 E 0 p p Ap λp λ 2 p 4 2 p p 2 p { 4p 2 2p p + 2 p, p 2 λ {

DVIOUT-HYOU

XcalableMP入門


Transcription:

( ) 1

2

3

1.CPU, 2.,,,,,, 3. register, register, 4.L1, L2, (L3), (L4) 4

register L1 cache L2 cache Main Memory,, L2, L1 CPU L2, L1, CPU 5

, 6

dgem2vu 7

? Wiedemann algorithm u 0, w 0, s i, s i = u 0 Ai w 0 t 0 = w 0, t 1 = At 0, t 2 = At 1, s 0 = u 0 t 0, s 1 = u 0 t 1, s 2 = u 0 t 2, 8

s i = u 0 Ai w 0 s 0 = (u 0 )(w 0) = (u 0 ) (w 0 ) s 1 = (u 0 )(Aw 0) = (u 0 ) (Aw 0 ) s 2 = (u 0 A)(Aw 0) = (A u 0 ) (Aw 0 ) s 3 = (u 0 A)(A 2 w 0 ) = (A u 0 ) (A 2 w 0 ) s 4 = (u 0 A2 )(A 2 w 0 ) = ((A ) 2 u 0 ) (A 2 w 0 ). 2 t 0 = w 0, t 1 = At 0, t 2 = At 1, v 0 = u 0, v 1 = A v 0, v 2 = A v 1, 2, 9

s 0 = v0 t 0 s 1 = v0 t 1 s 2 = v1 t 1 s 3 = v1 t 2 s 4 = v2 t 2. Wilkinson, Ax, A y, 10

for (j = 0; j < N; j++) ATY_TMP[j]=0; for (j = 0; j < N; j++){ TMP1 = 0; for (k = 0; k < N; k++){ TMP1 = (ULL) A[j][k] * X[k] + TMP1; ATY_TMP[k]=(ULL) A[j][k] * Y[j] + ATY_TMP[k]; } AX[j] = TMP1 % P; } for (j = 0; j < N; j++) ATY[j]=ATY_TMP[j] % P; Wiedemann algorithm, Z/pZ, % P, 1, 2 11

Ax, A y,,? Intel Math Kernel Libirary, BLAS LAPACK, BLAS?gem2vu The?gem2vu routines perform two matrix-vector operations defined as y1 := alpha*a*x1 + beta*y1, and y2 := alpha*a *x2 + beta*y2 12

two-stage algorithm 13

A n n, AV = V D, V V = V V = I n, λ 1 D =..., V = v 1 v n λ n, λ i (1 i n), V (n n) 14

A 1., A 3 T (1)Householder (Dongarra, ) (2)Bischof/Wu + 2. T V = V D, V V = V V = I n 3. V V 15

Bischof/Wu + n n A, (Bischof/Wu ), 3 T ( ) 0 0 0 0 A 16

Bischof/Wu, O(n 3 )., O(Ln 2 ), L,. Bischof/Wu,, CPU. L,,,,, two-stage algorithm 17

two-stage algorithm Z/pZ, N N A, =, N N A, C, C Wiedemann algorithm 0 0 0 0 A 18

N N, [1,10] N Wiedemann two-stage Hensel 250 0.116sec 0.579sec 0.263sec 500 1.546sec 1.983sec 2.757sec 750 7.026sec 5.891sec 12.088sec 1000 22.816sec 14.785sec 36.994sec 1250 61.798sec 33.538sec 86.468sec Intel(R) Core(TM) i7-4850hq CPU @ 2.30GHz, L4 :128MB, Wiedemann algorithm CPU, Mem 16GB, Fodora 20 19

dgem2vu? two-stage algorithm, Wiedemann algorithm,, = 1, two-stage algorithm, Wiedemann algorithm 20

Hensel,,, ( ), v 0, Q, A k 1 v 0 A n 2 v 0 Av 0 v 0 c k 1. c 0 = A k v 0,, Hensel (p ) v 0 generic, 21

3 (1) (2) 1. AVX2 2. AVX2 3. 64bits 22

:two-stage algorithm n n A, 0 0 A 2 2,, 23

SIMD 24

SIMD(single instruction multiple data) IP x y y α x + y, SIMD (SSE,AVX ) SIMD, 25

s = min i, SIMD a i s 0 = a 0, s 1 = a 1, s 2 = a 2, s 3 = a 3 s 0 = min(s 0, a 4 ), s 1 = min(s 1, a 5 ), s 2 = min(s 2, a 6 ), s 3 = min(s 3, a 7 ) s 0 = min(s 0, a 8 ), s 1 = min(s 1, a 9 ), s 2 = min(s 2, a 10 ), s 3 = min(s 3, a 11 ) s = min(s 0, s 1, s 2, s 3 ) 26

, s = min i a i SIMD? C tmp=a[0]; for(i=step;i<n;i=i+step){ tmp=a[i] < tmp? A[i]:tmp; },, s = min i a i C, 27

FORTRAN Fortran =minval( ( )), SIMD, 28

C?, Intel Array Notation = sec reduce min( [ : : ]); Array Notation C++ 12 29

GNU minval Fortran gfortran -O3 -mtune=native -march=native -c program min.f program min.o gcc -O3 -mtune=native -march=native -o test test.c program min.o, Fortran C 30

2 ( ) 31

C89 double A[5][4];, OK, int N=atoi(argv[1]); double A[N][N]; double **A; A=(double **)malloc(sizeof(double *)*N); for(i=0;i<n;i++){ A[i]=(double *)malloc(sizeof(double)*n); }, 32

, 2, double *A; A=(double *)malloc(sizeof(double)*n*n);, A[i*N+j],, C99 int N=atoi(argv[1]); double A[N][N];,,, 33

, static double A[N][N];,, C99 int N=atoi(argv[1]); double (* restrict A)[N]; A=(double (* restrict)[n])malloc(sizeof(double)*n*n); A, N 1,, A[i][j], 34

restrict? Fortran C double precision A(N,N),B(N,N), Fortran, A B,, C, restrict,, 35

36

X X mod 2 = 1 X mod 3 = 2, X =, 13, 7, 1, 5, 11, 17,, X, 37

, X 1, X = 7, 1, 5,, 3 X mod 2 = 1 X mod 3 = 2 X mod 5 = 3,, X = 7,, 38

( ) 2 2 : q, q 1 = q 2 = α 1,,α s c(α 1,, α s ), α 1,,α s c(α 1,, α s ) 2,, q = α 1,,α s c(α 1,, α s )x α 1 1 xα s s Z[x 1,, x s ] 39

A = (a i,j ) Z N N Hadamard, det(a) min N i=1 N j=1 a i,j 2, N j=1 N i=1 a i,j 2 A = (a i,j ) Z[x 1,, x s ] N N Goldstein Graham, det(a) 2 min N i=1 N j=1 a i,j 2 1, N j=1 N i=1 a i,j 2 1 40

B = a 2b 3c 4d + f = 4ad + 1af 6bc G.&G. = min( 1 + 4 9 + 25, 1 + 9 4 + 25) < 13.1 A f(x) = det(xi A) f 1 (x) = f(x) mod p 1 f 2 (x) = f(x) mod p 2., f(x) 41

Z/pZ 42

.1:C GNU p, a, b Z, p < 2 63, 0 a, b < p, a b = (a + b) mod p 0 a 0,, a 99999999 < p, r = a 0 a 1 a 99999999 unsigned long long a[1000000],r; r=0; for(i=0;i<1000000;i++) r=(r+a[i]) % p; 43

r = a 0 a 1 a 99999999 = (a 0 + a 1 + + a 99999999 ) mod p typedef unsigned int uint128_t attribute ((mode(ti))); unsigned long long a[1000000],r; uint128_t t=0; for(i=0;i<1000000;i++) t+=a[i]; r=t % p; 44

.2: Basic Linear Algebra Subprograms BLAS 1.0 2 53 1.0 2 53, p 1 2 p 1 2, N,, N p 1 2 2 253 2 53 p 2 N + 1 p, BLAS 45

.3: SIMD 0 < p, 2 64 1 N + 1 p Z/pZ, SIMD typedef unsigned int UI; typedef unsigned long long ULL; UI a[n],b[n],r,p; ULL t=0; for(i=0;i<n;i++) t+=(ull)a[i]*b[i]; r=t % p 46

.4: c i = a i mod p 0 c i < p, CPU, c i, 0 c i < 2p, 47

M = 264 1 p ( ), M,, 128bits 0 0 0 0 0 64bits 64bits 0 0 0 1 1 48

0.1111111111111111111111111111111111111111111111111111111111111111 (2) 1.0, M = (2 64 1)/p 1.0/p 2 64 p, M (a i M) 64bits, a i /p,, 1 (, ) c i = a i p (a i M 64bits) 49

64bits 64bits, xxx, asm ("mulq %3":"=a"(xxx),"=d"(r):"a"(a1),"g"(a2)) 50

Tropical Determinant 51

A:n A = (a ij ), det A det A = A = σ S n sgn(σ)a 1σ(1) a 2σ(2) a nσ(n) Permanent A:n A = (a ij ), per A per A = σ S n a 1σ(1) a 2σ(2) a nσ(n) 52

ultradiscretization, a + b max(a, b) a b a + b a/b a b ultradiscrete permanent A:n A = (a ij ), udper A udper A = max σ S n a 1σ(1) + a 2σ(2) + + a nσ(n) ultradiscrete permanent, Tropical Determinant 53

Tropical Determinant Linear Assignment Problem Tomizawa, N. : On some techniques useful for the solution of transportation problems. Networks 1, 173 194 (1971). Jonker-Volgenant algorithm(lapjv), 1987 54

Tropical Determinant A = 2x 3 + 1 x 3 + 6x 2 + 5 7x 5 + 11x 4 + 3 9x 3, B = 3 3 5 3, B Tropical Determinant, A 55

Newton 56

Newton 1 n f(x) Newton f(x) = f[x 0 ] + (x x 0 )f[x 0, x 1 ] +... +(x x 0 ) (x x n )f[x 0, x 1, x 2,..., x n ] f[x 0, x 1, x 2,..., x n ], 1, 2, f[x 0, x 1 ] = f(x 1) f(x 0 ) x 1 x 0, f[x 0, x 1, x 2 ] = f[x 0, x 2 ] f[x 0, x 1 ] x 2 x 1, 57

, n. f[x 0, x 1,..., x n ] = f[x 0, x 1,..., x n 2, x n ] f[x 0, x 1,..., x n 2, x n 1 ] x n x n 1, x n x n 1,, Z/pZ,, 1, 58

Newton (1) (x 0, f(x 0 )), (x 1, f(x 1 )),, (x N, f(x N )), f(x) = f 0 + (x x 0 )f 1 + (x x 0 )(x x 1 )f 2 + + (x x 0 )(x x 1 ) (x x N )f N 59

f(x 0 ) = f 0, f(x 1 ) = f 0 + (x 1 x 0 )f 1, f 1 = f(x 1) f 0 x 1 x 0 f(x 2 ) = f 0 + (x 2 x 0 )f 1 + (x 2 x 0 )(x 2 x 1 )f 2, f 2 = f(x 2) (f 0 + (x 2 x 0 )f 1 ) (x 2 x 0 )(x 2 x 1 ) f 0 + (x 2 x 0 )f 1, 60

Newton (2) f(x j ) = f 0 + (x j x 0 )f 1 + (x j x 0 )(x j x 1 )f 2 + + (x j x 0 )(x j x 1 ) (x j x N )f N (x j x 0 ), (x j x 0 )(x j x 1 ),, (x j x 0 )(x j x 1 ) (x j x N ),, f(x j ) = 1 f 0 + { (x j x 0 ) } f 1 + { (xj x 0 )(x j x 1 ) } f 2 + + { (xj x 0 )(x j x 1 ) (x j x N ) } f N,, SIMD 61

Hyper-Threading Technology 62

1.x1=(ULL)a[i]*(ULL)b[i]+(ULL)c[i] 2. asm ("mulq %3":"=a"(xxx),"=d"(x2):"a"(x1),"g"(INV_CMX)) 3.z[i]=(UI)x1-(UI)x2*(UI)CM, z[i] = mod(a[i] b[i] + c[i], p) (ULL=unsigned long long,ui=unsigned int), i, 1.2.3., CPU ( CPU ) Hyper-Threading Technology 63

Intel Hyper-Threading Technology(HTT) Intel web page, Hyper-Threading Technology. ( HT ) 1 HT 64

E6(a)=a^27+12*p2*a^25+60*p2^2*a^23-48*p1*a^22+(168*p2^3+96*q2)*a^21-336*p2*p1*a^20+(294*p2^4+528*q2*p2+480*p0)*a^19+(-1008*p2^2*p1-1344*q1)*a^18 +(144*p1^2+336*p2^5+1152*q2*p2^2+2304*p0*p2)*a^17 +((-1680*p2^3-768*q2)*p1-5568*q1*p2)*a^16 +(608*p2*p1^2+252*p2^6+1200*q2*p2^3+4768*p0*p2^2+17280*q0-1248*q2^2)*a^15 +((-1680*p2^4-2688*q2*p2+2304*p0)*p1-8832*q1*p2^2)*a^14 +(976*p2^2*p1^2+3264*q1*p1+120*p2^7+480*q2*p2^4+5696*p0*p2^3+ (43776*q0-4800*q2^2)*p2+12288*q2*p0)*a^13 +(832*p1^3+(-1008*p2^5-3072*q2*p2^2+5888*p0*p2)*p1-6528*q1*p2^3 +10752*q2*q1)*a^12 +((704*p2^3+4224*q2)*p1^2+2688*q1*p2*p1+33*p2^8-144*q2*p2^5+4384*p0*p2^4 +(41472*q0-6720*q2^2)*p2^2+34560*q2*p0*p2-34560*p0^2)*a^11 +(2560*p2*p1^3+(-336*p2^6-768*q2*p2^3+3584*p0*p2^2+64512*q0+8448*q2^2)*p1-2112*q1*p2^4+23040*q2*q1*p2-70656*p0*q1)*a^10 +((176*p2^4+8960*q2*p2-18944*p0)*p1^2-5504*q1*p2^2*p1+4*p2^9-192*q2*p2^6 +2176*p0*p2^5+(22528*q0-3840*q2^2)*p2^3+32768*q2*p0*p2^2-39936*p0^2*p2 +110592*q2*q0-40704*q1^2+5120*q2^3)*a^9 65

+(2688*p2^2*p1^3+4608*q1*p1^2+(-48*p2^7+768*q2*p2^4-1536*p0*p2^3 +(82944*q0+16128*q2^2)*p2-73728*q2*p0)*p1-192*q1*p2^5+13824*q2*q1*p2^2-64512*p0*q1*p2)*a^8 +(-2560*p1^4+(-32*p2^5+5376*q2*p2^2-16384*p0*p2)*p1^2+(-6144*q1*p2^3-15360*q2*q1)*p1-48*q2*p2^7+608*p0*p2^6+(9600*q0-480*q2^2)*p2^4 +10752*q2*p0*p2^3-20992*p0^2*p2^2+(156672*q2*q0-38400*q1^2+9984*q2^3)*p2-165888*p0*q0-56832*q2^2*p0)*a^7 +((1024*p2^3-10240*q2)*p1^3+10240*q1*p2*p1^2+(384*q2*p2^5-1792*p0*p2^4 +(21504*q0+6912*q2^2)*p2^2-57344*q2*p0*p2+49152*p0^2)*p1+1536*q2*q1*p2^3-19456*p0*q1*p2^2-110592*q1*q0-21504*q2^2*q1)*a^6 +(-1536*p2*p1^4+(-16*p2^6+768*q2*p2^3-4608*p0*p2^2+27648*q0-19200*q2^2)*p1^2 +(-1344*q1*p2^4+10752*q2*q1*p2-9216*p0*q1)*p1+64*p0*p2^7 +(2304*q0+192*q2^2)*p2^5-3072*p0^2*p2^3+(55296*q2*q0-12288*q1^2 +4608*q2^3)*p2^2+(-110592*p0*q0-46080*q2^2*p0)*p2+73728*q2*p0^2)*a^5 +((64*p2^4-4096*q2*p2+8192*p0)*p1^3-512*q1*p2^2*p1^2+(-256*p0*p2^5+ (3072*q0-768*q2^2)*p2^3-8192*q2*p0*p2^2+16384*p0^2*p2+73728*q2*q0-39936*q1^2-18432*q2^3)*p1-1024*p0*q1*p2^3+(-36864*q1*q0-3072*q2^2*q1)*p2 +24576*q2*p0*q1)*a^4 +(256*p2^2*p1^4+15360*q1*p1^3+(128*q2*p2^4-1024*p0*p2^3+(-6144*q0

-2560*q2^2)*p2+8192*q2*p0)*p1^2+(-128*q1*p2^5+2048*q2*q1*p2^2-14336*p0*q1*p2)*p1+256*q0*p2^6-256*q2*p0*p2^5+256*p0^2*p2^4+(9216*q2*q0-2560*q1^2-256*q2^3)*p2^3+(-18432*p0*q0-7680*q2^2*p0)*p2^2 +24576*q2*p0^2*p2-110592*q0^2+55296*q2^2*q0-30720*q2*q1^2-16384*p0^3-6912*q2^4)*a^3 +(-1024*p1^5+4096*p0*p2*p1^3+24576*q2*q1*p1^2-12288*q1^2*p2*p1)*a^2 +(-2048*q2*p1^4+2048*q1*p2*p1^3+((-3072*q0-256*q2^2)*p2^2+4096*q2*p0*p2-4096*p0^2)*p1^2+(512*q2*q1*p2^3-1024*p0*q1*p2^2-36864*q1*q0 +9216*q2^2*q1)*p1-256*q1^2*p2^4-6144*q2*q1^2*p2+12288*p0*q1^2)*a +(4096*q0-1024*q2^2)*p1^3+(2048*q2*q1*p2-4096*p0*q1)*p1^2-1024*q1^2*p2^2*p1-4096*q1^3 E6(a)

E6 k (a) = E6(a) mod a k+1,, CPU:Intel Core i7 980X(6 Core), Mem:24G, OS:Fedora 13 GNU GCC compiler 4.8.2 Option:-O3 -mtune=native -march=native -fopenmp Kimura Parallel Kimura Parallel k Kimura Serial without HTT with HTT 7 5m46.000s 1m13.400s 52.923s 66

Intel C++ compiler 14.0.1 Option:-fast -openmp Kimura Parallel Kimura Parallel k Kimura Serial without HTT with HTT 7 6m11.804s 1m11.837s 52.634s 6 Core CPU, 6 (super-linear) 67

E6(a) CPU:Intel Core i7 980X(6Core) Mem:24G Compiler:GCC 4.8.0 Option:-O3 -mtune=native -march=native -fopenmp :27329463 (txt :2.5G) Serial: 10913m45.857s Parallel: 1773m28.272s Speed Up: 6.15 superlinear 68