Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

Similar documents

GRAPE-DR /

アクセラレータのデモと プログラミング手法

GRAPE GRAPE-DR V-GRAPE


GRAPE GRAPE-DR V-GRAPE

HPC / (CfCA) HPC 2007/11/23-25

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

スライド 1


PowerPoint プレゼンテーション

EGunGPU

II 2 II

supercomputer2010.ppt

Part y mx + n mt + n m 1 mt n + n t m 2 t + mn 0 t m 0 n 18 y n n a 7 3 ; x α α 1 7α +t t 3 4α + 3t t x α x α y mx + n

( )

untitled

卓球の試合への興味度に関する確率論的分析

4 倍精度基本線形代数ルーチン群 QPBLAS の紹介 [index] 1. Introduction 2. Double-double algorithm 3. QPBLAS 4. QPBLAS-GPU 5. Summary 佐々成正 1, 山田進 1, 町田昌彦 1, 今村俊幸 2, 奥田洋司

120 9 I I 1 I 2 I 1 I 2 ( a) ( b) ( c ) I I 2 I 1 I ( d) ( e) ( f ) 9.1: Ampère (c) (d) (e) S I 1 I 2 B ds = µ 0 ( I 1 I 2 ) I 1 I 2 B ds =0. I 1 I 2

I y = f(x) a I a x I x = a + x 1 f(x) f(a) x a = f(a + x) f(a) x (11.1) x a x 0 f(x) f(a) f(a + x) f(a) lim = lim x a x a x 0 x (11.2) f(x) x

strtok-count.eps

3 SIMPLE ver 3.2: SIMPLE (SIxteen-bit MicroProcessor for Laboratory Experiment) 1 16 SIMPLE SIMPLE 2 SIMPLE 2.1 SIMPLE (main memo

y π π O π x 9 s94.5 y dy dx. y = x + 3 y = x logx + 9 s9.6 z z x, z y. z = xy + y 3 z = sinx y 9 s x dx π x cos xdx 9 s93.8 a, fx = e x ax,. a =

VXPRO R1400® ご提案資料

92% TEL ディー クルー テクノロジーズ株式会社

zz + 3i(z z) + 5 = 0 + i z + i = z 2i z z z y zz + 3i (z z) + 5 = 0 (z 3i) (z + 3i) = 9 5 = 4 z 3i = 2 (3i) zz i (z z) + 1 = a 2 {

I

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

( )/2 hara/lectures/lectures-j.html 2, {H} {T } S = {H, T } {(H, H), (H, T )} {(H, T ), (T, T )} {(H, H), (T, T )} {1

HP ProLiant 500シリーズ

システムオンチップ技術

倍々精度RgemmのnVidia C2050上への実装と応用

(ii) (iii) z a = z a =2 z a =6 sin z z a dz. cosh z z a dz. e z dz. (, a b > 6.) (z a)(z b) 52.. (a) dz, ( a = /6.), (b) z =6 az (c) z a =2 53. f n (z

スライド 1

29

( : December 27, 2015) CONTENTS I. 1 II. 2 III. 2 IV. 3 V. 5 VI. 6 VII. 7 VIII. 9 I. 1 f(x) f (x) y = f(x) x ϕ(r) (gradient) ϕ(r) (gradϕ(r) ) ( ) ϕ(r)

数学の基礎訓練I



AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

68 A mm 1/10 A. (a) (b) A.: (a) A.3 A.4 1 1

A Responsive Processor for Parallel/Distributed Real-time Processing

HP Blade Workstation HP RCS Remote Client Solution HP Blade Workstation CO2 2

システムソリューションのご紹介

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1


単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

1 8, : 8.1 1, 2 z = ax + by + c ax by + z c = a b +1 x y z c = 0, (0, 0, c), n = ( a, b, 1). f = n i=1 a ii x 2 i + i<j 2a ij x i x j = ( x, A x), f =

( )

2005 1

FIT2013( 第 12 回情報科学技術フォーラム ) I-032 Acceleration of Adaptive Bilateral Filter base on Spatial Decomposition and Symmetry of Weights 1. Taiki Makishi Ch

PROSTAGE[プロステージ]

IoTを加速するエッジコンピューティング HPE Edgeline Converged IoT Systems

.5 z = a + b + c n.6 = a sin t y = b cos t dy d a e e b e + e c e e e + e 3 s36 3 a + y = a, b > b 3 s363.7 y = + 3 y = + 3 s364.8 cos a 3 s365.9 y =,

7. y fx, z gy z gfx dz dx dz dy dy dx. g f a g bf a b fa 7., chain ule Ω, D R n, R m a Ω, f : Ω R m, g : D R l, fω D, b fa, f a g b g f a g f a g bf a

BIT -2-

tabaicho3mukunoki.pptx

並列計算の数理とアルゴリズム サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.

1 (bit ) ( ) PC WS CPU IEEE754 standard ( 24bit) ( 53bit)

次世代スーパーコンピュータのシステム構成案について

untitled

スライド 1

ATLAS 2011/3/25-26

02_Matrox Frame Grabbers_1612


1重谷.PDF

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

ProLiant BL25p Generation 2システム構成図

<4D F736F F F696E74202D2091E63489F15F436F6D C982E682E992B48D8291AC92B489B F090CD2888F38DFC E B8CDD8

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,

NDIS ( )

DRAM SRAM SDRAM (Synchronous DRAM) DDR SDRAM (Double Data Rate SDRAM) DRAM 4 C Wikipedia 1.8 SRAM DRAM DRAM SRAM DRAM SRAM (256M 1G bit) (32 64M bit)

WJ-HD SHIFT /0 PULL Digital Disk Recorder WJ-HD 316


スライド 1

「FPGAを用いたプロセッサ検証システムの製作」

untitled

°ÌÁê¿ô³ØII

fx-3650P_fx-3950P_J

本文ALL.indd

untitled


HP Workstation 総合カタログ

untitled

No2 4 y =sinx (5) y = p sin(2x +3) (6) y = 1 tan(3x 2) (7) y =cos 2 (4x +5) (8) y = cos x 1+sinx 5 (1) y =sinx cos x 6 f(x) = sin(sin x) f 0 (π) (2) y

< 1 > (1) f 0 (a) =6a ; g 0 (a) =6a 2 (2) y = f(x) x = 1 f( 1) = 3 ( 1) 2 =3 ; f 0 ( 1) = 6 ( 1) = 6 ; ( 1; 3) 6 x =1 f(1) = 3 ; f 0 (1) = 6 ; (1; 3)

VLSI工学

RW1097-0A-001_V0.1_170106

(1.2) T D = 0 T = D = 30 kn 1.2 (1.4) 2F W = 0 F = W/2 = 300 kn/2 = 150 kn 1.3 (1.9) R = W 1 + W 2 = = 1100 N. (1.9) W 2 b W 1 a = 0

lll

LCM,GCD LCM GCD..,.. 1 LCM GCD a b a b. a divides b. a b. a, b :, CD(a, b) = {d a, b }, CM(a, b) = {m a, b }... CM(a, b). q > 0, m 1, m 2 CM

P33W・P28X カタログ

main.dvi

sec13.dvi

R

Microsoft PowerPoint - GPUシンポジウム _d公開版.ppt [互換モード]

表紙.indd

II (10 4 ) 1. p (x, y) (a, b) ε(x, y; a, b) 0 f (x, y) f (a, b) A, B (6.5) y = b f (x, b) f (a, b) x a = A + ε(x, b; a, b) x a 2 x a 0 A = f x (

Transcription:

Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

ボードの概要 Control processor (FPGA by Altera) GRAPE-MP chip[nextreme NX2500] (structured ASIC by easic)

GRAPE-MP チップのブロック図 転送 :800MB/s 128bit, 128 words 2 演算 x 100MHz x 6 PE = 1.2 Gflops 4 論理 pipelines x 6 PE = 24 pipelines /chip

inst BM out BM in GRF1 128w GRF2 128w add treg 4w mul tt ss rsq

( 0 ) : nop if 0 ( 1 ) : sub? ( 2-8) : grf_a adr 7 bit ( 9-15) : grf_b adr 7 bit (16-22) : grf_c adr 7 bit (23-29) : grf_d adr 7 bit (30-31) : TREG adr 2 bit (32-34) : ADD 1st arg : a,b,bm,t,ti (35-37) : ADD 2nd arg : a,b,bm,t,ti (38-40) : MUL 1st arg : a,b,bm,t,ti (41-43) : MUL 2nd arg : a,b,bm,t,ti (44 ) : RSQ 1st arg : t,ti (45-46) : grf_c write : add, mul, rsq (47-48) : grf_d write : add, mul, rsq (49-50) : treg write : add, mul, rsq (51 ) : bm out (52-55) : bm mask : 1000 => 0, 1001 => 1, 1010 => 2 etc. (56-62) : bm adr 7 bit (128 words)

127 0 11bit exponents 116bit mantissa 1bit for sign

GRAPE-MP ボードのブロック図 64bit 16k ワード IO control processor をGRAPE-MP チップから分離 MP チップのPE 数を最大にするため 開発を簡単にするため

sub bm16v ra0v rb40v sub bm20v ra4v rb44v sub bm24v ra8v rb48v mul rb40v rb40v ra36v mul rb44v rb44v tt add ra36v ts ra32v mul rb48v rb48v tt add ra32v ts tt 1006600214000003 00010000000001100110000000000010000101000 1106600214800007 00010001000001100110000000000010000101001 120660021500000b 00010010000001100110000000000010000101010 130660021580000f 00010011000001100110000000000010000101011 1406600216000013 00010100000001100110000000000010000101100 1506600216800017 00010101000001100110000000000010000101101 160660021700001b 00010110000001100110000000000010000101110 170660021780001f 00010111000001100110000000000010000101111 1806600218000023 00011000000001100110000000000010000110000 1906600218800027 00011001000001100110000000000010000110001 1a0660021900002b 00011010000001100110000000000010000110010 1b0660021980002f 00011011000001100110000000000010000110011 7a24000245001 0000000000000111101000100100000000000000001001000 7a24000255201 0000000000000111101000100100000000000000001001010 7a24000265401 0000000000000111101000100100000000000000001001100 7a24000275601 0000000000000111101000100100000000000000001001110 3e24000005801 0000000000000011111000100100000000000000000000000 3e24040005a01 0000000000000011111000100100000001000000000000000 3e24080005c01 0000000000000011111000100100000010000000000000000 3e240c0005e01 0000000000000011111000100100000011000000000000000 7802000200091 0000000000000111100000000010000000000000001000000 7802000210095 0000000000000111100000000010000000000000001000010 7802000220099 0000000000000111100000000010000000000000001000100 780200023009d 0000000000000111100000000010000000000000001000110 3e24000006001 0000000000000011111000100100000000000000000000000 3e24040006201 0000000000000011111000100100000001000000000000000 3e24080006401 0000000000000011111000100100000010000000000000000 3e240c0006601 0000000000000011111000100100000011000000000000000 1e02000000081 0000000000000001111000000010000000000000000000000 1e02040000085 0000000000000001111000000010000001000000000000000

VARI xi, yi, zi, e2; VARJ xj, yj, zj, mj; VARF ax, ay, az, pt; dx = xj - xi; dy = yj - yi; dz = zj - zi; r1i = rsqrt(dx**2 + dy**2 + dz**2 + e2); pf = mj*r1i; pt += pf; af = pf*r1i**2; ax += af*dx; bm_in bm12v ra12v pe0 bm_in bm8v ra8v pe0 bm_in bm4v ra4v pe0 bm_in bm0v ra0v pe0 mov zz ra16v mov zz ra28v mov zz ra24v mov zz ra20v sub bm16v ra0v rb40v sub bm20v ra4v rb44v sub bm24v ra8v rb48v mul rb40v rb40v ra36v mul rb44v rb44v tt add ra36v ts ra32v mul rb48v rb48v tt add ra32v ts tt

GRAPE-MPの性能評価 テスト環境 CPU:Intel Core i7 920 (OC 3GHz) MEM: DDR-1333 12GB (1208MHz動作) MB: Asus P6T6 WS Revolution (6PCIe スロット) 6ボードを搭載して性能評価

ファインマンループ積分 1 I = 0 1 x dx 0 1 x y dy 0 dz 1 D 2 D= xys tz 1 x y z x y 2 1 x y z 1 x y m e 2 z 1 x y m f 2 x,yを与える 一番内側のzの和を計算 同時に (x,y) の24 組を計算 積分のポイント数 Nを変えて計算 41 N 3 演算

i 並列 146 pipelines(6 台 ) 96 pipelines(4 台 ) 48 pipelines(2 台 ) 性能 (N=3900) 3.040 Gflops (5.30 倍 ) 2.150 Gflops (3.75 倍 ) 1.118 Gflops (1.95) Number of particles/points

( i Number of particles 42 %

M A RAM-A (RA[1]) M B RAM-B (RB[1]) Multiplier[1] (64 64)... M A RAM-A (RA[p]) M B RAM-B (RB[p]) Multiplier[p] (64 64) Op 1024 bits 2048 bits MPFR Our Speedup MPFR Our Speedup x ± y 0.7 0.126 5.6 1.25 0.25 5 x y 12.9 0.41 31.5 32.18 1.30 24.8 x/y 18.6 1.95 9.5 64.1 5.05 12.7 x 18.8 2.52 7.5 46.9 6.39 7.3 Sin(x) 458 21.0 21.8 1766 82.0 21.5 Cos(x) 405 22.2 18.2 1640 73.5 22.3 Exp(x) 420 23.0 18.3 1515 83.2 18.2 Ln(x) 579.7 15.7 36.9 1547 46.1 33.6 Accumulator[1] 70bits(high) E A + E B MUX + Sum 64bits(low) RAM-C (RC) Normalization Result Accumulator[p] S A * S B (B) Structure of VP_Mult unit

600 500 POWER7 FPGA 400 Mop/s 300 200 100 0 1 10 100 1000 10000 1e+06 100000 1e+08 1e+07 vector length 7 8 FPGA

100 Performance of C AB + C on CPU-GPU Systems 2000 Performa in Differe 600 Performance [GFlop/s] 1500 1000 500 Performance [GFlop/s] 500 400 300 200 0 0 5000 10000 15000 20000 Matrix size [n=m=k] SGEMM on System A (HD 5870 GPU + Core i7 970 CPU) SGEMM on System B (HD 6970 GPU + Core i7 2600k CPU) DGEMM on System C (2 HD 5870 GPUs + Core i7 960 CPU) DGEMM on System A (HD 5870 GPU + Core i7 970 CPU) DGEMM on System B (HD 6970 GPU + Core i7 2600k CPU) 0

0 512 1024 1536 2048 Blocking factor [b] (n=m=k=10b) Maximum Performance DGEMM SGEMM Variant System A System B Perf. [GFlop/s] Perf. [GFlop/s] C A T B + C 419 467 C AB + C 417 467 C A T B T + C 418 467 C AB T + C 400 466 C A T B + C 1455 2010 C AB + C 1436 2010 C A T B T + C 1442 2010 C AB T + C 1301 1577