Similar documents
Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

倍々精度RgemmのnVidia C2050上への実装と応用

GPGPU

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

chisq.test corresp plot

[2] , [3] 2. 2 [4] 2. 3 BABOK BABOK(Business Analysis Body of Knowledge) BABOK IIBA(International Institute of Business Analysis) BABOK 7

2017 (413812)

4 倍精度基本線形代数ルーチン群 QPBLAS の紹介 [index] 1. Introduction 2. Double-double algorithm 3. QPBLAS 4. QPBLAS-GPU 5. Summary 佐々成正 1, 山田進 1, 町田昌彦 1, 今村俊幸 2, 奥田洋司

Introduction Purpose This training course describes the configuration and session features of the High-performance Embedded Workshop (HEW), a key tool

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h


[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

Vol. 48 No. 4 Apr LAN TCP/IP LAN TCP/IP 1 PC TCP/IP 1 PC User-mode Linux 12 Development of a System to Visualize Computer Network Behavior for L

2006 [3] Scratch Squeak PEN [4] PenFlowchart 2 3 PenFlowchart 4 PenFlowchart PEN xdncl PEN [5] PEN xdncl DNCL 1 1 [6] 1 PEN Fig. 1 The PEN

untitled

Page 1 of 6 B (The World of Mathematics) November 20, 2006 Final Exam 2006 Division: ID#: Name: 1. p, q, r (Let p, q, r are propositions. ) (10pts) (a

Mikio Yamamoto: Dynamical Measurement of the E-effect in Iron-Cobalt Alloys. The AE-effect (change in Young's modulus of elasticity with magnetization

MBLAS¤ÈMLAPACK; ¿ÇÜĹÀºÅÙÈǤÎBLAS/LAPACK¤ÎºîÀ®

main.dvi

..,,,, , ( ) 3.,., 3.,., 500, 233.,, 3,,.,, i

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

,,,, : - i -

IPSJ SIG Technical Report Vol.2012-IS-119 No /3/ Web A Multi-story e-picture Book with the Degree-of-interest Extraction Function

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

スライド 1

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

<82E682B15F8E E696E6464>

untitled

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

4.1 % 7.5 %

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

( )

A Nutritional Study of Anemia in Pregnancy Hematologic Characteristics in Pregnancy (Part 1) Keizo Shiraki, Fumiko Hisaoka Department of Nutrition, Sc


1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

WE WESB WENB WESNB 428

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

企業の信頼性を通じたブランド構築に関する考察

NKK NEWS 2012

How to read the marks and remarks used in this parts book. Section 1 : Explanation of Code Use In MRK Column OO : Interchangeable between the new part

25 II :30 16:00 (1),. Do not open this problem booklet until the start of the examination is announced. (2) 3.. Answer the following 3 proble

浜松医科大学紀要


単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

Visual Evaluation of Polka-dot Patterns Yoojin LEE and Nobuko NARUSE * Granduate School of Bunka Women's University, and * Faculty of Fashion Science,

Tsuken Technical Information 1

Table 1. Assumed performance of a water electrol ysis plant. Fig. 1. Structure of a proposed power generation system utilizing waste heat from factori

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

2 HI LO ZDD 2 ZDD 2 HI LO 2 ( ) HI (Zero-suppress ) Zero-suppress ZDD ZDD Zero-suppress 1 ZDD abc a HI b c b Zero-suppress b ZDD ZDD 5) ZDD F 1 F = a

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

Fig, 1. Waveform of the short-circuit current peculiar to a metal. Fig. 2. Waveform of arc short-circuit current. 398 T. IEE Japan, Vol. 113-B, No. 4,

1,a) 1,b) TUBSTAP TUBSTAP Offering New Benchmark Maps for Turn Based Strategy Game Tomihiro Kimura 1,a) Kokolo Ikeda 1,b) Abstract: Tsume-shogi and Ts

24 Region-Based Image Retrieval using Fuzzy Clustering

indd

卒業論文2.dvi

soturon.dvi

h23w1.dvi


03-03 Bush Mentori.pdf

HP Workstation 総合カタログ

28 TCG SURF Card recognition using SURF in TCG play video

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU


〈論文〉興行データベースから「古典芸能」の定義を考える

HPC pdf

OJT Planned Happenstance

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

udc-2.dvi

LAN LAN LAN LAN LAN LAN,, i

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

Microsoft Word - j201drills27.doc

22 1,936, ,115, , , , , , ,

{.w._.p7_.....\.. (Page 6)

,,,,., C Java,,.,,.,., ,,.,, i

百人一首かるた選手の競技時の脳の情報処理に関する研究

評論・社会科学 84号(よこ)(P)/3.金子

JAMSTEC Rep. Res. Dev., Volume 12, March 2011, 27 _ 35 1,2* Pb 210 Pb 214 Pb MCA 210 Pb MCA MCA 210 Pb 214 Pb * 2

早稲田大学現代政治経済研究所 ダブルトラック オークションの実験研究 宇都伸之早稲田大学上條良夫高知工科大学船木由喜彦早稲田大学 No.J1401 Working Paper Series Institute for Research in Contemporary Political and Ec

tabaicho3mukunoki.pptx

NINJAL Research Papers No.3

24 LED A visual programming environment for art work using a LED matrix

TCP/IP IEEE Bluetooth LAN TCP TCP BEC FEC M T M R M T 2. 2 [5] AODV [4]DSR [3] 1 MS 100m 5 /100m 2 MD 2 c 2009 Information Processing Society of

K02LE indd

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

alternating current component and two transient components. Both transient components are direct currents at starting of the motor and are sinusoidal

Introduction Purpose This training course demonstrates the use of the High-performance Embedded Workshop (HEW), a key tool for developing software for


<95DB8C9288E397C389C88A E696E6462>


untitled

IPSJ SIG Technical Report Secret Tap Secret Tap Secret Flick 1 An Examination of Icon-based User Authentication Method Using Flick Input for

2013 Future University Hakodate 2013 System Information Science Practice Group Report biblive : Project Name biblive : Recording and sharing experienc

Transcription:

QD library! Feature! Easy to use high precision! Easy to understand the structure of arithmetic! 2 type high precision arithmetic! Double-Double precision (pseudo quadruple precision)! Quad-Double precision (pseudo octuple precision) * High-Precision Software Directory, http://crd-legacy.lbl.gov/dhbailey/mpdist/

GRAPE-MP4, MP6, MP8 Extending the arithmetic format 52 bit double MP 11 bit MP4 15 bit 116 bit 112 112 bit bit TD MP6 176 bit QD MP8 240 bit software emulation

double[4] c = {s0, s1, s2, s3}; TD operations (SUB, MUL and DIV) are created. Each TD AD other return c; QD to TD example (A) TD MUL has 27 times and 78 times double precision computations, is clear that these numbers are less than those of QD A and QD MULはRenormalizeを行わない また分岐を使わないRenormalizeを採用 1: Algorithm Design for Figure QD A: from a[0] todesign a[3] andfor from 4.3: Algorithm TD A: from a[0] to a[2] an 3] represent a QD precision value [0] aistd highest bits and b[0] torespectively, b[2] represent precision value respectively, [0] is highest b west bits. The box marked + algorithms such + a means addition algorithms [2] ismeans lowestaddition bits. The box marked

OpenCL! Framework for parallel processing programming! Programs run on many platforms and devices (Multi-core CPUs, GPUs, DSP, FPGA etc.)! Target devices of this work! Multi-core CPUs! GPUs! Many Integrated Core (MIC)

is shown in Table 1. 行列乗算 (2012) And in this Section I show performance evaluation of matrix multiplication in calculation by OpenCL. 4 s1160154 In Figure 2, I present test configurations used in this work. And the information about CPU and GPU I used is shown in Section 6.3. PU CPUs do not support FMA, GPU so I used them without Figure 3: Result of CPU (non-parallel) - from left to And I also tried non-parallelized calculations in ntelfma. AMD right in each Dimension (N), No.1, No.2 and No.3 CPUs to compare the result with that of using OpenCL. As one of the non-parallelized calculations, I also used 7-2600K Radeon HD7970 the mpack library. The mpack library is a library Gflops 947 Gflops which has many operations for vector and matrix in multiple precision. VX) (FMA) CL 1.1 OpenCL 1.1 University of Aizu, Graduation Thesis. March, 2012 NUX AMD-APP 831.4 Figure 2: Test configurations Intel) (by AMD) GPU CPU Figure 4: Result of CPU (OpenCL) - from left to right Intel AMD device name Core i7-2600k Radeon HD7970 in each Dimension (N), No.4, No.5 and No.6 peak 108.8 Gflops 947 Gflops performance (AVX) (FMA) c and configuration [6] OpenCL SDK ver. OpenCL 1.1 LINUX (by Intel) OpenCL 1.1 AMD-APP 831.4 (by AMD) Figure 4: Result of CPU (OpenCL) - from left to right in each Dimension (N), No.4, No.5 and No.6 Table 1: Spec and configuration K.Nakamura, G.Thesis 2012

University of Aizu, Graduation(2012) Thesis. March, 2012 LU分解 Figure 6: Result of LU factorization - from left to right K.Nakamura, G.Thesis 2012 in each Dimension (N), mpack (non-blocking), CPU (OpenCL) and GPU (OpenCL)

-GEMM performance on GPUs! HD7970(Tahiti) produces highest performance (60 Gflop/s)

-GEMM performance on CPUs! Xeon Phi produces stable & high performance (11 Gflop/s)

QR decomposition Major routine of Linear Algebra! Decompose matrix A to matrix Q and matrix R! A: m-by-n matrix ( m n )!!! Q: m-by-m orthogonal matrix R: m-by-n upper triangular matrix ブロック化Householder法によるQR分解を実装

Performance Tests of -QR decomposition! Environments of Tests! Compare below! Without OpenCL (Serial execution)! -GEMM on GPUs with OpenCL! -GEMM on CPUs with OpenCL

Stage 1 Stage2 Stage 3 Stage 4 Algorithm 9 Blocked Householder QR Require: A C m n,q T Q = I 1: Q I 2: for k =1ton/r do 3: s =(k 1) r +1 4: for j =1tor do 5: u = s + j 1 6: [v, β] =house(a[u : m, u]) 7: A[u : m, u : s + r 1] = A[u : m, u : s + r 1] βvv T A[u : m, u : s + r 1] 8: V [:,j]=[zeros(j 1, 1); v] 9: B(j) =β 10: end for 11: Y = V [1 : end, 1] 12: W = B(1) V [1 : end, 1] 13: for j =2tor do 14: v = V [:,j] 15: z = B(j) v B(j) WY T v 16: W =[Wz] 17: Y =[Yv] 18: end for 19: A[s : m, s + r : n] =A[s : m, s + r : n]+yw T A[s : m, s + r : n] 20: Q[1 : m, s : m] =Q[1 : m, s : m]+q[1 : m, s : m]wy T 21: end for

Serial vs. OpenCL(GPU) Serial(CPU)による計算時間 計算時間(秒) OpenCL(GPU)による計算時間 N=3072の場合 GPUの利用で20倍高速

分解結果の精度について N=1024 r = 64

1 I = 0 1 x dx 0 1 x y dy 0 湯浅 et al. 2007 dz 1 D 2 D= xys tz 1 x y z x y 2 1 x y z 1 x y m e 2 z 1 x y m f 2 倍精度演算では数値不安定 4 倍精度演算が必要

Table 4.13: Numerical results with HD6970 (λ =10 ) N 256 1024 Double 1.10854011e-7 1.11434660070024864650109150138873600e-7 Double-Double 1.38322168e-7 1.38323589455119100876021157137126812e-7 Triple-Double 1.38322167e-7 1.38323589172160096884080782115035156e-7 Quad-Double 1.38322167e-7 1.38323589172160096884080782115035163e-7 Analytical Answer 1.38323589e-7 1.38323589227981762289646298761828386e-7 Table 4.14: Numerical results with HD6970 (λ =10 ) N 256 1024 Double 1.1623e-7 1.17272710173910304193985404e-7 Double-Double 2.1067e-7 2.11964036570355145245187164e-7 Triple-Double 2.4714e-7 2.47248635217083570183809854e-7 Quad-Double 2.4714e-7 2.47248635217083570183809888e-7 Analytical Answer 2.4724e-7 2.47248635259865968819535221e-7

D TD QD 1 core 4 core GPU 6.8 0.355 0.191 349 25.8 0.859 2335(?) 80 20.1 2921 240 58.1

Kernel Generator LSUMP for AMD GPU, DR, GRAPE-MP 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 for(id3=0; id3<n 2; id3++){ x301[0] = g x301[id3 2 +0]; x301[1] = g x301[id3 2 +1]; gw30[0] = g gw30[id3 2 +0]; gw30[1] = g gw30[id3 2 +1]; TwoProd(x301, cnt4, zz); 入力 10行 OpenCL Kernel 80行 TwoProd(mone, xx, t[0]); TwoProd(t[0], yy, t[1]); TwoProd(t[1], s, t[2]); TwoProd(tt, zz, t[3]); TwoSub(one, xx, t[4]); TwoSub(t[4], yy, t[5]); TwoSub(t[5], zz, t[6]); TwoProd(t[3], t[6], t[7]); TwoSub(t[2], t[7], t[8]); TwoSum(xx, yy, t[9]); TwoProd(t[9], ramda, t[10]);

付録 MPX: Performance比較 MP MP4 MP6 MP8 116 bit 6PE 112 bit 16PE 176 bit 14PE 240 bit 10PE 概要 78% 100MHz 1.2 Gflops 0.49 Gflops 12.6 W 61% 81% 85% 125MHz 95MHz 70MHz 4 Gflops 2.66 Gflops 1.4 Gflops 1.252 Gflops 0.917 Gflops 0.493 Gflops 11.5 W 12.3 W 90nm only PEs 40nm PEs & PCIe logic Nakasato etal. 2012, Daisaka etal. 2011