25 2 ) 15 (W 力電 idle FMA(1) FMA(N) 実行コード Memcopy matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmu

Similar documents
GPGPU

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

Microsoft PowerPoint - GPU_computing_2013_01.pptx

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

07-二村幸孝・出口大輔.indd

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

main.dvi

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

GPU CUDA CUDA 2010/06/28 1

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

4.1 % 7.5 %

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

2017 (413812)

.,,, [12].,, [13].,,.,, meal[10]., [11], SNS.,., [14].,,.,,.,,,.,,., Cami-log, , [15], A/D (Powerlab ; ), F- (F-150M, ), ( PC ).,, Chart5(ADIns

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

( ) [1] [4] ( ) 2. [5] [6] Piano Tutor[7] [1], [2], [8], [9] Radiobaton[10] Two Finger Piano[11] Coloring-in Piano[12] ism[13] MIDI MIDI 1 Fig. 1 Syst

工学院大学建築系学科近藤研究室2000年度卒業論文梗概

IPSJ SIG Technical Report Secret Tap Secret Tap Secret Flick 1 An Examination of Icon-based User Authentication Method Using Flick Input for

149 (Newell [5]) Newell [5], [1], [1], [11] Li,Ryu, and Song [2], [11] Li,Ryu, and Song [2], [1] 1) 2) ( ) ( ) 3) T : 2 a : 3 a 1 :

倍々精度RgemmのnVidia C2050上への実装と応用

Microsoft PowerPoint - suda.pptx

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

A (4.5mW) self (0.5mW) B(3mW) C(1mw) B1(1mW) B2(2mW) C1(1mw) PowerScope 4) SystemMoniter EnergyMonitor EnergyAnalyzer 46 Android 2.2

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

3_23.dvi

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

GPUコンピューティング講習会パート1

A Study on Throw Simulation for Baseball Pitching Machine with Rollers and Its Optimization Shinobu SAKAI*5, Yuichiro KITAGAWA, Ryo KANAI and Juhachi

Study on Throw Accuracy for Baseball Pitching Machine with Roller (Study of Seam of Ball and Roller) Shinobu SAKAI*5, Juhachi ODA, Kengo KAWATA and Yu

スライド 1

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

& Vol.2 No (Mar. 2012) 1,a) , Bluetooth A Health Management Service by Cell Phones and Its Us

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

TCP/IP IEEE Bluetooth LAN TCP TCP BEC FEC M T M R M T 2. 2 [5] AODV [4]DSR [3] 1 MS 100m 5 /100m 2 MD 2 c 2009 Information Processing Society of

IPSJ SIG Technical Report Vol.2015-ARC-215 No.7 Vol.2015-OS-133 No /5/26 Just-In-Time PG 1,a) 1, Just-In-Time VM Geyser Dalvik VM Caffei

IPSJ SIG Technical Report 1 1, Nested Transactional Memory Selecting the Optimal Rollback Point Yuji Ito, 1 Ryota Shioya, 1, 2 Masahiro Goshima

3 2 2 (1) (2) (3) (4) 4 4 AdaBoost 2. [11] Onishi&Yoda [8] Iwashita&Stoica [5] 4 [3] 3. 3 (1) (2) (3)

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

HPC pdf

IPSJ SIG Technical Report Vol.2012-HCI-149 No /7/20 1 1,2 1 (HMD: Head Mounted Display) HMD HMD,,,, An Information Presentation Method for Weara

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

mobicom.dvi

Microsoft Word - 0_0_表紙.doc

特集 e- サイエンスを実現するグリッド技術 1 サイエンスグリッドの動向 三浦謙一 国立情報学研究所 サイエンスグリッドとは 10 e- Electrical Power Grid 図 -1 Virtual Organization 1 ET 所の 所 (Electric ow

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

1, 4,a) 1, 4 1, 4 1, , 4 3, 4 HPC HPC HPC Slurm 1. HPC Tianhe MW MW [1] MW CREST a)

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

Mhij =zhij... (2) Đhij {1, 2,...,lMhij}... (3)

IPSJ SIG Technical Report Vol.2009-BIO-17 No /5/26 DNA 1 1 DNA DNA DNA DNA Correcting read errors on DNA sequences determined by Pyrosequencing

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

先進的計算基盤システムシンポジウム SACSIS2012 Symposium on Advanced Computing Systems and Infrastructures SACSIS /5/18 CPU, CPU., Memory-bound CPU,., Memory-bo

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR


untitled

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-GI-34 No /7/ % Selections of Discarding Mahjong Piece Using Neural Network Matsui

HP cafe HP of A A B of C C Map on N th Floor coupon A cafe coupon B Poster A Poster A Poster B Poster B Case 1 Show HP of each company on a user scree

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

HPC (pay-as-you-go) HPC Web 2

2010 : M DCG 3 (3DCG) 3DCG 3DCG 3DCG S

P2P P2P peer peer P2P peer P2P peer P2P i

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

untitled

IPSJ SIG Technical Report Vol.2014-GN-90 No.16 Vol.2014-CDS-9 No.16 Vol.2014-DCC-6 No /1/24 1,a) 2,b) 2,c) 1,d) QUMARION QUMARION Kinect Kinect

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

GPU.....

on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

MDD PBL ET 9) 2) ET ET 2.2 2), 1 2 5) MDD PBL PBL MDD MDD MDD 10) MDD Executable UML 11) Executable UML MDD Executable UML

DEIM Forum 2009 C8-4 QA NTT QA QA QA 2 QA Abstract Questions Recomme

[2] 2. [3 5] 3D [6 8] Morishima [9] N n 24 24FPS k k = 1, 2,..., N i i = 1, 2,..., n Algorithm 1 N io user-specified number of inbetween omis

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

(3.6 ) (4.6 ) 2. [3], [6], [12] [7] [2], [5], [11] [14] [9] [8] [10] (1) Voodoo 3 : 3 Voodoo[1] 3 ( 3D ) (2) : Voodoo 3D (3) : 3D (Welc


1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

1_26.dvi

Ⅱ 方法と対象 1. 所得段階別保険料に関する情報の収集 ~3 1, 分析手法

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

yasi10.dvi

2006 [3] Scratch Squeak PEN [4] PenFlowchart 2 3 PenFlowchart 4 PenFlowchart PEN xdncl PEN [5] PEN xdncl DNCL 1 1 [6] 1 PEN Fig. 1 The PEN

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

Transcription:

GPU 1, 2 1, 2 1, 2 1, 2 1, 2, 3 GPU NVIDIA GeForce GTX285 Tesla S17 1 GPU GPU GPU 2W CPU GPU GPU GPU GPGPU 92.8% GPU GPU Correlative Analysis of Performance Counters and Power Consumption on GPUs Hitoshi Nagasaka, 1, 2 Naoya Maruyama, 1, 2 Akira Nukada, 1, 2 Toshio Endo 1, 2 and Satoshi Matsuoka 1, 2, 3 1. GPU (GPGPU) 1) TSUBAME GPU HPC 2) GPU GPU 1W 2 GPU 53 7.2% GPUs are being employed in large-scale supercomputing environments, where their power consumption is a first-class design constraint. To reduce their power consumption, we propose a prediction model that leverages application behavior observable through performance counters. It predicts the power consumption of a given GPU kernel by a liner regression that uses the performance counter values when the kernel is executed, such as instruction throughput, register usage, memory accesses, and number of branches. Our experimental studies show that the model achieves up to 92.8% accuracy. We also found that, among others, instruction throughput and memory accesses are the most positively correlated with power, while number of executed branches is the most negatively correlated one. 2. GPU GPU 1 FMA fma GPU 1 2 3 1 c 29 Information Processing Society of Japan

25 2 ) 15 (W 力電 1 5 53.1 idle 1 1.8 FMA(1) 136.8 FMA(N) 実行コード 174.8 Memcopy 196.4 matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmul) dynsmemperblock stasmemperblock registerperthread idle idle gld coherent 2 4 gst coherent branch divergent branch divergent branch 2 instructions warp serialize 3. 3.1 GPU : GPU GeForce GTX 285 6 2 BIOS PCIExpress : GPU 12V 3.3V % 3.2 CUDA CUDA Profiler 3) 1 occupancy CUDA ( ) GPU 1 4 1 SM SM SM 2 c 29 Information Processing Society of Japan

情報処理学会研究報告 てブロック数が SM 数未満であった実行は解析対象から除外した の通りである また 使用したマシンの OS は OpenSUSE11.(kernel:2.6.25.2-.4-pae) 3.3 相関性の解析 CPU は AMD Phenom(tm) 95 Quad-Core Processor(2.2GHz) である CUDA ドライ 平均消費電力を目的関数 単位時間あたりのカウンタ値を説明変数として線形回帰分析に バ 2.2 NVIDIA ドライバ 185.18.8 を用いた かけ消費電力を予測する すなわち 消費電力を P カウンタの種類を n パフォーマンス 図 2 に実験環境の全体図を示す カウンタ値を p として P = c + n X ci p i 12V電源 A/Dコンバータ (1) i=1 と表せる 最も P を高い精度で予測できる ci を求める また その精度を解析するために leave-one-out 手法を用いる 具体的にはまずサンプル i を排除し残りのサンプルで回帰分 析を行い その結果からサンプル i の消費電力の予測精度を調べ この操作を全サンプルに 対して行う この時 カウンタ値は実行時間に依存するもの (命令数等) は単位時間あたり GPU ライザーカード(右図参照) とし さらに種類によりサイズ等が異なる為標準化 (平均, 分散 1) した後に回帰分析にか ける また どのカウンタ値が消費電力との相関が強いかを調査する為 回帰係数の比較を 行う 図 3 ライザーカード 図 2 マシン全体図 4. 準備 実験 GPU における消費電力を測定するには ATX 電源の 12V 線から供給される電力 PCIe 4.1 実 験 環 境 から供給される電力の 2 箇所を測定する必要がある 12V 線での測定は図 2 に示すように 電流センサを装着するだけで可能となる 一方 PCIe から供給される電力は 3 に示すよう 表 2 GeForce GTX 285 の詳細 Total amount of global memory 1Gbyte Number of multiprocessors 3 Number of cores 24 Total amount of constant memory 64Kbyte Total amount of shared memory per block 16Kbyte Total number of registers available per block 16384 Warp size 32 Maximum number of threads per block 512 Maximum sizes of each dimension of a block 512x512x64 Maximum sizes of each dimension of a grid 65535x65535x1 Maximum memory pitch 256Kbyte Texture aligment 256byte Clock rate 1.48GHz にライザーカードをはさみ さらにその中から 12V 3.3V の電力を供給している配線を測 定する必要がある 電流計には株式会社シナジェテック製 ST-36 を用いる これは 計測の際に配線に加 工を必要としないクランプセンサを用いている また 電流計と GPU コードのカーネル関 数のタイムスタンプの差異を最小限に抑えるために同一のマシンに接続している サンプリ ング間隔は 1ms とした 4.2 計 測 実験に使用したコードは CUDA SDK 付属のサンプルコードである カーネル関数呼び 出しの前後でタイムスタンプを取得し 後に電流計測の時間と照らしあわせて電力を算出す る これらの元のコードではカーネル関数の実行時間が非常に短いものが多いため 計測の 誤差を小さくする為カーネル内処理を繰り返し実行するように変更し すべてカーネルの 今回用いた GPU は NVIDIA 社製 GeForceGTX285 でありアーキテクチャの詳細は以下 個々の実行時間が 1 秒間となるようにした 3 c 29 Information Processing Society of Japan

5. 1 5.1 4 leave-one-out.6 7.2% 4.4 23.4 数係.2 4 warp serialize 帰回 2.8 h -.2 b ranc b l o ck S i z e r anch e ri a l e iz o c k e nt_b e rbl r Bl o c k g ri d S iz e a ncy a d n t n t -.4 d i v erg w arp_s m emp e mpe o ccup r Thre s t erpe g l d _c o here o here c ti o n s g st_c in stru d yns s tasm r egi -.6 カウンタ 2 1.8 5 1.6 比 1.4 1.2 1.8 6..6.4.2 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 24 25 26 27 28 29 3 31 32 33 34 35 36 37 38 39 4 41 42 43 44 45 46 47 48 49 5 51 52 53 4 サンプル 5.2 GPGPU GPU DVFS 4) (DVFS) 5 instructions 24.9% Maury branch 5) GPU 4 c 29 Information Processing Society of Japan

OpenMP ULP-HPC: Microsoft Technical Computing Initiative HPC-GPGPU: Large-Scale Commodity 17% 26% Accelerated Clusters and its Application to Advanced Structural Proteomics 7. 7.1 1) SamuelS. Stone, JustinP. Haldar, StephanieC. Tsao, Wen-MeiW. Hwu, Zhi-Pei Liang, and BradleyP. Sutton. Accelerating advanced mri reconstructions on gpus. GPU In CF 8: Proceedings of the 28 conference on Computing frontiers, pp. 261 272, 28. 2). tsubame., Vol.5, 7% No.2, pp. 1 16, 29. 3) NVIDIA. Cuda profier, 29. 4),,,,. DVFS., No.8, pp. 43 48, 26. 7.2 5) Matthew C. Maury, F. Blagojevic, C. D. Antonopoulos, and D. S. Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scientific codes. Parallel and Distributed Systems, IEEE Transactions on, Vol.19, No.1, pp. 1396 53 141, 28. 9% 6) Sara Baghsorkhi and Wen mei Hwu. Analytical performance prediction for evaluation and tuning of GPGPU applications. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods (EPHAM 9), In conjunction with The International Symposium on Code Generation and Optimization (CGO) 29, 29. ED Baghsorkhi GPU 6) GPU Da-Qi Ren GPU FMA 5 c 29 Information Processing Society of Japan