25 2 ) 15 (W 力電 idle FMA(1) FMA(N) 実行コード Memcopy matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmu

Size: px

Start display at page:

Download "25 2 ) 15 (W 力電 idle FMA(1) FMA(N) 実行コード Memcopy matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmu"

ぎんとふくだ
5 years ago
Views:

1 GPU 1, 2 1, 2 1, 2 1, 2 1, 2, 3 GPU NVIDIA GeForce GTX285 Tesla S17 1 GPU GPU GPU 2W CPU GPU GPU GPU GPGPU 92.8% GPU GPU Correlative Analysis of Performance Counters and Power Consumption on GPUs Hitoshi Nagasaka, 1, 2 Naoya Maruyama, 1, 2 Akira Nukada, 1, 2 Toshio Endo 1, 2 and Satoshi Matsuoka 1, 2, 3 1. GPU (GPGPU) 1) TSUBAME GPU HPC 2) GPU GPU 1W 2 GPU % GPUs are being employed in large-scale supercomputing environments, where their power consumption is a first-class design constraint. To reduce their power consumption, we propose a prediction model that leverages application behavior observable through performance counters. It predicts the power consumption of a given GPU kernel by a liner regression that uses the performance counter values when the kernel is executed, such as instruction throughput, register usage, memory accesses, and number of branches. Our experimental studies show that the model achieves up to 92.8% accuracy. We also found that, among others, instruction throughput and memory accesses are the most positively correlated with power, while number of executed branches is the most negatively correlated one. 2. GPU GPU 1 FMA fma GPU c 29 Information Processing Society of Japan

2 25 2 ) 15 (W 力電 idle FMA(1) FMA(N) 実行コード Memcopy matmul 1 N occupancy gridsize N=256 Memcopy blocksize 288x288 (matmul) dynsmemperblock stasmemperblock registerperthread idle idle gld coherent 2 4 gst coherent branch divergent branch divergent branch 2 instructions warp serialize GPU : GPU GeForce GTX BIOS PCIExpress : GPU 12V 3.3V % 3.2 CUDA CUDA Profiler 3) 1 occupancy CUDA ( ) GPU SM SM SM 2 c 29 Information Processing Society of Japan

情報処理学会研究報告てブロック数が SM 数未満であった実行は解析対象から除外したの通りであるまた使用したマシンの OS は OpenSUSE11.(kernel:2.6.25.2-.4-pae) 3.3 相関性の解析 CPU は AMD Phenom(tm) 95 Quad-Core Processor(2.

.18.8 を用いたかけ消費電力を予測するすなわち消費電力を P カウンタの種類を n パフォーマンス図 2 に実験環境の全体図を示すカウンタ値を p として P = c + n X ci p i 12V電源 A/Dコンバータ (1) i=1 と表せる最も P を高い精度で予測できる ci を求めるまたその精度を解析するために leave-one-out 手法を用いる

3 情報処理学会研究報告てブロック数が SM 数未満であった実行は解析対象から除外したの通りであるまた使用したマシンの OS は OpenSUSE11.(kernel: pae) 3.3 相関性の解析 CPU は AMD Phenom(tm) 95 Quad-Core Processor(2.2GHz) である CUDA ドライ平均消費電力を目的関数単位時間あたりのカウンタ値を説明変数として線形回帰分析にバ 2.2 NVIDIA ドライバを用いたかけ消費電力を予測するすなわち消費電力を P カウンタの種類を n パフォーマンス図 2 に実験環境の全体図を示すカウンタ値を p として P = c + n X ci p i 12V電源 A/Dコンバータ (1) i=1 と表せる最も P を高い精度で予測できる ci を求めるまたその精度を解析するために leave-one-out 手法を用いる具体的にはまずサンプル i を排除し残りのサンプルで回帰分析を行いその結果からサンプル i の消費電力の予測精度を調べこの操作を全サンプルに対して行うこの時カウンタ値は実行時間に依存するもの (命令数等) は単位時間あたりＧＰＵライザーカード(右図参照) としさらに種類によりサイズ等が異なる為標準化 (平均, 分散 1) した後に回帰分析にかけるまたどのカウンタ値が消費電力との相関が強いかを調査する為回帰係数の比較を行う図 3 ライザーカード図 2 マシン全体図 4. 準備実験 GPU における消費電力を測定するには ATX 電源の 12V 線から供給される電力 PCIe 4.1 実験環境から供給される電力の 2 箇所を測定する必要がある 12V 線での測定は図 2 に示すように電流センサを装着するだけで可能となる一方 PCIe から供給される電力は 3 に示すよう表 2 GeForce GTX 285 の詳細 Total amount of global memory 1Gbyte Number of multiprocessors 3 Number of cores 24 Total amount of constant memory 64Kbyte Total amount of shared memory per block 16Kbyte Total number of registers available per block Warp size 32 Maximum number of threads per block 512 Maximum sizes of each dimension of a block 512x512x64 Maximum sizes of each dimension of a grid 65535x65535x1 Maximum memory pitch 256Kbyte Texture aligment 256byte Clock rate 1.48GHz にライザーカードをはさみさらにその中から 12V 3.3V の電力を供給している配線を測定する必要がある電流計には株式会社シナジェテック製 ST-36 を用いるこれは計測の際に配線に加工を必要としないクランプセンサを用いているまた電流計と GPU コードのカーネル関数のタイムスタンプの差異を最小限に抑えるために同一のマシンに接続しているサンプリング間隔は 1ms とした 4.2 計測実験に使用したコードは CUDA SDK 付属のサンプルコードであるカーネル関数呼び出しの前後でタイムスタンプを取得し後に電流計測の時間と照らしあわせて電力を算出するこれらの元のコードではカーネル関数の実行時間が非常に短いものが多いため計測の誤差を小さくする為カーネル内処理を繰り返し実行するように変更しすべてカーネルの今回用いた GPU は NVIDIA 社製 GeForceGTX285 でありアーキテクチャの詳細は以下個々の実行時間が 1 秒間となるようにした 3 c 29 Information Processing Society of Japan

4 leave-one-out.6 7.2% 数係.2 4 warp serialize 帰回 2.8 h -.2 b ranc b l o ck S i z e r anch e ri a l e iz o c k e nt_b e rbl r Bl o c k g ri d S iz e a ncy a d n t n t -.4 d i v erg w arp_s m emp e mpe o ccup r Thre s t erpe g l d _c o here o here c ti o n s g st_c in stru d yns s tasm r egi -.6 カウンタ比サンプル 5.2 GPGPU GPU DVFS 4) (DVFS) 5 instructions 24.9% Maury branch 5) GPU 4 c 29 Information Processing Society of Japan

5 OpenMP ULP-HPC: Microsoft Technical Computing Initiative HPC-GPGPU: Large-Scale Commodity 17% 26% Accelerated Clusters and its Application to Advanced Structural Proteomics ) SamuelS. Stone, JustinP. Haldar, StephanieC. Tsao, Wen-MeiW. Hwu, Zhi-Pei Liang, and BradleyP. Sutton. Accelerating advanced mri reconstructions on gpus. GPU In CF 8: Proceedings of the 28 conference on Computing frontiers, pp , 28. 2). tsubame., Vol.5, 7% No.2, pp. 1 16, 29. 3) NVIDIA. Cuda profier, 29. 4),,,,. DVFS., No.8, pp , ) Matthew C. Maury, F. Blagojevic, C. D. Antonopoulos, and D. S. Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scientific codes. Parallel and Distributed Systems, IEEE Transactions on, Vol.19, No.1, pp , 28. 9% 6) Sara Baghsorkhi and Wen mei Hwu. Analytical performance prediction for evaluation and tuning of GPGPU applications. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods (EPHAM 9), In conjunction with The International Symposium on Code Generation and Optimization (CGO) 29, 29. ED Baghsorkhi GPU 6) GPU Da-Qi Ren GPU FMA 5 c 29 Information Processing Society of Japan

GPGPU

GPGPU GPGPU 2013 1008 2015 1 23 Abstract In recent years, with the advance of microscope technology, the alive cells have been able to observe. On the other hand, from the standpoint of image processing, the