Minsky の電力消費量の調査 (1) ディープラーニングデモプログラム (mnist) の実行デモプログラム添付 1 CPU 実行シングル GPU 実行 2GPU 実行での計算時間と電力消費量を比較した〇計算時間 CPU 実行シングル GPU 実行複数 GPU 実行今回は 2GPU 計

Size: px

Start display at page:

Download "Minsky の電力消費量の調査 (1) ディープラーニングデモプログラム (mnist) の実行デモプログラム添付 1 CPU 実行シングル GPU 実行 2GPU 実行での計算時間と電力消費量を比較した〇計算時間 CPU 実行シングル GPU 実行複数 GPU 実行今回は 2GPU 計"

らむとべ
5 years ago
Views:

1 IBM Minsky における電力性能比検証報告書 (Deep Learning および HPC アプリケーション ) 2017 年 1 月ビジュアルテクノロジー株式会社

2 Minsky の電力消費量の調査 (1) ディープラーニングデモプログラム (mnist) の実行デモプログラム添付 1 CPU 実行シングル GPU 実行 2GPU 実行での計算時間と電力消費量を比較した〇計算時間 CPU 実行シングル GPU 実行複数 GPU 実行今回は 2GPU 計算時間 ( 秒 ) 倍以上のスピード〇電力消費量約 10 秒ごとに消費電力を表示 (Total Watt: 総消費電力 GPU Watt: 総電力の内 GPU の消費電力 ) Time Step Chainer CPU Chainer 1GPU Chainer 2 GPU Total Watt ( うち GPU Watt) Total Watt ( うち GPU Watt) Total Watt ( うち 2GPU 合計 Watt) 10Sec 708W (55W) 744W (70W) 816W (75W) 20Sec 636W (55W) 744W (70W) 768W (85W) 30Sec 708W (50W) 744W (65W) 792W (75W) 40Sec 636W (50W) 744W (70W) 612W (55W) 50Sec 624W (55W) 756W (70W) Sec 720W (55W) 744W (65W) Sec 600W (55W) 720W (55W) Sec 636W (50W) 720W (55W) Sec 600W (55W) Sec 600W (55W) Sec 612W (50W) Sec 612W (55W) GPU 計算では GPU 消費電力以上に総電力が増加している

3 (2) 行列ベクトル積の計算行列 (7,000 x 7,000) x ベクトル (1 列 ) の計算 CPU(20 スレッド ) 実行とシングル GPU 実行 CPU プログラム添付 2 GPU プログラム添付 3 Time 行列ベクトル積 CPU(20 スレッド ) 行列ベクトル積 (1GPU) Step Total Watt ( うち GPU Watt) Total Watt ( うち GPU Watt) 10Sec 984W (50W) 744W (65W) 20Sec 972W (50W) 732W (65W) 30Sec 936W (50W) 744W (65W) 40Sec 912W (50W) 744W (65W) 50Sec 888W (50W) 744W (65W) 60Sec 864W (50W) 744W (65W) 70Sec 924W (50W) 744W (65W) 80Sec 792W (50W) 744W (65W) 90Sec 948W (50W) 744W (65W) 100Sec 972W (50W) 744W (65W) 110Sec 768W (55W) 736W (65W) 120Sec 802W (50W) 744W (65W) CPU 版のコンパイラ pgi fortran バージョン :16.10 オプション :pgfortran -O3 -mp GPU コンパイラ nvcc バージョン :7.5 オプション : なしライブラリ cublas 使用

4 〇計算性能と消費電力 CPU 実行 GPU 実行消費電力 (12Ts の平均値 ) 約 897Watt 約 743Watt 計算性能 (GFlops) 約 220GFlops 約 1500GFlops Watt 当たりの計算性能 0.25 GFlops/Watt 2.0 GFlops/Watt ( ご参考 ) 京コンピュータ電力性能比 0.83GFlops/Watt ( ご参考 )JC-AHPC スパコン電力性能比 5.0GFlops/Watt

5 添付 1 サンプルプログラム (mnist) 55 行目で args.gpu 指定することで GPU を使うことになる 1 #!/usr/bin/env python 2 from future import print_function 3 import argparse 4 5 import chainer 6 import chainer.functions as F 7 import chainer.links as L 8 from chainer import training 9 from chainer.training import extensions # Network definition 13 class MLP(chainer.Chain): def init (self, n_units, n_out): 16 super(mlp, self). init ( 17 # the size of the inputs to each layer will be inferred 18 l1=l.linear(none, n_units), # n_in -> n_units 19 l2=l.linear(none, n_units), # n_units -> n_units 20 l3=l.linear(none, n_out), # n_units -> n_out 21 ) def call (self, x): 24 h1 = F.relu(self.l1(x)) 25 h2 = F.relu(self.l2(h1)) 26 return self.l3(h2) def main(): 30 parser = argparse.argumentparser(description='chainer example: MNIST') 31 parser.add_argument('--batchsize', '-b', type=int, default=100, 32 help='number of images in each mini-batch') 33 parser.add_argument('--epoch', '-e', type=int, default=20,

6 34 help='number of sweeps over the dataset to train') 35 parser.add_argument('--gpu', '-g', type=int, default=-1, 36 help='gpu ID (negative value indicates CPU)') 37 parser.add_argument('--out', '-o', default='result', 38 help='directory to output the result') 39 parser.add_argument('--resume', '-r', default='', 40 help='resume the training from snapshot') 41 parser.add_argument('--unit', '-u', type=int, default=1000, 42 help='number of units') 43 args = parser.parse_args() print('gpu: {}'.format(args.gpu)) 46 print('# unit: {}'.format(args.unit)) 47 print('# Minibatch-size: {}'.format(args.batchsize)) 48 print('# epoch: {}'.format(args.epoch)) 49 print('') # Set up a neural network to train 52 # Classifier reports softmax cross entropy loss and accuracy at every 53 # iteration, which will be used by the PrintReport extension below. 54 model = L.Classifier(MLP(args.unit, 10)) 55 if args.gpu >= 0: 56 chainer.cuda.get_device(args.gpu).use() # Make a specified GPU current 57 model.to_gpu() # Copy the model to the GPU # Setup an optimizer 60 optimizer = chainer.optimizers.adam() 61 optimizer.setup(model) # Load the MNIST dataset 64 train, test = chainer.datasets.get_mnist() 65

7 66 train_iter = chainer.iterators.serialiterator(train, args.batchsize) 67 test_iter = chainer.iterators.serialiterator(test, args.batchsize, 68 repeat=false, shuffle=false) # Set up a trainer 71 updater = training.standardupdater(train_iter, optimizer, device=args.gpu) 72 trainer = training.trainer(updater, (args.epoch, 'epoch'), out=args.out) # Evaluate the model with the test dataset for each epoch 75 trainer.extend(extensions.evaluator(test_iter, model, device=args.gpu)) # Dump a computational graph from 'loss' variable at the first iteration 78 # The "main" refers to the target link of the "main" optimizer. 79 trainer.extend(extensions.dump_graph('main/loss')) # Take a snapshot at each epoch 82 trainer.extend(extensions.snapshot(), trigger=(args.epoch, 'epoch')) # Write a log of evaluation statistics for each epoch 85 trainer.extend(extensions.logreport()) # Save two plot images to the result dir 88 trainer.extend( 89 extensions.plotreport(['main/loss', 'validation/main/loss'], 'epoch', 90 file_name='loss.png')) 91 trainer.extend( 92 extensions.plotreport(['main/accuracy', 'validation/main/accuracy'], 93 'epoch', file_name='accuracy.png'))

8 94 95 # Print selected entries of the log to stdout 96 # Here "main" refers to the target link of the "main" optimizer again, and 97 # "validation" refers to the default name of the Evaluator extension. 98 # Entries other than 'epoch' are reported by the Classifier link, called by 99 # either the updater or the evaluator. 100 trainer.extend(extensions.printreport( 101 ['epoch', 'main/loss', 'validation/main/loss', 102 'main/accuracy', 'validation/main/accuracy', 'elapsed_time'])) # Print a progress bar to stdout 105 trainer.extend(extensions.progressbar()) if args.resume: 108 # Resume from a snapshot 109 chainer.serializers.load_npz(args.resume, trainer) # Run the training 112 trainer.run() if name == ' main ': 115 main()

9 添付 2 行列ベクトル積 (CPU マルチスレッド版 ) 27 行目から 33 行目がカーネル部 1 use omp_lib 2 implicit double precision(a-h,o-z) 3 allocatable a(:,:),b(:),c(:) 4 dimension toms(10000),tome(10000) 5 character*32 buff 6!$OMP parallel 7 nth=omp_get_num_threads() 8!$OMP end parallel 9 call getarg(1,buff) 10 read(buff,*) n 11 allocate(a(n,n),b(n),c(n)) 12 do i=1,n 13 do j=1,n 14 a(i,j)=1.d0/dble(i+j-1) 15 enddo 16 b(i)=1. 17 enddo 18 it=0 19 t0=elaptime() continue 21!$OMP critical 22 it=it+1 23!$OMP end critical if(it.gt.10000) goto toms(it)=elaptime() 27!$OMP parallel do reduction(+:c) 28 do j=1,n 29 do i=1,n 30 c(i)=c(i)+a(i,j)*b(j) 31 enddo 32 enddo 33!$OMP end parallel do 34 tome(it)=elaptime()

10 35!$OMP parallel do reduction(+:s) 36 do i=1,n 37 s=s+c(i)*c(i) 38 enddo 39!$OMP end parallel do 40 s=dsqrt(s) 41 if(mod(it,1000).eq.0)then 42 write(6,*) it,s 43 c write(6,*) c 44 endif 45 b=c/s 46 c=0.0d0 47 goto continue 49 t1=elaptime() 50 write(6,60) n,nth,t1-t0,1.d4*dble(2*n*n+4*n)/(t1-t0)*1.d format("qaz",2i6,2f12.6) 52 write(60,61) tome-toms 53 write(6,*) "s =",s format(1pd12.6) 55 stop 56 end

11 添付 3: 行列ベクトル積 (GPU 版 ) 1 // dgemm CUDA test public domain 2 #include <stdio.h> 3 #include <stdlib.h> 4 #include <math.h> 5 #include "cublas.h" 6 //Matlab/Octave format 7 void printmat(int N, int M, double *A, int LDA) { 8 double mtmp; 9 printf("[ "); 10 for (int i = 0; i < N; i++) { 11 printf("[ "); 12 for (int j = 0; j < M; j++) { 13 mtmp = A[i + j * LDA]; 14 printf("%5.2e", mtmp); 15 if (j < M - 1) printf(", "); 16 } if (i < N - 1) printf("]; "); 17 else printf("] "); 18 } printf("]"); 19 } 20 double extern elaptime(void); 21 int main( int argc,char *argv[] ) 22 { 23 int n ; double alpha, beta; 24 int i,j,l,it; 25 int nt,nl; cublasstatus stata, statb, statc; 28 scanf("%d %d ",&n,&nl); 29 nt=atoi(argv[1]); 30 double *deva[nl], *devb[nl], *devc[nl]; 31 double **A; 32 double **B ; 33 double **C ; 34 double s1,s2,t1,t2,flop; 35 cudasetdevice(2);

12 36 cublasinit(); 37 A=(double**) malloc(sizeof(double**)*nl); 38 B=(double**) malloc(sizeof(double**)*nl); 39 C=(double**) malloc(sizeof(double**)*nl); 40 for(l=0 ; l<nl;l++){ 41 A[l] = (double*) malloc(sizeof(double)*n*n); 42 B[l] = (double*) malloc(sizeof(double)*n*n); 43 C[l] = (double*) malloc(sizeof(double)*n*n); stata = cublasalloc (n*n, n*n, (void**)&deva[l]); 46 statb = cublasalloc (2*n, n*n, (void**)&devb[l]); 47 statc = cublasalloc (2*n, n*n, (void**)&devc[l]); 48 for(i=0 ; i<n ; i++){ 49 for(j=0 ; j<n; j++){ 50 A[l][i*n+j]=1.e0/(double)(i+j+1+l); 51 } 52 B[l][i]=(double)(2*i+2); 53 B[l][i+n]=(double)(2*i+3); 54 } 55 } 56 printf("# start. n"); 57 alpha = 1.0; beta = 1.0; 58 t1=elaptime(); 59 float elapsed_time_ms=0.0f; 60 cudaevent_t start, stop; 61 cudaeventcreate( &start ); 62 cudaeventcreate( &stop ); 63 cudaeventrecord( start, 0 ); 64 for(l=0; l< nl ; l++){ 65 stata = cublassetmatrix (n, n, n*n, A[l], n, deva[l], n); 66 statb = cublassetmatrix (n, 2, 2*n, B[l], n, devb[l], n); 67 statc = cublassetmatrix (n, 2, 2*n, C[l], n, devc[l], n); 68 } 69 for(it=0; it<nt ; it++){ 70 for(l=0; l< nl ; l++){ 71 cublasdgemm('n', 'n', n, 2, n, alpha, deva[l], n, devb[l], n, beta, devc[l],

13 n); 72 s1=cublasdnrm2(n, devc[l], 1); 73 statc= cublasgetmatrix (n, 2, n*n,devc[l], n, C[l], n); 74 for(i=0;i<n;i++) { 75 B[l][i]=C[l][i]/s1; 76 B[l][i+n]=C[l][i+n]/s2; 77 C[l][i]=C[l][i+n]=0.0; 78 } 79 } 80 } 81 t2=elaptime(); 82 cudaeventrecord( stop, 0 ); 83 cudaeventsynchronize( stop ); 84 cudaeventelapsedtime( &elapsed_time_ms, start, stop ); printf("alpha = %5.3e n", alpha); 87 printf("beta = %5.3e n", beta); 88 printf(" sec= %lf n",t2-t1); 89 flop=2.*(double)(2*n*n+4*n)*(double)(nt*nl); 90 printf(" sec2= %lf n",elapsed_time_ms); 91 printf("s1 s2= %lf %lf n", s1,s2); 92 printf(" n %d nl %d nt %d n",n,nl,nt); 93 printf("#flop %lf n GFlops %lf n",flop,flop*1.e-9/(t2-t1)); 94 cublasfree (deva); 95 cublasfree (devb); 96 cublasfree (devc); 97 cublasshutdown(); 98 delete[]c; delete[]b; delete[]a; 99 }

11050427-0_Vol16No3.indd

11050427-0_Vol16No3.indd 2599 チュートリアル BLAS, LAPACK 2 2 GPU BLAS, LAPACKチュートリアルパート2 (GPU 編 ) 中田真秀 1 はじめに GPU Graphics Processing Unit BLAS, LAPACK GPU GPU NVIDIA AMD AMD RADEON HD NVIDIA NVIDIA GPU NVIDIA C2050 BLAS, LAPACK