Deep Learning Deep Learning GPU GPU FPGA %

2016 (412825)

Deep Learning Deep Learning GPU GPU FPGA 16 1 16 69%

Abstract Recognition by DeepLearning attracts attention, because of its high recognition accuracy. Lots of learning is necessary for Deep Learning, and GPU which can parallel process a large amount of data to learn it fast is used. However, GPU can process only floating point arithmetic and has a problem that is a large power consumption and high latency. Therefore, in recent years, the dedicated hardware which used FPGA which can process the fixed point arithmetic that low power consumption and highspeed processing are possible than floating point arithmetic is studied. Because a multiplier for most of the configuration gates with this fixedpoint-based hardware, the gate scale increase in proportion to the product of the bit width of a multiplier and multiplicand. Therefore, the reduction of the hardware scale is enabled if make the bit width of a multiplier and multiplicand into necessity minimum. Actually, under conditions of the bit width fixation of each layer, there is a study succeeding for reduction to 16 bits. In this study, aiming at reduced and high-speed hardware, make the operation bit width into a necessity minimum in every layer of the neural network statically and dynamically. In Convolutional Neural Network, it is shown that reduce the multiplier scale 69% in comparison with the conventional technique that operation bit width was 16 bits in all layers.

1 1 1.1.............................. 1 1.2............................ 2 2 Deep Learning 3 2.1....................... 4 2.2 Multi-Layer Perceptron........ 5 2.3 Convolutional Neural Network.............................. 6 2.3.1...................... 6 2.3.2..................... 7 2.4 Deep Learning......................... 8 2.4.1...................... 9 2.4.2..................... 10 3 Deep Learning 13 3.1 GPU....................... 13 3.2 FPGA....................... 14 4 15 4.1..................... 15 4.2..................... 16 4.3.............................. 17 5 19 5.1......................... 19 5.2.............................. 19 5.3.............................. 19 6 22 6.1............................. 22 23 23 i

A 25 A.1 MNIST................. 25 A.2 CIFAR-10................. 28 B 32 ii

2.1....................... 4 2.2..................... 5 2.3........................ 6 2.4....................... 7 2.5............... 10 2.6............... 10 4.7.................. 15 4.8......... 15 4.9.................. 16 4.10......... 16 iii

4.1 MNIST........................ 18 4.2 CIFAR-10....................... 18 5.3 MNIST.......................... 20 5.4 CIFAR-10........................ 20 iv

1 1.1 NN NN 1940 RBM NN NN Deep Learning Deep Learning Deep Learning GPU FPGA GPU FPGA 1

1.2 GPU FPGA NN 16 [1] NN Multi-Layer Perceptron MLP Convolutional Neural Network CNN 2

2 Deep Learning NN 1940 1980 NN 90 3 3 NN 2006 NN Deep Learning 2012 ILSVRC Deep Learning NN 3

2.1 2.1: 2.1 y = f( i w i x i + b) (1) f(x) = 1 1 + exp( x) (2) f(x) = tanh(x) (3) f(x) = max(0, x) (4) 4

2.2 Multi-Layer Perceptron 2.2: MLP CNN MLP 2.2 k i h k i h k i = f(b k i + w kt i h k 1 ) (5) softmax p i = softmax i (w i x i + b i ) = exp(w ix i + b i ) j exp(w j x j + b j ) (6) 5

2.3 Convolutional Neural Network CNN softmax 2.3.1 2.3: 6

2.3 n x n y n w n w n = n x n w + 1 n y = n y n w + 1 2.3.2 2.4: 2.4 Lp h i = 1 P i 7 j P i h j (7)

Lp h i = ( 1 P i h i = max j P i h j (8) h P j ) 1 P (9) j P i (7) k 1 P i k i (8) k 1 P i k i (9) Lp 2.4 Deep Learning Deep Learning NN Deep Learning C = i d i log p i (10) d i 1 1 0 C 8

2.4.1 C w ij w ij w ij + w ij = ϵ C w ij (11) ϵ 9

2.5: 2.6: 2.4.2 C C w ij C (10) 2.5 NN 10

w ij C w ij = (p i d i )h j (12) f(f(f )) 2.6 3 l, i, j i x i = j w ij h j (13) 2.6 w ij C w ij C = C x i (14) w ij x i w ij 1 δ i C x i 2 x i = j w ij h j x i w ij = h j (15) l x l x l = i w li h i = i w li f(x i ) x i C x l (l = 1, 2,...) 11

δ i δ i = C x i = l C x l x l x i (16) 1 δ l C x l 2 x l = i w li f(x i ) x l x i = f (x i )w li (17) δ i = f (x i ) l δ l w li (18) 2.6 δ l δ i 2.6 δ l δ l = p l d l (18) δ i δ i (14) C w ij C w ij = δ i h j (19) δ i δ i 12

3 Deep Learning Deep Learning GPU FPGA FPGA GPU FPGA 3.1 GPU GPU NN GPU GPU GPU NN 13

3.2 FPGA GPU 14

4 4.1 4.7: 10 static 9 8 7 6 error 5 4 3 2 1 0 20 40 60 80 100 epoch 4.8: 16 [1] NN NN 15

4.7 NN 4.8 4.2 4.9: 12 dynamic 10 8 error 6 4 2 0 0 20 40 60 80 100 epoch 4.10: 16

4.9 NN 5epoch 4.10 4.3 Python Theano MNIST CIFAR-10 Deep Learning Tutorials[2] GPU NN 4.1 4.2 CONV POOL FULLY SOFTMAX M i, R i, C i, M o, R o, C o K r, K c 17

K r, K c, R i, C i, R o, C o 4.1: MNIST MNIST M i, R i, C i K r, K c M o, R o, C o INPUT - - 1,28,28 CONV 1,28,28 5,5 20,24,24 POOL 20,24,24 2,2 20,12,12 CONV 20,12,12 5,5 50,8,8 POOL 50,8,8 2,2 50,4,4 FULLY 800,-,- - 500,-,- SOFTMAX 500,-,- - 10,-,- 4.2: CIFAR-10 CIFAR-10 M i, R i, C i M o, R o, C o INPUT - 3,32,32 FULLY 3072,-,- 1000,-,- FULLY 1000,-,- 500,-,- SOFTMAX 500,-,- 10,-,- 18

5 5.1 MNIST CIFAR-10 2 MNIST[3] 0 9 28*28 70000 CIFAR-10[4] 10 32*32 60000 5.2 4.1 4.2 5.3 5.4 DATE 6 2 32 5.3 5.3 CNN DATA 6 5.4 MLP 19

5.3: MNIST CONV(bit) 32 16 12 16 12 12 16 12 16 12 16 FULLY(bit) 32 16 12 16 12 16 12 12 16 12 16 SOFTMAX(bit) 32 16 12 12 16 12 16 12 16 12 DATA(bit) 32 16 16 16 16 16 6 6 (bit) 32 16 15.2 15.2 15.2 14 11 9.0 (%) 99.0 98.9 98.8 98.7 98.4 98.4 98.9 98.7 5.4: CIFAR-10 FULLY(bit) 32 16 12 16 12 16 12 16 SOFTMAX(bit) 32 16 12 16 12 12 16 16 DATA(bit) 32 16 16 16 16 6 (bit) 32 16 15.5 15.5 14.0 11 (%) 56.1 55.6 54.6 54.2 46.8 22.4 CNN CNN 9 16 69% 20

4 16 3 12 1 FPGA 21

6 6.1 Deep Learning FPGA CNN MNIST 9 FPGA 22

[1] S.Fupta A.Agrawal K.Gopalakrishnan and P.Narayanan Deep Learning with Limited Numerical Precision ICML-15 pp.1737 1746 Feb 2015 [2] Deep Learning Deep Learning Tutorials http://deeplearning.net/tutorial/ 2016 1 29 [3] MNIST handwritten digit database THE MNIST DATABASE of handwritten digits http://http://yann.lecun.com/exdb/mnist/ 2016 2 29 23

[4] CIFAR-10 and CIFAR-100 datasets The CIFAR-10 dataset https://www.cs.toronto.edu/ kriz/cifar.html 2016 2 29 24

A A.1 MNIST ##################### # ###################### train_layer0 = theano.function( [index], layer0.output, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size], }) train_layer1 = theano.function( [layer1_input], layer1.output.flatten(2) ) train_layer2 = theano.function( [layer2_input], layer2.output, 25

) train_layer3 = theano.function( [index, layer3_input], layer3.negative_log_likelihood(y), updates=updates, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size], y: train_set_y[index * batch_size: (index + 1) * batch_size] }) ################### # ################### for i in xrange(8): #softmax if i == 0: params[i].set_value(np.round(params[i].get_value(), softmax_digit)) #ReLu elif i == 2: 26

params[i].set_value(np.round(params[i].get_value(), relu_digit)) #conv elif i == 4 or i == 6: params[i].set_value(np.round(params[i].get_value(),conv_digit)) #################### # #################### if this_validation_loss >= best_validation_loss: count += 1 print "best_error %.3f, this_error %.3f, count %d"%(best_validation else: count = 0 print "best_error %.3f, this_error %.3f, count %d, best_error updat if count == 5: conv_digit = 3 relu_digit = 3 softmax_digit = 3 27

print "digit updated" A.2 CIFAR-10 ####################### # ####################### this_error = 1 - erv if this_error >= best_error: count += 1 print "best_error %.3f, this_error %.3f, count %d"%(best_error, this_er else: count = 0 print "best_error %.3f, this_error %.3f, count %d,best_error update"%( best_error = this_error if count == 5: softmax_digit = 4 relu_digit = 4 print "softmax_digit,relu_digit = 4" 28

##################### # ##################### for i, layer in enumerate(mlp.layers): #softmax if i == 2: layer.w.set_value( np.round( layer.w.get_value(), softmax_digit ) ) layer.b.set_value( np.round( layer.b.get_value(), softmax_digit ) ) #ReLu else: layer.w.set_value( np.round( layer.w.get_value(), relu_digit ) ) layer.b.set_value( np.round( layer.b.get_value(), relu_digit ) ) ##################### # ##################### def training( self, XL, tl, data_digit ): 29

X = T.dmatrix( X ) t = T.dmatrix( t ) Y0, Z0 = self.layers[0].output( X ) Y1, Z1 = self.layers[1].output( Z0 ) Y2, Z2 = self.layers[2].output( Z1 ) cost = T.mean( _T_cost( Z2, t ) ) updateslist = [] for layer in self.layers: gradw = T.grad( cost, layer.w ) Wnew = layer.w - 0.1 * gradw updateslist.append( ( layer.w, Wnew ) ) if layer.withbias: gradb = T.grad( cost, layer.b ) bnew = layer.b - 0.1 * gradb 30

updateslist.append( ( layer.b, bnew ) ) train_layer0 = theano.function( [X], Z0 ) train_layer1 = theano.function( [Z0], Z1 ) train_layer2 = theano.function( [Z1, X, t], cost, updates = updateslist output_layer0 = train_layer0( XL ) output_layer0 = np.round( output_layer0, data_digit ) output_layer1 = train_layer1( output_layer0 ) output_layer1 = np.round( output_layer1, data_digit ) cost = train_layer2( output_layer1, XL, tl ) return cost 31

B MNIST CIFAR-10 2 32