35_3_9.dvi

Size: px

Start display at page:

Download "35_3_9.dvi"

れいがおいもり
5 years ago
Views:

180 Vol. 35 No. 3, pp.180 185, 2017 Image Recognition by Deep Learning Hironobu Fujiyoshi and Takayoshi Yamashita Chubu University 1.

1 180 Vol. 35 No. 3, pp , 2017 Image Recognition by Deep Learning Hironobu Fujiyoshi and Takayoshi Yamashita Chubu University Scale-Invariant Feature Transform SIFT Histogram of Oriented Gradients HOG handcrafted feature 2010 Deep learning handcrafted feature 2. [1] Deep Learning, Image Classification, Object Detection, Semantic Segmentation Kasugai shi, Aichi Triplet loss function [2] 2. 2 Haar-like [3] AdaBoost HOG [4] Support Vector Machine SVM 2 JRSJ Vol. 35 No. 3 8 Apr., 2017

181 2. 3 Bag-of-Features BoF [5] [6] Vector of Locally Aggregated Descriptors VLAD [7] 2015 1,000 2. 4 2.

ILSVRC ImageNet Large Scale Visual Recognition Challenge 1,000 120 1,000 classification 2012 Convolutional Neural Network CNN [10]

2 Bag-of-Features BoF [5] [6] Vector of Locally Aggregated Descriptors VLAD [7] , SIFT [8] 2016 Learned Invariant Feature Transform LIFT [9] SIFT 3. ILSVRC ImageNet Large Scale Visual Recognition Challenge 1, ,000 classification 2012 Convolutional Neural Network CNN [10] ILSVRC2012 CNN AlexNet [11] ,000 1, , AlexNet AlexNet [11] AlexNet [11] AlexNet VGG[12] 19 GoogLeNet [13] ResNet [14] 152 ResNet 3.56% 5.1% 4 AlexNet 4. CNN Region Proposal CNN

182 5 Faster R-CNN 7 YOLO 6 [17] 8 SSD [19]

R-CNN [15] Selective search [16] AlexNet

Faster R-CNN [17] 5 Region Proposal Network

Single Shot CNN YOLO YouOnlyLookOnce [18] 7

7 YOLO Single Shot Multi-Box Detector SSD

3 182 5 Faster R-CNN 7 YOLO 6 [17] 8 SSD [19] Regions with Convolutional Neural Network R-CNN [15] Selective search [16] AlexNet VGGNet Selective search CNN Selective search Faster R-CNN [17] 5 Region Proposal Network RPN RPN RPN 6 k RPN RPN Region Proposal 2016 Single Shot CNN YOLO YouOnlyLookOnce [18] ,024 i, j i, j x, y, w, h YOLO Single Shot Multi-Box Detector SSD [19] 8 SSD SSD 9 SSD JRSJ Vol. 35 No Apr., 2017

深層学習による画像認識 183 図 10 Fully Convolutional Network FCN の構造によりこれらの情報が統合され細かな情報が欠落している物体認識においてはこれらの詳細な情報は不要であるがセメンティクセグメンテーションのタスクでは重要な情報であるそこで FCN はネットワークの途中の特徴マップを最終層で統合する処理を行う FCN はこの統合

4 深層学習による画像認識 183 図 10 Fully Convolutional Network FCN の構造によりこれらの情報が統合され細かな情報が欠落している物体認識においてはこれらの詳細な情報は不要であるがセメンティクセグメンテーションのタスクでは重要な情報であるそこで FCN はネットワークの途中の特徴マップを最終層で統合する処理を行う FCN はこの統合に用いる特徴マップのサイズにより FCN-32s FCN-16s 図 9 SSD による物体検出結果文献 [19] より引用 FCN-8s とある FCN-8s では 3 回めにプーリングした特徴マップと 4 回めにプーリングした特徴マップを最終層の 5. セマンティックセグメンテーション入力に加えるこのときすべての特徴マップのサイズをれてきたそして高精度なセマンティックセグメンテー 3 回めにプーリングした特徴マップに合わせるために 4 回めにプーリングした特徴マップを 2 倍拡大し最終層手前の特徴マップを 4 倍拡大するこれらの特徴マップをチャションを実現するには時間がかかると考えられていたしネル方向に連結しでコンボリューション処理を行い元画かしながら他のタスクと同様に深層学習よる手法が提像と同じサイズのセグメンテーション結果を出力する案され従来手法を上回る性能を達成している CNN が注目された 2012 年に 3 層構造の CNN により得られた特 FCN は中間層の特徴マップを記憶しておく必要がありメモリ使用量が大きい SegNet [22] [23] は中間層の特徴徴マップとスーパーピクセル手法を組み合わせた手法が提マップを記憶する必要がないエンコーダデコーダ構成をコンピュータビジョン分野においてセマンティックセグメンテーションは難易度の高いタスクであり長年研究さ案された [20] この手法では複数のネットワークと別のしている図 11 (a) のように SegNet のエンコーダ側では手法との統合が必要であり複雑な処理を必要とする畳み込み処理とプーリング処理を繰り返し行う一方デ Fully Convolutional Network FCN [21] は CNN のみを用いて end-to-end で学習およびラベリングが可能な手法である FCN の構造を図 10 に示す FCN は全結合層をコーダ側ではエンコーダ側で生成された特徴マップをデ有しないネットワーク構造となっている入力画像に対し 11 (b) のようにエンコーダ側のプーリングは選択された位て畳み込み層およびプーリング層を繰り返し行うことで置を記憶しておきデコーダ側で特徴マップ拡大する際に生成される特徴マップのサイズは小さくなる元の画像と対応する位置にのみ値を挿入するこれにより中間層の同じサイズにするために特徴マップを最終層で 32 倍に拡特徴マップを利用せずに詳細な情報を復元することがで大処理し畳み込み処理を行うこれをデコンボリューショきるンと呼ぶ最終層はラベンリングしたい各クラスの確率化を図っている一般的に CNN の中間層の特徴マップは入 PSPNet [24] はエンコーダ側で得られた特徴マップを拡大する際に複数のスケールで拡大する Pyramid Pooling Module によりスケールの異なる情報を捉えることができる Pyramid Pooling Module は図 12 のようにエンコーダ側で元画像に対して縦および横のサイズがそれぞれ 1/8 に縮小された特徴マップをで力層に近いほど詳細な情報を捉えているプーリング処理プーリングするそしてそれぞれの特徴マップに対してマップを出力する確率マップは各画素におけるクラスの存在確率となるように学習しているこのように特徴マップの拡大を行うと粗いセグメンテーション結果となるそこで中間層の特徴マップを統合して用いることで高精度日本ロボット学会誌 35 巻 3 号コンボリューション処理で拡大し元の画像サイズのセグメンテーション結果を出力するこれらの処理において図年 4 月

5 [27] 11 SegNet 12 PSPNet [24] 14 SegNet [23] PSPNet 2016 ILSVRC Scene Parsing Cityscapes Dataset [25] CNN CRFasRNN [26] CNN Conditional Random Field CRF CRF CNN end-to-end CNN [27] 13 Faster R-CNN end-to-end 14 SegNet 6. end-to-end JRSJ Vol. 35 No Apr., 2017

185 [1] vol.48, no.sig16, pp.1 24, 2007. [ 2 ] D. Cheng, Y. Gong, S. Zhou, J. Wang and N.

Viola and M. Jones: Rapid object detection using a boosted cascade of simple features, IEEE Computer Society Computer Vision and Pattern Recognition, vol.1, pp.511 518, 2001. [ 4 ] N. Dalal and B.

6 185 [1] vol.48, no.sig16, pp.1 24, [ 2 ] D. Cheng, Y. Gong, S. Zhou, J. Wang and N. Zheng: Person re-identification by multi-channel parts-based cnn with improved triplet loss function, IEEE Conference on Computer Vision and Pattern Recognition, pp , [ 3 ] P. Viola and M. Jones: Rapid object detection using a boosted cascade of simple features, IEEE Computer Society Computer Vision and Pattern Recognition, vol.1, pp , [ 4 ] N. Dalal and B. Triggs: Histograms of Oriented Gradients for Human Detection, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol.1, pp , [ 5 ] G. Csurka, C.R. Dance, L. Fan, J. Willamowski and C. Bray: Visual Categorization with Bags of Keypoints, ECCV Workshop on Statistical Learning in Computer Vision, pp.1 22, [ 6 ] F. Perronnin and C. Dance: Fisher Kernels on Visual Vocabularies for Image Categorization, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, [7] H.Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez and C. Schmid: Aggregating Local Image Descriptors into Compact Codes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, no.9, pp , [ 8 ] D.G. Lowe: Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol.60, pp , [ 9 ] K.M. Yi, E. Trulls, V. Lepetit and P. Fua: LIFT: Learned Invariant Feature Transform, European Conference on Computer Vision, vol.9910, pp , [10] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradientbased learning applied to document recognition, Proc. of the IEEE, vol.86, no.11, pp , [11] A. Krizhevsky, I. Sutskever and G.E. Hinton: Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, pp , [12] K. Simonyan and A. Zisserman: Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv: , [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich: Going deeper with convolutions, IEEE Conference on Computer Vision and Pattern Recognition, pp.1 9, [14] K. He, X. Zhang, S. Ren and J. Sun: Deep residual learning for image recognition, IEEE Conference on Computer Vision and Pattern Recognition, pp , [15] R. Girshick, J. Donahue, T. Darrell and J. Malik: Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE conference on computer vision and pattern recognition, pp , [16] J.R.R. Uijlings, K.E.A. Van De Sande, T. Gevers and A.W.M. Smeulders: Selective search for object recognition, International journal of computer vision, vol.104, no.2, pp , [17] S. Ren, K. He, R. Girshick and J. Sun: Faster R-CNN: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, pp.91 99, [18] J. Redmon, S. Divvala, R. Girshick and A. Farhadi: You only look once: Unified, real-time object detection, IEEE Conference on Computer Vision and Pattern Recognition, pp , [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu and A.C. Berg: SSD: Single shot multibox detector, European Conference on Computer Vision, pp.21 37, [20] C. Farabet, C. Couprie, L. Najman and Y. LeCun: Learning hierarchical features for scene labeling, IEEE transactions on pattern analysis and machine intelligence, vol.35, no.8, pp , [21] J. Long, E. Shelhamer and T. Darrell: Fully convolutional networks for semantic segmentation, IEEE Conference on Computer Vision and Pattern Recognition, pp , [22] V. Badrinarayanan, A. Kendall and R. Cipolla: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, arxiv preprint arxiv: , [23] A. Kendall, V. Badrinarayanan and R. Cipolla: Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder- Decoder Architectures for Scene Understanding, arxiv preprint arxiv: , [24] H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia: Pyramid Scene Parsing Network, arxiv preprint arxiv: , [25] The Cityscapes Dataset, [26] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du and P.H. Torr: Conditional Random Fields as Recurrent Neural Networks, IEEE International Conference on Computer Vision, pp , [27] J. Dai, K. He and J. Sun: Instance-aware semantic segmentation via multi-task network cascades, IEEE Conference on Computer Vision and Pattern Recognition, pp , Hironobu Fujiyoshi Postdoctoral Fellow Takayoshi Yamashita PRMU

untitled

untitled c ILSVRC LeNet 1. 1 convolutional neural network 1980 Fukushima [1] [2] 80 LeCun (back propagation) LeNet [3, 4] LeNet 2. 2.1 980 8579 6 6 01 okatani@vision.is.tohoku.ac.jp (simple cell) (complex cell)