( ) /

Size: px

Start display at page:

Download "( ) /"

ひできかなり
5 years ago
Views:

1 NAIST-IS-MT

2 ( ) /

3 , NAIST-IS-MT , i

4 80% ii

5 Finding Important People in a Video using a Deep Neural Network with Conditional Random Field Atsushi Nishida Abstract Finding important regions is essential for applications like content-aware video compression and video retargeting, which automatically crops an important region in a video for small screens. Various models for important region estimation have been proposed. Since people are one of the main content of videos, some methods for finding important regions use face detection. However, those existing methods usually do not distinguish important people from passers-by in a video. This thesis proposes a method to classify people in a video frame into important or non-important ones. Generally, this classification problem is not well designed because who is important or not may differ viewer by viewer. Therefore, instead of the viewers perspective, we use videographers perspective. That is, our method finds people who are important for the videographer. Since viewers try to understand what the videographer wants to express in the video, important people for viewers and videographers may highly correlate. It is considered that videographers have a certain tendency in, e.g, how to move the camera when taking the video, such as placing important people near the center of the video frame. Since videographers such behavior is reflected in the trajectories and sizes of face regions, we use them as features for the classification. In addition, Master s Thesis, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-MT , March 16, iii

6 as visual cues like the orientation of faces are helpful for important person classification, the proposed method exploits visual features such as color histograms. The proposed method uses a conditional random field (CRF) built upon a deep neural network (DNN), which can capture the various types of relationships, such as spatial one, among people in a video frame in order to facilitate the classification. Experimental results demonstrate that our models trained on a dataset of user-generated videos achieve the accuracy of over 80%. Our experiments also verify the effectiveness of the proposed model and the effect of the conditional random field by comparing our model with baselines, such as a support vector machines and a DNN without a CRF. Keywords: Neural network Conditional random field Important people classification iv

7 v

8 Itti [1] Yang [2] (1) (5) (2a) (5a) (2a) (5a) (2a) (5a) (2b) (5b) (2b) (5b) (2b) (5b) (1) (5) vi

9 1. [3,4] [5 8] [1,9,10] [2, 11, 12] Itti [1] Itti [1] Yang [2] Ma [11] Ma [11] 1 2 2(a) 1 2(b) 1 1

図 1: 重要人物と非重要人物の例 (a) 全ての人物を重要領域と考えた場合 (b) 人物の重要度を考慮した場合

本研究ではこのような複数の人物を撮影した映像から重要人物だけを含む重要領域を抽出するために

あるいは偶然写り込んだ非重要人物かを判定する識別器を開発するこの識別結果を用いて

10 図 1: 重要人物と非重要人物の例 (a) 全ての人物を重要領域と考えた場合 (b) 人物の重要度を考慮した場合図 2: 図 1 のリターゲティング処理例ティングのようなアプリケーションの性能が損なわれる場合がある本研究ではこのような複数の人物を撮影した映像から重要人物だけを含む重要領域を抽出するために映像中の人物の重要度推定に取り組む具体的には映像中から検出した人物をそれぞれが映像中において重要な人物かあるいは偶然写り込んだ非重要人物かを判定する識別器を開発するこの識別結果を用いて非重要人物の領域を重要領域の候補から除去することにより非重要人物を含まない重要領域推定が可能となる一般に映像中の人物が重要か非重要かは視聴者によって異なり一意に決 2

11 1 1 (Deep Neural Network: DNN) (Conditional Random Fields: CRF) CRF DNN CRF End-to-End YouTube 3

12 2 CRF DNN 3 CRF DNN 4 5 4

13 [1,9,10,13] Itti [1] Itti Baldi [9,10] Bayesian Surprise Achanta [13, 14] Lab 3(b) 3(a) 3(c) Itti [1] 3(d) Yang [2] 5

14 (a) (b) (c) (d) 3: Itti [1] 4 Ma [11] [12] Ma [11] 6

(a) 入力画像 (b) 自転車を重視した重要度マップ (c) 車を重視した重要度マップ (d)

有無を重要領域推定の指標として用いた手法はその人物の映像中における重要度を考慮しないため

リターゲティングのようなアプリケーションの性能が損なわれる場合があるこのような課題を解決するため

Nakashima らは同じフレーム中の重要人物同士は大きさや動きの軌跡に相関があるという考えと

15 (a) 入力画像 (b) 自転車を重視した重要度マップ (c) 車を重視した重要度マップ (d) 人物を重視した重要度マップ図 4: Yang ら [2] の手法による重要領域推定有無を重要領域推定の指標として用いた手法はその人物の映像中における重要度を考慮しないため複数の人物を含む映像において重要度の低い人物も重要領域に含む場合がある重要でない人物が重要領域に含まれるとリターゲティングのようなアプリケーションの性能が損なわれる場合があるこのような課題を解決するため Nakashima ら [15] は撮影者の観点に基づき複数の人物を含む映像の重要人物を識別をする手法を提案した Nakashima らは同じフレーム中の重要人物同士は大きさや動きの軌跡に相関があるという考えと重要人物や非重要人物は短い期間では入れ替わらないという考えのもとに条件付き確率場を用いたモデルを採用した本論文ではさらなる精度向上のため Nakashima ら [15, 16] の手法を拡張し CRF を取り入れた DNN を用いた識別手 7

16 CRF DNN 2.2 CRF (Markov Random Field: MRF) x y CRF p(y x) p(y x) = 1 Z e E(y,x) (1) E(y, x) = i f i (x i y) + ij f ij (x i, x j y) (2) E(y, x) x i f i (x i y) x i, x j f ij (x i, x j y) Z (Partition function) DNN DNN CRF [17 23] Bengio [17] (Convolutional Neural Networks: CNN) Yao Wang [18, 19] CRF Ma [24] Long Short Term Memory CRF CRF CNN [20 23] [25] [26] Arnab [22] CNN 4 8

17 CRF Farabet [27] CNN CRF CNN CRF Liu [26] CNN CRF Chanra [23] CRF DNN End-to-End CRF CRF Contrastive Divergence [28] 2.3 DNN Nakashima [15,16] CRF DNN CRF CRF 9

18 CRF DNN CRF

19 5: 11

(a) 注目フレームから 100 フレーム前 (b) 注目フレーム (c) 注目フレームから 100 フレーム後 (d)

から得られる特徴量を重要人物識別に用いるまず注目フレームから検出された人物を前後 100 フレームの間トラッキングしその人物の顔領域の大きさと位置

は注目フレームから 100 フレーム前 6(c) は注目フレームの 100 フレーム後を表しており青色の矩形が顔領域である図 6(d)

と大きさを抽出しこの 3 次元ベクトルを連結した xm i R 徴量とするなお図 6(b) の奥の人物のようにトラッキング対象の人物が移

20 (a) 注目フレームから 100 フレーム前 (b) 注目フレーム (c) 注目フレームから 100 フレーム後 (d) トラッキングから得られた人物の軌跡図 6: トラッキングの例は人物の重要度は映像中の人物の位置や大きさに反映されるとして人物の動きから得られる特徴量を重要人物識別に用いるまず注目フレームから検出された人物を前後 100 フレームの間トラッキングしその人物の顔領域の大きさと位置の変化を取得する本手法では顔領域を追跡するために KCF トラッカー [29] を採用した図 6 はトラッキングの例である図 6(a) は注目フレームから 100 フレーム前 6(c) は注目フレームの 100 フレーム後を表しており青色の矩形が顔領域である図 6(d) の黄色の線が顔領域中心の変化を表しているこうしてある人物 i から得られた前後 100 フレームにおける顔領域から座標 600 を人物の動きの特と大きさを抽出しこの 3 次元ベクトルを連結した xm i R 徴量とするなお図 6(b) の奥の人物のようにトラッキング対象の人物が移動やオクルージョンにより画面上から消失した場合トラッキングを中止し残りフレームの顔領域の大きさおよび位置は 0 とする人物の見えの特徴量重要人物はカメラに対して正面か少なくとも顔が見えるように撮影されること 12

0.2 Histogram 0.1 (a) 0.0 0 10 20 30 40 50 x (b) (a) 0.2 Histogram 0.1 (c) 0.

21 0.2 Histogram 0.1 (a) x (b) (a) 0.2 Histogram 0.1 (c) x (d) (b) 7: DNN [30] 2 R G B 50 x l i R (a) 7(c) 7(d) DNN DNN FaceNet [30] x l i

22 CRF i x m i xl i f i h m i = ρ(w m x m i + b m ) (3) h l i = ρ(w l x l i + b l ) (4) f i = ρ(w h ml i + b ml ) (5) W m R W l R d 100 W R x l i d = 150 DNN d = 128 ρ Rectificed Linear Unit [31] (5) h ml i h m i h l i CRF i f i (i = 1,..., I) t 1,..., t I i t i t i = 1 0 CRF ϕ 0 (f i ) = ρ(v 0 f i + k 0 ) (6) ϕ 1 (f i ) = ρ(v 1 f i + k 1 ) (7) v 0 v 1 R 100 k 0, k 1 ϕ 0 (fi) ϕ 1 (fi) i 0 1 t i = 0 14

23 8: 15

24 2 ψ 00 (f ij ) = ρ(u 00f ij + c 00 ) (8) ψ 01 (f ij ) = ρ(u 01f ij + c 01 ) (9) ψ 10 (f ij ) = ρ(u 10f ij + c 10 ) (10) ψ 11 (f ij ) = ρ(u 11f ij + c 11 ) (11) f ij (5) f i, f j ψ 00 (f ij ), ψ 01 (f ij ), ψ 10 (f ij ), ψ 11 (f ij ) 2 (0 (1) T = {t i i = 1... I} F = {f i i = 1... I} E(T F ) E(T F ) = i ϕ ti (f i ) + ij ψ ti t j (f ij ) (12) p(t F ) = 1 Z e E(T F ) (13) Z Z = T e E(T F ) (14) (14) p(t F ) Z CRF Contrastive Divergence [28] 16

25 : (14) Z 17

26 Z ϕ(f i ) = V f i + K (15) ψ(f ij ) = Uf ij + C (16) V = (v 0 v 1 ) K = (k 0 k 1 ) U = (u 00 u 01 u 10 u 11 ) C = (c 00 c 01 c 10 c 11 ) ϕ ψ ϕ (1) (0) 2 ψ 4 E ϕ, ψ 3.4 L L(T m, F m ) = m log p(t m F m ) (17) T m F m m Dropout [32] [33] 18

27 4. CRF 4.1 [15] YouTube YouTube 20 YouTube 6 YouTube , , , , 764 [15] YouTube 19

28 (a) (b) 10: 55, , CRF 20

29 1: YouTube ,955 82, ,655 39, ,336 37,431 Nakashima [15] Nakashima CRF CRF CRF CRF DNN (1) (5) 5 (1) Nakashima [15] (2) (3) CRF (4) ( ) (5) 21

30 (1) (2) (3) (4) (2) (2) CRF CRF (3) CRF (3) CRF 1 Softmax Cross-Entropy z a = exp(u a ) 1 b=0 exp(u b) (18) a (a = 0, 1) u 0 u z 0 z 1 i f i t i t i = { 1 (z1 0.5) 0 (otherwise) (19) (4) ( ) (4) CRF (4) CRF Chainer [34] 22

31 4.3 (1) (5) T P (True Positive) F N(False Negative) REC = T P T P + F N (20) F P (False Positive) T N (True Negative) F P R: False positive rate F P R = F P F P + T N P RE (precision) ACC (Accuracy) F (F1-measure) P RE = T P T P + F P ACC = T P + T N T P + T N + F P + F N F 1 = P RE REC 2 P RE + REC (21) (22) (23) (24) 2 2 F FaceNet

32 2: (1) (5) REC(%) PRE(%) FPR(%) ACC(%) F1(%) (1) Nakashima [15] (5) (2a) (3a) CRF (4a) ( ) (5a) FaceNet (2b) (3b) CRF (4b) ( ) (5b)

33 手法 (1) frame 1350 frame 1355 frame 1360 手法 (5) frame 1350 frame 1355 frame 1360 手法 (1) frame 85 frame 90 frame 95 手法 (5) frame 85 frame 90 frame 95 11: (1) (5) 25

34 手法 (2a) frame 1350 frame 1355 frame 1360 手法 (3a) frame 1350 frame 1355 frame 1360 手法 (4a) frame 1350 frame 1355 frame 1360 手法 (5a) frame 1350 frame 1355 frame : (2a) (5a) 1 26

35 手法 (2a) frame 85 frame 90 frame 95 手法 (3a) frame 85 frame 90 frame 95 手法 (4a) frame 85 frame 90 frame 95 手法 (5a) frame 85 frame 90 frame 95 13: (2a) (5a) 2 27

36 手法 (2a) frame 5 frame 15 frame 25 手法 (3a) frame 5 frame 15 frame 25 手法 (4a) frame 5 frame 15 frame 25 手法 (5a) frame 5 frame 15 frame 25 14: (2a) (5a) 3 28

37 手法 (2b) frame 1350 frame 1355 frame 1360 手法 (3b) frame 1350 frame 1355 frame 1360 手法 (4b) frame 1350 frame 1355 frame 1360 手法 (5b) frame 1350 frame 1355 frame : (2b) (5b) 1 29

38 手法 (2b) frame 85 frame 90 frame 95 手法 (3b) frame 85 frame 90 frame 95 手法 (4b) frame 85 frame 90 frame 95 手法 (5b) frame 85 frame 90 frame 95 16: (2b) (5b) 2 30

39 手法 (2b) frame 5 frame 15 frame 25 手法 (3b) frame 5 frame 15 frame 25 手法 (4b) frame 5 frame 15 frame 25 手法 (5b) frame 5 frame 15 frame 25 17: (2b) (5b) 3 31

40 4.4 (4) (4) F (4) F P 13 (4a) 13 (5a) CRF F P (2) (3) (2) (3) F (2) (3) 12 (2a) (3a) CRF 12 CRF 18 (2) (3) 32

41 (2a) (3a) (5a) 18: DNN FaceNet [30] (4) 55, , 431 (4) FaceNet CRF FaceNet 33

42 5. CRF DNN CRF CRF YouTube CRF DNN CRF CRF FaceNet End-to-End 34

43 ( ) 35

44 [1] L. Itti, C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), vol. 20, no. 11, pp , [2] J. Yang and M.-H. Yang, Top-down visual saliency via joint CRF and dictionary learning, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR), pp , [3] F. Liu and M. Gleicher, Video retargeting: Automating pan and scan, in Proc. ACM Int. Conf. Multimedia (MM), pp , [4] X. Fan, X. Xie, H.-Q. Zhou, and W.-Y. Ma, Looking into video frames on small displays, in Proc. ACM Int. Conf. Multimedia (MM), pp , [5] L. Itti, Automatic foveation for video compression using a neurobiological model of visual attention, IEEE Trans. Image Processing, vol. 13, no. 10, pp , [6] W. Lai, X.-D. Gu, R.-H. Wang, W.-Y. Ma, and H.-J. Zhang, A contentbased bit allocation model for video streaming, in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), vol. 2, pp , [7] M.-H. Hsiao, Y.-W. Chen, H.-T. Chen, K.-H. Chou, and S.-Y. Lee, Contentaware video adaptation under low-bitrate constraint, EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 2, 17 pages, [8] M. Sun, A. Farhadi, B. Taskar, and S. Seitz, Salient montages from unconstrained videos, in Proc. European Conf. Computer Vision (ECCV), pp , [9] L. Itti and P. Baldi, Bayesian surprise attracts human attention, in Proc. Neural Information Processing Systems (NIPS), pp ,

45 [10] P. Baldi and L. Itti, Of bits and wows: A Bayesian theory of surprise with applications to attention, Neural Networks, vol. 23, no. 5, pp , [11] Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang, A generic framework of user attention model and its application in video summarization, IEEE Trans. Multimedia, vol. 7, no. 5, pp , [12] D. Walther and C. Koch, Modeling attention to salient proto-objects, Neural networks, vol. 19, no. 9, pp , [13] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, Frequency-tuned salient region detection, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR), pp , [14] R. Achanta, F. Estrada, P. Wils, and S. Süsstrunk, Salient region detection and segmentation, in Proc. Int. Conf. Computer Vision Systems, pp , [15] Y. Nakashima, N. Babaguchi, and J. Fan, Intended human object detection for automatically protecting privacy in mobile video surveillance, Multimedia Systems, vol. 18, no. 2, pp , [16] Y. Nakashima, N. Babaguchi, and J. Fan, Privacy protection for social video via background estimation and CRF-based videographer s intention modeling, IEICE Trans. Information and Systems, vol. E99.D, no. 4, pp , [17] Y. Bengio, Y. LeCun, and D. Henderson, Globally trained handwritten word recognizer using spatial representation, convolutional neural networks, and hidden markov models, in Proc. Neural Information Processing Systems (NIPS), pp ,

46 [18] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao, Recurrent conditional random field for language understanding, in Proc. IEEE Conf. Acoustics, Speech and Signal Processing (ICASSP), pp , [19] W. Wang, S. J. Pan, D. Dahlmeier, and X. Xiao, Recursive neural conditional random fields for aspect-based sentiment analysis, in Proc. ACL Conf. Empirical Methods Natural Language Processing (EMNLP), pp , [20] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, Semantic object parsing with graph LSTM, in Proc. European Conf. Computer Vision (ECCV), pp , [21] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, Conditional random fields as recurrent neural networks, in Proc. IEEE Int. Conf. Computer Vision (ICCV), pp , [22] A. Arnab, S. Jayasumana, S. Zheng, and P. H. S. Torr, Higher order conditional random fields in deep neural networks, in Proc. European Conf. Computer Vision (ECCV), pp , [23] S. Chandra and I. Kokkinos, Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian CRFs, in Proc. European Conf. Computer Vision (ECCV), pp , [24] X. Ma and E. Hovy, End-to-end sequence labeling via bi-directional LSTM- CNNs-CRF, in Proc. Association for Computational Linguistics (ACL), 10 pages, [25] X. Chu, W. Ouyang, H. Li, and X. Wang, CRF-CNN: Modeling structured information in human pose estimation, in Proc. Neural Information Processing Systems (NIPS), pp ,

47 [26] F. Liu, C. Shen, and G. Lin, Deep convolutional neural fields for depth estimation from a single image, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR), pp , [27] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, Learning hierarchical features for scene labeling, IEEE Trans. Pattern Aalysis and Machine Intelligence (PAMI), vol. 35, no. 8, pp , [28] G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation, vol. 14, no. 8, pp , [29] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, in Proc. European Conf. Computer Vision (ECCV), pp , [30] F. Schroff, D. Kalenichenko, and J. Philbin, FaceNet: A unified embedding for face recognition and clustering, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR), pp , [31] V. Nair and G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in Proc. Int. Conf. Machine Learning (ICML), pp , [32] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Jounal of Machine Learning Research, vol. 15, no. 1, pp , [33] D. Kingma and J. Ba, Adam: A method for stochastic optimization, in Proc. Int. Conf. Learning Representations (ICLR), 13 pages, [34] S. Tokui, K. Oono, S. Hido, and J. Clayton, Chainer: A next-generation open source framework for deep learning, in Proc. Neural Information Processing Systems (NIPS), 6 pages,

Convolutional Neural Network A Graduation Thesis of College of Engineering, Chubu University Investigation of feature extraction by Convolution

Convolutional Neural Network A Graduation Thesis of College of Engineering, Chubu University Investigation of feature extraction by Convolution Convolutional Neural Network 2014 3 A Graduation Thesis of College of Engineering, Chubu University Investigation of feature extraction by Convolutional Neural Network Fukui Hiroshi 1940 1980 [1] 90 3