(MIRU2009) 2009 7 182 8585 1 5 1 E-mail: noguchi-a@mm.cs.uec.ac.jp, yanai@cs.uec.ac.jp cuboid cuboid SURF 6 85% Web. Web Abstract Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions Akitsugu NOGUCHI and Keiji YANAI The University of Electro-Communications 1-5-1 Chofugaoka Chofu-shi Tokyo 182-8585 E-mail: noguchi-a@mm.cs.uec.ac.jp, yanai@cs.uec.ac.jp Recently spatio-temporal local features are proposed as image features to recognize events or human actions in videos. In this paper, we propose a novel local spatio-temporal feature which is applicable to large amounts of video data. Our method consists of two parts: extracting visual features and extracting motion features. First, we select candidate points based on the SURF detector, which is a very fast detector. Next, we calculate motion features at each points with local temporal units divided in order to consider consecutiveness of motions. Since our proposed feature is intended to be robust to rotation, we rotate optical flow vectors to the dominant direction of extracted SURF features. In the experiments, we evaluate the proposed spatio-temporal local feature with the common dataset containing six kinds of simple human actions. As the result, the accuracy achieves 85%, which is almost equivalent to state-of-the-art. In addition, we make experiments to classify large amounts of Web video clips downloaded from Youtube. Key words video recognition, action recognition, spatio-temporal local feature 1. Web cuboid cuboid cuboid [1] [2] cuboid HoG HoF
1 KTH 2 cuobid SURF [3] 1 KTH walking running jogging, boxing hand waving, hand clapping 85% Web Youtube 100. 2. 3 4. 5. 2. [4] [6] Cuboid Dollor Cuboid [1] visual word visual word Laptev STIP(spatio-temporal interest point) [2] 3 Cuboid HoG HoF Alireza Cuboid [7] cuboid cuboid SURF 3. 2 SURF (Speeded-up Robust Feature) [3] ( ) ( ) ( ) ( )
4 y xy ( ) ( ) 3 SURF 3. 1 SURF SURF 3 SURF SIFT [8] SURF SIFT SURF 3. 1. 1 SURF SURF SURF X = (x y) T I Σ (X) 1 i< = x j< = y I Σ (X) = I(i j) (1) i=0 j=0 5 haar wavelet x (dx) y (dy) 6 I(x y) (x y) (x y) (x y) SURF I X = (x y) H(X σ) 2 [ L xx (X σ) H(X σ) = L xy (X σ) ] L xy (X σ) L yy (X σ) (2) 7 non-maximum suppression L xx (X σ) I X 2 x 2 g(σ) L yy (X σ) L xy (X σ) 4 L yy L xy D xx D yy D xy det(h approx ) = D xx Dyy (0.9 D xy ) 2 (3) SIFT 6 9
10 9 8 Dominant rotation ( ) ( ) 6 27 15 6 2 51 7 3 3 3 non-maximum suppression 3. 1. 2 SURF Haar Wavelet 5 dx dy dx dy 8 8 Dominant rotation 4 4 dx dy dx dy 4 4 4 = 64 3. 2 3. 2. 1 9( ) SURF N N N=5 N/2 Lucus-Kanade [9] 3. 2. 2 9( ) N M M N M 1 N = 5, M 5 9( ) N x + x y + y 5 x + x x x 5 1 (M 1) 5 dominant rotate 10 (x 1 y 1 ) (x 2 y 2 ) SURF dominant rotate θ (x y) 4 [ x y ] = [ cosθ sinθ x 2 sinθ cosθ y 2 ] x 1 x 2 y 1 y 2 1 (4)
11 3. 3 visual motion 64 visual 5 (M 1) motion weight 64 + 5 (M 1). M=5 84 4. 12 Web 4. 1 Dollar [1] visual word bag-of-video-word SVM 11 k-means bag-of-videoword(bovw) BoVW bag-offeature(bof) [10] BoF support vector machiene(svm) SVM RBF 4. 1. 1 KTH 13 codebook walking running jogging boxing hand clapping hand waving 6 25 4 100 5 5 fold cross validation validation KTH 20 4000. 4. 1. 2 weight k. 12 weight. k 700. 13. weight 2.5. weight = 2.5 k = 1500.. 4
1 (VMR) 2 (VM) 3 (V) 4 (M) 14 VMR VM running jogging walking hand waving boxing hand clapping 1 boxing hand clapping hand waving walking running jogging 1 4 walking running jogging 1 2 3 walking running jogging boxing waving clapping 4 1 waving 85.5%83.3%. Dollar [1] Alireza [7] Laptev [2] 15. 85%, Dollar 82.3% Alireza 91.5% Laptev 91.8% 16 [1] [1] KTH 600 15 16 14 2/3 4. 2 Web Web.. 17. Web.. HSV. χ 2 (
1 2 3 4 ). 17 Web ( ). bag-ofvideo-words ( ). ( ) ( ) k-means. k = 50 4. 2. 1 Youtube Web. 100. 4. 2. 2 18.. 19... 1 20..... 5. SURF SURF dominant rotation KTH 6 85%
図 18 一つの動画のクラスタリング結果: 遠くからの (上) 比較的近くのショット (中) 人をクローズアップしたショット (下) 図 19 全ての動画ショットのクラスタリング結果: 遠くから撮影されたショット (上) 近くからのショット (中) 様々なものが混ざったクラスタ (下) 索 動画要約 自動サーベイランスシステムなどが考え られる 文 [7] 献 [1] P. Dollar, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In Proc. of Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65 72, 2005. [2] I. Laptev and T. Lindeberg. Local descriptors for spatio-temporal recognition. In Proc.of IEEE International Conference on Computer Vision, 2003. [3] B. Herbert, E. Andreas, T. Tinne, and G. Luc. Surf: Speeded up robust features. In CVIU, pp. 346 359, 2008. [4] C. Fanti and P. Perona. Hybrid models for human motion recognition. In Proc.of IEEE Computer Vision and Pattern Recognition, 2005. [5] C. Rao, A. Yilmaz, and M. Shah. View-invariant representation and recognition of actions. Int J Comput Vision, Vol. 50(2), pp. 203 226, 2002. [6] Y. Yacoob and M.J. Black. Parameterized modeling [8] [9] [10] [11] and recognition of activities. Comput Vis Image Und, Vol. 72(2), pp. 203 226, 2002. F. Alireza and M. Greg. Action recognition by learning mid-level feature. In Proc.of IEEE Computer Vision and Pattern Recognition, 2008. D. Lowe. Distinctive image features from scaleinvariant keypoints. In International Journal of Computer Vision, pp. 91 110, 2004. B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. of International Joint Conference on Artificial Intelligence, pp. 674 679, 1981. G.Csurka, C.Bray, C.Dance, and L. Fan. Visual categorization with bags of keypoints. In Proc. of ECCV Workshop on Statistical Learning in Computer Vision, pp. 1 22, 2004. S.Konrad and G.Luc. Action snippets: How many frames does human action recognition require? In Proc.of IEEE Computer Vision and Pattern Recognition, 2008.