1 1 TRECVID2010 SURF Bag-of-Features 1 TRECVID SVM 700% MKL-SVM 883% TRECVID2010 MKL-SVM Analysis of video data recognition using multi-frame Kazuya Hidume 1 and Keiji Yanai 1 In this study, we aim to verify the effectiveness of a multi-frame method for shot recognition proposed in recent years. In the experiments, we extract SURF, color and spatio- temporal features from the TRECVID 2010 video data, and convert them the Bag-of-Features(BoF) representation. In the multiframe method unlike the conventional method to extract features from only one keyframe, features are extracted from multiple frames which are selected from the video, and one BoF feature vector is generated by integrating these features. In the experiment, we use five kinds of concepts out of 130 TRECVID2010 target concepts and analyze recognition performance in various settings in terms of the number of frames selected from one shot. As a result, compared to the conventional method, the recognition accu- racy in classifying by SVM raised 700% at most and by MKL-SVM advance 883%. The result of MKL-SVM in all class outperformed the average of all the teams in TRECVID2010. 1. WEB TRECVID 1) TRECVID Web TRECVID 2. TRECVID TRECVID TREC Video Retrieval EvaluationNIST Disruptive Technology Office(DTO) TRECVID 2010 TRECVID2010 6 Semantic indexing SIN SIN 1 Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications 1 c 2011 Information Processing Society of Japan
1 TRECVID2010 Semantic indexing TRECVID2010 130 1 MPEG-4/H.264 200 3. TRECVID 2008 2) 4 SIFT 7 2009 10 Multiple Kernel Learning SR-KDA(Spectral Regression combined with Kernel Discriminant Analysis) 3) MK-FDA(Multiple Kernel Fisher Discriminant Analysis) 4) TRECVID 5) 2009 SIFT (MFCC) 4 4. 4.1 2 2 1 2 3 1 1 N 1 M 2 c 2011 Information Processing Society of Japan
Airplane Flying Bus Hand 3 N + 1 M M N ( 1)/M + 1 N M N = 3, 5, 10, M = 30, 15, 10 M 30 / 4.2 4 SURF Bag-of-Features SURF 2 2 Bag-of-Features SVM MKL-SVM 4 SVM RBF-χ 2 2000 5. 5.1 Bag-of-Features Bag-of-Features Bag-of-Words Bag-of-Words Bag-of-Features Bag-of-Features ( 1 ) ( 2 ) k k visual words visual words codebook ( 3 ) visual words 3 c 2011 Information Processing Society of Japan
( 4 ) bin ( SURF ) Bag-of-Features Bag-of-Features codebook 5.2 RGB RGB 3 Bag-of-Features visual words 5.3 SURF SURF(Speeded-Up Robust Feature) 6) 64 128 SIFT 7) SIFT 128 SURF SIFT SIFT SURF 5000 TRECVID2010 Web 5.4 8) Web step1 step2 step3 step4 step5 step6 SURF SURF Delaunay Web Lucas-Kanade 9) SURF SURF Dlauney Dlauney 3 SURF 64 3 = 192 SURF N SURF N/2 N M Lucas-Kanade x,y 5 N = 5 M = 5 5 20 3 + 5 = 65 SURF 192 + 65 = 257 5.5 Bag-of-Features Lazebnik 10) Bag-of-Features 4 c 2011 Information Processing Society of Japan
SURF 6. SVM(Support Vector Machine) MKL-SVM(Multiple Kernel Learning SVM) 6.1 Support Vector Machine SVM(Support Vector Machine) 2 SVM x φ(x) RBF-χ 2 K(x, y) = exp ( 1 2σ 2 6.2 Multiple Kernel Learning i ) x i y i 2 x i + y i MKL(Multiple Kernel Learning) SVM () K K K combined (x, x ) = β jk j(x, x ) β j 0, β j = 1 (2) j=1 β j MKL MKL β j β j MKL Sonnenburg 11) SVM β j j=1 (1) 7. 7.1 1 Airplane Flying 66 1000 Boat Ship 172 2500 Bus 31 1000 Cityscape 558 5000 Classroom 139 2500 TRECVID TRECVID2010 5 7.1 144,988 7.2 TRECVID2006 (Inferred Average Precision : infap) ( )/( ) N k P recision(k) infap = 1 N N P recision(k) (3) k=1 ( )/( ) TRECVID2010 TRECVID2010 Semantic Indexing 2000 7.3 SURF 3 Bag-of-Features codebook Bag-of- Features 500 Bag-of-Features 5 c 2011 Information Processing Society of Japan
2 M / M M=30 M=15 M=10 Airplane Flying 4.06/5.45 7.62/10.44 11.15/15.45 Boat Ship 5.40/6.05 10.60/11.72 15.82/17.46 Bus 19.06/5.23 37.65/10.08 56.55/14.94 Cityscape 4.80/7.77 9.20/15.23 13.61/22.67 Classroom 14.11/7.29 27.95/14.23 41.94/21.21 4.99 9.48 14.01 3 N M N=1 N=3 N=5 N=10 M=30 M=15 M=10 Airplane Flying 0.0076 0.0263 0.0282 0.0301 0.0331 0.0608 0.0400 0.0414 Boat Ship 0.0294 0.0334 0.0282 0.0278 0.0220 0.0270 0.0276 0.0274 Bus 0.0021 0.0028 0.0030 0.0032 0.0034 0.0034 0.0043 0.0043 Cityscape 0.0586 0.0818 0.0746 0.0466 0.0785 0.0796 0.0722 0.0889 Classroom 0.0002 0.0006 0.0003 0.0003 0.0012 0.0014 0.0004 0.0006 2 2 500 500 4 = 2000 SURF 2000 2 + 500 = 4500 SVM MKL-SVM RBF-χ 2 7.3.1 3 ( 1 ) ( 2 ) ( 3 ) MKL-SVM SVM M 2 MKL-SVM MKL TRECVID2010 (median) (max) 7.4 SVM 3 5 3 N = 1 4 MKL-SVM N M TRECVID2010 N=1 N=3 N=5 N=10 M=30 M=15 M=10 median max Airplane Flying 0.0114 0.0373 0.420 0.0429 0.0654 0.0742 0.0678 0.0675 0.017 0.141 Boat Ship 0.0528 0.0324 0.0322 0.0294 0.0231 0.0291 0.0286 0.0276 0.018 0.165 Bus 0.0035 0.0023 0.0035 0.0031 0.0024 0.0029 0.0036 0.0030 0.002 0.032 Cityscape 0.0996 0.1236 0.1250 0.1251 0.1255 0.1216 0.1235 0.1264 0.045 0.21 Classroom 0.0006 0.0017 0.0027 0.0031 0.0048 0.0059 0.0029 0.0032 0.002 0.116 Airplane Flying 700% M Boat Ship M 1 M=15 MKL-SVM 4 6 4 N MKL 7 Classroom MKL-SVM 883% Bus 4 SVM MKL-SVM Boat Ship MKL-SVM 6 c 2011 Information Processing Society of Japan
5 SVM 6 MKL-SVM N=1 SURF 5000 TRECVID2010 320 240 1 76800 Bag-of-Features 1% TRECVID2010 MKL-SVM SVM Classroom 4 3 1 codebook 4096 8. TRECVID2010 Se- 7 N mantic indexing 2 SURF Semantic indexing N M TRECVID2010 SVM 700% MKL-SVM 883% 7 c 2011 Information Processing Society of Japan
情報処理学会研究報告 た また TRECVID2010 の全チームの平均値と比較した結果 MKL-SVM を使用した設 10) S.Lazebnik, C.Schmid, and J.Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. of IEEE Computer Vision and Pattern Recognition, pp. 2169 2178, 2006. 11) S.Sonnenburg, G.Ra tsch, C.Scha fer, and B.Scho lkopf. Large Scale Multiple Kernel Learning. The Journal of Machine Learning Research, Vol.7, pp. 1531 1565, 2006. 定のいくつかで 5 クラスすべての値を上回る結果を得た 以上の点から ショット認識にお けるマルチフレーム手法の有効性を確認することができた 9. 今後の課題 実験ではより多くのフレームから特徴量を抽出した場合 逆に精度が悪くなるということ 付 が起こった 単純なショット中の位置でフレームを抽出するだけでなく より有用なフレー 録 MKL-SVM N=10 で実行した上位 15 ショットを示す 赤枠は正解のショットである ムを選択できるようにすることも重要である 例えば 時空間特徴の抽出で行ったような 選択した他のフレーム (この場合は既に抽出した前のフレーム) との差異を計算した場合に 一定以上特徴量が異なれば新たなフレームとして抽出する といったことが考えられる ま た そのフレームの差異が大きければ基本的な取得フレーム間隔を小さくし 逆に差異が小 さければ間隔を大きくする つまりフレームの変化によって取得間隔をショット毎に変更す るということもできる 参 考 文 献 1) TRECVID Home Page. http://www-nlpir.nist.gov/projects/trecvid/. 2) C.G.M. Snoek, KEA vande Sande, O.deRooij, etal. The mediamill trecvid 2008 semantic video search engine. In Proc. of TRECVID Workshop, 2008. 3) D.Cai, X.He, and J.Han. Efficient kernel discriminant analysis via spectral regression. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pp. 427 432. IEEE, 2008. 4) J.Ye, S.Ji, and J.Chen. Multi-class discriminant kernel learning via convex programming. The Journal of Machine Learning Research, Vol.9, pp. 719 758, 2008. 5) N.Inoue, S.Hao, T.Saito, K.Shinoda, I.Kim, and C.H. Lee. Titgt at trecvid 2009 workshop. In Proc. of TRECVID Workshop, Vol.2, 2009. 6) H.Bay, T.Tuytelaars, and L.VanGool. SURF: Speeded up robust features. In Proc. of European Conference on Computer Vision, pp. 404 415, 2006. 7) D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, Vol.60, No.2, pp. 91 110, 2004. 8) 野口顕嗣, 柳井啓司. 動きの連続性を考慮した動画からの局所的な時空間特徴の抽出. In MIRU, 2009. 9) B.D. Lucas and T.Kanade. An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence, Vol.3, pp. 674 679. Citeseer, 1981. 図8 Airplane Flying 図 9 Boat Ship 図 10 図 11 Cityscape Bus 図 12 Classroom 8 c 2011 Information Processing Society of Japan