59 6 2003 pp. 1 11 1 43.72.Kb * 1 2 3 1. 2 2 1 1 1 [1] Person Recognition for News Videos through Multimodal Interaction, by Masakiyo Fujimoto, Yasuo Ariki and Shuji Doshita. 1 ATR 2 3 masakiyo.fujimoto@atr.jp 2003 2003
2 59 6 2003 1 1 2 5 150 93.33% 76.19% 2 3 2. [2] IPv6 [3] [4] [5] NHK BS [6] 2000 12 1 2007
3 [7] Put-that-there [14] Electric Program Guide: EPG Put-that-there VTR EPG Put-that-there [8] [8] [9] [10] [10] 97% 3. 1 1 [8] [13] User Who is he? Pointing Speech input Action input Speech recog Action recog Face extraction [11] [12] Face recognition Web Information retrieval Information presentation 1
4 59 6 2003 4.1 [12] 3 [16] 3 y i(t) M θ x(t) τ x(t) d 3 θ θ 4 5 θ 6 4. 2 2 M MLLR(Maximum Likelihood Linear Regression) [15] M θ θ d θ θ y 1 (t) y (t) y (t) Delay τ Delay (Μ 1)τ y 1 (t) y 1 (t) x(t)=my (t) 1 3 Observed signal Speaker direction estimation Hands-free speech input Beam forming User utterance section detection Acoustic model adaptation Speech recognition 4.2 CSP(Cross-power Spectrum Phase analysis) [11] i j y i(t) y j(t) CSP i,j(k) [ ] F[yi(t)]F[y j(t)] CSP i,j(k) = F 1 F[y i(t)] F[y j(t)] (1) 2 F F 1
5 τ τ = arg max (CSP i,j(k)) (2) k θ c f ( ) c τ/f θ = cos 1 (3) d 4.3 4 time(s) Direction Of Arrival: DOA DOA(deg.) User utterance section 4 5 PC x6 Loudspeaker (TV sound) Screen 5.0m News 4.4 1 4.1 4.3 MLLR [15] [17] 5 1 1 MLLR 2 MLLR [17] 1 5 DOA 4.5 DOA 6 1 6 2.6m 2.85m 10.0m 0.8m Loudspeaker (TV sound) 1.2m Screen Microphone array DOA stability section News sound+ user utterance Screen PC x2 2.4m 2.0m Speaker (User) 2.4m 4.9m PC News 8.0m Digital projector
6 59 6 2003 1 6 5. 5.1 3 LED LED 3 [18] 2 LED LED 15cm 7 7cm LED 7 PC PC 4 7 PC PC PC PC 5.2 1 4.5 1 1 2 6. [19]
7 1/n 6.1 1 100 150 10 10 150 x 1 9 x 150 [20] (7) m(= 150) {x t} (t = 1,, t,, N) µ Σ (N PD = x µ 2 k (x µ,ϕ d ) 2 (7) µ = 1 N Σ = 1 N N x t (4) t=1 N (x t µ)(x t µ) T (5) t=1 Eigenface space Σ = VΛV T (6) Λ Σ λ d (d = 1,, k,, m) V Σ ϕ d (d = ϕ 2 1,, k,, m) Observation space 9 ϕ 1 PD 6.2 1 Search window Input image 8,, n 1, n n n n 8 n n 100 150 ϕ 1 µ d=1 PD x - µ λ 1 > λ2 > x λ3 ϕ 3
8 59 6 2003 1 7. 16kHz(16bit) 5 4096 CSP NHK 10 256ms 256ms 1 Hamming Window 3 150 2 1 16kHz(16bit) 1 0.97z 1 3 PC 13 MFCC(0 12 ) PC + + (39 ) PC TCP/IP 20ms 10ms Hamming Window PC 7.1 4 PC Intel Xeon 1.7GHz 2 Memory 10m 8m 3 512MByte PC 1.2m 16 1.5 d = 2cm 2m 0 3 9 PC 4 (%) Beam Forming 67.33(101/150) Beam Forming+2 MLLR 93.33(140/150) 58dB(A) 40dB(A) 7.2 T 60 = 0.3[sec] mono- 1.7GHz Memory 256MByte phone HMM(5 3 12 41 PC ) HMM [21] 100% 150 137 21782 150 7.3 1 2 4.4 2 PC Intel Pentium4 MLLR 5 4 20 3 93.33% 100% 1 2 93.33% 150 140 3 PC Intel Pentium4 1.7GHz Memory 256MByte PC 1 3 4.4
9 [22] 10 60 150 150 300 1 300 10 1 20 1 60.00%(84/140) 2 60.00% 6.1 6.1 1 76.19%(64/84) 20 150 64 42.67% 42.67% 93.33% 1 1) 2) 3) 4) 3 1) 2) 3) 3 1 3 3
10 59 6 2003 [23] [ 1 ],, 9 (2000). [ 2 ] TV, http://www.jiten.com/dicmi/docs/k15/18379s.htm [ 3 ] IPv6, http://www.iij.ad.jp/ipv6/ [ 4 ], http://plusd.itmedia.co.jp/broadband/rbb/0203/13/ rbb 0313 10.html [ 5 ] NHK dnhk, http://www.nhk.or.jp/data/ [ 6 ] NHK /digital, http://www.nhk.or.jp/digital/ [ 7 ] R. A. Bolt, Put-that-there : Voice and gesture at the graphics interface, ACM Computer Graphics, Vol. 14, No. 3, 262-270 (1980). [ 8 ] N. Krahnstoever, S. Kettebekov, M. Yeasin, and R. Sharma, A Real-Time Framework for Natural Multimodal Interaction with Large Screen Displays, Proc. ICMI 02, 349-354 (2002). [ 9 ] R. Sharma, M. Yeasin, N. Krahnstoever, I. [24] Rauschert, G. Cai, I. Brewer, A. M. MacEachren, and K. Sengupta, Speech Gesture Driven Multimodal Interfaces for Crisis Management, Proc. IEEE, Vol. 91, No. 9, 1327-1354 (2003). [10] R. Sharma, J. Cai, S. Chakravarthy, I. Poddar, and Y.Sethi, Exploiting Speech/Gesture Cooccurrence for Improving Continuous Gesture Recognition Weather Narration, Proc. FG 00, 422-427 (2000). [11] M. Omologo and P. Svaizer, Acoustic Event Localization Using a Crosspower-Spectrum Phase Based 8. Technique, Proc. ICASSP 94, I, 273-276 (1994). [12],,, SP95-62, 1-8 (1995). [13] M. Kaneko and O. Hasegawa, Processing of Face Images and Its Applications, IEICE Transactions on Information and Systems, Vol. E82-D, No. 3, 535-544 (2005). [14] Y. Ariki, N. Ishikawa, and Y. Sugiyama, Face indexing on Video Data Extraction, Recognition, Tracking and Modeling, Proc. FG 98, 62-69, (1998). [15] C. L. Leggetter and P. C. Woodland, Maximum Likelihood Linear Regression for Speaker Adap- 5 93.33% tation of Continuous Density Hidden Markov Models, Computer Speech and Language, 9, 171-100.00% 60.00% 185 (1995). 76.19% [16] J. L. Flanagan, J. D. Jhonston, R. Zhan and G. W. Elko, Computer-Steered Microphone Arrays for 42.67% Sound Transduction in Large Rooms, J.Acoust. Soc. Am., 78(5), 1508-1518 (1985). [17] M. Fujimoto, Y. Ariki and S. Doshita, Hands- Free Speech Recognition in Real Environments Using Microphone Array and 2-Levels MLLR Adaptation as a Front-End System for Conversational TV, Acoustical Science and Technology, 24(6), 379-381 (2003). [18] Visualeyez USER S MANUAL, PhoeniX Technologies Incorporated [19],,,, (, 1986) [20],,,,, 24(1), 106-112 (1983). [21] http://www.milab.is.tsukuba.ac.jp/jnas/ [22],, http://www.hoip.jp/web catalog/top.html
11 [23],,,, TV, FIT 03, K-039, 507-508 (2003). [24],,,,, S-tgif,, SP96-32, 89-96 (1996). 1997 2001 2004 ATR 2003 ISCA IEEE 1974 1976 1979 1980 1990 1992 2003 1987 1990 IEEE ISCA 1958 1960 1963 1965 1968 1973 1996 ( ) 1998 1999 2003 1959 1988 1990 30 1999 1994 1995 1997 1998