paper.dvi

Size: px

Start display at page:

Download "paper.dvi"

ふじきみそや
7 years ago
Views:

1 pp Kb * [1] Person Recognition for News Videos through Multimodal Interaction, by Masakiyo Fujimoto, Yasuo Ariki and Shuji Doshita. 1 ATR 2 3 [email protected]

2 % 76.19% [2] IPv6 [3] [4] [5] NHK BS [6]

3 3 [7] Put-that-there [14] Electric Program Guide: EPG Put-that-there VTR EPG Put-that-there [8] [8] [9] [10] [10] 97% [8] [13] User Who is he? Pointing Speech input Action input Speech recog Action recog Face extraction [11] [12] Face recognition Web Information retrieval Information presentation 1

4 [12] 3 [16] 3 y i(t) M θ x(t) τ x(t) d 3 θ θ 4 5 θ M MLLR(Maximum Likelihood Linear Regression) [15] M θ θ d θ θ y 1 (t) y (t) y (t) Delay τ Delay (Μ 1)τ y 1 (t) y 1 (t) x(t)=my (t) 1 3 Observed signal Speaker direction estimation Hands-free speech input Beam forming User utterance section detection Acoustic model adaptation Speech recognition 4.2 CSP(Cross-power Spectrum Phase analysis) [11] i j y i(t) y j(t) CSP i,j(k) [ ] F[yi(t)]F[y j(t)] CSP i,j(k) = F 1 F[y i(t)] F[y j(t)] (1) 2 F F 1

5 τ τ = arg max (CSP i,j(k)) (2) k θ c f ( ) c τ/f θ = cos 1 (3) d 4.3 4 time(s) Direction Of Arrival: DOA DOA(deg.) User utterance section 4 5 PC x6 Loudspeaker (TV sound) Screen 5.0m News 4.4 1 4.

5 5 τ τ = arg max (CSP i,j(k)) (2) k θ c f ( ) c τ/f θ = cos 1 (3) d time(s) Direction Of Arrival: DOA DOA(deg.) User utterance section 4 5 PC x6 Loudspeaker (TV sound) Screen 5.0m News MLLR [15] [17] MLLR 2 MLLR [17] 1 5 DOA 4.5 DOA m 2.85m 10.0m 0.8m Loudspeaker (TV sound) 1.2m Screen Microphone array DOA stability section News sound+ user utterance Screen PC x2 2.4m 2.0m Speaker (User) 2.4m 4.9m PC News 8.0m Digital projector

6 LED LED 3 [18] 2 LED LED 15cm 7 7cm LED 7 PC PC 4 7 PC PC PC PC [19]

7 7 1/n x 1 9 x 150 [20] (7) m(= 150) {x t} (t = 1,, t,, N) µ Σ (N PD = x µ 2 k (x µ,ϕ d ) 2 (7) µ = 1 N Σ = 1 N N x t (4) t=1 N (x t µ)(x t µ) T (5) t=1 Eigenface space Σ = VΛV T (6) Λ Σ λ d (d = 1,, k,, m) V Σ ϕ d (d = ϕ 2 1,, k,, m) Observation space 9 ϕ 1 PD Search window Input image 8,, n 1, n n n n 8 n n ϕ 1 µ d=1 PD x - µ λ 1 > λ2 > x λ3 ϕ 3

8 kHz(16bit) CSP NHK ms 256ms 1 Hamming Window kHz(16bit) z 1 3 PC 13 MFCC(0 12 ) PC + + (39 ) PC TCP/IP 20ms 10ms Hamming Window PC PC Intel Xeon 1.7GHz 2 Memory 10m 8m 3 512MByte PC 1.2m d = 2cm 2m PC 4 (%) Beam Forming 67.33(101/150) Beam Forming+2 MLLR 93.33(140/150) 58dB(A) 40dB(A) 7.2 T 60 = 0.3[sec] mono- 1.7GHz Memory 256MByte phone HMM( PC ) HMM [21] 100% PC Intel Pentium4 MLLR % 100% % PC Intel Pentium4 1.7GHz Memory 256MByte PC

9 9 [22] %(84/140) % %(64/84) % 42.67% 93.33% 1 1) 2) 3) 4) 3 1) 2) 3)

10 [23] [ 1 ],, 9 (2000). [ 2 ] TV, [ 3 ] IPv6, [ 4 ], rbb html [ 5 ] NHK dnhk, [ 6 ] NHK /digital, [ 7 ] R. A. Bolt, Put-that-there : Voice and gesture at the graphics interface, ACM Computer Graphics, Vol. 14, No. 3, (1980). [ 8 ] N. Krahnstoever, S. Kettebekov, M. Yeasin, and R. Sharma, A Real-Time Framework for Natural Multimodal Interaction with Large Screen Displays, Proc. ICMI 02, (2002). [ 9 ] R. Sharma, M. Yeasin, N. Krahnstoever, I. [24] Rauschert, G. Cai, I. Brewer, A. M. MacEachren, and K. Sengupta, Speech Gesture Driven Multimodal Interfaces for Crisis Management, Proc. IEEE, Vol. 91, No. 9, (2003). [10] R. Sharma, J. Cai, S. Chakravarthy, I. Poddar, and Y.Sethi, Exploiting Speech/Gesture Cooccurrence for Improving Continuous Gesture Recognition Weather Narration, Proc. FG 00, (2000). [11] M. Omologo and P. Svaizer, Acoustic Event Localization Using a Crosspower-Spectrum Phase Based 8. Technique, Proc. ICASSP 94, I, (1994). [12],,, SP95-62, 1-8 (1995). [13] M. Kaneko and O. Hasegawa, Processing of Face Images and Its Applications, IEICE Transactions on Information and Systems, Vol. E82-D, No. 3, (2005). [14] Y. Ariki, N. Ishikawa, and Y. Sugiyama, Face indexing on Video Data Extraction, Recognition, Tracking and Modeling, Proc. FG 98, 62-69, (1998). [15] C. L. Leggetter and P. C. Woodland, Maximum Likelihood Linear Regression for Speaker Adap % tation of Continuous Density Hidden Markov Models, Computer Speech and Language, 9, % 60.00% 185 (1995) % [16] J. L. Flanagan, J. D. Jhonston, R. Zhan and G. W. Elko, Computer-Steered Microphone Arrays for 42.67% Sound Transduction in Large Rooms, J.Acoust. Soc. Am., 78(5), (1985). [17] M. Fujimoto, Y. Ariki and S. Doshita, Hands- Free Speech Recognition in Real Environments Using Microphone Array and 2-Levels MLLR Adaptation as a Front-End System for Conversational TV, Acoustical Science and Technology, 24(6), (2003). [18] Visualeyez USER S MANUAL, PhoeniX Technologies Incorporated [19],,,, (, 1986) [20],,,,, 24(1), (1983). [21] [22],, catalog/top.html

11 11 [23],,,, TV, FIT 03, K-039, (2003). [24],,,,, S-tgif,, SP96-32, (1996) ATR 2003 ISCA IEEE IEEE ISCA ( )

1. HNS [1] HNS HNS HNS [2] HNS [3] [4] [5] HNS 16ch SNR [6] 1 16ch 1 3 SNR [4] [5] 2. 2 HNS API HNS CS27-HNS [1] (SOA) [7] API Web 2

1. HNS [1] HNS HNS HNS [2] HNS [3] [4] [5] HNS 16ch SNR [6] 1 16ch 1 3 SNR [4] [5] 2. 2 HNS API HNS CS27-HNS [1] (SOA) [7] API Web 2 THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. 657 8531 1 1 E-mail: {soda,matsubara}@ws.cs.kobe-u.ac.jp, {masa-n,shinsuke,shin,yosimoto}@cs.kobe-u.ac.jp,