676 Vol. 31 No. 7, pp , Incremental Noise Estimation in Outdoor Auditory Scene Analysis using a Quadrocopter with a Microphone A

676 Vol. 31 No. 7, pp.676 683, 2013 1 1 2 1 2 Incremental Noise Estimation in Outdoor Auditory Scene Analysis using a Quadrocopter with a Microphone Array Keita Okutani 1, Takami Yoshida 1, Keisuke Nakamura 2 and Kazuhiro Nakadai 1 2 This paper addresses sound source localization using an aerial vehicle with a microphone array in an outdoor environment to realize outdoor auditory scene analysis. It, for instance, aims at finding distressed people in a disaster situation. In such an environment, noise is quite loud and dynamically-changing, and conventional microphone array techniques studied in the field of indoor robot audition are of less use. We, thus, proposed MUltiple SIgnal Classification based on incremental Generalized EigenValue Decomposition (igevd-music). It can deal with dynamically-changing high power noise by introducing incrementally-estimated noise correlation matrices. We developed a prototype system for the outdoor auditory scene analysis based on the proposed method using the Parrot AR.Drone with an 8 ch microphone array and a Kinect device. Experimental results using the prototype system showed that dynamically-changing noise is properly suppressed with the proposed method even when the signal-to-noise ratio is less than 0 db in an outdoor/indoor environment with the hovering/moving AR.Drone. Key Words: Sound Source Localization, Outdoor, Quadrocopter 1. Computational Auditory Scene Analysis, CASA Robot Audition [1] [2] [3] 2012 12 10 1 2 1 Graduate School of Information Science and Engineering, Tokyo Institute of Technology 2 Honda Research Institute Japan Co., Ltd. J-STAGE Fig. 1 Auditory scene analysis in an outdoor environment Fig. 1 3 [4] [5] JRSJ Vol. 31 No. 7 38 Sept., 2013

677 Acoustic Vector Sensor AVS Unmanned Aerial Vehicle UAV [6] [7] [8] [9] [10] Root-MUSIC [11], ES- PRIT [12] [13] Multiple Signal Classification MUSIC [14] Signal-to-Noise Ratio SNR MUSIC MUSIC Generalized EigenValue Decomposition, GEVD MUSIC GEVD-MUSIC 10 [db] [15] [16] GEVD-MUSIC GEVD-MUSIC incremental GEVD-MUSIC igevd-music [17] 24 [bit] A/D Kinect 2 igevd-music 3 4 5 2. igevd-music 2. 1 Standard EigenValue Decomposition, SEVD MUSIC SEVD-MUSIC SEVD-MUSIC GEVD-MUSIC GEVD-MUSIC igevd-music Table 1 2. 2 igevd-music M f X M X(ω, f) = N(ω, f) = LX A l (ω, θ l )S l (ω, θ l, f) + N(ω, f) 1 l=1 JX A n j (ω, θ j )N j (ω, θ j, f) + N d (ω, f) 2 j=1 A S l l ω L θ l l N θ j N j J N d A n SEVD-MUSIC S l N j S l L + J = M N j S l Table 1 Variables and parameters M number of microphones (8) L number of sound sources (L < M) m microphone index (1 m M) l sound source index (1 l L) f frame index ω frequency bin index θ l l-th sound source direction ψ direction of a steering vector X m (ω, f) observation for m-th microphone at f-th frame X(ω, f) [X 1 (ω, f), X 2 (ω, f),..., X M (ω, f)] T G(ω, ψ) steering vector R(ω, f) correlation matrix of X(ω, f) ( C M M ) e m(ω, f) m-th Eigen vector of R(ω, f) E(ω, f) [e 1 (ω, f),..., e M (ω, f)] λ m (ω, f) m-th Eigenvalue of R(ω, f) (λ 1 λ 2... λ M ) Λ(ω, f) diag(λ 1,...λ M ) ω l ω h minand max frequencies for MUSIC (500, 2,800 [Hz]) T R number of frames for averaging correlation matrices 31 7 39 2013 9

678 Table 2 Configuration of moving microphone array hardware device size & weight quadrocopter AR.Drone 525 515 [mm], payload:130 [g] recording device RASP-24 55 90 33 [mm], 104 [g] microphone MEMS 20 15 [mm], 3 [g] Fig. 2 Noise correlation matrix estimation in igevd-music GEVD-MUSIC igevd-music N L < M G(ω, ψ) ψ ψ 5 R(ω, f) T R R(ω, f) = 1 fx X(ω, τ)x (ω, τ) T R τ=f T R 3 K GEVD-MUSIC K igevd-music Fig. 2 K(ω, f) = 1 T N f f X s τ=f f s T N X(ω, τ)x (ω, τ) 4 T N K f s R K T N f s R R 4 igevd-music R(ω, f) K 1 (ω, f) GEVD R(ω, f) K SEVD R(ω, f) igevd GEVD SEVD K 1 (ω, f)r(ω, f) = E(ω, f)λ(ω, f)e (ω, f) 5 K 1 (ω)r(ω, f) = E(ω, f)λ(ω, f)e (ω, f) http://www.ros.org/ R(ω, f) = E(ω, f)λ(ω, f)e (ω, f) 6 7 Λ(ω, f) E(ω, f) Λ(ω, f) e m (ω, f) ψ G(ω, ψ) MUSIC P (ω, ψ, f) P (ω, ψ, f) = G (ω, ψ)g(ω, ψ) P M m=l+1 G (ω, ψ)e m(ω, f) 8 L P (ω, ψ, f) ω P (ψ, f) = 1 ω H ω L + 1 ω H X ω=ω L P (ω, ψ, f) 9 ω H ω L P (ψ, f) L ψ P (ψ, f) 3. Fig. 3 PC 3. 1 Fig. 4 Table 2 AR.Drone 130 [g] AR.Drone 115 [g] Kinect 35 [mm] 3. 2 8 ch PC Fig. 3 PC ROS Kinect igevd-music HARK [18] JRSJ Vol. 31 No. 7 40 Sept., 2013

679 Fig. 3 System structure for outdoor auditory scene analysis Fig. 4 Quadrocpoter with microphone array a) Hovering b) Moving Fig. 5 Position of a vehicle and spekaers igevd-music SEVD-MUSIC, GEVD-MUSIC AR.Drone Kinect AR.Drone 4. 1 4. igevd-music Ex1. Ex2. Ex3. + Fig. 5 2 1.5 [m] 1 1 1 45 7 1 1 Fig. 5a) 12 Fig. 5b) http://ardrone.parrot.com/parrot-ar-drone/jp/ http://www.sifi.co.jp/solution/product.php?contentid=7& categoryid=4&page name=rasp-24 a) Indoor b) Outdoor Table 3 Fig. 6 Experimental environments Loudness of sound and noise sources sound source indoor outdoor propeller utterance sound level[db] 35 45 89 75 1 1.5 [m] Fig. 6 Table 3 4. 2 SEVD-MUSIC GEVD-MUSIC igevd-music 4 1 9 P (ψ, f) MUSIC 2 3 P (ψ, f) 4 igevd-music T N f s 31 7 41 2013 9

680 Fig. 7 Spatial spectrograms and localization results (Indoor-hovering) Fig. 8 Spatial spectrograms and localization results (Indoor-moving) Fig. 9 Spatial spectrograms and localization results (Outdoor-hovering) 4. 2. 1 MUSIC P (ψ, f) Fig. 7 9 b) d) MUSIC 5 0.1 Kinect Fig. 7 9 a) 2 Kinect Kinect 3.5 [m] 1.3 [cm] Kinect Kinect AR.Drone Table 4 Noise suppression performance Method std. dev. SEVD-MUSIC 0.14 GEVD-MUSIC 0.12 igevd-music 0.058 Table 5 Utterance-based localization performance (%) Condition SEVD GEVD igevd LAR LCR LAR LCR LAR LCR Indoor-Hovering -100 36 93 93 100 100 Indoor-Moving -186 39 42 92 50 92 Outdoor-Hovering -207 29-121 14 64 79 Fig. 7 9 Fig. 7 AR.Drone 89 [db] SEVD GEVD igevd Fig. 8 AR.Drone GEVD 160 igevd GEVD Fig. 9 JRSJ Vol. 31 No. 7 42 Sept., 2013

681 Fig. 10 Normalized histograms Fig. 11 Difference between speaking area and non-speaking area 45 [db] 10 15 [db] SEVD GEVD igevd 2 4. 2. 2 LAR Localization Accuracy Rate LCR Localization Correct Rate LAR = N t N S N D N I N t LCR = N t N S N D N t 10 11 N t N S N D N I LAR LCR ±0.2 ±5 ±5 Table 5 igevd LAR LCR igevd 4. 2. 3 P (ψ, f) Fig. 10 Table 4 Fig. 10 1 0 P P Table 4 c) igevd igevd-music 4. 2. 4 igevd-music Fig. 2 T N f s igevd- MUSIC P (ψ, f) T N f s Fig. 11 T N f s P T N f s 1 0.01 [s] igevd-music 31 7 43 2013 9

682 (a) Spatial spectrograms, localization results and pictures Fig. 12 (b) Reference Outdoor localization experiment T N 4. 3 2 1 1 1 2 [m] 2 24 igevd- MUSIC T N = 90 [frame] f s = 140 [frame] Fig. 12 Fig. 12 (a) MUSIC Fig. 12 (b) LAR = 21% LCR = 42% N D [19] [20] 5. igevd-music Kinect igevd-music 15 30% 80% MUSIC JRSJ Vol. 31 No. 7 44 Sept., 2013

683 24118702 22700165 [ 1 ] I. Hara, F. Asano, H. Asoh, J. Ogata, N. Ichimura, Y. Kawai, F. Kanehiro, H. Hirukawa and K. Yamamoto: Robust speech interface based on audio and video information fusion for humanoid HRP-2, Proc. of IEEE/RAS International Conference on Intelligent Robots and Systems (IROS 2004), pp.2404 2410, 2004. [ 2 ] J.-M. Valin, F. Michaud and J. Rouat: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering, Robotics and Autonomous Systems Journal, vol.55, no.3, pp.216 228, 2007. [ 3 ] Simon THOMPSON 64ch 32 AI Challenge pp.3 8, 2010. [ 4 ] 12 pp.2381 2384, 2011. [ 5 ] K. Okutani, T. Yoshida, K. Nakamura and K. Nakadai: Outdoor auditory scene analysis using a moving microphone array embedded in a quadrocopter, Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2012), pp.3288 3293, 2012. [ 6 ] B. Kaushik, D. Nance and K. K. Ahuj: A review of the role of acoustic sensors in the modern battlefield, 11th AIAA/CEAS Aeroacoustics Conference (26th AIAA Aeroacoustics Conference), pp.1 13, 2005. [ 7 ] Hans-Elias de Bree: Acoustic vector sensors increasing UAV s situational awareness, SAE Technical Paper, pp.2009 01 3249, 2009. [ 8 ] B. D. Van Veen and K. M. Buckley: Beamforming: a versatile approach to spatial filtering, IEEE ASSP Magazine, vol.5, no.2, pp.4 24, 1988. [ 9 ] J. Burg: Maximum entropy spectral analysis, 37th Meeting Society Exploration Geophysicists, 1967; reprinted in Modern Spectrum Analysis (D.G. Childers, ed.). pp.34 39, IEEE press, 1978. [10] S.S. Reddi: Multiple source location a digital approach, IEEE Trans. on Aerospace and Electronic Systems, AES-15:95 105, 1979. [11] B.D. Rao and K.V.S. Hari: Performance analysis of rootmusic, IEEE Trans. on Acoustics, Speech, and Signal Processing, vol.37, no.12, pp.1939 1949, 1989. [12] R. Roy and T. Kailath: Esprit estimation of signal parameters via rotational invariance techniques, IEEE Trans. on Acoustics, Speech, and Signal Processing, vol.37, no.7, pp.984 995, 1989. [13] A. Swami and J. M. Mendel: Cumulant-based approach to harmonic retrieval and related problems, IEEE Trans. on Signal Processing, vol.39, no.5, pp.1099 1109, 1991. [14] R. Schmidt: Multiple emitter location and signal parameter estimation, IEEE Trans. on Antennas and Propagation, vol.34, no.3, pp.276 280, 1986. [15] K. Nakamura, K. Nakadai, F. Asano, Y. Hasegawa and H. Tsujino: Intelligent sound source localization for dynamic environments, Proc. of IEEE/RSJ Int l Conf. on Intelligent Robots and Systems (IROS 2009), pp.664 669, 2009. [16] K. Nakamura, K. Nakadai, F. Asano and G. Ince: Intelligent sound source localization and its application to multimodal human tracking, Proceedings of 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2011), pp.143 148, 2011. [17] 30 DVD-ROM 3D1-2, 2012. [18] K. Nakadai, T. Takahashi, H.G. Okuno, H. Nakajima, Y. Hasegawa and H. Tsujino: Design and implementation of robot audition system HARK open source software for listening to three simultaneous speakers, Advanced Robotics, vol.24, no.5 6, pp.739 761, 2010. [19] G. Ince, K. Nakadai and K. Nakamura: Online learning for template-based multi-channel ego noise estimation, Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2012), pp.3282 3287, 2012. [20] T. Yoshida and K. Nakadai: Active audio-visual integration for voice activity detection based on a causal bayesian network, Proceedings of the 2012 IEEE RAS International Conference on Humanoid Robots (Humanoids 2012), pp.370 375, 2012. Keita Okutani 2011 2013 3 Keisuke Nakamura 2007 University of Strathclyde 2010 2013 2010 Takami Yoshida 2008 2010 3 2013 3 Kazuhiro Nakadai 1993 1995 NTT NTT JST ERATO 2003 2006 2008 2009 2010 2011 2011 IEEE 31 7 45 2013 9