Vol. 52 No. 12 3853 3867 (Dec. 2011) VocaListener 1 1 VocaListener VocaListener 2 VocaListener: A Singing Synthesis System by Mimicking Pitch and Dynamics of User s Singing Tomoyasu Nakano 1 and Masataka Goto 1 This paper presents a singing synthesis system, VocaListener, thatinterac- tively synthesizes a singing voice by mimicking pitch and dynamics of a user s singing voice. Although there is a method to estimate singing synthesis parameters of pitch (F 0 ) and dynamics (power) from a singing voice, it does not adapt to different singing synthesis conditions (e.g., different singing synthesis systems and their singer databases) or singing skill/style modifications. To deal with different conditions, VocaListener repeatedly updates singing synthesis parameters so that the synthesized singing can mimic the user s singing more closely. Moreover, VocaListener has functions to help modify the user s singing by correcting off-pitch phrases or changing vibrato. In an experimental evaluation under two different singing synthesis conditions, mean error values after the iteration were much smaller than the previous approach. 1. 2007 1) Web 2),3) 4) 7) 8) 10) HMM 11) text-to-speech TTS text-to-singing lyrics-to-singing 12) 13) 12),14) speech-to-singing 1 National Institute of Advanced Industrial Science and Technology (AIST) 3853 c 2011 Information Processing Society of Japan
3854 VocaListener 12) 15) VocaListener singing-to-singing Janer 16) VocaListener YAMAHA Vocaloid 10) 2 3 VocaListener 4 5 6 2. YAMAHA Vocaloid 10) lyrics-to-singing 1 Fig. 1 Even if the same parameters are specified, the synthesized results always differ when we change the synthesis conditions. VOCALOID 10) 1 17) 2 1 16) 2 2 VocaListener
3855 VocaListener 3. VocaListener VocaListener-core VocaListener-plus VocaListener-front-end 3 VocaListener 2 16) Fig. 2 Problems of a previous approach 16). VocaListener 2 VocaListener Janer Viterbi 16) 100% Viterbi 1 Viterbi 1 Vocaloid 10) A B VocaListener-front-end Viterbi C D E VocaListener-plus F VocaListener-core G Viterbi /tachidomaru/ H I J VocaListener-front-end K L M N O P VocaListener-front-end VocaListener-plus VocaListenercore 1
3856 VocaListener 3 VocaListener Fig. 3 System architecture of VocaListener. 3.1 VocaListener-front-end VocaListener-front-end 44.1 khz 10 msec 3.1.1 F 0 [Hz] / Gross Error SWIPE 18) F 0 MIDI f 1 60 1 Table 1 List of symbols. F 0 [Hz] f MIDI f d f t f (t) f(t) f n f(t) f (i) (t) i Δf (i) p (t) i PIT Δf (i) s (t) i PBS Δf (i) (t) i p(t) p (t) p(t) p(t) p (i) (t) p m(t) ˆp (i) (t) ɛ ɛ (i) f ɛ (i) p i DYN 64 i DYN i i F 0 f =12 log 2 +69 (1) 440 p(t) N x(t) h(t) N/2 1 ) p(t) = ( (x(t + τ) h(τ)) 2 (2) τ= N/2 N 2,048 46 ms h(t) Viterbi MeCab 19) Viterbi short pause
3857 VocaListener 2002 monophone HMM 20) MLLR Maximum Likelihood Linear Regression MAP Maximum A Posteriori Probability MLLR-MAP 21) Viterbi MLLR-MAP 16 khz HTK Speech Recognition Toolkit 22) 3.1.2 Vocaloid2 10) CV01 CV02 1 VSTi Vocaloid Playback VST Instrument 2 3.2 VocaListener-plus VocaListener-plus 2 3.2.1 23) F 0 f(t) f d 127 { } (f(t) g i)2 f d =argmax exp (3) g 2σi 2 t i=0 σ =0.17 f(t) 5Hz 3 F 0 4 24),25) 4 5Hz 8Hz 26),27) f d 0 F d < 1 { f(t) fd (0 f d < 0.5) f(t) = (4) f(t)+(1 f d ) (0.5 f d < 1) f t f(t) =f(t)+f t (5) f t +12 1 1 http://www.vocaloid.com/product.ja.html 2 10 msec VSTi 1msec 3 FIR 1.8 4
3858 VocaListener 3.2.2 4 1 2 (6) (7) f(t) 3Hz 4 F 0 f (t) p(t) p (t) 5Hz 8Hz 26),27) r v r s f(t) =r {v s} f(t)+(1 r {v s} ) f (t) (6) p(t) =r {v s} p(t)+(1 r {v s} ) p (t) (7) r v 23) r s r v = r s =1 r v > 1 r s < 1 F 0 28) r s < 1 3.2.3 4 1 4 VocaListener-plus F 0(t) Fig. 4 Examples of F 0(t) adjusted by VocaListener-plus. 3.3 VocaListener-core VocaListener-core 3 VocaListener-plus
3859 VocaListener Table 2 2 Singing synthesis parameters and those initial values. 0 127 PIT 8,192 8,191 0 PBS 0 24 1 DYN 0 127 64 3.3.1 Viterbi Vocaloid2 PIT PBS DYN MIDI DYN MIDI Expression PIT PBS DYN 2 PIT PBS PBS 1 ±1 16,384 1 12 1 3.3.3 DYN 0 127 3.3.2 5 Viterbi Step 1) 1 Viterbi Step 2) 2 5 VocaListener-core Fig. 5 The lyrics alingment procedure of VocaListener-core. Step 3) Step 4) Step 2) Step 4) MFCC 3 3.3.3 3.3.4 MFCC
3860 VocaListener 6 F 0 Fig. 6 F 0 of the target singing and estimated note numbers. 3.3.3 (1) F 0 PIT PBS ±2 PBS PBS F 0 f n 6 ( (n f n =argmax exp { }) f(t))2 (8) n 2σ 2 t 1 σ =0.33 t 3.3.4 (2) f (i) (t) f(t) PIT PBS t i PIT PBS Δf (i) p Step 1) Step 2) f (i) (t) (t) Δf s (i) (t) Step 3) f(t) Δf (i) (t) 7 4 DYN Fig. 7 Power of the target singing and power of the singing synthesized with four different dynamics. Δf (i+1) (t) =Δf (i) (t)+ ( f(t) f (i) (t) ) (9) Δf (i) (t) PIT PBS MIDI 1 Δf (i) (t) = Step 4) Δf (i+1) (t) Δf (i+1) s (i) Δf p (t) Δf s (i) (t) (10) 8192 (t) Δf (i+1) p (t) Δf s (i+1) (t) 3.3.5 (1) α 7 DYN 0 127 DYN DYN = 127 7 A 1 Δf (i) (t) F 0
3861 VocaListener 7 A p(t) DYN 64 p m(t) α ɛ 2 = (α p(t) p m(t)) 2 (11) t α (p(t) pm(t)) t α = (12) t p(t)2 Table 3 3 A B Dataset for experiment A and B and synthesis conditions. All of the song samples were sung by female singers. [sec] A No.07 1 103 CV01 A No.16 1 100 CV02 B No.07 6.0 CV01,02 B No.16 7.0 CV01,02 B No.54 8.9 CV01,02 B No.55 6.5 CV01,02 RWC-MDB-P-2001 3.3.6 (2) α DYN DYN DYN = (0, 32, 64, 96, 127) t i DYN ˆp (i) (t) DYN p (i) (t) Step 1) Step 2) p (i) (t) Step 3) ˆp (i) (t) ˆp (i+1) (t) =ˆp (i) (t)+ ( α p(t) p (i) (t) ) (13) Step 4) ˆp (i+1) (t) DYN DYN 4. VocaListener-core 4.1 VocaListener-core A B 2 RWC RWC-MDB-P-2001 29) 3 3 3 Vocaloid2 0% CV01 CV02 A 1 100 B 1 i ɛ (i) f ɛ (i) p ɛ (i) f = 1 f(t) f (i) (t) (14) T f ɛ (i) p t = 1 20 log (α p(t)) 20 log ( p (i) (t) ) (15) T p t 0 T f T p 0 B
3862 VocaListener 4 A Table 4 Number of boundary errors and number of repairs for correcting (pointing out) errors in experiment A. n n =0 n =1 n =2 n =3 No.07 CV01 166 8 5 2 0 No.16 CV02 128 3 2 0 4.2 VocaListener-core 2 A B 4.2.1 A VocaListener-front-end Viterbi No.07 No.16 2 A 4 No.07 166 8 3 /w/ /r/ /m/ /n/ 4.2.2 B 5 No.07 VocaListener i = i =0 i =0 4 i =4 6 4 5 6 Janer 16) 4 No.07 2.2 8 4.3 4 No.07 166 8 No.16 128 3 2 3 5 n [%] B No.07 Table 5 Mean error values after each iteration for song No.07 in experiment B. ɛ (i) [semitone] ɛ (i) f p [db] VocaListener i i =0 i =1 i =2 i =3 i =4 CV01 0.217 0.386 0.091 0.058 0.042 0.034 CV02 0.198 0.352 0.074 0.041 0.029 0.024 CV01 13.65 11.22 4.128 3.617 3.472 3.414 CV02 14.17 15.26 6.944 6.382 6.245 6.171 6 B Table 6 Minimum and maximum error values for all four songs in experiment B. VocaListener i i =0 i =4 0.168 0.369 0.352 1.029 0.019 0.107 9.545 15.45 10.46 19.04 1.676 6.560 HMM 1 5 6 8 4 2 VocaListener 1 2 http://staff.aist.go.jp/t.nakano/vocalistener/index-j.html
3863 VocaListener C++ GUI Visual Studio 2005 GUI 9 3 9 A F 0 9 B 9 C wav 8 Fig. 8 The estimated parameters and synthesized results. Web CV01 CV02 5. VocaListener 3 VocaListener 3 5.1 5.1.1 1 1 9 D Vocaloid/Vocaloid 2 2010 12 5.1.2 2 F 0 A C E F 0 1
3864 VocaListener 5.2 Vocaloid2 Score Editor 10) 2 i) F 0 ii) 9 VocaListener Fig. 9 An example VocaListener screen. A 5.1.3 3 B 5.1.4 4 3 6. VocaListener VocaListener 1 VocaListener
3865 VocaListener 1 30),31) VocaListener 32) VocaListener-plus VocaListener-plus HMM singing-to-singing 1 CrestMuse CV01 CV02 RWC RWC-MDB-P-2001 1) Cabinet Office, Government of Japan: Virtual Idol, Highlighting JAPAN through images, Vol.2, No.11, pp.24 25 (2009), available from http://www.gov-online.go.jp/pdf/hlj img/vol 0020et/24-25.pdf. 2) Vol.25, No.1, pp.157 167 (2010). 3) 2009 pp.118 124 (2009). 4) Depalle, P., Garcia, G. and Rodet, X.: A virtual castrato, Proc. International Computer Music Conference (ICMC 94 ), pp.357 360 (1994). 5) Cook, P.R.: Identification of Control Parameters in An Articulatory Vocal Tract Model, with Applications to the Synthesis of Singing, Ph.D. Thesis, Stanford Univ. (1991). 6) Cook, P.R.: Singing Voice Synthesis: History, Current Work, and Future Directions, Computer Music Journal, Vol.20, No.3, pp.38 46 (1996). 7) Sundberg, J.: The KTH Synthesis of Singing, Advances in Cognitive Psychology, Special issue on Music Performance, Vol.2, pp.131 143 (2006). 8) CyberSingers 99-SLP-25-8 Vol.99, No.14, pp.35 40 (1998). 9) Bonada, J. and Xavier, S.: Synthesis of the Singing Voice by Performance Sampling and Spectral Models, IEEE Signal Processing Magazine, Vol.24, No.2, pp.67 79 (2007).
3866 VocaListener 10) Kenmochi, H. and Ohshita, H.: VOCALOID Commercial Singing Synthesizer based on Sample Concatenation, Proc. 8th Annual Conference of the International Speech Communication Association (INTERSPEECH 2007 ), pp.4011 4010 (2007). 11) Vol.45, No.7, pp.719 727 (2004). 12) Saitou, T., Goto, M., Unoki, M. and Akagi, M.: Speech-To-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices, Proc. 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2007 ), pp.215 218 (2007). 13) Fukayama, S., Nakatsuma, K., Sako, S., Nishimoto, T. and Sagayama, S.: Automatic Song Composition from the Lyrics Exploiting Prosody of the Japanese Language, Proc. 7th Sound and Music Computing Conference (SMC2010 ), pp.299 302 (2010). 14) 2008-MUS-74-6 Vol.2008, No.12, pp.33 38 (2008). 15) STRAIGHT Vol.43, No.2, pp.208 219 (2002). 16) Janer, J., Bonada, J. and Blaauw, M.: Performance-driven Control for Sample- Based Singing Voice Synthesis, Proc. 9th Int. Conference on Digital Audio Effects (DAFx-06 ), pp.41 44 (2006). 17) VOCALOID 2008-MUS-74-9 Vol.2008, No.12, pp.51 58 (2008). 18) Camacho, A.: SWIPE: A Sawtooth Waveform Inspired Pitch Estimator for Speech And Music, Ph.D. Thesis, University of Florida (2007). 19) MeCab: Yet Another Part-of-Speech and Morphological Analyzer http://mecab.sourceforge.net/. 20) 2002 2001-SLP-48-1 Vol.2003, No.48, pp.1 6 (2003). 21) Digalakis, V. and Neumeyer, L.: Speaker Adaptation Using Combined Transformation and Bayesian Methods, IEEE Trans. Speech and Audio Processing, Vol.4, No.4, pp.294 300 (1996). 22) Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V. and Woodland, P.: The HTK Book (2002). 23) Vol.48, No.1, pp.227 236 (2007). 24) Saitou, T., Unoki, M. and Akagi, M.: Development of an F0 Control Model Based on F0 Dynamic Characteristics for Singing-Voice Synthesis, Speech Communication, Vol.46, pp.405 417 (2005). 25) Mori, H., Odagiri, W. and Kasuya, H.: F 0 Dynamics in Singing: Evidence from the Data of a Baritone Singer, IEICE Trans. Inf. & Syst., Vol.E87-D, No.5, pp.1068 1092 (2004). 26) Seashore, C.E.: A Musical Ornament, the Vibrato, Psychology of Music, pp.33 52, McGraw-Hill (1938). 27) STRAIGHT 2005 3-P-15 pp.269 270 (2005). 28) H 2006 109, pp.611 616 (2006). 29) RWC Vol.45, No.3, pp.728 738 (2004). 30) Toda, T., Black, A. and Tokuda, K.: Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory, IEEE Trans. Audio, Speech and Language Processing, Vol.15, No.8, pp.2222 2235 (2007). 31) STRAIGHT Vol.J91-D, No.4, pp.1082 1091 (2008). 32) Nakano, T., Ogata, J., Goto, M. and Hiraga, Y.: Analysis and Automatic Detection of Breath Sounds in Unaccompanied Singing Voice, Proc. 10th International Conference of Music Perception and Cognition (ICMPC 10 ), pp.387 390 (2008). ( 23 1 6 ) ( 23 9 12 ) 2003 2008 2006 2007 2007 2009 2010 2010
3867 VocaListener 1998 2001 IPA IT 25