VocaListener2: 1 1 VocaListener2 VocaListener VocaListener2 VocaListener2 VocaListener VocaListener2 VocaListener2: A Singing Synthesis System Mimicking Voice Timbre Changes in Addition to Pitch and Dynamics of User s Singing Tomoyasu Nakano 1 and Masataka Goto 1 In this paper, we propose a singing synthesis system, VocaListener2, that automatically synthesizes a singing voice by mimicking timbre changes of a user s singing voice The system extends our previous system called VocaListener that can estimate singing synthesis parameters of only pitch (F 0 ) and dynamics (power) from the user s singing voice Although most previous techniques for manipulating voice timbre have focused on voice conversion and voice morphing, they cannot deal with the timbre changes during singing To develop VocaListener2, we first construct a voice timbre space on the basis of various singing voices that mimic the pitch and dynamics of the user s singing voice by using the VocaListener In this space, the timbre changes can be reflected to the synthesized singing voice In our experiences with singing synthesis systems on the market, we found the timbre changes as well as the pitch and dynamics can be mimicked 1 2007 1) Web 2),3) VocaListener 4),5) VocaListener2 VocaListener VocaListener1 VocaListener1 6) 1 National Institute of Advanced Industrial Science and Technology (AIST) 1 c 2010 Information Processing Society of Japan
7) 8) 9),10) 11) 18) 11),16) 18) 19) 20) Vocaloid 6) Vocaloid 1 VocaListener1 2 VocaListener1 3 VocaListener2 4 5 6 2 VocaListener1 VocaListener1 VocaListener2 21 VocaListener1: 4),5) VocaListener1 1 21) 1 VocaListener1 VocaListener1 VocaListener1 5) 2 3 22 VocaListener1 1 Vocaloid1 Note Velocity, Resonance, Harmonics, Noise, Brightness, Clearness, Gender Factor, Vocaloid2 VEL, BRE, BRI, CLE, OPE, GEN 2 http://staffaistgojp/tnakano/vocalistener/index-jhtml 3 http://wwwnicovideojp/mylist/7012071/ 2 c 2010 Information Processing Society of Japan
Vocaloid Vocaloid2 6) (MIKU Append) 1 2 DARK, LIGHT, SOFT, SOLID, SWEET, VIVID 6 LIGHT SOLID VocaListener1 23 (1): (2): 2 2 1 http://wwwcryptoncojp/cv01a/ 2 http://wwwcryptoncojp/mp/pages/prod/vocaloid/cv01jsp 2 3 VocaListener2: VocaListener2 31 23 (1) VocaListener1 : (2) VocaListener1 M t J M z j=1,2,,j (t) 3 J z j (t) M u(t) 3 7 J = 7 3 c 2010 Information Processing Society of Japan
u (t) VocaListener2 32 VocaListener2 3 VocaListener2 Z Z 1 Z 4 VocaListener1 A F 0 B 23 F 0 F 0 M C Z 1 Z 4 D E 33 : B F 0 STRAIGHT 22) STRAIGHT 7) 34 : C 4 VocaListener1 VocaListener1 3 VocaListener2 A B C D VocaListener1 X Y Z1 Z2 Z3 Z4 E VocaListener2 STRAIGHT 1 2 23) 24) 23),24) : : N M 1 F 0 STRAIGHT 2 4 c 2010 Information Processing Society of Japan
1 4 23),24) 35 : D 01 36 : E DARK, LIGHT, SOFT, SOLID, SWEET, VIVID 5 1 5 DARK, LIGHT, SOFT, SOLID, SWEET, VIVID 7),19) Radial Basis Function Variational Interpolation 25) t f Z j=1,2,,j (f, t) Z 1(f, t) Zr j(f, t) u(t) z j (t) 5 c 2010 Information Processing Society of Japan
( ) Zj (f, t) Zr j (f, t) = log (1) Z 1 (f, t) I g(u(t); f, t) = (w k (f, t) ϕ (u(t) z k (t))) + P (u(t); f, t) (2) Zr j(f, t) = k=1 J (w k (f, t) ϕ (z j(t) z k (t))) + P (z j(t); f, t) (3) k=1 g(z j (t); f, t) = Zr j (f, t) (4) M P (x; f, t) = p 0 + p m x (m) (5) m=1 Zr i(f, t) (1) w j P ( ) (5) x z j (t) u(t) M p m=0,,m ϕ( ) ϕ( ) = 1 (4) M = 3 ϕ 11 ϕ 12 ϕ 1J 1 z (1) 1 z (2) 1 z (3) 1 w 1 Zr 1 ϕ 21 ϕ 22 ϕ 2J 1 z (1) 2 z (2) 2 z (3) 2 w 2 Zr 2 ϕ J1 ϕ J2 ϕ JJ 1 z (J) I z (2) J z (3) J 1 1 1 0 0 0 0 z (1) 1 z (1) 2 z (1) J 0 0 0 0 z (2) 1 z (2) 2 z (2) J 0 0 0 0 z (3) 1 z (3) 2 z (3) J 0 0 0 0 w J p 0 p 1 p 2 p 3 = ϕ ij ϕ(z i (t) z j (t)) (f, t) (t) w j p m (2) 1 ϕ( ) = 2 log( ) ϕ( ) = 3 Zr J 0 0 0 0 (6) STRAIGHT 37 : ( 1 ) ( 2 ) ( 3 ) 4 41 RWC RWC-MDB-G-2001 26) No91 441kHz 1 msec Vocaloid Vocaloid2 6) 17 3 2 1 14 3 7 STRAIGHT 2 KAITO (Vocaloid1) (Vocaloid2) 3 MEIKO (Vocaloid1)6 SF-A2 mikivocaloid 2 6 c 2010 Information Processing Society of Japan
1 17 N R [%] 50 55 60 65 70 75 80 85 90 95 129 162 197 240 289 348 418 504 616 770 101 127 151 184 220 264 314 377 459 573 6 M F 0 FFT 4096 0 80 80 STRAIGHT 80% N 3 M = 3 F 0 42 A: 55 sec 17 R% N N 1 M 6 2 7 1 77 7 2 RWC-MDB-G-2001 No9155 sec 6 17 7 2 7 7 c 2010 Information Processing Society of Japan
8 1 41 95% 7 1 1 VIVID SOLID LIGHT DARK SOFT SWEET 7 43 B: 55 sec VocaListener1 Closed GEN 90 2 Open 8 Closed Open Closed LIGHT SOLID VIVID DARK SOFT 7 9 5 3 8 9 LIGHT, DARK, SOLID, SOFT, VIVID, SWEET GEN GEN=90 2 1 51 (1): 8 c 2010 Information Processing Society of Japan
6 6 VIVID 52 (2): GEN 6 VocaListener2 VocaListener2 4) VocaListener2 CrestMuse RWC ( RWC-MDB-G-2001) [1] Cabinet Office, Government of Japan: Virtual Idol, Highlighting JAPAN through images, Vol 2, No 11, pp 24 25 (2009) http://wwwgov-onlinegojp/pdf/hlj img/vol 0020et/24-25pdf [2] Vol 25, No 1, pp 157 167 (2010) [3] 2009pp 118 124 (2009) [4] VocaListener: 2008-MUS-75-9 Vol 2008, No 12, pp 51 58 (2008) [5] Nakano, T and Goto, M: VocaListener: A singing-to-singing synthesis system based on iterative parameter estimation, Proc SMC 2009, pp 343 348 (2009) [6] VOCALOID 2008-MUS-74-9Vol 2008, No 12, pp 51 58 (2008) [7] Vol 48, No 12, pp 3637 3648 (2007) [8] emorish http://wwwcrestmusejp/cmstraight/personal/emorish/ [9] Toda, T, Black, A and Tokuda, K: Voice conversion based on maximum likelihood estimation of spectral parameter trajectory, IEEE Trans on Audio, Speech and Language Processing, Vol 15, No 8, pp 2222 2235 (2007) [10] STRAIGHT 9 c 2010 Information Processing Society of Japan
Vol J91-D, No 4, pp 1082 1091 (2008) [11] Schröder, M: Emotional speech synthesis: A review, Proc Eurospeech 2001, pp 561 564 (2001) [12] Iida, A, Campbell, N, Higuchi, F and Yasumura, M: A corpus-based speech synthesis system with emotion, Speech Communication, Vol 40, Iss 1 2, pp 161 187 (2003) [13] Tsuzuki, R, Zen, H, Tokuda, K, Kitamura, T, Bulut, M and Narayanan, S S: Constructing emotional speech synthesizers with limited speech database, Proc ICSLP 2004, pp 1185 1188 (2004) [14] F 0 Vol J89-D, No 8, pp 1811 1819 (2006) [15] Vol 50, No 3, pp 1181 1191 (2009) [16] Türk, O and Schröder, M: A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis, Proc Interspeech 2008, pp 2282 2285 (2008) [17] Nose, T, Tachibana, M and Kobayashi, T: HMM-based style control for expressive speech synthesis with arbitrary speaker s voice using model adaptation, IEICE Trans on Information and Systems, Vol E92-D, No 3, pp 489 497 (2009) [18] Inanoglua, Z and Young, S: Data-driven emotion conversion in spoken English, Speech Communication, Vol 51, Is 3, pp 268 283 (2009) [19] 1-4-9pp 229 230 (2006) [20] Vol 51, No 2, pp 250 264 (2010) [21] Janer, J, Bonada, J and Blaauw, M: Performance-driven control for samplebased singing voice synthesis, Proc of the 9th Int Conference on Digital Audio Effects (DAFx-06), pp 41 44 (2006) [22] Kawahara, H, Masuda-Katsuse, I and de Cheveigne, A: Restructuring speech representations using a pitch adaptive time-frequency smoothing and an instantaneous frequency based on F0 extraction: Possible role of a repetitive structure in sounds, Speech Communication, Vol 27, pp 187 207 (1999) [23] Vol J85-D2, No 4, pp 554 562 (2002) [24] SPVol 101, No 86, pp 1 6 (2001) [25] Turk, G and O Brien, J F: Modelling with implicit surfaces that interpolate, ACM Transactions on Graphics, Vol 21, No 4, pp 855 873 (2002) [26] RWC : Vol 45, No 3, pp 728 738 (2004) 10 c 2010 Information Processing Society of Japan