11 22 33 12 23 1 2 3, 1 2, U2 3 U 1 U b 1 (o t ) b 2 (o t ) b 3 (o t ), 3 b (o t )
MULTI-SPEAKER SPEECH DATABASE Training Speech Analysis Mel-Cepstrum, logf0 /context1/ /context2/... Context Dependent HMMs (Average Voice Model)
Average Voice Model Speaker Adaptation Adapted Model ADAPTATION DATA
Adapted Model Sentence HMM TEXT c 1 c 2 p 1 p 2 F0 PARAMETER GENERATION Mel-Cepstrum Excitation MLSA Filter SYNTHESIZED SPEECH
F0 no yes no yes no yes no yes no yes no yes no yes
MDL Yes No Yes No Clustering Context Dependent HMMs
y y n n
a-b-a a-a-b b-b-a y n b-a-a b-a-b a a b a a-b-a a-a-b b-b-a b-a-a b-a-a
y a-b-a a-a-b b-b-a a n y b-a-a b-a-b a n a-b-a a-a-b b-b-a a b-a-a b-a-b
y y n n Average Voice Model
HMM ATRB 16kHz 5ms 25ms 024 left-to-right 42 0.4
50 100 150 200 250 300 FKN FKS FYM MHO MHT MYI A B C D E F A,B B,C C,D D,E E,F F,G A~C B~D C~E D~F E~G F~H A~D B~E C~F D~G E~H F~I A~E B~F C~G D~H E~I A,F~I A~F B~G C~H D~I A,E~I A,B,F~I AI
50 F0 (A) (B) (A) (B) 419 1011 37 ( 8%) 505 (50%) 14 ( 3%) 197 (19%) 548 818 0 (0%) 0 (0%) 0 (0%) 0 (0%) (A) (B)
Frequency [Hz] 300 200 150 100 0 1 2 3 4 Time [s]
13 538
sentences per speaker 50 100 150 200 250 300 15.9 84.1 17.1 82.9 18.3 81.7 30.0 70.0 17.5 82.5 27.2 72.8 0 20 40 60 80 100 score[%]
話者適応学習 (SATアルゴリズム) 話者適応に適した 平均声モデルを作成するための 話者正規化学習アルゴリズム
/a/ Average Voice Speaker 1 Speaker 2 logf0
/a/ Average Voice Speaker 1 Speaker 2 logf0 Speaker Adaptive Training [T. Anastasakos et al., 96]
[C.J. Leggetter et al., 96] m m Acoustic Space Dimension 2 Average Voice 2 1 ˆ 2 W ˆ 1 Speaker A Acoustic Space Dimension 1
Speaker 1 Speaker 2 Average Voice Model W i Speaker 3
Context Dependent Model (SI) Tied Context Dependent Model (SI) Context Dependent Models (SD) Tied Context Dependent Model (SI) Average Voice Model Average Voice Model SI SD
Average Voice Model (NONESATSTCSTC+SAT) Speaker Adapted Model MMY FTK Speaker Dependent Model MMY FTK
53 5 4 3 2 1
NONE 2.65 SAT 2.79 STC 3.01 STC+SAT 3.52 NONE 2.33 SAT 2.66 STC 2.95 STC+SAT 3.43 SD 3.84 MMY SD 4.02 1 2 3 4 5 FTK 1 2 3 4 5 Score SD
NONE SATSAT STCSTC STCSATSTC+SAT SD
HSMMに基づく 話者適応アルゴリズム 隠れセミマルコフモデルに基づく スペクトル F0 音韻継続長の 同時適応アルゴリズム
11 22 33 12 23 1 2 3, 1 2, U2 3 U 1 U b 1 (o t ) b 2 (o t ) b 3 (o t ), 3 b (o t )
[J.D. Ferguson 80, S.E. Levinson 86] p(d 1 ) p(d 2 ) p(d 3 ) 1 2 3 p (d i ) b i(o t ) b 1 (o t ) b 2 (o t ) b 3 (o t )
HSMM 1 2 3 1 2 3 1 2 3 d time
[J. Yamagishi et. al. 04] W X Acoustic Space Dimension 2 Average Voice Model Speaker A Acoustic Space Dimension 1
[J. Yamagishi et. al. 04]
Threshold Target Speaker s Model Average Voice Model
[J. Yamagishi et. al. 05] p(d 1 ) p(d 2 ) p(d 3 ) 1 2 3 p (d i ) b i(o t ) b 1 (o t ) b 2 (o t ) b 3 (o t )
[J. Yamagishi et. al. 05] Average Voice Model Speaker 1 Speaker 2 X 1 X2 W 1 W 2 X 3 W 3 Speaker 3 W i X i
Δ, Δ
9.0 MHO MYI Average mora/sec 8.5 8.0 7.5 MHT MSH MMY FKS FYM FKN FTY MTK FTK 7.0 4.0 4.5 5.0 5.5 6.0 Average logarithm of F0
73 Average log-likelihood per frame 72 71 70 69 0 Both Output Duration None 50 100 150 200 250 300 350 400 450 Number of Sentences
9.0 Average mora/sec 8.5 8.0 7.5 Average Voice (Male Speakers) MTK(MLLR) Average Voice (Female Speakers) FTK MTK FTK(MLLR) 7.0 4.0 4.5 5.0 5.5 6.0 Average logarithm of F0
RMSE of logf0 [cent] 400 350 300 250 Average Voice SD MLLR 200 0 50 100 150 200 250 300 350 400 450 Number of Sentences
8 SD Mel-cepstrum Distance [db] 7 6 5 Average Voice MLLR 4 0 50 100 150 200 250 300 350 400 450 Number of Sentences
11 RMSE of Vowel Duration [frame] 10 9 8 7 6 5 Average Voice SD MLLR 4 0 50 100 150 200 250 300 350 400 450 Number of Sentences
5 4 3 2.5 3.3 2.6 3.6 2.9 2 1.6 1.5 1.6 1.5 1 Spectrum F0 Duration SD SD SD Average Voice Adaptation
Spectrum Spectrum +F0 Spectrum +F0 +Duration 0 10 20 30 40 50 60 70 80 90 100 Score (%)
5 4 3 2.5 3.3 2.6 3.6 2.9 2 1.6 1.5 1.6 1.5 1 Spectrum F0 Duration SD SD SD Average Voice Adaptation
1. J. Yamagishi and T. Kobayashi, Simultaneous Speaker Adaptation Algorithm of Spectrum, Fundamental Frequency and Duration for HMM-based Speech Synthesis, IEICE Trans. Information and Systems. (in preparation) 2. J. Yamagishi, Y. Nakano, K. Ogata, J. Isogai, and T. Kobayashi, A Unified Speech Synthesis Method Using HSMM-Based Speaker Adaptation and MAP Modification, IEICE Trans. Information and Systems. (in preparation) 3. J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-based Speech Synthesis, IEICE Trans. Information and Systems, E88-D, vol.3, pp.503 509, March 2005. 4. J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, A Training Method of Average Voice Model for HMM-based Speech Synthesis, IEICE Trans. Fundamentals, E86-A, no.8, pp.1956 1963, Aug. 2003. 5. J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, A Context Clustering Technique for Average Voice Models, IEICE Trans. Information and Systems, E86-D, no.3, pp.534 542, March 2003
1. J. Yamagishi, K. Ogata, Y. Nakano, J. Isogai, and T. Kobayashi, HSMM-based Model Adaptation Algorithms for Average-Voice-based Speech Synthesis, Proc. ICASSP 2006, May 2006 (submit). 2. J. Yamagishi, and T. Kobayashi, Adaptive Training for Hidden Semi-Markov Model, Proc. ICASSP 2005, vol.i, pp.365 368, March 2005. 3. J. Yamagishi, T. Masuko, and T. Kobayashi, MLLR Adaptation for Hidden Semi-Markov Model Based Speech Synthesis, Proc. ICSLP 2004, vo.ii, pp.1213 1216, October 2004. 4. J. Yamagishi, M. Tachibana, T. Masuko, and T. Kobayashi, Speaking Style Adaptation Using Context Clustering Decision Tree for HMM-based Speech Synthesis, Proc. ICASSP 2004, vol.i, pp.5 8, May 2004. 5. J. Yamagishi, T. Masuko, and T. Kobayashi, HMM-based Expressive Speech Synthesis Towards TTS with Arbitrary Speaking Styles and Emotions, Special Workshop in Maui (SWIM), January 2004. 6. J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, Modeling of Various Speaking Styles and Emotions for HMM-based Speech Synthesis, Proc. EUROSPEECH 2003, vol.iii, pp.2461 2464, September 2003. 7. J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, A Training Method for Average Voice Model Based on Shared Decision Tree Context Clustering and Speaker Adaptive Training, Proc. ICASSP 2003, vol.i, pp.716 719, April 2003. 8. J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, A Context Clustering Technique for Average Voice Model in HMM-based Speech Synthesis, Proc. ICSLP 2002, vol.1, pp.133 136, September 2002.