a) b) c) Speech Recognition of Short Time Utterance Based on Speaker Clustering Hiroshi SEKI a), Daisuke ENAMI, Faqiang ZHU, Kazumasa YAMAMOTO b), and

Size: px

Start display at page:

Download "a) b) c) Speech Recognition of Short Time Utterance Based on Speaker Clustering Hiroshi SEKI a), Daisuke ENAMI, Faqiang ZHU, Kazumasa YAMAMOTO b), and"

せせらみやのじょう
5 years ago
Views:

1 a) b) c) Speech Recognition of Short Time Utterance Based on Speaker Clustering Hiroshi SEKI a), Daisuke ENAMI, Faqiang ZHU, Kazumasa YAMAMOTO b), and Seiichi NAKAGAWA c) 0.5 DNN (Deep Neural Network) 7% Deep Neural Network 1. [1] GMM-HMM Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi-shi, Japan a) b) c) DOI: /transinfj.2016JDP7063 [2] GMM [3] HMM [4] [5] [6] [7] i-vector [8] [9], [10] GMM-HMM MAP (Maximum a posterior probability) [11] MLLR (Maximum Likelihood Linear Regression) [12] VTLN (Vocal Tract Length Normalization) [13] fmllr (feature space MLLR) [14] (Deep D Vol. J100 D No. 1 pp c

2 2017/1 Vol. J100 D No. 1 Neural Network; DNN) GMM-HMM [15] DNN-HMM DNN-HMM DNN-HMM GMM-HMM DNN DNN i-vector [16] [17] DNN [18] [19] DNN-HMM DNN [20] 1 (0.5 ) [21] i-vector 0.5 i-vector 5.0 [22] [22] GMM-HMM DNN-HMM GMM GMM [23] (Cepstral Mean Normalization; CMN, Cepstral Variance Normalization; CVN) [23] t i c i(t) CMN CVN CMN :ĉ i(t) =c i(t) μ i (1) CV N :ĉ i(t) = ci(t) σ 2 i (2) μ i = 1 T T c i(t),σi 2 = 1 T t=1 T (c i(t) μ i) 2 (3) t=1 (T ) ĉ i(t) = ci(t) μi σ 2 i (4) 82

3 CMVN CMVN [24] [25] DNN-HMM [24] CMVN [24] μ i[t] =βμ i[t 1] + (1 β)c i[t] (5) σi 2 [t] =βσi 2 [t 1] + (1 β)(c i[t] μ i[t]) 2 (6) β =0.992 (7) Algorithm 1 [26], [27] j rs i j m min m max n s GMM T L(o λ s)=logp(o λ s)= log p(o t λ s) (8) t=1 Algorithm 1 1: for i =1toI( ) do 2: n =1 3: M ( ) i 4: sc ( 1,...,m min,...,m max,...,m) 5: for j =2toM do 6: if sc (1) sc (j) <rsthen 7: n = n +1 8: end if 9: end for 10: if n<m min then 11: i 1 m min 12: else if n>m max then 13: i 1 m max 14: else if m min n m max then 15: i n 16: end if 17: end for o = o 1,o 2,...,o T 1 λ s s GMM GMM GMM CMVN DNN a) GMM DNN 1 b) DNN DNN DNN DNN 83

4 2017/1 Vol. J100 D No. 1 Fig. 1 1 Overview of speaker class incorporation. i-vector [28] i-vector GMM i-vector (8) 4 1 b-1) GMM ( ) DNN b-2) 1 0 DNN b-3) GMM 1 DNN b-4) DNN DNN c) (a) (b) DNN 3. 3 ( ) DNN DNN (stepwise training) ( / / ) ( / ) 1 ASJ ( ) [29] JNAS ( ) [30] 84

5 1 Table 1 Training data. ASJ+JNAS ,337 ( 33h) 25,056 ( 44h) 0.45% 0.45% S-JNAS ,081 ( 53h) 24,061 ( 53h) 2.07% 2.05% CIAIR-VCV ,538 (+3993, 11h) 7,744 (+3910, 11h) 13.81% 13.64% JNAS S-JNAS ( ) [31] CIAIR-VCV ( ) [32] AM AF EM EF CM CF 100 AM: 23 AF: 23 EM: 10 EF: 10 CM: 7 CF: 8 14% 0.5% 2.1% 94.5% GMM-HMM DNN-HMM GMM 25ms 10ms 12 MFCC ΔMFCC ΔΔMFCC Δ ΔΔ CMVN GMM-HMM GMM-HMM leftto-right HMM HMM HMM (116 8=928 ) MAP [27] HTK (HMM Toolkit) [33] DNN-HMM DNN GMM-HMM 3 [34] 418 (= 38 11) 3 2,048 1,276 (= ) CMVN (zero mean, unit variance) Rectifier function (f(x) = max(0,x)) [34], [35] Rectifier function DNN ± nj 6 +n j+1 [36] n j DNN j 4. 3 [37] ,000 tri-gram 1 85

6 2017/1 Vol. J100 D No. 1 WFST SPOJUS [38] 4. 4 GMM GMM 8 GMM 10,000 GMM 64 rs class soft 12 class soft m min m max rs GMM-HMM 84.6% DNN- HMM 88.8% DNN-HMM GMM-HMM GMM-HMM 2 [%] Table 2 Word accuracy of the baseline system [%]. Acoustic Model AM AF EM EF CM CF Ave. GMM-HMM DNN-HMM ( ) [%] Table 3 Word accuracy of the speaker-class dependent models (class-known). Acoustic Model AM AF EM EF CM CF Ave. GMM-HMM DNN-HMM DNN-HMM 88.8% GMM-HMM DNN-HMM GMM- HMM 4 (all frames) 50 (50 frames) GMM class init 12 class soft 6 1 (all frames) 2 GMM-HMM (84.6%) (85.9%) % GMM-HMM (85.6%) GMM-HMM DNN-HMM DNN-HMM 5. 3 (a) DNN-HMM 86

7 Acoustic Model GMM-HMM 4 ( )[%] Table 4 Increase of speaker-class and changes in Accuracy (class-unknown) [%]. Training data All frames 50 frames AM AF EM EF CM CF Ave. AM AF EM EF CM CF Ave. 6 class init (6 GMMs) class soft (12 GMMs) Acoustic Model DNN-HMM 5 ( ) Table 5 Word accuracy based on speaker-class-dependent CMVN (class-unknown) [%]. Training data All frames 50 frames AM AF EM EF CM CF Ave. AM AF EM EF CM CF Ave. 1 class (Table 2) class init class soft class soft ( ) [%] Table 6 Word accuracy comparison on the cepstral normalization (class-known). CMVN unit #. normalization unit Ave. Acc.[%] corpus class init speaker utterance (all frames) (6 class init) 89.2% ( ) 89.1% (50 ) 2 DNN-HMM (88.8%) 4% % % DNN DNN (b) 6 6 CMVN (corpus) % 6 (6 class init) CMVN 7 Table 7 Word accuracy based on utterance-based online cepstral normalization. Training Test Ave. Acc.[%] Online Online 89.3 Batch Online % 2 6 CNVN CMVN CMVN CMVN CMVN (speaker) 89.4% CMVN (utterance) 89.8% 1 ( 5.4 ) (c) (5) (7) CMVN 7 Online Batch 89.3% 87.7% 87

8 2017/1 Vol. J100 D No. 1 8 ( ) Table 8 Word accuracy using speaker-class-information (class-unknown) [%]. Training method Training data All frames 50 frames AM AF EM EF CM CF Ave. AM AF EM EF CM CF Ave. Baseline 1 class (Table 2) class init w/o stepwise training 6 class soft class soft class init w/ stepwise training 6 class soft class soft CMVN ( Batch) CMVN CMVN CMVN DNN 8 (all frames) 50 (50 frames) (a) 8 w/o stepwise training GMM 6 class init 89.2% 50 DNN-HMM ( 2) 2 DNN-HMM (88.8%) 6 DNN (50 frame, 89.2%) [39] p = GMM-HMM DNN-HMM (b) 8 w/ stepwise training % 5 CMVN (6 class init, 89.1%) 1% DNN Rectified Linear Unit DNN DNN DNN DNN DNN 9 retrain layer % % 88

9 GMM 6 class soft 50 a) % b) DNN b-1) 8 (6 class soft; 50 frames) 89.2% 89.6% b-2) (0/1) 87.8% 89.3% b-3) 88.4% 89.4% b-4) [%] ( ) Table 9 Word accuracy using stepwise training (class-unknown) [%]. retrain layer 50 frames AM AF EM EF CM CF Ave. all and % 89.5% c) CMVN ( ) 88.8% 89.3% 5. 6 ( ) 0.0% ( ) GMM-HMM method baseline stepwise training 6 GMM 50 online CMVN CMVN ( ) CMVN (0.5 ) ( 600 ) ( ) 6 10 [%] Table 10 Word accuracy focusing on the first word of a sentence [%]. method AM AF EM EF CM CF Ave. baseline (1class, 1DNN) stepwise training (6 class soft, 50 frames) online CMVN (training: online, recognition: online) online CMVN (training: batch, recognition: online)

10 2017/1 Vol. J100 D No DNN-HMM 0.5 DNN-HMM DNN i-vector % 89.6% ( 89.0% 90.3%) 7% (12%) [1] J.G. Wilpon and C.N. Jacobsen, A study of speech recognition for children and the elderly, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [2] M. Padmanabhan, L.R. Bahl, D. Nahamoo, and M.A. Picheny, Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.71 77, [3] D vol.j85-d, no.3, pp , March [4] K. Konno, M. Kato, and T. Kosaka, Speech recognition with large-scale speaker-class-based acoustic modeling, Proc. APSIPA, pp.1 4, [5] M. Naito, L. Deng, and Y. Sagisaka, Speaker clustering for speech recognition using vocal tract parameters, Speech Commun., pp , [6] R. Faltlhauser and G. Ruske, Robust speaker clustering in eigenspace, Proc. Automatic Speech Recognition and Understanding, pp.57 60, [7] H. Nanjo and T. Kawahara, Speaking-rate dependent decoding and adaptation for spontaneous lecture speech recognition, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [8] Y. Zhang, J. Xu, Z.J. Yan, and Q. Huo, An i- vector based approach to training data clustering for improved speech recognition, Proc. Interspeech, pp , [9] pp.1 3, [10] T. Sinozaki, Y. Kubota, and S. Furui, Unsupervised acoustic model adaptation based on ensemble methods, IEEE J. Selected Topics in Signal Processing, vol.4, pp , [11] HMM pp , [12] C.J. Leggetter and P.C. Woodland, Maximum likelihood regression for speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, vol.9, pp , [13] E. Eide and H. Gish, A parametric approach to vocal tract length normalization, Proc. International Conference on Acoustics, Speech and Signal (ICASSP), pp , [14] M.J.F. Gales and P.C. Woodland, Mean and variance adaptation within the MLLR framework, Comput. Speech Lang., vol.10, pp , [15] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.R. Mohamed, N. Jitaly, A. Senior, V. Vanhoucke, P. Ngyyen, T.N. Sainath, and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Process. Mag., vol.29, pp.82 97, [16] A. Senior and I. Lopez-Moreno, Improving DNN speaker independent with i-vector inputs, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [17] O.A. Hamid and H. Jiang, Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [18] H. Huang and K.C. Sim, An investigation of augmenting speaker representations to improve speaker normalization for DNN-based speech recognition, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [19] T. Tan, Y. Qian, M. Yin, Y. Zhuang, and K. Yu, Cluster adaptive training for deep neural network, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [20] T. Kosaka, K. Konno, and M. Kato, Deep neural network-based speech recognition with combination of speaker-class models, APSIPA, pp , [21] i-vector pp.65 70, [22] Y. Liu, P. Karanasou, and T. Hain, An investigation 90

into speaker informed DNN front-end for LVCSR, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4300 4304, 2015. [23] O. Viikki, D. Bye, and K.

[24] P. Pujol, D. Macho, and C. Nadeu, On real-time mean-and-variance normalization of speech recognition features, Proc.

Yamamoto, Distant speech recognition using a microphone array network, IEICE Trans. Inf. & Syst., vol.e93-d, no.9, pp.2451 2462, Sept. 2010. [26] pp.159 160, 2010. [27] D. Enami, F. Zhu, K.

Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Trans. Audio Speech Language Process., vol.15, pp.1435 1447, 2007. [29] (ASJ- JIPDEC) http://research.nii.ac.jp/src/asj- JIPDEC.

11 into speaker informed DNN front-end for LVCSR, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [23] O. Viikki, D. Bye, and K. Laurila, A recursive feature vector normalization approach for robust speech recognition in noise, Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp , [24] P. Pujol, D. Macho, and C. Nadeu, On real-time mean-and-variance normalization of speech recognition features, Proc. International Conference on Acoustics, Speech and Signal Proccesing (ICASSP), [25] A.Y. Nakano, S. Nakagawa, and K. Yamamoto, Distant speech recognition using a microphone array network, IEICE Trans. Inf. & Syst., vol.e93-d, no.9, pp , Sept [26] pp , [27] D. Enami, F. Zhu, K. Yamamoto, and S. Nakagawa, Soft-clustering technique for training data in ageand gender-independent speech recognition, Proc. APSIPA, pp.1 4, [28] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Trans. Audio Speech Language Process., vol.15, pp , [29] (ASJ- JIPDEC) JIPDEC.html [30] K. Itou, M. Yamamoto, K. Takeda, T. Takezawa, T. Matsuoka, T. Kobayashi, K. Shikano, and S. Itahashi, Japanese speech corpus for large vocabulary continuous speech recognition research, J. Acoustical Society of Japan (E), pp , [31] (S-JNAS) [32] CIAIR, (CIAIR-VCV) [33] HTK, [34] DNN-HMM pp.1 6, [35] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier networks, International Conference on Artificial Intelligence and Statistics, pp , [36] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, International Conference on Artificial Intelligence and Statistics, pp , [37] [38] Y. Fujii, K. Yamamoto, and S. Nakagawa, Large vocabulary speech recognition system: SPOJUS++, Proc. International Conference MUSP, pp , [39] vol.50, pp , ISCA 91

12 2017/1 Vol. J100 D No IETE ( ) ( ) ( ) ( ) Spoken Language Systems ( IOS Press) ( ) 92

音響モデル triphone 入力音声音声分析デコーダ言語モデル N-gram bigram HMM の状態確率として利用出力層 triphone: 3003 ノードリスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition syst

音響モデル triphone 入力音声音声分析デコーダ言語モデル N-gram bigram HMM の状態確率として利用出力層 triphone: 3003 ノードリスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition syst 1,a) 1 1 1 deep neural netowrk(dnn) (HMM) () GMM-HMM 2 3 (CSJ) 1. DNN [6]. GPGPU HMM DNN HMM () [7]. [8] [1][2][3] GMM-HMM Gaussian mixture HMM(GMM- HMM) MAP MLLR [4] [3] DNN 1 1 triphone bigram [5]. 2