音響モデル triphone 入力音声音声分析デコーダ言語モデル N-gram bigram HMM の状態確率として利用出力層 triphone: 3003 ノードリスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition syst

Size: px

Start display at page:

Download "音響モデル triphone 入力音声音声分析デコーダ言語モデル N-gram bigram HMM の状態確率として利用出力層 triphone: 3003 ノードリスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition syst"

きのこみおか
5 years ago
Views:

1 1,a) deep neural netowrk(dnn) (HMM) () GMM-HMM 2 3 (CSJ) 1. DNN [6]. GPGPU HMM DNN HMM () [7]. [8] [1][2][3] GMM-HMM Gaussian mixture HMM(GMM- HMM) MAP MLLR [4] [3] DNN 1 1 triphone bigram [5]. 2 trigram 1 Graduate School of Science and Engineering, Yamagata Uniersity a) tth18357@st.yamagata-u.ac.jp [3] 2. 2 c 2014 Information Processing Society of Japan 1

2 音響モデル triphone 入力音声音声分析デコーダ言語モデル N-gram bigram HMM の状態確率として利用出力層 triphone: 3003 ノードリスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition system 入力層 825ノード 11 (CSJ) 5 FBANK+Δ+ΔΔ 75 次元 X11フレーム = [3] HMM [10] GMM-HMM triphone 3003 DNN pre-training fine-tuning GMM-HMM 3 2 pre-training 1 Restricted Boltzmann Machine(RBM) pre-training [9]. (SGD) HMM( base) fine-tuning 3 DNN- GMM-HMM GMM- HMM HMM 3. MLLR (GMM-HMM adapt1) [7] HMM base adapt1 GMM-HMM base) base GMM-HMM GMM-HMMadapt1 [8] GMM-HMM 2 Structure of FBANK GMM-HMM (LM base adapt1 LMbase 3 1 c 2014 Information Processing Society of Japan 2

3 base GMM-HMM base 認識音素系列変換適応 GMM-HMM adapt1 認識評価データ GMM-HMM GMM- HMM DNN 適応用音素系列変換 (sil 候補挿入 ) ビタービアライメント GMM 適応用評価データ状態系列変換音素 / 状態系列 LM base 適応 adapt1 認識適応 LM adapt1 3 Procedure diagram of unsupervised adaptation trigram,. 3 P(w i c i ) 4 GMM-HMM. GMM-HMM,. P(c i c i 2 c i 1 ) = N 0(c i 2 c i 1 c i ) (1) N 0 (c i 2 c i 1 ) (sil) N 0 trigram P(w i w i 2 w i 1 ) trigram trigram GMM-HMM GMM-HMM 4 Procedure diagram of phoneme or state alignment 4. P (w i w i 2 w i 1 ) = λp(w i w i 2 w i 1 ) GMM-HMM +(1 λ)p(w i c i )P(c i c i 2 c i 1 ) (2) 1 trigram 2 tri- fine-tuning gram. λ. λ 0.7 DNN 5 trigram trigram, trigram [11]. 5. DNN DNN [5]. GMM-HMM GMM-HMM dropout[12] / 25ms/8ms 12 MFCC 1 2 [8] 39 CMN CSJ L L2 (ML) c 2014 Information Processing Society of Japan 3

4 適応データ大量テキストデコーダ品詞からの単語出現確率単語 trigram ( ベースライン ) 品詞出現回数品詞連鎖確率品詞 trigram 単語 trigram ( 適応モデル ) 5 Procedure diagram of language model adaptation 1 Conditions for DNN training pre-training 0.4 (1 0.01) 10 (1 20) ( WER (PMR:Phoneme mismatch (75 11 = 825 ) rate) 2 CSJ (203 ) 1 [13][14] 2 fine-tuning 1/10 WER 0.1% base (adapt1a) 2 47, epo ( 6.68M CSJ ) [8] 100 CSJ testset1 10 DNN Kaldi tool kit[13] base 2 GMM-HMM L2 (LMadapt1b) WER 14.73% GMM-HMM 0 (GMM-HMMadapt1) WER 14.53% L PMR WER of [%] WER of GMM-HMM [%] 6 Word error rate for each speaker ) GMM-HMM L (WER) 19.75% WER fine-tuning 15.12% (base) WER 6 0.1% WER 14.72% c 2014 Information Processing Society of Japan 4

base 15.12% 0.64% adapt1a 14.72%, epoch=100 GMM-HMM adapt1a 14.51% 2.18% LM adapt1b 14.73% 2.64% 2.95% 3.14% GMM-HMM adapt1 14.53% 4.16% GMM-HMM adapt2c 14.53% adapt1 13.

57% 7 Word accuracy using cross adaptation 2 Comparisons of substitution, insertion and deletion errors (%) Type of DNN- DNN- LMadapt1b GMMerrors HMMbase HMMadapt1a HMMadapt1 Sub 9.57 9.35 8.99 9.

5 base 15.12% 0.64% adapt1a 14.72%, epoch=100 GMM-HMM adapt1a 14.51% 2.18% LM adapt1b 14.73% 2.64% 2.95% 3.14% GMM-HMM adapt % 4.16% GMM-HMM adapt2c 14.53% adapt %, epoch=25 GMM-HMM adapt2d 14.04% LM adapt % GMM-HMMadapt % 7 Word accuracy using cross adaptation 2 Comparisons of substitution, insertion and deletion errors (%) Type of DNN- DNN- LMadapt1b GMMerrors HMMbase HMMadapt1a HMMadapt1 Sub Ins Del WER WER (%) GMM-HMM PMR 話者番号 8 Results of adaptation for each speaker PMR (0.64%) 2 3 DNN- HMMadapt1a GMM-HMMadapt1 GMM-HMMadapt1 GMM-HMM (GMM-HMMadapt2) 13.57% GMM-HMM LMadapt1b 13.08% GMM-HMMadapt1 GMM-HMM GMM-HMM LM (GMM-HMMadapt2c) GMM-HMM LM % (LMadapt1) 13.08% GMM-HMM LM c 2014 Information Processing Society of Japan 5

[11] [7] S. Stuker, et al.: Cross-system adaptation and combination 13.08 for continuous speech recognition: The influence of phoneme set and acoustic front-end, Proc. of Inter- Speech2006, pp.

6 [11] [7] S. Stuker, et al.: Cross-system adaptation and combination for continuous speech recognition: The influence of phoneme set and acoustic front-end, Proc. of Inter- Speech2006, pp , (2006) [8],,, :, (2014). [9] A. Mohamed, G. Hinton and G. Penn: Understanding how deep belief networks perform acoustic modelling, Proc. of ICASSP2012, (2012). [10] T. Kosaka, T. Miyamoto and M. Kato: Unsupervised cross-adaptation approach for speech recognition WER (%) 9 Summary of recognition results by combined language model and acoustic model adaptation, Proc. of APSIPA ASC 2011, (2011)., 7. Vol.J89-D No.2, pp (2006). [12] G.E. Dahl, T.N. Sainath and G.E. Hinton: Improving deep neural networks for LVCSR using rectified linearunits and dropout, Proc. of ICASSP2013, (2013). [13] Kaldi project: The Kaldi speech recognition toolkit, html [14] K. Vesely, A. Ghoshal, L. Burget, and D. Povey: Sequence-discriminative training of deep neural networks, Proc. of Interspeech2013, (2013). GMM-HMM 2 3 (CSJ) GMM-HMM 3 [5] [1],, :,, pp (2012). [2],, : Deep Neural Network,, 2013-SLP-97(8), pp. 1 6 (2013). [3], : CSJ,, 2013-SLP-97(9), pp. 1 6 (2013). [4] Y. Xiao, et al.: A initial attempt on task-specific adaptation for deep neural network-based large vocabulary continuous speech recognition, Proc. of Interspeech2012, (2012). [5] H. Liao: Speaker adaptation of context dependent deep neural networks, Proc. of ICASSP2013, (2013). [6],, X. Lu,, :, (2014). c 2014 Information Processing Society of Japan 6

a) b) c) Speech Recognition of Short Time Utterance Based on Speaker Clustering Hiroshi SEKI a), Daisuke ENAMI, Faqiang ZHU, Kazumasa YAMAMOTO b), and

a) b) c) Speech Recognition of Short Time Utterance Based on Speaker Clustering Hiroshi SEKI a), Daisuke ENAMI, Faqiang ZHU, Kazumasa YAMAMOTO b), and Seiichi NAKAGAWA c) 0.5 DNN (Deep Neural Network)

音響モデル triphone 入力音声 音声分析 デコーダ 言語モデル N-gram bigram HMM の状態確率として利用 出力層 triphone: 3003 ノード リスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition syst

音響モデル triphone 入力音声音声分析デコーダ言語モデル N-gram bigram HMM の状態確率として利用出力層 triphone: 3003 ノードリスコア trigram 隠れ層 2048 ノード X7 層 1 Structure of recognition syst