Ver.1.0 2004/3/23 : : 1 1 2 2 2.1..................................... 3 2.2..................................... 5 2.3........................... 5 2.4.............................. 7 2.5............................ 7 3 9 4 10 5 CSJ 10 11 1 CSJ 1 CSJ [1] CSJ 1 1
2 HMM HTK[2] 3 left-to-right HMM triphone 3000 16 2 MLLR 1 CSJ 10 1 : 3 1: 787 186 166 42 + 953 228 GID AM/CSJ-APS/hmmdefs.gz 721 124 822 134 + 1543 258 GID AM/CSJ-SPS/hmmdefs.gz 1508 310 + 988 176 + 2496 486 + GID AM/CSJ-APS,SPS/hmmdefs.gz 2 CSJ segment.pdf 2
2.1 16kHz 16bit 25msec 10msec MFCC 12 MFCC 12 Power 1 25 CMS 2 2: 16 khz 0.97 Hamming 25 ms 10 ms MFCC 12 + MFCC 12 + 25 24 CMS 1 2 N Power = log 2 s n n=1 (1) d t = Θθ=1 θ(c t+θ c t θ ) 2 Θ θ=1 θ 2 (2) Θ =2 3 HTK config file 3
SOURCEFORMAT=NOHEAD SOURCEKIND = WAVEFORM SOURCERATE = 625 TARGETKIND = MFCC E D Z TARGETRATE=100000.0 SAVECOMPRESSED=F SAVEWITHCRC=F WINDOWSIZE=250000.0 USEHAMMING=T PREEMCOEF=0.97 NUMCHANS=24 NUMCEPS=12 ZMEANSOURCE=T ENORMALISE=F ESCALE=1.0 TRACE=0 RAWENERGY=F 3: HTK config file 4
2.2 4 42 q sp silb sile 500 2.3 N a: o: 4: aiueoa:i:u:e:o: N w y j my ky by gy ny hy ry py ptktschbdgzmnsshhfr q sp silb sile 2.3 CSJ CSJ? W (W ; ) --> (?, ) --> 5 500 silb sile 500 20 500 sp sp sp 5
5: a i u e o ka ki ku ke ko ga gi gu ge go sa sh i su se so za ji zu ze zo ta ch i ts u te to da ji zu de do na ni nu ne no ha hi fu he ho ba bi bu be bo pa pi pu pe po ma mi mu me mo ra ri ru re ro wa o ya yu yo ky a ky u ky o gy a gy u gy o sh a sh u sh o ja ju jo ch a ch u ch o ny a ny u ny o hy a hy u hy o by a by u by o py a py u py o my a my u my o ry a ry u ry o ie sh e je ti tu ch e ts a ts i ts e ts o di du du nie he fa fi fe fo hy u bi me wi we wo ka ga sui ji teyu ba bi bu be bo N q : 6
2.4 silb sile sp IPA 3 6 6: a:-k+a a-k+a -a+ky *-a+k ky-a+* y-a+* 2.5 1 1 7 2 CSJ 3000 3 http://www.itakura.nuee.nagoya-u.ac.jp/ takeda/ipa/ 7
7: L Nasal R Nasal L Bilabial R Bilabial L DeltalAlveolar R DeltalAlveolar L PalatoAlveola R PalatoAlveola L Velar R Velar L Glottal R Glottal L YOUON L SOKUON R SOKUON L R R R L N R N L A R A L I R I L U R U L E R E L O R O N-, n-, m- +N, +n, +m p-, b-, f-, m-, w- +p, +b, +f, +m, +w t-, d-, ts-, z-, s-, n- +t, +d, +ts, +z, +s, +n ch-, j-, sh- +ch, +j, +sh k-, g- +k, +g h- +h y- q- +q r- +r N- +N a- +a i- +i u- +u e- +e o- +o 8
3 [3] 4 CSJ [4] 5 - - HTK [2] : LM/csj.htkdic 2 <sil> <sp> <sil> 1000msec <sp> 8: <sil> [<sil>] silb <sil> [<sil>] sile <sp> [<sp>] sp + [ ] t e N + / [ ] j u: d e: b i: + / [ ] ju:rokupi:pi:esu + / [ ] ju:rokupi:piesu + [ ] wanwe: + [ ] wane: + / [ ] i ch i i: a: r u b i: + / [ ] nijiqke: + / [ ] nijuqke: + [ ] ts u: e: + / [ ] n i: d e: k e: + / [ ] n i: d i: k e: CSJ CSJ 0.2 CSJ CSJ 2596 6.67M 4 3 25,300 27,249 4 pos.pdf 5 wdb.pdf 9
4 3 N-gram CMU-Cambridge SLM toolkit ver.2[5] 6 2-gram csj.2gram.gz 3- gram csj.3gram.gz back-off Witten-Bell N-gram <sil> <sp> CSJ 30 : 10 7 9 CSJ 2592 6.67M 25K 0.7M 2.6M 9: 2,592 6,671,844 1-gram 25,300 2-gram 731,728 3-gram 2,611,952 : <sil> <sp> 5 CSJ CSJ 10 3 1 10 test-set 1 5 5 10 test-set 2 5 5 10 test-set 3 10 [6][10] 3 2002 10 CSJ 6 http://mi.eng.cam.ac.uk/ prc14/toolkit.html 7 A01M0007, A01M0035, A01M0074, A02M0117, A03M0100, A05M0031, A06M0134, 3 [6][7][8][9] 10
30 CSJ test-set 2 A01M0056 ID S05M0613, R00M0187, D01M0019, D04M0056, D02M0028, D03M0017 10: CSJ test-set 1 10 10 A01M0097 A01M0110 A01M0137 A03M0106 A03M0112 A03M0156 A04M0051 A04M0121 A04M0123 A05M0011 test-set 2 10 5 5 A01M0056 A01M0141 A02M0012 A03M0016 A06M0064 A01F0001 A01F0034 A01F0063 A03F0072 A06F0135 test-set 3 10 5 5 S00M0008 S00M0070 S00M0079 S00M0112 S00M0213 S00F0019 S00F0066 S00F0148 S00F0152 S01F0105 [1] T.Kawahara, H.Nanjo, T.Shinozaki, and S.Furui. Benchmark Test for Speech Recognition using the Corpus of Spontaneous Japanese. In Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, pp. 135 138, 2003. [2] P.C.Woodland, C.J.Leggetter, J.J.Odell, V.Valtchev, and S.J.Young. The 1994 HTK Large Vocabulary Speech Recognition System. In IEEE Int l Conf. on Acoustics, Speech & Signal Processing (ICASSP), Vol. 1, pp. 73 76, 1995. [3].., pp. 21 28, Feb. 2001. [4],.., pp. 33 38, Feb. 2002. [5] P.R.Clarkson and R.Rosenfeld. Statistical Language Modeling using the CMU- Cambridge Toolkit. In Proc. European Conf. Speech Communication & Technology (EUROSPEECH), pp. 2707 2710, 1997. [6],.., Vol. 43, No. 7, pp. 2098 2107, 2002. 11
[7] T.Shinozaki and S.Furui. Towards Automatic Transcription of Spontaneous Presentations. In Proc. European Conf. Speech Communication & Technology (EU- ROSPEECH), pp. 491 494, 2001. [8] H.Nanjo and T.Kawahara. Speaking-Rate Dependent Decoding and Adaptation for Spontaneous Lecture Speech Recognition. In IEEE Int l Conf. on Acoustics, Speech & Signal Processing (ICASSP), pp. 725 728, 2002. [9],,,.., Vol. J86-DII, No. 4, pp. 450 459, 2003. [10] T.Shinozaki and S.Furui. Analysis on Individual Differences in Automatic Transcription of Spontaneous Presentations. In IEEE Int l Conf. on Acoustics, Speech & Signal Processing (ICASSP), Vol. 1, pp. 729 732, 2002. 606-8501 4F kawahara@i.kyoto-u.ac.jp 12