1 2 1 closed Automatic Detection of Edited Parts in Inexact Transcribed Corpora Using Alignment between Edited Transcription and Corresponding Utterance Kengo Ohta, 1 Masatoshi Tsuchiya 2 and Seiichi Nakagawa 1 The availability of a large-scale spontaneous speech corpora is crucially important for various domains of spoken language processing. However, the available corpora are usually limited because of its cost to prepare. On the other hand, inexact transcribed corpora have been widely produced in the form of shorthand notes, meeting records, or closed captions. Although these inexact transcribed corpora are more freely available than faithful/exact ones, these are not faithfully transcribed but contains edited transcriptions. Under this background, we are considering to build an efficient semi-automatic framework for converting inexact transcripts to faithful ones or exact transcriptions. This framework consists of three steps: the first step is to automatically detect positions of edited parts, the second step is to manually transcribe the edited parts, and as the third step, we extract transformation rule from the parallel corpus of written style and spoken style. This paper proposes an automatic detection method of edited parts in edited transcribed corpora for this framework. In our proposed method, an automatic alignment between edited transcription and its corresponding utterance is performed, and then a support vector machine based detector is applied to detect edited parts using some features obtained by the automatic alignment. As a result of evaluation on the Japanese National Diet Record, a reasonable result was obtained in speaker-closed condition. By product, we obtain reliable transcript for unsupervised learning of acoustic models. 1. 1 Department of Computer Sciences and Engineering, Toyohashi University of Technology 2 Information Media Center, Toyohashi University of Technology 1234 c 2011 Information Processing Society of Japan
(i) 1 (ii) 1947 1 1 1,, 第 5 回音声ドキュメント処理ワークショップ講演論文集 (2011 年 3 月 7 日 ) 1) 1 http://kokkai.ndl.go.jp/ 2) Lamel 2) Roy 3) 4) 2 3 4 5 2. w 1 w 2 w n 2 2-gram w i i sp i w i 3 3 1235 c 2011 Information Processing Society of Japan
2 bigram 3 /si/ /te/ /ne/ /sp/ /te de su/ / 3. 3 / 2 SVM SVM TinySVM ver 0.09 5) 3.1 7 2 2 Huang 6) 3.1.1 3.1.2 6 3 3 Local d 1236 c 2011 Information Processing Society of Japan
Global d Local d = 1 N N i=1 1 6 dur(s i) i+3 dur(sj) (1) j=i 3(j i) N dur(s) s Global d = 1 N dur(s N i=1 i) s U dur(s) (2) 1 U U 3.1.3 V ar d ( ) 2 V ar d = 1 N dur(s i ) N s U dur(s) Global d (3) i=1 1 U 3.1.4 Lo 7) ( ) Score d = 1 W log P (dur(s) s) P anti model (dur(s)) s W W P (dur(s) s) P anti model (dur(s)) (4) 3.1.5 3.1.6 3.1.7 3.2 33% 50% 100 10 15 5 30 10 100 3 100 90% 85 80 94% 2) 4. 4.1 1 1237 c 2011 Information Processing Society of Japan
semi-closed open 2 semi-closed 2 2 open 2 1 SPOJUS++ 8) CSJ 10) left-to-right HMM 5 4 / 116 2 4 5 1 semi-closed open (min) 22 20 42 60 5 4 7 11 3.6k 3.6k 7.2k 10.8k 347 257 604 426 % 9.6 7.1 8.4 3.9 4 2 16kHz 0.98 Hamming 25ms 10ms MFCC + MFCC + MFCC + Pow + Pow (38 dimensions) 4.2 4.2.1 30msec 97.4% 4.2.2 Semi-closed 4 closed 2 open 2-6 6 closed open closed 60% 30% open 0% 5% 4.2.3 open open - 7 7 semi-closed 7 6 1238 c 2011 Information Processing Society of Japan
open 5 7 open - 6 semi-closed - 4.2.4 semi-closed 3 feature set feature set 1: feature set 2: feature set 3: feature set 8 9 closed 8 9 feature set2 feature set3 4.2.5 10 semi-closed 11 open 10 100% 93% 60% 98% 80% 97% 88% 97% 1239 c 2011 Information Processing Society of Japan
11 60% 96% 98% 80% 97.5% 95.5% 98.5% 11 2) 8 feature set - semi-closed 10 - semi-closed 9 feature set - semi-closed closed 5. 2 closed 1240 c 2011 Information Processing Society of Japan
11 - open 1) 11) Algorithm and System Development, Prentice Hall, 2001. 7) W. Lo, A.M.Harrison and H.Meng, Statistical Phone Duration Modeling to Filter for Intact Utterances in a Computer-Assisted Pronunciation Training System, in Proc. of International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pp.5238 5241, 2010. 8) ++ 4 2010 9) S.Nakagawa, K.Hanai, K.Yamamoto, and N.Minematsu, Comparison of Syllable- Based HMMs and Triphone-Based HMMs in Japanese Speech Recognition, in Proc. of International Workshop on Automatic Speech Recognition and Understanding, pp.393-396, 1999. 10) K.Maekawa, Corpus of Spontaneous Japanese: Its Design and Evaluation, In Proc. of the ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR2003), pp.7 12, 2003. 11) T.Kawahara, M.Mimura, and Y.Akita, Language Model Transformation Applied to Lightly Supervised Training of Acoustic Model for Congress Meetings, In Proc. of International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pp.3853 3856, 2009. COE 1) G. Neubig, Y. Akita, S. Mori, and T. Kawahara, Improved Statistical Models for SMT-based Speaking Style Transformation, In Proc. of International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pp.5206 5209, 2010. 2) L. Lamel, J.L. Gauvain, G. Adda, Investigating lightly supervised acoustic model training, in Proc. of International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pp.477 480, 2001. 3) B.C.Roy, S Vosoughi, D Roy, Automatic Estimation of Transcription Accuracy and Difficulty, in Proc. of Interspeech, pp.1902 1905, 2010. 4) 3-Q-30 pp.177-178 1999 5) TinySVM, http://chasen.org/ taku/software/tinysvm/ 6) X. Huang, A. Acero, H. Hon, Spoken Language Processing: A Guide to Theory, 1241 c 2011 Information Processing Society of Japan