December 9, 1998 RT0288 Human-Computer Interaction 19 pages Research Report A word-based Japanese language model N. Itoh, M. Nishimura, S. Ogino, and K. Yamasaki IBM Research, Tokyo Research Laboratory IBM Japan, Ltd. 1623-14 Shimotsuruma, Yamato Kanagawa 242-8502, Japan Research Di vision Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich Limited Distribution Notice This report has been submitted for publication outside of IBM and will be probably copyrighted if accepted. It has been issued as a Research Report for early dissemination of its contents. In view of the expected transfer of copyright to an outside publisher, its distribution outside IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or copies of the article legally obtained (for example, by payment of royalities).
Vol. 6 No. 1 Jan. 1999 y y y y (N-gram) 2 3 44,000 94-98% 1 12% 19% : N-gram A Word-based Japanese Language Model Nobuyasu Itoh y, Masafumi Nishimura y, Shiho Ogino y and Kazutaka Yamasaki y This paper deals with a word-based language model of Japanese. In Japanese, word boundaries are not stable and grammatical units do not necessarily coincide with human intuition. For accurate segmentation it is therefore necessary to create a vocabulary set that covers human utterance units. In our word-segmentation method, a model of word boundary is described by morphological parameters (i.e. part of speech), which are learned by comparing results of human segmentation with those of Japanese morphological analyzer. Then by using pseudo-random number and the model, it is determined whether each morpheme transition is a word boundary. As a result, we obtain a vocabulary set and learning data for Japanese language model automatically. According to our experiments using articles from three newspaper and appended texts in network-based forums, about 44,000 words cover 94-98% of all words in the test data, and the average numbers of words per sentence are 12-19% smaller than those of morphemes. The parameters of word segmentation model and language model are quite dierent in newspaper articles and forum's texts. However, the dierence does not exist in the probabilities of common events, but in the kinds of events. Therefore the language model, which was created from newspaper articles and forum's text, gave the satisfactory results for both test set. 0
KeyWords: Speech recognition, Dictation, N-gram model, Morphological analysis 1 (,,,, 1996;,,, 1998b) HMM N N-gram 1 N-gram ( 1996;,,,, 1996) ( ) Minimum Cover Set ( 1989) 76% ( 1996) y, Tokyo Research Laboratory, IBM Japan, Ltd. 1
Vol. 6 No. 1 Jan. 1999 N-gram ( 1998a) N-gram 2 + + + + + + + + + + + + ( ) ( ) ( ) ] ] NULL ] P (] i j Morpheme i! M orpheme i+1 ) Morpheme C 1 C 2 ;...;C n j ] P (] j j Morpheme; C j! C j+1 ) ( ) (KoW) 2
,,, (Part of Speech: PoS) (String) (KoW[PoS]; String) ( 1994) 81 119 1 6 4 ( ) 1 1 P (] j V: infl:[29]! Conj: p:p:[69]; ) [29] [69] 2 1 3 V. in. Conj. p.p. (Verb[8]) + (V: infl[30]) + (Conj: p:p:[69]) 1... 1 A 17 PoS (PartofSpeech) ( 1994) 2 V. in. Verb inection Conj. p.p. Conjunctive post-positional particle 3 3
Vol. 6 No. 1 Jan. 1999 Morpheme level segmentation Kind of Word P(# * Conj. p.p..) KoW transition KoW transition with Part-of-Speech KoW transition with Part-of-Speech and the string of the following word P(# V. infl. Conj. p.p.) P(# V. infl.[29] Conj. p.p.[69]) Character level segmentation P(# P.noun Conj. p.p.) P(# V. infl.[29] Conj. p.p. [69], ) P(# Interrogative P.p.[17],, ) 1 Detailed Applied order + ( ) 2 4
,,, 3 3.1 (,, 1996) ( 1994) : : ( 1996) ( 1998) 5
Vol. 6 No. 1 Jan. 1999 3.2 2 3 25,000 1 2 2 1 3 1 1 3 4 4.1 17 5 3 2 ( 26,000 ) ( 9,500 4 ( ) 2,829 4 1. 2. 6
,,, 1 ( ) [19]! [19]; 0.33 [19]! [19]; 0.71 [19]! [18]; 0.36 [29]! [69]; 0.03 [19]! [77]; 1.0 1HZVSDSHU DUWLFOH )RUXP 2 2,269 1 ( ) 5 2 2 3 6 1. [69]! [31] 2. [62]! [73] 3. [48]! [73] 1. + +, 2.... +, 3.... + 5 50 6 String 7
Vol. 6 No. 1 Jan. 1999 ( [13]! [100] i.e.... + ) ( [19]! [19]; ) 1,607 0.980 ( ) (1) (2) 0 1 (3) N-gram ( ) N-gram 4.2 3 ( 446,079 ) (,, 1995) 97% 3 3 10 7 216,904 132,164 25,000 ( ) 95% 2 8
,,, Coverage (%) 100 90 80 70 1000 5000 10000 15000 20000 25000 30000 35000 40000 45000 73331 132164 216904 Number of tokens 3 3 60% 2 ( ) (%) P (] j KoW 1 [PoS 1 ]! KoW 2 [PoS 2 ]; String) 59.6 P (] j KoW 1[PoS 1]! KoW 2[PoS 2]) 29.2 P (] j KoW 1! KoW 2) 3.9 P (] j KoW 2 ) 6.6 0.7 9
Vol. 6 No. 1 Jan. 1999 5 5.1 93 96 92 10 97 91 92 EDR (EDR 1995) ( 1998a) 7 90 ( :-) 5.2 EDR 95% 44,000 (44K 7 ID 10
,,, ) 11 5.3 44K ( 1996) ( 1996) 3 6 N-gram 11
Vol. 6 No. 1 Jan. 1999 3 K 1,000 M 100 (K) (M) 715 20.9 1,837 49.4 1,401 41.4 EDR 169 4.4 1,565 33.6 4 ( ) (%) 600 21,378 18,725 98.3 35.6 31.2 725 22,051 18,608 96.1 30.4 25.7 775 21,702 17,751 96.0 28.0 22.9 1,381 29,979 24,204 94.4 21.7 17.5 36 44K 1,800 N-gram 3 ( 44K 4 12-19% N-gram N-gram ( : N-1,..,8 F-1,...,8 95% 5% N-gram Held-out (N-1,..,8) 12
,,, 1LNNHL 0DLQLFKL 6DQNHL )RUXP 7UDLQLQJ GDWD VL]H PLOOLRQ ZRUGV 4 N-gram (trigram 4Forum 7 8 1-2% 8 (100-170) ( 1998b) 400 8 13
Vol. 6 No. 1 Jan. 1999 1LNNHL 0DLQLFKL 6DQNHL )RUXP FUHDWHG IURP IRUXP FRUS XV )RUXP WH[W GDWD VL]H PLOOLRQ ZRUGV PL[HG ZLWK QHZV FRUSXV 5 (F-1,..,8) 5 ( 25M ) 9 152.1 N-gram N-gram N-gram 6 (N-1,..,8) bigram trigram N-1,...,8 F-1,...,8 N-gram trigram 31M bigram 5.6M trigram N-gram N-gram 9 14
,,, 7ULJUDP %LJUDP 7UDLQLQJ GDWD VL]H PLOOLRQ ZRUGV 6 N-gram ( ) 7 N-gram trigram N-1,...,8 F-1,...,8 trigram trigram 5M 1/3 1/5 10 7 44K 94-98% 10 1,...,8 1 trigram 31M 9M 15
Vol. 6 No. 1 Jan. 1999 1LNNHL 0DLQLFKL 6DQNHL )RUXP 1R RI WULJUDPV PLOOLRQ 7 trigram 12-19% N-gram N-gram ( ( 1998b) 400 ( 1989) 16
,,, 2 ( 1996) ( 1996) CD- 91-95 ( ) EDR (1995).. ( ). (1989). \.", DPHI22-3.,,,, (1996). \.", 3-3-10, pp. 105{106.,, (1996). \.", pp. 19{26. (1994). \.", 35 (7), 1293{1299.,,,, (1996). \.", J79-D-II (12), 2125{2131.,, (1995). \." 51, 3R-7, pp. 117{118. (1998a). \.", J81-D-II (1), 10{17.,,, (1998b). \.", SLP20-3, pp. 17{24. (1998). \.", pp. 122{135. 17
Vol. 6 No. 1 Jan. 1999 (1996). \.", J79-DII (12), 2078{2085. (1996). \ bigram.", 3 (4), 129{139. : 1982 1984 : 1981 3 1983 3 10 : 1986 1988 : 1988 1990 1993 (1998 4 1 ) (1998 7 ) (1998 8 ) 18