4 97.52% tri-gram 92.76% 98.49% : Japanese word segmentation by Adaboost using the decision list as the weak learner Hiroyuki Shinnou In this paper, we propose the new method of Japanese word segmentation by Adaboost using the decision list as the weak learner. The word segmentation is regarded as the classification problem of judging whether the word boundary exists between two characters or not. By solving the problem by the decision list method, we can conduct Japanese word segmentation. Our method has the advantage not to suffer the unknown word problem because we do not use dictionary information as an attribute of our decision list. Moreover, by taking this approach we can use Adaboost which is actively researched in the machine learning domain recently. Adaboost improves the precision of our decision list. In experiments, we built the decision list through Kyoto University Corpus (about 40K sentences). The precision of this decision list was 97.52%. This values was much higher than the precision of character based tri-gram model, 92.76%. By using Adaboost method, our precision was improved to 98.49%. Furthermore, our word segmentation system was excellent in detecting unknown words. KeyWords: Word segmentation, classification problem, decision list, Adaboost, Faculty of Engineering, Ibaraki University Department of Systems Engineering 3
Vol. 8 No. 2 Apr. 2001 1 HMM (Hidden Markov Model) HMM HMM a b c tri-gram 2 HMM ( 1997; Tsuji and Kageura 1997; 1998) HMM +1 1 n-gram (Yarowsky 1994) (Freund and Schapire 1997) 4
tri-gram HMM 2 2.1 n s = c 1 c 2 c n c i c i c i+1 b i (+1) ( 1) b i i = 1, 2,, n 1 +1 1. 1 +1 1 +1 %wwâwwww ww*ww]ww_wwtwwjww ww ww wwuww)ww<ww"wwï ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ %wwâw w w *w ]ww_wwtwwjww ww ww w Uw )ww<ww"w ï 1 1 Word segmentation by class assignment 2.2 5
Vol. 8 No. 2 Apr. 2001 step 1 step 2 step 3 n att 1, att 2,, att n att a C (att, a) C ((att, a), C) 1 ((att, a), C) f C f C Ĉ (att, a) pw(att, a) pw((att, a)) = log fĉ C Ĉ f C step 4 2.3 b i b i 1 7 1 1 Setting attributes att 1 c i 1c ic i+1 att 2 c ic i+1c i+2 att 3 c i 1c i att 4 c ic i+1 att 5 c i+1c i+2 att 6 1 ((c i ), (c i+1 )) att 7 2 ((c i ), (c i+1 )) 6 7 6 7 2 9 6
2 2 Classification of character types default default default default 6 2.4 5 6 b 5 +1 1 b 5 7 (att 1, ) (att 2, ) (att 3, ) (att 4, ) (att 5, ) (att 6, ) (att 7, ) 7
Vol. 8 No. 2 Apr. 2001 3 3 Example of class judgement (att 1, ) (att 2, ) (att 3, ) +1 2.74377 (att 4, ) +1 5.83188 (att 5, ) +1 1.64565 (att 6, ) +1 6.33293 (att 7, ) +1 8.64488 (att 7, ) +1 b 5 3 2 ( 2 Y ) {+1, 1} 2 (x 1, y 1 ), (x 2, y 2 ),, (x m, y m ) x i y i x i y i +1 1 h 1 h 1 h 1 x i x i y i h 1 x i (x 1, y 1 ), (x 2, y 2 ),, m, y m ) h 2 T T h 1, h 2,, h T T = 3 x h 1 +1 h 2 8
1 h 3 +1 1 2.0 2.2 +1.2 +1 2 ǫ t ) Given: (x 1, y 1 ),, (x m, y m ) where x i X,y i Y = {1, 1} Initialize Di(i) = 1/m For t = 1,, T Train weak learner using distribution D t Get weak hypothesis h t : X Y with error ǫ t = Pr i Dt [h t (x i ) y i ] Choose α t = 1 2 ln(1 ǫ t ǫ t ) Update: { D t+1 (i) = D t(i) Z t e αt if h t (x i ) = y i e αt if h t (x i ) y i where Z t is a normalization factor Output the final hypothesis: T H(x) = sign( α t h t (x)) t=1 2 2 AdaBoost / 9
Vol. 8 No. 2 Apr. 2001 / / / / 4 5 b 4 (att 1, ) (att 2, ) (att 3, ) (att 4, ) (att 5, ) (att 6, ) (att 7, ) 1 step 2 1 ((att 1, ), 1) ((att 2, ), 1) ((att 3, ), 1) ((att 4, ), 1) ((att 5, ), 1) ((att 6, ), 1) ((att 7, ), 1) 1 h k 4 5 +1 h k+1 1 step 2 2 4 4.1 n-gram n-gram ( 1998) n-gram n-gram Viterbi HMM ( 4 ) 10
950117.KNP 1,234 1 35,717 1,234 56,411 56,411 tri-gram CMU-Cambridge Toolkit 2 Witten-Bell discounting 0 ( 1999) tri-gram tri-gram 56,411 52,328 4,083 92.76% 7 136,114 56,411 55,015 1,396 97.52% tri-gram 92.76% 4.2 % 3 3 3 56,411 55,560 851 98.49% 4.3 3 35,717 1,234 1 EOS 2 CMU-Cambridge Toolkit http://svr-www.eng.cam.ac.uk/ prc14/toolkit.html 11
Vol. 8 No. 2 Apr. 2001 98.6 precision 98.4 98.2 98 97.8 97.6 97.4 1 2 3 4 5 6 7 3 3 Precision by boosting 914,392 41,890 32,764 6,479 1,024 832 1,024 832 1,024 832 688 562 67.2% 67.5% 9 (1) 124 123 12
(2) 94 91 (1) (3) 44 41 (4) 7 3 (5) 210 156 (6) 38 32 (7) 21 17 (8) 426 310 3 (9) 64 59 9 4 3 (7) (8) 13
Vol. 8 No. 2 Apr. 2001 4 4 Detection of unknown words (1) 124 101 124 (2) 94 57 0 (3) 44 40 44 (4) 7 5 7 (5) 210 188 210 (6) 38 19 0 (7) 21 4 21 (8) 426 246 0 (9) 64 28 0 1,024 688 (67.2%) 406 (39.6%) 4 39.6% 67.2% (1),(3),(4),(5),(7) 83.3% 5 2 56,411 0 1 83 57.83% 1 2 2 3 4 (Quinlan 1993) (Ratnaparkhi 1998) 7 bi-gram tri-gram 7 14 14
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0 2 4 6 8 10 12 14 4 4 Identification strength and precision bi-gram tri-gram 50% 15
Vol. 8 No. 2 Apr. 2001 4 3 1 (,, 1999) 2 3 ( 1998; 1999) ( 2000) (Shinnou 2000) 6 n-gram K11 IV 71. 4 336 851 7 16
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 (1), 119 139. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publisher. Ratnaparkhi, A. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. In PhD thesis. University of Pennsylvania. Shinnou, H. (2000). Deterministic Japanese Word Segmentation by Decision List Method. In PRICAI-2000 (poster session), pp. 822 822. Tsuji, K. and Kageura, K. (1997). An HMM-based Method for Segmenting Japanese Terms and Keywords based on Domain-Specific Bilingual Corpora. In The 4th Natural Language Processing Pacific Rim Symposium, pp. 557 560. Yarowsky, D. (1994). Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In 32th Annual Meeting of the Association for Computational Linguistics, pp. 88 95. (1999). Suffix array., NL-131-7. (1998). PPM., NL-128-2.,, (1999).., 6 (7), 93 108. (1999)... (2000).., NL-140-1. (1998).. 4, pp. 524 527. (1997).. 3, pp. 421 424. 17
Vol. 8 No. 2 Apr. 2001 : 36 60. 62., 5 4 9 10 ( ) (2000 8 26 ) (2000 10 6 ) (2001 1 12 ) 18