Vol. 2 No. 1 145 155 (Feb. 2009) 1 2 3 1 2 Generating Diverse Katakana Variants via Backward- Forward Transliteration for Information Retrieval Hiroyuki Hattori, 1 Kazuhiro Seki 2 and Kuniaki Uehara 3 In Japanese, it is quite common for the same word to be written in multiple ways. This is especially true for katakana words which are typically used for transliterating foreign languages. For example, Los Angeles can be written in katakana as (rosanjerusu), (rosanzerusu), (rosuanjerusu), or (rosuanzerusu), all considered legitimate. This ambiguity becomes a critical problem for automatic processing such as information retrieval. To tackle this problem, we propose asimplebuteffectiveapproachforgeneratingkatakanavariantsforagiven katakana word based on phonemic representation of the original language for a given word. The proposed approach is first evaluated through a manual assessment of the variants it generates. It is also shown that the approach is beneficial for information retrieval when applied for query replacement, retrieving a large number of potentially relevant documents. 1. 1) Y Los Angeles 1 Google, Inc. 2 Organization of Advanced Science and Technology, Kobe University 3 Graduate School of Engineering, Kobe University 145 c 2009 Information Processing Society of Japan
146 2) Yahoo! Japan 1 Alisha Keys Alicia Keys 2 3 4 5 2. 2 3),4) 5),6) OR 1 1 http://search.yahoo.co.jp/ 7) 258 17.6% 100 47 Masuyama 8) Masuyama 1 682 98.6% 86.3% 9)
147 3. 3.1 1 10) /æ/ /a/ /e/ Chandler/tSændl@/ 4 (1) (2) (3) (4) 3.2 Knight 9) 1 1 2 2 Gregory 11) Knight diteeru Table 2 Table 1 1 Katakana characters and their phonetic representations. a ta ma gi bi i chi mi gu bu u tsu mu ge be e te me go bo o to mo za pa ka na ya ji pi ki ni yu zu pu ku nu yo ze pe ke ne ra zo po ko no ri da n sa ha ru ji v shi hi re zu su hu ro de se he wa do so ho ga ba 2 Compound katakana characters and their phonetic representations. di tsi chyo pyu du tse nya pyo ti tso nyu gya tu she nyo gyu si je hya gyo wi che hyu jya we kya hyo jyu wo kyu mya jyo va kyo myu dya vi shya myo dyu ve shyu rya dyo vo shyo ryu bya vyu chya ryo byu tsa chyu pya byo Knight 3 diteeru ru 1 L
148 Table 3 1 3 Knight 9) A fragment of English-Japanese phonemic mappings. D d 0.535 do 0.329 ER aa 0.719 a 0.081 ar 0.063 er 0.042 EY ee 0.641 a 0.122 e 0.114 IH i 0.908 L r 0.621 ru 0.362 T t 0.463 to 0.305 tto 0.103 UH u 0.794 uu 0.098 diteeru Fig. 1 Possible partitions for diteeru. r u 1 L UH Knight 3 1 1 1 5 φ φ 1 2 3.3 2 diteeru Fig. 2 Possible English phoneme sequences for diteeru. noisy channel model 3.2 a ER EY 1 d-i-t-ee-ru 2 1 1 e ee 2 AH EH EY IY J = j 1...j n j i E = e 1...e n e i P (E J) Ê Ê = arg max P (E J) E (1) = arg max P (J E)P (E) E
149 P (J E)P (E) = P (j i e i)p (e i e i 1) (2) i P (e 1 e 0)=P (e 1) Knight 9) (2) e i 12) 1 2 P (j i e i) 8,000 Knight 9) P (e i e i 1) CMU 1 127,000 1,571 = CMU 39 2 J (2) Ê D-IH-T-EY-L 3.4 Ê J J K J K 1 2 3.5 K K 2 1 Ê J K P (K ) i P (j i ê i) 1 P (K ) n 2 (2) 1 Knight n 3 EDICT 2 13,124 K 4 Table 4 Examples of katakana variants generated for 10 6 P (K ) i P (j i êi) 329 0.000002 195 0.000017 86 0.000003 36 0.000003 1 K K Yahoo! API K K K 1 2 1 K 4 46 12 4. 4.1 4.1.1 Infoseek 3 Yahoo! 1 http://www.speech.cs.cmu.edu/cgi-bin/cmudict 2 http://www.csse.monash.edu.au/ jwb/jedict.html 3 http://dictionary.www.infoseek.co.jp
150 1 25 7) 17 4.1.2 5 100 100 25 18.56% 13.98% 2 32.54% 7) 17.6% 4.2 4.1.3 5 Table 5 5 Individual results for quality judgment of generated katakana variants. 12 17.13 12.50 29.63 11 14.14 16.67 30.81 1 11.11 33.33 44.44 13 14.96 9.83 24.79 8 18.06 12.50 30.56 23 8.70 8.45 17.15 6 29.63 12.04 41.67 21 5.03 5.03 10.05 13 6.84 6.41 13.25 12 16.67 11.11 27.78 2 41.67 8.33 50.00 3 27.78 9.26 37.04 9 16.67 20.37 37.04 32 8.51 6.25 14.76 22 5.81 11.36 17.17 7 23.81 11.90 35.71 20 12.78 14.72 27.50 0 9 19.14 16.05 35.19 13 26.92 16.24 43.16 1 72.22 27.78 100.00 29 1.34 5.17 6.51 15 26.30 25.19 51.48 7 11.90 14.29 26.19 4 8.33 20.83 29.17 18.56 13.98 32.54 2 25 293 1 195 66.6% 174 89.2% 1 http://dic.yahoo.co.jp/newword/ 2 Google (http://google.com) 2008 4 14
151 4.2.1 174 4.2 4.2.1 NTCIR-3 Web 13) NTCIR-3 DM2&RL1 DM2&RL1 H A 47 26 3 1 14) tfidf 15) Base Base Phone EDICT 750 5 118 Rule Rule Yahoo! API 3.5 4.2.2 3 3 Base Phone Rule 1,000 4 R P R P Precision Recall Precision = TP TP+FP Recall = TP TP+FN TP true positive FP false positive FN false negative 4 Base Phone Rule 0 0.1 Rule (3) N HO 3 26 Fig. 3 Twenty-six katakana queries from NTCIR-3. 4 NTCIR R P Fig. 4 R P curves for NTCIR dataset.
152 Rule Base P hone 4.2.3 Rule 6 Rule Phone 1,000 0 1 6 1,000 Table 6 Precision at top 1,000 retrieved documents by query replacement. Rule Phone 0.0000 0.0050 0.0000 0.0000 0.0000 0.0020 0.0000 0.0000 0.0000 0.0010 0.0080 0.0080 0.0010 0.0010 0.0010 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0270 0.0630 0.0010 0.0000 0.0000 0.0180 0.0020 N HO 0.0010 0.0063 0.0150 6 Phone Rule Rule Phone 10 NTCIR 1 NTCIR 2 NTCIR 6 NTCIR 8 NTCIR capsaicin 1 10 NTCIR 7 C 7 6
153 7 10 Table 7 Top 10 documents retrieved by query replacement for polyphenol. Y/N 1 A Y 2 C N 3 N 4 N 5 N 6 C Y 7 C Y 8 N 9 N 10 N NTCIR 16) n n Rule Phone Phone 3.5 4.2.4 NTCIR-3 Web 4.1 25 NTCIR 4.1 1 Yahoo! API 60.1 1,420 2,040,000 1,437 512,000 12,400,000 238 Phone Rule Phone Rule 25 Rule Phone 20 4.1 17 Phone Rule 100 R P 5 Phone Rule 0.2
154 5 Phone Rule R P Fig. 5 Comparison of R P curves for our proposed and existing approaches. 0.03 0.08 Rule Phone 4.3 4.1 4.2 5 5. 25 32.5% 66.6% 1 89.2% 60.1 R P 1) Vol.2, pp.43 49 (1983). 2) Brill, E. and Moore, R. C.: An improved error model for noisy channel spelling correction, Proc. 38th Annual Meeting of the Association for Computational Linguistics, pp.286 293 (2000). 3) 44 pp.3 249 250 (1992). 4) FleCS
155 Vol.87, No.11, pp.83 90 (1992). 5) Vol.J77-D-II, No.2, pp.380 387 (1994). 6) Vol.35, No.12, pp.2745 2750 (1994). 7) Vol.J86-D-II, No.3, pp.418 428 (2003). 8) Masuyama, T. and Nakagawa, H.: Web-based acquisition of Japanese katakana variants, Proc. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.338 344 (2005). 9) Knight, K. and Graehl, J.: Machine Transliteration, Computational Linguistics, Vol.24, No.4, pp.599 612 (1998). 10) (1978). 11) Gregory, G., Yan, Q. and David, A.E.: Mining the Web to create a language model for mapping between English names and phrases and Japanese, Proc. IEEE/WIC/ACM International Conference on Web Intelligence, pp.110 116 (2004). 12) Frederick, J.: Statistical Methods for Speech Recognition, MITPress(1998). 13) Eguchi, K., Oyama, K., Ishida, E., Kando, N. and Kuriyama, K.: Overview of the Web retrieval task at the third NTCIR workshop, Technical Report NII-2003-002E, National Institute of Informatics (2003). 14) Salton, G. and McGill, M.J.: Introduction to Modern Information Retrieval, McGraw-Hill, Inc. (1983). 15) Jones, K.S.: Statistical interpretation of term specificity and its application in retrieval, Journal of Documentation, Vol.28, No.1, pp.11 20 (1972). 16) Voorhees, E.M. and Harman, D.K. (Eds.): TREC: Experiment and Evaluation in Information Retrieval, The MIT Press (2005). 20 14 18 Ph.D. ACM SIGIR 53 58 AAAI ( 20 4 17 ) ( 20 6 6 ) ( 20 6 27 )