Vol. 42 No. 6 June 2001 IREX-NE F 83.86 A Japanese Named Entity Extraction System Based on Building a Large-scale and High-quality Dictionary and Pattern-matching Rules Yoshikazu Takemoto, Toshikazu Fukushima and Hiroshi Yamada We have developed a Named Entity extraction system from Japanese text. Named Entities, i.e., proper names and temporal/numerical expressions are considered as the essential elements for extracting information. The system employs a conventional method that it divides input Japanese text into words and parts of speech by morphological analysis and extracts each Named Entity by referencing dictionaries and applying pattern-matching rules. In order to improve the system s accuracy, we aim to build a large-scale and high-quality dictionary and rules. Both the dictionary and rules have been produced manually, because we believe that a hand-made dictionary or rules have better quality than those that are made automatically. We also focused our attention on two points for cases that cannot be covered by the dictionary. One is to extract proper names from compound words, and the other is to designate unknown or vague words as proper names. For the first point, our system divides compound words and determines proper names within them. Thus, omissions of proper names in compound words can be eliminated. For the second point, our system recognizes abbreviations of proper names, which tend to be unknown or vague, using reliable proper names. For the IREX-NE corpus, our system has accomplished 83.86 as F-measure score. 1. Information Services Department, Information Services Division, NEC Patent Service, Ltd. NEC Internet Systems Research Laboratories, Computer & Communication Media Research, NEC Corporation NEC Open Systems Development Department, 2nd Systems Operations Unit, NEC Corporation 1) 5) 1580
Vol. 42 No. 6 1581 1 MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC 2 3 4 5 IREX-NE 6 2. MUCMET IREX-NE IREX-NE 14) IREX-NE 8 IREX-NE 15) IREX-NE 8 SGML 1 15) 1 Table 1 Examples of named entity. ORGANIZATION NEC PERSON LOCATION ARTIFACT DATE 5 14 6 TIME 5 15 MONEY 500 1 PERCENT 120%5 1
1582 June 2001 24 24 3. 3.1 4 3.2 3.1 1 2 16) 3 1 1 4 1 3.2 3.2.1 1 2 9 3 3 2 / / /
Vol. 42 No. 6 1583 2 Table 2 Examples of rule. 2 1 / / 117 / 22 / 31 58 // 2 /// 2 2 10 ///// //4//1//1 4 1 1 1 / 3 2 / / / 22 31 58 2 2 2 117 / /4//1//1 IREX- NE /// // 10 F F 17) MET-2 3.2.2 3 3.1 + + 8) F 5 IREX
1584 June 2001 3.2.3 4 3.2.1 (a) (b) (c) (a) (a) (b) (c) (b) (c) (a) 2 / / 8) 18) 3.2.1 4. 1 5 4.1 4.5 4.6 4.1 1 Fig. 1 Our named entity extraction system configuration.
Vol. 42 No. 6 1585 Table 3 3 Details of named entity dictionary. 1 2 3 NEC 24990 36714 27564 Lavie 1343 1110 155 22 4 80 151 92133 + + 557 4.2 3 3.2.1 / / // / (1) (2) (3) 3 (3) (1) NEC (1) (2) (2) 3 3 /
1586 June 2001 / 4.3 3 19) 1 3 / / / (A) (B) (C) (A) / (B) / / (C) //// 8)11)12)18)20) 21) 4 1 1 4.4 4.5 4.4 8 4.6 4.14.5 2 Fig. 2 2 An example of our named entity extraction process.
Vol. 42 No. 6 1587 + + 11) 5. IREX-NE IREX-NE 1999 45 71 F F R P b (1) IREX-NE Table 4 4 Accuracy of our named entity extraction system. GLD SYS COR R P 361 373 288 79.8 77.2 338 324 290 85.8 89.5 413 387 339 82.1 87.6 48 20 15 31.3 75.0 260 275 242 93.1 88.0 54 60 47 87.0 78.3 15 15 13 86.7 86.7 21 17 16 76.2 94.1 1510 1471 1250 82.8 85.0 F 83.86 F = (1+b2 )PR b 2 P + R (1) 5.1 5.2 5.3 5.4 5.1 4 4 GLD SYS COR R GLD COR P SYS COR F R P (1) b=1 (2) F = 2PR P + R (2) 4 F 83.86 IREX-NE 4
1588 June 2001 Table 5 5 Evaluation of each component of our system. A A1 A2 A3 A4 B B1 B2 B3 B4 B5 B6 38.7 56.6 76.2 76.3 77.3 81.2 82.4 82.9 31.7 51.3 84.0 84.3 85.2 85.3 84.9 84.9 F 34.9 53.8 79.9 80.1 81.1 83.2 83.7 83.9 5.2 A1A4 A1 A2A1 A3A2 A4 A1 A2 A3 A4 A3 5 A 5 A 38.7% 31.7% 33.9%A2 A3 F A1 34.9 F 17.9% 19.6%F 18.9 A1 A2 5.3 B1 B6 B1 B2B1 B3B2 =A3 B4B3 B5B4=A4 B6B5 B1 B2 B3 B3 5.2 A3 B4 B5 B5 5.2 A4 B6 703 5 B 5B F 3.9%F 2.1 B3 B4 0.9% B2 B3 F 0.5 B4 B5 0.4% 1.2% F 0.2 B1 B2 /
Vol. 42 No. 6 1589 6 F Table 6 Accuracy improvement on training corpus. A A1 A2 A3 A4 B B1 B2 B3 B4 B5 B6 A 51.1 69.5 85.9 86.2 86.2 89.6 89.8 93.9 B 41.3 56.5 77.2 77.5 79.1 81.9 82.9 87.2 C 38.9 56.0 75.6 76.0 77.2 80.4 81.0 82.5 3 ABC F 6 A IREX-NE 46 B 1998 11 IREX-NE 36 C IREX-NE CRL 1460 1994 1995 AB C AB A B A 5.4 (1) (2) (1) (1) 5.1 1 1 (2) + + 3 78 6. IREX-NE F 83.86 1 90% 5.15.4 5.2 5.2 5 6
1590 June 2001 7. IREX-NE IREX-NE 83.86 F NEC MET-1 NEC 1) Vol.40, No.4 (1999). 2) 114-12 (1996). 3) Vol.36, No.8 (1995). 4) 115-12 (1996). 5) Cowie, J. and Lehnert, W.: Information Extraction, Comm. ACM, Vol.39, No.1 (1996). 6) Proc. 6th Message Understanding Conference (MUC-6 ), Morgan Kaufman Publishers Inc. (1996). 7) Proc. Tipster Text Program (Phase II ), DARPA (1996). 8) 115-10 (1996). 9) IREX 127-15 (1998). 10) Vol.25, No.6 (1984). 11) 35 6S-3 (1987). 12) Takemoto, Y., Wakao, T., Yamada, H., Gaizauskas, R. and Wilks, Y.: Description of NEC/Sheffield System Used For MET Japanese, Proc. Tipster Text Program (Phase II ) (1996). 13) IREX-NE IREX (1999). 14) IREX NE 5 B2-1 (1999). 15) http://cs.nyu.edu/cs/projects/proteus/irex/ 16) 5 A2-2 (1999). 17) Sekine, S.: NYU: Description of the Japanese NE System Used For MET-2 (1998). http://www.muc.saic.com/ 18) 92-90 (1992). 19) 3 (1997). 20) 53 7L-3 (1996). 21) 126-15 (1998). ( 12 3 1 ) ( 13 3 9 )
Vol. 42 No. 6 1591 1990 NEC 1962 1987 NEC 1982 NEC WWW 4 6 45 53 ACM