HTML HTML 1) HTML HTML 2) df idf 3) 4) : World Wide Web Automatic acquisition of hyponymy relations from HTML documents This paper describes an automatic acquisition method for hyponymy relations. Hyponymy relations play a crucial role in various natural language processing systems, and there have been many attempts to automatically acquire the relations from largescale corpora. Most of the existing acquisition methods rely on particular linguistic patterns, such as juxtapositions, which specify hyponymy relations. Our method, however, does not use such linguistic patterns. We try to acquire hyponymy relations from four different types of clues. The first is repetitions of HTML tags found in usual HTML documents on the WWW. The second is statistical measures such as df and idf, which are popular in IR literatures. The third is verb-noun cooccurrences found in normal corpora. The fourth is heuristic rules obtained through our experiments on a development set. KeyWords: Knowledge acquisition, Hypernym, Hyponym, Statistical Natural Language Processing, World Wide Web 1 hyponymy relation synonymy relation part-whole relation, School of Information Science, Japan Advanced Institute of Science and Technology 1
Vol. 1 No. 1 Jan. 1 WWW HTML Miller (Miller, Beckwith, Fellbaum, Gross, and Miller 1990) A B (hypernym) B A hyponym B is a (kind of) A 1 B A hyponym(a B) hyponym( ) hyponym( ) hyponym( ) (Mandala, Tokunaga, and Tanaka 1998). QA (Fleischman, Hovy, and Echihabi 2003) WWW HTML 1) 2) 3) HTML HTML lexico syntactic pattern HTML df idf 1 A concept represented by a lexical item L 0 is said to be a hyponym of the consepct represented by a lexical item L 1 if native speakers of English accept sentences consutructed from the frame An L 0 is a (kind of ) L 1. 2
HTML 4 WWW 87 HTML 9 2,000 2,000 14,000, 3.6 % 501 85 % 5% 700 75 % 10 % 1,400 60 % 2 3 4 5 6 2. (Hearst 1992; Caraballo 1999; 2001; Morin and Jacquemin 2003; Fleischman et al. 2003;,, 2003) Hearst such as such NP as {NP, }* {(or and)} NP Hearst such as... works by such authors as Herrick, Goldsmith, and Shakespeare. such as hyponym( author Herrick ) hyponym( author Goldsmith ) hyponym( author Shakespeare ). ( 2001) ( 2003) 1 4 232 15,000 600 77.2% 3
Vol. 1 No. 1 Jan. 1 1 A B A B A B * A, B A B A B * A B * A B A B A B * A B * X ( 2001) 60 6 1 7 60% 85% 2 68.2% 81.3% 3 5 3 3.1 3 1 HTML 2 3 2 ( 2003) 4
HTML DVD-RW PC 1 HTML 4 1 2 3 1 2 3 1 HTML 2 df idf 3 4 4 1 2 3 1 WWW HTML 1 1 HTML 1 2 {DVD RW } { } 1 1 HTML 2 df idf 1 2 1 WWW 5
Vol. 1 No. 1 Jan. 1 2 4 2 2 DVD RW WWW. DVD RW DVD RW 3 3 3 DVD RW 4 1 3 4 3.2 ( 1) 1 WWW HTML HTML 3 3 HTML <A> <FONT> <TD> <B> 6
HTML <UL> <LI> </LI> <UL> <LI>DVD-RW</LI> <LI> </LI> <LI> </LI> <LI> </LI> </UL> <LI>PC </LI> <UL> <LI> </LI> <LI> </LI> <LI> </LI> </UL> </UL> (a) (UL 1 LI 1 ) (UL 1 UL 2 LI 1 DVD RW) (UL 1 UL 2 LI 2 ) (UL 1 UL 2 LI 3 ) (UL 1 UL 2 LI 4 ) (UL 1 LI 3 PC ) (UL 1 UL 4 LI 1 ) (UL 1 UL 4 LI 2 ) (UL 1 UL 4 LI 3 ) (c) (UL LI ) (UL UL LI DVD RW) (UL UL LI ) (UL UL LI ) (UL UL LI ) (UL LI PC ) (UL UL LI ) (UL UL LI ) (UL UL LI ) (b) (UL 1 LI ) (UL 1 UL 2 LI DVD RW) (UL 1 UL 2 LI ) (UL 1 UL 2 LI ) (UL 1 UL 2 LI ) (UL 1 LI PC ) (UL 1 UL 4 LI ) (UL 1 UL 4 LI ) (UL 1 UL 4 LI ) (d) 2 HTML 1 HTML HTML 3 1 <OPTION> <LI> <STRONG> 7
Vol. 1 No. 1 Jan. 1 2(a) HTML 4 HTML HTML HTML 2(a) <LI> </LI> <UL> </UL> (UL LI ) 2(a) HTML 2(b) HTML 2(b) {DVD RW } { PC } 2 PC 2(c) N 2(c) N =1 2(d) {DVD RW } { } { PC } N =1 4 1 HTML 8
HTML 3.3 df idf ( 2) 1 HTML 2 1 2 1 df idf 2 2 1 HTML HTML 2 1 1 2 1 C G C LD(C) LD(C) 5 N 2 h(c) h(c) =argmax n N {df (n, LD(C)) idf(n, G)} idf(n, G) = log G df (n, G) df(n, D) D n G G 3 DVD RW df (n, LD(C)) idf(n, G) 5 B 9
Vol. 1 No. 1 Jan. 1 df(n, D) D n tf(n, D) tf(n, D) tf idf tf(n, LD(C)) idf(n, G) df (n, LD(C)) idf(n, G) 4 3.4 ( 3) 2 { h(c 1 ),C 1, h(c 2 ),C 2,, h(c m ),C m } C 1,,C m 1 h(c i ) 2 C i 3 3 h(c i ) C i 2 df idf h(c i ),C i 3 4 k m k 3.1 2 DVD RW 3 10
HTML 3 C p v f hypo (C, p, v) {p 1,,p l } {v 1,,v m } hypov(c) = f hypo (C, p 1,v 1 ),f hypo (C, p 2,v 1 ),,f hypo (C, p l 1,v m ),f hypo (C, p l,v m ) h(c) hyperv(h(c)) = f(h(c),p 1,v 1 ),f(h(c),p 2,v 1 ),,f(h(c),p l 1,v m ),f(h(c),p l,v m ) f(n, p, v) 33 6 (,,, 2000) 7 n p v 500 0 WWW 2 Lin(Lin 1998) Lee(Lee 1999) Jaccard (Salton and Lesk 1968) C h(c) sim(c, h(c)) = hypov(c) hyperv(h(c)) hypov(c) hyperv(h(c)) Lin Lee 6 1987 2001 1991 1999 1990 1998; 3.01GB 7 ( 2000). 11
Vol. 1 No. 1 Jan. 1 3 2 m { h(c i ),C i } m i=1 sim(c i,h(c i )) df (h(c i ),LD(C i )) idf(h(c i ),G) 2 df idf 1 2 k 4 3.5 ( 4) 4 3 4 3 1 2 3 1 WWW 1 2 12
HTML 3 3 (part-whole relation) 3 4 4.66 10 6 HTML WWW 1.00 10 6 HTML ( 1.26GB ) HTML JUMAN( 1999) 30 4.66 10 6 8.71 10 5 HTML ( 0.89GB ) HTML 13
Vol. 1 No. 1 Jan. 1 2 ID 10397,,, 16653,,,,,,,, 21561,,,,,, 28931 DDI,,,,,, 30288,,,,,,, 35645,,, 51462,,,, 53147,,, 56681 NTT,,,,,,, 58174,,,, 59502,,,, 69064,,,,,, HTML Tidy 8 HTML XML 3.2 9.02 10 4 6.01 10 5 2 9.02 10 4 2,000 4,000 2,000 13,790 3.3 goo 9 100 100 HTML JUMAN ( 2000) j 8 http://www.w3.org/people/raggett/tidy/ 9 http://www.goo.ne.jp/ 14
HTML 100 80 Proposed Method (Step 4) Step 3 Step 2 Step 2 (tf) accuracy [%] 60 40 20 0 0 500 1000 1500 2000 2500 # of hypernym hyponym pairs 3 j j C k, k=1 j k=1 correct(c k,h(c k )) j k=1 C k j 1 j 200 C k C k correct(c k,h(c k )) h(c k ) C k 2 3 4 3 2 3 4 3 4 sim(c, h(c)) df (h(c),ld(c)) idf(h(c),g) 2 df (h(c),ld(c)) idf(h(c),g) 2 1 200 1,800(= 2, 000 200) 3 Step 4 15
Vol. 1 No. 1 Jan. 1 3.6% 501 36 84.6% 5% 701 44 75% 10% 1398 93 61% sim(c, h(c)) df (h(c),ld(c)) idf(h(c),g) 3.3 2 df (n, LD(C)) idf(n, G) tf(n, LD(C)) idf(n, G) 3 Step 2(tf) tf(n, LD(C)) idf(n, G) df (n, LD(C)) idf(n, G) tf(n, LD(C)) idf(n, G) df (n, LD(C)) idf(n, G) 3 3 4 3 4 3 4 4 4 4 Step X Rule X X X 4 200 3 7.7 % 4 13.2 % 3 4 4 3 947 3 779 16
HTML 3 Step2 Step4 Step4 Step4 Step3 Step1 1 2 3 10 *, *, *, *, *, JavaScript* 23 *, *, *, *, *, *, *, *, *, 16 *, *, *, 42 *, *, *, *, *, *, * *, *, *, *, 21 *, *, *, *, *, *, *, *, *, 53 *, *, *, 29 *, *, *, *, *, *, * 68 *, *, *, *, *, 47 *, *, *, 112 *,, * *, *, *, 69 *, *, *, *, *, *, 169 + *, *, * 78 *, *, *, *, *, *, *, *, 196 81,,,,, 200 82 *, *, *, *, *, * 201 86 *,,, * 207 106 *, *, *, *, *, *, *, * 250 116 *,, *, * 280 127 *, *, *, *, * 306 139,,,, 324,,,, 150,,,,,,, 343, 172,,,,,,,,, 391 +,, *, *, 10 +,,,,,,,,, 80 + + * +. 17
Vol. 1 No. 1 Jan. 1 100 80 Proposed Method -Step 3 -Step 4 -Rule 1 -Rule 2 -Rule 3 accuracy [%] 60 40 20 0 0 500 1000 1500 2000 2500 # of hypernym hyponym pairs 4 100 80 accuracy [%] 60 40 Proposed Method 20 Re-ranking of top 2 hypernym candidates Re-ranking of top 3 hypernym candidates Re-ranking of top 4 hypernym candidates Re-ranking of top 5 hypernym candidates 0 0 500 1000 1500 2000 2500 # of hypernym hyponym pairs 5 17.7 % hyponym(, ) 3.4 2 k sim(c, h(c)) df (h(c),ld(c)) idf(h(c),g) 5 18
HTML sim(c, h(c)) df (h(c),ld(c)) idf(h(c),g) 5 3 3 4 1 1 head-final 1 2 HTML 2 HTML 2 1 HTML 3 ( 2001) ( 2003) 3 3 4 3 1 2 3 3 19
Vol. 1 No. 1 Jan. 1 4 A B A B.*.* A B.*.* A B.*.* A B.*.* A B.* ( ).* A B A B.* (, )?.* A, B.* ( ).* ( ) 3 2 HTML 3 1 100 4 1 2 3 4 4 1 2 3 4 1 2 3 6 1 2 3 4 6 1 2 3 4 sim(c, h(c)) df (h(c),ld(c)) idf(h(c),g) 4 20
HTML 100 80 Proposed Method Alternative 1 Alternative 2 Alternative 3 Alternative 4 accuracy [%] 60 40 20 0 0 500 1000 1500 2000 # of hypernym hyponym pairs 6 1 100 1 2 3 4 1 2 df (n, LD(C)) idf(n, G) HTML HTML 2 30 % HTML 30 % 21
Vol. 1 No. 1 Jan. 1 3 WWW HTML 6 WWW HTML HTML HTML df idf 4. 1 60 % 22
HTML 1 3 DVD RW, ( 15 (A)15680005, 15 15650015) ( ).. Caraballo, S. A. (1999). Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 120 126. Fleischman, M., Hovy, E., and Echihabi, A. (2003). Offline Strategies for Online Question Answering: Answering Questions Before They Are Asked. In Proceedings of the 41th Annual Meeting of the Association for Computational Linguistics, pp. 1 7. Hearst, M. A. (1992). Automatic acquistition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics, pp. 539 545. Lee, L. (1999). Measures of Distributional Similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pp. 25 32. Lin, D. (1998). Automatic Retrieval and Clustering of Similar Words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pp. 768 774. Mandala, R., Tokunaga, T., and Tanaka, H. (1998). The Use of WordNet in Information Retrieval. In Proceedings of the COLING ACL workshop on Usage of Wordnet in Natural Language Processing, pp. 31 37. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. In Journal of Lexicography, pp. 235 244. Morin, E. and Jacquemin, C. (2003). Automatic acquisition and expansion of hypernym links. In Computer and the Humanities 2003. forthcoming. 23
Vol. 1 No. 1 Jan. 1 Salton, G. and Lesk, M. E. (1968). Computer evaluation of indexing and text processing. Journal of the ACM, 15 (1), 8 36. Yoshida, M., Torisawa, K., and Tsujii, J. (2001). A method to integrate tables of the World Wide Web. In Proceedings of the International Workshop on Web Document Analysis, pp. 31 34.,, (2003).. 2003-NL-157, pp. 77 82. (2001).. Master s thesis,. (2001).., 8 (4), 37 54. (1999). JUMAN version 3.61. http://www.kc.t.u-tokyo.ac.jp/nl-resource/juman.html.,,, (2000). 3 4., 7 (5), 71 91. : 2002 2004 2004 HTML : 1992. 1995,. 1998 2001 21. 2001.. ). ( ) ( ) ( ) A 3.2 1 24
HTML $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $. $,$ $ $.+.+.+.+.+,.+.+.+.+.+.*.*.*.*.*.*.*.* 7 HTML 3.2 HTML (Yoshida, Torisawa, and Tsujii 2001) A B AB O HTML HTML 1 25
Vol. 1 No. 1 Jan. 1 8 2 7 3 3 20 1 HTML 1 12 5 2 7 7 3 20 3 20 B 8 8 26
HTML 27