Automatic Extraction of Related Terms using Web Search Engines 1
Danushka Bollegala 7-3-1 Keigo WATANABE Danushka BOLLEGALA Yutaka MATSUO and Mitsuru ISHIZUKA Graduate School of Information Science and Technology, The University of Tokyo School of Engineering, The University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan 2
. WordNet..., 3
Abstract Semantic lexicons, such as Roget s Thesaurus or WordNet, act as useful knowledge resources in natural language processing applications. However, such manually created lexical resources do not always reflect the new terms and named entities frequently found in the Web. Moreover, manually maintaining lexical resources are costly and time consuming. Motivated by this challenge, we propose a method to automatically extract related terms using the web as a corpus. The proposed method exploits snippets retrieved from a web search engine and efficiently finds related terms. Key words Web mining, information extraction, related terms, search engines 4
1 Miller 1 3) X Y X is a Y X Y is-a car automobile drive rental car automobile 6, 10) 100 Google 1000. URL URL URL WordNet 3) 12) R W car W R : automobile 2 3 4 5 5
1 Synonymy Synonym car - automobile Antonymy Antonym rise - fall Hyponymy Hypernym dog - animal Hyponym dog - poodle Meronymy Holonym dog - canis Meronym dog - flag 2 2.1 Miller 1 3) Hearst 6) Pantel 10) Snow 13) Lin 2) 14) Girju 5) Hearst Hearst 1 2 6) NP 0 such as NP 1, NP 2..., (and or)np n (1) NP 1,..., NP n (and or)other NP 0 (2) 1 Pantel 10) Part-of-Speech Tagging semantic drift 2.2 14) Bollegala 1) Cimiano 9) 3 6
2 dog animal...no dog or other animal shall be left... no X or other Y no X or other Y shall no X or other Y shall be X or other Y X or other Y shall X or other Y shall be 1 1. : 2. : 3. : 1 3.1 1: 1 Google http://google.com/ R R R R 4 k k = 1, 2, 3 X Y X Y X Y X Y Y X Y X Y X M 1. (X,Y) X Y X Y 2. X prefix X Y midfix Y suffix prefix midfix suffix n pre, n in, n post X Y (context window) 2 3. X Y (2) dog animal...no dog or other animal shall be left... 2 χ 2 v 3 3 χ 2 1 7 2 100 n pre = 2, n in = 3, n post = 2 Google
3 χ 2 v v p v P p v P n v N n v N χ 2 χ 2 = (P + N)(p v(n n v ) n v (P p v )) 2 P N(p v + n v )(P + N p v n v ) (3) χ 2 3 p v (N n v ) n v (P p v ) 2 χ 2 N 4 3.2 2: 1 2 1 X Y X and Y are dog dog and are and dog are M n 3 : the, or, of... 4 PorterStemmer 8) : dog dogs 3.3 3: 2 1 0-1 PF(c)( 4) PF(c) = f(c v ) (4) v f(c v ) = {v c v > 0} c 2 v c v c TF(c)( 5) TF(c) = c v (5) v 1 χ 2 X synonyms Y χ 2 X and Y χ 2 P F (c) T F (c) WeightPF(c)( 6) WeightTF(c)( 7) 3 n = 1 2 4 http://armandbrahaj. blog.al/2009/04/14/list-of-english-stop-words/ 8
WeightPF(c) = v WeightTF(c) = v weight v f(c v ) (6) weight v c v (7) weight v v F. F ( 10) ( 8) ( 9) = = F = 2 + (8) (9) (10) (weight v ) 4 F 4 PF( 4),TF( 5),WeightPF( 6) WeightTF( 7) PF TF WeightPF WeightTF PF WeightTF TF PF WeightTF WeightPF weight v = 1 WeightTF WeightPF 4 WordNet WordNet synset normalized F measure 0.03 0.025 0.02 0.015 0.01 0.005 0 20 40 60 80 100 pattern ID 2 WordNet synset 1000 WordNet WordNet synset 1000-1000 synset hypernym set M = 100 ( N = 100 F χ 2 4 100 4 F 10 4 2 100 weight v 3.3 weight v 10 F 2 F 2 9
4 F 10 X,Y: X: X: F χ 2 F χ 2 F χ 2 synset X Y 0.2736 68.4 X is a who 0.2996 28.2 or other X 0.1569 73.2 X synonyms Y 0.2051 50.8 X by of 0.1377 20.2 a is a X 0.1346 8.19 syn X Y 0.1876 64.7 X a who 0.1259 44.2 X such as a 0.1241 42.6 an X or Y 0.1383 29.9 X or other 0.1082 73.6 X or as 0.1011 15.7 X or Y a 0.1128 20.9 a or X 0.1031 43.3 n the X of 0.0933 17.6 X or Y was 0.0978 12.8 and X for 0.0942 25.4 X or in 0.0907 15.7 X or Y is 0.0900 79.0 or X that 0.0886 20.7 X called a 0.0891 24.2 X or Y in 0.0747 43.1 the or X 0.0849 57.8 X or is 0.0861 18.4 a X or Y 0.0745 128.1 or X of the 0.0815 11.7 X or of the 0.0807 16.5 as a X or Y 0.0692 13.5 X a that 0.0763 44.0 X by of 0.0801 32.1 0.5 0.4 0.3 PF WeightTF Lin 98 5 Lin 98 n @5 @10 @20 @30 Lin 98 0.28 0.30 0.28 0.27 PF 0.32 0.32 0.32 0.29 WeightTF 0.44 0.36 0.28 0.30 0.2 0 25 50 3 Lin 98 2 4.1 Lin 2) Lin Lin {cord, forest, fruit, glass, slave} 50 Lin 50 5 5 5-30 Lin 4.2 5 Roget s Thesaurus WordNet 4-6 / F F 0.158 10
WeightTF 0.114 PF 0.084 WeightTF 0 F /= 1 F F / 1 1.5 4.3 Bollegala Jaccard Overlap Dice PMI W ebjaccard W eboverlap W ebdice W ebp MI 11-14 1) H(P ) P H(P Q) P AND Q P Q AND W ebp MI N Google 10 10 WebJaccard(P, Q) = WebOverlap(P, Q) = WebDice(P, Q) = WebPMI(P, Q) = log 2 ( H(P Q) H(P ) + H(Q) H(P Q) (11) H(P Q) min(h(p ), H(Q)) 2H(P Q) H(P ) + H(Q) H(P Q) N H(P ) H(Q) N N ) (12) (13) (14) magician 0.6 0.4 0.2 0 PF WeightTF WebJaccard WebOverlap WebDice WebPMI 0 100 200 300 7 5 88 F 7 F Web- Dice WebJaccard 134 F 0.162 WebPMI WebOverlap 4, 15) WebPMI WebOverlap PF WeightTF 90 100 F 0.324 WebDice WebJaccard PF WeightTF F 120 WeightTF Recall Precision F 30-50 F magician 11 5
0.2 0.15 0.1 0.15 0.1 0.05 WeightTF PF 0 0 0.5 1 1.5 2 0.1 0.05 WeightTF PF 0 0 0.5 1 1.5 2 0.05 WeightTF PF 0 0 0.5 1 1.5 2 4 5 6 6 magician PF WeightTF *wizard 49 *sorcerer 9.78 *illusionist 47 clown 7.37 *sorcerer 43 *wizard 5.93 magic 40 *illusionist 5.62 clown 37 *shaman 5.55 *juggler 36 *witch 5.09 priest 36 *enchanter 4.30 *artist 35 priest 4.07 *witch 35 *juggler 4.01 warrior 34 *artist 3.59 10 6 Roget s Thesaurus 5 Lin 3 6 Roget s Thesaurus clown 1) D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using web search engines. In Proceedings of the 16th International World Wide Web Conference (WWW-07), pages 757-766, 2007. 2) D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 19th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING- ACL-98), pages 768-774, 1998. 3) G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography, 3(4): pages 235-244, 1990. 4) P. Turney. Mining the Web for Synonyms: PMI- IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (ECML-01), pages 491-502, 2001. 5) R. Girju, A. Badulescu, and D. Moldovan. Automatic Discovery of Part-Whole Relations. In 12
Computational Linguistics, 32(1), pages 83-135, 2006. 6) M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 539-545, 1992. 7) M. Pasca. Organizing and searching the World Wide Web of facts - step two: harnessing the wisdom of the crowds. In Proceedings of the 16th International Conference on World Wide Web (WWW-07), pages 101-110, 2007. 8) M. Porter. An Algorithm for Suffix Stripping. Program, 14, pages 130-137, 1980. Accesible at http://www.tartarus.org/ martin/porterstemmer/. 9) P. Cimiano and J. Wenderoth. Automatic acquisition of ranked qualia structures from the web. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-07), pages 888-895, 2007. 10) P. Pantel, D. Ravichandran, and E. Hovy. Towards Terascale Knowledge Acquisition. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), pages 771-777, 2004. 11) P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), pages 113-120, 2006. 12) P. Roget. Thesaurus of English words and phrases. Longmans, Green and Co., 1911. 13) R. Snow, D. Jurafsky, and A. Ng. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17, 2004. 14),,,. Web., Vol. 14, Number 2, pages 3-31, 2007. 15) Y. Matsuo, J. Mori, M. Hamasaki, K. Ishida, T. Nishimura, H. Takeda, K. Hasida and M. Ishizuka. POLYPHONET: an advanced social network extraction system from the web. In Proceedings of the 15th International Conference on World Wide Web, (WWW-06), 2006. 113-8656 7-3-1 Tel: 03-5841-6751 Fax: 03-5841-6070 E-mail: danushka@iba.t.u-tokyo.ac.jp (2008 2010 DeNA. Bollegala Danushka ( ) 2005 2007 2009,, WWW,ACL ( ) 1997 2002 2005 10 AAAI 13
( ) 1971 1976 NTT 1978 1980-81 Purdue 1992,Web,,.IEEE,AAAI,,,. 14