Danushka Bollegala 7-3-1 Keigo WATANABE Danushka BOLLEGALA Yutaka MATSUO and Mitsuru ISHIZUKA Graduate School of Information Science and Technology, T



Similar documents
skeiji.final.dvi

( )

i

1 AND TFIDF Web DFIWF Wikipedia Web Web AND 5. Wikipedia AND 6. Wikipedia Web Ma [4] Ma URL AND Tian [8] Tian Tian Web Cimiano [3] [

IPSJ-TOD

自然言語処理19_3

Mining Social Network of Conference Participants from the Web

SERPWatcher SERPWatcher SERP Watcher SERP Watcher,

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

_314I01BM浅谷2.indd

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

DEIM Forum 2009 C8-4 QA NTT QA QA QA 2 QA Abstract Questions Recomme

[7] [10] Web Web RDF Resource Description Framework subjectpredicate object Web Web Web Web Web 2 Web Web MUC(Message Understanding Confere

Wikipedia YahooQA MAD 4)5) MAD Web 6) 3. YAMAHA 7) 8) Vocaloid PV YouTube 1 minato minato ussy 3D MAD F EDis ussy

1 1 tf-idf tf-idf i

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

A Japanese Word Dependency Corpus ÆüËܸì¤Îñ¸ì·¸¤ê¼õ¤±¥³¡¼¥Ñ¥¹

( : A8TB2163)

IT,, i

DEIM Forum 2010 A Web Abstract Classification Method for Revie

Microsoft Word - toyoshima-deim2011.doc

1 Fig. 2 2 Fig. 1 Sample of tab UI 1 Fig. 1 that changes by clicking tab 5 2. Web HTML Adobe Flash Web ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) 3 Web 2.1 Web Goo

No. 3 Oct The person to the left of the stool carried the traffic-cone towards the trash-can. α α β α α β α α β α Track2 Track3 Track1 Track0 1

: ( 1) () 1. ( 1) 2. ( 1) 3. ( 2)

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

3807 (3)(2) ,267 1 Fig. 1 Advertisement to the author of a blog. 3 (1) (2) (3) (2) (1) TV 2-0 Adsense (2) Web ) 6) 3

Vol. 9 No. 5 Oct (?,?) A B C D 132

21 Pitman-Yor Pitman- Yor [7] n -gram W w n-gram G Pitman-Yor P Y (d, θ, G 0 ) (1) G P Y (d, θ, G 0 ) (1) Pitman-Yor d, θ, G 0 d 0 d 1 θ Pitman-Yor G

Computational Semantics 1 category specificity Warrington (1975); Warrington & Shallice (1979, 1984) 2 basic level superiority 3 super-ordinate catego

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

Wikipedia 2 Wikipedia Web Wikipedia 2. Web [6] [11] [8] 2 SVM Bollegala [1] 5-gram URL URL 2-gram [6] [11] SVM 3 SVM [8] Bollegala [1] SVM [7] [9] [6]

TF-IDF TDF-IDF TDF-IDF Extracting Impression of Sightseeing Spots from Blogs for Supporting Selection of Spots to Visit in Travel Sat

2006 3

,,,,., C Java,,.,,.,., ,,.,, i

IPSJ SIG Technical Report Vol.2009-HCI-134 No /7/17 1. RDB Wiki Wiki RDB SQL Wiki Wiki RDB Wiki RDB Wiki A Wiki System Enhanced by Visibl

24 Region-Based Image Retrieval using Fuzzy Clustering

Modal Phrase MP because but 2 IP Inflection Phrase IP as long as if IP 3 VP Verb Phrase VP while before [ MP MP [ IP IP [ VP VP ]]] [ MP [ IP [ VP ]]]

<> <name> </name> <body> <></> <> <title> </title> <item> </item> <item> 11 </item> </>... </body> </> 1 XML Web XML HTML 1 name item 2 item item HTML

自然言語処理21_249

2reN-A14.dvi

untitled

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

gengo.dvi

‰gficŒõ/’ÓŠ¹

HASC2012corpus HASC Challenge 2010,2011 HASC2011corpus( 116, 4898), HASC2012corpus( 136, 7668) HASC2012corpus HASC2012corpus

main.dvi

¥ì¥·¥Ô¤Î¸À¸ì½èÍý¤Î¸½¾õ

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

Web Web Web Web Web, i

. Yahoo! 1!goo 2 QA..... QA Web Web [1]Web Web Yin [2] Web Web Web. [3] Web Wikipedia 1 2

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan

kut-paper-template.dvi

IPSJ SIG Technical Report Secret Tap Secret Tap Secret Flick 1 An Examination of Icon-based User Authentication Method Using Flick Input for

3_23.dvi

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

22 Google Trends Estimation of Stock Dealing Timing using Google Trends

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

1 1 CodeDrummer CodeMusician CodeDrummer Fig. 1 Overview of proposal system c

DEIM Forum 2009 B4-6, Str

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

DEIM Forum 2009 E

IPSJ SIG Technical Report Vol.2010-SLDM-144 No.50 Vol.2010-EMB-16 No.50 Vol.2010-MBL-53 No.50 Vol.2010-UBI-25 No /3/27 Twitter IME Twitte

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

FA

独立行政法人情報通信研究機構 Development of the Information Analysis System WISDOM KIDAWARA Yutaka NICT Knowledge Clustered Group researched and developed the infor

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

/4 2012

DEIM Forum 2010 A3-3 Web Web Web Web Web. Web Abstract Web-page R

HP cafe HP of A A B of C C Map on N th Floor coupon A cafe coupon B Poster A Poster A Poster B Poster B Case 1 Show HP of each company on a user scree

ipod touch 1 2 Apple ipod touch ipod touch 3 ( ) ipod touch ( 1 ) Apple ( 2 ) Web 1),2) 3. ipod touch 1 2 ipod touch x y z i

Web [1] [2] [3] [4] [5] SupportVectorMachine SVM [6] [7] Google [11] Web

untitled

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

main.dvi

IT i

HP HP ELF 7 52

Vol.54 No (July 2013) [9] [10] [11] [12], [13] 1 Fig. 1 Flowchart of the proposed system. c 2013 Information

(2008) JUMAN *1 (, 2000) google MeCab *2 KH coder TinyTextMiner KNP(, 2000) google cabocha(, 2001) JUMAN MeCab *1 *2 h

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

2. Twitter Twitter 2.1 Twitter Twitter( ) Twitter Twitter ( 1 ) RT ReTweet RT ReTweet RT ( 2 ) URL Twitter Twitter 140 URL URL URL 140 URL URL

main.dvi


IPSJ SIG Technical Report Vol.2009-DBS-149 No /11/ Bow-tie SCC Inter Keyword Navigation based on Degree-constrained Co-Occurrence Graph

% 95% 2002, 2004, Dunkel 1986, p.100 1

IPSJ SIG Technical Report Vol.2011-DBS-153 No /11/3 Wikipedia Wikipedia Wikipedia Extracting Difference Information from Multilingual Wiki

01ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐02ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐03ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐04ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐05ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六七八九零壱弐06ⅢⅣⅤⅥⅦⅧⅨⅩ一二三四五六

Journal of Geography 116 (6) Configuration of Rapid Digital Mapping System Using Tablet PC and its Application to Obtaining Ground Truth

A B C B C ICT ICT ITC ICT

29 jjencode JavaScript

[1], B0TB2053, i


7,, i

1 7.35% 74.0% linefeed point c 200 Information Processing Society of Japan

,,, Twitter,,, ( ), 2. [1],,, ( ),,.,, Sungho Jeon [2], Twitter 4 URL, SVM,, , , URL F., SVM,, 4 SVM, F,.,,,,, [3], 1 [2] Step Entered

@08470030ヨコ/篠塚・窪田 221号

IPSJ SIG Technical Report Vol.2014-GN-90 No.16 Vol.2014-CDS-9 No.16 Vol.2014-DCC-6 No /1/24 1,a) 2,b) 2,c) 1,d) QUMARION QUMARION Kinect Kinect

untitled

05_fuke.indd

自然言語処理16_2_45

Transcription:

Automatic Extraction of Related Terms using Web Search Engines 1

Danushka Bollegala 7-3-1 Keigo WATANABE Danushka BOLLEGALA Yutaka MATSUO and Mitsuru ISHIZUKA Graduate School of Information Science and Technology, The University of Tokyo School of Engineering, The University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan 2

. WordNet..., 3

Abstract Semantic lexicons, such as Roget s Thesaurus or WordNet, act as useful knowledge resources in natural language processing applications. However, such manually created lexical resources do not always reflect the new terms and named entities frequently found in the Web. Moreover, manually maintaining lexical resources are costly and time consuming. Motivated by this challenge, we propose a method to automatically extract related terms using the web as a corpus. The proposed method exploits snippets retrieved from a web search engine and efficiently finds related terms. Key words Web mining, information extraction, related terms, search engines 4

1 Miller 1 3) X Y X is a Y X Y is-a car automobile drive rental car automobile 6, 10) 100 Google 1000. URL URL URL WordNet 3) 12) R W car W R : automobile 2 3 4 5 5

1 Synonymy Synonym car - automobile Antonymy Antonym rise - fall Hyponymy Hypernym dog - animal Hyponym dog - poodle Meronymy Holonym dog - canis Meronym dog - flag 2 2.1 Miller 1 3) Hearst 6) Pantel 10) Snow 13) Lin 2) 14) Girju 5) Hearst Hearst 1 2 6) NP 0 such as NP 1, NP 2..., (and or)np n (1) NP 1,..., NP n (and or)other NP 0 (2) 1 Pantel 10) Part-of-Speech Tagging semantic drift 2.2 14) Bollegala 1) Cimiano 9) 3 6

2 dog animal...no dog or other animal shall be left... no X or other Y no X or other Y shall no X or other Y shall be X or other Y X or other Y shall X or other Y shall be 1 1. : 2. : 3. : 1 3.1 1: 1 Google http://google.com/ R R R R 4 k k = 1, 2, 3 X Y X Y X Y X Y Y X Y X Y X M 1. (X,Y) X Y X Y 2. X prefix X Y midfix Y suffix prefix midfix suffix n pre, n in, n post X Y (context window) 2 3. X Y (2) dog animal...no dog or other animal shall be left... 2 χ 2 v 3 3 χ 2 1 7 2 100 n pre = 2, n in = 3, n post = 2 Google

3 χ 2 v v p v P p v P n v N n v N χ 2 χ 2 = (P + N)(p v(n n v ) n v (P p v )) 2 P N(p v + n v )(P + N p v n v ) (3) χ 2 3 p v (N n v ) n v (P p v ) 2 χ 2 N 4 3.2 2: 1 2 1 X Y X and Y are dog dog and are and dog are M n 3 : the, or, of... 4 PorterStemmer 8) : dog dogs 3.3 3: 2 1 0-1 PF(c)( 4) PF(c) = f(c v ) (4) v f(c v ) = {v c v > 0} c 2 v c v c TF(c)( 5) TF(c) = c v (5) v 1 χ 2 X synonyms Y χ 2 X and Y χ 2 P F (c) T F (c) WeightPF(c)( 6) WeightTF(c)( 7) 3 n = 1 2 4 http://armandbrahaj. blog.al/2009/04/14/list-of-english-stop-words/ 8

WeightPF(c) = v WeightTF(c) = v weight v f(c v ) (6) weight v c v (7) weight v v F. F ( 10) ( 8) ( 9) = = F = 2 + (8) (9) (10) (weight v ) 4 F 4 PF( 4),TF( 5),WeightPF( 6) WeightTF( 7) PF TF WeightPF WeightTF PF WeightTF TF PF WeightTF WeightPF weight v = 1 WeightTF WeightPF 4 WordNet WordNet synset normalized F measure 0.03 0.025 0.02 0.015 0.01 0.005 0 20 40 60 80 100 pattern ID 2 WordNet synset 1000 WordNet WordNet synset 1000-1000 synset hypernym set M = 100 ( N = 100 F χ 2 4 100 4 F 10 4 2 100 weight v 3.3 weight v 10 F 2 F 2 9

4 F 10 X,Y: X: X: F χ 2 F χ 2 F χ 2 synset X Y 0.2736 68.4 X is a who 0.2996 28.2 or other X 0.1569 73.2 X synonyms Y 0.2051 50.8 X by of 0.1377 20.2 a is a X 0.1346 8.19 syn X Y 0.1876 64.7 X a who 0.1259 44.2 X such as a 0.1241 42.6 an X or Y 0.1383 29.9 X or other 0.1082 73.6 X or as 0.1011 15.7 X or Y a 0.1128 20.9 a or X 0.1031 43.3 n the X of 0.0933 17.6 X or Y was 0.0978 12.8 and X for 0.0942 25.4 X or in 0.0907 15.7 X or Y is 0.0900 79.0 or X that 0.0886 20.7 X called a 0.0891 24.2 X or Y in 0.0747 43.1 the or X 0.0849 57.8 X or is 0.0861 18.4 a X or Y 0.0745 128.1 or X of the 0.0815 11.7 X or of the 0.0807 16.5 as a X or Y 0.0692 13.5 X a that 0.0763 44.0 X by of 0.0801 32.1 0.5 0.4 0.3 PF WeightTF Lin 98 5 Lin 98 n @5 @10 @20 @30 Lin 98 0.28 0.30 0.28 0.27 PF 0.32 0.32 0.32 0.29 WeightTF 0.44 0.36 0.28 0.30 0.2 0 25 50 3 Lin 98 2 4.1 Lin 2) Lin Lin {cord, forest, fruit, glass, slave} 50 Lin 50 5 5 5-30 Lin 4.2 5 Roget s Thesaurus WordNet 4-6 / F F 0.158 10

WeightTF 0.114 PF 0.084 WeightTF 0 F /= 1 F F / 1 1.5 4.3 Bollegala Jaccard Overlap Dice PMI W ebjaccard W eboverlap W ebdice W ebp MI 11-14 1) H(P ) P H(P Q) P AND Q P Q AND W ebp MI N Google 10 10 WebJaccard(P, Q) = WebOverlap(P, Q) = WebDice(P, Q) = WebPMI(P, Q) = log 2 ( H(P Q) H(P ) + H(Q) H(P Q) (11) H(P Q) min(h(p ), H(Q)) 2H(P Q) H(P ) + H(Q) H(P Q) N H(P ) H(Q) N N ) (12) (13) (14) magician 0.6 0.4 0.2 0 PF WeightTF WebJaccard WebOverlap WebDice WebPMI 0 100 200 300 7 5 88 F 7 F Web- Dice WebJaccard 134 F 0.162 WebPMI WebOverlap 4, 15) WebPMI WebOverlap PF WeightTF 90 100 F 0.324 WebDice WebJaccard PF WeightTF F 120 WeightTF Recall Precision F 30-50 F magician 11 5

0.2 0.15 0.1 0.15 0.1 0.05 WeightTF PF 0 0 0.5 1 1.5 2 0.1 0.05 WeightTF PF 0 0 0.5 1 1.5 2 0.05 WeightTF PF 0 0 0.5 1 1.5 2 4 5 6 6 magician PF WeightTF *wizard 49 *sorcerer 9.78 *illusionist 47 clown 7.37 *sorcerer 43 *wizard 5.93 magic 40 *illusionist 5.62 clown 37 *shaman 5.55 *juggler 36 *witch 5.09 priest 36 *enchanter 4.30 *artist 35 priest 4.07 *witch 35 *juggler 4.01 warrior 34 *artist 3.59 10 6 Roget s Thesaurus 5 Lin 3 6 Roget s Thesaurus clown 1) D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using web search engines. In Proceedings of the 16th International World Wide Web Conference (WWW-07), pages 757-766, 2007. 2) D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 19th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics (COLING- ACL-98), pages 768-774, 1998. 3) G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography, 3(4): pages 235-244, 1990. 4) P. Turney. Mining the Web for Synonyms: PMI- IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (ECML-01), pages 491-502, 2001. 5) R. Girju, A. Badulescu, and D. Moldovan. Automatic Discovery of Part-Whole Relations. In 12

Computational Linguistics, 32(1), pages 83-135, 2006. 6) M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 539-545, 1992. 7) M. Pasca. Organizing and searching the World Wide Web of facts - step two: harnessing the wisdom of the crowds. In Proceedings of the 16th International Conference on World Wide Web (WWW-07), pages 101-110, 2007. 8) M. Porter. An Algorithm for Suffix Stripping. Program, 14, pages 130-137, 1980. Accesible at http://www.tartarus.org/ martin/porterstemmer/. 9) P. Cimiano and J. Wenderoth. Automatic acquisition of ranked qualia structures from the web. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-07), pages 888-895, 2007. 10) P. Pantel, D. Ravichandran, and E. Hovy. Towards Terascale Knowledge Acquisition. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), pages 771-777, 2004. 11) P. Pantel and M. Pennacchiotti. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), pages 113-120, 2006. 12) P. Roget. Thesaurus of English words and phrases. Longmans, Green and Co., 1911. 13) R. Snow, D. Jurafsky, and A. Ng. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 17, 2004. 14),,,. Web., Vol. 14, Number 2, pages 3-31, 2007. 15) Y. Matsuo, J. Mori, M. Hamasaki, K. Ishida, T. Nishimura, H. Takeda, K. Hasida and M. Ishizuka. POLYPHONET: an advanced social network extraction system from the web. In Proceedings of the 15th International Conference on World Wide Web, (WWW-06), 2006. 113-8656 7-3-1 Tel: 03-5841-6751 Fax: 03-5841-6070 E-mail: danushka@iba.t.u-tokyo.ac.jp (2008 2010 DeNA. Bollegala Danushka ( ) 2005 2007 2009,, WWW,ACL ( ) 1997 2002 2005 10 AAAI 13

( ) 1971 1976 NTT 1978 1980-81 Purdue 1992,Web,,.IEEE,AAAI,,,. 14