06sugiyama.dvi



Similar documents
IPSJ-TOD

1 AND TFIDF Web DFIWF Wikipedia Web Web AND 5. Wikipedia AND 6. Wikipedia Web Ma [4] Ma URL AND Tian [8] Tian Tian Web Cimiano [3] [

kut-paper-template.dvi

1 1 tf-idf tf-idf i

IT i

Mining Social Network of Conference Participants from the Web

untitled


TF-IDF TDF-IDF TDF-IDF Extracting Impression of Sightseeing Spots from Blogs for Supporting Selection of Spots to Visit in Travel Sat

No. 3 Oct The person to the left of the stool carried the traffic-cone towards the trash-can. α α β α α β α α β α Track2 Track3 Track1 Track0 1

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan

i

A Japanese Word Dependency Corpus ÆüËܸì¤Îñ¸ì·¸¤ê¼õ¤±¥³¡¼¥Ñ¥¹

<> <name> </name> <body> <></> <> <title> </title> <item> </item> <item> 11 </item> </>... </body> </> 1 XML Web XML HTML 1 name item 2 item item HTML

Computational Semantics 1 category specificity Warrington (1975); Warrington & Shallice (1979, 1984) 2 basic level superiority 3 super-ordinate catego

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

main.dvi

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

gengo.dvi

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

Web Web Web Web Web, i

untitled


DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

独立行政法人情報通信研究機構 Development of the Information Analysis System WISDOM KIDAWARA Yutaka NICT Knowledge Clustered Group researched and developed the infor

クラスタリング クラスタリングとは クラスタの良さを類似度 目的関数で定義 困難 教師ありクラスタリング 類似度 目的関数ではなく 教師情報 制約を導入 教師情報 制約に一致するクラスタが良い クラスタリング問題を 絶対クラスタリングと相対クラスタリング に分けて考える必要 2

IPSJ SIG Technical Report Vol.2009-BIO-17 No /5/26 DNA 1 1 DNA DNA DNA DNA Correcting read errors on DNA sequences determined by Pyrosequencing

[7] [10] Web Web RDF Resource Description Framework subjectpredicate object Web Web Web Web Web 2 Web Web MUC(Message Understanding Confere

IT,, i

% 95% 2002, 2004, Dunkel 1986, p.100 1

( )

2007/2 Vol. J90 D No Web 2. 1 [3] [2], [11] [18] [14] YELLOW [16] [8] tfidf [19] 2. 2 / 30% 90% [24] 2. 3 [4], [21] 428

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

2. Twitter Twitter 2.1 Twitter Twitter( ) Twitter Twitter ( 1 ) RT ReTweet RT ReTweet RT ( 2 ) URL Twitter Twitter 140 URL URL URL 140 URL URL

kut-paper-template.dvi

. Yahoo! 1!goo 2 QA..... QA Web Web [1]Web Web Yin [2] Web Web Web. [3] Web Wikipedia 1 2

: ( 1) () 1. ( 1) 2. ( 1) 3. ( 2)

24 Region-Based Image Retrieval using Fuzzy Clustering

[1], B0TB2053, i

3 2 2 (1) (2) (3) (4) 4 4 AdaBoost 2. [11] Onishi&Yoda [8] Iwashita&Stoica [5] 4 [3] 3. 3 (1) (2) (3)

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-CVIM-186 No /3/15 EMD 1,a) SIFT. SIFT Bag-of-keypoints. SIFT SIFT.. Earth Mover s Distance

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

1 Web,.,, Web..,, Web.,,,.,,,., CGI.,, Web, Web.,,. PC,,.

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

(MIRU2008) HOG Histograms of Oriented Gradients (HOG)

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

ActionScript Flash Player 8 ActionScript3.0 ActionScript Flash Video ActionScript.swf swf FlashPlayer AVM(Actionscript Virtual Machine) Windows

IPSJ SIG Technical Report Vol.2010-SLDM-144 No.50 Vol.2010-EMB-16 No.50 Vol.2010-MBL-53 No.50 Vol.2010-UBI-25 No /3/27 Twitter IME Twitte

21 Pitman-Yor Pitman- Yor [7] n -gram W w n-gram G Pitman-Yor P Y (d, θ, G 0 ) (1) G P Y (d, θ, G 0 ) (1) Pitman-Yor d, θ, G 0 d 0 d 1 θ Pitman-Yor G

‰gficŒõ/’ÓŠ¹

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

fiš„v5.dvi

IPSJ SIG Technical Report Vol.2010-CVIM-170 No /1/ Visual Recognition of Wire Harnesses for Automated Wiring Masaki Yoneda, 1 Ta

DEIM Forum 2010 A Web Abstract Classification Method for Revie

johnny-paper2nd.dvi

,,,,., C Java,,.,,.,., ,,.,, i

21 A contents organization method for information sharing systems

2015 9


1_26.dvi

Vol. 9 No. 5 Oct (?,?) A B C D 132

1 7.35% 74.0% linefeed point c 200 Information Processing Society of Japan

( : A8TB2163)

skeiji.final.dvi

SOM SOM(Self-Organizing Maps) SOM SOM SOM SOM SOM SOM i

_314I01BM浅谷2.indd

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

Web サイト作成者によって設定された Web リンク システムが作成した Web リンク データ 静的リンク 学部ページ 大学 ページ 就職関連ページ 入試関連ページ 利用者の要求 データベース 動的リンク データ工学研究室ページ ベース研究室ページ 大学ページ 3. Web 学科ページ データ工

1 Broder Navigational URL URL Informational Web Transactional Web Web Web 2 Broder [16] SearchLife Broder [16] Daniel [17] Broder

IPSJ SIG Technical Report Pitman-Yor 1 1 Pitman-Yor n-gram A proposal of the melody generation method using hierarchical pitman-yor language model Aki

IPSJ SIG Technical Report Vol.2017-SLP-115 No /2/18 1,a) 1 1,2 Sakriani Sakti [1][2] [3][4] [5][6][7] [8] [9] 1 Nara Institute of Scie

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

Vol. 48 No. 3 Mar PM PM PMBOK PM PM PM PM PM A Proposal and Its Demonstration of Developing System for Project Managers through University-Indus

22 Google Trends Estimation of Stock Dealing Timing using Google Trends

SERPWatcher SERPWatcher SERP Watcher SERP Watcher,

WII-D 2017 (1) (2) (1) (2) [Tanaka 07] [ 04] [ 10] [ 13, 13], [ 08] [ 13] (1) (2) 2 2 e.g., Wikipedia [ 14] Wikipedia [ 14] Linked Open

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

2

自然言語処理24_705

1 Fig. 2 2 Fig. 1 Sample of tab UI 1 Fig. 1 that changes by clicking tab 5 2. Web HTML Adobe Flash Web ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) 3 Web 2.1 Web Goo

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

3807 (3)(2) ,267 1 Fig. 1 Advertisement to the author of a blog. 3 (1) (2) (3) (2) (1) TV 2-0 Adsense (2) Web ) 6) 3

main.dvi

2003/3 Vol. J86 D II No Fig. 1 An exterior view of eye scanner. CCD [7] CCD PC USB PC PC USB RS-232C PC

IPSJ SIG Technical Report Secret Tap Secret Tap Secret Flick 1 An Examination of Icon-based User Authentication Method Using Flick Input for

IPSJ SIG Technical Report Vol.2009-HCI-134 No /7/17 1. RDB Wiki Wiki RDB SQL Wiki Wiki RDB Wiki RDB Wiki A Wiki System Enhanced by Visibl

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

DEIM Forum 2009 E

[4], [5] [6] [7] [7], [8] [9] 70 [3] 85 40% [10] Snowdon 50 [5] Kemper [3] 2.2 [11], [12], [13] [14] [15] [16]

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

ディスプレイと携帯端末間の通信を実現する映像媒介通信技術

(2008) JUMAN *1 (, 2000) google MeCab *2 KH coder TinyTextMiner KNP(, 2000) google cabocha(, 2001) JUMAN MeCab *1 *2 h

2017 (413812)

IPSJ SIG Technical Report Vol.2009-DBS-149 No /11/ Bow-tie SCC Inter Keyword Navigation based on Degree-constrained Co-Occurrence Graph

,.,. NP,., ,.,,.,.,,, (PCA)...,,. Tipping and Bishop (1999) PCA. (PPCA)., (Ilin and Raiko, 2010). PPCA EM., , tatsukaw

Wikipedia YahooQA MAD 4)5) MAD Web 6) 3. YAMAHA 7) 8) Vocaloid PV YouTube 1 minato minato ussy 3D MAD F EDis ussy

Transcription:

Web Web Web Web Personal Name Disambiguation in Web Search Results Using a Semi-Supervised Clustering Approach Kazunari Sugiyama and Manabu Okumura Personal names are often submitted to search engines as query keywords. However, in response to a personal name query, search engines return a long list of search results that contains Web pages about several namesakes. In order to address this problem, most of the previous works that disambiguate personal names in Web search results often employ agglomerative clustering approaches. In contrast, we have adopted a semi-supervised clustering approach to integrate similar documents into a seed document. Our proposed semi-supervised clustering approach is novel in that it controls the fluctuation of the centroid of a cluster. Key Words: Web information retrieval, Semi-supervised clustering, Personal name disambiguation 1 ALLTheWeb 1 1 2 Web, Department of Computer Science, National University of Singapore, Precision and Intelligence Laboratory, Tokyo Institute of Technology 1 http://www.alltheweb.com/ 2 http://tap.stanford.edu/peoplesearch.pdf

Vol. 16 No. 5 October 2009 Google 3 William Cohen Web Web (Mann and Yarowsky 2003) (Pedersen, Purandare, and Kulkarni 2005) (Bekkerman, El-Yaniv, and McCallum 2005) (Bollegala, Matsuo, and Ishizuka 2006) Web seed seed Web (1) (2) Wagstaff (Wagstaff and Cardie 2000) (Wagstaff, Rogers, and Schroedl 2001) K-means must-link 2 cannot-link 2 2 Basu (Basu, Banerjee, and Mooney 2002) K-means Klein (Klein, Kamvar, and Manning 2002) 2 (x i,x j ) 0 2 (max i,j D ij )+1 Xing (Xing, Ng, Jordan, and Russell 2003) Bar-Hillel (Bar-Hillel, Hertz, and Shental 2003) RCA (Relevant Component Analysis) (Shental, Hertz, Weinshall, and Pavel 2002) seed 2 3 3 http://www.google.com/ 24

Web 4 2 1 K K-means (MacQueen 1967) Web (Wagstaff and Cardie 2000) (Wagstaff et al. 2001) (Basu et al. 2002) (Klein et al. 2002) (Xing et al. 2003) (Bar-Hillel et al. 2003) seed seed (1) seed seed (2) Web seed W p Web p i w pi (i =1,,n) (1) w pi =(w pi t 1,w pi t 2,,w pi t m ) (1) m W p t k (k =1, 2,,m) (a) Term Frequency (TF) (b) Inverse Document Frequency (IDF) (c) residual IDF (RIDF) (d) TF-IDF (e) x I -measure (f) gain 6 25

Vol. 16 No. 5 October 2009 (a) Term Frequency (TF) TF tf(t k,p i ) Web p i t k w pi w pi t k (2) tf(t w pi k,p i ) t k = m s=1 tf(t s,p i ) (b) Inverse Document Frequency (IDF) (Jones 1973) IDF w pi w pi t k (3) N w pi t k =log df (t k ) N Web df (t k ) t k Web (c) Residual Inverse Document Frequency (RIDF) Church and Gale (Church and Gale 1995a, 1995b) IDF residual IDF IDF IDF cf k t k N Web 1 Web t k λ k = cf k N wpi w pi t k (4) 1 w pi t k = IDF log 1 p(0; λ i ) N =log df (t k ) +log(1 p(0; λ k)) (4) p λ k RIDF (d) TF-IDF TF-IDF (Salton and McGill 1983) TF-IDF (a) TF (b) IDF (5) w pi t k = tf(t k,p i ) m s=1 tf(t s,p i ) log N df (t k ) (2) (3) (5) 26

Web tf(t k,p i ) df (t k ) Web p i t k t k Web N Web (e) x I -measure Bookstein and Swanson (Bookstein and Swanson 1974) t k x I -measure tf(t k,p i ) Web p i t k df (t k ) t k Web w pi w pi t k (6) w pi t k = tf(t k,p i ) df (t k ) (6) 2 (f) gain IDF Papineni (Papineni 2001) IDF gain w pi w pi t k (7) w pi t k = df (t ( k) df (tk ) N N 1 log df (t ) k) (7) N df (t k ) t k Web N Web (a) (f) (f) gain C G C (8) G C =(g C t 1,g C t 2,,g C t m ) (8) gt C k G C t k (k =1, 2,,m) 2 C i C j sim(c i,c j ) (9) G sim(c i,c j )= GCi Cj G Ci G Cj (9) G Ci G Cj C i C j 2.1 Web 27

Vol. 16 No. 5 October 2009 1 C i ( n i ) C j ( n j ) C new G new (10) w G new = p C i w p + w p C j w p (10) n i + n j 2.2 seed C sj seed C i C i ( G Ci ) seed C sj ( G C ) D(G Ci, G C ) Web p w p C i k j seed C s (kj) j ( n sj ) C i ( n i ) k j +1 1 28

Web C (0) (1) C (kj) seed C i C s (kj) j j G C(k ) C i G Ci D(G Ci, G C(k j ) ) C i Web w p l C i (l =1,,n i ) C i ( n i ) C i Web w p l C i (11) w p l C = w pl C i i D(G Ci, G C(k j ) )+c (11) D(G Ci, G C(k j ) ) (i) (ii) (iii) c D(G Ci, G C(k j ) ) 0 w p c 3.3.1 (2) seed C s (kj) j ( n sj ) C i ( n i ) C s (kj+1) j ( n sj + n i ) C (kj+1) = {w p1 C (k j ),, w pnsj C (k j ), w p1 C i,, wpn i C i } (3) k j +1 C (kj+1) G C(k j +1) (12) (11) w p l C i n i G C(k j +1) = w p C (k j +1) n sj + n i 1 w p D(G C i,g C(k j ) )+c (12) seed 2 G new (13) w G new = p C i w p + w p C j w p (13) n i + n j seed seed Web seed 29

Vol. 16 No. 5 October 2009 seed (Rocchio 1971) seed Clusty 4 Web seed seed 2 Web seed seed 2 7 else if (11) (i) (ii) (iii) 3 (i) (11) seed G Cs C G C D(G Cs, G C ) (8) (14) D(G Cs, G C )= m (gt Cs k gt C k ) 2 (14) k=1 (ii) (11) seed C s G Cs C G C D(G Cs, G C ) (15) D(G C(s), G C )= (G Cs G C ) T Σ 1 (G Cs G C ) (15) 4 http://clusty.com 30

Web 2 31

Vol. 16 No. 5 October 2009 Σ seed C s C s C s = {w p1 C s, w p2 C s,, w pm C s } G Cs G Cs = 1 m m i=1 w pi C s Σ ij (16) Σ ij = 1 m m i=1 (w pi C s G Cs )(w pj C s G Cs ) T (16) Σ Σ 11 Σ 12 Σ 1m Σ 21 Σ 22 Σ 2m Σ =...... Σ m1 Σ m2 Σ mm (iii) (ii) seed C sj Web C sj G C C l G C l (Diday and Govaert 1977) (1) C sj Web w pi v (w pi v) d sj (w pi, v) (17) d sj (w pi, v) =(w pi v) T M 1 (w pi v) (17) M sj C sj C sj C sj = {w p1 C sj, w p2 C sj,, w pm C sj } G C 32

Web G C = 1 m m i=1 w pi C sj M ij (18) M ij = 1 m m i=1 (w pi C sj G Cs p j )(w j C sj G C ) T (18) M sj M 11 M 12 M 1m M 21 M 22 M 2m M sj =........ M m1 M m2 M mm (2) Δ sj (v, M sj )= w p i C sj d sj (w pi, v) = (w pi v) T M 1 (w pi v) w p i C sj C sj L sj S sj (i) C sj M sj Δ sj L sj L sj =argmin v w p i C sj (w pi v) T M 1 (w pi v) (19) (19) C sj G G v G L sj = v G (ii) (i) L sj = v G det(m sj )=1 Δ sj S sj S sj =argmin M sj w p i C sj (w pi v G ) T M 1 (w pi v G ) (20) S sj C sj M sj (21) (Diday and Govaert 1977) 33

Vol. 16 No. 5 October 2009 S sj =(det(m sj )) 1/m M 1 (21) det(m sj ) 0 m seed C sj Web S sj S sj C sj G C Cl G C l (22) D(G C, G C l )= (G C G C l ) T S 1 (G C G C l ) (22) (22) (1) (2) C sj Web (21) S sj (15) 3 3.1 Web People Search Task (Artiles, Gonzalo, and Sekine 2007) WePS WePS 49 30 79 Yahoo! 5 API 100 7,900 Web 1 Web 6 Porter Stemmer (Porter 1980) 7 WePS WePS 3.2 purity inverse purity F (Hotho, Nürnberger, and Paaß 2005) Web People Search Task purity C 5 http://www.yahoo.com/ 6 ftp://ftp.cs.cornell.edu/pub/smart/english.stop 7 http://www.tartarus.org/ martin/porterstemmer/ 34

Web 1 WePS (A) (A) 1 Wikipedia 7 23.1 ECDL 06 10 15.3 32 5.90 49 10.8 Wikipedia 10 56.5 ACL 06 10 31.0 10 50.3 30 45.9 *ECDL: European Conference on Digital Libraries, ACL: Association for Computational Linguistics L n purity (23) Purity = i C i n max P recision(c i,l j ) (23) L j C i P recision(c i,l j ) (24) P recision(c i,l j )= C i Lj (24) C i inverse purity inverse purity (25) InversePurity = j L j n max Recall(C i,l j ) (25) L j C i Recall(C i,l j ) (26) Recall(C i,l j )= C i Lj (26) L j purity inverse purity F (27) 1 F = α 1 Purity +(1 α) 1 InverseP urity (27) 35

Vol. 16 No. 5 October 2009 α =0.5 0.2 α =0.5 0.2 F F 0.5 F 0.2 3.3 2 seed (a) Wikipedia (Remy 2002) (b) Web Web 3.3.1 c seed C sj C i 2 (11) C i Web w p l C i (l =1,,n i ) (12) (11) c D(G Ci, G C ) 0 w p WePS 2 seed (a) (b) 7 seed 0.1 c 50 seed 7 seed 2 4 c F 0.5 F 0.2 3.3.3 seed 2 4 c WePS 2 c Seed page c F 0.5 F 0.2 Seed page c F 0.5 F 0.2 1 Wikipedia article 0.94 0.63 0.61 1 Web page 0.95 0.61 0.58 2 Wikipedia articles 0.93 0.65 0.64 2 Web pages 0.94 0.63 0.61 3 Wikipedia articles 0.95 0.66 0.66 3 Web pages 0.96 0.64 0.65 4 Wikipedia articles 0.97 0.67 0.68 4 Web pages 0.93 0.64 0.66 5 Wikipedia articles 0.98 0.68 0.69 5 Web pages 0.97 0.66 0.67 6 Wikipedia articles 0.96 0.67 0.70 6 Web pages 0.94 0.64 0.67 7 Wikipedia articles 0.97 0.66 0.68 7 Web pages 0.96 0.62 0.65 36

Web 3 c Seed page c F 0.5 F 0.2 Seed page c F 0.5 F 0.2 1 Wikipedia article 0.95 0.64 0.61 1 Web page 0.93 0.62 0.59 2 Wikipedia articles 0.94 0.66 0.63 2 Web pages 0.96 0.63 0.61 3 Wikipedia articles 0.96 0.68 0.66 3 Web pages 0.95 0.66 0.64 4 Wikipedia articles 0.98 0.70 0.68 4 Web pages 0.95 0.67 0.68 5 Wikipedia articles 0.97 0.71 0.70 5 Web pages 0.94 0.69 0.70 6 Wikipedia articles 0.95 0.69 0.71 6 Web pages 0.93 0.68 0.69 7 Wikipedia articles 0.96 0.67 0.70 7 Web pages 0.95 0.66 0.68 4 c Seed page c F 0.5 F 0.2 Seed page c F 0.5 F 0.2 1 Wikipedia article 0.97 0.66 0.67 1 Web page 0.96 0.65 0.63 2 Wikipedia articles 0.95 0.68 0.68 2 Web pages 0.98 0.66 0.65 3 Wikipedia articles 0.94 0.70 0.71 3 Web pages 0.93 0.68 0.69 4 Wikipedia articles 0.96 0.72 0.73 4 Web pages 0.95 0.70 0.72 5 Wikipedia articles 0.98 0.75 0.74 5 Web pages 0.98 0.73 0.73 6 Wikipedia articles 0.97 0.73 0.72 6 Web pages 0.96 0.71 0.72 7 Wikipedia articles 0.95 0.71 0.70 7 Web pages 0.97 0.69 0.71 5 Purity Inverse purity F 0.5 F 0.2 0.67 0.48 0.52 0.49 3.3.2 (1) 5 (2) seed seed 3.3 2 seed (a) Wikipedia (b) 1 Web 3.1 WePS Wikipedia Wikipedia seed Web 1 Web WePS 30 16 Wikipedia 14 1 Web seed 37

Vol. 16 No. 5 October 2009 Wikipedia Bunescu (Bunescu and Pasca 2006) Wikipedia 6 seed seed F (F 0.5 =0.68 F 0.2 =0.66) seed 3.3.1 seed seed 7 2 seed cannot-link Web 3 4, Wikipedia 7 Web (F ) 2 seed 1 Klein (Klein et al. 2002) Xing (Xing et al. 2003) Bar-Hillel (Bar-Hillel et al. 2003) seed 3.3.3 3.3.2 Web seed Web seed 3 5 6 1 seed Distance measure Seed page* Purity Inverse purity F 0.5 F 0.2 (i) Euclidean distance (a) 0.47 0.82 0.60 0.62 (b) 0.49 0.77 0.58 0.61 (ii) Mahalanobis distance (a) 0.50 0.85 0.62 0.64 (b) 0.53 0.75 0.64 0.65 (iii) Adapative Mahalanobis distance (a) 0.57 0.88 0.68 0.66 (b) 0.55 0.76 0.65 0.64 *(a) and (b) in Seed page denote Wikipedia article and top-ranked Web page, respectively. 38

Web 3 seed (7 Wikipedia ) 4 seed ( 7 Web ) Wikipedia (F 0.5 =0.76,F 0.2 =0.74) (i) seed Web (ii) (i) WePS F seed Web 5 (ii) 39

Vol. 16 No. 5 October 2009 5 3 5 seed (Wikipedia ), seed Web ( w s, ) 6 3 5 seed (Wikipedia ) seed ( w s, ) WePS F seed 6 (i) (ii) 7 3.3.4 Web People Search Task 3 (F ) 7 7 40

Web 7 Web People Search Task 3 Team-ID Purity Inverse purity F 0.5 F 0.2 CU COMSEM (Chen and Martin 2007) 0.72 0.88 0.78 0.83 IRST-BP (Popescu and Magnini 2007) 0.75 0.80 0.75 0.77 PSNUS (Elmacioglu, Tan, Yan, Kan, and Lee 2007) 0.73 0.82 0.75 0.78 Our proposed method (with adaptive Mahalanobis distance) Using full text (Sec. 3.3.2) 1 Wikipedia article 0.58 0.84 0.68 0.66 1 Web page 0.60 0.76 0.65 0.64 5 Wikipedia articles 0.75 0.78 0.76 0.74 5 Web pages 0.77 0.69 0.72 0.73 Using fragments (Sec. 3.3.3) (i) 2 and 3 sentences in 5 Wikipedia seed pages and a search result Web page, respectively 0.80 0.83 0.81 0.82 (ii) Snippet and 3 sentences in 5 Wikipedia seed pages 0.70 0.62 0.66 0.68 3.4 3.3.2 (11) seed Web 7 Wikipedia 7 Web seed 3.3.2 PC (CPU: Intel Pentium M 2.0 GHz Memory: 2 GByte OS: Windows XP) Perl 7 3.5 (11) c 2 4 c =0.95 5 c 50 (11) c Web 41

Vol. 16 No. 5 October 2009 7 seed 5 purity (0.67) inverse purity (0.48) purity F F 0.5 =0.52 F 0.2 =0.49 2 6 purity (0.47 0.57) 5 purity (0.67) inverse purity (0.75 0.88) (0.48) inverse purity F seed seed 6 seed Wikipedia F (F 0.5 =0.68 F 0.2 =0.66) seed 3 4 seed (F ) seed 7 seed 5 Web Web purity 42

Web Bar-Hillel (Bar-Hillel et al. 2003) Xing (Xing et al. 2003) Klein (Klein et al. 2002) 1 Klein 2 (x i,x j ) 0 2 (max i,j D ij )+1 Xing Bar-Hillel seed seed seed Wikipedia Web Web seed Wikipedia WePS 3.3.3 (i) seed Web 5 Web 5 seed Web 2 3 F (F 0.5 =0.79 F 0.2 =0.80) WePS [purity:0.80 inverse purity:0.83, F 0.5 =0.81 F 0.2 =0.82] F α =0.5 7 Web People Search Task (Artiles et al. 2007) 1 (CU COMSEM) 0.03 3.3.2 (2) Wikipedia 16 Wikipedia 10 ( 1 (A) ) ACL 06 Wikipedia 6 1 (B) Wikipedia seed 8 (A) (B) 0.02 0.04 seed Wikipedia (B) Web Wikipedia 43

Vol. 16 No. 5 October 2009 seed seed Web 7 inverse purity Web People Search Task 3 seed 5 2 gain Web People Search Task 3 3 9 7 seed Wikipedia 5 CU COMSEM 7 F (F 0.5 =0.78 F 0.2 =0.83) F (F 0.5 =0.81 F 0.2 =0.84) F 0.5 0.03 F 0.2 0.01 URL IRST-BP 7 F (F 0.5 =0.75 F 0.2 =0.77) (F 0.5 =0.76 F 0.2 =0.81) F 0.5 0.01 F 0.2 0.04 8 Wikipedia seed (A) Wikipedia 10 (B) Wikipedia ACL 06 6 F 0.5 F 0.2 F 0.5 F 0.2 (A) 0.65 0.75 0.72 0.77 (B) 0.63 0.71 0.69 0.74 9 Web People Search Task 3 Team-ID Purity Inverse purity F 0.5 F 0.2 CU COMSEM 0.70 0.92 0.81 0.84 IRST-BP 0.69 0.82 0.76 0.81 PSNUS 0.67 0.84 0.78 0.82 44

Web PSNUS NE TF-IDF 7 F (F 0.5 =0.75 F 0.2 =0.78) F F 0.5 =0.78 F 0.2 =0.82 F 0.5 0.03 F 0.2 0.04 gain 7 F 0.5 =0.81 F 0.2 =0.82 F CU COMSEM F gain Web 5 F (F 0.5 =0.52 F 0.2 =0.49) F 0.5 0.29 F 0.2 0.33 WePS 3.3.3 (ii) seed 6 seed 6 seed 3 F (F 0.5 =0.64 F 0.2 =0.67) WePS [purity:0.70 inverse purity:0.62 F 0.5 =0.66 F 0.2 =0.68] Web People Search Task 3 seed 3 Web seed Wikipedia seed 2 Web 3 (22) 2 1 100 Web 5 seed seed Web 0.8 7 45

Vol. 16 No. 5 October 2009 4 Web seed [purity:0.80 inverse purity:0.83 F 0.5 :0.81 F 0.2 :0.82] seed cannot-link seed seed Web seed Web Web seed seed Artiles, J., Gonzalo, J., and Sekine, S. (2007). The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search Task. In Proc. of the Semeval 2007, Association for Computational Linguistics (ACL), pp. 64 69. Bar-Hillel, A., Hertz, T., and Shental, N. (2003). Learning Distance Functions Using Equivalence Relations. In Proc. of the 20th International Conference on Machine Learning (ICML 2003), pp. 577 584. Basu, S., Banerjee, A., and Mooney, R. (2002). Semi-supervised Clustering by Seeding. In Proc. of the 19th International Conference on Machine Learning (ICML 2002), pp. 27 34. Bekkerman, R., El-Yaniv, R., and McCallum, A. (2005). Multi-way Distributional Clustering via Pairwise Interactions. In Proc. of the 22nd International Conference on Machine Learning (ICML2005), pp. 41 48. Bollegala, D., Matsuo, Y., and Ishizuka, M. (2006). Extracting Key Phrases to Disambiguate Personal Names on the Web. In Proc. of the 7th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2006), pp. 223 234. 46

Web Bookstein, A. and Swanson, D. R. (1974). Probabilistic Models for Automatic Indexing. Journal of the American Society for Information Science, 25 (5), pp. 312 318. Bunescu, R. and Pasca, M. (2006). Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proc. of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), pp. 9 16. Chen, Y. and Martin, J. (2007). CU-COMSEM: Exploring Rich Features for Unsupervised Web Personal Name Disambiguation. In Proc. of the Semeval 2007, Association for Computational Linguistics (ACL), pp. 125 128. Church, K. W. and Gale, W. A. (1995a). Inverse Document Frequency (IDF): A Measure of Deviation from Poisson. In Proc. of the 3rd Workshop on Very Large Corpora, pp. 121 130. Church, K. W. and Gale, W. A. (1995b). Poisson Mixtures. Journal of Natural Language Engineering, 1 (2), pp. 163 190. Diday, E. and Govaert, G. (1977). Classification Automatique Avec Distances Adaptatives. R.A.I.R.O. Informatique Computer Science, 11 (4), pp. 329 349. Elmacioglu, E., Tan, Y. F., Yan, S., Kan, M.-Y., and Lee, D. (2007). PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features. In Proc.oftheSemeval 2007, Association for Computational Linguistics (ACL), pp. 268 271. Hotho, A., Nürnberger, A., and Paaß, G. (2005). A Brief Survey of Text Mining. GLDV-Journal for Computational Linguistics and Language Technology, 20 (1), pp. 19 62. Jones, K. S. (1973). Index Term Weighting. Information Strage and Retrieval, 9 (11), pp. 619 633. Klein, D., Kamvar, S. D., and Manning, C. D. (2002). From Instance-level Constraints to Spacelevel Constraints: Making the Most of Prior Knowledge in Data Clustering. In Proc. of the 19th International Conference on Machine Learning (ICML 2002), pp. 307 314. MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proc. of the 5th Berkeley Symposium on Mathmatical Statistics and Probability, pp. 281 297. Mann, G. S. and Yarowsky, D. (2003). Unsupervised Personal Name Disambiguation. In Proc. of the 7th Conference on Natural Language Learning (CoNLL-2003), pp. 33 40. Papineni, K. (2001). Why Inverse Document Frequency? In Proc. of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001), pp. 25 32. Pedersen, T., Purandare, A., and Kulkarni, A. (2005). Name Discrimination by Clustering Similar Contexts. In Proc. of the 6th International Conference on Computational Linguistics 47

Vol. 16 No. 5 October 2009 and Intelligent Text Processing (CICLing 2005), pp. 226 237. Popescu, O. and Magnini, B. (2007). IRST-BP: Web People Search Using Name Entities. In Proc. of the Semeval 2007, Association for Computational Linguistics (ACL), pp. 195 198. Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program, 14 (3), pp. 130 137. Remy, M. (2002). Wikipedia: The Free Encyclopedia. Online Information Review, 26 (6), p. 434. Rocchio, J. (1971). Relevance Feedback in Information Retrieval. In Salton, G. (Ed.), The Smart Retrieval System: Experiments in Automatic Document Processing, pp. 313 323. Prentice-Hall, Englewood Cliffs, NJ. Salton, G. and McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill. Shental, N., Hertz, T., Weinshall, D., and Pavel, M. (2002). Adjustment Learning and Relevant Component Analysis. In Proc. of the 7th European Conference on Computer Vision (ECCV 2002), pp. 776 792. Wagstaff, K. and Cardie, C. (2000). Clustering with Instance-level Constraints. In Proc.ofthe 17th International Conference on Machine Learning (ICML 2000), pp. 1103 1110. Wagstaff, K., Rogers, S., and Schroedl, S. (2001). Constrained K-means Clustering with Background Knowledge. In Proc. of the 18th International Conference on Machine Learning (ICML 2001), pp. 577 584. Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. J. (2003). Distance Metric Learning with Application to Clustering with Side-Information. Advances in Neural Information Processing Systems, 15, pp. 521 528. 1998 2000 KDDI 2004 2006 2009 IEEE ACM AAAI 1984 1989 1992 2000 2009 48

Web AAAI ACL 2009 2 6 2009 7 16 2009 8 19 49