Vol. 47 No. 10 Oct. 2006 Wikipedia 2 Wiki Wikipedia Web Wikipedia Mining to Construct a Thesaurus Kotaro Nakayama, Takahiro Hara and Shojiro Nishio Thesauri have been widely used in many applications such as information retrieval, natural language processing (NLP), and interactive agents. However, several problems, such as morphological analysis, treatment of synonymous and multisense words, still remain and degrade accuracy on traditional NLP-based thesaurus construction methods. In addition, adding latest/miner words is also a difficult issue on this research area. In this paper, to solve these problems, we propose a web mining method to automatically construct a thesaurus by extracting relations between words from Wikipedia, a wiki-based huge encyclopedia on WWW. 1. WWW Weblog 13) Wiki 15) Web CMS Contents Management System Web WWW Wikiepdia Wikipedia Wiki Wikipedia Web Wikipedia 1 Wikipedia 2005 10 75 Department of Multimedia Engineering, Graduate School of Information Science and Technology, Osaka University 10 2 Wikipedia Web Wikipedia Wiki 2 Web 3 4 3 2917
2918 Oct. 2006 5 2. 2.1 Web WWW WWW Web Web Web HTML Web RDF 1) Web WWW Web Web content Web usage Web structure 3 11) Web Web Web Web 1 5) Web Web Web Web Web Web Web Web Web Web Google(TM) PageRank(TM) 14) HITS 12) Web Dean 9) 2.2 17) IR WordNet 16) EDR 18) 2.2.1 17) 3),7) Brill Tagger 2)
Vol. 47 No. 10 Wikipedia 2919 2.2.2 Web Web 6) Davison 8) Web A Apple <a href="http://en.wikipedia.com/wiki/apple_computer"> Apple </a> Web Web Web Chen 4) Web Web Chen Web Chen 2 1 2 Web Wikipedia Web Web Wikipedia 3. Wikipedia Wikipedia Web 3.1 Web Wikipedia Wikpedia Wiki Wikipedia Web 3.1.1 Wikipedia 1 1 Wikipedia 1 Wikipedia Fig. 1 Wikipedia.
2920 Oct. 2006 Wikipedia Wiki 3.1.2 Wikipedia 65 1,000 Wikipedia Wikipedia 715 3.1.3 Wikipedia 3.1.4 WWW Wikipedia WWW 3.2 Wikipedia Wikipedia Wikipedia Web Wikipedia Wikipedia Web Web 3.2.1 Wikipedia Web P = {p 1,p 2,p 3,..., p n } p i 1 i n Forward Link Backward Link 2 p i Forward Link p i F pi = {f i1,f i2,f i3,..., f im } Backward Link p i B pi = {b i1,b i2,b i3,..., b il } Wikipedia p i Forward Link Backward Link p j p i p j 1 2 3 Forward Link Backward Link n
Vol. 47 No. 10 Wikipedia 2921 Redirect Link Redirect Link Action film Action movie p i Redirect Link R pi = {r i1,r i2,r i3,..., r ik } 2 Forward Link Backward Link Redirect Link p i p j Redirect Link p j Backward Link b i1 b i2 b i3 b j1 b j2 b j3 6 3.2.2 p i RE Algorithm RE(p i,weight,depth) 1 if depth > n then return; 2 F pi = GetF orwardlinks(p i ); 3 for each (p j ) F pi do 4 score = weight/ F pi ; 5 S pj = S pj + score; 6 RE(p j,score,depth+1); 7 B pi = GetBackwardLinks(p i ); 8 for each (p j ) B pi do 9 score = weight/ B pi ; 10 S pj = S pj + score; 11 RE(p j,score,depth+1); 12 R pi = GetRedirectLinks(p i ); 13 for each (p j ) R pi do 14 RE(p j,weight,depth); p i weight 1.0 depth 1 3 1 n 2 6 p i Forward Link 2 Fig. 2 Links. p i Forward Link F pi Backward Link p i Redirect Link S pj p i p j S pj p i 3.3 3.3.1 Apple Computer Apple Apple Computer Apple Backward Link S pi S pi GetSynonym(p i )
2922 Oct. 2006 Algorithm GetSynonym(p i ) 1 B pi = GetBackwardLinks(p i ); 2 for each (p j ) B pi do 3 w = GetLinkT ext(p j ); 4 S w = S w +1; S w 1 w p i GetSynonym(p i ) Apple Computer Backward Link 1 3.3.2 Apple Apple Computer Apple CS w CS(p i,w)= B p i,w B j p j,w B pi,w B pi w UFO CS Unidentified flying object CS 0.65 UFO Unidentified flying object Apple Apple CS 0.44 Apple Computer CS 0.35 Apple 2 CS 1 GetSynonym Table 1 Result of GetSynonym algorithm. w S w Apple 176 Apple Computer 462 Apple Computer Company 1 Apple Computer Corporation 2...... Yahoo! (TM) Yahoo Yahoo Yahoo! CS 0.94 3.4 (1) P = {p 1,p 2,..., p n } (2) p i (3) p i (2) (4) p i n (5) (4) (6) (2) 4. 3 3 47 4.1 2 4.2 Backward Table 2 2 Environment for performance evaluation. CPU Pentium4 3.2 GHz 2GB OS Windows XP C# DB CPU G4 1.42 GHZ 1GB OS Mac OS 10.4 DBMS MySQL 4.1
Vol. 47 No. 10 Wikipedia 2923 3 Fig. 3 Generated words and relations. Link Forward Link Redirect Link RE MySQL B-Tree XML 3 HTML Document Type Definition DTD XML 65 100 Backward Link Forward Link 100 Wikinews Wikipedia Wikipedia 4.3 3 1 n Wikipedia 1 2 3 (1) 1 (2) 30 (3) 5 1 3 5 (4) 10 20 30 is-a is-apart-of
2924 Oct. 2006 CP Concept precision 3) CP = 4 5 3 n 2 17) Wikipedia Chen 4) 1 9,250 Web 762,636 5 52,729,700 1 5 CP Chen Wikipedia 1 Chen d 1 2 3 CP Chen 3 (1) (2) Web (3) Web 30 5 (4) CS (5) (6) Web (7) Web 30 5 (2) Web CS 4.4 1 Forward Link Backward Link 3 3 4 18 30 1 2 2 3 O(A n ) 3 75 Wikipedia Wikipedia 2 3 n 2
Vol. 47 No. 10 Wikipedia 2925 3 Table 3 The influence of distance for performance. 1 2 3 Nintendo 0.05 sec., 328 6.63 sec., 53981 1129.03 sec., 9973687 apple 0.03 sec., 208 3.66 sec., 24217 380.27 sec., 3022035 ipod 0.04 sec., 159 1.71 sec., 11645 205.36 sec., 1647562 Table 5 5 The comparison with other methods. 10 20 30 / 46.2% 35.4% 30.7% 0.34 sec. Chen 1 39.3% 28.1% 22.4% 1.20 sec. Chen 2 50.0% 50.9% 41.7% 121.34 sec. 1 66.7% 64.2% 61.2% 0.04 sec. 2 93.2% 86.2% 83.1% 4.00 sec. 3 91.4% 89.4% 85.9% 571.55 sec. 4 Table 4 The influence of distance for precision. 10 20 30 1 66.7% 64.2% 61.2% 2 93.2% 86.2% 83.1% 3 91.4% 89.4% 85.9% 2 Chen 5 17 1 30 1 SQL Server 2005 (TM) SQL Server 2005 3 Chen 1 2 3 3 Chen 2 121.34 / 1 80 Wikipedia 2 4.00 / Wikipedia 1 15,000 2005 12 Backward Link 100 200 2 1 / URL
2926 Oct. 2006 Fig. 4 4 Backward Link The distribution of backward links. 6 Table 6 The influence of the multiplicity. 10 20 30 65.1% 59.1% 56.8% 88.4% 82.2% 83.3% 2 Arm 3 12 30 30 CP 6 6 CP 4.5 Wikipedia Web Zipf 4 United States United Kingdom square kilometer Marriage World War II Backward Link 100 Backward Link 100 Wikipedia 34,586 1 WordNet WordNet 20 13,735 22,196 30 34,586 WordNet Backward Link 10 188,094 5. Wiki Wikipedia Web Wikinews Web Wikipedia 1 n-gram
Vol. 47 No. 10 Wikipedia 2927 21 COE 18049050 1) Berners-Lee, T., Hendler, J. and Lassila, O.: The Semantic Web, Scientific American, pp.35 43 (2001). 2) Brill, E.: A Simple Rule-based Part of Speech Tagger, Proc. Conference on Applied Computational Linguistics (ACL), pp.112 116 (1992). 3) Chen, H., Yim, T. and Fye, D.: Automatic Thesaurus Generation for an Electronic Community System, Journal of the American Society for Information Science, Vol.46, No.3, pp.175 193 (1995). 4) Chen, Z., Liu, S., Wenyin, L., Pu, G. and Ma, W.Y.: Building a Web Thesaurus from Web Link Structure, Proc. ACM SIGIR, pp.48 55 (2003). 5) Cooley, R., Mobasher, B. and Srivastava, J.: Web Mining: Information and Pattern DiscoveryontheWorldWideWeb,Proc. 9th IEEE International Conference on Tools with Artificial Intelligence, pp.558 567 (1997). 6) Craswell, N., Hawking, D. and Robertson, S.: Effective Site Finding using Link Anchor Information, Proc. ACM SIGIR, pp.250 257 (2001). 7) Crouch, C.J.: A Cluster Based Approach to Thesaurus Construction, Proc. ACM SIGIR, pp.309 320 (1988). 8) Davison, B.D.: Topical Locality in the Web, Proc. ACM SIGIR, pp.272 279 (2000). 9) Dean, J. and Henzinger, M.R.: Finding Related Pages in the World Wide Web, Proc. 8th International World Wide Web Conference, pp.1467 1479 (1999). 10) Edmundson, H.P.: New Methods in Automatic Extracting, J. ACM, Vol.16, No.2, pp.264 285 (1969). 11) Facca, F.M. and Lanzi, P.L.: Mining Interesting Knowledge from Weblogs: A Survey, Data and Knowledge Engineering, Vol.53, No.3, pp.225 241 (2005). 12) Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment, J. ACM, Vol.46, No.5, pp.604 632 (1999). 13) Kumar, R., Novak, J., Raghavan, P. and Tomkins, A.: Structure and evolution of blogspace, Comm. ACM, Vol.47, No.12, pp.35 39 (2004). 14) Lawrence, P., Sergey, B., Rajeev, M. and Terry, W.: The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford Digital Library Technologies Project (1999). 15) Leuf, B. and Cunningham, W.: The Wiki Way: Collaboration and Sharing on the Internet, Addison-Wesley (2001). 16) Miller, G.A.: WordNet: A Lexical Database for English, Comm. ACM, Vol.38, No.11, pp.39 41 (1995). 17) Schutze, H. and Pedersen, J.O.: A Cooccurrence-based Thesaurus and Two Applications to Information Retrieval, International Journal of Information Processing and Management, Vol.33, No.3, pp.307 318 (1997). 18) Tseng, Y.H.: Automatic Thesaurus Generation for Chinese Documents, Journal of the American Society for Information Science and Technology, Vol.53, No.13, pp.1130 1138 (2002). ( 17 11 4 ) ( 18 7 4 ) 2001 2003 2004 WWW ACM IEEE
2928 Oct. 2006 1995 1997 2002 2004 1996 2000 IEEE ACM 1975 1980 2002 2000 2003 Data & Knowledge Engineering ACM IEEE 8