DEIM Forum 2015 B1-5 606 8501 606 8501 E-mail: komurasaki@dl.kuis.kyoto-u.ac.jp, tajima@i.kyoto-u.ac.jp Web Web AND AND Web 1. Twitter Facebook SNS Web Web Web Web [5] Bollegala [2] Web Web 1 Google Microsoft Bing Cimiano [3] Web Web Web Web Web Web Web 1 4,730,000 660,000 0.993 0.830 Web Satoh [7]
1 AND TFIDF Web DFIWF Wikipedia Web Web 2. 3. 4. AND 5. Wikipedia AND 6. Wikipedia Web 7. 8. 2. Ma [4] Ma URL AND Tian [8] Tian Tian Web Cimiano [3] [5] AND Bollegala [2] AND SVM Satoh [7] Uyar [9] Satoh Web Uyar Google Yahoo Microsoft Satoh Uyar 3. 3. 1 500 1000 1 Microsoft Bing Search API 2 2 0.7 0.9 0.5 1 2 https://datamarket.azure.com/dataset/bing/search
Histogram of Famousness 1 Frequency 0 50 100 150 200 2 8 5 8 13 7 20 12 11 12 15 31 37 61 54 0.0 0.2 0.4 0.6 0.8 1.0 97 115 169 184 139 0.0246 log( ) 0.220 4. Famousness 0.0 0.2 0.4 0.6 0.8 1.0 Famousness 2 0 2 4 6 8 10 Hitcount.logarithm 3 93 3. 2 1000 Bing Search API 20 20000 MeCab [1] 3. 3 3 5 7 10 1000 1 Web 0.220 3. Web 20 AND 4. 1 TFIDF TFIDF(term frequency / inverse document frequency) [6] AND 4. 1. 1 TFIDF TFIDF C c i D i w T F w,i D i w D C i d i DF w,i 1000 20 1 1000 w DF w,i IDF w,i IDF w,i = log( Di DF w,i ) (1) T F w,i IDF w,i T F IDF w,i TFIDF w,i = TF w,i IDF w,i (2) T F IDF w,i c i C c i 4. 1. 2 3. 2 TFIDF
2 0.0255 log( ) 0.212 AND 2 TFIDF AND AND 13,300,000 21 Web 4. 2 DFIWF 4. 1 TFIDF AND DFIWF(document frequency / inverse web frequency) 4. 2. 1 DFIWF DFIWF DFIWF Web C c i d i D = {d 1, d 2,... d n } w D DF w w W F w IW F w 1 IWF w = log( ) (3) W F w DF IW F w DFIWF w = DF w IWF w (4) DF IW F w D Web w w Web 4. 2. 2 3. 2 DFIWF 3 1000 AND 3 1000 AND 3 DFIWF DF WF DFIWF 1 1000 5940000 64.1 0.489 2 1000 11300000 61.6 0.303 3 1000 12300000 61.3 0.304 5 DF 1000 4. 1 DFIWF TFIDF 2 0.489 DFIWF 3 3 AND 5. AND 4. TFIDF DFIWF TFIDF DFIWF 5. 1 15
5. 2 5. 1 AND Web 4. TFIDF Web Web AND AND Web AND AND T c i t j T AND h i,tj famousness i W = {w 1, w 2,, w n} famousness i = w 1 h i,t1 + w 2 h i,t2 + + w n h i,tn n (5) = (w j h i,tj ) j=1 (5) AND W W AND 5. 3 1000 500 5. 2 Leave-one-out 499 AND W W 1 AND 500 5. 3 AND 5 w i 0 4 0.422 0.0357 0.259 0.449 5. 3 0.0357 0.449 AND 0 4. Satoh [7] 6. 5. Web Web 6. 1 Web 3 3 6. 2 Web Web Web Web
Web Web 5. 3 6. 2. 1 Category c W = {w 0, w 1,..., w n} c C w i f c,i Category c = {f c,0, f c,1,..., f c,n} 6. 2. 2 AND t X c Y t c AND X Y b t,c b t,c = X Y X + Y 6. 2. 3 (6) 6. 2. 1 6. 2. 2 t c Category c t c b t,c V t 6. 2. 1 w i f t,i V t = {f t,0, f t,1,..., f t,n} c Category c t c b t,c V t t Celebrity t t C t Celebrity t = c C t (b t,c Category c) + V t (7) 7 Celebrity t Celebrity t 6. 2. 4 Web 7 Web Web Web Web Web 7 Web Web Web Web Web p P age p V t w i Web p f p,i Page p = {f p,0, f p,1,..., f p,n} (8) 8 Web p P age p 7 t Celebrity t cos sim t,p sim t,p = Celebrityt Pagep Celebrity t Page p (9) Web p sim t,p Web Occurrence t Occurrence t Occurrence t = P p simt,p P (10) Occurrence t t Web Web Occurrence t t Web WebAffinity t WebAffinity t = p 1 Occurrence t + p 2 (11) p 1 p 2 Web 6. 3 6. 1 t NewsHook t Web Wikipedia Wikipedia Wikipedia t Wikipedia 1 WikiAccess t Wikimedia WikiEdit t Wikipedia WikiAccess t WikiEdit t t NewsHook t NewsHook t = p 3 WikiAccess t +p 4 WikiEdit t +p 5 (12) 11 p 1, p 2 p 3, p 4, p 5
Web 4 t Famousness t HitCount t WebAffinity t NewsHook t AccumulateDuration t 4 Infobox HitCount t =Famousness t WebAffinity t NewsHook t AccumulateDuration t (14) 6. 4 Web Web Wikipedia 4 Wikipedia infobox infobox t days t AccumulateDuration t AccumulateDuration t = p 6 days t + p 7 (13) p 6 p 7 6. 5 6. 2 6. 3 6. 4 t Web WebAffinity t NewsHook t AccumulateDuration t Web Web Web Web Web 14 Famousness t Famousness t = HitCount t WebAffinity t NewsHook t AccumulateDuration t (15) 15 WebAffinity t NewsHook t AccumulateDuration t 11 12 13 p 1 p 7 4. 2 DFIWF 15 HitCount t DFIWF 15 7. 15 7. 1 3. 1 3. 2 4. 2 DFIWF AND 2 ( ) (DFIWF) 12 Wikipedia 2013 1 1 2013 12 31 1
5 Result of estimation ( ) (DFIWF) 0.0300 0.0243 0.470 0.592 Correct 0.0 0.2 0.4 0.6 0.8 1.0 Correct 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Estimate Estimate 5 ( ) 6 (DFIWF) 13 days t t 2013 12 31 15 p 1 p 7 Leave-one-Out 7. 2 5 5 DFIWF 6 5 DFIWF 0.0243 5. 3 0.0357 0.592 5. 3 0.499 5 6 3. 1 Web JSPS 26280112 [1] Mecab. http://mecab.googlecode.com/svn/trunk/mecab/ doc/index.html. [2] Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. Measuring semantic similarity between words using web search engines. www, 7:757 766, 2007. [3] Philipp Cimiano, Siegfried Handschuh, and Steffen Staab. Towards the self-annotating web. In Proceedings of the 13th international conference on World Wide Web, pages 462 471. ACM, 2004. [4] Qiang Ma and Masatoshi Yoshikawa. Ranking people based on metadata analysis of search results. In Sven Hartmann, Xiaofang Zhou, and Markus Kirchberg, editors, Web Information Systems Engineering - WISE 2008 Workshops, volume 5176 of Lecture Notes in Computer Science, pages 48 60. Springer Berlin Heidelberg, 2008. [5] Yutaka Matsuo, Hironori Tomobe, and Takuichi Nishimura. Robust estimation of google counts for social network extraction. In AAAI, volume 7, pages 1395 1401, 2007. [6] GERARD SALTON. Developments in automatic text retrieval. Science, 253(5023):974 980, 1991. [7] Koh Satoh and Hayato Yamana. Hit count reliability: how much can we trust hit counts? Web Technologies and Applications, pages 751 758, 2012. [8] Tian Tian, Soon Ae Chun, and James Geller. A prediction model for web search hit counts using word frequencies. Journal of Information Science, page 0165551511415183, 2011. [9] Ahmet Uyar. Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4):469 480, 2009. 8. Web