DEIM Forum 2012 F3-5 305 8550 1-2 305 8550 1-2 E-mail: {yamaguchi,satoh}@ce.slis.tsukuba.ac.jp, sat@slis.tsukuba.ac.jp Wikipedia SVM Abstract A study of Retrieval in Microblogging based on Person s Aliases Yutaro YAMAGUCHI, Satoshi SHIMADA, and Tetsuji SATOH College of Knowledge and Library Sciences, School of Informatics, University of Tsukuba 1 2 Kasuga, Tsukuba, Ibaraki, 305 8550 Japan Graduate School of Library, Information and Media Studies, University of Tsukuba 1 2 Kasuga, Tsukuba, Ibaraki, 305 8550 Japan E-mail: {yamaguchi,satoh}@ce.slis.tsukuba.ac.jp, sat@slis.tsukuba.ac.jp In microblogging which the user can easily post comments intuitive, People are referenced in a variety of aliases other than personal names. Aliases is used in the tweets which reflect the context and user s feelings, it s not limited to mere means of referring to the person. In this paper, we propose the method to extract person s aliases using search engine and Wikipedia,and analyze topic and polarity of the article. Based on the result, we created the system which can retrieve context and polarity of the article in which the person s alias appear when user input a personal name. Key words Microblogging alias SVM 1. Twitter 1 Twitter Web 1http://twitter.com/
Wikipedia 2 Wikipedia Web Wikipedia 2. Web [6] [11] [8] 2 SVM Bollegala [1] 5-gram URL URL 2-gram [6] [11] SVM 3 SVM [8] Bollegala [1] SVM [7] [9] [6] Wikipedia Brendan [3] 4 0.725 David [5] Aniket [4] K-means Affinity Propagation [2] 3 idf Affinity Propagation Web [7] Wikipedia 3. 3. 1 3. 1. 1 Wikipedia Wikipedia [8] Wikipedia 1 2http://ja.wikipedia.org/ 3 4 Conference-Board) 6
1 2 Wikipedia 3. 1. 2 2 Wikipedia Wikipedia 3. 1. 3 Wikipedia Wikipedia Wikipedia 3. 1. 4 Wikipedia [6] [11] 2 3. 2 alias fullname alias fullname 3 [6] alias fullname fullname alias fullname alias [11] alias fullname (1) fullname Web N (2) fullname 5 3 candidate fullname 3 5 3. 2 SVM SVM Support Vector Machine SVM SVM 3. 1. 4 [11] SVM alias fullname 6 Dice(fullname, candidate) OverlapC(fullname, candidate) OverlapN(fullname,
candidate) 6 9 Dice(fullname, candidate) = Hits(fullname, candidate) Hits(name) + Hits(candidate) OverlapC(fullname, candidate) = Hits(fullname, candidate) Hits(candidate) OverlapN(fullname, candidate) = Hits(fullname, candidate) Hits(name) Hits(name, candidate) name AND candidate Hits(name) Hits(candidate) namecandidate (1) (2) (3) 8 SVM (1) Dice(fullname, candidate) (2) OverlapC(fullname, candidate) (3) OverlapN(fullname, candidate) (4) candidate candidate log(cf(candidate)) (5) candidate fn(candidate) (6) candidate fp(candidate) (7) candidate bn(candidate) (8) candidate bp(candidate) 3. 2 3. 3 3. 1 3. 3. 1 [10] 8,500 tweet pn(tweet) = posi nega posi + nega posi tweet nega tweet pn(tweet) ( 1.0 < = pn(tweet) < = 1.0) tweet 3 1 4 0.5 3. 3. 2 (4) alias date pn alias(alias, day) pn alias(alias, date) = 1 n tweet T pn(tweet) (5) T date n T date pn alias(alias, date) candidate pn(tweet) pn alias(alias, date) 4 ( 1.0 < = pn alias(alias, day) < = 1.0) 4. 4. 1 3. 4. 2 Wikipedia 5 3. 1 500 1 SVM F 5http://ja.wikipedia.org/wiki/Category:
1 SVM SVM 2 fullname 6 fullname P recision = R N Recall = R C 2 precision recall F measure = precision + recall (6) 4 2011 6 27 0:00:00 2011 9 26 0:00:00 geocode = 35.67012719,139.8094368,100km R SVM N SVM C 10 4. 3 2 2 4 Twitter Search API 4 2011 8 27 0:00:00 9 20 23:59:59 RT URL 2 3 RT URL 224 70 346 313 3 3 1127 259 1389 985 53 34 5. 5. 1 SVM 3. 1. 4 N 300 SVM LibSVM 6 SVM RBF C-SVC C gridsearch 9 excite 7 NAVER 8 9 1 9 15 867 SVM 4. 2 Wikipedia 5 6 5 Wikipedia 7 8 Wikipedia 1 4 false negative 5 false positive 1 Web faridyu @faridyu 2 3 4 6http://www.csie.ntu.edu.tw/ cjlin/libsvm/ 7http://tt.excite.co.jp/people/ 8http://person.naver.jp/issue
4 4 7 5 1 faridyu Twitter Web Web URL 2 3 4 4 5 4 4 5 4 faridyu @faridyu 8 Wikipedia Wikipedia 9 C 1.34 10 8 0.25 1.00 0.50 5 Wikipedia 824 464 80 6 Wikipedia 428 229 168 10 Precision Recall F-measure 0.69 0.65 0.67 0.62 0.67 0.64 11 Precision Recall F-measure 0.83 0.70 0.76 0.65 0.80 0.71
5 : 7 : 6 : 8 : 5. 2 5 6 5 6 5 9 2 3 6 9 10 13 5 9 8 6 9 7 8 5 9 8 9 2 AKBINGO! 5. 3 7 8 5 pn alias pn alias 0 7 8 1 5. 4 9 10 jquery PHP MySQL DB Twitter API DB 5. 1 489 1208
9 10 6. SVM 0.83 [1] D. Bollegala, Y. Matsuo, and M. Ishizuka. Automatic discovery of personal name aliases from the web. Knowledge and Data Engineering, IEEE Transactions on, Vol. 23, No. 6, pp. 831 844, june 2011. [2] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science, Vol. 315, pp. 972 976, 2007. [3] Brendan O Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A.Smith. From tweets to polls:linking text sentiment to public oppinion time series. ICWSM-2010, 2010. [4] Aniket Rangrej, Sayali Kulkarni, and Ashish V. Tendulkar. Comparative study of clustering techniques for short text documents. 20th International World Wide Web Conference (WWW2011), p. 111, 2011. [5] David A. Shamma, Lyndon Kennedy, and Elizabeth F. Churchill. Peaks and persistence: modeling the shape of microblog conversations. Proceedings of the ACM 2011 conference on Computer supported cooperative work, pp. 355 358, 2011. [6],. Web. (DBWS2006), 2006. [7],. blog. 18 (DEWS2007), 2007. [8],,,. Web. 19 DEWS2008, 2008. [9],,,,. Weblog. 22, 2008. [10],,.. 14, pp. pp.584 587, 2008. [11], Danushka Bollegala,,. Web. NLP 2, 2007. 21500091