PDF Automatic identification of academic articles in Japanese PDF files Teru AGATA Atsushi IKEUCHI Emi ISHIDA Michiko NOZUE Takashi KUNO Shuichi UEDA Résumé As open-access policies gain acceptance, an increasing number of researchers are contributing their papers to publicly accessible web sites (i.e. self-archiving). Theoretically, these papers are accessible from standard search engines, but they tend to be obscured by other contents on the web. The purpose of this research is to develop a system that can automatically detect 52410 Teru AGATA: Asia University, 52410 Sakai Musashino-shi, Tokyo e-mail: agata@asia-u.ac.jp Atsushi IKEUCHI: Daito Bunka University Emi ISHIDA: Surugadai University Michiko NOZUE: Railway Technical Research Institute Takashi KUNO: Sakushingakuin University Shuichi UEDA: Keio University 2006 5 15 2006 9 4 43
PDF academic articles and/or quasi-academic articles on the web. This paper describes experiments that were conducted on the performance of various classifiers and the results are compared in terms of precision, recall, and F-measure. The classifiers use attributes such as terms in PDF files and empirical rules. The results suggest the e$ciency of a ranked output system which has several phases to identify academic articles. I. A. B. C. PDF D. II. A. PDF B. C. D. III. A. B. C. IV. A. B. C. V. A. B. I. A. 2001 Budapest Open Access Initiative (BOAI) 1) 44
2) 2006 2 9 3) B. (JST) 4) (J-STAGE) 5) 100 2006 2 6) OAIster 2006 5 639 732 Google CiteSeer. IST 7) Google Scholar 8) CiteSeer.IST CiteSeer.IST 9) Google Scholar 10) Google Scholar (1) (2) C. PDF PDF PDF, HTML, XML, TeX, MS Word PDF 45
PDF PDF 2006 5 80.9 11) 1 PDF PDF HTML 1 1 1 PDF 1 1 HTML PDF Google Scholar 12) 1 1 D. PDF 1 46
2 PDF PDF / F II. A. PDF PDF 2005 5 2005 11 ipadic2.5.1 213,020 9,750 1 10,250 2 (Yahoo! Japan) PDF 100 URL 307,514 1 441,598 2 URL PDF 0 pdf PDF 1 248,314 2 349,971 PDF 1 2 599,673 B. PDF 20,000 6 PDF 12,000 565 6 (1) (2) (3) (4) 1 1 (5) 2 PDF 2 47
PDF C. 1. 2 2 PDF 20,000 326 950 4.75 5 PDF PDF 1 PDF 2. jp 3 URL jp 5 3 ac.jp jp co.jp jp 2 326 624 19,050 497,622.7 bytes 436,736.4 bytes 295,111.9 bytes 10.94 pages 13.86 pages 6.88 pages 100.00 98.54 92.50 3 JP ac 172 52.60 269 43.32 1,749 9.30 go 52 15.90 109 17.55 1,889 10.05 co 29 8.87 47 7.57 3,023 16.08 or 24 7.34 59 9.50 2,127 11.32 ne 5 1.53 20 3.22 1,322 7.03 45 13.76 117 18.84 8,687 46.21 48
PDF URL 3. 950 2 4 D. 1. PDF PDF PDF 4 NDC 00 33 10.1 22 3.5 10 20 6.1 15 2.4 20 12 3.7 22 3.5 30 64 19.6 161 25.8 40 64 19.6 175 28.0 50 / 88 27.0 145 23.2 60 22 6.7 60 9.6 70 / 4 1.2 14 2.2 80 10 3.1 5 0.8 90 9 2.8 5 0.8 PDF Adobe Acrobat 13), Xpdf 3.01pl 2 14), PDFDocText 15), PDFTrans 16) PDF Acrobat PDFDocText PDFTrans Xpdf PDF PDF PDF Xpdf PDF Xpdf 3a 3b 3a PDF 3a 3b PDF 3b 1 2 3b PDF 49
PDF 3a no. 11, 2001, p. 9 3b EU - vol. 39, no. 2, p. 63 PDF Xpdf 1 2. SVM MeCab 0.81 17) 2 bigram mecab, bigram bigram 50
III. A. 1. PDF mecab 77,814 2. 18) PDF 2 5 4 19 5 2 5 URL ac.jp URL go.jp 51
PDF B. SVM, AdaBoost, 3 SVM, AdaBoost (C4.5), Vote Weka (Waikato Environment for Knowledge Analysis) Weka 19) 20) Waikato Java Weka3.4.7, Weka3.5.2 Weka Weka 1. SVM Vapnik 2 21) SVM SVM AdaBoost SVM SVM light 6.01 22) Weka LIBSVM 2.81 23) 2. AdaBoost (Boosting) (Bagging) (ensemble learning) AdaBoost Schapire Singer AdaBoost k-nn 24) AdaBoost 25) SVM 26) AdaBoost 52
BoosTexter AdaBoost.MH BoosTexter 27) mecab 70 AdaBoost BoosTexter AdaBoost 3 AdaBoost. MH 2 28) AdaBoost.MH Weka AdaBoost (decision stumps) 10 100 1000 Weka BoosTexter AdaBoost 3. / (naive Bayesian classifier) naive (Bayesian Filter) Paul Graham A plan for spam 29) bsfilter 30) bsfilter Paul Graham Gary RobinsonFisher 31) bsfilter Weka NaïveBayes 4. (C4.5) (decision tree) CART 32),ID3 33),C4.5 34) Weka C4.5 J48 if-then 4 AdaBoost AdaBoost 53
PDF a0.5 a0.5 F 1 a1/3 F 2 P R 4 5. (Vote) Weka Vote SVM AdaBoost (100), C. (P), (R), F F 1 F 2 F F a F 1 a1/p(1a)1/r 4 (macro-averaging) IV. 611 A. 2 3 / mecab / bigram, 4 AdaBoost 3 54
1. 6 SVM mecab SVM 7 bigram.933 9.026 6 F 1, F 2 AdaBoost AdaBoost 100 1000 2. 7 SVM 0 F 7 N/A AdaBoost 1000 9 F F 1, F 2 Vote 6 F 1 F 2 SVM mecab.750.277.404.350 30 bigram.727.274.398.346 31 AdaBoost Round 10 mecab.521.403.455.436 63 Round 100 mecab.549.407.467.445 60 Round 1000 mecab.605.383.469.437 52 SVM mecab.039.914.076.109 1,891 bigram.047.923.090.128 1,598 mecab.713.273.395.344 31 bigram.743.271.397.344 30 AdaBoost Round 10 mecab.413.367.389.381 72 Round 100 mecab.527.417.465.448 64 Round 1000 mecab.547.387.453.428 58 mecab.103.506.172.220 399 bigram.026.933.050.074 2,937 55
PDF Vote SVM 1 SVM SVM (C4.5) 4 3. 8 50 SVM 75.933.026 7 F 1 F 2.233.893.370.459 312 (C4.5).430.236.305.278 45 Round 10.422.331.371.357 64 AdaBoost Round 100.467.393.427.415 69 Round 1000.504.365.423.402 59 SVM N/A.000 N/A N/A N/A Vote.444.537.486.502 99 8 F 1 F 2 SVM mecab.750.277.404.350 30 AdaBoost (R100) mecab.527.417.465.448 64 AdaBoost (R1000) mecab.605.383.469.437 52 bigram.026.933.050.074 2,937.233.893.370.459 312 AdaBoost (R1000).504.365.423.402 59 Vote.444.537.486.502 99 56
Vote F Vote F 1 AdaBoost (1000) mecab.469 F 2 19 F B. 1. 9 10 20 SVM bigram SVM mecab bigram 9 F 1 F 2 SVM mecab.742.482.584.546 154 bigram.749.478.584.544 152 AdaBoost Round 10 mecab.580.432.495.472 177 Round 100 mecab.624.515.564.547 196 Round 1000 mecab.675.513.583.557 180 SVM mecab.113.895.200.270 1,889 bigram.136.913.236.314 1,597 mecab.740.469.574.534 151 bigram.736.470.574.535 152 AdaBoost Round 10 mecab.587.379.461.430 154 Round 100 mecab.622.478.540.518 182 Round 1000 mecab.633.500.559.538 188 mecab.257.432.322.352 399 bigram.075.931.139.194 2,937 57
PDF bigram.026.075 F 1 SVM bigram F 2 AdaBoost 1000 AdaBoost 2. 10 10 20 SVM.681 7 SVM.893.726 10 F Vote 3. 11 10 F 1 F 2.363.726.484.545 475 (C4.5).662.445.532.500 160 Round 10.642.433.517.486 160 AdaBoost Round 100.654.437.524.491 159 Round 1000.652.425.515.481 155 SVM.681.436.532.495 152 Vote.592.551.571.564 221 11 F 1 F 2 SVM bigram.749.478.584.544 152 AdaBoost (R1000) mecab.675.513.583.557 180 bigram.075.931.139.194 2937.363.726.484.545 475 AdaBoost (R1000).681.436.532.495 152 Vote.592.551.571.564 221 58
bigram SVM.277.478 bigram.931.363.726 F F 1 bigram SVM F 2 Vote C. mecab SVM PDF PDF 1. SVM SVM 6 / 6 1 2 1 1 1 SVM 2 SVM 2. 6 1 3 3 59
PDF V. A. 1. PDF PDF 2 5 10 20 PDF 2. 19 8 11 3. SVM AdaBoost PDF 3 1) 2) 60
SVM SVM 5 Vote 3) 5 B. 1. 2. PDF 3. 61
PDF 1) Budapest Open Access Initiative. http:// www.soros.org/openaccess/read.shtml 2006-05-12 2) Lawrence, S. Free online availability substantially increases a paper s impact. Nature. vol. 411, no.6837, 2001, p.521. http://www. nature. com / nature / debates / e-access / Articles / lawrence. html, http: / / www. nature. com / nature / journal / v 411 / n 6837 / full / 411521a0.html 2006-05-12 3) EPrints. Journal PoliciesSummary Statistics So Far. http://romeo.eprints.org/stats.php 2006-05-12 4) http://www. jst.go.jp/ 2006-05-12 5) J-STAGE. http://www. jstage.jst.go.jp/ 2006-05-12 6)... vol. 55, no. 10, 2005, p. 434. 7) CiteSeer.IST http://citeseer.ist.psu.edu/ 2006-05-12 8) Google Scholar Beta. http://scholar.google. com/ 2006-05-12 9) Lawrence, S.; Giles, C.L.; Bollacker, K. Digital libraries and autonomous citation indexing. IEEE Computer. vol. 32, no. 6, 1999, p.67-71. http://citeseer.ist.psu.edu/aci-computer/acicomputer99.html2006-05-12 10) Google Scholar Beta 2006 5 11) http: / / www. openaccessjapan. com / archives / 2006 / 05 / oa1. html 2006-05-12 12) Crane, Diana. 1979, 260 p. 13) Adobe Acrobat family. http://www. adobe.co.jp/products/acrobat/ 2006-05-12 14) Glyph & Cog. Xpdf. http://www.foolabs. com/xpdf/ 2006-05-12 15) papy. http://homepage3.nifty.com/e-papy/ index.html 2006-05-12 16) Ishikawa, O. http://ohju.cside4.jp/software/ pdftrans/ 2006-05-12 17) Taku, Kudo. MeCab: Yet Another Part-of- Speech and Morphological Analyzer. http:// chasen.org/taku/software/mecab/ 2006-05-12 18) PDF 2005, 2005-10-22/23 p. 165 168. 19) Department of Computer Science, University of Waikato. Weka. http://www.cs.waikato. ac.nz/ml/weka 2006-05-12 20) Witten, Ian H.; Frank, Eibe. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed., San Francisco, Morgan Kaufmann, 2005, 525 p. 21) Vapnik, Vladimir N. The Nature of Statistical Learning Theory. 2nd ed. New York, Springer, 2000, xix, 314 p. SVM Cristianini, Nello; Shawe-Taylor, John. 2005, 252p. 22) Joachims, Thorsten. SVM light. http: / / svmlight.joachims.org/ 2006-05-12 23) Chang, Chih-Chun; Lin, Chih-Jen. LIBSVMA Library for Support Vector Machines. http://www.csie.ntu.edu.tw/cjlin/libsvm/ :2006-05-12 24) Schapire, R.E.; Singer, Y. BoosTexter: A boosting-based system for text categorization. Machine Learning. vol. 39, no. 2/3, 2000, p. 135168. 25) Dietterich, Thomas G. An experimental comparison of three methods for costructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning. vol. 40, no. 2, 2000, p. 139157. 26) Freund, Y.; Shapire, R. vol. 14, no. 5, 1999, p.771780. 27) Allwein, E.; Schapire, R. E.; Singer, Y. BoosTexter http: / / www. research. att. com / sw/tools/boostexter/ 2006-05-12 28) Schapire, R. E. The boosting approach to machine learning: An overview. MSRI workshop on nonlinear estimation and classification. 2001, p. 149172. 29) Graham, Paul 8 2005, p. 127 135. http://www.shiro.dreamhost.com/ scheme / trans / spam-j. html 62
2006-05-12 30) nabeken. bsfilter/bayesian spam filter. http://bsfilter.org/ 2006-05- 12 31) Robinson, G. A statistical approach to the spam problem. http://www.linuxjournal. com/article/6467 2006-05-12 32) Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and regression trees. Belmont, Wadsworth International Group, 1984, 358 p. 33) Quinlan, J. R. Induction of decision trees. Machine Learning. vol. 1, no. 1, 1986, p. 81106. 34) Quinlan, J. R. AI 1995, 293 p. SVM 63