Library and Information Science No. 56 2007

Similar documents

24 SPAM Performance Comparison of Machine Learning Algorithms for SPAM Discrimination

..,,,, , ( ) 3.,., 3.,., 500, 233.,, 3,,.,, i

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

23 The Study of support narrowing down goods on electronic commerce sites

ACS電子ジャーナル利用マニュアル

<95DB8C9288E397C389C88A E696E6462>

,,,,., C Java,,.,,.,., ,,.,, i

e-learning e e e e e-learning 2 Web e-leaning e 4 GP 4 e-learning e-learning e-learning e LMS LMS Internet Navigware

kut-paper-template.dvi

2 : Open Clip Art Library [4] Microsoft Office PowerPoint Web PowerPoint 2 Yahoo! Web [5] SlideShare Yahoo! Web Yahoo! Web

29 jjencode JavaScript

A Study on Throw Simulation for Baseball Pitching Machine with Rollers and Its Optimization Shinobu SAKAI*5, Yuichiro KITAGAWA, Ryo KANAI and Juhachi

1 Web Web 1,,,, Web, Web : - i -

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

1 1 tf-idf tf-idf i

08-特集04.indd

SERPWatcher SERPWatcher SERP Watcher SERP Watcher,

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

‰gﬁcŒõ/’ÓŠ¹

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

1 Fig. 2 2 Fig. 1 Sample of tab UI 1 Fig. 1 that changes by clicking tab 5 2. Web HTML Adobe Flash Web ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) 3 Web 2.1 Web Goo

Abstract Journal of Agricultural Science 2

ISSN NII Technical Report Patent application and industry-university cooperation: Analysis of joint applications for patent in the Universit

IS1-09 第回画像センシングシンポジウム, 横浜,14 年 6 月 2 Hough Forest Hough Forest[6] Random Forest( [5]) Random Forest Hough Forest Hough Forest 2.1 Hough Forest 1 2.2

Web Web Web Web 1 1,,,,,, Web, Web - i -

NINJAL Research Papers No.3

MOMW_I_,II 利用ガイド.PDF

(c) The Institute of Statistical Mathematics 2016

大学における原価計算教育の現状と課題

ディープラーニングとオープンサイエンス～研究の爆速化が引き起こす摩擦なき情報流通へのシフト～

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan

) 6) 2 (1855) 10 (1921) 7) II 8) 75 9)

provider_020524_2.PDF

<> <name> </name> <body> <></> <> <title> </title> <item> </item> <item> 11 </item> </>... </body> </> 1 XML Web XML HTML 1 name item 2 item item HTML

Web Web Web Web Web, i

未婚者の恋愛行動分析 : なぜ適当な相手にめぐり会わないのか

三税協力の実質化 : 住民税の所得税閲覧に関する国税連携の効果

日本感性工学会論文誌

独立行政法人情報通信研究機構 Development of the Information Analysis System WISDOM KIDAWARA Yutaka NICT Knowledge Clustered Group researched and developed the infor

IPSJ SIG Technical Report Vol.2017-SLP-115 No /2/18 1,a) 1 1,2 Sakriani Sakti [1][2] [3][4] [5][6][7] [8] [9] 1 Nara Institute of Scie

3 2 2 (1) (2) (3) (4) 4 4 AdaBoost 2. [11] Onishi&Yoda [8] Iwashita&Stoica [5] 4 [3] 3. 3 (1) (2) (3)

Powered by TCPDF ( Title Sub Title Author ウェブ上の文書から学術論文を自動判定し, 検索するシステムの設計開発 The development of a search engine for academic papers in W

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

橡表紙参照.PDF

100 SDAM SDAM Windows2000/XP 4) SDAM TIN ESDA K G G GWR SDAM GUI

(1) i NGO ii (2) 112

% 95% 2002, 2004, Dunkel 1986, p.100 1

The Japanese Journal of Psychology 1974, Vol. 44, No. 6, AN ANALYSIS OF WORD ATTRIBUTES IMAGERY, CONCRETENESS, MEANINGFULNESS AND EASE OF LEAR

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

DEIM Forum 2010 D Development of a La

SpecimenOTKozGo indd

NINJAL Project Review Vol.3 No.3

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

TF-IDF TDF-IDF TDF-IDF Extracting Impression of Sightseeing Spots from Blogs for Supporting Selection of Spots to Visit in Travel Sat

Q-Learning Support-Vector-Machine NIKKEI NET Infoseek MSN i

外国語学部　紀要３０号（横書）／０３＿菊地俊一

ISCO自動コーディングシステムの分類精度向上に向けて―SSM およびJGSS データセットによる実験の結果―

DEIM Forum 2010 A Web Abstract Classification Method for Revie

機関リポジトリ.PDF

Kyushu Communication Studies 第２号

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

kut-paper-template2.dvi

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

1. IEEE Xplore 1.1. IEEE Xplore Institute of electrical and Electronics Engineers (IEEE) Institution of Electrical Engineers (IEE) 12, IEEE Xpl

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

„h‹¤.05.07

02[ ]小山・池田(責)岩.indd

日本看護管理学会誌15-2

Microsoft Word - deim2011_new-ichinose doc

p コミュニケーション科学26_4（加藤川浦先生）.indd

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

Transcription:

PDF Automatic identification of academic articles in Japanese PDF files Teru AGATA Atsushi IKEUCHI Emi ISHIDA Michiko NOZUE Takashi KUNO Shuichi UEDA Résumé As open-access policies gain acceptance, an increasing number of researchers are contributing their papers to publicly accessible web sites (i.e. self-archiving). Theoretically, these papers are accessible from standard search engines, but they tend to be obscured by other contents on the web. The purpose of this research is to develop a system that can automatically detect 52410 Teru AGATA: Asia University, 52410 Sakai Musashino-shi, Tokyo e-mail: agata@asia-u.ac.jp Atsushi IKEUCHI: Daito Bunka University Emi ISHIDA: Surugadai University Michiko NOZUE: Railway Technical Research Institute Takashi KUNO: Sakushingakuin University Shuichi UEDA: Keio University 2006 5 15 2006 9 4 43

PDF academic articles and/or quasi-academic articles on the web. This paper describes experiments that were conducted on the performance of various classifiers and the results are compared in terms of precision, recall, and F-measure. The classifiers use attributes such as terms in PDF files and empirical rules. The results suggest the e$ciency of a ranked output system which has several phases to identify academic articles. I. A. B. C. PDF D. II. A. PDF B. C. D. III. A. B. C. IV. A. B. C. V. A. B. I. A. 2001 Budapest Open Access Initiative (BOAI) 1) 44

2) 2006 2 9 3) B. (JST) 4) (J-STAGE) 5) 100 2006 2 6) OAIster 2006 5 639 732 Google CiteSeer. IST 7) Google Scholar 8) CiteSeer.IST CiteSeer.IST 9) Google Scholar 10) Google Scholar (1) (2) C. PDF PDF PDF, HTML, XML, TeX, MS Word PDF 45

PDF PDF 2006 5 80.9 11) 1 PDF PDF HTML 1 1 1 PDF 1 1 HTML PDF Google Scholar 12) 1 1 D. PDF 1 46

2 PDF PDF / F II. A. PDF PDF 2005 5 2005 11 ipadic2.5.1 213,020 9,750 1 10,250 2 (Yahoo! Japan) PDF 100 URL 307,514 1 441,598 2 URL PDF 0 pdf PDF 1 248,314 2 349,971 PDF 1 2 599,673 B. PDF 20,000 6 PDF 12,000 565 6 (1) (2) (3) (4) 1 1 (5) 2 PDF 2 47

PDF C. 1. 2 2 PDF 20,000 326 950 4.75 5 PDF PDF 1 PDF 2. jp 3 URL jp 5 3 ac.jp jp co.jp jp 2 326 624 19,050 497,622.7 bytes 436,736.4 bytes 295,111.9 bytes 10.94 pages 13.86 pages 6.88 pages 100.00 98.54 92.50 3 JP ac 172 52.60 269 43.32 1,749 9.30 go 52 15.90 109 17.55 1,889 10.05 co 29 8.87 47 7.57 3,023 16.08 or 24 7.34 59 9.50 2,127 11.32 ne 5 1.53 20 3.22 1,322 7.03 45 13.76 117 18.84 8,687 46.21 48

PDF URL 3. 950 2 4 D. 1. PDF PDF PDF 4 NDC 00 33 10.1 22 3.5 10 20 6.1 15 2.4 20 12 3.7 22 3.5 30 64 19.6 161 25.8 40 64 19.6 175 28.0 50 / 88 27.0 145 23.2 60 22 6.7 60 9.6 70 / 4 1.2 14 2.2 80 10 3.1 5 0.8 90 9 2.8 5 0.8 PDF Adobe Acrobat 13), Xpdf 3.01pl 2 14), PDFDocText 15), PDFTrans 16) PDF Acrobat PDFDocText PDFTrans Xpdf PDF PDF PDF Xpdf PDF Xpdf 3a 3b 3a PDF 3a 3b PDF 3b 1 2 3b PDF 49

PDF 3a no. 11, 2001, p. 9 3b EU - vol. 39, no. 2, p. 63 PDF Xpdf 1 2. SVM MeCab 0.81 17) 2 bigram mecab, bigram bigram 50

III. A. 1. PDF mecab 77,814 2. 18) PDF 2 5 4 19 5 2 5 URL ac.jp URL go.jp 51

PDF B. SVM, AdaBoost, 3 SVM, AdaBoost (C4.5), Vote Weka (Waikato Environment for Knowledge Analysis) Weka 19) 20) Waikato Java Weka3.4.7, Weka3.5.2 Weka Weka 1. SVM Vapnik 2 21) SVM SVM AdaBoost SVM SVM light 6.01 22) Weka LIBSVM 2.81 23) 2. AdaBoost (Boosting) (Bagging) (ensemble learning) AdaBoost Schapire Singer AdaBoost k-nn 24) AdaBoost 25) SVM 26) AdaBoost 52

BoosTexter AdaBoost.MH BoosTexter 27) mecab 70 AdaBoost BoosTexter AdaBoost 3 AdaBoost. MH 2 28) AdaBoost.MH Weka AdaBoost (decision stumps) 10 100 1000 Weka BoosTexter AdaBoost 3. / (naive Bayesian classifier) naive (Bayesian Filter) Paul Graham A plan for spam 29) bsfilter 30) bsfilter Paul Graham Gary RobinsonFisher 31) bsfilter Weka NaïveBayes 4. (C4.5) (decision tree) CART 32),ID3 33),C4.5 34) Weka C4.5 J48 if-then 4 AdaBoost AdaBoost 53

PDF a0.5 a0.5 F 1 a1/3 F 2 P R 4 5. (Vote) Weka Vote SVM AdaBoost (100), C. (P), (R), F F 1 F 2 F F a F 1 a1/p(1a)1/r 4 (macro-averaging) IV. 611 A. 2 3 / mecab / bigram, 4 AdaBoost 3 54

1. 6 SVM mecab SVM 7 bigram.933 9.026 6 F 1, F 2 AdaBoost AdaBoost 100 1000 2. 7 SVM 0 F 7 N/A AdaBoost 1000 9 F F 1, F 2 Vote 6 F 1 F 2 SVM mecab.750.277.404.350 30 bigram.727.274.398.346 31 AdaBoost Round 10 mecab.521.403.455.436 63 Round 100 mecab.549.407.467.445 60 Round 1000 mecab.605.383.469.437 52 SVM mecab.039.914.076.109 1,891 bigram.047.923.090.128 1,598 mecab.713.273.395.344 31 bigram.743.271.397.344 30 AdaBoost Round 10 mecab.413.367.389.381 72 Round 100 mecab.527.417.465.448 64 Round 1000 mecab.547.387.453.428 58 mecab.103.506.172.220 399 bigram.026.933.050.074 2,937 55

PDF Vote SVM 1 SVM SVM (C4.5) 4 3. 8 50 SVM 75.933.026 7 F 1 F 2.233.893.370.459 312 (C4.5).430.236.305.278 45 Round 10.422.331.371.357 64 AdaBoost Round 100.467.393.427.415 69 Round 1000.504.365.423.402 59 SVM N/A.000 N/A N/A N/A Vote.444.537.486.502 99 8 F 1 F 2 SVM mecab.750.277.404.350 30 AdaBoost (R100) mecab.527.417.465.448 64 AdaBoost (R1000) mecab.605.383.469.437 52 bigram.026.933.050.074 2,937.233.893.370.459 312 AdaBoost (R1000).504.365.423.402 59 Vote.444.537.486.502 99 56

Vote F Vote F 1 AdaBoost (1000) mecab.469 F 2 19 F B. 1. 9 10 20 SVM bigram SVM mecab bigram 9 F 1 F 2 SVM mecab.742.482.584.546 154 bigram.749.478.584.544 152 AdaBoost Round 10 mecab.580.432.495.472 177 Round 100 mecab.624.515.564.547 196 Round 1000 mecab.675.513.583.557 180 SVM mecab.113.895.200.270 1,889 bigram.136.913.236.314 1,597 mecab.740.469.574.534 151 bigram.736.470.574.535 152 AdaBoost Round 10 mecab.587.379.461.430 154 Round 100 mecab.622.478.540.518 182 Round 1000 mecab.633.500.559.538 188 mecab.257.432.322.352 399 bigram.075.931.139.194 2,937 57

PDF bigram.026.075 F 1 SVM bigram F 2 AdaBoost 1000 AdaBoost 2. 10 10 20 SVM.681 7 SVM.893.726 10 F Vote 3. 11 10 F 1 F 2.363.726.484.545 475 (C4.5).662.445.532.500 160 Round 10.642.433.517.486 160 AdaBoost Round 100.654.437.524.491 159 Round 1000.652.425.515.481 155 SVM.681.436.532.495 152 Vote.592.551.571.564 221 11 F 1 F 2 SVM bigram.749.478.584.544 152 AdaBoost (R1000) mecab.675.513.583.557 180 bigram.075.931.139.194 2937.363.726.484.545 475 AdaBoost (R1000).681.436.532.495 152 Vote.592.551.571.564 221 58

bigram SVM.277.478 bigram.931.363.726 F F 1 bigram SVM F 2 Vote C. mecab SVM PDF PDF 1. SVM SVM 6 / 6 1 2 1 1 1 SVM 2 SVM 2. 6 1 3 3 59

PDF V. A. 1. PDF PDF 2 5 10 20 PDF 2. 19 8 11 3. SVM AdaBoost PDF 3 1) 2) 60

SVM SVM 5 Vote 3) 5 B. 1. 2. PDF 3. 61

PDF 1) Budapest Open Access Initiative. http:// www.soros.org/openaccess/read.shtml 2006-05-12 2) Lawrence, S. Free online availability substantially increases a paper s impact. Nature. vol. 411, no.6837, 2001, p.521. http://www. nature. com / nature / debates / e-access / Articles / lawrence. html, http: / / www. nature. com / nature / journal / v 411 / n 6837 / full / 411521a0.html 2006-05-12 3) EPrints. Journal PoliciesSummary Statistics So Far. http://romeo.eprints.org/stats.php 2006-05-12 4) http://www. jst.go.jp/ 2006-05-12 5) J-STAGE. http://www. jstage.jst.go.jp/ 2006-05-12 6)... vol. 55, no. 10, 2005, p. 434. 7) CiteSeer.IST http://citeseer.ist.psu.edu/ 2006-05-12 8) Google Scholar Beta. http://scholar.google. com/ 2006-05-12 9) Lawrence, S.; Giles, C.L.; Bollacker, K. Digital libraries and autonomous citation indexing. IEEE Computer. vol. 32, no. 6, 1999, p.67-71. http://citeseer.ist.psu.edu/aci-computer/acicomputer99.html2006-05-12 10) Google Scholar Beta 2006 5 11) http: / / www. openaccessjapan. com / archives / 2006 / 05 / oa1. html 2006-05-12 12) Crane, Diana. 1979, 260 p. 13) Adobe Acrobat family. http://www. adobe.co.jp/products/acrobat/ 2006-05-12 14) Glyph & Cog. Xpdf. http://www.foolabs. com/xpdf/ 2006-05-12 15) papy. http://homepage3.nifty.com/e-papy/ index.html 2006-05-12 16) Ishikawa, O. http://ohju.cside4.jp/software/ pdftrans/ 2006-05-12 17) Taku, Kudo. MeCab: Yet Another Part-of- Speech and Morphological Analyzer. http:// chasen.org/taku/software/mecab/ 2006-05-12 18) PDF 2005, 2005-10-22/23 p. 165 168. 19) Department of Computer Science, University of Waikato. Weka. http://www.cs.waikato. ac.nz/ml/weka 2006-05-12 20) Witten, Ian H.; Frank, Eibe. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed., San Francisco, Morgan Kaufmann, 2005, 525 p. 21) Vapnik, Vladimir N. The Nature of Statistical Learning Theory. 2nd ed. New York, Springer, 2000, xix, 314 p. SVM Cristianini, Nello; Shawe-Taylor, John. 2005, 252p. 22) Joachims, Thorsten. SVM light. http: / / svmlight.joachims.org/ 2006-05-12 23) Chang, Chih-Chun; Lin, Chih-Jen. LIBSVMA Library for Support Vector Machines. http://www.csie.ntu.edu.tw/cjlin/libsvm/ :2006-05-12 24) Schapire, R.E.; Singer, Y. BoosTexter: A boosting-based system for text categorization. Machine Learning. vol. 39, no. 2/3, 2000, p. 135168. 25) Dietterich, Thomas G. An experimental comparison of three methods for costructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning. vol. 40, no. 2, 2000, p. 139157. 26) Freund, Y.; Shapire, R. vol. 14, no. 5, 1999, p.771780. 27) Allwein, E.; Schapire, R. E.; Singer, Y. BoosTexter http: / / www. research. att. com / sw/tools/boostexter/ 2006-05-12 28) Schapire, R. E. The boosting approach to machine learning: An overview. MSRI workshop on nonlinear estimation and classification. 2001, p. 149172. 29) Graham, Paul 8 2005, p. 127 135. http://www.shiro.dreamhost.com/ scheme / trans / spam-j. html 62

2006-05-12 30) nabeken. bsfilter/bayesian spam filter. http://bsfilter.org/ 2006-05- 12 31) Robinson, G. A statistical approach to the spam problem. http://www.linuxjournal. com/article/6467 2006-05-12 32) Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and regression trees. Belmont, Wadsworth International Group, 1984, 358 p. 33) Quinlan, J. R. Induction of decision trees. Machine Learning. vol. 1, no. 1, 1986, p. 81106. 34) Quinlan, J. R. AI 1995, 293 p. SVM 63