2015 9

Similar documents
Copyright 2008 by Tomoyoshi Yamazaki

JAIST Reposi Title KJ 法における作法の研究 Author(s) 三村, 修 Citation Issue Date Type Thesis or Dissertation Text version author URL http

レビューテキストの書き の評価視点に対する評価点の推定 29 3

2006 3





¥ì¥·¥Ô¤Î¸À¸ì½èÍý¤Î¸½¾õ

Web

Copyright c 2000 by Yoshihide Tomiyama


Copyright ' 2001 by Manabu Masuoka i


1,a) 1,b) TUBSTAP TUBSTAP Offering New Benchmark Maps for Turn Based Strategy Game Tomihiro Kimura 1,a) Kokolo Ikeda 1,b) Abstract: Tsume-shogi and Ts

AI

Copyright c 2001 by Shuuhei Takimoto


2015 3


ACS電子ジャーナル利用マニュアル

JAIST Reposi Title 既存曲に合わせて口す さまれる即興歌唱を利用した 音楽創作支援手法に関する研究 Author(s) 柳, 卓知 Citation Issue Date Type Thesis or Dissertation Te

Copyright 2001 by Junichi Sawase


Copyright c 2012 by Kikugawa Mariko




2005 3

TRON Copyright C 2002 by KURATA Keiicchi

<> <name> </name> <body> <></> <> <title> </title> <item> </item> <item> 11 </item> </>... </body> </> 1 XML Web XML HTML 1 name item 2 item item HTML





DEIM Forum 2012 E Web Extracting Modification of Objec

2014 3





1 1 tf-idf tf-idf i

A Japanese Word Dependency Corpus ÆüËܸì¤Îñ¸ì·¸¤ê¼õ¤±¥³¡¼¥Ñ¥¹





( : A8TB2163)





[1], B0TB2053, i






1.

1. IEEE Xplore 1.1. IEEE Xplore Institute of electrical and Electronics Engineers (IEEE) Institution of Electrical Engineers (IEE) 12, IEEE Xpl

[4], [5] [6] [7] [7], [8] [9] 70 [3] 85 40% [10] Snowdon 50 [5] Kemper [3] 2.2 [11], [12], [13] [14] [15] [16]

MDA


独立行政法人情報通信研究機構 Development of the Information Analysis System WISDOM KIDAWARA Yutaka NICT Knowledge Clustered Group researched and developed the infor




stud 戸 時 of 血 e~ 田 e 置 'Ch

Drive-by-Download JavaScript

29 jjencode JavaScript

' ' ' '

,,, Twitter,,, ( ), 2. [1],,, ( ),,.,, Sungho Jeon [2], Twitter 4 URL, SVM,, , , URL F., SVM,, 4 SVM, F,.,,,,, [3], 1 [2] Step Entered





johnny-paper2nd.dvi

Title 中國宗教文獻研究國際シンポジウム報告書 ( 大規模佛教文獻群に對する確率統計的分析の試み / 師茂樹 ) Author(s) Citation (2004) Issue Date URL Right Typ

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan




文を綴る、文を作る

1



3 2 2 (1) (2) (3) (4) 4 4 AdaBoost 2. [11] Onishi&Yoda [8] Iwashita&Stoica [5] 4 [3] 3. 3 (1) (2) (3)

1 3 [1] [2, 3] WWW 2.1 WWW WWW DjVu 3 ( 1) 2 DjVu DjVu DjVu[2] 16 ( ) http

135




(2008) JUMAN *1 (, 2000) google MeCab *2 KH coder TinyTextMiner KNP(, 2000) google cabocha(, 2001) JUMAN MeCab *1 *2 h





2004 3



Transcription:

JAIST Reposi https://dspace.j Title ウェブページからのサイト情報 作成者情報の抽出 Author(s) 堀, 達也 Citation Issue Date 2015-09 Type Thesis or Dissertation Text version author URL http://hdl.handle.net/10119/12932 Rights Description Supervisor: 白井清昭, 情報科学研究科, 修士 Japan Advanced Institute of Science and

2015 9

1310067 : 2015 8 Copyright c 2015 by Hori Tatsuya 2

,.,.,,.,.,.,,,.,,,.,,,.,., (, ), (,,, ).,,.,,,.,.,.,,. Kato,, DOM,. Giuffrida,,,.,, ( ),,,, ( ).,,., HTML Document Object Model (DOM),. Support Vector Machine (SVM), DOM, id, class,,, DOM,,, n-gram.,, ( DOM ) ( DOM )

., DOM,,..,.,,,..,, DOM., Kato., 0. 0 DOM,,.,,.. 500,., 10,, F.,,. F, 0.384, 0.258, F, 0.585, 0.675.,,,.,,,., F,.,.,,., F.,.,,.,.,.,,.,,. 2

1 1 1.1................................... 1 1.2................................... 1 1.3.................................. 2 2 3 2.1......................... 3 2.2............................ 13 3 15 3.1....................................... 15 3.2................................... 16 3.3....................................... 18 3.4............................. 22 3.4.1................... 23 3.4.2 0.............. 23 3.4.3 2.................. 24 4 26 4.1................................... 26 4.2.................................. 26 4.3..................................... 27 4.4................................. 28 4.4.1............................. 28 4.4.2............................... 32 4.4.3......................... 37 4.5................................... 38 5 42 5.1................................. 42 5.2................................... 42 44 i

A 100 n-gram 46 ii

1 1.1,..,.,.,.,,,.,,,.,,,.,.,. 1.2,., (, ), (,,, ).,,.,,,.,,,.,,,.,. 1

1.3 5. 2,. 3,. 4,,.,. 5,. 2

2,. 2.1,, ( ). 2.2,. 2.1 2.1: [1],, [1]., 2.1,,, DOM.,.. X 3

Y DOM HTML DOM DOM, 10 10 5 5, 5. 2.1.,,.,, DOM. 2.1: 1 [1] 10 10 5 5 Precision Recall Precision Recall 0.21 0.52 0.48 0.68,, DOM.. Juman 4

,, 10 10 2. 2.2., 1, 2. 2.2: 2 [1] Precision Recall 1 0.53 0.47 1 0.84 0.47 Kato,, [2]. Kato 2.2.,.. 1. HTML. 2.. 3. KNP( ). 4.. (a) (b) ( ) 5.. (a). (b). (c) ( ) ( ). 6...,. HTML (, ) 5

2.2: Kato [2], 2.3. author name content, h1 div, 2 h1-div-table-tbody-tr-td, 5. SVM. SVM,., ABC,. 2 : ABC 1 : ABC 0 k,., 2.3. All,. 58.6%.,, k. 2.3: [2] k Ranking Precision 1 0.586 3 0.720 5 0.752 All 0.847 6

2.3: DOM [2] Giuffrida, PostScript,,,, [3].,., Giuffrida,. Giuffrida. 1. xy. 2. xy. 3. -,. 4... 2.4. 7

2.4: [3] 9 12 10 10 8 2.5: [3] Accuracy 92% 87% 75% 71% 76% 2.5.,,. Kawahara, [4].,,,. Kawahara. 1. Web,. 1). Web.. 2). JUMAN KNP,. 3). Web,. 2.,. 8

. :.. :. 3... ( ), ). :.. :.,, 25. 2.4.,. A). B)., C).,. 9

2.4: [4] 2.5. Kawahara 2.6. major p-a, contradictions., A, B, 82.5%, 79.3%,. Kobayashi, [5]. Kobayashi, ( ). opinion holder subject ( ) 10

2.5: [4] : aspect,,, Subject evaluation Opinion holder / :,,,, Asp-Eval, Asp-of. Asp-Eval aspect evaluation, Asp-of, aspect.,, subject,, aspect, Asp-of. 2.6.,., 1. 11

2.6: Kawahara [4] major p-a contradictions relevant(a, B) 160/194 (82.5%) 46/58 (79.3%) relevant(a) 118/194 (60.8%) 39/58 (67.2%) should be merged(b) 42/194 (21.6%) 7/58 (12.1%) not relevant(c) 34/194 (17.5%) 12/58 (20.7%) 2.6: [5] ( (Rest) (Auto) (Phone) (Game)), 2.7. I, II aspect., other aspect 3, Non-writer op. holder opinion holder. Asp-Eval, Asp-of.,, Aspect -ga VP-te Evaluation.,., aspect-aspect, aspect-evaluation., Asp-Eval Asp-of.,.. 12

2.7: [5] Rest Auto Phone Game articles 1,356 564 481 361 sentences 21,666 14,005 11,638 6,448 # of opinion units 4,267 1,519 1,518 775 I Asp-Eval 3,692 943 965 521 I Asp-Asp 1,426 280 296 221 I Subj-Asp 2,632 877 850 451 II Subj-Eval 575 576 553 243 II Subj-Asp-Eval 2,314 736 768 351 II Subj-Asp-Asp-Eval 1065 175 172 127 II other 313 32 25 54 Non-writer op. holder 95 17 22 2 1). evaluation( aspect),, aspect.,, 2). 2). evaluation( aspect) aspect. Kobayashi, Tateishi [6],., [7] Kobayashi,. 2.8, 2.9. A B Asp-of, B C Asp-of, A C Asp-of., 2.9 Asp-of. Asp-Eval,, 10%. Asp-of, 10%, 20%.,. 2.2, Kato, Giuffrida,.,, 13

2.8: Asp-Eval [5] Asp-Eval P 0.56 (432/774) R 0.53 (432/809) P 0.70 (504/723) 0.13 (46/360) R 0.62 (504/809) 0.17 (46/274) + P 0.72 (502/694) 0.14 (53/389) R 0.62 (502/809) 0.19 (53/274) 2.9: Asp-of [5] Asp-of precision recall 0.27 (175/682) 0.17 (175/1048) 0.44 (458/1047) 0.44 (458/1048) + 0.45 (474/1047) 0.45 (474/1048) ( ),,,, ( ).,,,. Kawahara, Kobayashi,,.,,,. 14

3 3.1, HTML Document Object Model (DOM),. Document Object Model, HTML. DOM, HTML ( ). DOM,, 1 HTML., HTML DOM. 3.1 HTML DOM. div h1 h2. h1 DOM 1, h2 DOM 2., DOM,. Support Vector Machine (SVM)[8] 3.1: HTML DOM 15

3.2 DOM. site DOM person DOM site-link, DOM person-link, DOM site-part DOM person-part DOM site-image, DOM person-image, DOM other, site person DOM 3.2.,,.,,,. 1 DOM. person-link DOM 3.3., DOM,.,,,, person-link. person-part DOM 3.4. Author:mirura., DOM., Author:mirura DOM ( ) person-part.,,.,,. 16

3.2: site person 3.3: person-link 17

3.4: person-part 3.3 node+infor. node, DOM., node DOM (N t ), N t (N p ), N t 1 (N s ), N t 1 (N ps ).,. node 3.5., infor. SVM, node infor 1, 0., infor. DOM HTML., site-link 18

3.5: node person-link HTML a, site HTML h1, h2. id, class id= title class= profile, id class,.,,,,., id= title-top title, top 2. DOM l, l [1, 20], [11, 30],..., [181, 200], l = 0, l > 200 1.,,.,, node N t. DOM.,,.,,,., N t 19

20, 3,.. 1. ChaSen 1. 2. 3 (N t 20 ) 3.1,. 3.1: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - N s N ps title ( ) 1., 4.1,., 3.6 3.2 DOM, h2 DOM, 1 ( h1 ) 3 title, 1. node N t. 1 http://chasen-legacy.osdn.jp/ 20

3.6: 1, DOM., 1, ( ) + ( ) + 1.,. node N t. ABOUT 1.,,. n-gram,.,,.,,,.,,.,, n-gram,. 1, DOM n-gram, 1., 2 2 http://blog.with2.net/ 21

2. n = 3, n-gram 100. n-gram 3.2., 100 n-gram A A.1, A.2. 3.2: n-gram n-gram 558,, 557,, 281,, 115,, 100,,, N t, N p, N s, N ps, N t. DOM 3.3. 3.3: DOM N t N p N s N ps DOM id, class n-gram 3.4, ( DOM ) ( DOM )., 4.1 4.1, 99%., SVM,., DOM,.. 22

3.4.1,,,,..,, DOM,., Kato [2]. 3.7.. 1. DOM ( body DOM ) 3. 2.. 3., t m,, 2.., DOM., t m = 0.5. 3.7, DOM.,,.,.,,.,., = 0.1, 0.2, 0.5, 4.1., = 0.1,.,, 0.1., img height, width.,, 100., ( title ) profile,.,.,,. 3.4.2 0 HTML DOM, ( 0 ). 0 DOM,, 3, DOM. 23

., 0,.,. 3.4.3 2 3.4.1,., 2. T,, I,, 24

3.7: (Kato et al. (2008) p.39 Figure 4) 25

4,,. 4.1,,. 4.2,. 4.3,. 4.4,,,. 4.5,,. 4.1,. Yahoo!,goo,FC2,, 500.,,. 4.1 DOM., site-part person-part, site person., site-image person-image,,, other. 500 10 ( D 1 D 10 ),., D 10., 50,.. 4.2. SVM,., D test. 4.2, DOM. 1. N s N ps, site. 26

4.1: DOM site 252 site-link 14 site-part* 17 site-image* 8 person 243 person-link 183 person-part* 35 person-image* 2 other 668386 4.2: DOM site 20 site-link 0 site-part* 1 site-image* 0 person 28 person-link 21 person-part* 3 person-image* 0 other 55424 2. N t about, a, site-link. 3. N s N ps profile, person. 4. N t profile, a, personlink. 5. other. 4.3,, HTML DOM,, F. (P), (R), F (F). P = DOM DOM R = DOM DOM F = 2 P R P + R (4.1) (4.2) (4.3) 27

P, R, F other, site, site-link, person, person-link,. 10, 10, 10., P, R, F. 4.4 4.4.1 4.3 4.6. 4.3, 10 10, 4.4,., 4.5 4.6,. 4.5 (10 ).,, 3.4.3 I.,. person-link F,., site, site-link, person,,., 4.2,., D 3, D 6, D 7, D 9, site-link DOM, site-link., 4.3 D 1 D 10 F. site, 0.235(D 4 ), 0.541(D 9 ), 0.306., site, site-link 0.253, person 0.3, person-link 0.21.,, ±0.1., site D 2, D 4, D 8, D 9, D 10, site-link D 10, person D 1, person-link D 9., site, ±0.1,.,. site-link, D 2., site-link. site, person, person-link F,, person-link, person, site.,,., 4.4, site 0.192, person 0.367, person-link 0.309., ±0.1, person D 1, D 4, D 5, D 6, D 10, person-link D 7, D 9, site., person, person-link,. 28

4.3: ( 10 ) site site-link person person-link D 1 0.246 0.067 0.027 1.000 D 2 0.153 0.077 0.146 0.714 D 3 0.300 0.161 0.909 D 4 0.148 0.063 0.143 0.857 D 5 0.311 0.125 0.217 0.933 D 6 0.194 0.132 0.909 D 7 0.214 0.225 0.950 D 8 0.392 0.059 0.219 0.733 D 9 0.411 0.230 1.000 D 10 0.388 0.250 0.157 0.733 site site-link person person-link D 1 0.600 0.500 0.462 0.692 D 2 0.565 0.333 0.583 0.714 D 3 0.643 0.600 0.625 D 4 0.571 1.000 0.480 0.545 D 5 0.704 1.000 0.670 0.700 D 6 0.633 0.474 0.556 D 7 0.682 0.714 0.704 D 8 0.667 1.000 0.694 0.579 D 9 0.793 0.742 0.750 D 10 0.765 0.667 0.706 0.667 F site site-link person person-link D 1 0.349 0.118 0.051 0.818 D 2 0.241 0.125 0.233 0.714 D 3 0.409 0.254 0.741 D 4 0.235 0.118 0.220 0.667 D 5 0.432 0.222 0.331 0.800 D 6 0.297 0.207 0.690 D 7 0.326 0.342 0.809 D 8 0.494 0.111 0.333 0.647 D 9 0.541 0.351 0.857 D 10 0.515 0.364 0.257 0.667 29

4.4: ( 10 ) site site-link person person-link D 1 0.667 0.611 0.917 D 2 0.556 1.000 0.750 0.792 D 3 0.667 0.824 0.929 D 4 0.478 0.692 0.857 D 5 0.684 0.917 0.850 D 6 0.720 0.882 0.867 D 7 0.579 0.895 0.963 D 8 0.750 0.629 0.762 D 9 0.810 0.692 0.952 D 10 0.792 1.000 0.750 site site-link person person-link D 1 0.480 0.423 0.846 D 2 0.435 0.333 0.500 0.905 D 3 0.571 0.560 0.813 D 4 0.524 0.360 0.545 D 5 0.481 0.759 0.850 D 6 0.600 0.789 0.722 D 7 0.500 0.607 0.963 D 8 0.500 0.611 0.842 D 9 0.586 0.581 1.000 D 10 0.559 0.765 0.833 F site site-link person person-link D 1 0.558 0.500 0.880 D 2 0.488 0.500 0.600 0.844 D 3 0.615 0.667 0.867 D 4 0.500 0.474 0.667 D 5 0.565 0.830 0.850 D 6 0.655 0.833 0.788 D 7 0.537 0.723 0.963 D 8 0.600 0.620 0.800 D 9 0.680 0.632 0.976 D 10 0.655 0.867 0.789 30

4.5: ( 10 ) F site 0.276 0.662 0.384 site-link 0.107 0.750 0.176 person 0.116 0.613 0.258 person-link 0.874 0.653 0.741 site 0.670 0.524 0.585 site-link 1.000 0.333 0.500 person 0.789 0.596 0.675 person-link 0.864 0.832 0.842 4.6: ( ) F site 0.320 0.762 0.451 site-link person 0.208 0.645 0.315 person-link 0.722 0.619 0.667 site 0.667 0.667 0.667 site-link person 0.750 0.677 0.712 person-link 0.840 1.000 0.913 31

, 4.5,., site, site-link, person.,., person-link,,. F,., 10,., 4.6,. site,.,. F,, site 0.216, person 0.397, person-link 0.346.,. 4.4.2 4.7: (, D test ) site site-link person person-link F tag 0.667 0.808 0.895 ( 0.000) (+0.058) (+0.055) F id,class 0.571 0.714 0.808 ( 0.096) ( 0.036) ( 0.032) F length 0.579 0.778 0.808 ( 0.088) (+0.028) ( 0.032) F bow 0.615 0.955 1.000 ( 0.052) (+0.205) (+0.160) F title 0.684 0.750 0.840 (+0.017) ( 0.000) ( 0.000) F sitekey 0.667 0.750 0.840 ( 0.000) ( 0.000) ( 0.000) F linkkey 0.667 0.750 0.840 ( 0.000) ( 0.000) ( 0.000) F n-gram 0.770 0.778 0.840 (+0.103) (+0.028) ( 0.000) F all 0.667 0.750 0.840 32

4.8: (, D test ) site site-link person person-link F tag 0.571 0.677 0.810 ( 0.096) ( 0.000) ( 0.190) F id,class 0.571 0.645 1.000 ( 0.096) ( 0.022) ( 0.000) F length 0.524 0.677 1.000 ( 0.143) (+0.010) ( 0.000) F bow 0.762 0.677 0.476 (+0.095) (+0.010) ( 0.524) F title 0.619 0.677 1.000 ( 0.048) (+0.010) ( 0.000) F sitekey 0.667 0.677 1.000 ( 0.000) (+0.010) ( 0.000) F linkkey 0.667 0.677 1.000 ( 0.000) (+0.010) ( 0.000) F n-gram 0.667 0.677 1.000 ( 0.000) (+0.010) ( 0.000) F all 0.667 0.667 1.000 4.9: (F, D test ) F site site-link person person-link F tag 0.615 0.737 0.895 ( 0.052) (+0.025) ( 0.018) F id,class 0.571 0.678 0.894 ( 0.096) ( 0.034) ( 0.019) F length 0.550 0.724 0.894 ( 0.117) (+0.012) ( 0.019) F bow 0.681 0.792 0.645 (+0.014) (+0.080) ( 0.268) F title 0.650 0.712 0.913 ( 0.017) ( 0.000) ( 0.000) F sitekey 0.667 0.712 0.913 ( 0.000) ( 0.000) ( 0.000) F linkkey 0.667 0.712 0.913 ( 0.000) ( 0.000) ( 0.000) F n-gram 0.683 0.724 0.913 (+0.016) (+0.012) ( 0.000) F all 0.667 0.712 0.913 33

4.10: (, D 10 ) site site-link person person-link F tag 0.900 0.889 0.846 (+0.108) ( 0.111) (+0.096) F id,class 0.762 0.703 0.750 ( 0.030) ( 0.297) ( 0.000) F length 0.818 1.000 0.750 (+0.026) ( 0.000) ( 0.000) F bow 0.821 0.864 1.000 (+0.029) ( 0.136) (+0.250) F title 0.833 1.000 0.750 (+0.041) ( 0.000) ( 0.000) F sitekey 0.833 1.000 0.750 (+0.041) ( 0.000) ( 0.000) F linkkey 0.783 1.000 0.750 ( 0.009) ( 0.000) ( 0.000) F n-gram 0.818 1.000 0.750 (+0.026) ( 0.000) ( 0.000) F all 0.792 1.000 0.750 4.11: (, D 10 ) site site-link person person-link F tag 0.529 0.706 0.611 ( 0.030) ( 0.059) ( 0.222) F id,class 0.471 0.765 0.833 ( 0.088) ( 0.000) ( 0.000) F length 0.529 0.735 0.833 ( 0.030) ( 0.030) ( 0.000) F bow 0.676 0.559 0.500 (+0.117) ( 0.206) ( 0.333) F title 0.588 0.765 0.833 (+0.029) ( 0.000) ( 0.000) F sitekey 0.588 0.765 0.833 (+0.029) ( 0.000) ( 0.000) F linkkey 0.529 0.765 0.833 ( 0.030) ( 0.000) ( 0.000) F n-gram 0.529 0.765 0.833 ( 0.030) ( 0.000) ( 0.000) F all 0.559 0.765 0.833 34

4.12: (F, D 10 ) F site site-link person person-link F tag 0.667 0.787 0.710 (+0.012) ( 0.080) ( 0.079) F id,class 0.582 0.732 0.789 ( 0.073) ( 0.135) ( 0.000) F length 0.643 0.847 0.789 ( 0.012) ( 0.020) ( 0.000) F bow 0.742 0.679 0.667 (+0.087) ( 0.188) ( 0.122) F title 0.690 0.867 0.789 (+0.035) ( 0.000) ( 0.000) F sitekey 0.690 0.867 0.789 (+0.035) ( 0.000) ( 0.000) F linkkey 0.632 0.867 0.789 ( 0.023) ( 0.000) ( 0.000) F n-gram 0.643 0.867 0.789 ( 0.012) ( 0.000) ( 0.000) F all 0.655 0.867 0.789 35

,., SVM, 1 SVM.,,, F,. F tag DOM, F id,class id, class, F length, F bow, F title, F sitekey, F linkkey, F n-gram n-gram., F all.,, I. F all 1,, F 4.7, 4.8, 4.9. () F all. F n-gram F all F, site, person F n-gram, person-link. F n-gram F all, n-gram. F all F id,class., id, class.,, site F length, person F id,class, person-link F bow., site, person id, class, person-link., D 10,. 4.10, 4.11, 4.12. F title F all, site F title, person, person-link., F sitekey F all, site F sitekey, person, person-link., F all, F title F sitekey,,. F all, F id,class F length. 2, person-link F, site, person F F id,class., id, class.,, site F id,class, person F bow, person-link F bow., site id, class, person person-link. n-gram D 10, D test., D test, D 10.,., n-gram., id, class,,,. 36

4.4.3 3.4. 4.1, DOM 668386, 3.4.3 T, 192161. DOM,., site 6, site-link 0, site-part 4, site-image 6, person 7, person-link 6, person-part 4, person-image 2. 71%, 5%. 3.4.3 I, DOM 196954., site 2, site-part 2, person 2, person-link 2, person-part 1, 0. 70%, 1%., T,,, F. 4.13. T,, person-link, site, person F T. person T, site,., T. T,, person-link, site, person F T., site, person, T.,, site, person F., I. D test, I, site DOM 20 5 (25%), person DOM 28 3 ( 11%)., T, site DOM 20 1 5%, person DOM., I T,., 3.4.1,., D 10. 4.14., T I person F. 3.4.1, 0.1, F, person F I 37

4.13: ( D test ) site site-link person person-link 0.682 0.778 0.840 T 0.762 0.808 0.840 I 0.667 0.750 0.840 site site-link person person-link 0.714 0.677 1.000 T 0.762 0.677 1.000 I 0.667 0.677 1.000 F site site-link person person-link 0.698 0.724 0.913 T 0.762 0.737 0.913 I 0.667 0.712 0.913 T., T, I, person F,., site T, I.,. D 10, I, site DOM 31 4 ( 13%), person DOM 32 2 ( 6%)., T, site DOM 31 1 ( 3%), person DOM. site person, person F site F. 4.5. 4.3, site,,.,,,. 4.1.,, 1, site., 38

4.14: ( D 10 ) site site-link person person-link 0.800 0.926 0.750 T 0.792 0.963 0.750 I 0.792 1.000 0.750 site site-link person person-link 0.588 0.735 0.833 T 0.559 0.765 0.833 I 0.559 0.765 0.833 F site site-link person person-link 0.678 0.820 0.789 T 0.655 0.852 0.789 I 0.655 0.867 0.789,, DOM other.,, n-gram,. 4.1:,. 4.2.,.,,.,,.,,.,.,. 3.4.1, DOM., 4.3 39

4.2:,,,.,,., DOM,., 4.4,,. DOM,, DOM,.,.,,. 40

図 4.3: コンテンツ領域検知の失敗例 1 図 4.4: コンテンツ領域検知の失敗例 2 41

5 5.1,., HTML DOM. SVM, 8.,,, SVM.,. DOM 8, 4., 10,.,., id, class, F,.,,,.,,.,.,,.,.,,. 5.2,,.,.,.,,., DOM, 100,., 100,., 42

,.,., 3.4,., 3.3,,, 4.4.2.,.,., 1 DOM.,, n-gram,.,,.,.,. 43

,,.,,.,,. 44

[1],,,. Web. 16, p.94-p.97, 2010. [2] Yoshikiyo Kato, Daisuke Kawahara, Kentaro Inui, Sadao Kurohashi and Tomohide Shibata. Extracting the Author of Web Pages. Proceedings of the 2nd ACM workshop on Information Credibility on the WICOW 08, p.35-p.42, 2008. [3] Giovanni Giuffrida, Eddie C. Shek, and Jihoon Yang. Knowledge-Based Metadata Extraction from PostScript Files. Proceedings of the Fifth ACM Conference on Digital Libraries(DL 00), p.77-p.84, 2000. [4] Daisuke Kawahara, Sadao Kurohashi, and Kentaro Inui. Grasping Major Statements and their Contradictions Toward Information Credibility Analysis of Web Contents. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, p.393-p.397, 2008. [5] Nozomi Kobayashi, Kentaro Inui, and Yuji Matsumoto. Extracting Aspect- Evaluation and Aspect-of Relations in Opinion Mining. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, p.1065-p.1074, 2007. [6] K. Tateishi, T. Fukushima, N. Kobayashi, T. Takahashi, A. Fujita, K. Inui, and Y. Matsumoto. Web Opinion Extraction and Summarization Based on Viewpoints of Products, In IPSJ SIGNL Note 163, p.1-p.8, 2004. [7] Razvan Bunescu. Associative Anaphora Resolution: a Web Based Approach. In Proceedings of the EACL Workshop on the Computational Treatment of Anaphora, p.47-p.52, 2003. [8] Chih-Chung Chang and Chih-Jen Lin. LIBSVM : a Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, Vol.2, No.3, Article 27, 2011. 45

A 100 n-gram A.1: n-gram( 1 50 ) 3-gram 3-gram 2992,, 248,, 2138,, 242,, 1524,, 216,, 868,, 205,, 761,, 194,,! 588,, 191,, 557,, 169,,! 539,, 167 &, amp, ; 509,, 161,, 455,, 158,, 435,, 156,, 426,, 154,, 403,, 152,, 353,, 149,, 315,, 148,, 308,, 148,,! 281,, 145,, 281,, 143,, 266,, 139,, 264,, 138,, 262,, 137,, 261,, 133,, 261,, 130,, 260,, 130,, 260,, 128,, 46

A.2: n-gram( 51 100 ) 3-gram 3-gram 126,, 98,, 125,, 97,, 122,, 95,, 121,, 95,, 119,, 94,, 118,, 93,, 118,, 92,, 116,,? 90,, 115,, 89,, 114,, 88,, 114,, 88,, 112,, 88,, 111,, 86,, 110,, 86,, 110,, 85,, 107,, 82,,! 103,, 81,, 103,, 80,, 102,, 79,, 100,, 79,,! 100,,! 79,, 99,, 74,, 99,, 74,, 99,, 73,, 99,, 73,, 47