教師情報を必要としないWebページ群のコンテンツ自動抽出ツールの提案

Similar documents
IT,, i

DEIM Forum 2010 A3-3 Web Web Web Web Web. Web Abstract Web-page R

DEIM Forum 2010 A Web Abstract Classification Method for Revie

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

DEIM Forum 2009 B4-6, Str

DEIM Forum 2012 E Web Extracting Modification of Objec

日本感性工学会論文誌

DEIM Forum 2009 E

WebRTC P2P Web Proxy P2P Web Proxy WebRTC WebRTC Web, HTTP, WebRTC, P2P i

TF-IDF TDF-IDF TDF-IDF Extracting Impression of Sightseeing Spots from Blogs for Supporting Selection of Spots to Visit in Travel Sat

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

Introduction to Information and Communication Technology (a)

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

24 Region-Based Image Retrieval using Fuzzy Clustering

,,,,., C Java,,.,,.,., ,,.,, i

SERPWatcher SERPWatcher SERP Watcher SERP Watcher,

29 jjencode JavaScript

2015 9

Web [1] [2] [3] [4] [5] SupportVectorMachine SVM [6] [7] Google [11] Web

DEIM Forum 2009 C8-4 QA NTT QA QA QA 2 QA Abstract Questions Recomme

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

Vol. 42 No. SIG 8(TOD 10) July HTML 100 Development of Authoring and Delivery System for Synchronized Contents and Experiment on High Spe

main.dvi

( )

2009/9 Vol. J92 D No. 9 HTML [3] Microsoft PowerPoint Apple Keynote OpenOffice Impress XML 4 1 (A) (C) (F) Fig. 1 1 An example of slide i

”‰−ofiI…R…fi…e…L…X…g‡ðŠp‡¢‡½„�“õ„‰›Ê‡Ì™ñ”¦

IPSJ SIG Technical Report Vol.2011-DBS-153 No /11/3 Wikipedia Wikipedia Wikipedia Extracting Difference Information from Multilingual Wiki

知能と情報, Vol.30, No.5, pp

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

<> <name> </name> <body> <></> <> <title> </title> <item> </item> <item> 11 </item> </>... </body> </> 1 XML Web XML HTML 1 name item 2 item item HTML

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

日本感性工学会論文誌

IT i

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

大学における原価計算教育の現状と課題

e-learning station 1) 2) 1) 3) 2) 2) 1) 4) e-learning Station 16 e-learning e-learning key words: e-learning LMS CMS A Trial and Prospect of Kumamoto

World Wide Web =WWW Web ipad Web Web HTML hyper text markup language CSS cascading style sheet Web Web HTML CSS HTML

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

Vol. 9 No. 5 Oct (?,?) A B C D 132

fiš„v5.dvi

3_39.dvi

100 SDAM SDAM Windows2000/XP 4) SDAM TIN ESDA K G G GWR SDAM GUI

3_23.dvi

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

Wikipedia 2 Wikipedia Web Wikipedia 2. Web [6] [11] [8] 2 SVM Bollegala [1] 5-gram URL URL 2-gram [6] [11] SVM 3 SVM [8] Bollegala [1] SVM [7] [9] [6]

& Vol.2 No (Mar. 2012) 1,a) , Bluetooth A Health Management Service by Cell Phones and Its Us

Web Stamps 96 KJ Stamps Web Vol 8, No 1, 2004

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

IPSJ SIG Technical Report Vol.2009-DBS-149 No /11/ Bow-tie SCC Inter Keyword Navigation based on Degree-constrained Co-Occurrence Graph

Table 1 Table 2

HP cafe HP of A A B of C C Map on N th Floor coupon A cafe coupon B Poster A Poster A Poster B Poster B Case 1 Show HP of each company on a user scree

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

HTML文書の作成

IPSJ SIG Technical Report GPS LAN GPS LAN GPS LAN Location Identification by sphere image and hybrid sensing Takayuki Katahira, 1 Yoshio Iwai 1

独立行政法人情報通信研究機構 Development of the Information Analysis System WISDOM KIDAWARA Yutaka NICT Knowledge Clustered Group researched and developed the infor

B 20 Web

Microsoft Word - toyoshima-deim2011.doc

main.dvi

E MathML W3C MathJax 1.3 MathJax MathJax[5] TEX MathML JavaScript TEX MathML [8] [9] MathSciNet[10] MathJax MathJax MathJax MathJax MathJax MathJax We

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

[1] [3]. SQL SELECT GENERATE< media >< T F E > GENERATE. < media > HTML PDF < T F E > Target Form Expression ( ), 3.. (,). : Name, Tel name tel

dews2004-final.dvi

TA3-4 31st Fuzzy System Symposium (Chofu, September 2-4, 2015) Interactive Recommendation System LeonardoKen Orihara, 1 Tomonori Hashiyama, 1

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan

21 A contents organization method for information sharing systems

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

XML XML (Extensible Markup Language) ISO SGML (Standard Generalized Markup Language) W3C (World Wide Web Consortium) XML 1.0

149 (Newell [5]) Newell [5], [1], [1], [11] Li,Ryu, and Song [2], [11] Li,Ryu, and Song [2], [1] 1) 2) ( ) ( ) 3) T : 2 a : 3 a 1 :

2 : Open Clip Art Library [4] Microsoft Office PowerPoint Web PowerPoint 2 Yahoo! Web [5] SlideShare Yahoo! Web Yahoo! Web

IPSJ SIG Technical Report Vol.2010-SLDM-144 No.50 Vol.2010-EMB-16 No.50 Vol.2010-MBL-53 No.50 Vol.2010-UBI-25 No /3/27 Twitter IME Twitte

橡自動車~1.PDF

1 7.35% 74.0% linefeed point c 200 Information Processing Society of Japan

Microsoft Word - deim2011_new-ichinose doc

2. Twitter Twitter 2.1 Twitter Twitter( ) Twitter Twitter ( 1 ) RT ReTweet RT ReTweet RT ( 2 ) URL Twitter Twitter 140 URL URL URL 140 URL URL

untitled

Vol. 48 No. 3 Mar PM PM PMBOK PM PM PM PM PM A Proposal and Its Demonstration of Developing System for Project Managers through University-Indus

: ( 1) () 1. ( 1) 2. ( 1) 3. ( 2)

IPSJ SIG Technical Report Vol.2012-MPS-88 No /5/17 1,a) 1 Network Immunization via Community Structure based Node Representation Tetsuya Yoshida

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

3D UbiCode (Ubiquitous+Code) RFID ResBe (Remote entertainment space Behavior evaluation) 2 UbiCode Fig. 2 UbiCode 2. UbiCode 2. 1 UbiCode UbiCode 2. 2

1 AND TFIDF Web DFIWF Wikipedia Web Web AND 5. Wikipedia AND 6. Wikipedia Web Ma [4] Ma URL AND Tian [8] Tian Tian Web Cimiano [3] [

News_Letter_No35(Ver.2).p65

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website


IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

08-特集04.indd

: Name, Tel name tel (! ) name : Name! Tel tel ( % ) 3. HTML. : Name % Tel name tel 2. 2,., [ ]!, [ ]!, [ ]!,. [! [, ]! ]!,,. ( [ ], ),. : [Name], nam

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

XML Tool to Check the Consistency both Software Documents Using XML and Source Programs 1 Summary. Generally, a software consists of source programs a

A Study on Throw Simulation for Baseball Pitching Machine with Rollers and Its Optimization Shinobu SAKAI*5, Yuichiro KITAGAWA, Ryo KANAI and Juhachi

, IT.,.,..,.. i

1 Fig. 2 2 Fig. 1 Sample of tab UI 1 Fig. 1 that changes by clicking tab 5 2. Web HTML Adobe Flash Web ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) 3 Web 2.1 Web Goo

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

main.dvi

12) NP 2 MCI MCI 1 START Simple Triage And Rapid Treatment 3) START MCI c 2010 Information Processing Society of Japan

1. [5] Wikipedia 4. ( ) Wikipedia 5. 3 ( ) ( ) ( ) Wikipedia ( ) ( ) 2.2 Global Database of Events, Language and Tone (GDELT) Global Datab

Transcription:

DEIM Forum 2009 A8-4 Web 305-8573 1-1-1 305-8573 1-1-1 E-mail: m.yoshida@mibel.cs.tsukuba.ac.jp, myama@cs.tsukuba.ac.jp CMS Web Web Web Web Web Web Web Web,,, HTML, Web, Web, Primary Content Extraction from Web Pages without Training Data Abstract Mitsuo YOSHIDA and Mikio YAMAMOTO College of Information Sciences, and Graduate School of Systems and Information Engineering University of Tsukuba, Tennodai 1-1-1 Tsukuba Ibaraki 305-8573 JAPAN E-mail: m.yoshida@mibel.cs.tsukuba.ac.jp, myama@cs.tsukuba.ac.jp In recent years, the proportion of primary content in a Web page has been decreasing as content management systems (CMS s continue to spread, because CMS s automatically and excessively add unnecessary parts such as menus, copyright displays and so on into the Web page. In this paper, we propose a simple and training data-less method extracting the primary content from a collection of Web pages. We regard a Web page as a set of blocks (minimum unit of primary or non-primary content, and assume that blocks of the primary content are unique and those of non-primary content aren t. method using real Web pages of the news sites in Japanese and English. Key words Science, Data mining We describe experimental results to show performance of the Primary Content Extraction, Unsupervised, Semi-structured Data, HTML, Web and Internet, Web 1. Web 2008 7 Google 1998 2600 Web 1 [1] Web CMS Content Management System 1 CMS Web Web 1 Web

1 2 Web Web Web Web Web 1 Web Web Web Web Web Web 2. Web Bing [2] Web Web [3] Web DOM DOM 2 http://www.asahi.com/business/update/0106 /TKY200901060314.html DOM Web Web Web Lin [4] Web Debnath [5] IBDF Inverse Block Document Frequency 2 1 tag-set Web TABLE TABLE 2 IBDF Web Web Web Web W3C World Wide Web Consortium Web Web 3. Web 3. 1 Web Web Web 1 2 3 4 5 6 7

Web 3 Web Web Web Web Web 3. 2 Web Web Web Web 1 Web Web Web 4 Step.1 [Web ] Web Step.2 [ ] Web Step.3 [] Step.2 Step.4 [ ] Step.3 Web Step.5 [ ] Step.4 Web 3. 3 Web Web Web Web Web S S = {D 1, D 2, D 3,..., D N } D i(1 < = i < = N Web 3. 4 3 Adblock (Firefox Add-ons Web SGML Standard Generalized Markup Language HTML DOM DOM Web 2 HTML 3 2 DOM DOM 3 Web Web HTML WWW W3C World Wide Web Consortium W3C HTML Web H1, P, DIV, TABLE FONT, STRONG, A [6] <body> <div> <p>text 1</p> <div> <div> <a href= # title= a-title text >Text 2</a> <script>code</script> </body> 3 P(1 TEXT(1 2 HTML BODY DIV(1 DIV(2 DIV(3 IMG(1 IMG(2 IMG(3 A SCRIPT TEXT(2 CODE 2 HTML DOM 3 DOM 4 5

SCRIPT, STYLE 2 BODY HTML Web D i(1 < = i < = N B ij(1 < = i < = N, 1 < = j < = M i B ij = (b ij1 b ij2 b ij3... b ijl (1 < = i < = N, 1 < = j < = M i 4 P(1 TEXT(1 BODY DIV(1 DIV(2 DIV(3 IMG(1 IMG(2 IMG(3 A SCRIPT TEXT(2 CODE 3 DOM 5 Web S Web D i(1 < = i < = N D i = {B i1, B i2, B i3,..., B imi } (1 < = i < = N B ij(1 < = i < = N, 1 < = j < = M i Web 3. 5 1 2 3 title, alt title, alt IMG 4 HTML 5 5 1 1 <a> <body> a-title text text 1 1. <p>text 1</p> 2. <div> 3. <div> 4. <div> <a href= # title= a-title text >Text 2</a> 5. <body></body> 5 2 HTML 5 b ijk (1 < = i < = N, 1 < = j < = M i, 1 < = k < = L Web Web N L Web S 3. 6 3. 5 Web 6 B ij(1 < = i < = N, 1 < = j < = M i B kl (1 < = k < = N, 1 < = l < = M k Sim(B ij, B kl Sim(B ij, B kl = Bij B kl B ij B kl Sim(B ij, B nm 0.9 6 Block(1 1 Block(1 2 Block(1 i 同じかどうかじかどうか比較 Block(2 1 Block(2 2 Block(2 j Block(n 1 Block(n 2 Block(n k Web Page 1 Web Page 2 Web Page n Web 3. 7 3. 6 Web Web 1 4. 4. 1 2. Precision Recall F F-measure Perfect-matching Web N

1 5 <a> <body> <div> <img> <p> a-title text img-alt text text 1 text 2 1 0 0 0 0 1 0 0 1 0 2 0 0 1 1 0 0 1 0 0 3 0 0 1 2 0 0 2 0 0 4 1 0 1 0 0 1 0 0 1 5 0 1 0 0 0 0 0 0 0 R P recision = R N C Recall = R C F F F R N C F -measure = = 2 precision recall precision + recall R 1 2 (N + C F Web Web F Web Web N Web M P erfect-matching = M N 4. 2 7 7 HTML DOM 3. 1 3. 1 2 3. 1 Web 2 3 6 7 4. 3 3 asahi.com 4 jp 5 YOMIURI ONLINE 6 Web URL CEEK.JP NEWS 7 URL HTML CEEK.JP NEWS URL Web 4 5 ALL asahi.com jp YOMIURI ONLINE 8 8 Web 4 http://www.asahi.com/ 5 http://mainichi.jp/ 6 http://www.yomiuri.co.jp/ 7 http://news.ceek.jp/ 8 http://www.yomiuri.co.jp/politics/news /20081205-OYT1T00914.htm

2 F 274 0.9968 0.9915 0.9941 0.9526 A 124 0.9931 0.9558 0.9741 0.6935 B 69 1.0000 0.9860 0.9930 0.9275 C 91 1.0000 0.9889 0.9944 0.9560 D 11 1.0000 1.0000 1.0000 1.0000 E 43 0.9953 0.9976 0.9965 0.9535 F 104 0.9977 0.9455 0.9709 0.8173 716 0.9965 0.9771 0.9867 0.8841 3 asahi.com 179 13593 1031 2008-12-12 jp 180 28656 1017 2008-12-12 YOMIURI ONLINE 176 33420 1178 2008-12-12 535 75669 3226 - jp F Web Web Web 1 1 Web 18 Web 0.9494 0.9805 Web 8 Web Web Web 9 9 Web 10 10 Web URL jp 18 9 http://mainichi.jp/enta/sports/news /20081211k0000e050032000c.html 10 http://mainichi.jp/enta/sports/baseball/news /20081211k0000e050032000c.html 9 jp Web 1 Web 1 11 11 11 http://www.yomiuri.co.jp/atmoney/mnews /20081210-OYT8T00266.htm

表4 実験結果 国内 1 サイト名 適合率 asahi.com 0.9980 0.9777 0.9878 再現率 F 値 完全一致率 0.8939 毎日 jp 0.9372 0.7925 0.8588 0.5111 YOMIURI ONLINE 0.9965 0.9559 0.9757 0.8125 合計 0.7383 0.9800 0.9113 0.9444 表5 実験結果 国内 2 サイト名 適合率 ALL 0.9803 0.9113 0.9446 再現率 F 値 完全一致率 0.7383 ど影響を与えていないことがわかる 4. 4 海外のニュースサイトを対象とした実験結果 使 用 し た デ ー タ セット の 詳 細 は 表 6 の 通 り で あ る CNN.com 注 12 の各 Web ページの URL は Google News 英語 版 注 13 から取得し その URL のリストを基に HTML ファ イルを取得した Google News から URL を取得する際は ド メインのみを指定し 注 14 Web ページの内容にばらつきが出る ようにしている ただし 閲覧者がコメントを付けられる Blog 形式のページは人手により除外している 実験結果を表 7 に示す 図 12 注 15 はコンテンツ自動抽出を 行った Web ページの例である 着色部分がコンテンツを示す 実験結果より 国内のニュースサイトに比べて比較的悪い結果 を示している 特に再現率と完全一致率が悪い結果を示して いる 図 10 図 11 毎日 jp の Web ページ例 2 日付の抽出に失敗した例 が含まれない場合 日付の表現方法が限られるため他の Web ページにも出現する可能性が高くなる これを解決するために は 予め日付の表現方法を学習したモデルを準備し 日付の抽 出のみ別途抽出を行うという方法が考えられる 図 12 実験結果 海外 の Web ページ例 コンテンツ抽出後 また 表 4 の合計と表 5 の結果がほぼ同等であるが 抽出方 法は異なる 表 4 の合計は 各 Web サイトで Web ページ群を 作りコンテンツを抽出した結果の合計であるが 表 5 はデータ CNN.com のデータセットには 毎日 jp データセットと同様 注 12 http://www.cnn.com/ セット全ての Web ページで 1 つの Web ページ群を作り抽出 注 13 http://news.google.com/ した結果である このことから Web サイトを横断して Web 注 14 検索クエリ site:cnn.com を利用した ページ群を作りコンテンツを抽出したとしても 性能にほとん 注 15 http://sportsillustrated.cnn.com/2009/baseball/mlb /01/15/bp.salarycap/

6 CNN.com 175 31401 2758 2009-01-16 7 F CNN.com 0.9438 0.7128 0.8122 0.2971 URL Web 14 14 Web 0.9411 0.8953 Web CNN.com 13 16 Web Web Web Web Web 13 5. Web Web Web [1] Jesse Alpert, Nissan Hajaj. (2008. We knew the web was big.... Official Google Blog. http://googleblog.blogspot.com/2008/07 /we-knew-web-was-big.html, (Accessed 2009-01-29. [2] Lidong Bing, Yexin Wang, Yan Zhang, Hui Wang. (2008. Primary Content Extraction with Mountain Model. IEEE CIT2008. pp.479-484. [3],. (2008. Web. 14. [4] Shian-Hua Lin, Jan-Ming Ho. (2002. Discovering Informative Content Blocks from Web Documents. In Proceedings of ACM SIGKDD 02. pp.588-593. [5] Sandip Debnath, Prasenjit Mitra, Nirmal Pal, and C. Lee Giles. (2005. Automatic Identification of Informative Sections of Web Pages. IEEE Transactions on Knowledge and Data Engineering. Vol.17, No.9, pp.1233-1246. [6] W3C. (1999. The global structure of an HTML document. HTML 4.01 Specification. http://www.w3.org/tr/1999/rec-html401-19991224 /struct/global.html#h-7.5.3, (Accessed 2009-01-29. 16 http://money.cnn.com/news/newsfeeds/articles /djf500/200901151434dowjonesdjonline001004 FORTUNE5.htm