教師情報を必要としないWebページ群のコンテンツ自動抽出ツールの提案

DEIM Forum 2009 A8-4 Web 305-8573 1-1-1 305-8573 1-1-1 E-mail: m.yoshida@mibel.cs.tsukuba.ac.jp, myama@cs.tsukuba.ac.jp CMS Web Web Web Web Web Web Web Web,,, HTML, Web, Web, Primary Content Extraction from Web Pages without Training Data Abstract Mitsuo YOSHIDA and Mikio YAMAMOTO College of Information Sciences, and Graduate School of Systems and Information Engineering University of Tsukuba, Tennodai 1-1-1 Tsukuba Ibaraki 305-8573 JAPAN E-mail: m.yoshida@mibel.cs.tsukuba.ac.jp, myama@cs.tsukuba.ac.jp In recent years, the proportion of primary content in a Web page has been decreasing as content management systems (CMS s continue to spread, because CMS s automatically and excessively add unnecessary parts such as menus, copyright displays and so on into the Web page. In this paper, we propose a simple and training data-less method extracting the primary content from a collection of Web pages. We regard a Web page as a set of blocks (minimum unit of primary or non-primary content, and assume that blocks of the primary content are unique and those of non-primary content aren t. method using real Web pages of the news sites in Japanese and English. Key words Science, Data mining We describe experimental results to show performance of the Primary Content Extraction, Unsupervised, Semi-structured Data, HTML, Web and Internet, Web 1. Web 2008 7 Google 1998 2600 Web 1 [1] Web CMS Content Management System 1 CMS Web Web 1 Web

1 2 Web Web Web Web Web 1 Web Web Web Web Web Web 2. Web Bing [2] Web Web [3] Web DOM DOM 2 http://www.asahi.com/business/update/0106 /TKY200901060314.html DOM Web Web Web Lin [4] Web Debnath [5] IBDF Inverse Block Document Frequency 2 1 tag-set Web TABLE TABLE 2 IBDF Web Web Web Web W3C World Wide Web Consortium Web Web 3. Web 3. 1 Web Web Web 1 2 3 4 5 6 7

Web 3 Web Web Web Web Web 3. 2 Web Web Web Web 1 Web Web Web 4 Step.1 [Web ] Web Step.2 [ ] Web Step.3 [] Step.2 Step.4 [ ] Step.3 Web Step.5 [ ] Step.4 Web 3. 3 Web Web Web Web Web S S = {D 1, D 2, D 3,..., D N } D i(1 < = i < = N Web 3. 4 3 Adblock (Firefox Add-ons Web SGML Standard Generalized Markup Language HTML DOM DOM Web 2 HTML 3 2 DOM DOM 3 Web Web HTML WWW W3C World Wide Web Consortium W3C HTML Web H1, P, DIV, TABLE FONT, STRONG, A [6] <body> <div> <p>text 1</p> <div> <div> <a href= # title= a-title text >Text 2</a> <script>code</script> </body> 3 P(1 TEXT(1 2 HTML BODY DIV(1 DIV(2 DIV(3 IMG(1 IMG(2 IMG(3 A SCRIPT TEXT(2 CODE 2 HTML DOM 3 DOM 4 5

SCRIPT, STYLE 2 BODY HTML Web D i(1 < = i < = N B ij(1 < = i < = N, 1 < = j < = M i B ij = (b ij1 b ij2 b ij3... b ijl (1 < = i < = N, 1 < = j < = M i 4 P(1 TEXT(1 BODY DIV(1 DIV(2 DIV(3 IMG(1 IMG(2 IMG(3 A SCRIPT TEXT(2 CODE 3 DOM 5 Web S Web D i(1 < = i < = N D i = {B i1, B i2, B i3,..., B imi } (1 < = i < = N B ij(1 < = i < = N, 1 < = j < = M i Web 3. 5 1 2 3 title, alt title, alt IMG 4 HTML 5 5 1 1 <a> <body> a-title text text 1 1. <p>text 1</p> 2. <div> 3. <div> 4. <div> <a href= # title= a-title text >Text 2</a> 5. <body></body> 5 2 HTML 5 b ijk (1 < = i < = N, 1 < = j < = M i, 1 < = k < = L Web Web N L Web S 3. 6 3. 5 Web 6 B ij(1 < = i < = N, 1 < = j < = M i B kl (1 < = k < = N, 1 < = l < = M k Sim(B ij, B kl Sim(B ij, B kl = Bij B kl B ij B kl Sim(B ij, B nm 0.9 6 Block(1 1 Block(1 2 Block(1 i 同じかどうかじかどうか比較 Block(2 1 Block(2 2 Block(2 j Block(n 1 Block(n 2 Block(n k Web Page 1 Web Page 2 Web Page n Web 3. 7 3. 6 Web Web 1 4. 4. 1 2. Precision Recall F F-measure Perfect-matching Web N

1 5 <a> <body> <div> <img> <p> a-title text img-alt text text 1 text 2 1 0 0 0 0 1 0 0 1 0 2 0 0 1 1 0 0 1 0 0 3 0 0 1 2 0 0 2 0 0 4 1 0 1 0 0 1 0 0 1 5 0 1 0 0 0 0 0 0 0 R P recision = R N C Recall = R C F F F R N C F -measure = = 2 precision recall precision + recall R 1 2 (N + C F Web Web F Web Web N Web M P erfect-matching = M N 4. 2 7 7 HTML DOM 3. 1 3. 1 2 3. 1 Web 2 3 6 7 4. 3 3 asahi.com 4 jp 5 YOMIURI ONLINE 6 Web URL CEEK.JP NEWS 7 URL HTML CEEK.JP NEWS URL Web 4 5 ALL asahi.com jp YOMIURI ONLINE 8 8 Web 4 http://www.asahi.com/ 5 http://mainichi.jp/ 6 http://www.yomiuri.co.jp/ 7 http://news.ceek.jp/ 8 http://www.yomiuri.co.jp/politics/news /20081205-OYT1T00914.htm

2 F 274 0.9968 0.9915 0.9941 0.9526 A 124 0.9931 0.9558 0.9741 0.6935 B 69 1.0000 0.9860 0.9930 0.9275 C 91 1.0000 0.9889 0.9944 0.9560 D 11 1.0000 1.0000 1.0000 1.0000 E 43 0.9953 0.9976 0.9965 0.9535 F 104 0.9977 0.9455 0.9709 0.8173 716 0.9965 0.9771 0.9867 0.8841 3 asahi.com 179 13593 1031 2008-12-12 jp 180 28656 1017 2008-12-12 YOMIURI ONLINE 176 33420 1178 2008-12-12 535 75669 3226 - jp F Web Web Web 1 1 Web 18 Web 0.9494 0.9805 Web 8 Web Web Web 9 9 Web 10 10 Web URL jp 18 9 http://mainichi.jp/enta/sports/news /20081211k0000e050032000c.html 10 http://mainichi.jp/enta/sports/baseball/news /20081211k0000e050032000c.html 9 jp Web 1 Web 1 11 11 11 http://www.yomiuri.co.jp/atmoney/mnews /20081210-OYT8T00266.htm

表4 実験結果国内 1 サイト名適合率 asahi.com 0.9980 0.9777 0.9878 再現率 F 値完全一致率 0.8939 毎日 jp 0.9372 0.7925 0.8588 0.5111 YOMIURI ONLINE 0.9965 0.9559 0.9757 0.8125 合計 0.7383 0.9800 0.9113 0.9444 表5 実験結果国内 2 サイト名適合率 ALL 0.9803 0.9113 0.9446 再現率 F 値完全一致率 0.7383 ど影響を与えていないことがわかる 4. 4 海外のニュースサイトを対象とした実験結果使用したデータセットの詳細は表 6 の通りである CNN.com 注 12 の各 Web ページの URL は Google News 英語版注 13 から取得しその URL のリストを基に HTML ファイルを取得した Google News から URL を取得する際はドメインのみを指定し注 14 Web ページの内容にばらつきが出るようにしているただし閲覧者がコメントを付けられる Blog 形式のページは人手により除外している実験結果を表 7 に示す図 12 注 15 はコンテンツ自動抽出を行った Web ページの例である着色部分がコンテンツを示す実験結果より国内のニュースサイトに比べて比較的悪い結果を示している特に再現率と完全一致率が悪い結果を示している図 10 図 11 毎日 jp の Web ページ例 2 日付の抽出に失敗した例が含まれない場合日付の表現方法が限られるため他の Web ページにも出現する可能性が高くなるこれを解決するためには予め日付の表現方法を学習したモデルを準備し日付の抽出のみ別途抽出を行うという方法が考えられる図 12 実験結果海外の Web ページ例コンテンツ抽出後また表 4 の合計と表 5 の結果がほぼ同等であるが抽出方法は異なる表 4 の合計は各 Web サイトで Web ページ群を作りコンテンツを抽出した結果の合計であるが表 5 はデータ CNN.com のデータセットには毎日 jp データセットと同様注 12 http://www.cnn.com/ セット全ての Web ページで 1 つの Web ページ群を作り抽出注 13 http://news.google.com/ した結果であるこのことから Web サイトを横断して Web 注 14 検索クエリ site:cnn.com を利用したページ群を作りコンテンツを抽出したとしても性能にほとん注 15 http://sportsillustrated.cnn.com/2009/baseball/mlb /01/15/bp.salarycap/

6 CNN.com 175 31401 2758 2009-01-16 7 F CNN.com 0.9438 0.7128 0.8122 0.2971 URL Web 14 14 Web 0.9411 0.8953 Web CNN.com 13 16 Web Web Web Web Web 13 5. Web Web Web [1] Jesse Alpert, Nissan Hajaj. (2008. We knew the web was big.... Official Google Blog. http://googleblog.blogspot.com/2008/07 /we-knew-web-was-big.html, (Accessed 2009-01-29. [2] Lidong Bing, Yexin Wang, Yan Zhang, Hui Wang. (2008. Primary Content Extraction with Mountain Model. IEEE CIT2008. pp.479-484. [3],. (2008. Web. 14. [4] Shian-Hua Lin, Jan-Ming Ho. (2002. Discovering Informative Content Blocks from Web Documents. In Proceedings of ACM SIGKDD 02. pp.588-593. [5] Sandip Debnath, Prasenjit Mitra, Nirmal Pal, and C. Lee Giles. (2005. Automatic Identification of Informative Sections of Web Pages. IEEE Transactions on Knowledge and Data Engineering. Vol.17, No.9, pp.1233-1246. [6] W3C. (1999. The global structure of an HTML document. HTML 4.01 Specification. http://www.w3.org/tr/1999/rec-html401-19991224 /struct/global.html#h-7.5.3, (Accessed 2009-01-29. 16 http://money.cnn.com/news/newsfeeds/articles /djf500/200901151434dowjonesdjonline001004 FORTUNE5.htm