Web Web [12] Web HTML HTML Web Web Web Web HTML Web Web Web Web Web Web Web Web Ducky[6][7] Ducky Web Web Ducky GUI GUI GUI Web 2 Ducky 3 GUI

Similar documents
Ducky 1. GUI, Web, Web URL,, 2., CSS ( ), xml, json, csv,,, Web DB HTML id class, class,. com, div unit,, CSS CSS, Web, Web, JavaScript

: Name, Tel name tel (! ) name : Name! Tel tel ( % ) 3. HTML. : Name % Tel name tel 2. 2,., [ ]!, [ ]!, [ ]!,. [! [, ]! ]!,,. ( [ ], ),. : [Name], nam

WIX. URL, WIX. URL,, WIX., Web. id (eid), keyword target. 1 entry wid eid keyword target

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

教師情報を必要としないWebページ群のコンテンツ自動抽出ツールの提案

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

[1] [3]. SQL SELECT GENERATE< media >< T F E > GENERATE. < media > HTML PDF < T F E > Target Form Expression ( ), 3.. (,). : Name, Tel name tel

Vol. 42 No. SIG 8(TOD 10) July HTML 100 Development of Authoring and Delivery System for Synchronized Contents and Experiment on High Spe

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

(a) (b) 1 JavaScript Web Web Web CGI Web Web JavaScript Web mixi facebook SNS Web URL ID Web 1 JavaScript Web 1(a) 1(b) JavaScript & Web Web Web Webji

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

IPSJ SIG Technical Report Vol.2009-HCI-134 No /7/17 1. RDB Wiki Wiki RDB SQL Wiki Wiki RDB Wiki RDB Wiki A Wiki System Enhanced by Visibl

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

2. Twitter Twitter 2.1 Twitter Twitter( ) Twitter Twitter ( 1 ) RT ReTweet RT ReTweet RT ( 2 ) URL Twitter Twitter 140 URL URL URL 140 URL URL

DEIM Forum 2012 E Web Extracting Modification of Objec

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan

DEIM Forum 2019 H2-2 SuperSQL SuperSQL SQL SuperSQL Web SuperSQL DBMS Pi

([ ]!) name1 name2 : [Name]! name SuperSQL,,,,,,, (@) < >@{ < > } =,,., 200,., TFE,, 1 2.,, 4, 3.,,,, Web EGG [5] SSVisual [6], Java SSedit( ss

DEIM Forum 2009 B4-6, Str

IPSJ SIG Technical Report Vol.2014-GN-90 No.16 Vol.2014-CDS-9 No.16 Vol.2014-DCC-6 No /1/24 1,a) 2,b) 2,c) 1,d) QUMARION QUMARION Kinect Kinect

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

IPSJ SIG Technical Report Vol.2014-IOT-27 No.14 Vol.2014-SPT-11 No /10/10 1,a) 2 zabbix Consideration of a system to support understanding of f

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

3_23.dvi

Microsoft Word - toyoshima-deim2011.doc

3.1 Thalmic Lab Myo * Bluetooth PC Myo 8 RMS RMS t RMS(t) i (i = 1, 2,, 8) 8 SVM libsvm *2 ν-svm 1 Myo 2 8 RMS 3.2 Myo (Root

, HTML HTML PHP, 3. SuperSQL SuperSQL [1] [2], SQL, SQL SELECT GENERATE <media> <TFE> GENERATE <media>, HTML XML, PDF <TFE> Target Form Expression,, 3

1 Fig. 2 2 Fig. 1 Sample of tab UI 1 Fig. 1 that changes by clicking tab 5 2. Web HTML Adobe Flash Web ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) 3 Web 2.1 Web Goo

TF-IDF TDF-IDF TDF-IDF Extracting Impression of Sightseeing Spots from Blogs for Supporting Selection of Spots to Visit in Travel Sat

2 : Open Clip Art Library [4] Microsoft Office PowerPoint Web PowerPoint 2 Yahoo! Web [5] SlideShare Yahoo! Web Yahoo! Web

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

DEIM Forum 2009 E

"-./0%. "-%!"#$#% $%&'(%)*+,%.!"#+$,$% &'()*% $%&'-(.(/%+,% $%&'0%12*+,'% 1 RMX.. grade gradetype= integer grade[

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

E MathML W3C MathJax 1.3 MathJax MathJax[5] TEX MathML JavaScript TEX MathML [8] [9] MathSciNet[10] MathJax MathJax MathJax MathJax MathJax MathJax We

IPSJ SIG Technical Report Vol.2010-NL-199 No /11/ treebank ( ) KWIC /MeCab / Morphological and Dependency Structure Annotated Corp

,, WIX. 3. Web Index 3. 1 WIX WIX XML URL, 1., keyword, URL target., WIX, header,, WIX. 1 entry keyword 1 target 1 keyword target., entry, 1 1. WIX [2

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

DEIM Forum 2013 B5-2 RMX RMX RMX $, RMX Implementation of the E-m

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

Microsoft Word - deim2011_new-ichinose doc

HP cafe HP of A A B of C C Map on N th Floor coupon A cafe coupon B Poster A Poster A Poster B Poster B Case 1 Show HP of each company on a user scree

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

DEIM Forum 2010 A Web Abstract Classification Method for Revie

DEIM Forum 2009 C8-4 QA NTT QA QA QA 2 QA Abstract Questions Recomme

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

Vol.54 No (July 2013) [9] [10] [11] [12], [13] 1 Fig. 1 Flowchart of the proposed system. c 2013 Information

システム開発プロセスへのデザイン技術適用の取組み~HCDからUXデザインへ~

202

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

2006 [3] Scratch Squeak PEN [4] PenFlowchart 2 3 PenFlowchart 4 PenFlowchart PEN xdncl PEN [5] PEN xdncl DNCL 1 1 [6] 1 PEN Fig. 1 The PEN

1 1 CodeDrummer CodeMusician CodeDrummer Fig. 1 Overview of proposal system c

B HNS 7)8) HNS ( ( ) 7)8) (SOA) HNS HNS 4) HNS ( ) ( ) 1 TV power, channel, volume power true( ON) false( OFF) boolean channel volume int

dews2004-final.dvi

Oda

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

IPSJ SIG Technical Report Vol.2014-HCI-157 No.26 Vol.2014-GN-91 No.26 Vol.2014-EC-31 No /3/15 1,a) 2 3 Web (SERP) ( ) Web (VP) SERP VP VP SERP

IPSJ SIG Technical Report Vol.2014-EIP-63 No /2/21 1,a) Wi-Fi Probe Request MAC MAC Probe Request MAC A dynamic ads control based on tra

1_26.dvi


HASC2012corpus HASC Challenge 2010,2011 HASC2011corpus( 116, 4898), HASC2012corpus( 136, 7668) HASC2012corpus HASC2012corpus

Computer Security Symposium October 2013 Android OS kub

SERPWatcher SERPWatcher SERP Watcher SERP Watcher,

2015 9

Vol. 48 No. 3 Mar PM PM PMBOK PM PM PM PM PM A Proposal and Its Demonstration of Developing System for Project Managers through University-Indus

,,,,., C Java,,.,,.,., ,,.,, i

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

149 (Newell [5]) Newell [5], [1], [1], [11] Li,Ryu, and Song [2], [11] Li,Ryu, and Song [2], [1] 1) 2) ( ) ( ) 3) T : 2 a : 3 a 1 :

Journal of Geography 116 (6) Configuration of Rapid Digital Mapping System Using Tablet PC and its Application to Obtaining Ground Truth

IPSJ SIG Technical Report Vol.2009-DBS-149 No /11/ Bow-tie SCC Inter Keyword Navigation based on Degree-constrained Co-Occurrence Graph

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

IPSJ SIG Technical Report Vol.2010-SLDM-144 No.50 Vol.2010-EMB-16 No.50 Vol.2010-MBL-53 No.50 Vol.2010-UBI-25 No /3/27 Twitter IME Twitte

IPSJ SIG Technical Report Vol.2011-DBS-153 No /11/3 Wikipedia Wikipedia Wikipedia Extracting Difference Information from Multilingual Wiki

Lotus Domino XML活用の基礎!

29 jjencode JavaScript

24 Region-Based Image Retrieval using Fuzzy Clustering

日本感性工学会論文誌

Web

1 Web Web 1,,,, Web, Web : - i -

( )

Windows7 OS Focus Follows Click, FFC FFC focus follows mouse, FFM Windows Macintosh FFC n n n n ms n n 4.2 2

独立行政法人情報通信研究機構 Development of the Information Analysis System WISDOM KIDAWARA Yutaka NICT Knowledge Clustered Group researched and developed the infor

SNS GIS Abstract The Tourism-based Country Promotion Basic Act was enacted in Japan over a decade ago. Tourism is expected to be the primary contribut

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

3. XML, DB, DB (AP). DB, DB, AP. RDB., XMLDB, XML,.,,.,, (XML / ), XML,,., AP. AP AP AP 検索キー //A=1 //A=2 //A=3 返却 XML 全体 XML 全体 XML 全体 XMLDB <root> <A

, [! [, ]! ]!,,., ([ ],). : [Name], name1 name2 name10 ([ ]!). name1 name2 : [Name]! name SuperSQL,,,,,,, < < > } =.,

和文タイトル

WCAG 2.0 W3C/WAI ( ) 2 24 December,

Fig. 3 3 Types considered when detecting pattern violations 9)12) 8)9) 2 5 methodx close C Java C Java 3 Java 1 JDT Core 7) ) S P S

Web Web ID Web 16 Web Web i

fiš„v5.dvi

Abstract

IPSJ SIG Technical Report Vol.2011-CE-110 No /7/9 Bebras 1, 6 1, 2 3 4, 6 5, 6 Bebras 2010 Bebras Reporting Trial of Bebras Contest for K12 stud

DEIM Forum 2010 A3-3 Web Web Web Web Web. Web Abstract Web-page R

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

IPSJ SIG Technical Report Vol.2013-GN-86 No.35 Vol.2013-CDS-6 No /1/17 1,a) 2,b) (1) (2) (3) Development of Mobile Multilingual Medical

Transcription:

WebDB Forum 2015 Web Ducky GUI 1,a) 2,b) Web 2 Ducky Ducky Web URL CSS XML JSON CSV Ducky GUI GUI GUI Web Browser GUI for Rule Generation in Web Data Extraction System Ducky Kei Kanaoka 1,a) Motomichi Toyama 2,b) Abstract: To gain the benefit of invaluable data from World Wide Web, manual extraction or creation of web scraping programs may be necessary. But these processes can be tedious and complicated. To address these, we have proposed Ducky, which is a Web data extraction system including a web wrapper that extracts data from web sources and translates them into structured data based on a user-defined data extraction rules. Ducky is able to extract data flexibly from various structured web pages, remove noise from extracted data and integrate data distributed to multiple pages from different sites. In this paper, we propose the browser GUI of Ducky to help users to extract the data. It can operate intuitively by the actions such as clicking, pointing a cursor (mouse over) to an objective elements. These users actions are converted into data extraction rules in a configuration file. We hereby help users to extract the data by intuitive operations and reduce users burden to write the configuration file. 1. Web 2 1 Graduate School of Science and Technology, Keio University 2 Department of Information and Computer Science, Keio University a) kei@db.ics.keio.ac.jp b) toyama@ics.keio.ac.jp Web 2015 Information Processing Society of Japan 158

Web Web [12] Web HTML HTML Web Web Web Web HTML Web Web Web Web Web Web Web Web Ducky[6][7] Ducky Web Web Ducky GUI GUI GUI Web 2 Ducky 3 GUI 4 5 6 2. 2.1 Ducky 1 GUI Web Web URL 2 CSS (2.2.1) xml json csv { } "name" : "" 1 "author" : "" "frequency" : "" "format" : "" "scraping" : [{ }] 2 2.2 2.2.1 CSS \\ \\ \\ \\ \\ CSS Xpath HTML 3.com *1 HP URL HTML CSS div.unit li > a CSS HTML id, class, Ducky CSS 2 CSS Web Web DB *1 http://eiga.com/link/ 2015 Information Processing Society of Japan 159

HTML id class class.com div unit CSS CSS Web Web JavaScript (DOM) jquery W3Techs *2 2015 8 jquery Web 65.5 jquery CSS Web CSS 3 2.2.2 HTML scraping 4 1 GUI Web CSS selector. scraping ( 1 ) url selector URL HTML CSS HTML CSS *2 Web http://w3techs.com/ "scraping" : [{ "url" : " ", "selector" : " ", "data" : [{ "field" : " " "attr" : " " "find" : " ", "remove" : [" " " " ] "replace" : [[" " " "] ] }] "next" : { } }] 4 ( 2 ) data field attr find remove replace ( 3 ) next next URL url (1) DB next Web (3.2.3 ) 3. GUI GUI Web URL URL HTML GUI GUI (2.2.2) GUI Ameba *3 ( 5) 50 *4 *3 http://official.ameba.jp/ *4 http://official.ameba.jp/genrekana/kanatop.html 2015 Information Processing Society of Japan 160

1 scraping array url string URL selector string CSS data string ( ) field array attr string selector find string selector CSS blank remove array parentheses string replace array next object URL name url 6 ( 5 ) 3.1.3 URL ( 5 ) 6 Ameba 3.1 3.1.1 URL GUI 50 CSS 3.1.2 3.2 3.2.1 CSS GUI CSS CSS body class CSS 50 CSS 6 div.syllabarymdl > table > tbody > tr > td > a ( ) 3.2.2 alt ( 5 ) Web HTML 7 a img a 5 2015 Information Processing Society of Japan 161

図 5 ブラウザ GUI を用いたデータ抽出例 てクリックされた場合 その HTML 構造は図 7 のように なっているため img タグがもつ src 属性と alt 属性の値 親ノードである a タグの href 属性の値がポップアップに表 4. 評価 4.1 評価方法 今回提案したブラウザ GUI の有用性を評価するために 2 示される ユーザがポップアップのチェックボックスで選択したも つの実験を行った 1 つ目の実験では 30 の Web サイト のは 抽出対象として後にデータ抽出ルールに変換される を対象にブラウザ GUI を用いてデータ抽出を行った 2 つ 目の実験では 実験対象となる Web サイトを実験 1 の結 果から系統別に 2 つ選定し それぞれのサイトに対してブ ラウザ GUI を用いずにデータ抽出ルールを手書きで作成 してもらう場合と ブラウザ GUI を用いてデータ抽出を 行う場合に分けてユーザによる評価実験を行った なお 再現率と適合率は以下のように定義する 図 7 リンク付き画像の HTML 構造例 3.2.3 ページ遷移 ツールバー上の矢印ボタンがユーザによってクリックさ れた場合 選択されている要素の href 属性の値を取得し そのリンク先へページ遷移を行う これはデータ抽出ルー ルにおいて next フィールドを用いて表現される (図 6) 図 6 において 3 行目に存在する selector フィールド の値 div.syllabarymdl > table > tbody > tr > td > a 4.2 実験 1 は図 5 における 50 音順のリンクの位置を示す CSS セレク 4.2.1 結果および考察 タである ここで選択されたのは a タグであり href 属性 今回対象とした 30 の Web サイトのうち 23 の Web サ を持つ その値である URL が次の next フィールドにお イトからのデータ抽出に関するデータを表 2 に示す 表 2 ける url フィールドの値として用いられる (2.2.2) つ に記載したサイトのうち SONY 商品カテゴリー一覧*5 を まり あ から わ までの URL 全てにリクエストを送り 除き 再現率と適合率が 100%のデータを得ることが出来 その遷移先のページにおいて芸能人の名前とそのブログの た 以下 SONY 商品カテゴリー一覧に関して考察を行 URL を取得するといった処理を行う このように next う フィールドは遷移先の Web ページが同一のテンプレート SONY 商品カテゴリー一覧におけるデータ抽出の流れ で生成されている Web ページに対して それらの情報を は次のようになる 商品カテゴリー一覧から計 52 のカテ 抽出 統合することを可能にする ゴリーページへのリンクをクリックし ページ遷移する *5 2015 Information Processing Society of Japan http://www.sony.jp/products menu.html 162

*6 52 36 4 2 SONY SONY 36 7 *7 Javascript Web Ducky Web GUI Web 4.3 2 Web 1 2 * 2.com ( 1 ) 1 Ameba ( 2) Web Web GUI ( A) GUI ( B) ( ) GUI URL Web HTML CSS Javascript ( ) HTML *6 1 http://www.sony.jp/bravia/ *7 http://www.pokemon.jp/zukan/ CSS ( ) 3 6 GUI GUI CSS Google Chrome Web CSS 4.3.1 GUI 8 8 GUI GUI GUI HTML GUI 3 3 2 Web 2 100 GUI CSS 100 GUI 2 100 2 1 3 1 2 A B A B 1 66.6 64.5 100 100 45 41.6 100 83.3 2 100 92.3 - - 68.5 62.1 100 100 3 100 93 - - 100 80.1 - - 4 100 100 - - 100 90.2 - - 5 - - - - 100 100 - - 2015 Information Processing Society of Japan 163

2 Web ( ).com 1 - - - 2 549 FC Barcelona 1 - - - 3 26 EXILE HP 1 - - - 3 14 1 - - - 2 100 SKE48 HP 1 - - - 4 71 HP 4 - - 2 1136 1 - - 2 94 1 - - 4 73 1 (1) 6-2 83 NMB48 HP 1 - - 3 65 Ameba 1 (1) 44-3 11774 46 HP 1 - - 3 32 SAMURAI JAPAN 1 - - 6 12 DeNA 5 - - 4 90 1 - - 5 104 2-3 92 1-3 115 SAMURAI BLUE 1-5 51 HP 1-2 1634 1 (1) 5 6 170 21-2 3551 1 (1) 5 6 85 SONY 1 (3) 52 - - 3 238 URL HTML OXPath [4][11] Xpath DEiXTo [8] GUI 8 5. Web [12] Web Web XML Web 1 (semi-automatic) 2 (automatic) Zhang [13] Adelberg NoDoSE [1] XML Kushmerick[9] Kushmerick Chang [3] Chang IEPAD HTML IEPAD HTML HTML IEPAD 2015 Information Processing Society of Japan 164

URL Web Web [2] [5] GUI [10] URL OXPath[4] Web OXPath Web Web 6. Ducky GUI GUI Web 2 1 GUI CSS 2 GUI Web Web Web the 10th International Conference on World Wide Web, WWW 01, pages 681 688, New York, NY, USA, 2001. ACM. [4] Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew Sellers. Oxpath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal, 22(1):47 72, February 2013. [5] Matthias Geel, Timothy Church, and Moira C. Norrie. Sift: An end-user tool for gathering web content on the go. In Proceedings of the 2012 ACM Symposium on Document Engineering, DocEng 12, pages 181 190, New York, NY, USA, 2012. ACM. [6] Kei Kanaoka, Yotaro Fujii, and Motomichi Toyama. Ducky: A data extraction system for various structured web documents. In Proceedings of the 18th International Database Engineering & Applications Symposium, IDEAS 14, pages 342 347, New York, NY, USA, 2014. ACM. [7] Kei Kanaoka and Motomichi Toyama. Effective web data extraction with ducky. In Proceedings of the 19th International Database Engineering & Applications Symposium, IDEAS 15, pages 212 213, New York, NY, USA, 2014. ACM. [8] Fotios Kokkoras, Konstantinos Ntonas, and Nick Bassiliades. Deixto: A web data extraction suite. In Proceedings of the 6th Balkan Conference in Informatics, BCI 13, pages 9 12, New York, NY, USA, 2013. ACM. [9] Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artif. Intell., 118(1-2):15 68, April 2000. [10] Tiezheng Nie, Zhenhua Wang, Yue Kou, and Rui Zhang. Crawling result pages for data extraction based on url classification. In Proceedings of the 2010 Seventh Web Information Systems and Applications Conference, WISA 10, pages 79 84, Washington, DC, USA, 2010. IEEE Computer Society. [11] Andrew Jon Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian Schallhart. Oxpath: Little language, little memory, great value. In Proceedings of the 20th International Conference Companion on World Wide Web, WWW 11, pages 261 264, New York, NY, USA, 2011. ACM. [12] H.A. Sleiman and R. Corchuelo. A survey on region extractors from web documents. Knowledge and Data Engineering, IEEE Transactions on, 25(9):1960 1981, September 2013. [13] Suzhi Zhang and Peizhong Shi. An efficient wrapper for web data extraction and its application. In Computer Science Education, 2009. ICCSE 09. 4th International Conference on, pages 1245 1250, July 2009. [1] Brad Adelberg. NoDoSE - a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec., 27(2):283 294, June 1998. [2] Sudhir Agarwal and Michael Genesereth. Extraction and integration of web data by end-users. In Proceedings of the 22Nd ACM International Conference on Conference on Information & Knowledge Management, CIKM 13, pages 2405 2410, New York, NY, USA, 2013. ACM. [3] Chia-Hui Chang and Shao-Chen Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of 2015 Information Processing Society of Japan 165