DEIM Forum 2016 F1-5 Web Ducky GUI 223-8522 3-14-1 E-mail: kei@db.ics.keio.ac.jp, toyama@ics.keio.ac.jp Web, 2,,,, Ducky Ducky Web URL CSS,, Ducky GUI. GUI, Web,,. Web, Web, 1. Web, 2,,,,, Web Web, Web [13]., Web HTML,,, HTML, Web Web, Web Web HTML,, Web,, Web, Web, Web Web,,,, Web,,,, Web, Web Ducky [8] [9], Ducky Web, Web GUI,,,, Web,, 2. Ducky 3. GUI, 4., 5. 6., 7.
2. 2. 1 Ducky 1. GUI, Web, Web URL,, 2., CSS (2. 2. 1), xml, json, csv,,, Web DB HTML id class, class,. com, div unit,, CSS CSS, Web, Web, JavaScript (DOM) jquery. W3Techs 2, 2015 8 jquery Web 65.5 jquery CSS, Web CSS 1 { } "name" : "", "author" : "", "frequency" : "", "format" : "", "scraping" : [{ }] 2 2. 2 2. 2. 1 CSS \\ \\ \\ \\ \\ CSS Xpath, HTML. 3.com 1, HP URL, HTML CSS, div.unit li > a CSS HTML id, class,, Ducky CSS, 2 CSS Web 1 http://eiga.com/link/ 3 2. 2. 2 HTML, scraping 4 1. GUI Web, CSS selector scraping, 1 url, selector URL, HTML CSS HTML. CSS 2 data field attr, find, 2 Web. http://w3techs.com/
1 scraping array url string URL selector string CSS data array ( ) field string attr string selector find string selector CSS blank remove array parentheses string, replace array next object "scraping" : [{ "url" : " ", "selector" : " ", "data" : [{ "field" : " ", "attr" : " ", "find" : " ", "remove" : [" ", " ", ], "replace" : [[" ", " "], ] }], "next" : { } }] 4 remove replace, 3 next next, URL url (1)., DB. next Web,, (3. 2. 3 ) (2. 2. 2),, GUI, Ameba 3 ( 5). 50 4, URL, name, url 6. 3. 1 3. 1. 1 URL GUI,, 50,, CSS,, 3. 1. 2,, ( 5 )., 3. GUI GUI, Web URL, URL, HTML GUI.. GUI 3. 1. 3 URL, ( 5 ). 3. 2 3. 2. 1 CSS GUI 3 http://official.ameba.jp/ 4 http://official.ameba.jp/genrekana/kanatop.html
5 GUI CSS,,, 50 CSS, 6 div.syllabarymdl > table > tbody > tr > td > a ( ). 6 Ameba, CSS (Algorithm1). CSS, body, class Algorithm 1 Pseudocode of generating CSS selector Declare CSS selector called C Require: node Ensure: C of node T N tagname of node CN classname of node while TN is not Body do if CN is not null then else C+ = T N +. + CN C+ = T N end if node parentnode of node end while 3. 2. 2, alt, ( 5 ). Web,. HTML 7, a img, a, 5, HTML 7, img src alt, a href, 7 HTML
3. 2. 3, href,, next ( 6). 6, 3 selector div.syllabarymdl > table > tbody > tr > td > a 5 50 CSS a, href URL next url (2. 2. 2). URL, URL, next Web Web, 8 Web ( A) 4. 4. 1 Web Web,, 2 3. Web 2,,, 2 Web A B C 2 A 8. 8 Web Web, Web,., CSS, Web GUI, Web., B 9. 9 Web Web, Web, Web, CSS, Web GUI, Web,, C, Web, 9 Web ( B) Web,., 0,, Web,, GUI,, B C Web, 9,, 4. 2 10. GUI,. 0, 0, Web,, 3. 2. 1
Algorithm 1 CSS CSS, 1 CSS,, CSS CSS, P hk = A k Ch C h R hk = A k Ch A k 5. 4 (1) (2), import io [1] kimono [2]. 5. 4. 1 import io kimono Ducky, 4 1 2 3 Web 4 Web 10 5., 2 1, 2, 5. 1, 2, 5. 2, Web, 5. 3 C = {C 1, C 2,, C h }, A = {A 1, A 2,, A k }, ( (1)), ( (2)) 3. 1, import io, kimono Ducky Web, import io, kimono Ducky Web API 2, kimono Ducky 3 Web 4 Web, Ducky, 3 import io kimono Ducky 1 Web Web 2 3 4 5. 4. 2 Web Web,, a. Web b. Web c. d. A. (HTML ) B. (HTML ) C.
D. E. F. Null 5 Web 5. 4. 3 Web, Web 4. A F Web A (HTML ),. B (HTML ),, C, import io, kimono Ducky kimono Web, PC,. Web, Ducky, D Null. Null,. import io, kimono, Null,, E F, Ducky, Web 4 Web import io kimono Ducky a b c d A 10 100 100 100 100 100 100 3 261 B 15 83.4 100 36.9 59.9 87.1 100 5 78 C 10 50 50 100 100 100 100 4 338 D 2 0 0 85 73.9 100 100 5 60 E 2 99.3 100 98.7 100 85.7 68.9 2 610 F 5 100 100 100 100 - - 3 100 5. 4. 4 Web kimono Ducky, Web 5., Web, kimono, a b import io kimono c d 15 44 87.6 72.1 89.9 100 5 12615 2 52 36.5 19.9 68.1 72.1 8 18739 5. 5, Web Web, 2 Web, 6 5. 5. 1, 6. 3, 4, 5, 100%, 9 CSS,, Web next 1, 2, 6, 0% Web URL,.,,. 6 1 23.8 19.9 10 2 0 0 0 3 100 100 3631 4 100 100 41 5 100 100 52 6 0 0 0 6. Web [13]. Web, (semi-automatic) (automatic) 2, Zhang [14]
Adelberg NoDoSE [3] XML, URL HTML OXPath [6] [12], Xpath,,,,,, Kushmerick [10] Kushmerick,., Chang [5]. Chang IEPAD, HTML, IEPAD HTML,,,,, HTML, IEPAD,,, URL, Web, Web, [4] [7] GUI, [11] URL OXPath [6], Web OXPath,, Web,, Web,, 7. 1, 2, Web [1] import io. https://import.io/. [2] kimono. https://www.kimonolabs.com/. [3] Brad Adelberg. NoDoSE - a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec., 27(2):283 294, June 1998. [4] Sudhir Agarwal and Michael Genesereth. Extraction and integration of web data by end-users. In Proceedings of the 22Nd ACM International Conference on Conference on Information & Knowledge Management, CIKM 13, pages 2405 2410, New York, NY, USA, 2013. ACM. [5] Chia-Hui Chang and Shao-Chen Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, WWW 01, pages 681 688, New York, NY, USA, 2001. ACM. [6] Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, and Andrew Sellers. Oxpath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal, 22(1):47 72, February 2013. [7] Matthias Geel, Timothy Church, and Moira C. Norrie. Sift: An end-user tool for gathering web content on the go. In Proceedings of the 2012 ACM Symposium on Document Engineering, DocEng 12, pages 181 190, New York, NY, USA, 2012. ACM. [8] Kei Kanaoka, Yotaro Fujii, and Motomichi Toyama. Ducky: A data extraction system for various structured web documents. In Proceedings of the 18th International Database Engineering & Applications Symposium, IDEAS 14, pages 342 347, New York, NY, USA, 2014. ACM. [9] Kei Kanaoka and Motomichi Toyama. Effective web data extraction with ducky. In Proceedings of the 19th International Database Engineering & Applications Symposium, IDEAS 15, pages 212 213, New York, NY, USA, 2014. ACM. [10] Nicholas Kushmerick. Wrapper induction: Efficiency and expressiveness. Artif. Intell., 118(1-2):15 68, April 2000. [11] Tiezheng Nie, Zhenhua Wang, Yue Kou, and Rui Zhang. Crawling result pages for data extraction based on url classification. In Proceedings of the 2010 Seventh Web Information Systems and Applications Conference, WISA 10, pages 79 84, Washington, DC, USA, 2010. IEEE Computer Society. [12] Andrew Jon Sellers, Tim Furche, Georg Gottlob, Giovanni Grasso, and Christian Schallhart. Oxpath: Little language, little memory, great value. In Proceedings of the 20th International Conference Companion on World Wide Web, WWW 11, pages 261 264, New York, NY, USA, 2011. ACM. [13] H.A. Sleiman and R. Corchuelo. A survey on region extractors from web documents. Knowledge and Data Engineering, IEEE Transactions on, 25(9):1960 1981, September 2013. [14] Suzhi Zhang and Peizhong Shi. An efficient wrapper for web data extraction and its application. In Computer Science Education, 2009. ICCSE 09. 4th International Conference on, pages 1245 1250, July 2009. Ducky GUI,