DEIM Forum 2017 B4-4 Recognition and semantics interpretation of header hierarchies in statistical tables with complicated structures 603 8047 603 8047 E-mail: g1344739@cse.kyoto-su.ac.jp, miya@cc.kyoto-su.ac.jp.. AI. AI. AI. Excel. Excel. Excel.. Excel. (appearance) Excel 1... Excel. 1 Excel Excel. Excel.. Excel. 2017 1 2.... 1 2017 1. 2017 1 2... 2017 1 2. 2017 1 2017 2.... 1 https://www.e-stat.go.jp/.. Excel. Excel.. Excel. (appearance)
. 2. 2. 1 [1] Excel Excel... [3]. CSV RDF. [4] LinkedData. [5] LinkedData 1 RDF. OLAP. OLAP. [6] RDBMS. Excel CSV. CSV JSON RDF. Excel RDF. RDF RDF.. RDF 1. OLAP.. 2. 2 Kieninger [7].. [8] (CRF).. [9] HTML. HTML.. Excel HTML. Excel. 3. 3. 1 Excel. CSV. Excel.. - 1. CSV. 1. Excel... CSV. 3. 2 Excel. Excel pdf. pdf Excel 1. ImageMagick 2 pdf. OCR. density 600. Excel. Excel. 1. 2 http://imagemagick.org/script/index.php
1 1 cell no page 3 x x y y percentage x x percentage y y width height area normalized area lower left x x lower left y y upper right x x upper right y y lower right x x lower right y y text type. OpenCV3..... 1 Excel.. OCR. OCR tesseract3.04.01 4 Google Cloud Vision API 5.. tesseract OCR. Google Vision API tesseract., Google Vision API tesseract. tesseract tesseract. 3. 3 3.2. 1.. GBDT Gradient Boosting Decision Tree. XGBoost [10].... 3. 4 CSV. CSV.. 3. 4. 1.. 2 5. 2 ( ) 3 Excel.Excel. 4 https://github.com/tesseract-ocr/tesseract 5 https://cloud.google.com/vision/
3 ( ) 4 ( ) 6 3. 4. 2. 7 9. 5 ( ) Excel 2. 3. 4 1 2 3 4... 5 2. 2. 2.. Population- -Both sexes Population- -Male Population- -Female Households- -Total Households- -Private-households Households- - -(a). 1.. 2. 2 1. x y x y. 2 3.. 7 (1) 8 (2) 9 (3) 7 x. 8 x. 9. 8. 8. -Japan -Both sexes -Both sexes-a -Both sexes-a -I -Both sexes-a -I -(1) -Both sexes-a -I -(2)
-Both sexes-a -I -(3) -Both sexes-a -I -(4) 1. 8. 1 Japan -Japan. 1 -Japan. 2 -Both sexes x 1 -Japan. -Both sexes. 3 A -Both sexes x. -Both sexes A. 4 I A x.. 1. 1. 1. 1..1. 39913. 1. 5 (1). 4. 4. 1 4. 1. 1 Excel CSV.. Excel 81. 4498857. 2. 1 XGBoost. XGBoost eta( ) 0.3 max depth( ) 6 min child weight( ) 1 subsample() 1 colsample bytree( ) 1. 2 292 142 163 83 2,475 947 25,600 11,006 196,271 84,614 488 221 95,676 41,000 4. 1. 2 XGBoost. 3.. Excel CSV CSV. 3 F 0.972(138/142) 0.972 (138/142) 0.972 0.918(78/85) 0.940(78/83) 0.923 0.950(910/958) 0.960(910/947 0.955 0.983(10,909/11,102) 0.991(10,909/11,006) 0.987 0.995(84,129/84,512) 0.994(84,129/84,614) 0.995 0.911(164/180) 0.742(164/221) 0.818 0.987(40,509/41,034) 0.988(40,509/41,034) 0.988 4 5. percentage x percentage y normalized area. 4 F percentage y 2,702 cell no 2,071 normalized area 1,830 width 1,523 percentage x 1,461 4. 1. 3. CRF( ). CRFSuit 0.12 6. 1 1.. 6 http://www.chokkan.org/software/crfsuite/
( ) N. N=1 10. 5 IOB2. 5 IOB2 B TITLE I TITLE 2 B SUB TITLE I SUB TITLE 2 B COL HEADER I COL HEADER 2 B ROW HEADER I ROW HEADER 2 B BODY ( ) I BODY ( ) 2 B COMMENT I COMMENT 2 23164 5792 2. 2. 1 1 IOB2 1. 10. N=6 F. 6 N=6. F.. 6 N=6 F 0.854(94/110) 0.662(94/142) 0.746 0.755(37/49) 0.446(37/83) 0.561 0.926(686/741) 0.724(686/947 0.812 0.940(8,844/9,408) 0.803(8,844/11,006) 0.867 0.948(82,076/86,557) 0.970(82,076/84,614) 0.959 0.767(115/150) 0.520(115/221) 0.620 4. 2 CSV 14 Excel CSV 15. INCA OCR. Excel CSV OCR. Excel.. 5. 1 2. CSV OCR. OCR. 14 18. Google Cloud Vision API.. Excel OCR Google Cloud Vision API. TemplateMatching OCR. 11 1 10.. 11. 1 Related member. 2. 7-7- - -or more.
. Related member 1 2 1. 12 2 12 A.. A (4). I Family nuclei 1. Excel. Excel. 13 13. 65 1 18 18 -. 65 1 18 1.. 6. Excel Excel CSV.. CSV OCR. OCR Google Cloud Vision API. OCR. pdf OCR... OCR Excel CSV.. [1] (2013) Excel http://oku.edu.mieu.ac.jp/ okumura/sss2013.pdf (:2016/12/31) [2] (2015) http://www.meti.go.jp/committee/kenkyukai/sa -nsei/kaseguchikara/pdf/010 03 03.pdf (:2016/1/7) [3] (2014) UNISYS TECHNOLOGY REVIEW 121 SEP. 2014 [4] (2011) Linked Data The 25th Annual Conference of the Japanese Society for Artificial Intelligence 2011 [5] (2013) RDF 12 F-034 [6] (1996) RDB OLAP 52 4-157 [7] T.G. Kieninger and B. Strieder(1999) T-Recs Table Recognition and Validation Approach AAAI Fall Symposium on Using Layout for the Generation Understanding and Retrieval of Documents. [8] (2015) DEIM Forum 2015 B4-5. [9] (2003) HTML. [10] Tianqi Chen Carlos Guestrin(2016) XGBoost: A Scalable Tree Boosting System https://arxiv.org/pdf/1603.02754.pdf (:2016/12/31)
図 14 Excel 統計表の例 図 15 CSV 化の例