DEWS2008 C6-4 XML 606-8501 E-mail: yyonei@db.soc.i.kyoto-u.ac.jp, {iwaihara,yoshikawa}@i.kyoto-u.ac.jp XML XML XML, Abstract Person Retrieval on XML Documents by Coreference that Uses Structural Features Yumi YONEI, Mizuho IWAIHARA, and Masatoshi YOSHIKAWA Department of Social Informatics, Graduate School of Informatics, Kyoto University Yoshidahonmachi, Sakyo-ku, Kyoto, 606-8501 Japan E-mail: yyonei@db.soc.i.kyoto-u.ac.jp, {iwaihara,yoshikawa}@i.kyoto-u.ac.jp Present retrieval by keywords is based on the occurrence frequency and the occurrence position of the keywords. As for retrieval by two or more keywords, semantic relation between keywords is important. For retrieving information about a person, it is common to search by pair of keywords consisting of the person s name and his/her attribute. However, if semantic relation between keywords is not considered, the documents that describe different person s attribute may be retrieved. By using dependency analysis and coreference analysis, it is possible to retrieve the contents in which query keywords have semantic dependencies and improve search precision. However, it is costly. On the other hand, as for structural documents such as the XML, correspondence is often influenced by the document structure. In this paper, we confirm it by the coreference that uses structural features of XML documents, and we describe our person retrieval that uses the structual coreference. Key words XML, coreference, person retrieval, structural features, 1. XXX XXX
<> <name> </name> <body> <></> <> <title> </title> <item> </item> <item> 11 </item> </>... </body> </> 1 XML Web XML HTML 1 name item 2 item item HTML [1] [17] [3] [18] Wikipedia Web HTML XML Web Web XML 2 3 4 Wikipedia XML 5 2. Web [14] [8] [14] [8]. Web caption [1] [17] [1] - [17] [3] [18] [3] 3 [18] Web 3 Web Web Web [16] [16] Web Web HTML XML
2 (a) name body 3. XML 3. 1 XML 3.1.1 3.1.2 3. 1. 1 1 (linguistic features) [4] 4 ( ) ( ) ( ) ( ) PER- SON, LOCATION Cabocha [7] ( ) 2 (structural features) XML HTML XML 照 応 詞 title p title item title title item (b) 照 応 詞 title p 2 (a) (b) name body name body 照 応 詞 照 応 詞 item item (c) name body 照 応 詞 item item (d) 照 応 詞 name body // 3 k- (k=2) XML 2 2(a) XML title XML 2(b) XML
k- k 3 k=2 k- 3(a),(b),(c) name body 3(d) 2 k- (k=2) k- k- ( (name )(body )) 3(d) k- 3. 1. 2 Support Vector Machine 1 x, y C(x, y) p(x, y) [6] [9] (x, y) 1,0 x ( (name )(body )) k- y { 1 ifx y f i (x, y) = 0 otherwise 1 (1) p(x, y) = 1 z(x) e λ if i (x,y) i z(x) = e λ if i (x,y) i y λ z(x). P (f i) = p(x, y)f i(x, y) (3) x,y (1) (2) P (f i ) = p(x)p(x, y)f i (x, y) (4) x,y P (f i) = P (f i) (5) P (f i ) P (f i ) (6) H(P ) = p(x, y)logp(x, y) (6) x,y λ 2 SVM SVM [2] 2 SVM [5] T 1, T 2 V 1, V 2 E 1, E 2 T 1 = (V 1, E 1), T 2 = (V 2, E 2) K(T 1, T 2 )= K S (s 1, s 2 ) (7) v 1 V 1 v 2 V 2 s 1 S v1 (T 1 ) s 2 S v2 (T 2 ) K S (s 1, s 2 ) = I(s 1 = s 2 ) (8) S v(t ) v V K S (2) I() 1 0 s 1 = s 2 2 [12] RNA HTML XML Web SVM
SVM 1 0 k- 3.1.1(2) SVM 3. 2 ( (name )(body )) body name body XML 3. 3 4. XML 4. 1 4. 1. 1 Wikipedia 1 XML 4 1: 2: 3: 1 http://ja.wikipedia.org/ 1 1 2 3 4 333 521 6 72 75 96 4 4 12597 22758 18 33 240 489 7 14 12260 22269 11 19 1 (item ) 2 (p ) 3 (item (normalist )) 4 ( (name )(body )) 5 ( (title )( )) 6 (body (p )( )) 7 (body ( )( )) 8 (normalist (item )(item )) 9 ( (normalist ) (normalist )) 10 ((p ) (p )) 11 ( ( )( )) 4 k- (k=2) 4: 1 2 3 4 1 1 4. 1. 2 [4] [11] Cabocha [7] EDR [13] EDR
2 k- (k=2) (1) (item ) 71 (2) (p ) 40 (3) (item (normalist )) 4 (4) ( (name )(body )) 93 (5) ( (title )( )) 3 (6) (body (p )( )) 29 (7) (body ( )( )) 0 (8) (normalist (item )(item )) 0 (9) ( (normalist )(normalist )) 0 (10) ((p ) (p )) 0 (11) ( ( )( )) 0 240 3 k- (k=2) (2) (p ) 7 (10) ((p ) (p )) 0 7 Wikipedia 2 3 k=2 k- 4 2 3 4 (1) (2) item (3) item (4) name (5) title (6) (6) (10) 2 3 XML (3) (4) (5) name title 2 (4) name item (body (p )( ) 2 3 4. 1. 3 SVM [10] SVM SV M light [15] SVM 0 1 4. 2 (precision) (recall) precision = recall = F (F-measure) F F -measure = 4. 3 2 precision recall precision + recall k- k- SVM 4. 3. 1 XML (I) (II) (III) 4, 3.1.1(1) k=2 k- 5 5 (I) (II)
4 (I) (II) (III) 5 1 2 3 4 74.3 % 76.0% 51.3 % 31.2% (I) 40.8 % 66.8% 57.5 % 48.7% F 52.7 % 71.1% 54.2 % 38.0% (II) 77.0% 78.9% 69.3% 75.0% 48.1% 69.2% 91.7% 54.8% F 59.2% 73.7% 74.9% 63.3% (III) 90.6% 92.0% 82.0% 86.0% 6 k- 1 2 3 4 77.0% 78.9% 69.3% 75.0% k=2 48.1% 69.2% 91.7% 54.8% F 59.2% 73.7% 74.9% 63.3% 75.4% 77.6% 69.3% 75.0% k=3 46.2% 68.6% 91.7% 54.8% F 57.3% 72.8% 74.9% 63.3% 72.1% 78.0% 69.3% 75.0% k= 49.3% 72.0% 91.7% 54.8% F 58.6% 74.9% 74.9% 63.3% 7 1 2 3 4 72.1% 78.0% 69.3% 75.0% 49.3% 72.0% 91.7% 54.8% F 58.6% 74.9% 74.9% 63.3% 97.4% 85.3% 63.3% 88.0% SVM 30.6% 61.2% 83.3% 30.4% F 46.6% 71.3% 719% 45.2% 38.9% 33.5% 62.0% 54.8% F 54.4% 49.1% 70.6% 66.7% (k= ) F XML (III) (II) (III) 6 3 4 3 1 2 k=2 k=3 k- k=2 ( = ) 3 F 1 k=2 2 k= k=2 k- 4. 3. 3. [4] SVM 4. 3. 2 k- 7 2 3-2 (k=2 - ) 3 (k=3 - ) SVM F 3 SVM
5. XML k- XML Wikipedia XML 4 F k- k- =2 k- SVM F Wikipedia HTML (B)( 18300031), [4],,,,, Vol 46, No. 3 2005. [5] Vol.21, No.1,a, 2006. [6] Andrew Kehler, Probabilistic Coreference in Information Extraction,CoRR, cmp-lg/9706012,1997. [7], Support Vector Machine Chunk,, Vol. 9, No. 5, pp.3-21 2002. [8] 11, 2005 [9] Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra A Maximum Entropy Approach to Natural Language Processing, Computational Linguistics, 22 1996. [10] Zhang Le Maximum Entropy Modeling Toolkid for Python and C++, http://homepages.inf.ed.ac.uk/s0450736/maxent toolkit. html. [11],,,,, version 2.3.3, 2003. [12] Alessandro Moschitti, Making Tree Kernels proctical for Natural Language Learning,EACL, 2006. [13] EDR. Technical Report TR 045, 1995. [14] 2002 pp175-176 2002. [15] SV M light http://dit.unitn.it/ moschitt/tree-kernel.htm. [16] Lan Yi,Bing Liu,Xiaoli Li, Eliminating noisy information in web pages for data mining, Conference on Knowledge Discovery in Data Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.296-305, 2003. [17] Minoru Yoshida, Kentaro Torisawa, Junichi Tsujii, Extracting ontologies from World Wide Web via HTML tables, Pacific Association for Computational Linguistics, pp.332-341,2001. [18] Web DEWS 6-p-05 2003. [1] Hsin-Hsi Chen, Shih-Chung Tsai, Jin-He Tsai Mining Tables from Large Scale HTML Texts, 18th International Conference Computational Linguistics, pp.166-172 2000. [2] Nello Cristianini, John Shawe-Taylor,, [3] WWW HTML, DE2005-136 2005