19 A Proposal of Text Classification using Formal Concept Analysis 1080418 2008 3 7
( ) Hasse Web Reuters 21578 Concept Explorer 2 4 said i
Abstract A Proposal of Text Classification using Formal Concept Analysis Akinori Moriki Formal concept analysis, visualizing relations among objects by partial order relation, is a data analysis method based on lattice theory. The method is used Hasse s diagram which is generated by a 2 dimensional table consisted of objects and attributes. In this thesis, formal concept analysis is applied to articles included in Reuters 21578 for obtaining main subjects and summarization of the articles. Objects are the news articles, and attributes are words included in all of the article. Concept lattice is constructed with Concept Explorer, the software for formal concept analysis. In the result, prepositions and articles, such as a, an, and the, are located on high layer of the concept lattice. This situation indicates that prepositions and articles are common words for many news articles. They are, however, not suggestive for news contents, since those words are not meaningful. On the other hand, nouns and verbs are generally meaningful words, and they are suggestive for news contents. Those words are located on low layers of the concept lattice, and are common for 2 to 4 articles. However, said appears on high layer of the concept lattice. This causes by the name of speaker in news. Therefore, nouns and verbs appeared on high layer are indicate tendencies of all articles, and nouns and verbs appeared on low layer indicate relation and association among texts. key words formal concept analysis, text classification, concept lattice ii
1 1 2 3 2.1... 3... 6 2.2... 8 3 10 3.1... 10 3.2... 11 3.2.1... 11 3.2.2 CSV............. 11 3.3................................ 12... 13 4 14 4.1................................ 14 4.1.1................................ 14... 14... 15 4.1.2.................................. 16 5 24 26 iii
28 iv
2.1 2.1 (conceptlattice)... 4 2.2 (concept) [5]... 5 2.3 2.2................................ 8 3.1................................ 13 4.1 10... 16 4.2 15... 17 4.3 20... 18 4.4 25... 19 4.5 30... 20 4.6 35... 21 4.7 40... 22 4.8 45... 23 4.9 50... 23 v
2.1.............................. 4 2.2... 7 3.1 CSV... 12 4.1... 15 4.2... 15 vi
1 Web Web Web 1981 Darmstadt Rudolf Wille 2 Hasse 2 1
[2] Concept Explorer[3] Hasse Hasse 2 3 4 3 5 2
2 2.1 2.1 Hasse 2.1 Hasse Hasse 2.1 3
2.1 a b c d 1 2 3 4 2.1 ({2,3}, {c}) ({1,2,3,4}, empty) ({2,4}, {d}) : node(concept) ({1,4}, {b}) ({2}, {c,d}) ({4}, {b,d}) ({1}, {a,b}) (empty, {a,b,c,d}) 2.1 2.1 (concept lattice) 2.1 ({2}, {c,d}) ({4}, {b,d}) ({2}, {c,d}) {2} {c,d} ({4}, {b,d}) {4} {b,d} 2 4
2.1 concept Extent A Object Intent B Attribute 2.2 (concept) [5] ({2,3}, {c}) ({2,4}, {d}) 2 {2,3} {c} {2,4} {d} ({2,3}, {c}) ({2}, {c,d}) ({2,4}, {d}) ({2}, {c,d}) ({4}, {b,d}) 2.1 ({2,3}, {c}) ({2}, {c,d}) {2} ({2,3}, {c}) ({3}, {c}) ({2,4}, {d}) {2}{4} {d} ({2}, {c,d}) ({2,3}, {c}) ({2,4}, {d}) {c}{d} {2} 5
2.1 Hasse ( 2.2) Hasse ({2,3}, {c}) {2,3} {c} 1 12 9 2.2 2.2 2.3 ({Arctic Monkeys,Metallica}, {Rock}) ({James Blunt}, {Pop}) 2 ({Arctic Monkeys,Metallica}, {Rock}) {Rock} ({James Blunt}, {Pop}) {Pop} ({Arctic Monkeys,Metallica}, {Rock}) ({James Blunt}, {Pop}) Rock Pop ({Arctic Monkeys,Metallica}, {Rock}) { } { } {Punk, } { } { } { } Rock ({Slipknot}) {Rock} { } { } 6
2.1 Pop Rock Punk Sum 41 Red Hot Chili Peppers Oasis Linkinpark Killswitch Engage Fall Out Boy Slipknot Marilyn Manson Arctic Monkeys Maroon5 Metallica James Blunt 2.2 { } Slipknot Rock ({Sum 41}) {Rock} {Punk, } {Pop} Sum 41 Punk Pop Rock 7
2.2 2.3 2.2 2.2 CREDO[4] Claudio Carpinet Gianni Romano 2 Web Web Yahoo! CREDO Web Web Yahoo! 8
2.2 Google Web Web 9
3 3.1 Web Reuter 21578, Distribution 1.0[2] SGML <BODY> </BODY> *.txt txt 10
3.2 3.2 3.2.1 1 path 1 3.2.2 CSV CSV CSV 3.2.1 CSV K K 1 0 K = w 0,0 w 0,attr 1..... w obj 1,0 w obj 1,attr 1 (3.1) obj attr ( ) w 11
3.3,Showers,continued,throughout,,hotels text 1,1,1,1,,0 text 2,0,0,0,,0 text 3,0,0,0,,0 text 4,0,0,0,,0 text 5,0,0,0,,1 3.1 CSV m g wobj-1,attr-1, CSV CSV 3.1 3.3 Web Concept Explorer version 1.3[3] Concept Explorer Yevtushenko Concept Explorer 3.2.2 CSV Concept Explorer 12
3.3 text 1 text 2.... text N context Concept Explorer CSV Concept Lattice 3.1 1. 2. CSV 3. Concept Explorer CSV 4. CSV 5. 13
4 4.1 10 5 50 4.1 4.9 4.1.1 2 1 2 2 4 1 14
4.1 10 20 3 5 30 40 40 4 5 50 9 10 4.1 n n 1 1 2 2 3 3 n n 2 4 8 2 n 4.2 said 2 Concept Explorer CSV 10 20 5 30 30 40 40 4 5 50 9 10 4.1 50 60 50 60 15
4.1 4.1 10 4.1.2 1 4 1 2 4 said 1 2 16
4.1 4.2 15 n n n 4.2 n 1 2 n 17
4.1 4.3 20 18
4.1 4.4 25 19
4.1 4.5 30 20
4.1 4.6 35 21
4.1 4.7 40 22
4.1 実験結果と考察 図 4.8 テキスト 45 個の概念束 図 4.9 テキスト 50 個の概念束 23
5 Web [2] 2 4 said 2 4 said said 24
50 60 60 Concept Explorer[3] 60 25
1 2 1 4 3 26
3 1 1 8 1 4 3 27
[1] Bernhard Ganter TU Dresden, Formal Concept Analysis : Methods and Application in Computer Science, 2002 [2] Reuters-21578 Text Categorization Collection http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html [3] Yevtushenko, Concept Explorer version 1.3 http://sourceforge.net/projects/conexp [4] C. Carpinet, G. Romano, CREDO http://credo.fub.it/ [5], vol. 19 no. 2 pp. 103 142 2007 28