I05-010 : 19 1
ii
k + 1 2 DS 198 20 32 1 1 iii
ii iv v vi 1 1 2 2 3 3 3.1.................................... 3 3.2............................. 4 3.3.............................. 6 3.4....................................... 7 3.4.1........................... 7 3.4.2........................ 10 3.4.3............................... 11 3.5.................................. 12 4 14 4.1........................................ 14 4.2.................................. 14 4.3.................................. 18 5 21 6 25 7 27 iv
3.1...................... 5 4.1 HTML........................... 15 4.2 20............... 19 v
3.1............................. 6 3.2.................................. 8 3.3 Y1..................................... 8 3.4 Y2..................................... 9 4.1 ( )( 1-39).................. 14 4.2 ( )( 40-78)................. 16 4.3 ( )( 79-117)................. 16 4.4 ( )( 118-128)................ 16 4.5 ( )( 129-146)................ 16 4.6 ( )( 147-161)................ 17 4.7 ( )( 162-180)................ 17 4.8 ( )( 181-190)................ 17 4.9 ( )( 701-708)................ 17 4.10............................... 18 5.1.............................. 22 5.2................... 22 5.3.................. 23 5.4.................. 23 5.5.................. 24 5.6 N=2................. 24 5.7 N=3................. 24 5.8 N=4................. 24 vi
1 2 3 4 5 6 7 1
2 [1] [2] [3] [4] [5] [6] [7] 2
3 [8] [9] 3.1 3.2 3.3 2 3.4 3.5 3.1 85% 2 6% sup = 6%, conf = 85% X Y X Y X Y [10] X Y D I T I D 3
X X sup(x) (X Y ) sup(x Y ) conf (X Y ) sup(x Y )/sup(x ) IBM R.Agrawal [11] 3.2 [10] k k + 1 [12] k k C k k L k k 1. C 1 C 1 2. C 1 L 1 3. L 1 C 2 4. 2 3 C k 3.1 3.1 50% 4
3.1: 4 1 C 1 {4} L 1 L 1 2 C 2 {1,5}, {3,5} 2 L 2 L 2 {1,2,3} L 3 L 3 4 1 1 5
t/yes t/no m/yes 50 15 75 m/no 30 5 35 80 20 100 3.1: 3.3 2 [11] 3.1 (t/yes) (t/no) (m/yes) (m/no) {m/yes} {t/yes} 50/75 = 67% ({t/yes} ) 80/100 = 80% 2 3.1 2 X Y X Y X Y S X S Y S XY 6
N X Y 2 T dep T dep = N (S XY S X ) 2 S X S Y (1 S Y )(1 S X ) (3.1) 1 2 2 T dep 0 X Y α T dep < x 2 1 (α) X Y X Y 3.4 [13] [14] IF=THEN IF-THEN 3.4.1 3.2 A IF-THEN 1 3.3 3.4 Y1 1(Y1) A(X) 7
1 A 20 Y 0 10 Y 1 30 N 1 20 N 1 10 Y 0 30 N 1 20 N 0 20 Y 1 3.2: 1 1 A 2 3 5 A 2 1 3 4 4 8 3.3: Y1 Y2 (Y2) A(X) A 1 A Y1 A Y2 A X k H(S) = H(X) = p i log k p i i=1 p i X k a i (1 i k) Y1 A 3.3 H(X) = 5/8 log 5/8 3/8 log 3/8 = 0.95 Y1 8
A 4 1 5 A 0 3 3 4 4 8 3.4: Y2 H(X Y 1 = yes) = 2/4 log 2/4 2/4 log 2/4 = 1 Y1 H(X Y 1 = no) = 3/4 log 3/4 1/4 log 1/4 = 0.8113 H(X Y 1) = 4/8 H(X Y 1 = yes) + 4/8 H(X Y 1 = no) = 0.9057 H(X) H(X Y 1) = 0.95 0.9057 = 0.0443 Y1 A Y2 A 3.4 H(X) = 5/8 log 5/8 3/8 log 3/8 = 0.95 Y2 H(X Y 2 = yes) = 4/4 log 4/4 0/4 log 0/4 = 0 Y2 H(X Y 2 = no) = 1/4 log 1/4 3/4 log 3/4 = 0.8113 9
H(X Y 2) = 4/8 H(X Y 2 = yes) + 4/8 H(X Y 2 = no) = 0.4057 H(X) H(X Y 2) = 0.95 0.4057 = 0.5443 Y2 A Y2 A Y2 A 3.4.2 NP [15] (greedy algorithm) [16][17] (main) 1. D 2. SPLIT(D) (SPLIT( D)) 1. IF D THEN 2. 3. 4. (2),(3) D D 1 D 2 5. SPLIT(D 1 ) 10
6. SPLIT(D 2 ) D D D 3.4.3 (cross validation) 2 N 1. N 2. N N 1 3. 1 (2) 4. (2),(3) 1 N = (overfitting) 11
2 SQL [18] 3.5 k-means N 1 N x 1 x 2 D(x 1, x 2 ) D(C 1, C 2 ) C 1 C 2 D(C 1, C 2 ) D(C 1, C 2 ) = D(C 1, C 2 ) = D(C 1, C 2 ) = 1 n 1 n 2 min D(x 1, x 2 ) x 1 C 1,x 2 C 2 max D(x 1, x 2 ) x 1 C 1,x 2 C 2 x 1 C 1 x 2 C 2 D(x 1, x 2 ) 12
D(C 1, C 2 ) = E(C 1 C 2 ) E(C 1 ) E(C 2 ) E(C i ) = x C i (D(x, c i )) 2 2 [19] k-means k- c i k i=1 x C i (D(x, C i )) 2 k-means O(NK) O(N 2 ) k-means [20] 13
4 4.1 DS 4.2 20 32 4.1 32 id DS 198 9 1 198 4.1 4.9 ( ) 4.1: ( )( 1-39) 14
4.1: HTML 15
4.2: ( )( 40-78) 4.3: ( )( 79-117) 4.4: ( )( 118-128) 4.5: ( )( 129-146) 16
4.6: ( )( 147-161) 4.7: ( )( 162-180) 4.8: ( )( 181-190) 4.9: ( )( 701-708) 17
2006/11/15 95 95 138 145 186 701 706 1 2 2006/11/16 125 138 138 145 186 701 707 1 2 2006/11/17 118 131 138 138 145 186 701 1 2 2006/11/18 103 138 145 146 701 1 2 2006/11/20 50 83 120 128 138 138 145 701 1 2 2006/11/21 0 1 1 2006/11/22 137 137 145 186 186 701 702 706 1 2 2006/11/23 35 136 138 145 186 701 1 2 2006/11/24 138 145 145 177 183 701 705 1 0 4.10: 1 4.10 1 95 0 1 2 3 4.10 2006/11/19 2006/11/21 0 4.3 1 0 18
4.2: 20 3 IBM R.Agrawal 1 1 (< a 1 >, < a 2 >... < a 198 >) 198 2 (< a 1 a 1 >, < a 1 a 2 >... < a 1 a 198 >, < a 2 a 1 >... < a 198 a 198 > < (a 1 a 2 ) >< (a 1 a 3 ) >... < (a 197 a 198 ) >) 198 198 + 198 197 2 = 58707 minsup = 1 198 1 198 2 58707 2 198 1 198 20 4.2 19
2 α 5% 2 T dep < x 1 2 (0.05) = 3.841 20
5 4 1 1 32 1 1 X A Y 50% 2 5% 1 0 1 2 S X S Y S XY 3.1 2 T dep 2 2 5% T dep 3.841 3.841 5.1 1 21
2 6.849704 0.833333 0.006361 4.033251 1 0.002545 4.033251 1 0.002545 7.856304 0.727273 0.010178 5.1: 2 3.917093 5.2: 1 32 1 198 20 2 5% 1 X Y 1 2 S X S Y S XY 3.1 2 T dep T dep 3.841 5.2 5.3 5.4 5.5 N 2 N 4 X Y 22
2 8.15726 5.02715 10.866193 5.3: 2 5.392515 11.206815 3.889798 20.681059 5.4: 1 2 S X S Y S XY 3.1 2 T dep T dep 3.841 5.6 5.7 5.8 23
2 11.222124 19.64843 4.101117 5.5: 2 4.876865 5.6: N=2 2 4.3198314 3.900086 4.993071 6.900366 5.7: N=3 2 5.176787 4.040732 5.8: N=4 24
6 1 1 25
20 26
7 32 1 1 i 27
28
[1] http://www.v350f200.com/kanri/kankei 2g.html [2] http://www.mhlw.go.jp/houdou/2004/11/h1122-2.html [3] [4] http://www.nih.go.jp/eiken/main adult.html [5] http://www.nih.go.jp/eiken/programs/ekigaku shokuji.html [6] http://www2.eiyo.shikoku-u.ac.jp/eiyokun/soft/ffqg/ffqg.htm [7] DEWS2006 1B-ill 2006 [8] Ian H.Witten, Eibe Frank, Data Mining - Practical Machine Learning Tools and Techniques with Java Implementations Morgan Kaufmann Publishers, 1999 [9] W.Frawley and G.Piatetsky-Shapiro and C.Matheus, Knowledge Discovery in Databases: An Overview. Al Magazine, 213-228, Fall 1992 [10] 2001 [11] R.Agrawal, A.Arning, T.Bollinger, M.Mehta, J.Shafer, and R.Srikant, The Quest data mining system. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996 [12] Sequential Pattern Mining DE2005-72 pp.43-48 2005 [13] http://www.engr.ie.u-ryukyu.ac.jp/ taka/soturon/genkou/node21.html [14] 2004 [15] L.Hyafil, R.Rivest. Constructing optimal binary decision tree is NP-complete. Information Processing Letters, 5:15-17, 1976 [16] Quinlan, J. R. Induction of decision trees. Machine Learning, 1:81-106, 1986 29
[17] Quinlan, J. R. C4.5:Programs for Machine Learning. Morgan Kaufmann, 1993 [18] http://mikilab.doshisha.ac.jp/dia/research/report/2005/0712/001/report20050712 [19] (1) vol.18, no.1, pp.59-65, 2003 [20] http://www.kamishima.net/jp/clustering/ 30