26 Feature Extraction with Randomness for an Application to Machine Learning from Text Data

26 Feature Extraction with Randomness for an Application to Machine Learning from Text Data 1175087

Random Forest Random Forest SPAM SPAM600 SPAM1000 SPAM Random Forest i

2 10 2 3% Random Forest Mersenne Twister Low-Discrepancy ii

Abstract Feature Extraction with Randomness for an Application to Machine Learning from Text Data Natsuki Fujimori In text processing, a word vector, which is converted from text document data, is usually used as a feature vector. A word vector is a histogram of word frequency or occurrence of a document. It is a numerical data, however it is not a quantitative data such as ratio scale. It is originally a qualitative data such as ordinal or nominal scale. Some methods for data mining using machine learning employ numeric calculation which can discriminate quantitative data that able to define the distance or inner product of feature vectors, but random forests employing decision trees is machine learning technique which is suitable for qualitative text data because it constructs various shapes of decision trees by choosing predictor values randomly. When choosing predictor values, the random sampling is performed using pseudo-random sequence. However, if the number of tree depth to construct, the number of decision trees to use discriminate, or the number of features predictor values of decision trees are small values, the bias may appear because of non-uniformness of pseudo-random numbers. In order to solve this problem, in this study, we propose an application of quasi-random number generator. quasi-random sequence generates uniform sequence of points. When the number of tree of depth, the number of trees using for discrimination, or the number of features constructing tree are small, the proposed method is able to achieve higher performance than pseudo-random. By using the original Japanese SPAM e-mail dataset (600: SPAM, iii

1000: non-spam), we perform the experiment of conventional and proposed methods. As a result, under the condition that the maximum number of tree is 2, and that the number of feature is under 10, the proposed method improves 2 or 3 percent of the precision. key words Random Forest, Pseudo-Random Sequence, Mersenne Twister, Low- Discrepancy Sequence, Quasi-Random Sequence iv

1 1 2 4 2.1.................................. 4 2.2................................... 6 2.2.1............................. 6 2.2.2................................ 6 2.2.3 Bagging................................ 6 2.2.4 Random Forest............................ 8 3 10 3.1.................................. 10 3.2................................... 11 4 Random Forest 15 4.1 Random Forest............. 15 4.2 Random Forest 16 4.3 Python Random Forest........ 17 4.3.1................ 17 4.3.2............... 19 5 20 5.1................................ 20 5.1.1....................... 20 5.1.2................................ 21 v

5.1.3................................ 22 5.2.................................... 23 5.2.1........................... 23 5.2.2 D max = 1............. 26 5.2.3 D max = 2............. 26 5.2.4 D max = 3 D max = 4... 28 5.2.5 D max = 1............. 31 5.2.6....... 32 5.2.7................................. 34 5.3.......... 37 6 39 42 44 A D max 46 B 6 48 vi

2.1............................ 5 2.2 Bagging......................... 7 2.3 Random Forest..................... 9 3.1 Mersenne Twister 1000*1000......... 13 3.2 Mersenne Twister 10000*10000........ 13 3.3 SOBOL 1000*1000 2............. 14 3.4 SOBOL 10000*10000 2........... 14 4.1 Random Forest....... 18 5.1........................ 22 5.2..... 24 5.3 D max = 1... 25 5.4 D max = 1... 25 5.5 D max = 2... 27 5.6 D max = 2... 27 5.7 D max = 3... 28 5.8 D max = 3... 29 5.9 D max = 4... 30 5.10 D max = 4... 30 5.11 D max = 5... 31 5.12 D max = 5... 31 5.13 D max = 2 F max = 11 20 N max = 11 20.......................... 33 vii

5.14 D max = 2 F max = 11 20 N max = 11 20......................... 33 5.15 D max = 2 F max = 1 10 N max = 11 20............................ 34 5.16 D max = 2 F max = 1 10 N max = 11 20.......................... 34 5.17 D max = 2 F max = 11 20 N max = 1 10............................ 36 5.18 D max = 2 F max = 11 20 N max = 1 10.......................... 36 5.19 rand int................. 37 5.20 SOBOL.................. 37 B.1.............. 48 B.2 6 49 B.3 6 50 viii

5.1.................................... 21 5.2 Random Forest.............. 21 5.3.... 23 A.1 D max = 1......................... 46 A.2 D max = 2......................... 47 ix

1 1970 Web Web Weblog Twitter Random Forest Random Forest 1

Bagging Random Forest Random Forest Low Discrepancy Sequence Random Forest Random Forest [1],[2] SPAM SPAM Random Forest SPAM [1],[2] Random Forest SPAM Random Forest 2 2

3 2 Random Forest 4 Random Forest 5 6 3

2 Bagging Random Forest 2.1 2 2.1 [4] 2.1 1 4

2.1 無料 1 0 出会い 0 1 女性 1 0 HAM SPAM 秘密 HAM 1 0 SPAM HAM 2.1 f f 2 5

2.2 2.2 2.2.1 {x i 1 i N} X 2.2.2 2 Bagging Random Forest 2.2.3 Bagging Bagging Leo Breiman 1996 [6] Bagging Bootstrap Aggregating Bagging 6

2.2 T 1 T m f 1 (x) f m (x) 2.2 Bagging 学習データブートストラップサンプリングブートストラップサンプル 1 ブートストラップサンプル 2 ブートストラップサンプル m 識別識別識別弱学習器 T 1 T 2 T m f 1 (x) f 2 (x) f m x 弱識別結果弱識別結果弱識別結果多数決識別結果 2.2 Bagging 7

2.2 Random Forest 2.2.4 Random Forest Random Forest Leo Breiman 2001 [7] Bagging Random Forest d Breiman [5] Bagging 2.3 Random Forest Bagging Random Forest [5] 8

2.2 学習データブートストラップサンプリング多様な決定木が並列に並ぶ森 ( 弱学習器群 ) ブートストラップサンプル 1 ブートストラップサンプル 2 ブートストラップサンプル m 識別識別ノードに利用する特徴をランダムサンプリングで d 個選択する識別 T 1 T 2 T m f 1 (x) f 2 (x) f m x 弱識別結果弱識別結果弱識別結果多数決識別結果 2.3 Random Forest 9

3 k-means Random Forest 3.1 pseudorandom sequence [9] X n+1 = (A X n + B) mod M (3.1) C 10

3.2 M M M Mersenne Twister 1998 [8] M 2 19937 1 623 Mersenne Twister [10] 3.1 3.2 Mersenne Twister 1000*1000 2 10000*10000 2003 Marsaglia Xorshift 2 128 1 [12] XOR scikit-learn Xorshift 3.2 Quasi random numbers Quasi-Monte Carlo method I := {0 x < 1} I n N x i (1 i N) S (f) 1 S N (f) = 1 N 1 i N f (x i ) (3.2) {x i 1 i N} n I n N P N D (P N ) = O ( N 1 (log N) n) 2 1 f 2 D 11

3.2 n Low Discrepancy Sequence [9] quasirandom 1 SOBOL SOBOL 1967 I.M. Sobol [11] 3.3 SOBOL 1000*1000 2 3.4 SOBOL 10000*10000 2 12

3.2 0.0 0.2 0.4 0.6 0.8 1.0 x 3.1 Mersenne Twister 1000*1000 0.0 0.2 0.4 0.6 0.8 1.0 x 3.2 Mersenne Twister 10000*10000 13

3.2 0.0 0.2 0.4 0.6 0.8 1.0 random[,1] 3.3 SOBOL 1000*1000 2 0.0 0.2 0.4 0.6 0.8 1.0 a[,1] 3.4 SOBOL 10000*10000 2 14

4 Random Forest Random Forest Python scikit-learn Random Forest 4.1 Random Forest Random Forest Random Forest 1 Random Forest 15

4.2 Random Forest 4.2 Random Forest Xorshift Random Forest Random Forest Random Forest Random Forest 16

4.3 Python Random Forest 4.3 Python Random Forest 4.3.1 Python scikitlearn Random Forest forest.py Random Forest fit() Random Forest... for i in range(self.n estimators): # (1) seed seed = np.random.randint(max INT) tree = self. make estimator(append=false) # (2) [j, seed] = i4 uniform(0, MAX INT, seed) # (3) tree.set params(random state=j) # tree.set params(random state=random state.randint(max INT)) trees.append(tree)... i4 uniform() sobol seq forest.py 17

4.3 Python Random Forest from sobol seq import * (3) 4.1 1 rf RandomForestClassifier rf = RandomForestClassifier(n_estimator =num, max_features=feat, max_depth=2) rf.fit(traindata, trainlabel) result = rf.predict(testdata) 実行プログラム RandomForest Classifier 関数を rf に代入訓練データで決定木の森を作成変更点 forest.py 中の fit 関数 for i in range(self.n_estimators): seed = np.random.randint(max_int) tree = self._make_estimator(append=false) # 追加部分 [ j, seed ] = i4_uniform(0, MAX_INT, seed) # j に準乱数生成 # 以下 2 文のどちらかをコメントアウトして乱数を決定 tree.set_params(random_state=j) # 準乱数をセット tree.set_params(random_state= random_state.randint(max_int)) # 疑似乱数をセットテストデータで識別木の深さ : 2 木の深さ :2 識別精度 ( % ) 72 70 68 66 64 62 60 1 2 3 4 5 6 7 8 9 10 特徴数 MT? Quasi? 72 識 70 別精 68 度 ( % 66 ) 64 1 2 3 4 5 6 7 8 9 10 決定木の数 MT? Quasi? 4.1 Random Forest n estimator max features max depth fit 18

4.3 Python Random Forest Random Forest fit Mersenne Twister SOBOL result predict fit 4.3.2 scikit-learn Random Forest sklearn/tree Python C Cython tree.so UNIX tree.pyd Windows tree.pyx tree.pxd Cython C tree.c setup.py install.so.pyd tree.pyx rand int rand int xorshift [12] SOBOL SOBOL 19

5 Random Forest SPAM 1600 SPAM 600 SPAM 1000 Random Forest 5.1 5.1.1?? Python 2.7.6 NumPy scikit-learn SOBOL sobol seq.py 2004 2007 SPAM 600 SPAM 1000 52 20

5.1 OS CPU 5.1 Windows 7 Enterprise 4.00 GB Intel(R) Core(TM) i5-2400s CPU @ 2.50GHz Python 2.7.6 scikit-learn NumPy sobol seq SPAM SPAM 600 SPAM 1000 52 50 1550 5.1.2 Random Forest 3 3 5.2 5.2 Random Forest D max 1, 2, 3, 4, 5 F max 1,2,3,...,10 N max 1,2,3,...,10 D max F max N max 5.1 1 5 21

5.1 深さ :1 深さ :2 深さ :3 深さ :4 深さ :5 5.1 6 5 10 10 5.1.3 3 22

5.2 10 2 10 F max N max 10 D max = 5000 2 5.2 5.2.1 5.3 1 260 740 2 695 305 3 535 465 4 546 454 5 190 810 190 190 445 555 5000 23

5.2 精度が他方より向上した回数各乱数列を適用した際に識別 900 800 700 600 500 400 300 200 100 1 2 3 4 5 MT Quasi 決定木の最大深度 5.2 2 D max = 2 5.2 5.3 D max = 2 D max = 5 24

5.2 74 72 木の深さ :1 識別精度 ( % ) 70 68 66 64 62 60 1 2 3 4 5 6 7 8 9 10 特徴数 MT Quasi 5.3 D max = 1 木の深さ :1 識別精度 ( % ) 74 72 70 68 66 64 62 60 1 2 3 4 5 6 7 8 9 10 決定木の数 MT Quasi 5.4 D max = 1 25

5.2 5.2.2 D max = 1 5.3 D max = 1 5.4 N max 1 10 5.3 2.9%, F max = 8 4.6% 5.4 3.1%, 4.6% 1 1 1 SPAM SPAM 2 2 5.2.3 D max = 2 5.5 D max = 2 5.6 N max 1 10 D max = 1 5.5 F max = 9 1.8% F max = 3 2.0% 5.6 N max = 2 N max = 9 2.0% N max = 10 3.3% 26

5.2 木の深さ : 2 識別精度 ( % ) 72 71 70 69 68 67 66 65 64 63 62 1 2 3 4 5 6 7 8 9 10 特徴数 MT Quasi 5.5 D max = 2 識別精度 ( % ) 72 71 70 69 68 67 66 65 64 木の深さ :2 1 2 3 4 5 6 7 8 9 10 決定木の数 MT Quasi 5.6 D max = 2 27

5.2 N max = 10 5.2.4 D max = 3 D max = 4 76 木の深さ :3 識別精度 ( % ) 74 72 70 68 66 MT Quasi 64 1 2 3 4 5 6 7 8 9 10 特徴数 5.7 D max = 3 5.7 D max = 3 5.8 5.9 D max = 4 5.10 N max 1 10 5.7 0.7% 1.5% 5.8 0.7% 3.6% D max = 2 5.9 0.8% 1.2% 5.10 0.6% 1.0% D max = 3 1 28

5.2 木の深さ :3 識別精度 ( % ) 76 74 72 70 68 66 64 62 60 58 1 2 3 4 5 6 7 8 9 10 決定木の数 MT Quasi 5.8 D max = 3 D max = 4 D max = 4 29

5.2 78 76 木の深さ :4 識別精度 ( % ) 74 72 70 68 66 64 1 2 3 4 5 6 7 8 9 10 特徴数 MT Quasi 5.9 D max = 4 識別精度 ( % ) 78 76 74 72 70 68 66 64 62 木の深さ :4 1 2 3 4 5 6 7 8 9 10 決定木の数 MT Quasi 5.10 D max = 4 30

5.2 5.2.5 D max = 1 木の深さ :5 識別精度 ( % ) 78 76 74 72 70 68 66 64 62 60 1 2 3 4 5 6 7 8 9 10 特徴数 MT Quasi 5.11 D max = 5 識別精度 ( % ) 76 74 72 70 68 66 64 62 60 58 56 木の深さ :5 1 2 3 4 5 6 7 8 9 10 決定木の数 MT Quasi 5.12 D max = 5 31

5.2 5.11 D max = 5 5.12 N max 1 10 5.11 4.6% 5.7% 5.12 4.6% 8.2% D max = 5 D max = 1 5.11 F max = 10 5.12 5 1000 10 Random Forest 10 SPAM A 5.2.6 D max = 2 5.13 5.14 32

5.2 木の深さ :2, 決定木の数 11~20 識別精度 ( % ) 85.5 85 84.5 84 83.5 83 82.5 82 11 12 13 14 15 16 17 18 19 20 特徴数 MT Quasi 5.13 D max = 2 F max = 11 20 N max = 11 20 木の深さ :2, 特徴数 :11~20 識別精度 ( % ) 85.5 85 84.5 84 83.5 83 82.5 82 11 12 13 14 15 16 17 18 19 20 決定木の数 MT Quasi 5.14 D max = 2 F max = 11 20 N max = 11 20 33

5.2 5.2.7 木の深さ :2, 決定木の数 :11~20 86 識別精度 ( % ) 84 82 80 78 76 MT Quasi 74 1 2 3 4 5 6 7 8 9 10 特徴数 5.15 D max = 2 F max = 1 10 N max = 11 20 木の深さ :2, 特徴数 1~10 識別精度 ( % ) 88 87 86 85 84 83 82 81 80 11 12 13 14 15 16 17 18 19 20 決定木の数 MT Quasi 5.16 D max = 2 F max = 1 10 N max = 11 20 34

5.2 F max N max 1 10 11 20 2 5.15 5.18 2 35

5.2 88 木の深さ :2, 特徴数 :11~20 識別精度 ( % ) 86 84 82 80 78 MT Quasi 76 1 2 3 4 5 6 7 8 9 10 決定木の数 5.17 D max = 2 F max = 11 20 N max = 1 10 87 85 木の深さ :2, 決定木の数 :1~10 識別精度 ( % ) 83 81 79 77 75 73 11 12 13 14 15 16 17 18 19 20 特徴数 MT Quasi 5.18 D max = 2 F max = 11 20 N max = 1 10 36

5.3 5.3 25 rand_int 関数による特徴選択 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 5.19 rand int SOBOL 列による特徴選択 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 5.20 SOBOL 5.19 rand int 5.20 37

5.3 SOBOL 0 51 F max = 5, N max = 5, D max = 2 SOBOL SOBOL rand int rand int SOBOL Random Forest 38

6 Random Forest Random Forest D max 1 5 F max 1 10 N max 1 10 SPAM 600 SPAM 1000 SPAM 50 1550 1 10 1 D max = 2 F max = 1 9 N max = 2 10 1.9% 39

D max D max = 2 F max = 1 9 N max = 2 10 D max = 2 2 F max N max 1 20 D max = 2 UCI Spambase dataset UCI Machine Learning Repository Spambase dataset 4601 3% 3% 138 4463 D max = 2 1.9 % 3.2% D max = 2, F max = 1 10, N max = 1 10 SOBOL SOBOL Random Forest Random Forest 40

Random Forest 41

3 2 2 1 2 42

4 1 1 3 3 3 43

[1] Natsuki Fujimori and Shinichi Yoshida, Comparison of Machine Learning Algorithms for English and Japanese SPAM mail discrimination, The 3rd International Workshop on Advanced Computational Intelligence and Intelligent Informatics, IWACIII 2013, Shanghai, China, October 18-21, ss1-10, 2013. [2] Natsuki Fujimori and Shinichi Yoshida, Machine Learning Algorithms applied to SPAM Filtering for English and Japanese E-Mail, The 14th International Symposium on Advanced Intelligent Systems, ISIS2013, Daejeon, Korea, November 13-16, F3a-1, 2013. [3] 2013 [4] 2010 [5] R p266 2007 [6] Leo Breiman, Bagging Predictors, Machine Learning, 24, pp.123-140, 1996. [7] Leo Breiman, Random Forests, Machine Learning, 45, pp.5-23, 2001. [8] Makoto Matsumoto and Takuji Nishimura, Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation (TOMACS) 8.1 (1998), pp.3-30. [9] 4 p.1560 2007 [10] http://random.ism.ac.jp/info01/random number generation/node6.html. [11] IM Sobol, On the distribution of points in a cube and the approximate evaluation of integrals, USSR Computational mathematics and mathematical physics 7, pp.86-112, 1967. 44

[12] George Marsaglia, Xorshift RNGs, Journal of Statistical Software 8(14), pp.1-6, 2003. 45

A D max A.1 D max = 1 F max N max 1 63.5 63.3 1 64.0 64.7 2 65.6 63.7 2 65.4 64.8 3 66.1 64.1 3 67.9 65.4 4 67.2 64.3 4 67.9 64.7 5 67.2 64.8 5 67.9 65.1 6 68.6 65.1 6 68.7 65.2 7 68.7 65.5 7 68.9 65.7 8 70.3 65.7 8 68.7 65.1 9 70.0 66.5 9 69.0 64.3 10 66.5 71.2 10 69.9 64.9 46

A.2 D max = 2 F max N max 1 63.5 65.9 1 66.5 65.6 2 65.3 67.0 2 66.6 67.1 3 66.3 68.3 3 66.4 68.4 4 66.9 68.8 4 67.2 69.3 5 67.5 69.0 5 68.1 68.7 6 67.8 69.4 6 67.2 69.2 7 68.0 70.0 7 67.0 69.3 8 68.6 70.0 8 67.6 70.3 9 68.4 70.2 9 67.2 70.1 10 70.2 68.8 10 67.1 70.5 47

B 6 精度が他方より向上した回数各乱数列を適用した際に識別 900 800 700 600 500 400 300 200 100 0 1 2 3 4 5 6 MT Quasi 決定木の最大深度 B.1 48

識別精度 ( % ) 84 83 82 81 80 79 78 77 76 75 木の深さ :6 1 2 3 4 5 6 7 8 9 10 特徴数 MT Quasi B.2 6 49

木の深さ :6 識別精度 ( % ) 84 82 80 78 76 74 72 70 1 2 3 4 5 6 7 8 9 10 決定木の数 MT Quasi B.3 6 50