24 SPAM Performance Comparison of Machine Learning Algorithms for SPAM Discrimination

24 SPAM Performance Comparison of Machine Learning Algorithms for SPAM Discrimination 1130378 2013 3 9

SPAM SPAM SPAM SPAM SVM AdaBoost RandomForest SPAM SPAM UCI Machine Learning Repository Spambase 4601 500 4000 SPAM 1400 1000 SPAM SPAM Bayes 5 90.6 94.9 % SPAM Bayes 42.8 % 77.0 79.5 % SPAM AdaBoost Random Forest i

Abstract Performance Comparison of Machine Learning Algorithms for SPAM Discrimination Fujimori Natsuki Many other machine learning techniques are able to applied for classification, and quantitative comparison is required. In this research, Naive Bayes Classifier, Neural Network, Support Vector Machine (SVM), Bagging, AdaBoost, and Random Forest are applied to classify e-mail written in both English and Japanese in order to filter SPAM mail out. Those algorithms are compared with each other from a viewpoint of classification precision. For English e-mail classification, the dataset Spambase of UCI Machine Learning Repository is used. Total number of the data is 4601, and training data is from 500 to 4000, which are randomly selected, and the rest are the test data. For Japanese e-mail classification, original corpus is created and used. Total number of the data is 1400 and training data are randomly selected to 1000. SPAM email in English discrimination is performed under the same conditions to compare the result of precision. As a result, for English SPAM distinction, all algorithms except Naive Bayes Classifier achieves the precision exceeding 90 %. Moreover, for Japanese SPAM, Naive Bayes Classifier became a distinction rate of 42.8 %, and 77-79 % abtained by other algorithms. key words SPAM, Machine Learning, Naive Bayes Classifier, Neural Network, Support Vector Machine, Bagging, AdaBoost, Random Forest ii

1 1 2 3 2.1............................ 3 2.2........................... 5 2.3........................... 6 2.4 (Bagging)............................. 7 2.5 AdaBoost.................................. 8 2.6 Random Forest............................... 9 2.7.................................. 10 3 SPAM 12 3.1................................... 12 3.2 SPAM............... 12 3.3 SPAM.............. 13 3.3.1............................ 13 3.3.2................................ 16 4 17 4.1 SPAM............... 17 4.1.1................................ 17........................... 18 SVM............... 19 4.1.2.................................. 20........................... 20 iii

SVM............... 22 4.2 SPAM.............. 22 4.2.1................................ 22 4.2.2.................................. 23 5 25 26 28 A 29 iv

2.1...................... 5 2.2.................... 6 2.3............................... 7 2.4 AdaBoost...................... 8 2.5 Random Forest............................ 10 2.6 8.......................... 11 3.1.......................... 13 3.2........................ 15 4.1 SPAM........................... 18 4.2 SVM 8 SPAM.... 19 4.3 SVM 8 SPAM.. 19 4.4 3500 4000 SPAM........ 21 4.5 3500 4000 SPAM..... 21 4.6 1000 SPAM.............................. 24 A.1 4.1................................ 29 A.2 4.3................................ 30 A.3 4.5................................ 30 v

2.1 100..................... 4 2.2............................... 4 3.1.................. 12 3.2....................... 14 4.1 6 SPAM................. 17 4.2 1000 SPAM................................ 23 vi

1 SPAM [1] SPAM 2010 92.51 SPAM SPAM SPAM SPAM SPAM SPAM SPAM SPAM 6 SPAM 1

Bayes NN SVM AdaBoost Random Forest RF Hewlett-Packard Labs Mark Hopkins UCI Machine Learning Repository Spambase Data Set SPAM 22317 SPAM SPAM HAM 1400 TF IDF 500 4000 500 1000 SPAM Bayes 5 90 SPAM 5 75 79 Bayes 42.2 10 15 2 6 3 SPAM 4 SPAM 2

2 SPAM 6 1 2.1 Naive Bayes classifier P (B A) = P (A B)P (B) P (A) (2.1) P (A), P (B) A B P (B A) A B 100 SPAM 100 40 SPAM 60 HAM B S SPAM B H HAM P (B S ) SPAM P (B H ) HAM P (B S ) = 40/100 = 0.4 P (B H ) = 60/100 = 0.6 SPAM 4.1 100 P (A), P (A B) A = P (A = ) = (5+11)/100 = 0.16 P (A = B S ) SPAM 3

2.1 term SPAM HAM 5 11 3 14 13 2 20 0 2.1 100 P (A = B S ) = 5/40 = 0.125 P (A = B H ) = 11/60 = 0.1833 2.2 SPAM SPAM HAM P (B) 0.4 0.6 P (A) 0.16 P (A B) 0.125 0.1833 2.2 HAM SPAM P (SP AM ) 0.125 0.4/0.16 = 0.3125 HAM P (HAM ) 0.1833 0.6/0.16 = 0.687 P (SP AM ) > P (HAM ) HAM Mozilla Thunderbird 4

2.2 2.2 入力層中間層出力層 x1 1 1 1 wih xi i h whk k...... xi... I H K 2.1 Neural Network David E. Rumelhart 1986 2 2.1 [3] y k = ϕ 0 (α k + Σ h w hk ϕ h (α h + Σ i w ih x i )) (2.2) ϕ α 5

2.3 1. 1 2. w(j + 1) = w(j) + η δ R (2.3) w(j + 1) j + 1 w(j) j η δ R 3. 2. 2.3 x1 マージンサポートベクター 0 x2 2.2 6

2.4 (Bagging) Support Vector Machine, SVM Vladimir N.Vapnik 1992 SVM [2] 2.2 2 SVM 2.4 (Bagging) B 個のブーストストラップに分割学習データ弱学習機多数決結果を出力 2.3 Bagging Leo Breiman 1996 7

2.5 AdaBoost bootstrap 2.3 1. n 2. m 3. h 4. 1. 3. B B {h i i = 1, 2,..., B} 5. 4. H(x) = arg max {i h i = y} 2.5 AdaBoost 学習データ分割訓練データテストデータ重み wi 弱学習機 1 誤判別率信頼度 α 重み更新弱学習機 2 誤判別率信頼度 α 重み更新多数決弱学習機 T 誤判別率信頼度 α 重み更新強学習機 2.4 AdaBoost 8

2.6 Random Forest AdaBoost Yoav Freund Robert Schapire 1996 AdaBoost 1/ 2 2.4 AdaBoost AdaBoost 1. N 2. T 3. w ti = 1/N 4. t 5. 6. h α 7. w (h+1)i 8. 3. 7. T 9. α 2.6 Random Forest Random Forest Leo Breiman 2001 2.5 Random Forest 9

2.7 学習データランダムサンプリング B 個のブーストストラップに分割弱学習機多数決結果を出力 2.5 Random Forest Random Forest Random Forest 1. B 2. 3. 2.7 Cross Validation 2.6 8 10

2.7 学習データ分割 ⅰ. 1 2 3 4 5 6 7 8 評価データ訓練データ ⅱ. 1 2 3 4 5 6 7 8 評価データ訓練データ ⅷ. 1 2 3 4 5 6 7 8 2.6 訓練データ 8 評価データ 1. a 2. 1 3. 2 1. 4. a 5. 4. 11

3 SPAM 3.1 3.1 OS CPU 3.1 Windows 7 Enterprise Intel(R) Core(TM) i5-2400s CPU @ 2.50GHz 4.00GB R x64 2.15.2 MeCab 3.2 SPAM SPAM UCI Machine Learning Repository Spambase Data Set 4601 Spam or NonSpam 58 4601 1813 SPAM 2788 HAM 500 4000 500 12

3.3 SPAM 3.3 SPAM 3.3.1 3.1 SPAM データ形態素解析 (RMeCab) 分かち書き (.csv file) 1.txt 2.txt 3.txt DF 総和 1 7 3 7 17 5 0 1 3 7 各単語が全文書中に出現した回数 (DF 値 ) および各単語の出現頻度総和を算出上位 10% をコーパスの特徴量として抽出 6 2 8 15 40.................. 3.1 1. SPAM 100 2. 3. 13

3.3 SPAM 4. Document Frequency, DF 5. 6. 4. DF 10 % 5. 10 % 3.2 3.2 SPAM HAM type 3.2 3.2 14

3.3 SPAM サイト無料様今回 1 0 0 0 0 10 0 0 0 0 100 15.72893026 2.104697379 11.35762558 0 101 0 4.209394757 0 9.473931188 102 0 0 0 0 103 0 0 0 0 104 2.621488377 6.314092136 0 4.736965594 105 0 0 0 0 106 0 0 0 0 107 2.621488377 4.209394757 0 0 108 2.621488377 4.209394757 0 0 109 0 0 0 0 11 0 4.209394757 0 0 110 0 0 0 0 111 0 2.104697379 0 0 112 2.621488377 0 3.785875195 0 113 0 0 0 0 114 0 0 0 0 115 2.621488377 0 0 4.736965594 116 0 4.209394757 0 0 117 0 4.209394757 0 0 118 0 0 0 0 119 0 0 0 0 3.2 15

3.3 SPAM 3.3.2 1400 800 SPAM 600 HAM 1000 16

4 3 4.1 SPAM 4.1.1 Bayes NN SVM AdaBoost RF 500 0.6559376 0.919288 0.8973421 0.9051451 0.9053889 0.9256279 1000 0.6842544 0.9297417 0.9050264 0.9122466 0.9139128 0.9394613 1500 0.6875202 0.9245405 0.9129313 0.9068043 0.9090616 0.9422767 2000 0.6885813 0.9300269 0.9200308 0.9073433 0.9161861 0.9423299 2500 0.7029986 0.9324131 0.9281295 0.9033793 0.9219419 0.9433603 3000 0.7083073 0.9394129 0.9312929 0.9050593 0.9125547 0.9437851 3500 0.7111717 0.9445958 0.9300636 0.9064487 0.9218892 0.9491371 4000 0.7304493 0.9351082 0.9267887 0.9018303 0.9301165 0.9450915 4.1 6 SPAM 17

4.1 SPAM 1 0.95 判別率 0.9 0.85 0.8 0.75 0.7 0.65 0.6 500 1000 1500 2000 2500 3000 3500 4000 Bayes NN SVM バギング AdaBoost RF 訓練データ数 4.1 SPAM 500 4000 500 4.1 NN RF Random Forest 1 1 7 18

4.1 SPAM SVM ガウシアン線形多項式タンジェントラプラシアンベッセル ANOVA スプライン 500 0.891733723 0.9029505 0.86100951 0.799804926 0.907583516 0.545476713 0.911241161 0.844672031 1000 0.910024993 0.90863649 0.883365732 0.777006387 0.9130797 0.493474035 0.926409331 0.860038878 1500 0.921315705 0.917445985 0.882296034 0.790712673 0.922605611 0.546597872 0.930667527 0.862947436 2000 0.9242599 0.924644368 0.894655902 0.796232218 0.92272203 0.517877739 0.931180315 0.902729719 2500 0.926701571 0.924321752 0.899095669 0.74631128 0.925749643 0.540218943 0.938124703 0.910042837 3000 0.931917552 0.931292942 0.90131168 0.797001874 0.927545284 0.552779513 0.926296065 0.555902561 3500 0.934604905 0.924613987 0.905540418 0.792915531 0.920980926 0.522252498 0.933696639 0.821071753 4000 0.931780366 0.916805324 0.908485857 0.792013311 0.920133111 0.510815308 0.93344426 0.821071753 4.2 SVM 8 SPAM 判別率 1 0.9 0.8 0.7 0.6 0.5 0.4 500 1000 1500 2000 2500 3000 3500 4000 訓練データ数ガウシアン線形多項式タンジェントラプラシアンベッセル ANOVA スプライン 4.3 SVM 8 SPAM 4.2 SVM 8 4.3 1 19

4.1 SPAM 1 4.1.2 20 % 6 Random Forest 3500 4601 5 3500 3500 4000 3500 4000 50 SVM RBF 3950 74.5 % Random Forest 3800 95.0 % 96.3 % SVM 3600 94.1 % AdaBoost 3500 92.2 % 94.3 % 80 % 20

3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4.1 SPAM Bayes NN SVM Bagging AdaBoost RF 3500 0.713896458 0.930971844 0.933696639 0.921889192 0.942779292 0.959128065 3550 0.703139867 0.941960038 0.917221694 0.900095147 0.917221694 0.939105614 3600 0.732267732 0.946053946 0.941058941 0.911088911 0.93006993 0.947052947 3650 0.703470032 0.936908517 0.930599369 0.904311251 0.923238696 0.954784437 3700 0.720310766 0.933407325 0.937846837 0.905660377 0.92563818 0.955604883 3750 0.722679201 0.937720329 0.929494712 0.902467685 0.92479436 0.945945946 3800 0.696629213 0.950062422 0.937578027 0.913857678 0.936329588 0.962546816 3850 0.711051931 0.917443409 0.934753662 0.909454061 0.925432756 0.954727031 3900 0.697574893 0.932952924 0.92296719 0.910128388 0.915834522 0.938659058 3950 0.74500768 0.935483871 0.937019969 0.894009217 0.913978495 0.950844854 4000 0.727121464 0.936772047 0.935108153 0.91014975 0.913477537 0.951747088 4.4 3500 4000 SPAM 判別率 0.98 0.93 0.88 0.83 0.78 0.73 0.68 Bayes NN SVM Bagging AdaBoost RF 訓練データ数 4.5 3500 4000 SPAM 21

4.2 SPAM SVM 50% SPAM 8 ANOVA 6 SPAM 90% 3500 2500 3000 55% R 90% 2500 4.2 SPAM 4.2.1 4.2 1000 SPAM 4.6 NN RF Random Forest 1 1 22

4.2 SPAM Bayes 0.684 0.422 NN 0.929 0.785 SVM 0.926 0.795 0.912 0.778 AdaBoost 0.913 0.77 RF 0.939 0.793 4.2 1000 SPAM 4.2.2 0.02% SVM 23

4.2 SPAM 判別率 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 英文日本語文機械学習手法名 4.6 1000 SPAM 24

5 SPAM SPAM 6 University of California, Irbine Machine Learning Repository 500 Random Forest SVM 8 ANOVA SPAM 1000 SVM SPAM SPAM 25

5 Free BSD 4 SNS 2 26

3 2 1 4 27

[1], 45, 2010 9. [2] Nello Cristianini, John Shawe-Taylor,,,p.9,. [3], R,p251,. 28

A 0.96 0.95 判別率 0.94 0.93 0.92 0.91 0.9 0.89 500 1000 1500 2000 2500 3000 3500 4000 NN SVM バギング AdaBoost RF 訓練データ数 A.1 4.1 29

3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 判別率 1 0.95 0.9 0.85 0.8 0.75 0.7 500 1000 1500 2000 2500 3000 3500 4000 訓練データ数ガウシアン線形多項式タンジェント ANOVA A.2 4.3 判別率 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 0.89 0.88 NN SVM Bagging AdaBoost RF 訓練データ数 A.3 4.5 30