24 SPAM Performance Comparison of Machine Learning Algorithms for SPAM Discrimination

Similar documents
..,,,, , ( ) 3.,., 3.,., 500, 233.,, 3,,.,, i

26 Feature Extraction with Randomness for an Application to Machine Learning from Text Data

Web Web Web Web Web, i

untitled

kut-paper-template.dvi

aca-mk23.dvi

DEIM Forum 2010 A Web Abstract Classification Method for Revie

Q-Learning Support-Vector-Machine NIKKEI NET Infoseek MSN i

1 1 tf-idf tf-idf i

25 fmri A study of discrimination of musical harmony using brain activity obtained by fmri

GPGPU

浜松医科大学紀要

28 Horizontal angle correction using straight line detection in an equirectangular image

IT i

[1] SBS [2] SBS Random Forests[3] Random Forests ii

21 A contents organization method for information sharing systems

i


Wide Scanner TWAIN Source ユーザーズガイド

n 2 n (Dynamic Programming : DP) (Genetic Algorithm : GA) 2 i

フリーソフトではじめる機械学習入門 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.

2017 (413812)

TF-IDF TDF-IDF TDF-IDF Extracting Impression of Sightseeing Spots from Blogs for Supporting Selection of Spots to Visit in Travel Sat


Web Web Web Web i

kut-paper-template2.dvi

29 jjencode JavaScript

IS1-09 第 回画像センシングシンポジウム, 横浜,14 年 6 月 2 Hough Forest Hough Forest[6] Random Forest( [5]) Random Forest Hough Forest Hough Forest 2.1 Hough Forest 1 2.2

28 TCG SURF Card recognition using SURF in TCG play video

23 A Comparison of Flick and Ring Document Scrolling in Touch-based Mobile Phones

IPSJ SIG Technical Report Vol.2009-BIO-17 No /5/ (SVM) Boosting SVM Prediction of Drug Clearance Pathway by Boosting Algorithm Ka

2 The Bulletin of Meiji University of Integrative Medicine 3, Yamashita 10 11

21 e-learning Development of Real-time Learner Detection System for e-learning

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-GI-34 No /7/ % Selections of Discarding Mahjong Piece Using Neural Network Matsui

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

kut-paper-template.dvi

NINJAL Research Papers No.3

provider_020524_2.PDF

Convolutional Neural Network A Graduation Thesis of College of Engineering, Chubu University Investigation of feature extraction by Convolution

II

FC741E2_091201

「産業上利用することができる発明」の審査の運用指針(案)

4.1 % 7.5 %

A Study on Throw Simulation for Baseball Pitching Machine with Rollers and Its Optimization Shinobu SAKAI*5, Yuichiro KITAGAWA, Ryo KANAI and Juhachi

Support Vector Machine (SVM) 4 SVM SVM 2 80% 100% SVM SVM SVM 4 SVM 2 2 SVM 4



1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [


,,.,,., II,,,.,,.,.,,,.,,,.,, II i

四校_目次~巻頭言.indd

23 Study on Generation of Sudoku Problems with Fewer Clues

SNS ( ) SNS(Social Networking Service) SNS SNS i

ALAGIN (SVM)


25 Removal of the fricative sounds that occur in the electronic stethoscope

先端社会研究 ★5★号/4.山崎

生活設計レジメ

44 4 I (1) ( ) (10 15 ) ( 17 ) ( 3 1 ) (2)

I II III 28 29


178 5 I 1 ( ) ( ) ( ) ( ) (1) ( 2 )

23 The Study of support narrowing down goods on electronic commerce sites

24 Region-Based Image Retrieval using Fuzzy Clustering

25 About what prevent spoofing of misusing a session information

PC PDA SMTP/POP3 1 POP3 SMTP MUA MUA MUA i

DEIM Forum 2009 E

27 VR Effects of the position of viewpoint on self body in VR environment

soturon.dvi

it-ken_open.key

(VKIR) VKIR VKIR DCT (R) (G) (B) Ward DCT i

,,,,., C Java,,.,,.,., ,,.,, i

24 Depth scaling of binocular stereopsis by observer s own movements

Study on Application of the cos a Method to Neutron Stress Measurement Toshihiko SASAKI*3 and Yukio HIROSE Department of Materials Science and Enginee

エクセルカバー入稿用.indd

News_Letter_No35(Ver.2).p65

130 Oct Radial Basis Function RBF Efficient Market Hypothesis Fama ) 4) 1 Fig. 1 Utility function. 2 Fig. 2 Value function. (1) (2)

Personal Software Process /Team Software Process (PSP/TSP) Watts Humphrey PSP/TSP PSP/TSP TaskPit TaskPit PSP/TSP TaskiPit PSP/TSP 5 5

日本人英語学習者の動機付け―JGSS-2003のデータ分析を通して―

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

活用ガイド (ハードウェア編)

振動充填燃料の粒子焼結試験実施計画書


01_.g.r..

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

★保健医療科学_第67巻第2号.indb


<95DB8C9288E397C389C88A E696E6462>

7,, i

52-2.indb

Web Web ID Web 16 Web Web i

熊本大学学術リポジトリ Kumamoto University Repositor Title 特別支援を要する児童生徒を対象としたタブレット端末 における操作ボタンの最適寸法 Author(s) 竹財, 大輝 ; 塚本, 光夫 Citation 日本産業技術教育学会九州支部論文集, 23: 61-

OJT Planned Happenstance

02[ ]小山・池田(責)岩.indd

[2] , [3] 2. 2 [4] 2. 3 BABOK BABOK(Business Analysis Body of Knowledge) BABOK IIBA(International Institute of Business Analysis) BABOK 7

ii


P2P P2P peer peer P2P peer P2P peer P2P i

困ったときのQ&A

untitled

Transcription:

24 SPAM Performance Comparison of Machine Learning Algorithms for SPAM Discrimination 1130378 2013 3 9

SPAM SPAM SPAM SPAM SVM AdaBoost RandomForest SPAM SPAM UCI Machine Learning Repository Spambase 4601 500 4000 SPAM 1400 1000 SPAM SPAM Bayes 5 90.6 94.9 % SPAM Bayes 42.8 % 77.0 79.5 % SPAM AdaBoost Random Forest i

Abstract Performance Comparison of Machine Learning Algorithms for SPAM Discrimination Fujimori Natsuki Many other machine learning techniques are able to applied for classification, and quantitative comparison is required. In this research, Naive Bayes Classifier, Neural Network, Support Vector Machine (SVM), Bagging, AdaBoost, and Random Forest are applied to classify e-mail written in both English and Japanese in order to filter SPAM mail out. Those algorithms are compared with each other from a viewpoint of classification precision. For English e-mail classification, the dataset Spambase of UCI Machine Learning Repository is used. Total number of the data is 4601, and training data is from 500 to 4000, which are randomly selected, and the rest are the test data. For Japanese e-mail classification, original corpus is created and used. Total number of the data is 1400 and training data are randomly selected to 1000. SPAM email in English discrimination is performed under the same conditions to compare the result of precision. As a result, for English SPAM distinction, all algorithms except Naive Bayes Classifier achieves the precision exceeding 90 %. Moreover, for Japanese SPAM, Naive Bayes Classifier became a distinction rate of 42.8 %, and 77-79 % abtained by other algorithms. key words SPAM, Machine Learning, Naive Bayes Classifier, Neural Network, Support Vector Machine, Bagging, AdaBoost, Random Forest ii

1 1 2 3 2.1............................ 3 2.2........................... 5 2.3........................... 6 2.4 (Bagging)............................. 7 2.5 AdaBoost.................................. 8 2.6 Random Forest............................... 9 2.7.................................. 10 3 SPAM 12 3.1................................... 12 3.2 SPAM............... 12 3.3 SPAM.............. 13 3.3.1............................ 13 3.3.2................................ 16 4 17 4.1 SPAM............... 17 4.1.1................................ 17........................... 18 SVM............... 19 4.1.2.................................. 20........................... 20 iii

SVM............... 22 4.2 SPAM.............. 22 4.2.1................................ 22 4.2.2.................................. 23 5 25 26 28 A 29 iv

2.1...................... 5 2.2.................... 6 2.3............................... 7 2.4 AdaBoost...................... 8 2.5 Random Forest............................ 10 2.6 8.......................... 11 3.1.......................... 13 3.2........................ 15 4.1 SPAM........................... 18 4.2 SVM 8 SPAM.... 19 4.3 SVM 8 SPAM.. 19 4.4 3500 4000 SPAM........ 21 4.5 3500 4000 SPAM..... 21 4.6 1000 SPAM.............................. 24 A.1 4.1................................ 29 A.2 4.3................................ 30 A.3 4.5................................ 30 v

2.1 100..................... 4 2.2............................... 4 3.1.................. 12 3.2....................... 14 4.1 6 SPAM................. 17 4.2 1000 SPAM................................ 23 vi

1 SPAM [1] SPAM 2010 92.51 SPAM SPAM SPAM SPAM SPAM SPAM SPAM SPAM 6 SPAM 1

Bayes NN SVM AdaBoost Random Forest RF Hewlett-Packard Labs Mark Hopkins UCI Machine Learning Repository Spambase Data Set SPAM 22317 SPAM SPAM HAM 1400 TF IDF 500 4000 500 1000 SPAM Bayes 5 90 SPAM 5 75 79 Bayes 42.2 10 15 2 6 3 SPAM 4 SPAM 2

2 SPAM 6 1 2.1 Naive Bayes classifier P (B A) = P (A B)P (B) P (A) (2.1) P (A), P (B) A B P (B A) A B 100 SPAM 100 40 SPAM 60 HAM B S SPAM B H HAM P (B S ) SPAM P (B H ) HAM P (B S ) = 40/100 = 0.4 P (B H ) = 60/100 = 0.6 SPAM 4.1 100 P (A), P (A B) A = P (A = ) = (5+11)/100 = 0.16 P (A = B S ) SPAM 3

2.1 term SPAM HAM 5 11 3 14 13 2 20 0 2.1 100 P (A = B S ) = 5/40 = 0.125 P (A = B H ) = 11/60 = 0.1833 2.2 SPAM SPAM HAM P (B) 0.4 0.6 P (A) 0.16 P (A B) 0.125 0.1833 2.2 HAM SPAM P (SP AM ) 0.125 0.4/0.16 = 0.3125 HAM P (HAM ) 0.1833 0.6/0.16 = 0.687 P (SP AM ) > P (HAM ) HAM Mozilla Thunderbird 4

2.2 2.2 入力層 中間層 出力層 x1 1 1 1 wih xi i h whk k...... xi... I H K 2.1 Neural Network David E. Rumelhart 1986 2 2.1 [3] y k = ϕ 0 (α k + Σ h w hk ϕ h (α h + Σ i w ih x i )) (2.2) ϕ α 5

2.3 1. 1 2. w(j + 1) = w(j) + η δ R (2.3) w(j + 1) j + 1 w(j) j η δ R 3. 2. 2.3 x1 マージン サポートベクター 0 x2 2.2 6

2.4 (Bagging) Support Vector Machine, SVM Vladimir N.Vapnik 1992 SVM [2] 2.2 2 SVM 2.4 (Bagging) B 個のブーストストラップに分割 学習データ 弱学習機 多数決 結果を出力 2.3 Bagging Leo Breiman 1996 7

2.5 AdaBoost bootstrap 2.3 1. n 2. m 3. h 4. 1. 3. B B {h i i = 1, 2,..., B} 5. 4. H(x) = arg max {i h i = y} 2.5 AdaBoost 学習データ 分割 訓練データ テストデータ 重み wi 弱学習機 1 誤判別率 信頼度 α 重み更新 弱学習機 2 誤判別率 信頼度 α 重み更新 多数決 弱学習機 T 誤判別率 信頼度 α 重み更新 強学習機 2.4 AdaBoost 8

2.6 Random Forest AdaBoost Yoav Freund Robert Schapire 1996 AdaBoost 1/ 2 2.4 AdaBoost AdaBoost 1. N 2. T 3. w ti = 1/N 4. t 5. 6. h α 7. w (h+1)i 8. 3. 7. T 9. α 2.6 Random Forest Random Forest Leo Breiman 2001 2.5 Random Forest 9

2.7 学習データ ランダムサンプリング B 個のブーストストラップに分割 弱学習機 多数決 結果を出力 2.5 Random Forest Random Forest Random Forest 1. B 2. 3. 2.7 Cross Validation 2.6 8 10

2.7 学習データ 分割 ⅰ. 1 2 3 4 5 6 7 8 評価データ 訓練データ ⅱ. 1 2 3 4 5 6 7 8 評価データ 訓練データ ⅷ. 1 2 3 4 5 6 7 8 2.6 訓練データ 8 評価データ 1. a 2. 1 3. 2 1. 4. a 5. 4. 11

3 SPAM 3.1 3.1 OS CPU 3.1 Windows 7 Enterprise Intel(R) Core(TM) i5-2400s CPU @ 2.50GHz 4.00GB R x64 2.15.2 MeCab 3.2 SPAM SPAM UCI Machine Learning Repository Spambase Data Set 4601 Spam or NonSpam 58 4601 1813 SPAM 2788 HAM 500 4000 500 12

3.3 SPAM 3.3 SPAM 3.3.1 3.1 SPAM データ 形態素解析 (RMeCab) 分かち書き (.csv file) 1.txt 2.txt 3.txt DF 総和 1 7 3 7 17 5 0 1 3 7 各単語が全文書中に出現した回数 (DF 値 ) および各単語の出現頻度総和を算出 上位 10% をコーパスの特徴量として抽出 6 2 8 15 40.................. 3.1 1. SPAM 100 2. 3. 13

3.3 SPAM 4. Document Frequency, DF 5. 6. 4. DF 10 % 5. 10 % 3.2 3.2 SPAM HAM type 3.2 3.2 14

3.3 SPAM サイト 無料 様 今回 1 0 0 0 0 10 0 0 0 0 100 15.72893026 2.104697379 11.35762558 0 101 0 4.209394757 0 9.473931188 102 0 0 0 0 103 0 0 0 0 104 2.621488377 6.314092136 0 4.736965594 105 0 0 0 0 106 0 0 0 0 107 2.621488377 4.209394757 0 0 108 2.621488377 4.209394757 0 0 109 0 0 0 0 11 0 4.209394757 0 0 110 0 0 0 0 111 0 2.104697379 0 0 112 2.621488377 0 3.785875195 0 113 0 0 0 0 114 0 0 0 0 115 2.621488377 0 0 4.736965594 116 0 4.209394757 0 0 117 0 4.209394757 0 0 118 0 0 0 0 119 0 0 0 0 3.2 15

3.3 SPAM 3.3.2 1400 800 SPAM 600 HAM 1000 16

4 3 4.1 SPAM 4.1.1 Bayes NN SVM AdaBoost RF 500 0.6559376 0.919288 0.8973421 0.9051451 0.9053889 0.9256279 1000 0.6842544 0.9297417 0.9050264 0.9122466 0.9139128 0.9394613 1500 0.6875202 0.9245405 0.9129313 0.9068043 0.9090616 0.9422767 2000 0.6885813 0.9300269 0.9200308 0.9073433 0.9161861 0.9423299 2500 0.7029986 0.9324131 0.9281295 0.9033793 0.9219419 0.9433603 3000 0.7083073 0.9394129 0.9312929 0.9050593 0.9125547 0.9437851 3500 0.7111717 0.9445958 0.9300636 0.9064487 0.9218892 0.9491371 4000 0.7304493 0.9351082 0.9267887 0.9018303 0.9301165 0.9450915 4.1 6 SPAM 17

4.1 SPAM 1 0.95 判別率 0.9 0.85 0.8 0.75 0.7 0.65 0.6 500 1000 1500 2000 2500 3000 3500 4000 Bayes NN SVM バギング AdaBoost RF 訓練データ数 4.1 SPAM 500 4000 500 4.1 NN RF Random Forest 1 1 7 18

4.1 SPAM SVM ガウシアン 線形 多項式 タンジェントラプラシアン ベッセル ANOVA スプライン 500 0.891733723 0.9029505 0.86100951 0.799804926 0.907583516 0.545476713 0.911241161 0.844672031 1000 0.910024993 0.90863649 0.883365732 0.777006387 0.9130797 0.493474035 0.926409331 0.860038878 1500 0.921315705 0.917445985 0.882296034 0.790712673 0.922605611 0.546597872 0.930667527 0.862947436 2000 0.9242599 0.924644368 0.894655902 0.796232218 0.92272203 0.517877739 0.931180315 0.902729719 2500 0.926701571 0.924321752 0.899095669 0.74631128 0.925749643 0.540218943 0.938124703 0.910042837 3000 0.931917552 0.931292942 0.90131168 0.797001874 0.927545284 0.552779513 0.926296065 0.555902561 3500 0.934604905 0.924613987 0.905540418 0.792915531 0.920980926 0.522252498 0.933696639 0.821071753 4000 0.931780366 0.916805324 0.908485857 0.792013311 0.920133111 0.510815308 0.93344426 0.821071753 4.2 SVM 8 SPAM 判別率 1 0.9 0.8 0.7 0.6 0.5 0.4 500 1000 1500 2000 2500 3000 3500 4000 訓練データ数 ガウシアン線形多項式タンジェントラプラシアンベッセル ANOVA スプライン 4.3 SVM 8 SPAM 4.2 SVM 8 4.3 1 19

4.1 SPAM 1 4.1.2 20 % 6 Random Forest 3500 4601 5 3500 3500 4000 3500 4000 50 SVM RBF 3950 74.5 % Random Forest 3800 95.0 % 96.3 % SVM 3600 94.1 % AdaBoost 3500 92.2 % 94.3 % 80 % 20

3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4.1 SPAM Bayes NN SVM Bagging AdaBoost RF 3500 0.713896458 0.930971844 0.933696639 0.921889192 0.942779292 0.959128065 3550 0.703139867 0.941960038 0.917221694 0.900095147 0.917221694 0.939105614 3600 0.732267732 0.946053946 0.941058941 0.911088911 0.93006993 0.947052947 3650 0.703470032 0.936908517 0.930599369 0.904311251 0.923238696 0.954784437 3700 0.720310766 0.933407325 0.937846837 0.905660377 0.92563818 0.955604883 3750 0.722679201 0.937720329 0.929494712 0.902467685 0.92479436 0.945945946 3800 0.696629213 0.950062422 0.937578027 0.913857678 0.936329588 0.962546816 3850 0.711051931 0.917443409 0.934753662 0.909454061 0.925432756 0.954727031 3900 0.697574893 0.932952924 0.92296719 0.910128388 0.915834522 0.938659058 3950 0.74500768 0.935483871 0.937019969 0.894009217 0.913978495 0.950844854 4000 0.727121464 0.936772047 0.935108153 0.91014975 0.913477537 0.951747088 4.4 3500 4000 SPAM 判別率 0.98 0.93 0.88 0.83 0.78 0.73 0.68 Bayes NN SVM Bagging AdaBoost RF 訓練データ数 4.5 3500 4000 SPAM 21

4.2 SPAM SVM 50% SPAM 8 ANOVA 6 SPAM 90% 3500 2500 3000 55% R 90% 2500 4.2 SPAM 4.2.1 4.2 1000 SPAM 4.6 NN RF Random Forest 1 1 22

4.2 SPAM Bayes 0.684 0.422 NN 0.929 0.785 SVM 0.926 0.795 0.912 0.778 AdaBoost 0.913 0.77 RF 0.939 0.793 4.2 1000 SPAM 4.2.2 0.02% SVM 23

4.2 SPAM 判別率 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 英文 日本語文 機械学習手法名 4.6 1000 SPAM 24

5 SPAM SPAM 6 University of California, Irbine Machine Learning Repository 500 Random Forest SVM 8 ANOVA SPAM 1000 SVM SPAM SPAM 25

5 Free BSD 4 SNS 2 26

3 2 1 4 27

[1], 45, 2010 9. [2] Nello Cristianini, John Shawe-Taylor,,,p.9,. [3], R,p251,. 28

A 0.96 0.95 判別率 0.94 0.93 0.92 0.91 0.9 0.89 500 1000 1500 2000 2500 3000 3500 4000 NN SVM バギング AdaBoost RF 訓練データ数 A.1 4.1 29

3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 判別率 1 0.95 0.9 0.85 0.8 0.75 0.7 500 1000 1500 2000 2500 3000 3500 4000 訓練データ数 ガウシアン線形多項式タンジェント ANOVA A.2 4.3 判別率 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 0.89 0.88 NN SVM Bagging AdaBoost RF 訓練データ数 A.3 4.5 30