JAIST Reposi https://dspace.j Title Drive-by-Download 攻撃予測のための難読化 JavaScript の検知に関する研究 Author(s) 本田, 仁 Citation Issue Date 2016-03 Type Thesis or Dissertation Text version author URL http://hdl.handle.net/10119/13608 Rights Description Supervisor: 面和成, 情報科学研究科, 修士 Japan Advanced Institute of Science and
Drive-by-Download JavaScript 1410039 28 2
Drive-by-Download (DbD ) DbD web web DbD JavaScript DbD JavaScript JavaScript JavaScript Support Vector Machine JavaScript JavaScript JavaScript 3 D3M dataset bigram
1 1 1.1....................................... 1 1.2....................................... 1 1.3................................... 2 2 3 2.1 Drive-by-Download............................ 3 2.2 JavaScript................................ 3 2.3 D3M dataset.................................. 4 2.4 K K-Fold Cross Validation................. 5 2.5..................................... 5 2.5.1 Support Vector Machine (SVM).................... 6 2.5.2 Naive Bayes............................... 6 2.5.3 k-nearest Neighbor........................... 7 2.5.4 Decision Tree.............................. 7 2.5.5 Random Forest............................. 7 2.6................................... 8 2.6.1 Python[10]................................ 8 2.6.2 scikit-learn[11].............................. 8 2.6.3 HTMLParser[4]............................. 8 2.7............................. 9 2.8 unigram, bigram........................... 10 3 11 3.1......................... 11 3.2........................... 11 3.2.1 XZZ 13[12]................................ 11 3.2.2 JCB 14[7]................................ 12 3.3........................... 13 3.3.1 LJJ 09[8]................................. 13 3.3.2 CCVK 11[3]............................... 13 3.3.3 NHKEIN 14[9].............................. 17 i
3.4............. 17 3.4.1 AO 15[1]................................. 17 3.5...................... 17 4 [9] 19 4.1 [9].............................. 19 4.2 [9]..................... 19 4.3..................................... 20 4.3.1 JavaScript.................... 20 4.3.2 SVM............................ 21 4.3.3................................. 22 4.4...................................... 22 5 23 5.1............................ 23 5.2 JavaScript............. 23 5.2.1 JavaScript............................. 23 5.2.2 JavaScript............................. 23 5.2.3 JavaScript.................. 24 5.3..................................... 24 5.3.1.............. 25 5.3.2 SVM............... 25 5.4..................................... 26 5.5....................................... 27 6 bigram 28 6.1 bigram..................... 28 6.2....................................... 28 7 30 8 31 ii
1 1.1 Drive-by- Download (DbD ) DbD web web Bot IBM 2014 21.9% 11.3% DbD [5] IPA 10 2015[6] DbD DbD DbD web JavaScript JavaScript JavaScript DbD 1.2 DbD JavaScript DbD 2 JavaScript [9] JavaScript JavaScript JavaScript Support Vector Machine 1
JavaScript JavaScript 3 D3M dataset [9] bigram 1.3 2 3 DbD 4 [9] 5 6 bigram 7 2
2 2.1 Drive-by-Download Drive-by-Download (DbD ) web web DbD DbD 2.1 1. web 2. 3. 4. DbD web JavaScript 2.2 JavaScript DbD JavaScript JavaScript JavaScript unescape() String.replace() String.charAt() eval() 2.2 JavaScript JavaScript unescape() eval() 3
2.1: DbD 2.3 D3M dataset 2.2: JavaScript D3M dataset MWS MWS datasets DbD D3M dataset DbD 3 1. URL DbD pcap 2. 3. pcap 4
2.3: 5 2.4 K K-Fold Cross Validation K K K 1 K-1 K 1 K 2.3 5 (K=5) 2.5 5
2.4: Support Vector Machine 2.5.1 Support Vector Machine (SVM) SVM ( 2.4) SVM 2 2.5.2 Naive Bayes Naive Bayes Naive Bayes Gaussian Naive Bayes Bernoulli Naive Bayes Multinomial Naive Bayes Gaussian Naive Bayes 6
2.5.3 k-nearest Neighbor 2.5: k-nearest Neighbor k-nearest Neighbor k 2.5 k=3 Class2 k=5 Class1 k=7 Class2 2 2 2.5.4 Decision Tree C4.5 CART C4.5 CART 2.5.5 Random Forest Decision Tree 1 7
2.6 2.6.1 Python[10] Python Python 2.6.2 scikit-learn[11] scikit-learn Python SVM Decision Tree Random Forest k-nearest Neighbor Naive Baise 2 scikit-learn 2.6.3 HTMLParser[4] HTMLParser Python HTML HTMLParser HTMLParser HTML JavaScript HTMLParser 8
2.7 JavaScript 4 True Positive (TP) True Negative (TN) False Positive (FP) False Negative (FN) 4 Accuracy Accuracy T P + T N Accuracy = (2.1) T P + T N + F P + F N Precision Precision ( ) T P P recision = (2.2) T P + F P Recall Recall ( ) T P Recall = (2.3) T P + F N False Negative Rate (FNR) FNR ( ) 9
F NR = F N T P + F N (2.4) Recall (2.3) F NR = 1 Recall Recall FNR False Positive Rate (FPR) FPR ( ) F P R = F P T N + F P (2.5) (2.1) (2.2) (2.3) (2.4) (2.5) T P, T N, F P, F N Accuracy Precision FNR FPR 2.8 unigram, bigram unigram 1 unigram 1 1 bigram 2 bigram 2 1 N 1 N-gram N-gram N N-gram N 1 m N-gram m N 10
3 3.1 DbD 2 1. 2. ( ) 3.2 3.2.1 XZZ 13[12] [12] JavaScript 11
3 1. unescape() 2. JavaScript 3. JavaScript unescape() eval () 3 1 3.2.2 JCB 14[7] [7] JavaScript JavaScript N-gram Support Vector Machine 3.1 1. web JavaScript JavaScript 2. 3. N-gram N N 12
4. N-gram σ σ N 5. SVM 3.3 3.3.1 LJJ 09[8] [8] JavaScript JavaScript JavaScript eval 50 3.1 15 65 3.3.2 CCVK 11[3] [3] FNR [3] 3 DbD 1. HTML iframe HTML HTML URL 19 2. JavaScript eval() settimeout() setinterval() 25 3. URL URL URL IP DNS A DNS NS 33 3 3 1 FPR FNR 13
3.1: [7] 14
Feature 3.1: [8] Description Length in characters The length of the script in characters. Avg. characters per line The average number of characters on each line. Num. of lines The number of newline characters in the script. Num. of strings The number of strings in the script. Num. of unicode symbols The number of unicode characters in the script. hex or octal numbers A count of the numbers represented in hex or octal. % human readable We judge a word to be readable if it is > 70% alphabetical, has 20% < vowels < 60%, is less than 15 characters long, and does not contain > 2 repetitions of the same character in a row. % whitespace The percentage of the script that is whitespace. Num. of methods called The number of methods invoked by the script. Avg. string length The average number of characters per string in the script. Avg. argument length The average length of the arguments to a method, in characters. Num. of comments The number of comments in the script. Avg. comments per line The number of comments over the total number of lines in the script. Num. of words The number of words in the script where words are delineated by whitespace and JavaScript symbols (for example, arithmetic operators). % word not in comments The percentage of words in the script that are not commented out. 15
3.2: [1] 16
3.3.3 NHKEIN 14[9] [9] JavaScript 4 3.4 3.4.1 AO 15[1] [1] URL [7] JavaScript N-gram 2 DbD 3.2 3.5 3.2 JavaScript [9] JavaScript [9] 3 17
/ 3.2: LJJ 09[8] JavaScript CCVK 11[3] HTML JavaScript URL 3 NHKEIN 14[9] JavaScript XZZ 13[12] JCB 14[7] JavaScript N-gram JavaScript AO 15[1] JavaScript 18
4 [9] 4.1 [9] 2.2 JavaScript JavaScript JavaScript [9] JavaScript [9] JavaScript 94 ASCII ASCII 0x21 0x7e i JavaScript m i JavaScript N F (i) N = i m i (4.1) F (i) = m i N (4.2) F (i) 0 F (i) 1 F (i) Support Vector Machine SVM 94 1 1 unigram [9] JavaScript UTF-8 4.2 [9] [9] JavaScript 19
JavaScript JavaScript JavaScript [9] JavaScript JavaScript JavaScript JavaScript [9] JavaScript JavaScript JavaScript JavaScript JavaScript JavaScript Document Object Model (DOM) HTML HTML DOM JavaScript JavaScript 4.3 4.3.1 JavaScript [9] JavaScript JavaScript JavaScript JavaScript Alexa[2] 500 <script> JavaScript <script> 20
4.1: [9] JavaScript 2013/11/18 2011/2/8-2013/2/26 2786 330 src URL URL JavaScript 4.2 JavaScript 1KB JavaScript [9] JavaScript JavaScript JavaScript D3M dataset 2011 2013 3 [9] JavaScript jquery JavaScript JavaScript 1KB [9] JavaScript JavaScript 4.1 4.3.2 SVM SVM [9] SVM libsvm SVM 1. 4.2 F (i) 94 SVM 94 JavaScript JavaScript JavaScript +1 JavaScript 1 21
4.2: [9] Result Accuracy 98.84% Precision 97.72% Recall 94.35% 2. SVM SVM RBF SVM C γ [9] libsvm grid.py 2 5 4.3.3 [9] C = 25.22 γ = 55.72 Accuracy 4.2 4.4 [9] MWS MWS datasets D3M dataset 3 1 SVM 1 SVM 2 [9] 22
5 5.1 4.4 [9] [9] 5.1 5.2 JavaScript JavaScript [9] JavaScript JavaScript JavaScript 5.2.1 JavaScript Alexa Top500[2] web URL <script> JavaScript <script> src URL JavaScript JavaScript Python JavaScript URL JavaScript Python 5.2 [9] 1KB 5.2.2 JavaScript 2011-2014 D3M dataset JavaScript JavaScript 1KB 23
5.1: 5.1: JavaScript 2015/6/9 2011/2/14-2014/4/11 3344 906 5.2.3 JavaScript JavaScript JavaScript JavaScript JavaScript 5.1 5.3 [9] Support Vector Machine (SVM) Naive Bayes (NB) k-nearest Neighbor (knn) Decision Tree (DT) Random Forest (RF) 5 knn k 1 3 5 7 9 5 5 24
5.2: JavaScript [9] 2 5.3.1 JavaScript 2011 2012 2014 5.3.2 SVM SVM [9] RBF C γ 2 25
5.2: ( ) Accuracy Precision FNR FPR NB 0.9247 0.8369 0.1954 0.04276 SVM (C = 329.9, γ = 61.7) 0.9638 0.9226 0.09384 0.02064 DT 0.9548 0.8711 0.07394 0.03738 RF 0.9769 0.9822 0.09162 0.004485 k=1 0.9616 0.9159 0.09715 0.02243 k=3 0.9546 0.9424 0.1612 0.01405 knn k=5 0.9576 0.9602 0.1634 0.009569 k=7 0.9572 0.9585 0.1645 0.009868 k=9 0.9544 0.9578 0.1777 0.009868 5.3: Accuracy Precision FNR FPR NB 0.8960 0.8988 0.2618 0.03589 SVM (C = 501.0, γ = 5.92) 0.8480 0.9453 0.4737 0.01316 DT 0.7916 0.8608 0.6316 0.02572 RF 0.7414 0.9558 0.8504 0.002990 k=1 0.7385 0.8582 0.8407 0.01136 k=3 0.7410 0.9180 0.8449 0.005981 knn k=5 0.7544 0.9408 0.8019 0.005383 k=7 0.7410 0.8984 0.8407 0.007775 k=9 0.7410 0.8592 0.8310 0.01196 FNR FNR FNR 5.4 5.2 5.3 26
5.5 5.2 FNR DT RF SVM Accuracy RF SVM DT SVM RF RF SVM Accuracy DT 3 knn FNR NB FNR Accuracy ( 5.3) FNR RF knn 0.8 (80%) FNR DT FNR 0.6316 SVM FNR 0.4737 Naive Bayes FNR 0.2618 74% Accuracy Naive Bayes 0.8960 Naive Bayes FNR FPR 3.6% Naive Bayes [9] SVM 4.4 SVM SVM SVM 27
6 bigram 6.1 bigram bigram SVM 5.3 bigram 94 2 = 8836 1767 2 1767 4418 5 4418 2 1767 6.1 4418 6.2 6.2 1767 6.1 unigram 5.3 NB SVM RF FNR knn FNR k 0.69 (69%) DT FNR 1767 FNR NB 0.4571 5.3 NB 0.2618 4418 6.2 unigram 5.3 NB SVM RF FNR knn FNR k 0.72 (72%) 1767 DT FNR 4418 FNR NB 0.4460 1767 5.3 NB bigram JavaScript 28
6.1: bigram (1767 ) Accuracy Precision FNR FPR NB 0.8438 0.8991 0.4571 0.02632 SVM (C = 60.0, γ = 5.0) 0.8250 0.9749 0.5693 0.004785 DT 0.8396 0.9183 0.4861 0.01974 RF 0.7339 0.9670 0.8781 0.001794 k=1 0.7715 0.9187 0.7341 0.01017 k=3 0.7853 0.9444 0.6939 0.007775 knn k=5 0.7857 0.9563 0.6967 0.005981 k=7 0.7853 0.9602 0.6994 0.005383 k=9 0.7861 0.9605 0.6967 0.005383 6.2: bigram (4418 ) Accuracy Precision FNR FPR NB 0.8404 0.8696 0.4460 0.03589 SVM (C = 216, γ = 2.3) 0.7682 0.9563 0.7576 0.004785 DT 0.8371 0.9171 0.4945 0.01974 RF 0.7410 0.9636 0.8532 0.002392 k=1 0.7623 0.8964 0.7604 0.01196 k=3 0.7740 0.9209 0.7258 0.01017 knn k=5 0.7749 0.9463 0.7313 0.006579 k=7 0.7753 0.9466 0.7299 0.006579 k=9 0.7757 0.9426 0.7271 0.007177 1767 4418 NB RF FNR 94 2 = 8836 29
7 DbD JavaScript [9] [9] D3M dataset SVM Naive Bayes Naive Bayes bigram Naive Bayes 74% 30
8,, Drive-by-Download JavaScript, The 33rd Symposium on Cryptography and Information Security (SCIS 2016), 2016. 31
32
[1] Takashi Adachi and Kazumasa Omote, An Approach to Predict Drive-by-Download Attacks by Vulnerability Evaluation and Opcode, The 10th Asia Joint Conference on Information Security (AsiaJCIS 2015), 2015. [2] Alexa Top 500 Global Sites, www.alexa.com/topsites [3] Davide Canali, Marco Cova, Giovanni Vigna, Christopher Kruegel, Prophiler: a fast filter for the large-scale detection of malicious web pages, The 20th international conference on World wide web, 2011. [4] HTMLParser, https://docs.python.org/2.7/library/htmlparser.html [5] IBM, 2014 Tokyo SOC, www-935.ibm.com/services/jp/ja/it-services/soc-report/ [6], 10 2015, www.ipa.go.jp/security/vuln/10threats2015.html [7] G.K. Jayasinghe, J.S. Culpepper, P. Bertok, Efficient and effective realtime prediction of drive-by download attacks, Journal of Network and Computer Applications Volume 38, February 2014. [8] P. Likarish, E. Jung, I. Jo Obfuscated malicious javascript detection using classification techniques, MALWARE 09, 2009. [9],,,,,, JavaScript, 2014-CSEC-64, vol.21, pp. 1-7, 2014. [10] Python, https://www.python.org/ [11] scikit-learn, http://scikit-learn.org/stable/ [12] W. Xu, F. Zhang, S. Zhu, Jstill: Mostly static detection of obfuscated malicious javascript code, CODASPY 13, 2013. 33