CN114172748A

CN114172748A - Encrypted malicious traffic detection method

Info

Publication number: CN114172748A
Application number: CN202210124869.1A
Authority: CN
Inventors: 霍跃华; 赵法起; 李晓宇; 裴超; 曹洪治
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-03-11
Anticipated expiration: 2042-02-10
Also published as: CN114172748B

Abstract

The invention discloses an encrypted malicious flow detection method. The invention utilizes a Wireshark tool to process the flow packet; filtering out invalid IP checksum, preprocessing the sample set and marking malicious/benign labels; performing primary feature extraction on the preprocessed flow packet; 3 feature subsets are constructed for the preliminarily extracted features, and normalization and encoding are carried out; performing feature dimensionality reduction by adopting a machine learning or principal component analysis method for each type of feature subset; respectively establishing a random forest, XGboost and Gaussian naive Bayes classifier model aiming at the 3 feature subsets; combining 3 classifier models according to a Stacking strategy to form a DMMFC detection model; fusing the 3 characteristic subsets through stream fingerprints to form a sample set, dividing the sample set into a training set and a test set, and training a model; checking the model, using accuracy,F ₁Score and false alarm rate evaluation meansEvaluating the test effect of the DMMFC model; the method of combining multi-feature fusion and the Stacking strategy is adopted to detect the encrypted malicious traffic, and the detection capability is strong.

Description

Encrypted malicious traffic detection method

Technical Field

The invention belongs to the field of encrypted malicious flow detection in data identification, and particularly relates to a double-layer multi-model fusion (DMMFC) encrypted malicious flow detection method for a Stacking strategy.

Background

In recent years, the trend of each industry is toward digitalization, network attack means are diversified, and security events such as data leakage and lasso software are frequently generated. In order to protect the security of users accessing the internet, many websites have adopted a transport encryption protocol. According to the Sophos News report, the encryption enabled proportion of the Chrome loaded web page ranged from 40% in 2014 to 98% in 2021. In unfortunate, while legal traffic is encrypted, malicious traffic also employs the TLS encryption protocol to mask attacks. In 2020, 23% of detected malware that communicates with remote systems over the Internet uses Transport Layer Security (TLS); today, this ratio is close to 46%.

The decryption technology has the advantages of high hardware overhead, long time consumption and high cost, and is not in line with the original purpose of protecting the internet privacy of the user. As the computing power of computers has been significantly improved in recent years, many researchers have started to identify malicious traffic in networks by using information entropy, machine learning or deep learning without decryption. The general idea of malicious traffic inspection based on the machine learning method is as follows: the method comprises the steps of obtaining malicious flow and normal flow, extracting features according to specific rules, constructing a feature matrix and a training set, establishing a machine learning model, inputting the training set into the machine learning model for training, and after training is completed, carrying out malicious flow detection by using the model. The deep learning-based method is similar to machine learning, and is different in that when a training network is established, the training network is interconnected in a neuron mode, and common networks comprise a deep neural network, a one-dimensional convolutional neural network, a long-short term memory network (LSTM), and the like; the learner also obtains good results by converting the features into images and using a two-dimensional convolutional neural network.

In the past, the conventional deep packet detection-based method analyzes the underlying information of the data packet, so that on one hand, the internet privacy of a user is violated, and on the other hand, the method has high false alarm rate and brings troubles to network security practitioners. Nowadays, malicious traffic detection method based on machine learning has become the mainstream research method, but TLS encrypted traffic detection has the following problems: (1) TLS encrypted traffic features are various in types, and a single machine learning model is not suitable for processing multiple heterogeneous features; (2) TLS encrypted malicious flow detection has low recall rate and high false alarm rate.

Disclosure of Invention

Aiming at the defects and shortcomings in the prior art, the invention provides a method for detecting DMMFC encrypted malicious traffic of a Stacking strategy, aiming at considering flow characteristics, connection characteristics, DNS response characteristics, HTTP background characteristics and TLS handshake characteristics in the detection process, synthesizing traffic behaviors and combining the Stacking strategy to solve the problems.

The technical route of the invention is to extract the flow characteristics, the connection characteristics, the DNS response characteristics, the HTTP background characteristics and the TLS handshake characteristics to detect the encrypted malicious traffic in the mixed traffic under the condition of non-decryption. The technical idea is that a complete pcap data packet is obtained, the data packet is preprocessed, malicious/benign flow is labeled, then feature extraction is carried out, the data packet is divided into 3 feature subsets according to feature categories, and the feature subsets are standardized, coded and dimension reduced; respectively designing 3 classifier models for the 3 feature subsets; performing feature fusion on the 3 feature subsets subjected to dimensionality reduction according to the stream fingerprints, constructing a sample set, disordering and dividing the sample set into a training set and a test set; combining 3 classifier models and 1 logistic regression model according to a Stacking strategy to form a DMMFC detection model; inputting the training set into a DMMFC model for training, and checking the DMMFC detection model by using the test set; according to the accuracy,F ₁And (4) checking the performance of the model by using the scores and the false alarm rate.

According to the technical idea, the technical scheme for achieving the purpose of the invention comprises the following steps:

first, an original traffic packet is obtained:

(1) collecting malicious traffic generated by 7 kinds of malicious software in the attack process by using a Wireshark tool, and combining 7 kinds of malicious traffic packets to obtain malicious traffic packets;

(2) collecting benign traffic under normal conditions by using a Wireshark tool to obtain a benign traffic packet;

further, data preprocessing is carried out, invalid IP checksum in the flow packets is filtered, and malicious flow packets and benign flow packets which can be used for data analysis are obtained;

further, analyzing the flow packet by using a Zeek tool, extracting features to obtain flow features, connection features, DNS response features, HTTP background features and TLS handshake features, and marking malicious/benign labels; the TLS handshake characteristics comprise a Client Hello part, a server Hello part and a certificate verification part;

further, according to the feature types, combining the extracted flow features and the extracted connection features to form a flow feature subset, wherein the extracted DNS response features, HTTP background features and Client Hello and server Hello parts in the TLS handshake features form a protocol feature subset, and a certificate verification part in the extracted TLS handshake features forms a certificate feature subset;

further, 3 feature subsets are subjected to standardization and coding processing, and feature dimensionality reduction is carried out according to a feature importance assessment method and a principal component analysis method:

(1) after the stream feature subset is subjected to standardization processing, a 101-dimensional standard stream feature subset is obtained, and feature importance is determined

After sorting, taking

As a new subset of stream featuresX ₁；

(2) After the one-hot coding processing is carried out on the protocol feature subset, 117-dimensional sparse protocol feature subsets are obtained, and the accumulative maximum feature contribution rate is set

The characteristic dimensionality is reduced to 4 dimensions by adopting a principal component analysis method, the dimensionality-reduced 4-dimensional characteristics are fused with TLS version number characteristics to obtain 7-dimensional characteristics, and a new protocol characteristic subset is formedX ₂；

(3) After the certificate feature subset is subjected to one-hot encoding processing, a 2874-dimensional sparse certificate feature subset is obtained, and the accumulated maximum feature tribute is setContribution rate

The index of (2) is subjected to feature dimension reduction by adopting a principal component analysis method to obtain 120-dimensional features, and a new certificate feature subset is formedX ₃；

Further, the feature subset after feature dimension reductionX ₁、X ₂AndX ₃respectively establishing a classifier model, adjusting model parameters, and training a model:

(1) for the reduced standard stream feature subsetX ₁Establishing a random forest classifier model;

(2) for reduced sparse protocol feature subsetsX ₂Establishing an XGboost classifier model;

(3) for reduced sparse certificate feature subsetX ₃Establishing a Gaussian naive Bayes classifier model;

(4) in the parameter adjusting process, the parameters of a random forest classifier and a Gaussian naive Bayes classifier model are adjusted by adopting a control variable method, and the parameters of an XGboost classifier model are adjusted by adopting a grid search method;

(5) training the 3 models;

further, the 3 feature subsets after dimension reduction are fused through stream fingerprints and the label values of the feature subsetsYA sample set is constructed and calculated as 7: 3, dividing the ratio into a training set and a test set;

further, the 3 classifier models and the 1 logistic regression model are combined into a DMMFC detection model through a Stcalking strategy:

(1) the first layer network of the DMMFC detection model consists of a random forest classifier, an XGboost classifier and a Gaussian naive Bayes classifier;

(2) the second layer network of the DMMFC detection model consists of 1 logistic regression model;

(3) combining two layers of networks according to a Stacking strategy to form a DMMFC detection model;

further, the accuracy is utilized,F ₁And evaluating the performance of the encrypted malicious flow detection model by using the score and the false alarm rate.

Compared with the prior art, the invention has the beneficial effects that:

1. the TLS encrypted malicious flow can be accurately detected under the actual network environment, the false alarm rate is low, the number of wrong detection samples is small, the burden of a network flow analyzer is reduced, and the timely response of a user is facilitated;

2. according to the characteristic that the machine learning model has tendentiousness to different types of data, the extracted features are divided into 3 feature subsets, and a proper classifier model is respectively established for each feature subset, so that the classification accuracy is improved;

3. the first layer network of the DMMFC detection model is trained aiming at the characteristic dimension, so that the designed machine learning model can be effectively trained, and the encrypted malicious flow can be accurately detected; a single-layer logistic regression model is adopted in the second layer network of the DMMFC detection model, so that training overfitting is prevented, the overall complexity of the detection model is reduced, and the detection efficiency is improved.

Drawings

In order to more clearly describe the technical solution of the present invention, the drawings which are needed to be used in the present invention are briefly described below, and the drawings are only for illustrating the embodiments of the present invention and are not to be construed as limiting the present invention.

Fig. 1 is a flowchart of a method for detecting DMMFC encrypted malicious traffic of a Stacking policy according to an embodiment of the present invention;

fig. 2 is a design diagram of a method for detecting DMMFC encrypted malicious traffic according to a Stacking policy in an embodiment of the present invention;

FIG. 3 is a flow chart of encrypted traffic processing according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a DMMFC model of a Stacking strategy according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be analyzed and expressed more fully and completely with reference to the drawings in the following embodiments, and it is obvious that the embodiments described are only a part of the embodiments of the present invention, so as to further explain the present invention, and enable those skilled in the art to clearly and thoroughly understand the present invention, and not to limit the present invention.

As shown in fig. 1 to 4, a method for detecting encrypted malicious traffic according to an embodiment of the present invention is designed as follows: dividing an original flow packet into a benign flow packet and a malicious flow packet, performing feature extraction, feature subset construction, feature coding and dimension reduction, establishing a classifier model for each dimension-reduced feature subset, combining 3 classifier models and 1 logistic regression model through a Stacking strategy to form a DMMFC detection model, and performing training and prediction by using a training set and a test set;

as shown in fig. 2, a method for detecting encrypted malicious traffic provided in an embodiment of the present invention includes the following steps:

step 1, obtaining an original pcap flow packet:

(1) collecting flow generated by 7 types of malicious software running in a network communication environment by using a Wireshark tool, and combining the flow to obtain an original pcap flow packet of the malicious flow, thereby ensuring the diversity of encrypted malicious flow and preventing an overfitting result of a model for training encrypted malicious flow of a single category;

(2) collecting benign traffic under normal conditions by using a Wireshark tool to obtain an original pcap traffic packet of the benign traffic;

step 2, data preprocessing:

filtering invalid IP checksum in the traffic packet, preprocessing the captured traffic to obtain a malicious traffic packet (containing 35552 TLS encrypted traffic) containing 653633 traffic and a benign traffic packet (containing 51703 TLS encrypted traffic) containing 314733 traffic;

and 3, extracting features, namely extracting and converting the features of the pretreated flow to obtain a feature vector of the flow as shown in figure 3:

extracting the characteristics of each flow in the two preprocessed flow packets by using a Zeek tool, wherein the extracted characteristics comprise flow characteristics, namely the arrival characteristics and the arrival process of the bidirectional flow; connection characteristics, namely TCP and UDP protocol transmission and connection establishment processes; DNS response features and HTTP background features, including dialogs between initiators and responders; TLS handshake characteristics including a Client Hello, a server Hello and a certificate verification part; in order to better perform statistics and analysis on the extracted features, the extracted features are respectively stored in 5 log files, a unique flow fingerprint is given to each flow and used for associating all behaviors of each flow, each flow is labeled, malicious flows are represented by-1, and benign flows are represented by 1;

and 4, feature conversion, namely classifying the features extracted in the step 3 into 3 types according to the categories:

(1) the stream feature subset consists of stream features and connection features, and the data types of the stream features are numerical types and have 101 dimensionalities in total;

(2) the protocol feature subset consists of Client Hello and server Hello parts in DNS response features, HTTP background features and TLS handshake features, the data types of the protocol feature subset are text types, and the protocol feature subset has 21 dimensions;

(3) the certificate feature subset consists of an encryption suite 'cipher suite', a certificate issuing authority 'issuers' and a certificate main body 'subject' in the TLS handshake feature, and the data types of the certificate feature subset are text types and have 3 dimensions.

Step 5, standardizing, coding and reducing dimensions of the feature subset:

because the machine learning model has a tendency to different types of data, in order to make the machine learning model function better, the 3 feature subsets in the step 4 are respectively standardized and encoded, and simultaneously, in order to reduce the complexity of the detection model and improve the detection efficiency of the detection model, the 3 feature subsets after being standardized and encoded are subjected to feature dimension reduction;

(1) after the stream feature subsets are normalized and normalized, 101 dimension standard stream feature subsets are obtained, and a sample set is formed by the dimension standard stream feature subsets and label values of the dimension standard stream feature subsetsT ₁Will beT ₁Inputting a random forest classifier model for training, after training is finished, performing importance evaluation on the features of each dimension by using a random forest feature importance evaluator, and setting feature importance

Is 0.01, is taken

As a new subset of stream featuresX ₁Forming a new sample set with the tag values

；

(2) After the one-hot coding processing is carried out on the protocol feature subset, 117-dimension sparse protocol feature subsets are obtained, and a sample set is formed by the protocol feature subsets and tag values of the sparse protocol feature subsetsT ₂Performing feature dimensionality reduction on the image by adopting a principal component analysis method, and setting the accumulated maximum feature contribution rate

The 4-dimensional feature after dimension reduction is obtained and fused with the 3-dimensional TLS version number after encoding to be used as the identification of the encryption flow, and then the protocol feature subset of 7 dimensions is obtainedX ₂Form a new feature subset with its tag value

；

(3) After the one-hot encoding processing is carried out on the certificate feature subset, 2874 dimensionality sparse certificate feature subsets are obtained, and a sample set is formed by the sparse certificate feature subsets and label values of the sparse certificate feature subsetsT ₃Performing feature dimensionality reduction on the image by adopting a principal component analysis method, and setting the accumulated maximum feature contribution rate

To obtain a 120-dimensional feature subset after dimension reductionX ₃Forming a new sample set with the tag values

；

Step 6, respectively establishing classifier models for the 3 feature subsets after dimensionality reduction, adjusting model parameters, training the models:

(1) for reduced-dimension stream feature subsetsX ₁Establishing a Bagging-based random forest classifier model according to the characteristics of high dimensionality and sample imbalance, setting the number of trees in a forest to be 110 and the maximum depth of each tree to be 20 by adopting a controlled variable method according to a parallel training sample set of a random sampling principle;

(2) for reduced-dimension subset of protocol featuresX ₂Establishing a Boosting-based XGboost classifier model based on moderate dimensionality, serially training a sample set of the model to enable misclassified samples to get more attention, and setting the maximum depth of each tree to be 20, the proportion of sampling column numbers to be 0.8 and the number of iterators to be 100 by adopting a grid search method;

(3) for certificate feature subset after dimension reductionX ₃Establishing a Gaussian naive Bayes classifier model assuming that the sample set obeys Gaussian distribution, based on which

Making a decision, and calculating prior probability by adopting a maximum likelihood method;

(4) training the 3 models;

and 7, feature fusion:

3 feature subsets after dimension reductionX ₁，X ₂，X ₃Fusing through stream fingerprints, wherein the fused characteristic dimension is 155 and the label value thereofYForming a sample setTAnd according to the following steps of 7: 3, dividing the ratio into a training set and a test set;

step 8, combining the 3 models according to a sthooking strategy, as shown in fig. 4, wherein the structure and training mechanism of the DMMFC detection model are as follows:

(1) the first layer network of the DMMFC detection model consists of a random forest classifier, an XGboost classifier and a Gaussian naive Bayes classifier; in order to fully play the role of each machine learning model, a first-layer network of a DMMFC detection model is trained aiming at feature dimensions, namely the 1 st-28 th dimension feature of a training sample set of a random forest classifier, the 29 th-35 th dimension feature of a training sample set of an XGboost classifier and the 36 th-155 th dimension feature of a training sample set of a Gaussian naive Bayes classifier are trained in a five-fold cross validation mode, the training result of each classifier of the first-layer network is used as the feature of one dimension to reconstruct the features, the label value of each sample is kept unchanged, and the reconstructed features and labels form a new sample set and are input into a second-layer network;

(2) in order to prevent overfitting, a second layer network of the DMMFC detection model is formed by 1 logistic regression model, and a new sample set obtained by training of the first layer network is fitted;

step 9, training the model, and checking the performance of the model:

(1) inputting the training set into a DMMFC detection model, and training the model;

(2) inputting the test set into the DMMFC detection model after training for testing, and judging each sample by the model to obtain a final prediction result; if the prediction result is 1, the test sample is predicted to be benign flow, and if the prediction result is-1, the test sample is predicted to be malicious flow;

(3) by utilizing the accuracy,F ₁Scores and false alarm rates were used to evaluate the performance of DMMFC test models, and table 1 shows the performance of each model proposed by the examples of the present invention.

Table 1: performance of each model

Model (model)	Accuracy (%)	F ₁Fraction (%)	False alarm rate (%)
				Random forest classifier	99.80	99.68	0.11
XGboost classifier	97.12	95.00	1.4
				Gaussian naive Bayes classifier	14.2	24.82	0.12
DMMFC	99.90	99.91	0.05

In summary, in the encrypted malicious flow detection method, the performance of the DMMFC detection model is superior to that of a single classifier model; in the misinformation samples, the DMMFC detection model misinformation 1 TLS encryption malicious sample, and the number of the rest models misinformation TLS encryption malicious samples is more than 50, which reflects that the method provided by the invention has low misinformation rate, high detection rate of TLS encryption malicious flow and reduced workload of threat response personnel; secondly, the overall accuracy reaches 99.90%, which shows that the model not only can mix encrypted malicious traffic in the traffic, but also has good detection capability on malicious samples without encryption; in the third place, the first place is,F ₁the score reaches 99.91%, which reflects that the model achieves higher harmonic mean value in prediction accuracy and recall rate, and also indicates that the model has higher accuracy and recall rate.

Claims

1. An encrypted malicious traffic detection method is characterized by comprising the following steps:

capturing a pcap traffic packet by using a Wireshark tool, and constructing an encrypted traffic original data set;

filtering invalid IP checksums in the original data set, and marking malicious/benign labels;

analyzing the pcap traffic packet by using a Zeek tool, and extracting flow characteristics, connection characteristics, DNS response characteristics, HTTP background characteristics and TLS handshake characteristics;

step four, constructing a stream feature subset, a protocol feature subset and a certificate feature subset according to the extracted stream features, connection features, DNS response features, HTTP background features and TLS handshake features, and carrying out standardization and encoding;

step five, performing feature dimensionality reduction on the stream feature subset, the protocol feature subset and the certificate feature subset by using a machine learning and principal component analysis method to obtain a standard stream feature subset, a sparse protocol feature subset and a sparse certificate feature subset;

step six, respectively establishing 3 classifier models for the standard flow feature subset, the sparse protocol feature subset and the sparse certificate feature subset, training the 3 classifier models, and simultaneously respectively carrying out parameter adjustment on the 3 classifier models;

step seven, fusing the standard stream feature subset, the sparse protocol feature subset and the sparse certificate feature subset through the stream fingerprint to form a sample set with the label, and dividing the sample set into a training set and a test set;

step eight, forming a DMMFC detection model by the 1 logistic regression model and the 3 classifier models in the step six through an Stcalking strategy, and training the DMMFC detection model;

step nine, inputting a test set by using the trained DMMFC model for prediction, and using the accuracy,F ₁And 3 evaluation indexes of the value and the false alarm rate are used for evaluating the performance of the DMMFC detection model.

2. The encryption malicious traffic detection method according to claim 1,

the normalization and encoding of the feature subsets comprises:

the stream feature subset comprises stream features and connection log features, and standard stream feature subsets with 101 dimensions are obtained after standardization processing and processing;

the protocol feature subset comprises DNS response, HTTP background and TLS handshake features, and is coded in a one-hot coding mode, and after coding, sparse protocol feature subsets with 117 dimensions are obtained;

the certificate feature subset comprises TLS certificate features and an encryption algorithm selected in the TLS handshake process, one-hot coding is adopted for coding, and 2874 dimensionalities of sparse certificate feature subsets are obtained after coding.

3. The encrypted malicious traffic detection method according to claim 2, wherein the dimension reduction and fusion manner of the standard flow feature subset, the sparse protocol feature subset and the sparse certificate feature subset includes:

aiming at the standard stream feature subset, a random forest feature importance evaluator is used for evaluating the feature of each dimension in the subset, and the feature with feature importance greater than or equal to 0.01 is selected to obtain 28-dimensional features;

setting the accumulated maximum feature importance contribution rate epsilon to be more than or equal to 90% aiming at the sparse protocol feature subset, carrying out feature dimension reduction by using a principal component analysis method, adding a label feature of TLS (traffic class service) encrypted flow to determine that the feature set has 7 dimensions after the dimension reduction is carried out on 4-dimensional features;

setting the accumulated maximum feature importance contribution rate epsilon to be more than or equal to 90% aiming at the sparse certificate feature subset, performing feature dimensionality reduction by using a principal component analysis method, and performing 120-dimensional feature dimensionality reduction;

and fusing the 3 feature subsets subjected to dimension reduction according to the stream fingerprint, wherein the fused sample set has 155-dimensional features.

4. The tagged features of TLS encrypted traffic of claim 3, wherein,

based on the extracted TLS handshake features, TLS version number features are extracted from each TLS encrypted flow, 3-dimensional features are obtained after encoding, and the features and the sparse protocol feature subset after dimension reduction are fused.

5. The encrypted malicious traffic detection method according to claim 1, wherein the classifier model and parameter adjustment respectively established according to the standard flow feature subset, the sparse protocol feature subset, and the sparse certificate feature subset comprise:

establishing a random forest classifier model aiming at the standard stream feature subset; establishing an XGboost classifier model aiming at the sparse protocol feature subset; establishing a Gaussian naive Bayes classifier model aiming at the sparse certificate feature subset;

and the parameter adjustment comprises the steps of adjusting the parameters of a random forest classifier and a Gaussian naive Bayes classifier model by adopting a control variable method and adjusting the parameters of an XGboost classifier model by adopting a grid search method.

6. The encrypted malicious traffic detection method according to claim 1, wherein the DMMFC detection model:

the first layer of network consists of a random forest classifier, an XGboost classifier and a Gaussian naive Bayes classifier; the layer two network consists of 1 logistic regression model.

7. The encrypted malicious traffic detection method according to claim 1, wherein the DMMFC detection model is trained in a manner of:

the first-layer network of the DMMFC model is trained aiming at feature dimensions, namely the 1 st to 28 th dimension features of a training sample set of a random forest classifier, the 29 th to 35 th dimension features of a training sample set of an XGboost classifier and the 36 th to 155 th dimension features of a training sample set of a Gaussian naive Bayes classifier, and is trained in a five-fold cross validation mode, the result of the training of the first-layer network is input into a second-layer network as a new feature, and a logistic regression model is adopted for fitting.