CN113704762A

CN113704762A - Malicious software encrypted flow detection method based on ensemble learning

Info

Publication number: CN113704762A
Application number: CN202111024464.2A
Authority: CN
Inventors: 李树栋; 赵传彧; 吴晓波; 韩伟红; 方滨兴; 田志宏; 殷丽华; 顾钊铨; 仇晶; 唐可可; 李默涵
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-11-26
Anticipated expiration: 2041-09-02
Also published as: CN113704762B

Abstract

The invention discloses a malicious software encryption flow detection method based on ensemble learning, which comprises the following steps: collecting an encrypted traffic sample set, the encrypted traffic sample set comprising a plurality of heterogeneous features; constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set; and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers. The invention can solve the problems of low detection rate and high false alarm rate of the existing malicious software flow detection system, and compared with the deep packet inspection DPI technology, the invention does not need to decrypt the encrypted data packet, can be suitable for the detection of malicious encrypted flow only according to the observable characteristics of the data packet, and has the characteristics of high detection rate and low false alarm rate.

Description

Malicious software encrypted flow detection method based on ensemble learning

Technical Field

The invention relates to the technical field of malicious software flow detection, in particular to a malicious software encryption flow detection method based on ensemble learning.

Background

Malware is a program that aims to destroy computer systems, and is one of the most serious threats to information security today. In addition to the PE-based malware detection method, traffic detection based on malware generation is also an effective method. TLS is an encryption protocol used to provide privacy for applications. In recent years, with the widespread application of TLS, encrypted traffic on the internet is increasing; meanwhile, the number of malicious software attacks which are self-propagated or communicated by utilizing the encrypted HTTP traffic is also increased sharply; encryption also has security risks while protecting user privacy, and malicious traffic may be hidden in encrypted traffic, resulting in a series of security problems.

Identifying whether these encrypted traffic are benign or malicious is a significant challenge. The importance of network infrastructure security places high demands on both the TPR and FPR detected. The traditional non-encrypted traffic detection method is difficult to apply to encrypted traffic detection because it disables Deep Packet Inspection (DPI) and pattern matching; the traditional signature-based method can only detect the existing attacks of the signature, so the signature-based method cannot detect new attacks, and the encrypted payload cannot be directly observed, and the number of the encrypted payloads is huge. Therefore, it is necessary to provide an automatic detection method for malicious software traffic in combination with domain knowledge and a machine learning method, so as to implement security protection for information.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a malicious software encrypted flow detection method based on integrated learning, which can be suitable for malicious encrypted flow detection only according to observable characteristics of a data packet without decrypting an encrypted data packet.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a malicious software encryption flow detection method based on ensemble learning, which comprises the following steps:

collecting an encrypted traffic sample set, wherein the encrypted traffic sample set comprises a plurality of heterogeneous features, and specifically comprises the following steps: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;

constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier;

and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers.

Preferably, the packet length distribution feature classifier is specifically described as follows:

and (3) constructing a packet length distribution characteristic: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length;

selecting a model: processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

Preferably, the server IP address feature classifier is specifically described as follows:

and (3) server IP address characteristic construction: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address;

selecting a model: processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.

Preferably, the certificate word frequency feature classifier is specifically described as follows:

and (3) certificate word frequency feature construction: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word;

selecting a model: processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in a test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign; if all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.

Preferably, the packet length sequence feature classifier is specifically described as follows:

and (3) packet length sequence characteristic construction: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets;

selecting a model: processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; taking the length of each data packet as a word, the packet length sequence group generated by each host communication is equivalent to a sentence; in the training set, a packet length sequence array sequentially passes through a word embedding layer, a convolutional layer, a pooling layer, a full connection layer and a SoftMax layer, and finally parameters in the layers are updated integrally by using a gradient descent method, so that the layers form the TextCNN convolutional neural network classifier; in the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

Preferably, the TCP connection status feature classifier is specifically described as:

and (3) TCP connection state characteristic construction: sequencing TLS encrypted streams according to a time sequence aiming at each host, and then analyzing a TCP connection state of the TLS encrypted streams;

the TCP connection states are 14, and are defined as follows:

s0: the client attempts to connect, but does not answer;

s1: the connection is established but not terminated;

SF: the connection is normally established and terminated;

REJ: the client tries to connect, but the server refuses;

s2: having established a connection, the client attempts to close the attempt, but there is no reply from the server;

s3: having established a connection, the server attempts to close the attempt, but the client does not reply;

RSTO: establishing connection, terminating the client and sending RST;

RSTR: the server sends the RST;

RSTOS 0: the client has sent a SYN and a RST, but has not sent a SYN-ack from the server;

RSTOS 0: the server sends a SYN ACK with RST, but the client does not send a SYN;

RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;

SH: the client has sent one SYN and one FIN, but no SYN ACK from the server;

SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;

OTH: no SYN is seen, only intermediate traffic;

then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic;

selecting a model: processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

Preferably, the flow characteristic classifier is specifically described as:

flow characteristic construction: extracting, for each TLS flow for each host, a one-hot code of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time intervals, max/min/total/mean/variance of received packet time intervals, max/min/mean/variance of all packet time intervals, markov chain after packet length buckets, TLS certificate word frequency;

selecting a model: processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in a test set, samples are predicted by using the plurality of CART decision trees, a probability of judging the samples to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges the samples to be malicious; whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.

Preferably, the host feature classifier is specifically described as:

host feature construction: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert;

selecting a model: processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

Preferably, the malware encryption traffic detection model determines whether the host is infected with malware by using majority voting of the plurality of feature classifiers, and specifically includes:

when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model, otherwise, the host is judged to be infected by the malicious software by the detection model.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention uses 7 classifiers for detection, makes up errors among the classifiers, has higher robustness, and can solve the problems of low detection rate and high false alarm rate of the existing malicious software flow detection system; compared with the DPI (deep packet inspection) technology, the method and the device do not need to decrypt the encrypted data packet, can be suitable for detecting the malicious encrypted flow only according to the observable characteristics of the data packet, and have the characteristics of high detection rate and low false alarm rate.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a malware traffic detection method based on ensemble learning according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a detection model according to an embodiment of the present invention;

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings, it should be understood that the drawings are for illustrative purposes only and are not to be construed as limiting the patent. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Examples

As shown in fig. 1, the present embodiment is a malware traffic detection method based on ensemble learning, and the method includes the following steps:

s1, collecting an encrypted traffic sample set, where the encrypted traffic sample set includes a plurality of heterogeneous features, and the method specifically includes: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;

s2, constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier, and the feature construction and model selection processes of the 7 classifiers are as follows:

1, a classifier and a packet length distribution feature classifier: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length; processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

The 2 nd classifier and the server IP address feature classifier: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address; processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive bayes classifier assumes that the dimensions of the features are independent of each other, and it calculates the conditional probability that each dimension of the features is classified separately. In the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.

3, a classifier and a certificate word frequency feature classifier: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word; processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive bayes classifier assumes that the dimensions of the features are independent of each other, and it calculates the conditional probability that each dimension of the features is classified separately. In the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign. If all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.

4, a classifier and a packet length sequence feature classifier: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets; processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; in the training set, the length of each data packet is taken as a word, and the sequence of packet lengths generated by each host communication is equivalent to a sentence. And finally, integrally updating parameters in the word embedding layer, the convolution layer, the pooling layer, the full connection layer and the SoftMax layer by using a gradient descent method, and forming the TextCNN convolutional neural network classifier by the layers. In the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

The 5 th classifier and the TCP connection state feature classifier: for each host, the TLS encrypted streams are ordered in time order, and their TCP connection states are then parsed, the TCP connection states being defined as:

s0: the client attempts to connect, but does not answer;

s1: the connection is established but not terminated;

s2: having established a connection, the client attempts to close the attempt (but no reply from the server);

s3: having established a connection, the server attempts to close the attempt (but the client does not reply);

SF: the connection is normally established and terminated;

SH: the client has sent one SYN and one FIN, but no SYN ACK from the server (half connection);

SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;

REJ: the client tries to connect, but the server refuses;

RSTO: connection establishment, client termination (sending RST);

RSTR: the server sends the RST;

RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;

OTH: no SYN is seen, only intermediate traffic (partial connection);

then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic; processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

6, classifier and flow characteristic classifier: for each TLS flow for each host, the following flow-level features are extracted: one-hot encoding of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time interval, max/min/total/mean/variance of received packet time interval, max/min/mean/variance of all packet time interval, markov chain after packet length bucket, TLS certificate word frequency; processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious. Whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.

7, classifier and host feature classifier: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert; processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.

S3, model integration: as shown in fig. 2, 7 constructed classifiers are integrated into a detection model, and whether a host is infected with malware is determined according to a detection result of the model, specifically: when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model; otherwise, the host is determined to be infected by the malware by the detection model.

In order to further verify the detection rate of the invention, corresponding experiments are performed to verify that the training set and the test set are shown in table 1, and the verification results of the test set are shown in table 2.

The following criteria are defined:

the final score is the detection rate-false alarm rate;

judging the detection rate to be the number of infected hosts/the number of infected hosts;

the false alarm rate is judged as the number of infected hosts/the number of benign hosts.

Table 1: training set and test set overview

Table 2: test set validation results

In summary, the invention is a malware traffic detection method based on ensemble learning, which exploits the features of obvious difference between malicious traffic and benign traffic to the maximum extent by extracting a plurality of heterogeneous features and constructing 7 classifiers, and distinguishes two different traffic data, thereby achieving the purpose of identifying malware traffic without decrypting the traffic data.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The method for detecting the malicious software encrypted flow based on ensemble learning is characterized by comprising the following steps of:

2. The ensemble learning based malware encryption traffic detection method according to claim 1, wherein the packet length distribution feature classifier is specifically described as follows:

3. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the server-side IP address feature classifier is specifically described as:

4. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the certificate word frequency feature classifier is specifically described as:

5. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the packet length sequence feature classifier is specifically described as:

6. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the TCP connection state feature classifier is specifically described as:

the TCP connection states are 14, and are defined as follows:

s0: the client attempts to connect, but does not answer;

s1: the connection is established but not terminated;

SF: the connection is normally established and terminated;

REJ: the client tries to connect, but the server refuses;

RSTO: establishing connection, terminating the client and sending RST;

RSTR: the server sends the RST;

RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;

SH: the client has sent one SYN and one FIN, but no SYN ACK from the server;

SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;

OTH: no SYN is seen, only intermediate traffic;

7. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the traffic feature classifier is specifically described as:

8. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the host feature classifier is specifically described as:

9. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the malware encryption traffic detection model determines whether the host is infected with malware by using majority voting of a plurality of feature classifiers, specifically: