CN113704762A - Malicious software encrypted flow detection method based on ensemble learning - Google Patents
Malicious software encrypted flow detection method based on ensemble learning Download PDFInfo
- Publication number
- CN113704762A CN113704762A CN202111024464.2A CN202111024464A CN113704762A CN 113704762 A CN113704762 A CN 113704762A CN 202111024464 A CN202111024464 A CN 202111024464A CN 113704762 A CN113704762 A CN 113704762A
- Authority
- CN
- China
- Prior art keywords
- classifier
- malicious
- host
- sample
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Probability & Statistics with Applications (AREA)
- Virology (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a malicious software encryption flow detection method based on ensemble learning, which comprises the following steps: collecting an encrypted traffic sample set, the encrypted traffic sample set comprising a plurality of heterogeneous features; constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set; and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers. The invention can solve the problems of low detection rate and high false alarm rate of the existing malicious software flow detection system, and compared with the deep packet inspection DPI technology, the invention does not need to decrypt the encrypted data packet, can be suitable for the detection of malicious encrypted flow only according to the observable characteristics of the data packet, and has the characteristics of high detection rate and low false alarm rate.
Description
Technical Field
The invention relates to the technical field of malicious software flow detection, in particular to a malicious software encryption flow detection method based on ensemble learning.
Background
Malware is a program that aims to destroy computer systems, and is one of the most serious threats to information security today. In addition to the PE-based malware detection method, traffic detection based on malware generation is also an effective method. TLS is an encryption protocol used to provide privacy for applications. In recent years, with the widespread application of TLS, encrypted traffic on the internet is increasing; meanwhile, the number of malicious software attacks which are self-propagated or communicated by utilizing the encrypted HTTP traffic is also increased sharply; encryption also has security risks while protecting user privacy, and malicious traffic may be hidden in encrypted traffic, resulting in a series of security problems.
Identifying whether these encrypted traffic are benign or malicious is a significant challenge. The importance of network infrastructure security places high demands on both the TPR and FPR detected. The traditional non-encrypted traffic detection method is difficult to apply to encrypted traffic detection because it disables Deep Packet Inspection (DPI) and pattern matching; the traditional signature-based method can only detect the existing attacks of the signature, so the signature-based method cannot detect new attacks, and the encrypted payload cannot be directly observed, and the number of the encrypted payloads is huge. Therefore, it is necessary to provide an automatic detection method for malicious software traffic in combination with domain knowledge and a machine learning method, so as to implement security protection for information.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a malicious software encrypted flow detection method based on integrated learning, which can be suitable for malicious encrypted flow detection only according to observable characteristics of a data packet without decrypting an encrypted data packet.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a malicious software encryption flow detection method based on ensemble learning, which comprises the following steps:
collecting an encrypted traffic sample set, wherein the encrypted traffic sample set comprises a plurality of heterogeneous features, and specifically comprises the following steps: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;
constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier;
and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers.
Preferably, the packet length distribution feature classifier is specifically described as follows:
and (3) constructing a packet length distribution characteristic: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length;
selecting a model: processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the server IP address feature classifier is specifically described as follows:
and (3) server IP address characteristic construction: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address;
selecting a model: processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.
Preferably, the certificate word frequency feature classifier is specifically described as follows:
and (3) certificate word frequency feature construction: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word;
selecting a model: processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in a test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign; if all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.
Preferably, the packet length sequence feature classifier is specifically described as follows:
and (3) packet length sequence characteristic construction: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets;
selecting a model: processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; taking the length of each data packet as a word, the packet length sequence group generated by each host communication is equivalent to a sentence; in the training set, a packet length sequence array sequentially passes through a word embedding layer, a convolutional layer, a pooling layer, a full connection layer and a SoftMax layer, and finally parameters in the layers are updated integrally by using a gradient descent method, so that the layers form the TextCNN convolutional neural network classifier; in the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the TCP connection status feature classifier is specifically described as:
and (3) TCP connection state characteristic construction: sequencing TLS encrypted streams according to a time sequence aiming at each host, and then analyzing a TCP connection state of the TLS encrypted streams;
the TCP connection states are 14, and are defined as follows:
s0: the client attempts to connect, but does not answer;
s1: the connection is established but not terminated;
SF: the connection is normally established and terminated;
REJ: the client tries to connect, but the server refuses;
s2: having established a connection, the client attempts to close the attempt, but there is no reply from the server;
s3: having established a connection, the server attempts to close the attempt, but the client does not reply;
RSTO: establishing connection, terminating the client and sending RST;
RSTR: the server sends the RST;
RSTOS 0: the client has sent a SYN and a RST, but has not sent a SYN-ack from the server;
RSTOS 0: the server sends a SYN ACK with RST, but the client does not send a SYN;
RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;
SH: the client has sent one SYN and one FIN, but no SYN ACK from the server;
SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;
OTH: no SYN is seen, only intermediate traffic;
then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic;
selecting a model: processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the flow characteristic classifier is specifically described as:
flow characteristic construction: extracting, for each TLS flow for each host, a one-hot code of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time intervals, max/min/total/mean/variance of received packet time intervals, max/min/mean/variance of all packet time intervals, markov chain after packet length buckets, TLS certificate word frequency;
selecting a model: processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in a test set, samples are predicted by using the plurality of CART decision trees, a probability of judging the samples to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges the samples to be malicious; whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.
Preferably, the host feature classifier is specifically described as:
host feature construction: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert;
selecting a model: processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the malware encryption traffic detection model determines whether the host is infected with malware by using majority voting of the plurality of feature classifiers, and specifically includes:
when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model, otherwise, the host is judged to be infected by the malicious software by the detection model.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention uses 7 classifiers for detection, makes up errors among the classifiers, has higher robustness, and can solve the problems of low detection rate and high false alarm rate of the existing malicious software flow detection system; compared with the DPI (deep packet inspection) technology, the method and the device do not need to decrypt the encrypted data packet, can be suitable for detecting the malicious encrypted flow only according to the observable characteristics of the data packet, and have the characteristics of high detection rate and low false alarm rate.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a malware traffic detection method based on ensemble learning according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a detection model according to an embodiment of the present invention;
Detailed Description
In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings, it should be understood that the drawings are for illustrative purposes only and are not to be construed as limiting the patent. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Examples
As shown in fig. 1, the present embodiment is a malware traffic detection method based on ensemble learning, and the method includes the following steps:
s1, collecting an encrypted traffic sample set, where the encrypted traffic sample set includes a plurality of heterogeneous features, and the method specifically includes: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;
s2, constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier, and the feature construction and model selection processes of the 7 classifiers are as follows:
1, a classifier and a packet length distribution feature classifier: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length; processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
The 2 nd classifier and the server IP address feature classifier: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address; processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive bayes classifier assumes that the dimensions of the features are independent of each other, and it calculates the conditional probability that each dimension of the features is classified separately. In the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.
3, a classifier and a certificate word frequency feature classifier: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word; processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive bayes classifier assumes that the dimensions of the features are independent of each other, and it calculates the conditional probability that each dimension of the features is classified separately. In the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign. If all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.
4, a classifier and a packet length sequence feature classifier: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets; processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; in the training set, the length of each data packet is taken as a word, and the sequence of packet lengths generated by each host communication is equivalent to a sentence. And finally, integrally updating parameters in the word embedding layer, the convolution layer, the pooling layer, the full connection layer and the SoftMax layer by using a gradient descent method, and forming the TextCNN convolutional neural network classifier by the layers. In the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
The 5 th classifier and the TCP connection state feature classifier: for each host, the TLS encrypted streams are ordered in time order, and their TCP connection states are then parsed, the TCP connection states being defined as:
s0: the client attempts to connect, but does not answer;
s1: the connection is established but not terminated;
s2: having established a connection, the client attempts to close the attempt (but no reply from the server);
s3: having established a connection, the server attempts to close the attempt (but the client does not reply);
SF: the connection is normally established and terminated;
SH: the client has sent one SYN and one FIN, but no SYN ACK from the server (half connection);
SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;
REJ: the client tries to connect, but the server refuses;
RSTO: connection establishment, client termination (sending RST);
RSTR: the server sends the RST;
RSTOS 0: the client has sent a SYN and a RST, but has not sent a SYN-ack from the server;
RSTOS 0: the server sends a SYN ACK with RST, but the client does not send a SYN;
RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;
OTH: no SYN is seen, only intermediate traffic (partial connection);
then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic; processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
6, classifier and flow characteristic classifier: for each TLS flow for each host, the following flow-level features are extracted: one-hot encoding of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time interval, max/min/total/mean/variance of received packet time interval, max/min/mean/variance of all packet time interval, markov chain after packet length bucket, TLS certificate word frequency; processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious. Whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.
7, classifier and host feature classifier: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert; processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
S3, model integration: as shown in fig. 2, 7 constructed classifiers are integrated into a detection model, and whether a host is infected with malware is determined according to a detection result of the model, specifically: when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model; otherwise, the host is determined to be infected by the malware by the detection model.
In order to further verify the detection rate of the invention, corresponding experiments are performed to verify that the training set and the test set are shown in table 1, and the verification results of the test set are shown in table 2.
The following criteria are defined:
the final score is the detection rate-false alarm rate;
judging the detection rate to be the number of infected hosts/the number of infected hosts;
the false alarm rate is judged as the number of infected hosts/the number of benign hosts.
Table 1: training set and test set overview
Table 2: test set validation results
In summary, the invention is a malware traffic detection method based on ensemble learning, which exploits the features of obvious difference between malicious traffic and benign traffic to the maximum extent by extracting a plurality of heterogeneous features and constructing 7 classifiers, and distinguishes two different traffic data, thereby achieving the purpose of identifying malware traffic without decrypting the traffic data.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (9)
1. The method for detecting the malicious software encrypted flow based on ensemble learning is characterized by comprising the following steps of:
collecting an encrypted traffic sample set, wherein the encrypted traffic sample set comprises a plurality of heterogeneous features, and specifically comprises the following steps: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;
constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier;
and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers.
2. The ensemble learning based malware encryption traffic detection method according to claim 1, wherein the packet length distribution feature classifier is specifically described as follows:
and (3) constructing a packet length distribution characteristic: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length;
selecting a model: processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
3. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the server-side IP address feature classifier is specifically described as:
and (3) server IP address characteristic construction: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address;
selecting a model: processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.
4. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the certificate word frequency feature classifier is specifically described as:
and (3) certificate word frequency feature construction: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word;
selecting a model: processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in a test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign; if all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.
5. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the packet length sequence feature classifier is specifically described as:
and (3) packet length sequence characteristic construction: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets;
selecting a model: processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; taking the length of each data packet as a word, the packet length sequence group generated by each host communication is equivalent to a sentence; in the training set, a packet length sequence array sequentially passes through a word embedding layer, a convolutional layer, a pooling layer, a full connection layer and a SoftMax layer, and finally parameters in the layers are updated integrally by using a gradient descent method, so that the layers form the TextCNN convolutional neural network classifier; in the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
6. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the TCP connection state feature classifier is specifically described as:
and (3) TCP connection state characteristic construction: sequencing TLS encrypted streams according to a time sequence aiming at each host, and then analyzing a TCP connection state of the TLS encrypted streams;
the TCP connection states are 14, and are defined as follows:
s0: the client attempts to connect, but does not answer;
s1: the connection is established but not terminated;
SF: the connection is normally established and terminated;
REJ: the client tries to connect, but the server refuses;
s2: having established a connection, the client attempts to close the attempt, but there is no reply from the server;
s3: having established a connection, the server attempts to close the attempt, but the client does not reply;
RSTO: establishing connection, terminating the client and sending RST;
RSTR: the server sends the RST;
RSTOS 0: the client has sent a SYN and a RST, but has not sent a SYN-ack from the server;
RSTOS 0: the server sends a SYN ACK with RST, but the client does not send a SYN;
RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;
SH: the client has sent one SYN and one FIN, but no SYN ACK from the server;
SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;
OTH: no SYN is seen, only intermediate traffic;
then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic;
selecting a model: processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
7. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the traffic feature classifier is specifically described as:
flow characteristic construction: extracting, for each TLS flow for each host, a one-hot code of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time intervals, max/min/total/mean/variance of received packet time intervals, max/min/mean/variance of all packet time intervals, markov chain after packet length buckets, TLS certificate word frequency;
selecting a model: processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in a test set, samples are predicted by using the plurality of CART decision trees, a probability of judging the samples to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges the samples to be malicious; whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.
8. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the host feature classifier is specifically described as:
host feature construction: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert;
selecting a model: processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
9. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the malware encryption traffic detection model determines whether the host is infected with malware by using majority voting of a plurality of feature classifiers, specifically:
when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model, otherwise, the host is judged to be infected by the malicious software by the detection model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111024464.2A CN113704762B (en) | 2021-09-02 | 2021-09-02 | Malicious software encrypted flow detection method based on ensemble learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111024464.2A CN113704762B (en) | 2021-09-02 | 2021-09-02 | Malicious software encrypted flow detection method based on ensemble learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113704762A true CN113704762A (en) | 2021-11-26 |
CN113704762B CN113704762B (en) | 2022-06-21 |
Family
ID=78657257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111024464.2A Active CN113704762B (en) | 2021-09-02 | 2021-09-02 | Malicious software encrypted flow detection method based on ensemble learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113704762B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114492623A (en) * | 2022-01-25 | 2022-05-13 | 电子科技大学 | Method and device for classifying Android malicious software |
CN114553605A (en) * | 2022-04-26 | 2022-05-27 | 中国矿业大学(北京) | Encrypted malicious flow detection method for voting strategy |
CN114726653A (en) * | 2022-05-24 | 2022-07-08 | 深圳市永达电子信息股份有限公司 | Abnormal flow detection method and system based on distributed random forest |
CN114938290A (en) * | 2022-04-22 | 2022-08-23 | 北京天际友盟信息技术有限公司 | Information detection method, device and equipment |
CN115174160A (en) * | 2022-06-16 | 2022-10-11 | 广州大学 | Malicious encrypted traffic classification method and device based on stream level and host level |
CN115632875A (en) * | 2022-11-29 | 2023-01-20 | 湖北省楚天云有限公司 | Malicious flow detection method and system based on multi-feature fusion and real-time analysis |
CN115834097A (en) * | 2022-06-24 | 2023-03-21 | 电子科技大学 | HTTPS malicious software flow detection system and method based on multiple visual angles |
CN116055201A (en) * | 2023-01-16 | 2023-05-02 | 中国矿业大学(北京) | Multi-view encryption malicious traffic detection method based on collaborative training |
CN117521052A (en) * | 2024-01-04 | 2024-02-06 | 中国电信股份有限公司江西分公司 | Protection authentication method and device for server privacy, computer equipment and medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324888A (en) * | 2012-03-19 | 2013-09-25 | 哈尔滨安天科技股份有限公司 | Method and system for automatically extracting virus characteristics based on family samples |
CN106031293A (en) * | 2014-10-31 | 2016-10-12 | 华为技术有限公司 | Data processing method, apparatus, terminal, mobility management entity, and system |
CN107153789A (en) * | 2017-04-24 | 2017-09-12 | 西安电子科技大学 | The method for detecting Android Malware in real time using random forest grader |
CN108833360A (en) * | 2018-05-23 | 2018-11-16 | 四川大学 | A kind of malice encryption flow identification technology based on machine learning |
CN110138849A (en) * | 2019-05-05 | 2019-08-16 | 哈尔滨英赛克信息技术有限公司 | Agreement encryption algorithm type recognition methods based on random forest |
US20190349403A1 (en) * | 2018-05-11 | 2019-11-14 | Cisco Technology, Inc. | Detecting targeted data exfiltration in encrypted traffic |
CN110572382A (en) * | 2019-09-02 | 2019-12-13 | 西安电子科技大学 | Malicious flow detection method based on SMOTE algorithm and ensemble learning |
CN110708341A (en) * | 2019-11-15 | 2020-01-17 | 中国科学院信息工程研究所 | User behavior detection method and system based on remote desktop encryption network traffic mode difference |
CN111310796A (en) * | 2020-01-19 | 2020-06-19 | 中山大学 | Web user click identification method facing encrypted network flow |
CN112104677A (en) * | 2020-11-23 | 2020-12-18 | 北京金睛云华科技有限公司 | Controlled host detection method and device based on knowledge graph |
-
2021
- 2021-09-02 CN CN202111024464.2A patent/CN113704762B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324888A (en) * | 2012-03-19 | 2013-09-25 | 哈尔滨安天科技股份有限公司 | Method and system for automatically extracting virus characteristics based on family samples |
CN106031293A (en) * | 2014-10-31 | 2016-10-12 | 华为技术有限公司 | Data processing method, apparatus, terminal, mobility management entity, and system |
CN107153789A (en) * | 2017-04-24 | 2017-09-12 | 西安电子科技大学 | The method for detecting Android Malware in real time using random forest grader |
US20190349403A1 (en) * | 2018-05-11 | 2019-11-14 | Cisco Technology, Inc. | Detecting targeted data exfiltration in encrypted traffic |
CN108833360A (en) * | 2018-05-23 | 2018-11-16 | 四川大学 | A kind of malice encryption flow identification technology based on machine learning |
CN110138849A (en) * | 2019-05-05 | 2019-08-16 | 哈尔滨英赛克信息技术有限公司 | Agreement encryption algorithm type recognition methods based on random forest |
CN110572382A (en) * | 2019-09-02 | 2019-12-13 | 西安电子科技大学 | Malicious flow detection method based on SMOTE algorithm and ensemble learning |
CN110708341A (en) * | 2019-11-15 | 2020-01-17 | 中国科学院信息工程研究所 | User behavior detection method and system based on remote desktop encryption network traffic mode difference |
CN111310796A (en) * | 2020-01-19 | 2020-06-19 | 中山大学 | Web user click identification method facing encrypted network flow |
CN112104677A (en) * | 2020-11-23 | 2020-12-18 | 北京金睛云华科技有限公司 | Controlled host detection method and device based on knowledge graph |
Non-Patent Citations (6)
Title |
---|
JUN LI等: "Identifying Skype Traffic by Random Forest", 《2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING》 * |
JUN LI等: "Identifying Skype Traffic by Random Forest", 《2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING》, 8 October 2007 (2007-10-08) * |
TANGDA YU等: "An Encrypted Malicious Traffic Detection System Based on Neural Network", 《2019 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC)》, 2 January 2020 (2020-01-02) * |
李树栋等: "基于Schnorr算法的不可否认签名方案", 《海军航空工程学院学报》 * |
李树栋等: "基于Schnorr算法的不可否认签名方案", 《海军航空工程学院学报》, vol. 22, no. 4, 17 December 2007 (2007-12-17) * |
陈良臣等: "网络加密流量识别研究进展及发展趋势", 《信息网络安全》, vol. 2019, no. 3, 10 March 2019 (2019-03-10) * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114492623A (en) * | 2022-01-25 | 2022-05-13 | 电子科技大学 | Method and device for classifying Android malicious software |
CN114938290A (en) * | 2022-04-22 | 2022-08-23 | 北京天际友盟信息技术有限公司 | Information detection method, device and equipment |
CN114553605A (en) * | 2022-04-26 | 2022-05-27 | 中国矿业大学(北京) | Encrypted malicious flow detection method for voting strategy |
CN114726653A (en) * | 2022-05-24 | 2022-07-08 | 深圳市永达电子信息股份有限公司 | Abnormal flow detection method and system based on distributed random forest |
CN114726653B (en) * | 2022-05-24 | 2022-11-15 | 深圳市永达电子信息股份有限公司 | Abnormal flow detection method and system based on distributed random forest |
CN115174160B (en) * | 2022-06-16 | 2023-10-20 | 广州大学 | Malicious encryption traffic classification method and device based on stream level and host level |
CN115174160A (en) * | 2022-06-16 | 2022-10-11 | 广州大学 | Malicious encrypted traffic classification method and device based on stream level and host level |
CN115834097A (en) * | 2022-06-24 | 2023-03-21 | 电子科技大学 | HTTPS malicious software flow detection system and method based on multiple visual angles |
CN115834097B (en) * | 2022-06-24 | 2024-03-22 | 电子科技大学 | HTTPS malicious software flow detection system and method based on multiple views |
CN115632875A (en) * | 2022-11-29 | 2023-01-20 | 湖北省楚天云有限公司 | Malicious flow detection method and system based on multi-feature fusion and real-time analysis |
CN116055201B (en) * | 2023-01-16 | 2023-09-01 | 中国矿业大学(北京) | Multi-view encryption malicious traffic detection method based on collaborative training |
CN116055201A (en) * | 2023-01-16 | 2023-05-02 | 中国矿业大学(北京) | Multi-view encryption malicious traffic detection method based on collaborative training |
CN117521052A (en) * | 2024-01-04 | 2024-02-06 | 中国电信股份有限公司江西分公司 | Protection authentication method and device for server privacy, computer equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN113704762B (en) | 2022-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113704762B (en) | Malicious software encrypted flow detection method based on ensemble learning | |
CN109951500B (en) | Network attack detection method and device | |
CN111818052B (en) | CNN-LSTM-based industrial control protocol homologous attack detection method | |
Mohapatra et al. | Handling of man-in-the-middle attack in wsn through intrusion detection system | |
Yang et al. | SDAP: A secure hop-by-hop data aggregation protocol for sensor networks | |
CN107222491B (en) | Intrusion detection rule creating method based on industrial control network variant attack | |
El-Khatib | Impact of feature reduction on the efficiency of wireless intrusion detection systems | |
Bagui et al. | Using machine learning techniques to identify rare cyber‐attacks on the UNSW‐NB15 dataset | |
KR100877911B1 (en) | Method for detection of p2p-based botnets using a translation model of network traffic | |
US20060212942A1 (en) | Semantically-aware network intrusion signature generator | |
US20080263661A1 (en) | Detecting anomalies in signaling flows | |
Abraham et al. | A comparison of machine learning approaches to detect botnet traffic | |
Mazhar et al. | BeeAIS: Artificial immune system security for nature inspired, MANET routing protocol, BeeAdHoc | |
CN110958233B (en) | Encryption type malicious flow detection system and method based on deep learning | |
US20220174083A1 (en) | Method and device for detecting malicious activity over encrypted secure channel | |
Liu et al. | The detection method of low-rate DoS attack based on multi-feature fusion | |
KR100684602B1 (en) | Corresponding system for invasion on scenario basis using state-transfer of session and method thereof | |
Ireland | Intrusion detection with genetic algorithms and fuzzy logic | |
He et al. | Detection of tor traffic hiding under obfs4 protocol based on two-level filtering | |
Limmer et al. | Improving the performance of intrusion detection using dialog-based payload aggregation | |
Yang et al. | Deep learning approach for detecting malicious activities over encrypted secure channels | |
CN113242233B (en) | Multi-classification botnet detection device | |
Al-Fawa'reh et al. | Detecting stealth-based attacks in large campus networks | |
Dai et al. | Eclipse attack detection for blockchain network layer based on deep feature extraction | |
Alajeely et al. | Defense against packet dropping attacks in opportunistic networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |