CN113704762A - Malicious software encrypted flow detection method based on ensemble learning - Google Patents

Malicious software encrypted flow detection method based on ensemble learning Download PDF

Info

Publication number
CN113704762A
CN113704762A CN202111024464.2A CN202111024464A CN113704762A CN 113704762 A CN113704762 A CN 113704762A CN 202111024464 A CN202111024464 A CN 202111024464A CN 113704762 A CN113704762 A CN 113704762A
Authority
CN
China
Prior art keywords
classifier
malicious
host
sample
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111024464.2A
Other languages
Chinese (zh)
Other versions
CN113704762B (en
Inventor
李树栋
赵传彧
吴晓波
韩伟红
方滨兴
田志宏
殷丽华
顾钊铨
仇晶
唐可可
李默涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202111024464.2A priority Critical patent/CN113704762B/en
Publication of CN113704762A publication Critical patent/CN113704762A/en
Application granted granted Critical
Publication of CN113704762B publication Critical patent/CN113704762B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Virology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a malicious software encryption flow detection method based on ensemble learning, which comprises the following steps: collecting an encrypted traffic sample set, the encrypted traffic sample set comprising a plurality of heterogeneous features; constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set; and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers. The invention can solve the problems of low detection rate and high false alarm rate of the existing malicious software flow detection system, and compared with the deep packet inspection DPI technology, the invention does not need to decrypt the encrypted data packet, can be suitable for the detection of malicious encrypted flow only according to the observable characteristics of the data packet, and has the characteristics of high detection rate and low false alarm rate.

Description

Malicious software encrypted flow detection method based on ensemble learning
Technical Field
The invention relates to the technical field of malicious software flow detection, in particular to a malicious software encryption flow detection method based on ensemble learning.
Background
Malware is a program that aims to destroy computer systems, and is one of the most serious threats to information security today. In addition to the PE-based malware detection method, traffic detection based on malware generation is also an effective method. TLS is an encryption protocol used to provide privacy for applications. In recent years, with the widespread application of TLS, encrypted traffic on the internet is increasing; meanwhile, the number of malicious software attacks which are self-propagated or communicated by utilizing the encrypted HTTP traffic is also increased sharply; encryption also has security risks while protecting user privacy, and malicious traffic may be hidden in encrypted traffic, resulting in a series of security problems.
Identifying whether these encrypted traffic are benign or malicious is a significant challenge. The importance of network infrastructure security places high demands on both the TPR and FPR detected. The traditional non-encrypted traffic detection method is difficult to apply to encrypted traffic detection because it disables Deep Packet Inspection (DPI) and pattern matching; the traditional signature-based method can only detect the existing attacks of the signature, so the signature-based method cannot detect new attacks, and the encrypted payload cannot be directly observed, and the number of the encrypted payloads is huge. Therefore, it is necessary to provide an automatic detection method for malicious software traffic in combination with domain knowledge and a machine learning method, so as to implement security protection for information.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a malicious software encrypted flow detection method based on integrated learning, which can be suitable for malicious encrypted flow detection only according to observable characteristics of a data packet without decrypting an encrypted data packet.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a malicious software encryption flow detection method based on ensemble learning, which comprises the following steps:
collecting an encrypted traffic sample set, wherein the encrypted traffic sample set comprises a plurality of heterogeneous features, and specifically comprises the following steps: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;
constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier;
and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers.
Preferably, the packet length distribution feature classifier is specifically described as follows:
and (3) constructing a packet length distribution characteristic: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length;
selecting a model: processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the server IP address feature classifier is specifically described as follows:
and (3) server IP address characteristic construction: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address;
selecting a model: processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.
Preferably, the certificate word frequency feature classifier is specifically described as follows:
and (3) certificate word frequency feature construction: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word;
selecting a model: processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in a test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign; if all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.
Preferably, the packet length sequence feature classifier is specifically described as follows:
and (3) packet length sequence characteristic construction: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets;
selecting a model: processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; taking the length of each data packet as a word, the packet length sequence group generated by each host communication is equivalent to a sentence; in the training set, a packet length sequence array sequentially passes through a word embedding layer, a convolutional layer, a pooling layer, a full connection layer and a SoftMax layer, and finally parameters in the layers are updated integrally by using a gradient descent method, so that the layers form the TextCNN convolutional neural network classifier; in the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the TCP connection status feature classifier is specifically described as:
and (3) TCP connection state characteristic construction: sequencing TLS encrypted streams according to a time sequence aiming at each host, and then analyzing a TCP connection state of the TLS encrypted streams;
the TCP connection states are 14, and are defined as follows:
s0: the client attempts to connect, but does not answer;
s1: the connection is established but not terminated;
SF: the connection is normally established and terminated;
REJ: the client tries to connect, but the server refuses;
s2: having established a connection, the client attempts to close the attempt, but there is no reply from the server;
s3: having established a connection, the server attempts to close the attempt, but the client does not reply;
RSTO: establishing connection, terminating the client and sending RST;
RSTR: the server sends the RST;
RSTOS 0: the client has sent a SYN and a RST, but has not sent a SYN-ack from the server;
RSTOS 0: the server sends a SYN ACK with RST, but the client does not send a SYN;
RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;
SH: the client has sent one SYN and one FIN, but no SYN ACK from the server;
SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;
OTH: no SYN is seen, only intermediate traffic;
then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic;
selecting a model: processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the flow characteristic classifier is specifically described as:
flow characteristic construction: extracting, for each TLS flow for each host, a one-hot code of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time intervals, max/min/total/mean/variance of received packet time intervals, max/min/mean/variance of all packet time intervals, markov chain after packet length buckets, TLS certificate word frequency;
selecting a model: processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in a test set, samples are predicted by using the plurality of CART decision trees, a probability of judging the samples to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges the samples to be malicious; whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.
Preferably, the host feature classifier is specifically described as:
host feature construction: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert;
selecting a model: processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
Preferably, the malware encryption traffic detection model determines whether the host is infected with malware by using majority voting of the plurality of feature classifiers, and specifically includes:
when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model, otherwise, the host is judged to be infected by the malicious software by the detection model.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention uses 7 classifiers for detection, makes up errors among the classifiers, has higher robustness, and can solve the problems of low detection rate and high false alarm rate of the existing malicious software flow detection system; compared with the DPI (deep packet inspection) technology, the method and the device do not need to decrypt the encrypted data packet, can be suitable for detecting the malicious encrypted flow only according to the observable characteristics of the data packet, and have the characteristics of high detection rate and low false alarm rate.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a malware traffic detection method based on ensemble learning according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a detection model according to an embodiment of the present invention;
Detailed Description
In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings, it should be understood that the drawings are for illustrative purposes only and are not to be construed as limiting the patent. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Examples
As shown in fig. 1, the present embodiment is a malware traffic detection method based on ensemble learning, and the method includes the following steps:
s1, collecting an encrypted traffic sample set, where the encrypted traffic sample set includes a plurality of heterogeneous features, and the method specifically includes: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;
s2, constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier, and the feature construction and model selection processes of the 7 classifiers are as follows:
1, a classifier and a packet length distribution feature classifier: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length; processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
The 2 nd classifier and the server IP address feature classifier: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address; processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive bayes classifier assumes that the dimensions of the features are independent of each other, and it calculates the conditional probability that each dimension of the features is classified separately. In the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.
3, a classifier and a certificate word frequency feature classifier: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word; processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive bayes classifier assumes that the dimensions of the features are independent of each other, and it calculates the conditional probability that each dimension of the features is classified separately. In the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign. If all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.
4, a classifier and a packet length sequence feature classifier: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets; processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; in the training set, the length of each data packet is taken as a word, and the sequence of packet lengths generated by each host communication is equivalent to a sentence. And finally, integrally updating parameters in the word embedding layer, the convolution layer, the pooling layer, the full connection layer and the SoftMax layer by using a gradient descent method, and forming the TextCNN convolutional neural network classifier by the layers. In the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
The 5 th classifier and the TCP connection state feature classifier: for each host, the TLS encrypted streams are ordered in time order, and their TCP connection states are then parsed, the TCP connection states being defined as:
s0: the client attempts to connect, but does not answer;
s1: the connection is established but not terminated;
s2: having established a connection, the client attempts to close the attempt (but no reply from the server);
s3: having established a connection, the server attempts to close the attempt (but the client does not reply);
SF: the connection is normally established and terminated;
SH: the client has sent one SYN and one FIN, but no SYN ACK from the server (half connection);
SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;
REJ: the client tries to connect, but the server refuses;
RSTO: connection establishment, client termination (sending RST);
RSTR: the server sends the RST;
RSTOS 0: the client has sent a SYN and a RST, but has not sent a SYN-ack from the server;
RSTOS 0: the server sends a SYN ACK with RST, but the client does not send a SYN;
RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;
OTH: no SYN is seen, only intermediate traffic (partial connection);
then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic; processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
6, classifier and flow characteristic classifier: for each TLS flow for each host, the following flow-level features are extracted: one-hot encoding of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time interval, max/min/total/mean/variance of received packet time interval, max/min/mean/variance of all packet time interval, markov chain after packet length bucket, TLS certificate word frequency; processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious. Whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.
7, classifier and host feature classifier: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert; processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
S3, model integration: as shown in fig. 2, 7 constructed classifiers are integrated into a detection model, and whether a host is infected with malware is determined according to a detection result of the model, specifically: when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model; otherwise, the host is determined to be infected by the malware by the detection model.
In order to further verify the detection rate of the invention, corresponding experiments are performed to verify that the training set and the test set are shown in table 1, and the verification results of the test set are shown in table 2.
The following criteria are defined:
the final score is the detection rate-false alarm rate;
judging the detection rate to be the number of infected hosts/the number of infected hosts;
the false alarm rate is judged as the number of infected hosts/the number of benign hosts.
Table 1: training set and test set overview
Figure BDA0003242880630000081
Table 2: test set validation results
Figure BDA0003242880630000082
In summary, the invention is a malware traffic detection method based on ensemble learning, which exploits the features of obvious difference between malicious traffic and benign traffic to the maximum extent by extracting a plurality of heterogeneous features and constructing 7 classifiers, and distinguishes two different traffic data, thereby achieving the purpose of identifying malware traffic without decrypting the traffic data.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (9)

1. The method for detecting the malicious software encrypted flow based on ensemble learning is characterized by comprising the following steps of:
collecting an encrypted traffic sample set, wherein the encrypted traffic sample set comprises a plurality of heterogeneous features, and specifically comprises the following steps: packet length distribution characteristics, server IP address characteristics, certificate word frequency characteristics, packet length sequence characteristics, TCP connection state characteristics, flow characteristics and host characteristics;
constructing a plurality of corresponding feature classifiers based on a plurality of heterogeneous features of the encrypted flow sample set, wherein the feature classifiers comprise a packet length distribution feature classifier, a server IP address feature classifier, a certificate word frequency feature classifier, a packet length sequence feature classifier, a TCP connection state feature classifier, a flow feature classifier and a host feature classifier;
and constructing a malicious software encryption traffic detection model based on the plurality of feature classifiers, wherein the malicious software encryption traffic detection model judges whether the host is infected with malicious software by utilizing majority voting of the plurality of feature classifiers.
2. The ensemble learning based malware encryption traffic detection method according to claim 1, wherein the packet length distribution feature classifier is specifically described as follows:
and (3) constructing a packet length distribution characteristic: for each host, extracting the number of messages in each length and direction, and dividing the number of the extracted messages by the total number of the messages to obtain probability distribution, wherein the probability distribution is a packet length distribution characteristic, and each dimension of the characteristic represents the probability of the messages in a certain direction and a certain length;
selecting a model: processing the packet length distribution characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
3. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the server-side IP address feature classifier is specifically described as:
and (3) server IP address characteristic construction: aiming at each host, carrying out one-hot coding on all accessed server IP, wherein the one-hot coding value is 1, which indicates that the server IP address is accessed, the one-hot coding value is 0, which indicates that the server IP address is not accessed, and each dimension of the characteristics represents a certain server IP address;
selecting a model: processing the IP address characteristics of the server by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in the test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign.
4. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the certificate word frequency feature classifier is specifically described as:
and (3) certificate word frequency feature construction: for each host, extracting all received TLS stream X509 certificate chains to obtain words contained in certificate subjects and issuers, counting the number of each word, wherein each dimension of the characteristics represents the occurrence frequency of a word;
selecting a model: processing the certificate word frequency features by using a naive Bayes classifier; in the training set, the naive Bayes classifier assumes that the dimensions of the features are independent from each other, and calculates the conditional probability of each dimension of the features being classified separately; in a test set, the probability that each sample is malicious is solved by using the conditional probability, if the probability > is 0.5, the sample is considered to be malicious, otherwise, the sample is considered to be benign; if all words in a sample's credentials do not appear in the training set, it is directly inferred that the sample is malicious.
5. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the packet length sequence feature classifier is specifically described as:
and (3) packet length sequence characteristic construction: extracting a packet length sequence group consisting of the first 1000 packets generated by the communication of each host, and supplementing 0 for the part of less than 1000 packets;
selecting a model: processing the packet length sequence feature by using a TextCNN convolutional neural network classifier; taking the length of each data packet as a word, the packet length sequence group generated by each host communication is equivalent to a sentence; in the training set, a packet length sequence array sequentially passes through a word embedding layer, a convolutional layer, a pooling layer, a full connection layer and a SoftMax layer, and finally parameters in the layers are updated integrally by using a gradient descent method, so that the layers form the TextCNN convolutional neural network classifier; in the test set, the TextCNN convolutional neural network classifier outputs a probability of being judged to be malicious for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
6. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the TCP connection state feature classifier is specifically described as:
and (3) TCP connection state characteristic construction: sequencing TLS encrypted streams according to a time sequence aiming at each host, and then analyzing a TCP connection state of the TLS encrypted streams;
the TCP connection states are 14, and are defined as follows:
s0: the client attempts to connect, but does not answer;
s1: the connection is established but not terminated;
SF: the connection is normally established and terminated;
REJ: the client tries to connect, but the server refuses;
s2: having established a connection, the client attempts to close the attempt, but there is no reply from the server;
s3: having established a connection, the server attempts to close the attempt, but the client does not reply;
RSTO: establishing connection, terminating the client and sending RST;
RSTR: the server sends the RST;
RSTOS 0: the client has sent a SYN and a RST, but has not sent a SYN-ack from the server;
RSTOS 0: the server sends a SYN ACK with RST, but the client does not send a SYN;
RSTRH: the server sends a SYN ACK with RST, but the client does not send a SYN;
SH: the client has sent one SYN and one FIN, but no SYN ACK from the server;
SHR: the server sends a SYN ACK and a FIN, but the client does not send a SYN;
OTH: no SYN is seen, only intermediate traffic;
then establishing a Markov random field transmission matrix MRFTM for the TCP connection state, wherein the MRFTM is a two-dimensional matrix of 14 x 14, and MRFTM [ i, j ] represents the times of transferring from the ith state to the jth state; finally, row normalization is carried out on the MRFTM, and the MRFTM is reshaped into a 1 x 196 one-dimensional vector which is the TCP connection state characteristic;
selecting a model: processing the TCP connection state characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
7. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the traffic feature classifier is specifically described as:
flow characteristic construction: extracting, for each TLS flow for each host, a one-hot code of TCP connection state, flow duration, total number of sent packets, total number of received packets, ratio of received to total number of sent packets, max/min/total/mean/variance of all sent packet bytes, max/min/total/mean/variance of all received packet bytes, max/min/total/mean/variance of all packet bytes, max/min/total/mean/variance of sent packet time intervals, max/min/total/mean/variance of received packet time intervals, max/min/mean/variance of all packet time intervals, markov chain after packet length buckets, TLS certificate word frequency;
selecting a model: processing the flow characteristics by using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in a test set, samples are predicted by using the plurality of CART decision trees, a probability of judging the samples to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges the samples to be malicious; whenever any one of the streams is determined to be malicious, the host is determined to be infected with malware.
8. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the host feature classifier is specifically described as:
host feature construction: for each host, extracting individual features from the stream-level features of each TLS stream and aggregating them to form host-level features, the host-level features comprising: the total number of packets, the average number of packets of each flow, the average packet length of each flow, the number of self-signed flows, the number of certificate expired flows, TCP connection state number statistics, and the number of flows with Alert;
selecting a model: processing host features using a random forest classifier; in a training set, a random forest classifier constructs a plurality of CART decision trees by randomly extracting the dimensionality of features, and the set of the CART decision trees is the random forest classifier; in the test set, samples are predicted by using the plurality of CART decision trees, a probability judged to be malicious is output for each sample, and when the probability > is 0.5, the classifier judges that the sample is malicious.
9. The ensemble learning-based malware encryption traffic detection method according to claim 1, wherein the malware encryption traffic detection model determines whether the host is infected with malware by using majority voting of a plurality of feature classifiers, specifically:
when more than 3 classifiers in the 7 classifiers judge that the host is normal, the host is judged to be normal by the detection model, otherwise, the host is judged to be infected by the malicious software by the detection model.
CN202111024464.2A 2021-09-02 2021-09-02 Malicious software encrypted flow detection method based on ensemble learning Active CN113704762B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111024464.2A CN113704762B (en) 2021-09-02 2021-09-02 Malicious software encrypted flow detection method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111024464.2A CN113704762B (en) 2021-09-02 2021-09-02 Malicious software encrypted flow detection method based on ensemble learning

Publications (2)

Publication Number Publication Date
CN113704762A true CN113704762A (en) 2021-11-26
CN113704762B CN113704762B (en) 2022-06-21

Family

ID=78657257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111024464.2A Active CN113704762B (en) 2021-09-02 2021-09-02 Malicious software encrypted flow detection method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN113704762B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492623A (en) * 2022-01-25 2022-05-13 电子科技大学 Method and device for classifying Android malicious software
CN114553605A (en) * 2022-04-26 2022-05-27 中国矿业大学(北京) Encrypted malicious flow detection method for voting strategy
CN114726653A (en) * 2022-05-24 2022-07-08 深圳市永达电子信息股份有限公司 Abnormal flow detection method and system based on distributed random forest
CN114938290A (en) * 2022-04-22 2022-08-23 北京天际友盟信息技术有限公司 Information detection method, device and equipment
CN115174160A (en) * 2022-06-16 2022-10-11 广州大学 Malicious encrypted traffic classification method and device based on stream level and host level
CN115632875A (en) * 2022-11-29 2023-01-20 湖北省楚天云有限公司 Malicious flow detection method and system based on multi-feature fusion and real-time analysis
CN115834097A (en) * 2022-06-24 2023-03-21 电子科技大学 HTTPS malicious software flow detection system and method based on multiple visual angles
CN116055201A (en) * 2023-01-16 2023-05-02 中国矿业大学(北京) Multi-view encryption malicious traffic detection method based on collaborative training
CN117521052A (en) * 2024-01-04 2024-02-06 中国电信股份有限公司江西分公司 Protection authentication method and device for server privacy, computer equipment and medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324888A (en) * 2012-03-19 2013-09-25 哈尔滨安天科技股份有限公司 Method and system for automatically extracting virus characteristics based on family samples
CN106031293A (en) * 2014-10-31 2016-10-12 华为技术有限公司 Data processing method, apparatus, terminal, mobility management entity, and system
CN107153789A (en) * 2017-04-24 2017-09-12 西安电子科技大学 The method for detecting Android Malware in real time using random forest grader
CN108833360A (en) * 2018-05-23 2018-11-16 四川大学 A kind of malice encryption flow identification technology based on machine learning
CN110138849A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Agreement encryption algorithm type recognition methods based on random forest
US20190349403A1 (en) * 2018-05-11 2019-11-14 Cisco Technology, Inc. Detecting targeted data exfiltration in encrypted traffic
CN110572382A (en) * 2019-09-02 2019-12-13 西安电子科技大学 Malicious flow detection method based on SMOTE algorithm and ensemble learning
CN110708341A (en) * 2019-11-15 2020-01-17 中国科学院信息工程研究所 User behavior detection method and system based on remote desktop encryption network traffic mode difference
CN111310796A (en) * 2020-01-19 2020-06-19 中山大学 Web user click identification method facing encrypted network flow
CN112104677A (en) * 2020-11-23 2020-12-18 北京金睛云华科技有限公司 Controlled host detection method and device based on knowledge graph

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324888A (en) * 2012-03-19 2013-09-25 哈尔滨安天科技股份有限公司 Method and system for automatically extracting virus characteristics based on family samples
CN106031293A (en) * 2014-10-31 2016-10-12 华为技术有限公司 Data processing method, apparatus, terminal, mobility management entity, and system
CN107153789A (en) * 2017-04-24 2017-09-12 西安电子科技大学 The method for detecting Android Malware in real time using random forest grader
US20190349403A1 (en) * 2018-05-11 2019-11-14 Cisco Technology, Inc. Detecting targeted data exfiltration in encrypted traffic
CN108833360A (en) * 2018-05-23 2018-11-16 四川大学 A kind of malice encryption flow identification technology based on machine learning
CN110138849A (en) * 2019-05-05 2019-08-16 哈尔滨英赛克信息技术有限公司 Agreement encryption algorithm type recognition methods based on random forest
CN110572382A (en) * 2019-09-02 2019-12-13 西安电子科技大学 Malicious flow detection method based on SMOTE algorithm and ensemble learning
CN110708341A (en) * 2019-11-15 2020-01-17 中国科学院信息工程研究所 User behavior detection method and system based on remote desktop encryption network traffic mode difference
CN111310796A (en) * 2020-01-19 2020-06-19 中山大学 Web user click identification method facing encrypted network flow
CN112104677A (en) * 2020-11-23 2020-12-18 北京金睛云华科技有限公司 Controlled host detection method and device based on knowledge graph

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JUN LI等: "Identifying Skype Traffic by Random Forest", 《2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING》 *
JUN LI等: "Identifying Skype Traffic by Random Forest", 《2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING》, 8 October 2007 (2007-10-08) *
TANGDA YU等: "An Encrypted Malicious Traffic Detection System Based on Neural Network", 《2019 INTERNATIONAL CONFERENCE ON CYBER-ENABLED DISTRIBUTED COMPUTING AND KNOWLEDGE DISCOVERY (CYBERC)》, 2 January 2020 (2020-01-02) *
李树栋等: "基于Schnorr算法的不可否认签名方案", 《海军航空工程学院学报》 *
李树栋等: "基于Schnorr算法的不可否认签名方案", 《海军航空工程学院学报》, vol. 22, no. 4, 17 December 2007 (2007-12-17) *
陈良臣等: "网络加密流量识别研究进展及发展趋势", 《信息网络安全》, vol. 2019, no. 3, 10 March 2019 (2019-03-10) *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492623A (en) * 2022-01-25 2022-05-13 电子科技大学 Method and device for classifying Android malicious software
CN114938290A (en) * 2022-04-22 2022-08-23 北京天际友盟信息技术有限公司 Information detection method, device and equipment
CN114553605A (en) * 2022-04-26 2022-05-27 中国矿业大学(北京) Encrypted malicious flow detection method for voting strategy
CN114726653A (en) * 2022-05-24 2022-07-08 深圳市永达电子信息股份有限公司 Abnormal flow detection method and system based on distributed random forest
CN114726653B (en) * 2022-05-24 2022-11-15 深圳市永达电子信息股份有限公司 Abnormal flow detection method and system based on distributed random forest
CN115174160B (en) * 2022-06-16 2023-10-20 广州大学 Malicious encryption traffic classification method and device based on stream level and host level
CN115174160A (en) * 2022-06-16 2022-10-11 广州大学 Malicious encrypted traffic classification method and device based on stream level and host level
CN115834097A (en) * 2022-06-24 2023-03-21 电子科技大学 HTTPS malicious software flow detection system and method based on multiple visual angles
CN115834097B (en) * 2022-06-24 2024-03-22 电子科技大学 HTTPS malicious software flow detection system and method based on multiple views
CN115632875A (en) * 2022-11-29 2023-01-20 湖北省楚天云有限公司 Malicious flow detection method and system based on multi-feature fusion and real-time analysis
CN116055201B (en) * 2023-01-16 2023-09-01 中国矿业大学(北京) Multi-view encryption malicious traffic detection method based on collaborative training
CN116055201A (en) * 2023-01-16 2023-05-02 中国矿业大学(北京) Multi-view encryption malicious traffic detection method based on collaborative training
CN117521052A (en) * 2024-01-04 2024-02-06 中国电信股份有限公司江西分公司 Protection authentication method and device for server privacy, computer equipment and medium

Also Published As

Publication number Publication date
CN113704762B (en) 2022-06-21

Similar Documents

Publication Publication Date Title
CN113704762B (en) Malicious software encrypted flow detection method based on ensemble learning
CN109951500B (en) Network attack detection method and device
CN111818052B (en) CNN-LSTM-based industrial control protocol homologous attack detection method
Mohapatra et al. Handling of man-in-the-middle attack in wsn through intrusion detection system
Yang et al. SDAP: A secure hop-by-hop data aggregation protocol for sensor networks
CN107222491B (en) Intrusion detection rule creating method based on industrial control network variant attack
El-Khatib Impact of feature reduction on the efficiency of wireless intrusion detection systems
Bagui et al. Using machine learning techniques to identify rare cyber‐attacks on the UNSW‐NB15 dataset
KR100877911B1 (en) Method for detection of p2p-based botnets using a translation model of network traffic
US20060212942A1 (en) Semantically-aware network intrusion signature generator
US20080263661A1 (en) Detecting anomalies in signaling flows
Abraham et al. A comparison of machine learning approaches to detect botnet traffic
Mazhar et al. BeeAIS: Artificial immune system security for nature inspired, MANET routing protocol, BeeAdHoc
CN110958233B (en) Encryption type malicious flow detection system and method based on deep learning
US20220174083A1 (en) Method and device for detecting malicious activity over encrypted secure channel
Liu et al. The detection method of low-rate DoS attack based on multi-feature fusion
KR100684602B1 (en) Corresponding system for invasion on scenario basis using state-transfer of session and method thereof
Ireland Intrusion detection with genetic algorithms and fuzzy logic
He et al. Detection of tor traffic hiding under obfs4 protocol based on two-level filtering
Limmer et al. Improving the performance of intrusion detection using dialog-based payload aggregation
Yang et al. Deep learning approach for detecting malicious activities over encrypted secure channels
CN113242233B (en) Multi-classification botnet detection device
Al-Fawa'reh et al. Detecting stealth-based attacks in large campus networks
Dai et al. Eclipse attack detection for blockchain network layer based on deep feature extraction
Alajeely et al. Defense against packet dropping attacks in opportunistic networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant