CN116346384A - Malicious encryption flow detection method based on variation self-encoder - Google Patents

Malicious encryption flow detection method based on variation self-encoder Download PDF

Info

Publication number
CN116346384A
CN116346384A CN202111604173.0A CN202111604173A CN116346384A CN 116346384 A CN116346384 A CN 116346384A CN 202111604173 A CN202111604173 A CN 202111604173A CN 116346384 A CN116346384 A CN 116346384A
Authority
CN
China
Prior art keywords
data
encoder
class
category
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111604173.0A
Other languages
Chinese (zh)
Inventor
陈敏
刘澍波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tonghe Shiyi Telecommunication Science And Technology Research Institute Co ltd
Data Communication Science & Technology Research Institute
Xingtang Telecommunication Technology Co ltd
Original Assignee
Beijing Tonghe Shiyi Telecommunication Science And Technology Research Institute Co ltd
Data Communication Science & Technology Research Institute
Xingtang Telecommunication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tonghe Shiyi Telecommunication Science And Technology Research Institute Co ltd, Data Communication Science & Technology Research Institute, Xingtang Telecommunication Technology Co ltd filed Critical Beijing Tonghe Shiyi Telecommunication Science And Technology Research Institute Co ltd
Priority to CN202111604173.0A priority Critical patent/CN116346384A/en
Publication of CN116346384A publication Critical patent/CN116346384A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention relates to a malicious encryption traffic detection method based on a variation self-encoder, belongs to the technical field of computer network security, and solves the problems that the malicious encryption traffic detection method in the prior art is difficult to generalize and migrate and has poor detection capability on unknown malicious encryption traffic types. Collecting and processing network encrypted flow data to obtain feature vector data, and adding a class label to the feature vector data to obtain total sample data, wherein the class label comprises normal flow data and malicious class flow data; inputting the normal flow sample data into a variation self-encoder for training, and learning the characteristic distribution of the normal flow data to obtain a trained variation self-encoder; inputting the total sample data to a trained variable self-encoder to obtain reconstruction errors of the sample data of each detection category, and determining an identification threshold interval of each detection category based on the reconstruction errors; wherein each detection category includes a normal and each malicious category.

Description

Malicious encryption flow detection method based on variation self-encoder
Technical Field
The invention relates to the technical field of computer network security, in particular to a malicious encryption flow detection method based on a variation self-encoder.
Background
With the enhancement of network security awareness, the internet starts to use an encryption protocol to communicate, and encryption traffic is rapidly increased, but the encryption communication technology is an important means for network attack and information theft while protecting user privacy and information security. For HTTPS encryption traffic used in Web communication, because the use of encryption technology destroys the special statistical characteristics and data format of plaintext data, an attacker can utilize SSL encryption channels to avoid the filtering and protection of boundary equipment, deliver and distribute malicious software loads, realize malicious attack behaviors such as scanning detection, violent cracking and the like, complete the communication between an infected host and a server and cause great threat to network information security. Because the characteristics of the encrypted traffic are changed, the traditional traffic detection method is difficult to multiplex in an encryption environment, and how to effectively identify malicious encrypted traffic becomes one of the important challenges in the field of network security.
At present, researchers at home and abroad mainly adopt a machine learning detection method for detecting malicious encrypted traffic. According to the malicious flow detection method based on machine learning, through priori knowledge of experts in the network security field, features in the statistical sense of detection objects are extracted by using feature engineering, the extracted data features are used as training samples to be sent into a machine learning algorithm model, and then different machine learning algorithm models are adopted for training and parameter adjustment optimization, so that malicious encryption flow detection is finally achieved. The core of the malicious traffic detection technology based on machine learning is the characteristic engineering of network traffic data, which means better recognition results, but the characteristic engineering is seriously dependent on prior knowledge of experts in the field, so that the existing detection method is difficult to generalize and migrate; and because of relying on empirical knowledge, the unknown malicious encryption traffic type is poorly detected.
Therefore, the existing malicious encryption traffic detection method is difficult to generalize and migrate, and has poor detection capability on unknown malicious encryption traffic types.
Disclosure of Invention
In view of the above analysis, the embodiment of the invention aims to provide a malicious encryption traffic detection method based on a variation self-encoder, which is used for solving the problems that the existing malicious encryption traffic detection method is difficult to generalize and migrate and has poor detection capability on unknown malicious encryption traffic types.
The embodiment of the invention provides a malicious encryption traffic detection method based on a variation self-encoder, which comprises the following steps:
collecting and processing network encrypted flow data to obtain feature vector data, and adding a class label to the feature vector data to obtain total sample data, wherein the class label comprises normal flow data and malicious class flow data;
inputting the normal flow sample data into a variation self-encoder for training, and learning the characteristic distribution of the normal flow data to obtain a trained variation self-encoder; inputting the total sample data to a trained variable self-encoder to obtain reconstruction errors of the sample data of each detection category, and determining an identification threshold interval of each detection category based on the reconstruction errors; wherein each detection category includes a normal and each malicious category;
and inputting the feature vector data of the encrypted flow data to be detected into the trained variation self-encoder to obtain a reconstruction error, and obtaining a detection result of the encrypted flow according to the reconstruction error and the identification threshold interval of each detection category.
Further, feature vector data is obtained by performing the following manner:
extracting data characteristics of the encrypted flow data, cleaning, null filling and digitizing the data characteristics, and obtaining feature vector data through zero mean normalization; the data features comprise statistical features, time sequence features and message load features.
Further, the statistical characteristics include the number of data packets, the number of transmission bytes, the data packet transmission rate and the byte transmission rate; the time series characteristics include average packet interval time, average stream duration, packet time interval standard deviation, and stream duration standard deviation; the message load characteristics comprise a protocol version number, a protocol entropy, a TLS cipher suite number and a TLS extension length.
Further, the inputting the total sample data to the trained variational self-encoder to obtain a reconstruction error of the sample data of each detection category, and determining the recognition threshold interval of each detection category based on the reconstruction error, including:
inputting the total sample data to a trained variational self-encoder to obtain reconstruction input data;
obtaining reconstruction errors according to the total sample data and the reconstruction input data which are originally input, and counting the reconstruction error distribution of the flow data of each category;
and based on the reconstruction error distribution and the detection accuracy of each class of flow data, adaptively adjusting the recognition threshold value of each class to obtain the recognition threshold value interval of each class.
Further, the reconstruction error MSE is obtained by the following formula:
Figure BDA0003433108120000031
wherein x is i 、x′ i The i dimension characteristics of the input data and the reconstructed input data are respectively represented, and n is the number of dimension characteristics of the input data and the reconstructed input data.
Further, the recognition threshold interval of each category is obtained by executing the following steps:
step 1, setting an LBround initial value as a reconstruction error with the largest frequency of a normal flow class in reconstruction error distribution, wherein the RBround initial value is a reconstruction error with the largest frequency of a malicious flow class closest to the normal flow class in reconstruction error distribution, and the initial Step distance Step is 0.1;
step 2, taking mid= (LBbond+RBbond)/2, traversing by taking Step as a Step distance and LBbond as an end point from the mid position to the left in the reconstruction error distribution, and determining a reconstruction error with the maximum classification precision in the [ LBbond, mid ] interval as a left threshold value Lmax; traversing by taking Step as Step distance from mid position to right and taking RBround as end point, and determining reconstruction error with maximum classification precision in [ mid, RBround ] interval as right threshold value Rmax;
step 3, if (Rmax-Lmax) is smaller than or equal to the tolerable error, stopping iteration, and taking the median point of Lmax and Rmax as the recognition threshold value of the current two categories; otherwise, setting lbround=lmax, rbround=rmax, step=step 0.1, returning to Step 2, and performing the next iteration;
step 4, if the category to which the initial value of the current RBround belongs does not have other malicious traffic categories on the right side, stopping calculation of the identification threshold, and taking the maximum reconstruction error of the category as a final identification threshold; otherwise, taking the initial value of the current RBbond as the initial value of the next LBbond, taking the reconstruction error with the largest frequency in the malicious traffic class closest to the class on the right side of the reconstruction error distribution as the initial value of the next RBbond, executing the steps 2-3, and confirming the recognition threshold of the next two classes;
and 5, sequencing the obtained identification thresholds to obtain identification threshold intervals of each category of the flow data, wherein 0 is used as an initial identification threshold.
Further, the classification precision is an average value of detection accuracy rates of the current two categories; the detection accuracy of the class on the left side is the ratio of the number of samples of the class on the left side of the threshold point to the total number of samples of the class; the detection accuracy of the right class is the ratio of the number of samples of the class to the total number of samples of the class to the right of the threshold point.
Further, the tolerable error has a value of 0.001.
Further, the obtaining the detection result of the encrypted traffic according to the reconstruction error and the identification threshold interval of each category includes:
if the reconstruction error of the encrypted flow to be detected belongs to an identification threshold interval of any category, the detection result of the encrypted flow is the corresponding category;
otherwise, judging the detection result of the encrypted traffic as unknown malicious traffic.
Further, the training of the normal flow sample data input variable self-encoder to obtain a trained variable self-encoder includes:
inputting the normal flow sample data into a variation self-encoder to obtain reconstructed normal flow sample data;
obtaining a Loss function Loss according to the input normal flow sample data and the reconstructed normal flow sample data, wherein the Loss function Loss is expressed as:
Loss=KL(N(μ(X),σ 2 (X))|N(0,1))-L(X,P(X′=X|Z));
wherein, X represents the input normal flow sample data, and X' represents the reconstructed normal flow sample data; n (μ (X), σ 2 (X)) means that the mean value is μ (X), and the variance is σ 2 A normal distribution of (X); n (0, 1) represents a standard normal distribution with a mean of 0 and a variance of 1; KL () represents KL divergence; p (X '=x|z) represents the probability that the value of reconstructed X' is equal to the value of input X under the condition that the hidden vector Z distribution is satisfied; -L () represents a negative log-likelihood loss function;
and updating parameters of the variation self-encoder by adopting a back propagation algorithm and a random gradient descent optimization algorithm so as to minimize a loss function and obtain the trained variation self-encoder.
Compared with the prior art, the invention has at least one of the following beneficial effects:
the invention provides a malicious encryption flow detection method based on a variation self-encoder,
1. acquiring and processing network encrypted flow data to obtain total sample data, and training a variation self-encoder by using normal flow sample data to obtain characteristic distribution of normal encrypted flow; inputting the total sample data into a variation self-encoder to obtain the reconstruction error of the sample data, further obtaining the identification threshold interval of each category, and carrying out type detection through the reconstruction error of the encrypted flow to be detected and the identification threshold interval, thereby realizing the identification and detection of the encrypted flow with high generalization capability and high accuracy, and having important practical significance for ensuring the safety of network information and maintaining the normal operation of the network;
2. the variation self-encoder is adopted to learn the characteristic distribution of the normal encrypted flow samples, and the probability encoder is used to simulate the distribution of hidden vectors, so that the expression capacity of input characteristics is expanded, and the anti-interference capacity and universality of the model are improved;
3. the detection method based on the deep learning model of the variation self-encoder can effectively and accurately detect and classify the malicious encrypted traffic of a known type;
4. the encrypted traffic is detected through the reconstruction error and each type of recognition interval threshold, the detection type can be flexibly expanded without retraining the variable self-encoder, the method has the strong generalization and migration capability, and the unknown threat can be rapidly recognized and detected.
In the invention, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.
Fig. 1 is a flow diagram of a malicious encryption flow detection method based on a variation self-encoder according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of acquiring an identification threshold interval of each category according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a variable self-encoder according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.
The invention discloses a malicious encryption traffic detection method based on a variation self-encoder, which is shown in fig. 1 and comprises the following steps:
s1, collecting and processing network encrypted flow data to obtain feature vector data, and adding category labels for the feature vector data to obtain total sample data, wherein the category labels comprise normal flow data and malicious category flow data.
Specifically, the network encrypted traffic data is traffic generated by an encryption algorithm, and refers to encrypted actual plaintext content transmitted in a communication process, so that confidentiality of the communication content can be ensured.
In practice, feature vector data is obtained by performing the following:
extracting data characteristics of network encrypted flow data, sequentially cleaning, null filling and digitizing the data characteristics, and obtaining feature vector data through zero mean normalization; the data features comprise statistical features, time sequence features and message load features.
Specifically, a LAMP (Linux Apache Mysql Php) technology is used for building a web shooting range, a Kali Linux is used for building a network attack and defense exercise sand table, a Wireshark network packet capturing analysis software is used for collecting middle flow data, useless starting flow data are filtered, and a pcap message file is collected for each detection type so as to simulate an abnormal flow condition when network equipment is attacked in a real environment; and carrying out data analysis on the collected pcap message, and extracting to obtain data characteristics. The detection category comprises normal traffic data and malicious traffic data, wherein the malicious traffic data comprises DDoS attack, brute force cracking and sql injection; each category contains 5000 records.
Specifically, the statistical characteristics include the number of data packets, the number of transmission bytes, the data packet transmission rate and the byte transmission rate; the time series characteristics include average packet interval time, average stream duration, packet time interval standard deviation, and stream duration standard deviation; the message load characteristics comprise a protocol version number, a protocol entropy, a TLS cipher suite number and a TLS extension length.
Specifically, the data features are cleaned and null-value filled in the following manner:
filtering the source IP and the destination IP in the extracted data characteristics, considering that most of the current network traffic is IPv4 traffic addresses, only reserving IPv4 addresses, screening IPv6 network addresses, and unifying the types of the collected sample traffic data; performing average value extraction operation on a column where a missing value is located in the extracted data characteristic, and setting data with the column as a null value as an average value, such as the size of received data, the number of received data packets and the like; adding a cipher suite length column and an extension suite length column, and setting the cipher suite and the extension suite length thereof to 0 for the TLS field to be empty; the other irrelevant fields are set to 0.
It can be appreciated that the data features can be higher in availability by cleaning and null filling, so that the influence of data noise and abnormal values on generalization is avoided, and the later training and the determination of the class recognition threshold value are facilitated.
Specifically, the data characteristics after cleaning and null filling are subjected to numerical treatment in the following manner:
using one-hot coding to digitize non-numerical data in the extracted data features; such as message load characteristics, etc.; and zero-mean normalization is performed on all data features. It can be appreciated that the data is digitized and zero-mean normalized, which is beneficial to the late-variation self-encoder to accelerate convergence and improve robustness.
S2, inputting the normal flow sample data into a variation self-encoder for training, and learning the characteristic distribution of the normal flow data to obtain a trained variation self-encoder; inputting the total sample data to a trained variational self-encoder to obtain reconstruction errors of sample data of each detection category, and determining an identification threshold interval of each detection category based on the reconstruction errors, as shown in fig. 2; wherein each detection category includes a normal and each malicious category.
Specifically, an unsupervised learning module based on a variation self-encoder is used for learning the characteristic distribution of the normal flow data by adopting a variation self-encoder deep learning model and storing model weights. The normal flow sample data after data processing and standardization is input into a Variational self-Encoder (VAE) deep learning model, and the Variational self-Encoder is a directed probability map model based on Variational inference, and can learn the characteristic distribution of the data by combining deep learning with probability statistics.
Specifically, as shown in fig. 3, the variable self-encoder includes a probability encoder and a probability decoder;
the probability encoder is used for encoding the input data into probability distribution in the hidden space, and sampling from the probability distribution to obtain corresponding low-dimensional hidden vectors;
the probability encoder is used for mapping the low-dimensional hidden vector to the reconstructed input data.
Specifically, the probability encoder is defined by p (z|x), representing the distribution of the given encoded vector from the original input vector, and the probability decoder is defined by p (X' |z), representing the distribution from the given encoded vector to the decoded vector; wherein X is the original high-dimensional input feature vector, Z is the low-dimensional hidden vector of the hidden space, and X' is the reconstructed high-dimensional output vector.
Assuming the original input feature vector is
Figure BDA0003433108120000091
Wherein x is i The ith dimension feature of the original input feature vector X is represented, and n represents the number of dimension features of the original input feature vector X; each data sample x i Are randomly generated mutually independent and discrete distribution vectors; generating an output feature vector of +.>
Figure BDA0003433108120000092
Wherein x' i The ith dimension feature of the output feature vector X' is represented, and the output feature vector corresponds to the dimension feature of the original input feature vector one by one; and assuming that the prior distribution p (Z) followed by the low-dimensional hidden vector Z is a standard gaussian distribution, the likelihood distribution p (x|z) follows the gaussian distribution, and from the bayesian theorem, the prior distribution p (Z), the likelihood distribution p (x|z), and the posterior distribution p (z|x) existThe following relationship:
Figure BDA0003433108120000093
thus, the variance self-encoder deep learning model can be divided into two processes: probability encoder for approximate inference process of low-dimensional hidden vector Z posterior distribution
Figure BDA0003433108120000094
I.e. deducing the network; conditional distribution generation process p for generating vector X' for probability decoder θ (Z)p θ (X' |z), i.e., generating a network.
When the method is implemented, the normal flow sample data is input into the variation self-encoder for training, and the trained variation self-encoder is obtained, which comprises the following steps:
s21, inputting normal flow sample data to a variation self-encoder to obtain reconstructed normal flow sample data;
specifically, reconstructed normal flow sample data is obtained by performing the following steps:
setting initial weights of the variation self-encoders to random values conforming to normal distribution; encoding the input data as a normal distribution over the hidden space; the probability encoder maps the input sample data into a mean μ and a standard deviation σ 2 From which a conditional probability distribution of the low-dimensional hidden vector Z based on the input data X is determined
Figure BDA0003433108120000101
The probability decoder samples from the probability distribution to obtain a low-dimensional hidden vector Z belonging to the distribution;
through probability decoder p θ (Z)p θ (X '|Z) mapping the low-dimensional hidden vector Z to reconstructed input sample data X'.
Preferably, the dimension of the hidden vector Z is set to 20, which may optimize the effect of the variance from the encoder.
It can be understood that the probability encoder is used for simulating the distribution of hidden vectors by learning the characteristic distribution of the normal encrypted flow samples by adopting the variation self-encoder, so that the expression capacity of input characteristics is expanded, and the anti-interference capacity and the generalization capacity of the model are improved.
S22, obtaining a Loss function Loss according to normal flow sample data input and reconstructed normal flow sample data, wherein the Loss function Loss is expressed as:
Loss=KL(N(μ(X),σ 2 (X))|N(0,1))-L(X,P(X′=X|Z));
wherein, X represents the input normal flow sample data, namely the original input feature vector, and X' represents the reconstructed normal flow sample data, namely the output feature vector; n (μ (X), σ 2 (X)) means that the mean value is μ (X), and the variance is σ 2 A normal distribution of (X); n (0, 1) represents a standard normal distribution with a mean of 0 and a variance of 1; KL () represents KL divergence; p (X '=x|z) represents the probability that the value of reconstructed X' is equal to the value of input X under the condition that the hidden vector Z distribution is satisfied; -L () represents a negative log-likelihood loss function.
Specifically, KL divergence represents
Figure BDA0003433108120000111
Distance from the reference probability distribution, i.e., the standard gaussian distribution N (0, 1); the negative log likelihood loss function is a probability negative log likelihood loss function in which the value of the reconstructed output X 'is equal to the input sample X under the condition that the hidden vector Z distribution is satisfied, and represents the degree of coincidence between the distribution of the reconstructed input X' and the distribution of the input sample data X.
Specifically, the KL divergence is calculated by the following formula:
Figure BDA0003433108120000112
where d represents the dimension of the hidden vector Z, μ i And
Figure BDA0003433108120000113
the mean and variance of the approximate posterior distribution of the ith dimension in the hidden vector Z are represented.
Specifically, the negative log likelihood loss function is calculated by the following formula:
Figure BDA0003433108120000114
wherein P (x' i =x i Z) represents the value X ' of the reconstructed X ' ith dimension feature under the condition that the hidden vector Z is satisfied ' i Value X equal to input X ith dimension feature i Is a probability of (2).
And S23, updating parameters of the variation self-encoder by adopting a back propagation algorithm and a random gradient descent optimization algorithm so as to minimize a loss function and obtain the trained variation self-encoder.
It will be appreciated that by designing the loss function and adding the back propagation algorithm in the variant encoder training, in order to make the distance of the reconstructed input data X' from the input data X as small as possible and to ensure that the variant self-encoder has the generating capability,
Figure BDA0003433108120000115
the standard gaussian distribution N (0, 1) is satisfied as much as possible.
In the implementation, the total sample data is input to a trained variable self-encoder to obtain a reconstruction error of sample data of each detection category, and an identification threshold interval of each detection category is determined based on the reconstruction error, and the method comprises the following steps:
inputting the total sample data to a trained variational self-encoder to obtain reconstruction input data;
obtaining reconstruction errors according to the total sample data and the reconstruction input data which are originally input, and counting the reconstruction error distribution of the flow data of each category; wherein the categories include a normal category and each malicious category;
and based on the reconstruction error distribution and the detection accuracy of each class of flow data, adaptively adjusting the recognition threshold value of each class to obtain the recognition threshold value interval of each class.
In practice, the reconstruction error MSE is obtained by the following formula:
Figure BDA0003433108120000121
wherein x is i 、x′ i The i dimension characteristics of the input data and the reconstructed input data are respectively represented, and n is the number of dimension characteristics of the input data and the reconstructed input data.
In specific implementation, the following steps are executed to adaptively adjust the recognition threshold value of each category, so as to obtain the recognition threshold value interval of each category:
step 1, setting an LBround initial value as a reconstruction error with the largest frequency of a normal flow class in reconstruction error distribution, wherein the RBround initial value is a reconstruction error with the largest frequency of a malicious flow class closest to the normal flow class in reconstruction error distribution, and the initial Step distance Step is 0.1;
step 2, taking mid= (LBbond+RBbond)/2, traversing by taking Step as a Step distance and LBbond as an end point from the mid position to the left in the reconstruction error distribution, and determining a reconstruction error with the maximum classification precision in the [ LBbond, mid ] interval as a left threshold value Lmax; traversing by taking Step as Step distance from mid position to right and taking RBround as end point, and determining reconstruction error with maximum classification precision in [ mid, RBround ] interval as right threshold value Rmax;
step 3, if (Rmax-Lmax) is smaller than or equal to the tolerable error, stopping iteration, and taking the median point of Lmax and Rmax as the recognition threshold value of the current two categories; otherwise, setting lbround=lmax, rbround=rmax, step=step 0.1, returning to Step 2, and performing the next iteration;
step 4, if the category to which the initial value of the current RBround belongs does not have other malicious traffic categories on the right side, stopping calculation of the identification threshold, and taking the maximum reconstruction error of the category as a final identification threshold; otherwise, taking the initial value of the current RBbond as the initial value of the next LBbond, taking the reconstruction error with the largest frequency in the malicious traffic class closest to the class on the right side of the reconstruction error distribution as the initial value of the next RBbond, executing the steps 2-3, and confirming the recognition threshold of the next two classes;
and 5, sequencing the obtained identification thresholds to obtain identification threshold intervals of each category of the flow data, wherein 0 is used as an initial identification threshold. For example, the sequentially determined recognition thresholds are 0.33,0.45 and 0.55, the initial recognition threshold is 0, and the final recognition threshold is 0.7, and the obtained recognition threshold intervals are [0,0.33], (0.33,0.45 ], (0.45,0.55) and (0.55,0.7 ]; wherein [0,0.33] is the recognition threshold interval of the normal traffic class, and the others correspond to the corresponding malicious traffic classes respectively.
It can be appreciated that the variant self-coding is a data feature of the learned normal traffic class, so that the reconstruction error is smaller than other malicious classes when reconstructing the input, and therefore, the recognition threshold of the normal traffic class and the malicious traffic class adjacent thereto is determined first, and then the recognition threshold of the malicious traffic class adjacent thereafter is determined in turn.
Specifically, the classification precision is an average value of detection accuracy rates of the current two categories; the detection accuracy of the class on the left side is the ratio of the number of samples of the class on the left side of the threshold point to the total number of samples of the class; the detection accuracy of the right class is the ratio of the number of samples of the class to the total number of samples of the class to the right of the threshold point. It can be understood that when the left threshold point Lmax is determined, the threshold point calculated by the detection accuracy of the two categories is the left threshold point; when the right threshold point Rmax is determined, the threshold point calculated by the detection accuracy of the two categories is the right threshold point.
Specifically, the tolerable error can be set according to actual requirements, and preferably, the value of the tolerable error is 0.001, so that the obtained recognition threshold value is more accurate and reliable.
S3, inputting the feature vector data of the encrypted flow data to be detected into the trained variation self-encoder to obtain a reconstruction error, and obtaining a detection result of the encrypted flow according to the reconstruction error and the identification threshold intervals of each category.
In implementation, the obtaining the detection result of the encrypted traffic according to the reconstruction error and the identification threshold interval of each category includes:
if the reconstruction error of the encrypted flow to be detected belongs to an identification threshold interval of any category, the detection result of the encrypted flow is the corresponding category;
otherwise, judging the detection result of the encrypted traffic as unknown malicious traffic.
It can be understood that by detecting in the mode, the malicious traffic of an unknown type can be rapidly identified, early warning is timely carried out, the traffic is further analyzed, and the resistance of the detection model to unknown threats is improved.
Compared with the prior art, the embodiment provides a malicious encryption flow detection method based on a variation self-encoder, which acquires and processes network encryption flow data to obtain total sample data, and trains the variation self-encoder by using normal flow sample data to obtain the characteristic distribution of normal encryption flow; inputting the total sample data into a variation self-encoder to obtain the reconstruction error of the sample data, further obtaining the identification threshold interval of each category, and carrying out type detection through the reconstruction error of the encrypted flow to be detected and the identification threshold interval, thereby realizing the identification and detection of the encrypted flow with high generalization capability and high accuracy, and having important practical significance for ensuring the safety of network information and maintaining the normal operation of the network; the detection method based on the deep learning model of the variation self-encoder can effectively and accurately detect and classify the malicious encrypted traffic of a known type; the encrypted traffic is detected through the reconstruction error and each type of recognition interval threshold, the detection type can be flexibly expanded without retraining the variable self-encoder, the method has the strong generalization and migration capability, and the unknown threat can be rapidly recognized and detected.
Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims (10)

1. The malicious encryption traffic detection method based on the variation self-encoder is characterized by comprising the following steps of:
collecting and processing network encrypted flow data to obtain feature vector data, and adding a class label to the feature vector data to obtain total sample data, wherein the class label comprises normal flow data and malicious class flow data;
inputting the normal flow sample data into a variation self-encoder for training, and learning the characteristic distribution of the normal flow data to obtain a trained variation self-encoder; inputting the total sample data to a trained variable self-encoder to obtain reconstruction errors of the sample data of each detection category, and determining an identification threshold interval of each detection category based on the reconstruction errors; wherein each detection category includes a normal and each malicious category;
and inputting the feature vector data of the encrypted flow data to be detected into the trained variation self-encoder to obtain a reconstruction error, and obtaining a detection result of the encrypted flow according to the reconstruction error and the identification threshold interval of each detection category.
2. The method for detecting malicious encrypted traffic based on a variational self-encoder according to claim 1, wherein the feature vector data is obtained by performing the following steps:
extracting data characteristics of the encrypted flow data, cleaning, null filling and digitizing the data characteristics, and obtaining feature vector data through zero mean normalization; the data features comprise statistical features, time sequence features and message load features.
3. The method for detecting malicious encrypted traffic based on a variable self-encoder according to claim 2, wherein the statistical characteristics include a number of packets, a number of bytes transmitted, a packet transmission rate, and a byte transmission rate; the time series characteristics include average packet interval time, average stream duration, packet time interval standard deviation, and stream duration standard deviation; the message load characteristics comprise a protocol version number, a protocol entropy, a TLS cipher suite number and a TLS extension length.
4. The method for detecting malicious encrypted traffic based on a variable-score self-encoder according to claim 1, wherein the inputting the total sample data to the trained variable-score self-encoder to obtain a reconstruction error of the sample data of each detection class, and determining the identification threshold interval of each detection class based on the reconstruction error comprises:
inputting the total sample data to a trained variational self-encoder to obtain reconstruction input data;
obtaining reconstruction errors according to the total sample data and the reconstruction input data which are originally input, and counting the reconstruction error distribution of the flow data of each category;
and based on the reconstruction error distribution and the detection accuracy of each class of flow data, adaptively adjusting the recognition threshold value of each class to obtain the recognition threshold value interval of each class.
5. The method for detecting malicious encrypted traffic based on a variational self-encoder according to claim 4, wherein the reconstruction error MSE is obtained by the following formula:
Figure FDA0003433108110000021
wherein x is i 、x′ i The i dimension characteristics of the input data and the reconstructed input data are respectively represented, and n is the number of dimension characteristics of the input data and the reconstructed input data.
6. The method for detecting malicious encrypted traffic based on a variation self-encoder according to claim 5, wherein the recognition threshold interval of each category is obtained by adaptively adjusting the recognition threshold of each category by performing the steps of:
step 1, setting an LBround initial value as a reconstruction error with the largest frequency of a normal flow class in reconstruction error distribution, wherein the RBround initial value is a reconstruction error with the largest frequency of a malicious flow class closest to the normal flow class in reconstruction error distribution, and the initial Step distance Step is 0.1;
step 2, taking mid= (LBbond+RBbond)/2, traversing by taking Step as a Step distance and LBbond as an end point from the mid position to the left in the reconstruction error distribution, and determining a reconstruction error with the maximum classification precision in the [ LBbond, mid ] interval as a left threshold value Lmax; traversing by taking Step as Step distance from mid position to right and taking RBround as end point, and determining reconstruction error with maximum classification precision in [ mid, RBround ] interval as right threshold value Rmax;
step 3, if (Rmax-Lmax) is smaller than or equal to the tolerable error, stopping iteration, and taking the median point of Lmax and Rmax as the recognition threshold value of the current two categories; otherwise, setting lbround=lmax, rbround=rmax, step=step 0.1, returning to Step 2, and performing the next iteration;
step 4, if the category to which the initial value of the current RBround belongs does not have other malicious traffic categories on the right side, stopping calculation of the identification threshold, and taking the maximum reconstruction error of the category as a final identification threshold; otherwise, taking the initial value of the current RBbond as the initial value of the next LBbond, taking the reconstruction error with the largest frequency in the malicious traffic class closest to the class on the right side of the reconstruction error distribution as the initial value of the next RBbond, executing the steps 2-3, and confirming the recognition threshold of the next two classes;
and 5, sequencing the obtained identification thresholds to obtain identification threshold intervals of each category of the flow data, wherein 0 is used as an initial identification threshold.
7. The method for detecting malicious encrypted traffic based on a variational self-encoder according to claim 6, wherein the classification accuracy is an average value of detection accuracy of the current two categories; the detection accuracy of the class on the left side is the ratio of the number of samples of the class on the left side of the threshold point to the total number of samples of the class; the detection accuracy of the right class is the ratio of the number of samples of the class to the total number of samples of the class to the right of the threshold point.
8. The method for detecting malicious encrypted traffic based on a variable self-encoder according to claim 6, wherein the tolerable error has a value of 0.001.
9. The method for detecting malicious encrypted traffic based on a variable self-encoder according to claim 1, wherein the obtaining the detection result of the encrypted traffic according to the reconstruction error and the recognition threshold interval of each category comprises:
if the reconstruction error of the encrypted flow to be detected belongs to an identification threshold interval of any category, the detection result of the encrypted flow is the corresponding category;
otherwise, judging the detection result of the encrypted traffic as unknown malicious traffic.
10. The method for detecting malicious encrypted traffic based on a variation self-encoder according to claim 2, wherein the step of inputting normal traffic sample data into the variation self-encoder to train and obtain the trained variation self-encoder comprises the steps of:
inputting the normal flow sample data into a variation self-encoder to obtain reconstructed normal flow sample data;
obtaining a Loss function Loss according to the input normal flow sample data and the reconstructed normal flow sample data, wherein the Loss function Loss is expressed as:
Loss=KL(N(μ(X),σ 2 (X))|N(0,1))-L(X,P(X′=X|Z));
wherein, X represents the input normal flow sample data, and X' represents the reconstructed normal flow sample data; n (μ (X), σ 2 (X)) means that the mean value is μ (X), and the variance is σ 2 A normal distribution of (X); n (0, 1) represents a standard normal distribution with a mean of 0 and a variance of 1; KL () represents KL divergence; p (X '=x|z) represents the probability that the value of reconstructed X' is equal to the value of input X under the condition that the hidden vector Z distribution is satisfied; -L () represents a negative log-likelihood loss function;
and updating parameters of the variation self-encoder by adopting a back propagation algorithm and a random gradient descent optimization algorithm so as to minimize a loss function and obtain the trained variation self-encoder.
CN202111604173.0A 2021-12-24 2021-12-24 Malicious encryption flow detection method based on variation self-encoder Pending CN116346384A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111604173.0A CN116346384A (en) 2021-12-24 2021-12-24 Malicious encryption flow detection method based on variation self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111604173.0A CN116346384A (en) 2021-12-24 2021-12-24 Malicious encryption flow detection method based on variation self-encoder

Publications (1)

Publication Number Publication Date
CN116346384A true CN116346384A (en) 2023-06-27

Family

ID=86877801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111604173.0A Pending CN116346384A (en) 2021-12-24 2021-12-24 Malicious encryption flow detection method based on variation self-encoder

Country Status (1)

Country Link
CN (1) CN116346384A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116776949A (en) * 2023-06-30 2023-09-19 中国地质科学院矿产资源研究所 Machine learning chemical exploration data processing method and system based on mineral control element restriction
CN116915504A (en) * 2023-09-11 2023-10-20 中国电子科技集团公司第三十研究所 Fine granularity identification method for unknown protocol flow data in bright and dense state
CN116910752A (en) * 2023-07-17 2023-10-20 重庆邮电大学 Malicious code detection method based on big data
CN117640252A (en) * 2024-01-24 2024-03-01 北京邮电大学 Encryption stream threat detection method and system based on context analysis

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116776949A (en) * 2023-06-30 2023-09-19 中国地质科学院矿产资源研究所 Machine learning chemical exploration data processing method and system based on mineral control element restriction
CN116910752A (en) * 2023-07-17 2023-10-20 重庆邮电大学 Malicious code detection method based on big data
CN116910752B (en) * 2023-07-17 2024-03-08 重庆邮电大学 Malicious code detection method based on big data
CN116915504A (en) * 2023-09-11 2023-10-20 中国电子科技集团公司第三十研究所 Fine granularity identification method for unknown protocol flow data in bright and dense state
CN116915504B (en) * 2023-09-11 2023-11-21 中国电子科技集团公司第三十研究所 Fine granularity identification method for unknown protocol flow data in bright and dense state
CN117640252A (en) * 2024-01-24 2024-03-01 北京邮电大学 Encryption stream threat detection method and system based on context analysis
CN117640252B (en) * 2024-01-24 2024-03-26 北京邮电大学 Encryption stream threat detection method and system based on context analysis

Similar Documents

Publication Publication Date Title
CN116346384A (en) Malicious encryption flow detection method based on variation self-encoder
CN112398779B (en) Network traffic data analysis method and system
CN109450842B (en) Network malicious behavior recognition method based on neural network
CN112738039B (en) Malicious encrypted flow detection method, system and equipment based on flow behavior
Wressnegger et al. Zoe: Content-based anomaly detection for industrial control systems
CN110611640A (en) DNS protocol hidden channel detection method based on random forest
CN107370752B (en) Efficient remote control Trojan detection method
Liu et al. LSTM-CGAN: Towards generating low-rate DDoS adversarial samples for blockchain-based wireless network detection models
CN112804253A (en) Network flow classification detection method, system and storage medium
CN112118154A (en) ICMP tunnel detection method based on machine learning
CN114143037A (en) Malicious encrypted channel detection method based on process behavior analysis
CN117220920A (en) Firewall policy management method based on artificial intelligence
CN113037748A (en) C and C channel hybrid detection method and system
CN113965393B (en) Botnet detection method based on complex network and graph neural network
CN111865947B (en) Method for generating abnormal data of power terminal based on transfer learning
CN113382003B (en) RTSP mixed intrusion detection method based on two-stage filter
Whalen et al. Hidden markov models for automated protocol learning
CN112804239B (en) Traffic safety analysis modeling method and system
Sun et al. Covert timing channels detection based on auxiliary classifier generative adversarial network
Gonzalez-Granadillo et al. An improved live anomaly detection system (i-lads) based on deep learning algorithm
CN111343205B (en) Industrial control network security detection method and device, electronic equipment and storage medium
CN114205855A (en) Feeder automation service network anomaly detection method facing 5G slices
CN113449768A (en) Network traffic classification device and method based on short-time Fourier transform
Kapoor et al. Detecting VoIP data streams: approaches using hidden representation learning
Fernandes et al. Statistical, forecasting and metaheuristic techniques for network anomaly detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination