CN116232772B

CN116232772B - Unsupervised network data intrusion detection method based on ensemble learning

Info

Publication number: CN116232772B
Application number: CN202310509884.2A
Authority: CN
Inventors: 江荣; 刘海天; 周斌; 李爱平; 涂宏魁; 王晔
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-07-07
Anticipated expiration: 2043-05-08
Also published as: CN116232772A

Abstract

The invention relates to an unsupervised network data intrusion detection method based on ensemble learning, which comprises the following steps: step S1: processing the collected network stream data into time sequence data; step S2: reconstructing the time-series data into single-point format data, context format data and time-period format data; step S3: training an intrusion detection model set, wherein the intrusion detection model set comprises a variation self-encoder set CNN-VAE model based on single-point format data and/or a cyclic neural network predictor TCN-LSTM model based on context format data and/or a variation self-encoder BILSTM-VAE model based on time period format data; step S4: acquiring error data; step S5: the difference between the error data obtained in step S4 and the expected error is compared. The invention can reduce the expensive cost caused by manual message marking and realize intrusion detection on network data.

Description

Unsupervised network data intrusion detection method based on ensemble learning

Technical Field

The invention relates to the technical field of network data intrusion detection and anomaly identification, in particular to an unsupervised network data intrusion detection method based on ensemble learning.

Background

Network security is now becoming an extensive research area, and detection of malicious activity on networks is one of the more common problems, and Intrusion Detection Systems (IDS), which are the best solutions for detecting various network threats, can be used to check activity in specific environments, and typical intrusion detection systems, including but not limited to firewalls, access control lists, authentication mechanisms, etc., have long been widely used to improve security of computer systems.

In terms of detection technology, in general, conventional IDS includes three intrusion detection systems based on signature-based (signature-based), based on abnormal conditions (analysis-based), and based on Specification (Specification-based). However, these conventional detection techniques have failed to handle the multi-variable data flow generated by the increasingly dynamic and complex nature of modern cyber attacks. Thus, researchers have exceeded specifications or token-based techniques to begin utilizing machine learning techniques to utilize large amounts of data generated by the system. As the demand for intelligence and autonomy increases, neural networks have become an increasingly popular solution to intrusion detection systems. Their ability to learn complex patterns and behaviors makes them suitable solutions to distinguish between normal traffic and network attacks.

The mainstream neural network solutions are more prone to supervised training, which has been shown to exhibit good anomaly recognition in the problem of intrusion detection. However, in addition to autonomy, another important attribute of IDS is its ability to detect zero-day attacks, which change over time, while new attacks are continually discovered, so the continued maintainability of malicious attack traffic repositories may be impractical, meaning that experts must annotate network traffic and manually update models from time to time, which would require specialized expert knowledge bases to support, and the labeling process is time consuming and expensive, which is too costly to require labor costs. Furthermore, classification itself is a closed method of identifying concepts, a classifier is trained to identify classes provided in a training set, however, it is not reasonable to assume that all possible malicious traffic can be collected and placed in the training data.

Disclosure of Invention

The invention aims to solve the technical problems that: the unsupervised network data intrusion detection method based on the ensemble learning is provided, so that the expensive cost caused by manual message marking is reduced, and the intrusion detection of network data is realized.

The technical scheme adopted by the invention for solving the technical problems is as follows: an unsupervised network data intrusion detection method based on ensemble learning comprises the following steps:

step S1: preprocessing data, namely processing the acquired network stream data into time sequence data;

step S2: time series data are shunted, and the time series data in the step S1 are reconstructed into three different data forms: single point format data, context format data, and time period format data;

step S3: training an intrusion detection model set, wherein the intrusion detection model set comprises a variation self-encoder set CNN-VAE model based on single-point format data and/or a cyclic neural network predictor TCN-LSTM model based on context format data and/or a variation self-encoder BILSTM-VAE model based on time period format data;

step S4: acquiring error data, and inputting a set of intrusion detection models formed by training in the step S3 after time series data are shunted in the step S1 and the step S2 for newly-entered network stream data with unknown properties so as to form error data;

step S5: and (3) judging the characteristics of the newly-entered network flow data, and comparing the difference between the error data obtained in the step S4 and the expected error given by the network administrator for the newly-entered network flow data with unknown properties to obtain the characteristic judging result of the network flow data.

Preferably, for the data entering the intrusion detection model set, each new piece of network flow data is reconstructed in combination with its historical network flow data into three different data forms in step S2, and the three different data forms are respectively used as input data of a variable self-encoder set CNN-VAE model, a cyclic neural network predictor TCN-LSTM model and a variable self-encoder BILSTM-VAE model.

Preferably, the network flow data of step S1 includes network flow data obtained from a secure network environment; based on network flow data obtained from a secure network environment, the variation self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variation self-encoder BILSTM-VAE model respectively learn different intrinsic characteristics of normal network flow data, including time sequence characteristics and non-time sequence characteristics thereof.

Preferably, in step S4, after the newly entered network stream data with unknown properties is subjected to time-series data splitting in step S1 and step S2, the network stream data is individually reconstructed by the variable self-encoder set CNN-VAE model, the network stream data is predicted by the historical network stream data before the cyclic neural network predictor TCN-LSTM model passes through the network stream data, and the data segment including the new network stream data is reconstructed by the variable self-encoder BILSTM-VAE model.

Preferably, in step S5, for newly entered network flow data with unknown properties, when the difference between the error data obtained in step S4 and the expected error given by the network administrator is greater than a preset threshold, the network flow data is considered as attack data; and when the difference between the error data obtained in the step S4 and the expected error given by the network administrator is not greater than a preset threshold value, the network flow data is considered to be normal data.

Preferably, the data preprocessing in step S1 includes:

step S101: feature selection, namely extracting features from network stream data;

step S102: feature numeralization, namely assigning a numerical value to the network flow data feature acquired in the step S101;

step S103: feature normalization normalizes the feature values obtained in step S102 to the [0,1] section.

Preferably, in step S103, the feature values are normalized using the following formula:

wherein

For the original characteristic value, < >>

Is->

Normalized value, <' > and->

For the minimum value exhibited by the same class of feature values in the dataset,

is the maximum value exhibited by the same class of feature values in the dataset.

Preferably, the error data of the intrusion detection model set comprises an error of a variance from an encoder set CNN-VAE model and/or an error of a recurrent neural network predictor TCN-LSTM model and/or an error of a variance from an encoder BILSTM-VAE model.

Preferably, for the variational self-encoder set CNN-VAE model, its inputs are

Output is +.>

Its error is reconstruction error->

Or as a function of the VAE loss,

，

wherein the value of c represents the number of features contained in the network flow data.

Preferably, for the TCN-LSTM model of the recurrent neural network predictor, the error is the loss function

：

，

Where y is the actual flow characteristic information of the next timestamp,

for predicted flow characteristic information, the c value represents the number of characteristics included in the network flow data.

Preferably, for the variable self-encoder BILSTM-VAE model, its error is its reconstruction error, its reconstruction error is its mean square error or its mean square error plus the corresponding KL divergence.

The invention has the beneficial effects that: three unsupervised anomaly detectors are formed by adopting a variation self-encoder set CNN-VAE model, a circulating neural network predictor TCN-LSTM model and a variation self-encoder BILSTM-VAE model, and compared with a supervised machine learning method, the invention can reduce the expensive cost caused by manual message marking and realize intrusion detection on network data. No tag is used in the training process, malicious tag data and benign tag data in the network data stream message are not required to be balanced through over sampling and under sampling, the zero-day attack detection capability is better, and the method has better adaptability to new network attack types.

Drawings

FIG. 1 is a schematic diagram of the overall workflow of an integrated learning-based unsupervised network data intrusion detection method of the present invention;

FIG. 2 is a schematic diagram of a workflow of time series data offloading in the present invention;

FIG. 3 is a schematic diagram of intrusion detection model set training of the present invention;

FIG. 4 is a schematic diagram of the error output of the present invention for network flow data of unknown nature newly entered;

FIG. 5 is a schematic diagram of the present invention for determining characteristics of newly incoming network flow data;

FIG. 6 is a diagram of the encoder architecture of the variable self-encoder set CNN-VAE model of the present invention;

FIG. 7 is a diagram of encoder parameters of the variable self-encoder set CNN-VAE model of the present invention;

FIG. 8 is a block diagram of a decoder of the variable self-encoder set CNN-VAE model of the present invention;

FIG. 9 is a diagram of decoder parameters of the variable self-encoder set CNN-VAE model of the present invention;

FIG. 10 is a block diagram of a recurrent neural network predictor TCN-LSTM model in accordance with the present invention;

FIG. 11 is a parameter diagram of a recurrent neural network predictor TCN-LSTM model in accordance with the present invention;

FIG. 12 is a block diagram of an encoder of the variable self-encoder BILSTM-VAE model of the present invention;

FIG. 13 is a diagram of encoder parameters of the variable self-encoder BILSTM-VAE model of the present invention;

FIG. 14 is a block diagram of a decoder of the variable self-encoder BILSTM-VAE model of the present invention;

FIG. 15 is a diagram of decoder parameters of the variable self-encoder BILSTM-VAE model of the present invention;

FIG. 16 is a graph illustrating performance comparison and evaluation results of accuracy, recall and F1 score on a KDD Cup 1999 dataset using the network data intrusion detection method of the present invention and other methods.

Detailed Description

The invention will now be described in further detail with reference to the drawings and a preferred embodiment. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

As shown in fig. 1 to 5, the unsupervised network data intrusion detection method based on ensemble learning according to the present invention, wherein the network data in the present invention is specifically network flow data, and the network data intrusion detection method includes the following steps:

step S2: time series data are shunted, and the time series data in the step S1 are reconstructed into three different data forms: single point format data, context format data, and time period format data; as shown in fig. 2, the time-series data is duplicated and split to form time-slot type format data, context type format data, and individual different single-point type format data;

the variable self-encoder set CNN-VAE model, the circulating neural network predictor TCN-LSTM model and the variable self-encoder BILSTM-VAE model respectively form three unsupervised anomaly detectors after training. The variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variable self-encoder BILSTM-VAE model can be independently trained or trained in parallel;

Specifically, as an optional implementation manner in this embodiment, the data preprocessing in step S1 includes:

In step S101, the feature selection method includes extracting features from the network flow data by using a third party tool, such as a ciclowmeter or a manual marking or statistical algorithm. Extracting characteristics of network flow data collected under a conventional network environment, and further processing the obtained network flow data into multi-element time sequence data; taking the third party tool CICFlowMeter (network traffic generator and analyzer) as an example, the tool inputs a pcap file based on network flow data, outputs the pcap file containing characteristic information of the data packet, has more than 80 dimensions in total, and outputs the result in a csv table form.

The CICFlowMeter tool extracts some statistics of the transport layer, in units of one TCP stream or one UDP stream. The TCP stream ends with the FIN flag, the UDP stream ends with the set flowtimeout time as the limit, and the exceeding time is judged to be the end, and a plurality of data packets exist in one TCP stream. Statistics in a stream are counted as extracted features, specifying forward from source address to destination address, and reverse from destination address to source address.

In step S102, taking the ciclowmeter for feature collection as an example, most of the flow features can directly collect data values through the ciclowmeter, such as the serror_rate: number of connections with SYN error; RERROR_rate: number of connection times in which REJ error occurs, etc.; may be obtained based on statistical methods using the ciclowmeter software, such as Protocol: the string type data such as the network protocol needs to be further subjected to discretization assignment, the string type data is assigned in a digital sequence numbering mode, for example, the TCP type is set to be 1, the UDP type is set to be 2, and the like, so that the numeric value of the string type data is completed.

In step S103, the feature values obtained in step S102 are normalized to [0,1]]Interval to avoid unbalanced influence of the features in classification; the feature values were normalized using the following formula:

wherein->

For the original characteristic value, < >>

Is->

Normalized value, <' > and->

Minimum value exhibited in the dataset for the same class of feature values, +.>

Specifically, as an alternative implementation manner in this embodiment, for the data entering the intrusion detection model set, each new piece of network flow data is reconstructed into three different data forms in step S2 in combination with its historical network flow data, and the three different data forms are respectively used as input data of the variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variable self-encoder BILSTM-VAE model.

For example sequence data

When converting to single point format data, the latest time point data entering the system is reserved +.>

Conversion to context as training data for a variational self-encoder set CNN-VAE modelThe historical data section is reserved when the format data is in the format of +.>

And forecast target data->

As training data of TCN-LSTM model of cyclic neural network predictor, all sequence data is reserved when converting into time slot format data>

As training data for the variable self-encoder BILSTM-VAE model.

Specifically, as an optional implementation manner in this embodiment, the network flow data in step S1 includes network flow data obtained from a secure network environment, and based on the network flow data obtained from the secure network environment, the variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model, and the variable self-encoder BILSTM-VAE model learn different intrinsic features of the normal network flow data, including timing features and non-timing features thereof, respectively.

Specifically, as an alternative implementation manner in this embodiment, the present invention constructs an intrusion detection model set by using three different single learners in parallel, including a variation self-encoder set CNN-VAE model based on single-point format data, a cyclic neural network predictor TCN-LSTM model based on context format data, and a variation self-encoder BILSTM-VAE model based on time-slot format data. Specifically, as an optional implementation manner in this embodiment, three different single learners, namely a variation self-encoder set CNN-VAE model, a cyclic neural network predictor TCN-LSTM model and a variation self-encoder BILSTM-VAE model, are used for parallel training; specifically, as an optional implementation manner in this embodiment, before parallel training, the multivariate time series data is divided into sub-sequences in step S2, the Last time Point (Last Point) data corresponding to each sub-sequence is set as Key Point data (Key Point) to be detected, and the length of the time period data corresponding to the sub-sequence and the length of the history data in the context data are controlled by introducing a sliding window.

For the single point format data based variational self-encoder set CNN-VAE model, the main purpose of each encoder is to input

Conversion to intermediate variable->

Learning implicit features of the input data; the main purpose of each of its decoders is to add the intermediate variable +.>

Conversion to output->

I.e. the input data is reconstructed using the new learned features. FIGS. 6 and 7 show the encoder structure diagram and encoder parameter diagram, respectively, of the variation from the encoder set CNN-VAE model; fig. 8 and 9 show a decoder structure diagram and a decoder parameter diagram of the variable self-encoder set CNN-VAE model, respectively. Specifically, as an alternative implementation of this embodiment, in fig. 6 and 7, the function of the coding mean layer is to code the input data into a mean vector, where the mean vector represents the distribution of the input data in the potential space; the function of the coding variance layer is to code the input data into a variance vector that represents the variance of the distribution of the input data in the underlying space; the latent vector layer is used to implement sampling in which the network converts, i.e., encodes, the input data into a latent vector, which is a low-dimensional representation of the learned data that contains the characteristics of the input data.

The single temporal point data variation from encoder set CNN-VAE model through a Shared frame (Shared frame) detects point anomalies in the temporal sequence data from encoder set (an ensemble of neural networks called autoencoders). The point anomaly detection strategy is based on the following principle: training a CNN-VAE model through normal time point data, and giving n tasks, wherein each task reconstructs the data characteristics of the same time point, and interacting hidden characteristic codes of the n tasks in a hidden layer to enable the hidden characteristic codes to learn normal characteristic combinations of the data; as an example, let n=41 in the CNN-VAE model with reference to the flow feature numbers provided in the KDD-CUP-1999 dataset, select 41 features from the network flow data, and construct [41×41] two-dimensional matrix input data by 41 different feature ordering methods.

Specifically, as an alternative implementation manner in this embodiment, the error data of the intrusion detection model set includes an error that is divided from the encoder set CNN-VAE and/or an error of the recurrent neural network predictor TCN-LSTM and/or an error that is divided from the encoder BILSTM-VAE model.

Specifically, as an alternative implementation of this embodiment, for the variational self-encoder set CNN-VAE model, its input is

Output is +.>

The error is the reconstruction error +.>

Or is a VAE loss function->

Its reconstruction error

Using mean square error->

Whose VAE loss function is the reconstruction error +.>

In addition to the corresponding KL divergence,

，

wherein the value of c represents the number of features contained in the network flow data. Even reconstruction errors in the variant self-encoder set CNN-VAE model according to the situation requirements>

Mean square error of the model is used only +.>

And the corresponding KL divergence can still work normally when being ignored.

Specifically, as an alternative implementation manner in this embodiment, for the cyclic neural network predictor TCN-LSTM model, its error is a loss function

：

，

Where y is the actual flow characteristic information of the next timestamp,

for predicted flow characteristic information, the c value represents the number of characteristics included in the network flow data. Fig. 10 shows a structural diagram of the cyclic neural network predictor TCN-LSTM model, and fig. 11 shows a parameter diagram of the cyclic neural network predictor TCN-LSTM model.

Specifically, as an alternative implementation manner in this embodiment, for the variable self-encoder BILSTM-VAE model, its error is its reconstruction error, its reconstruction error is its mean square error or its mean square error plus a corresponding KL divergence. FIG. 12 shows an encoder block diagram of the variable self-encoder BILSTM-VAE model; FIG. 13 shows a diagram of encoder parameters for the variable self-encoder BILSTM-VAE model; FIG. 14 shows a decoder block diagram of the variable self-encoder BILSTM-VAE model; fig. 15 shows a decoder parameter diagram of the variable self-encoder BILSTM-VAE model. According to the situation requirement, even if the reconstruction error of the BILSTM-VAE model of the variable self-encoder only adopts the mean square error of the model, the corresponding KL divergence is ignored, and the model can still work normally. Specifically, as an alternative implementation manner in this embodiment, in fig. 12 to 15, a step slicing layer of a Tensorflow operation layer (TF operation layer) is used to perform a slicing operation on a tensor, that is, select a tensor with a specified step in any dimension. Specifically, as an alternative implementation manner in this embodiment, the data with the historical time sequence length of 3 is sliced by the step slicing layer, and tensor data at each time point is independently taken out to perform encoding and decoding operations.

Specifically, as an alternative implementation manner in this embodiment, for the context-format-data-based recurrent neural network predictor TCN-LSTM model, at the time of training, it acquires a historical time series information (used as a context), and compares the actual flow characteristic information y of the next timestamp with the predicted flow characteristic information by attempting to predict the flow characteristic information of the next timestamp

Training is performed such that the next time stamp is actual stream feature information y and predicted stream feature information +.>

Infinite access. Specifically, as an alternative implementation manner in this embodiment, the recurrent neural network predictor TCN-LSTM model adopts a joint structure of a three-layer time convolution structure (TCN) and a long-short-term memory network (LSTM layer), receives output data of the TCN layer through the LSTM layer and is used for predicting stream feature data of a next timestamp, and captures historical context information under different levels by using different window sizes.

Specifically, as an alternative implementation manner in this embodiment, for the variable self-encoder BILSTM-VAE model based on the time zone format data (time zone data), the variable self-encoder BILSTM-VAE model is trained on the normal sequence so as to learn the normal mode of the time sequence data and make the input data and the output data coincide. Specifically, as an alternative implementation manner in this embodiment, the variable self-encoder BILSTM-VAE model extracts the up-down interaction relationship in the time series data through a BILSTM layer, the output of each time point corresponds to a variable automatic encoder, each model is composed of an independent encoder and decoder, and the original time series data is reconstructed through another BILSTM layer. The invention can use different window sizes to capture the system state under different resolutions by taking the selected window length as SW=3 as an example and providing a corresponding parameter for constructing the BILSTM-VAE model.

Specifically, in step S4, after the time-series data splitting is performed on the newly entered network flow data with unknown properties through step S1 and step S2, the network flow data is individually reconstructed by the variable self-encoder set CNN-VAE model, the network flow data is predicted by the historical network flow data before the network flow data is passed through the recurrent neural network predictor TCN-LSTM model, and the data segment including the new network flow data is reconstructed by the variable self-encoder BILSTM-VAE model. The error output by the variable self-encoder set CNN-VAE model or the variable self-encoder BILSTM-VAE model can still work normally even if only the mean square error is adopted and the KL divergence is ignored according to the situation requirement.

Specifically, as an optional implementation manner in this embodiment, in step S5, for newly entered network flow data with unknown properties, when the difference between the error data obtained in step S4 and the expected error given by the network administrator is greater than a preset threshold, the network flow data is considered as attack data; and when the difference between the error data obtained in the step S4 and the expected error given by the network administrator is not greater than a preset threshold value, the network flow data is considered to be normal data.

Specifically, as an alternative implementation manner in this embodiment, network flow data with unknown properties newly entered is divided into single-point format data, context format data and time-period format data after being subjected to data preprocessing and time-sequence data splitting, and the single-point format data, the context format data and the time-period format data are respectively used as inputs of a variable self-encoder set CNN-VAE model, a cyclic neural network predictor TCN-LSTM model and a variable self-encoder BILSTM-VAE model; obtaining errors of the outputs of the unsupervised anomaly detectors formed by the variable self-encoder set CNN-VAE model, obtaining errors of the outputs of the unsupervised anomaly detectors formed by the cyclic neural network predictor TCN-LSTM model, obtaining errors of the outputs of the unsupervised anomaly detectors formed by the variable self-encoder BILSTM-VAE model, comparing differences between the errors of the unsupervised anomaly detectors and expected errors given by a network administrator, obtaining a judging result of the characteristics of the new network flow data of the unsupervised anomaly detectors, and quantifying importance (priority) of different unsupervised anomaly detectors to each attack flow according to the three judging results.

Specifically, as an optional implementation manner in this embodiment, the anomaly scores obtained by the three determination results are quantized to form final anomaly scores, and when the final anomaly scores are greater than a preset certain threshold, the new network flow data is considered as attack data.

In the present invention, the anomaly scores obtained from the individual single detector models are weighted together to complete the quantification of the anomaly scores ultimately obtained, an alternative weighted summation example is to average the difference between the error of the three unsupervised anomaly detectors and the expected error given by the network administrator (i.e., the acceptable threshold error). For example, the error of the three unsupervised anomaly detectors is respectively <0.5,2.5,0.9>, and the error threshold given by the network administrator is respectively <0.4,2.0,1.0>, the anomaly scores obtained by the three unsupervised anomaly detectors are respectively <0.1,0.5, -0.1>, and the average weight is <0.166>, if the anomaly score is higher, the anomaly degree of the input data is higher, if the anomaly score is smaller than 0, the anomaly score of each single detector is normal, and the weight of the anomaly score of each single detector is controlled to control the weight of the detection result of each single detector in the final detection result. For example, if the target network to be detected invades more and shows an abnormal phenomenon of point abnormality, the weight of the abnormal score obtained by the CNN-VAE model can be improved as much as possible.

Specifically, as an alternative implementation manner in this embodiment, before the weighted summation, if there is a large difference between the anomaly scores obtained by different single detectors, the anomaly scores obtained by different single detectors or the obtained error values need to be preprocessed, so that they are limited to the same numerical interval. For example, for a CNN-VAE model or a BILSTM-VAE model, whether KL divergence values are added to the reconstruction errors will greatly affect the range of values of the error values. As an alternative in this embodiment, a simple method of directly processing the error value is to probabilistically select Φ, specifically, the output error value score may be fitted to a log-normal or non-standard distribution, and then the probability of occurrence of the output error value score itself is directly used as an anomaly score, in which case the lower the probability of occurrence is, the more anomaly is. As another alternative in this embodiment, the present invention may employ another simple normalization method to normalize the obtained error value to the [0,1] interval, and then perform weighted summation on the obtained anomaly score, so as to determine a final detection decision in the detection stage by comparing the finally obtained weighted anomaly score, and consider it as anomaly, attack or intrusion when the finally obtained anomaly score is greater than the acceptance threshold (generally set to 0 directly) set by the network administrator.

The three unsupervised anomaly detectors are formed by adopting a variational self-encoder set CNN-VAE model, a circulating neural network predictor TCN-LSTM model and a variational self-encoder BILSTM-VAE model, compared with a supervised machine learning method, the method has the advantages that the expensive cost caused by manual message marking can be greatly reduced through the unsupervised neural network model, the labels are not used in the training process, malicious label data and benign label data in a network data stream message are not required to be balanced through over sampling and under sampling, the capability of detecting zero-day attacks is better, and the method has better adaptability to new network attack types. According to the invention, intrusion detection is performed on a plurality of different time series data, so that expert knowledge and manual interference required by model training are reduced, and network attack data possibly existing are efficiently detected.

The invention integrates point anomaly detection and context anomaly detection by using three different deep learning model frameworks, has smaller noise and considers different types of anomaly data existing in time series data. Compared with other detection methods, the method has higher performance improvement in the aspects of indexes such as Precision, recall, F1 fraction and the like. The results of accuracy, recall and F1 score performance comparison evaluations performed on a KDD Cup 1999 dataset using the network data intrusion detection method of the present invention and other methods are shown in fig. 16. The performance results of four popular unsupervised detection methods (PCA, KNN, FB and AE) are given in the first 4 columns of the X-axis, the performance results of a variable self-encoder BILSTM-VAE model, a cyclic neural network predictor TCN-LSTM model, a variable self-encoder set CNN-VAE model and three model sets are given in the last 4 columns of the X-axis, and it can be seen that the performance results of the last 4 columns are higher in performance improvement in terms of Precision, recall, F1 score and other indexes compared with the performance results of the first 4 columns.

The invention has been described in detail with reference to the method for detecting intrusion into unsupervised network data based on ensemble learning, and specific examples are applied herein to illustrate the principles and embodiments of the invention, and the above examples are only used to help understand the method and core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The unsupervised network data intrusion detection method based on ensemble learning is characterized by comprising the following steps:

step S3: training an intrusion detection model set, wherein the intrusion detection model set comprises a variation self-encoder set CNN-VAE model based on single-point format data, a cyclic neural network predictor TCN-LSTM model based on context format data and a variation self-encoder BILSTM-VAE model based on time period format data;

after time series data distribution is carried out on newly entered network stream data with unknown properties through the step S1 and the step S2, the network stream data are respectively and independently reconstructed by a variation self-encoder set CNN-VAE model, the network stream data are predicted by a historical network stream data before the network stream data pass through a circulating neural network predictor TCN-LSTM model, and data segments containing the new network stream data are reconstructed by a variation self-encoder BILSTM-VAE model;

for the variable self-encoder set CNN-VAE model, its input is

Output is +.>

The error is the reconstruction error +.>

Or is a VAE loss function->

Its reconstruction error->

Using mean square error->

Its VAE loss function->

Error for the reconstruction->

Adding corresponding KL divergence;

for the TCN-LSTM model of the cyclic neural network predictor, the error is a loss function

：

，

Wherein,,

the actual stream characteristic information for the next timestamp,/->

Information about the predicted stream characteristics;

for a BILSTM-VAE model of the variable self-encoder, the error is the reconstruction error, the reconstruction error is the mean square error or the mean square error is added with corresponding KL divergence;

2. The method of claim 1, wherein for data entering the intrusion detection model set, each new piece of network flow data is reconstructed in combination with its historical network flow data into three different data forms in step S2, which are respectively used as input data for the variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model, and the variable self-encoder BILSTM-VAE model.

3. The method of integrated learning based unsupervised network data intrusion detection according to claim 1, wherein the network flow data of step S1 comprises network flow data obtained from a secure network environment; based on network flow data obtained from a secure network environment, the variation self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variation self-encoder BILSTM-VAE model respectively learn different intrinsic characteristics of normal network flow data, including time sequence characteristics and non-time sequence characteristics thereof.

4. The method for intrusion detection of network data based on ensemble learning according to claim 1, wherein in step S5, for newly entered network flow data of unknown nature, when a difference between the error data obtained in step S4 and an expected error given by a network administrator is greater than a preset threshold, the network flow data is considered as attack data; and when the difference between the error data obtained in the step S4 and the expected error given by the network administrator is not greater than a preset threshold value, the network flow data is considered to be normal data.

5. The method for unsupervised network data intrusion detection based on ensemble learning according to claim 1, wherein the data preprocessing of step S1 comprises:

6. The method for unsupervised network data intrusion detection based on ensemble learning according to claim 5, wherein in step S103, the feature values are normalized by using the following formula:

wherein->

As the original characteristic value of the object is obtained,

is->

Normalized value, <' > and->

7. The method of claim 1, wherein the error data of the intrusion detection model set includes an error that varies from an encoder set CNN-VAE model and/or an error of a recurrent neural network predictor TCN-LSTM model and/or an error that varies from an encoder BILSTM-VAE model.