CN116232772B - Unsupervised network data intrusion detection method based on ensemble learning - Google Patents

Unsupervised network data intrusion detection method based on ensemble learning Download PDF

Info

Publication number
CN116232772B
CN116232772B CN202310509884.2A CN202310509884A CN116232772B CN 116232772 B CN116232772 B CN 116232772B CN 202310509884 A CN202310509884 A CN 202310509884A CN 116232772 B CN116232772 B CN 116232772B
Authority
CN
China
Prior art keywords
data
network
error
encoder
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310509884.2A
Other languages
Chinese (zh)
Other versions
CN116232772A (en
Inventor
江荣
刘海天
周斌
李爱平
涂宏魁
王晔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310509884.2A priority Critical patent/CN116232772B/en
Publication of CN116232772A publication Critical patent/CN116232772A/en
Application granted granted Critical
Publication of CN116232772B publication Critical patent/CN116232772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to an unsupervised network data intrusion detection method based on ensemble learning, which comprises the following steps: step S1: processing the collected network stream data into time sequence data; step S2: reconstructing the time-series data into single-point format data, context format data and time-period format data; step S3: training an intrusion detection model set, wherein the intrusion detection model set comprises a variation self-encoder set CNN-VAE model based on single-point format data and/or a cyclic neural network predictor TCN-LSTM model based on context format data and/or a variation self-encoder BILSTM-VAE model based on time period format data; step S4: acquiring error data; step S5: the difference between the error data obtained in step S4 and the expected error is compared. The invention can reduce the expensive cost caused by manual message marking and realize intrusion detection on network data.

Description

Unsupervised network data intrusion detection method based on ensemble learning
Technical Field
The invention relates to the technical field of network data intrusion detection and anomaly identification, in particular to an unsupervised network data intrusion detection method based on ensemble learning.
Background
Network security is now becoming an extensive research area, and detection of malicious activity on networks is one of the more common problems, and Intrusion Detection Systems (IDS), which are the best solutions for detecting various network threats, can be used to check activity in specific environments, and typical intrusion detection systems, including but not limited to firewalls, access control lists, authentication mechanisms, etc., have long been widely used to improve security of computer systems.
In terms of detection technology, in general, conventional IDS includes three intrusion detection systems based on signature-based (signature-based), based on abnormal conditions (analysis-based), and based on Specification (Specification-based). However, these conventional detection techniques have failed to handle the multi-variable data flow generated by the increasingly dynamic and complex nature of modern cyber attacks. Thus, researchers have exceeded specifications or token-based techniques to begin utilizing machine learning techniques to utilize large amounts of data generated by the system. As the demand for intelligence and autonomy increases, neural networks have become an increasingly popular solution to intrusion detection systems. Their ability to learn complex patterns and behaviors makes them suitable solutions to distinguish between normal traffic and network attacks.
The mainstream neural network solutions are more prone to supervised training, which has been shown to exhibit good anomaly recognition in the problem of intrusion detection. However, in addition to autonomy, another important attribute of IDS is its ability to detect zero-day attacks, which change over time, while new attacks are continually discovered, so the continued maintainability of malicious attack traffic repositories may be impractical, meaning that experts must annotate network traffic and manually update models from time to time, which would require specialized expert knowledge bases to support, and the labeling process is time consuming and expensive, which is too costly to require labor costs. Furthermore, classification itself is a closed method of identifying concepts, a classifier is trained to identify classes provided in a training set, however, it is not reasonable to assume that all possible malicious traffic can be collected and placed in the training data.
Disclosure of Invention
The invention aims to solve the technical problems that: the unsupervised network data intrusion detection method based on the ensemble learning is provided, so that the expensive cost caused by manual message marking is reduced, and the intrusion detection of network data is realized.
The technical scheme adopted by the invention for solving the technical problems is as follows: an unsupervised network data intrusion detection method based on ensemble learning comprises the following steps:
step S1: preprocessing data, namely processing the acquired network stream data into time sequence data;
step S2: time series data are shunted, and the time series data in the step S1 are reconstructed into three different data forms: single point format data, context format data, and time period format data;
step S3: training an intrusion detection model set, wherein the intrusion detection model set comprises a variation self-encoder set CNN-VAE model based on single-point format data and/or a cyclic neural network predictor TCN-LSTM model based on context format data and/or a variation self-encoder BILSTM-VAE model based on time period format data;
step S4: acquiring error data, and inputting a set of intrusion detection models formed by training in the step S3 after time series data are shunted in the step S1 and the step S2 for newly-entered network stream data with unknown properties so as to form error data;
step S5: and (3) judging the characteristics of the newly-entered network flow data, and comparing the difference between the error data obtained in the step S4 and the expected error given by the network administrator for the newly-entered network flow data with unknown properties to obtain the characteristic judging result of the network flow data.
Preferably, for the data entering the intrusion detection model set, each new piece of network flow data is reconstructed in combination with its historical network flow data into three different data forms in step S2, and the three different data forms are respectively used as input data of a variable self-encoder set CNN-VAE model, a cyclic neural network predictor TCN-LSTM model and a variable self-encoder BILSTM-VAE model.
Preferably, the network flow data of step S1 includes network flow data obtained from a secure network environment; based on network flow data obtained from a secure network environment, the variation self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variation self-encoder BILSTM-VAE model respectively learn different intrinsic characteristics of normal network flow data, including time sequence characteristics and non-time sequence characteristics thereof.
Preferably, in step S4, after the newly entered network stream data with unknown properties is subjected to time-series data splitting in step S1 and step S2, the network stream data is individually reconstructed by the variable self-encoder set CNN-VAE model, the network stream data is predicted by the historical network stream data before the cyclic neural network predictor TCN-LSTM model passes through the network stream data, and the data segment including the new network stream data is reconstructed by the variable self-encoder BILSTM-VAE model.
Preferably, in step S5, for newly entered network flow data with unknown properties, when the difference between the error data obtained in step S4 and the expected error given by the network administrator is greater than a preset threshold, the network flow data is considered as attack data; and when the difference between the error data obtained in the step S4 and the expected error given by the network administrator is not greater than a preset threshold value, the network flow data is considered to be normal data.
Preferably, the data preprocessing in step S1 includes:
step S101: feature selection, namely extracting features from network stream data;
step S102: feature numeralization, namely assigning a numerical value to the network flow data feature acquired in the step S101;
step S103: feature normalization normalizes the feature values obtained in step S102 to the [0,1] section.
Preferably, in step S103, the feature values are normalized using the following formula:
Figure SMS_1
wherein
Figure SMS_2
For the original characteristic value, < >>
Figure SMS_3
Is->
Figure SMS_4
Normalized value, <' > and->
Figure SMS_5
For the minimum value exhibited by the same class of feature values in the dataset,
Figure SMS_6
is the maximum value exhibited by the same class of feature values in the dataset.
Preferably, the error data of the intrusion detection model set comprises an error of a variance from an encoder set CNN-VAE model and/or an error of a recurrent neural network predictor TCN-LSTM model and/or an error of a variance from an encoder BILSTM-VAE model.
Preferably, for the variational self-encoder set CNN-VAE model, its inputs are
Figure SMS_7
Output is +.>
Figure SMS_8
Its error is reconstruction error->
Figure SMS_9
Or as a function of the VAE loss,
Figure SMS_10
Figure SMS_11
wherein the value of c represents the number of features contained in the network flow data.
Preferably, for the TCN-LSTM model of the recurrent neural network predictor, the error is the loss function
Figure SMS_12
Figure SMS_13
Where y is the actual flow characteristic information of the next timestamp,
Figure SMS_14
for predicted flow characteristic information, the c value represents the number of characteristics included in the network flow data.
Preferably, for the variable self-encoder BILSTM-VAE model, its error is its reconstruction error, its reconstruction error is its mean square error or its mean square error plus the corresponding KL divergence.
The invention has the beneficial effects that: three unsupervised anomaly detectors are formed by adopting a variation self-encoder set CNN-VAE model, a circulating neural network predictor TCN-LSTM model and a variation self-encoder BILSTM-VAE model, and compared with a supervised machine learning method, the invention can reduce the expensive cost caused by manual message marking and realize intrusion detection on network data. No tag is used in the training process, malicious tag data and benign tag data in the network data stream message are not required to be balanced through over sampling and under sampling, the zero-day attack detection capability is better, and the method has better adaptability to new network attack types.
Drawings
FIG. 1 is a schematic diagram of the overall workflow of an integrated learning-based unsupervised network data intrusion detection method of the present invention;
FIG. 2 is a schematic diagram of a workflow of time series data offloading in the present invention;
FIG. 3 is a schematic diagram of intrusion detection model set training of the present invention;
FIG. 4 is a schematic diagram of the error output of the present invention for network flow data of unknown nature newly entered;
FIG. 5 is a schematic diagram of the present invention for determining characteristics of newly incoming network flow data;
FIG. 6 is a diagram of the encoder architecture of the variable self-encoder set CNN-VAE model of the present invention;
FIG. 7 is a diagram of encoder parameters of the variable self-encoder set CNN-VAE model of the present invention;
FIG. 8 is a block diagram of a decoder of the variable self-encoder set CNN-VAE model of the present invention;
FIG. 9 is a diagram of decoder parameters of the variable self-encoder set CNN-VAE model of the present invention;
FIG. 10 is a block diagram of a recurrent neural network predictor TCN-LSTM model in accordance with the present invention;
FIG. 11 is a parameter diagram of a recurrent neural network predictor TCN-LSTM model in accordance with the present invention;
FIG. 12 is a block diagram of an encoder of the variable self-encoder BILSTM-VAE model of the present invention;
FIG. 13 is a diagram of encoder parameters of the variable self-encoder BILSTM-VAE model of the present invention;
FIG. 14 is a block diagram of a decoder of the variable self-encoder BILSTM-VAE model of the present invention;
FIG. 15 is a diagram of decoder parameters of the variable self-encoder BILSTM-VAE model of the present invention;
FIG. 16 is a graph illustrating performance comparison and evaluation results of accuracy, recall and F1 score on a KDD Cup 1999 dataset using the network data intrusion detection method of the present invention and other methods.
Detailed Description
The invention will now be described in further detail with reference to the drawings and a preferred embodiment. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.
As shown in fig. 1 to 5, the unsupervised network data intrusion detection method based on ensemble learning according to the present invention, wherein the network data in the present invention is specifically network flow data, and the network data intrusion detection method includes the following steps:
step S1: preprocessing data, namely processing the acquired network stream data into time sequence data;
step S2: time series data are shunted, and the time series data in the step S1 are reconstructed into three different data forms: single point format data, context format data, and time period format data; as shown in fig. 2, the time-series data is duplicated and split to form time-slot type format data, context type format data, and individual different single-point type format data;
step S3: training an intrusion detection model set, wherein the intrusion detection model set comprises a variation self-encoder set CNN-VAE model based on single-point format data and/or a cyclic neural network predictor TCN-LSTM model based on context format data and/or a variation self-encoder BILSTM-VAE model based on time period format data;
the variable self-encoder set CNN-VAE model, the circulating neural network predictor TCN-LSTM model and the variable self-encoder BILSTM-VAE model respectively form three unsupervised anomaly detectors after training. The variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variable self-encoder BILSTM-VAE model can be independently trained or trained in parallel;
step S4: acquiring error data, and inputting a set of intrusion detection models formed by training in the step S3 after time series data are shunted in the step S1 and the step S2 for newly-entered network stream data with unknown properties so as to form error data;
step S5: and (3) judging the characteristics of the newly-entered network flow data, and comparing the difference between the error data obtained in the step S4 and the expected error given by the network administrator for the newly-entered network flow data with unknown properties to obtain the characteristic judging result of the network flow data.
Specifically, as an optional implementation manner in this embodiment, the data preprocessing in step S1 includes:
step S101: feature selection, namely extracting features from network stream data;
step S102: feature numeralization, namely assigning a numerical value to the network flow data feature acquired in the step S101;
step S103: feature normalization normalizes the feature values obtained in step S102 to the [0,1] section.
In step S101, the feature selection method includes extracting features from the network flow data by using a third party tool, such as a ciclowmeter or a manual marking or statistical algorithm. Extracting characteristics of network flow data collected under a conventional network environment, and further processing the obtained network flow data into multi-element time sequence data; taking the third party tool CICFlowMeter (network traffic generator and analyzer) as an example, the tool inputs a pcap file based on network flow data, outputs the pcap file containing characteristic information of the data packet, has more than 80 dimensions in total, and outputs the result in a csv table form.
The CICFlowMeter tool extracts some statistics of the transport layer, in units of one TCP stream or one UDP stream. The TCP stream ends with the FIN flag, the UDP stream ends with the set flowtimeout time as the limit, and the exceeding time is judged to be the end, and a plurality of data packets exist in one TCP stream. Statistics in a stream are counted as extracted features, specifying forward from source address to destination address, and reverse from destination address to source address.
In step S102, taking the ciclowmeter for feature collection as an example, most of the flow features can directly collect data values through the ciclowmeter, such as the serror_rate: number of connections with SYN error; RERROR_rate: number of connection times in which REJ error occurs, etc.; may be obtained based on statistical methods using the ciclowmeter software, such as Protocol: the string type data such as the network protocol needs to be further subjected to discretization assignment, the string type data is assigned in a digital sequence numbering mode, for example, the TCP type is set to be 1, the UDP type is set to be 2, and the like, so that the numeric value of the string type data is completed.
In step S103, the feature values obtained in step S102 are normalized to [0,1]]Interval to avoid unbalanced influence of the features in classification; the feature values were normalized using the following formula:
Figure SMS_15
wherein->
Figure SMS_16
For the original characteristic value, < >>
Figure SMS_17
Is->
Figure SMS_18
Normalized value, <' > and->
Figure SMS_19
Minimum value exhibited in the dataset for the same class of feature values, +.>
Figure SMS_20
Is the maximum value exhibited by the same class of feature values in the dataset.
Specifically, as an alternative implementation manner in this embodiment, for the data entering the intrusion detection model set, each new piece of network flow data is reconstructed into three different data forms in step S2 in combination with its historical network flow data, and the three different data forms are respectively used as input data of the variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variable self-encoder BILSTM-VAE model.
For example sequence data
Figure SMS_21
When converting to single point format data, the latest time point data entering the system is reserved +.>
Figure SMS_22
Conversion to context as training data for a variational self-encoder set CNN-VAE modelThe historical data section is reserved when the format data is in the format of +.>
Figure SMS_23
And forecast target data->
Figure SMS_24
As training data of TCN-LSTM model of cyclic neural network predictor, all sequence data is reserved when converting into time slot format data>
Figure SMS_25
As training data for the variable self-encoder BILSTM-VAE model.
Specifically, as an optional implementation manner in this embodiment, the network flow data in step S1 includes network flow data obtained from a secure network environment, and based on the network flow data obtained from the secure network environment, the variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model, and the variable self-encoder BILSTM-VAE model learn different intrinsic features of the normal network flow data, including timing features and non-timing features thereof, respectively.
Specifically, as an alternative implementation manner in this embodiment, the present invention constructs an intrusion detection model set by using three different single learners in parallel, including a variation self-encoder set CNN-VAE model based on single-point format data, a cyclic neural network predictor TCN-LSTM model based on context format data, and a variation self-encoder BILSTM-VAE model based on time-slot format data. Specifically, as an optional implementation manner in this embodiment, three different single learners, namely a variation self-encoder set CNN-VAE model, a cyclic neural network predictor TCN-LSTM model and a variation self-encoder BILSTM-VAE model, are used for parallel training; specifically, as an optional implementation manner in this embodiment, before parallel training, the multivariate time series data is divided into sub-sequences in step S2, the Last time Point (Last Point) data corresponding to each sub-sequence is set as Key Point data (Key Point) to be detected, and the length of the time period data corresponding to the sub-sequence and the length of the history data in the context data are controlled by introducing a sliding window.
For the single point format data based variational self-encoder set CNN-VAE model, the main purpose of each encoder is to input
Figure SMS_26
Conversion to intermediate variable->
Figure SMS_27
Learning implicit features of the input data; the main purpose of each of its decoders is to add the intermediate variable +.>
Figure SMS_28
Conversion to output->
Figure SMS_29
I.e. the input data is reconstructed using the new learned features. FIGS. 6 and 7 show the encoder structure diagram and encoder parameter diagram, respectively, of the variation from the encoder set CNN-VAE model; fig. 8 and 9 show a decoder structure diagram and a decoder parameter diagram of the variable self-encoder set CNN-VAE model, respectively. Specifically, as an alternative implementation of this embodiment, in fig. 6 and 7, the function of the coding mean layer is to code the input data into a mean vector, where the mean vector represents the distribution of the input data in the potential space; the function of the coding variance layer is to code the input data into a variance vector that represents the variance of the distribution of the input data in the underlying space; the latent vector layer is used to implement sampling in which the network converts, i.e., encodes, the input data into a latent vector, which is a low-dimensional representation of the learned data that contains the characteristics of the input data.
The single temporal point data variation from encoder set CNN-VAE model through a Shared frame (Shared frame) detects point anomalies in the temporal sequence data from encoder set (an ensemble of neural networks called autoencoders). The point anomaly detection strategy is based on the following principle: training a CNN-VAE model through normal time point data, and giving n tasks, wherein each task reconstructs the data characteristics of the same time point, and interacting hidden characteristic codes of the n tasks in a hidden layer to enable the hidden characteristic codes to learn normal characteristic combinations of the data; as an example, let n=41 in the CNN-VAE model with reference to the flow feature numbers provided in the KDD-CUP-1999 dataset, select 41 features from the network flow data, and construct [41×41] two-dimensional matrix input data by 41 different feature ordering methods.
Specifically, as an alternative implementation manner in this embodiment, the error data of the intrusion detection model set includes an error that is divided from the encoder set CNN-VAE and/or an error of the recurrent neural network predictor TCN-LSTM and/or an error that is divided from the encoder BILSTM-VAE model.
Specifically, as an alternative implementation of this embodiment, for the variational self-encoder set CNN-VAE model, its input is
Figure SMS_30
Output is +.>
Figure SMS_31
The error is the reconstruction error +.>
Figure SMS_32
Or is a VAE loss function->
Figure SMS_33
Its reconstruction error
Figure SMS_34
Using mean square error->
Figure SMS_35
Whose VAE loss function is the reconstruction error +.>
Figure SMS_36
In addition to the corresponding KL divergence,
Figure SMS_37
Figure SMS_38
wherein the value of c represents the number of features contained in the network flow data. Even reconstruction errors in the variant self-encoder set CNN-VAE model according to the situation requirements>
Figure SMS_39
Mean square error of the model is used only +.>
Figure SMS_40
And the corresponding KL divergence can still work normally when being ignored.
Specifically, as an alternative implementation manner in this embodiment, for the cyclic neural network predictor TCN-LSTM model, its error is a loss function
Figure SMS_41
Figure SMS_42
Where y is the actual flow characteristic information of the next timestamp,
Figure SMS_43
for predicted flow characteristic information, the c value represents the number of characteristics included in the network flow data. Fig. 10 shows a structural diagram of the cyclic neural network predictor TCN-LSTM model, and fig. 11 shows a parameter diagram of the cyclic neural network predictor TCN-LSTM model.
Specifically, as an alternative implementation manner in this embodiment, for the variable self-encoder BILSTM-VAE model, its error is its reconstruction error, its reconstruction error is its mean square error or its mean square error plus a corresponding KL divergence. FIG. 12 shows an encoder block diagram of the variable self-encoder BILSTM-VAE model; FIG. 13 shows a diagram of encoder parameters for the variable self-encoder BILSTM-VAE model; FIG. 14 shows a decoder block diagram of the variable self-encoder BILSTM-VAE model; fig. 15 shows a decoder parameter diagram of the variable self-encoder BILSTM-VAE model. According to the situation requirement, even if the reconstruction error of the BILSTM-VAE model of the variable self-encoder only adopts the mean square error of the model, the corresponding KL divergence is ignored, and the model can still work normally. Specifically, as an alternative implementation manner in this embodiment, in fig. 12 to 15, a step slicing layer of a Tensorflow operation layer (TF operation layer) is used to perform a slicing operation on a tensor, that is, select a tensor with a specified step in any dimension. Specifically, as an alternative implementation manner in this embodiment, the data with the historical time sequence length of 3 is sliced by the step slicing layer, and tensor data at each time point is independently taken out to perform encoding and decoding operations.
Specifically, as an alternative implementation manner in this embodiment, for the context-format-data-based recurrent neural network predictor TCN-LSTM model, at the time of training, it acquires a historical time series information (used as a context), and compares the actual flow characteristic information y of the next timestamp with the predicted flow characteristic information by attempting to predict the flow characteristic information of the next timestamp
Figure SMS_44
Training is performed such that the next time stamp is actual stream feature information y and predicted stream feature information +.>
Figure SMS_45
Infinite access. Specifically, as an alternative implementation manner in this embodiment, the recurrent neural network predictor TCN-LSTM model adopts a joint structure of a three-layer time convolution structure (TCN) and a long-short-term memory network (LSTM layer), receives output data of the TCN layer through the LSTM layer and is used for predicting stream feature data of a next timestamp, and captures historical context information under different levels by using different window sizes.
Specifically, as an alternative implementation manner in this embodiment, for the variable self-encoder BILSTM-VAE model based on the time zone format data (time zone data), the variable self-encoder BILSTM-VAE model is trained on the normal sequence so as to learn the normal mode of the time sequence data and make the input data and the output data coincide. Specifically, as an alternative implementation manner in this embodiment, the variable self-encoder BILSTM-VAE model extracts the up-down interaction relationship in the time series data through a BILSTM layer, the output of each time point corresponds to a variable automatic encoder, each model is composed of an independent encoder and decoder, and the original time series data is reconstructed through another BILSTM layer. The invention can use different window sizes to capture the system state under different resolutions by taking the selected window length as SW=3 as an example and providing a corresponding parameter for constructing the BILSTM-VAE model.
Specifically, in step S4, after the time-series data splitting is performed on the newly entered network flow data with unknown properties through step S1 and step S2, the network flow data is individually reconstructed by the variable self-encoder set CNN-VAE model, the network flow data is predicted by the historical network flow data before the network flow data is passed through the recurrent neural network predictor TCN-LSTM model, and the data segment including the new network flow data is reconstructed by the variable self-encoder BILSTM-VAE model. The error output by the variable self-encoder set CNN-VAE model or the variable self-encoder BILSTM-VAE model can still work normally even if only the mean square error is adopted and the KL divergence is ignored according to the situation requirement.
Specifically, as an optional implementation manner in this embodiment, in step S5, for newly entered network flow data with unknown properties, when the difference between the error data obtained in step S4 and the expected error given by the network administrator is greater than a preset threshold, the network flow data is considered as attack data; and when the difference between the error data obtained in the step S4 and the expected error given by the network administrator is not greater than a preset threshold value, the network flow data is considered to be normal data.
Specifically, as an alternative implementation manner in this embodiment, network flow data with unknown properties newly entered is divided into single-point format data, context format data and time-period format data after being subjected to data preprocessing and time-sequence data splitting, and the single-point format data, the context format data and the time-period format data are respectively used as inputs of a variable self-encoder set CNN-VAE model, a cyclic neural network predictor TCN-LSTM model and a variable self-encoder BILSTM-VAE model; obtaining errors of the outputs of the unsupervised anomaly detectors formed by the variable self-encoder set CNN-VAE model, obtaining errors of the outputs of the unsupervised anomaly detectors formed by the cyclic neural network predictor TCN-LSTM model, obtaining errors of the outputs of the unsupervised anomaly detectors formed by the variable self-encoder BILSTM-VAE model, comparing differences between the errors of the unsupervised anomaly detectors and expected errors given by a network administrator, obtaining a judging result of the characteristics of the new network flow data of the unsupervised anomaly detectors, and quantifying importance (priority) of different unsupervised anomaly detectors to each attack flow according to the three judging results.
Specifically, as an optional implementation manner in this embodiment, the anomaly scores obtained by the three determination results are quantized to form final anomaly scores, and when the final anomaly scores are greater than a preset certain threshold, the new network flow data is considered as attack data.
In the present invention, the anomaly scores obtained from the individual single detector models are weighted together to complete the quantification of the anomaly scores ultimately obtained, an alternative weighted summation example is to average the difference between the error of the three unsupervised anomaly detectors and the expected error given by the network administrator (i.e., the acceptable threshold error). For example, the error of the three unsupervised anomaly detectors is respectively <0.5,2.5,0.9>, and the error threshold given by the network administrator is respectively <0.4,2.0,1.0>, the anomaly scores obtained by the three unsupervised anomaly detectors are respectively <0.1,0.5, -0.1>, and the average weight is <0.166>, if the anomaly score is higher, the anomaly degree of the input data is higher, if the anomaly score is smaller than 0, the anomaly score of each single detector is normal, and the weight of the anomaly score of each single detector is controlled to control the weight of the detection result of each single detector in the final detection result. For example, if the target network to be detected invades more and shows an abnormal phenomenon of point abnormality, the weight of the abnormal score obtained by the CNN-VAE model can be improved as much as possible.
Specifically, as an alternative implementation manner in this embodiment, before the weighted summation, if there is a large difference between the anomaly scores obtained by different single detectors, the anomaly scores obtained by different single detectors or the obtained error values need to be preprocessed, so that they are limited to the same numerical interval. For example, for a CNN-VAE model or a BILSTM-VAE model, whether KL divergence values are added to the reconstruction errors will greatly affect the range of values of the error values. As an alternative in this embodiment, a simple method of directly processing the error value is to probabilistically select Φ, specifically, the output error value score may be fitted to a log-normal or non-standard distribution, and then the probability of occurrence of the output error value score itself is directly used as an anomaly score, in which case the lower the probability of occurrence is, the more anomaly is. As another alternative in this embodiment, the present invention may employ another simple normalization method to normalize the obtained error value to the [0,1] interval, and then perform weighted summation on the obtained anomaly score, so as to determine a final detection decision in the detection stage by comparing the finally obtained weighted anomaly score, and consider it as anomaly, attack or intrusion when the finally obtained anomaly score is greater than the acceptance threshold (generally set to 0 directly) set by the network administrator.
The three unsupervised anomaly detectors are formed by adopting a variational self-encoder set CNN-VAE model, a circulating neural network predictor TCN-LSTM model and a variational self-encoder BILSTM-VAE model, compared with a supervised machine learning method, the method has the advantages that the expensive cost caused by manual message marking can be greatly reduced through the unsupervised neural network model, the labels are not used in the training process, malicious label data and benign label data in a network data stream message are not required to be balanced through over sampling and under sampling, the capability of detecting zero-day attacks is better, and the method has better adaptability to new network attack types. According to the invention, intrusion detection is performed on a plurality of different time series data, so that expert knowledge and manual interference required by model training are reduced, and network attack data possibly existing are efficiently detected.
The invention integrates point anomaly detection and context anomaly detection by using three different deep learning model frameworks, has smaller noise and considers different types of anomaly data existing in time series data. Compared with other detection methods, the method has higher performance improvement in the aspects of indexes such as Precision, recall, F1 fraction and the like. The results of accuracy, recall and F1 score performance comparison evaluations performed on a KDD Cup 1999 dataset using the network data intrusion detection method of the present invention and other methods are shown in fig. 16. The performance results of four popular unsupervised detection methods (PCA, KNN, FB and AE) are given in the first 4 columns of the X-axis, the performance results of a variable self-encoder BILSTM-VAE model, a cyclic neural network predictor TCN-LSTM model, a variable self-encoder set CNN-VAE model and three model sets are given in the last 4 columns of the X-axis, and it can be seen that the performance results of the last 4 columns are higher in performance improvement in terms of Precision, recall, F1 score and other indexes compared with the performance results of the first 4 columns.
The invention has been described in detail with reference to the method for detecting intrusion into unsupervised network data based on ensemble learning, and specific examples are applied herein to illustrate the principles and embodiments of the invention, and the above examples are only used to help understand the method and core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (7)

1. The unsupervised network data intrusion detection method based on ensemble learning is characterized by comprising the following steps:
step S1: preprocessing data, namely processing the acquired network stream data into time sequence data;
step S2: time series data are shunted, and the time series data in the step S1 are reconstructed into three different data forms: single point format data, context format data, and time period format data;
step S3: training an intrusion detection model set, wherein the intrusion detection model set comprises a variation self-encoder set CNN-VAE model based on single-point format data, a cyclic neural network predictor TCN-LSTM model based on context format data and a variation self-encoder BILSTM-VAE model based on time period format data;
step S4: acquiring error data, and inputting a set of intrusion detection models formed by training in the step S3 after time series data are shunted in the step S1 and the step S2 for newly-entered network stream data with unknown properties so as to form error data;
after time series data distribution is carried out on newly entered network stream data with unknown properties through the step S1 and the step S2, the network stream data are respectively and independently reconstructed by a variation self-encoder set CNN-VAE model, the network stream data are predicted by a historical network stream data before the network stream data pass through a circulating neural network predictor TCN-LSTM model, and data segments containing the new network stream data are reconstructed by a variation self-encoder BILSTM-VAE model;
for the variable self-encoder set CNN-VAE model, its input is
Figure QLYQS_2
Output is +.>
Figure QLYQS_4
The error is the reconstruction error +.>
Figure QLYQS_6
Or is a VAE loss function->
Figure QLYQS_3
Its reconstruction error->
Figure QLYQS_5
Using mean square error->
Figure QLYQS_7
Its VAE loss function->
Figure QLYQS_8
Error for the reconstruction->
Figure QLYQS_1
Adding corresponding KL divergence;
for the TCN-LSTM model of the cyclic neural network predictor, the error is a loss function
Figure QLYQS_9
Figure QLYQS_10
Wherein,,
Figure QLYQS_11
the actual stream characteristic information for the next timestamp,/->
Figure QLYQS_12
Information about the predicted stream characteristics;
for a BILSTM-VAE model of the variable self-encoder, the error is the reconstruction error, the reconstruction error is the mean square error or the mean square error is added with corresponding KL divergence;
step S5: and (3) judging the characteristics of the newly-entered network flow data, and comparing the difference between the error data obtained in the step S4 and the expected error given by the network administrator for the newly-entered network flow data with unknown properties to obtain the characteristic judging result of the network flow data.
2. The method of claim 1, wherein for data entering the intrusion detection model set, each new piece of network flow data is reconstructed in combination with its historical network flow data into three different data forms in step S2, which are respectively used as input data for the variable self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model, and the variable self-encoder BILSTM-VAE model.
3. The method of integrated learning based unsupervised network data intrusion detection according to claim 1, wherein the network flow data of step S1 comprises network flow data obtained from a secure network environment; based on network flow data obtained from a secure network environment, the variation self-encoder set CNN-VAE model, the cyclic neural network predictor TCN-LSTM model and the variation self-encoder BILSTM-VAE model respectively learn different intrinsic characteristics of normal network flow data, including time sequence characteristics and non-time sequence characteristics thereof.
4. The method for intrusion detection of network data based on ensemble learning according to claim 1, wherein in step S5, for newly entered network flow data of unknown nature, when a difference between the error data obtained in step S4 and an expected error given by a network administrator is greater than a preset threshold, the network flow data is considered as attack data; and when the difference between the error data obtained in the step S4 and the expected error given by the network administrator is not greater than a preset threshold value, the network flow data is considered to be normal data.
5. The method for unsupervised network data intrusion detection based on ensemble learning according to claim 1, wherein the data preprocessing of step S1 comprises:
step S101: feature selection, namely extracting features from network stream data;
step S102: feature numeralization, namely assigning a numerical value to the network flow data feature acquired in the step S101;
step S103: feature normalization normalizes the feature values obtained in step S102 to the [0,1] section.
6. The method for unsupervised network data intrusion detection based on ensemble learning according to claim 5, wherein in step S103, the feature values are normalized by using the following formula:
Figure QLYQS_13
wherein->
Figure QLYQS_14
As the original characteristic value of the object is obtained,
Figure QLYQS_15
is->
Figure QLYQS_16
Normalized value, <' > and->
Figure QLYQS_17
Minimum value exhibited in the dataset for the same class of feature values, +.>
Figure QLYQS_18
Is the maximum value exhibited by the same class of feature values in the dataset.
7. The method of claim 1, wherein the error data of the intrusion detection model set includes an error that varies from an encoder set CNN-VAE model and/or an error of a recurrent neural network predictor TCN-LSTM model and/or an error that varies from an encoder BILSTM-VAE model.
CN202310509884.2A 2023-05-08 2023-05-08 Unsupervised network data intrusion detection method based on ensemble learning Active CN116232772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310509884.2A CN116232772B (en) 2023-05-08 2023-05-08 Unsupervised network data intrusion detection method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310509884.2A CN116232772B (en) 2023-05-08 2023-05-08 Unsupervised network data intrusion detection method based on ensemble learning

Publications (2)

Publication Number Publication Date
CN116232772A CN116232772A (en) 2023-06-06
CN116232772B true CN116232772B (en) 2023-07-07

Family

ID=86587653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310509884.2A Active CN116232772B (en) 2023-05-08 2023-05-08 Unsupervised network data intrusion detection method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN116232772B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108566364A (en) * 2018-01-15 2018-09-21 中国人民解放军国防科技大学 Intrusion detection method based on neural network
CN113159163A (en) * 2021-04-19 2021-07-23 杭州电子科技大学 Lightweight unsupervised anomaly detection method based on multivariate time series data analysis
CN113240198A (en) * 2021-06-07 2021-08-10 兰州大学 Port ship track prediction method based on TCN model
CN113868006A (en) * 2021-10-09 2021-12-31 中国建设银行股份有限公司 Time sequence detection method and device, electronic equipment and computer storage medium
CN115409091A (en) * 2022-08-08 2022-11-29 哈尔滨工业大学 Method, device, equipment and medium for unsupervised satellite anomaly detection based on TDRAE
CN115879505A (en) * 2022-11-15 2023-03-31 哈尔滨理工大学 Self-adaptive correlation perception unsupervised deep learning anomaly detection method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210365792A1 (en) * 2020-05-22 2021-11-25 Samsung Electronics Co., Ltd. Neural network based training method, inference method and apparatus
US20220245422A1 (en) * 2021-01-27 2022-08-04 Royal Bank Of Canada System and method for machine learning architecture for out-of-distribution data detection

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108566364A (en) * 2018-01-15 2018-09-21 中国人民解放军国防科技大学 Intrusion detection method based on neural network
CN113159163A (en) * 2021-04-19 2021-07-23 杭州电子科技大学 Lightweight unsupervised anomaly detection method based on multivariate time series data analysis
CN113240198A (en) * 2021-06-07 2021-08-10 兰州大学 Port ship track prediction method based on TCN model
CN113868006A (en) * 2021-10-09 2021-12-31 中国建设银行股份有限公司 Time sequence detection method and device, electronic equipment and computer storage medium
CN115409091A (en) * 2022-08-08 2022-11-29 哈尔滨工业大学 Method, device, equipment and medium for unsupervised satellite anomaly detection based on TDRAE
CN115879505A (en) * 2022-11-15 2023-03-31 哈尔滨理工大学 Self-adaptive correlation perception unsupervised deep learning anomaly detection method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Unsupervised Anomaly Detection Based on CNN-VAE with Spectral Residual for KPIs;Gongliang Li et al;IEEE;第1307-1313页 *
基于无监督多源数据特征解析的网络威胁态势评估;杨宏宇;王峰岩;;通信学报(02);第147-158页 *
基于视觉的车辆异常行为检测综述;黄超;胡志军;徐勇;王耀威;;模式识别与人工智能(03);第47-61页 *
视听觉深度伪造检测技术研究综述;梁瑞刚;吕培卓;赵月;陈鹏;邢豪;张颖君;韩冀中;赫然;赵险峰;李明;陈恺;;信息安全学报(02);第6-22页 *

Also Published As

Publication number Publication date
CN116232772A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN112398779B (en) Network traffic data analysis method and system
CN111885059B (en) Method for detecting and positioning abnormal industrial network flow
CN111314329B (en) Traffic intrusion detection system and method
CN110895526A (en) Method for correcting data abnormity in atmosphere monitoring system
CN107707431A (en) The data safety monitoring method and system of a kind of facing cloud platform
CN112822189A (en) Traffic identification method and device
CN110472671B (en) Multi-stage-based fault data preprocessing method for oil immersed transformer
CN112367303B (en) Distributed self-learning abnormal flow collaborative detection method and system
CN109784668B (en) Sample feature dimension reduction processing method for detecting abnormal behaviors of power monitoring system
CN113452672B (en) Method for analyzing abnormal flow of terminal of Internet of things of electric power based on reverse protocol analysis
CN117220920A (en) Firewall policy management method based on artificial intelligence
CN110650124A (en) Network flow abnormity detection method based on multilayer echo state network
CN116232772B (en) Unsupervised network data intrusion detection method based on ensemble learning
CN116933895B (en) Internet of things data mining method and system based on machine learning
CN116738354A (en) Method and system for detecting abnormal behavior of electric power Internet of things terminal
CN111797997A (en) Network intrusion detection method, model construction method, device and electronic equipment
CN113328881B (en) Topology sensing method, device and system for non-cooperative wireless network
CN115883424A (en) Method and system for predicting traffic data between high-speed backbone networks
CN115175192A (en) Vehicle networking intrusion detection method based on graph neural network
CN110650130B (en) Industrial control intrusion detection method based on multi-classification GoogLeNet-LSTM model
YR et al. IoT streaming data outlier detection and sensor data aggregation
CN115249059A (en) Model training and abnormal data analysis method and device and computer storage medium
CN115085948A (en) Network security situation assessment method based on improved D-S evidence theory
Huang et al. A Protocol-based Intrusion Detection System using Dual Autoencoders
Wang et al. MBM-IoT: Intelligent multi-baseline modeling of heterogeneous device behaviors against iot botnet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant