CN111556016B

CN111556016B - Network flow abnormal behavior identification method based on automatic encoder

Info

Publication number: CN111556016B
Application number: CN202010217930.8A
Authority: CN
Inventors: 蹇诗婕; 姜波; 卢志刚; 刘玉岭; 杜丹; 刘宝旭
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2021-02-26
Anticipated expiration: 2040-03-25
Also published as: CN111556016A

Abstract

The invention provides a network flow abnormal behavior identification method based on an automatic encoder, which belongs to the cross technical field of combination of machine learning and information safety, and aims to balance the category distribution of normal flow data and abnormal flow data in flow data by using a comprehensive few oversampling methods and combine the automatic encoder, so that nonlinear structure information can be effectively extracted from mass data, and abnormal behaviors in network flow can be identified.

Description

Network flow abnormal behavior identification method based on automatic encoder

Technical Field

The invention provides an effective network flow abnormal behavior identification method. The method combines a comprehensive few oversampling methods and an automatic encoder classification algorithm, and belongs to the cross technical field of combination of machine learning and information safety.

Background

With the rapid development of the information age, the internet has become an indispensable part of people's lives. However, the frequency of the attack behaviors in the network and the scale of the attack events are also increasing, and these attack behaviors not only cause huge economic loss, but also pose serious threats to social stability and national security, and maintaining the security of the network space has become a problem to be solved urgently. In order to better maintain the security of network space, ensure the availability of various network resources and prevent various attack behaviors, the intrusion detection technology as an active defense method becomes a hot problem of current research. The intrusion detection system is an active safety protection technology, can monitor the transmission behavior of data in a network, and sends out an alarm or interrupts an abnormal transmission behavior after finding suspicious transmission.

The concept of intrusion detection was first proposed by James Anderson in 1980 to monitor attack behavior. There are a lot of studies on the detection of network intrusion behaviors, and these works can be classified into misuse-based intrusion detection systems (MIDS) and anomaly-based intrusion detection systems (AIDS). The MIDS is also called an intrusion detection system based on signature, and detects attack behaviors according to the existing knowledge. Although the MIDS has higher accuracy and lower false alarm rate, it cannot detect unknown attacks that are not in the signature database. Unlike MIDS, AIDS can detect unknown attacks by comparing normal and abnormal behavior. Thus, AIDS is drawing increasing attention, the most important of which is the use of traditional feature-based machine learning methods, such as decision trees, random forests, na iotave bayes, etc. However, intrusion detection based on the conventional machine learning method usually emphasizes feature engineering, and is a shallow learning method. With the increase of massive high-dimensional data in a network and the increase of network bandwidth, the complexity of the data and the diversity of characteristics are continuously improved, and the purposes of analysis and prediction are difficult to achieve through shallow learning.

In recent years, deep neural network technology has enjoyed great success in the fields of image recognition, natural language processing, speech recognition, and the like. The deep neural network is a method for performing characterization learning on data, can learn the intrinsic rule of the data, and adapts to the requirements of high-dimensional learning and prediction by constructing a nonlinear network structure formed by a plurality of hidden layers. The current intrusion detection method based on deep learning also has a development prospect, comprises an automatic encoder, a deep belief network, a recurrent neural network, a convolutional neural network, a gated recurrent unit and the like, and achieves certain success. However, these deep learning methods for intrusion detection still have some problems.

For example, due to the category imbalance problem, many studies do not consider the overall distribution of traffic data, the decision function is biased towards most samples, low frequency attack samples are considered as noise and are ignored, so that the model is difficult to capture effective features, and low frequency attacks are difficult to detect. On the other hand, some studies do not process high-dimensional data when converting symbolic data into numerical data, which results in low training efficiency, memory space consumption and poor detection performance. Therefore, the efficiency and the accuracy of intrusion detection can be better improved by performing dimension reduction processing on the traffic data.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a novel deep neural network intrusion detection method, namely a network flow abnormal behavior identification method based on an automatic encoder, which uses a comprehensive few oversampling methods to balance the category distribution of normal flow data and abnormal flow data in flow data and combines the automatic encoder, thereby effectively extracting nonlinear structure information from mass data.

In order to achieve the purpose, the invention adopts the specific technical scheme that:

a network flow abnormal behavior identification method based on an automatic encoder comprises the following steps:

1) constructing a sparse abnormal intrusion detection model SAIDS by using an automatic encoder;

2) training the SAIDS model, comprising the following steps:

the SAIDS model preprocesses original training data, and balances the category distribution of normal flow and abnormal flow in flow data by adopting a comprehensive few oversampling method (SMOTE) to the preprocessed training data to obtain balanced data;

classifying normal flow and abnormal flow according to the balance data, calculating a loss value, and finding out a model parameter corresponding to the minimum loss value to obtain a trained SAIDS model;

3) detecting the network traffic to be identified by using the trained SAIDS model, wherein the steps comprise:

the SAIDS model preprocesses the network traffic to be identified, classifies the preprocessed network traffic into normal traffic and abnormal traffic, and identifies abnormal behaviors.

Further, the original training data carries class labels for normal traffic and abnormal traffic.

Further, the normalized data is obtained through preprocessing, wherein the preprocessing comprises converting the symbolic data into numerical data by using one-hot coding, and normalizing the numerical data.

Further, the normalization processing refers to the reduction of numerical data to the range of [0,1] by adopting a Min-Max normalization method.

Further, a linear interpolation is adopted in the comprehensive minority oversampling method, and new data are generated by multiplying the difference between the data samples in the minority class and a randomly selected nearest neighbor sample by a random number between 0 and 1 and then summing the difference and the data samples in the minority class.

Furthermore, the SAIDS model mainly comprises a discarding layer and an automatic encoder except a network structure which is responsible for preprocessing original data and obtaining balanced data by adopting an SMOTE method; the discarding layer preprocesses the balance data to prevent overfitting; the automatic encoder comprises an input layer, an encoding layer and a decoding layer, wherein the input layer receives preprocessed balance data, the encoding layer maps the balance data into low-dimensional features, the decoding layer reconstructs the low-dimensional features into input data, and the input data is classified into normal flow and abnormal flow.

Further, the discarding layer is to perform element product processing on the input balance data and the vector with the probability obeying the Bernoulli distribution.

Further, the SAIDS model selects the Relu activation function and the Adam optimizer when training, and calculates the loss value using the mean square error.

The reason why the invention chooses to integrate a few oversampling methods for balancing the class distribution of the traffic data is that it has the following advantages: (1) the under-sampling method obtains a balanced data set by deleting most types of data, and important data information may be lost, so that the over-sampling method generally has better processing effect and higher use frequency than the under-sampling method. (2) By integrating a few oversampling methods and adopting the theory of linear interpolation, the overfitting phenomenon is effectively reduced, and the limitation in the sampling process is reduced.

Due to the fact that the dimensionality of the data is too high, training efficiency is low, the needed storage space can be reduced by reducing the dimensionality of the data, the calculation speed is increased, redundant features are removed, and the data are better expressed. The traditional linear dimensionality reduction methods such as principal component analysis and the like are difficult to capture nonlinear information in data, and the kernel function-based nonlinear dimensionality reduction methods such as principal component analysis and the like are high in calculation complexity and difficult to apply to large-scale data sets. The automatic encoder is used as a dimension reduction method in deep learning, nonlinear structure information can be effectively extracted from mass data sets, and higher-level features can be obtained. Therefore, the invention adopts the automatic encoder algorithm to construct the intrusion detection system, thereby improving the detection capability of massive high-dimensional data.

Compared with the prior art, the invention has the following positive effects:

the invention performs experiments on a plurality of real network traffic data sets, and evaluates the performance of the model by using the overall accuracy, precision, recall rate and F1 value. Comprehensive experiment results show that the model provided by the invention is superior to the existing baseline recognition methods such as decision trees, random forests, gated neural networks and the like in performance.

Drawings

Fig. 1 is a flowchart of the entire method for identifying abnormal network traffic behavior according to this embodiment.

FIGS. 2A-2B are graphs of the NSL-KDD data set used in the present example; where fig. 2A is the original training data set and fig. 2B is the data set processed by the SMOTE method.

Fig. 3A-3B are distribution diagrams of UNSW-NB15 data sets used in the present embodiment, where fig. 3A is the original training data set and fig. 3B is the data set processed by the SMOTE method.

Fig. 4A-4B are cubic graphs of performance comparisons for the deep learning method, where fig. 4A is an evaluation case for the NSL-KDD dataset and fig. 4B is an evaluation case for the UNSW-NB15 dataset.

Detailed Description

In order to make the technical solutions in the embodiments of the present invention better understood and make the objects, features, and advantages of the present invention more comprehensible, the technical core of the present invention is described in further detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment provides an effective network traffic abnormal behavior identification method. The method has the general idea that network traffic data is preprocessed firstly, the preprocessing comprises two parts of sign data digitization and numerical data normalization, then the distribution condition of the network traffic data is changed by using a comprehensive few oversampling methods, and a model is established by combining an automatic encoder method, so that the attack behavior in the network traffic data can be detected.

The overall flow chart of the method is shown in fig. 1, and the details of the steps of the method are described as follows:

(1) data pre-processing

The validation dataset used in the method is the NSL-KDD dataset and the UNSW-NB15 dataset. Specifically, the NSL-KDD dataset is a subset of the KDDCup1999 dataset, which is provided by the United states defense advanced research planning agency and contains weeks of attack data that can be used to evaluate intrusion detection performance. However, the KDDCup1999 data set has the problems of redundant records, more preference of a classifier for frequent records and the like. To solve these problems. The NSL-KDD data set effectively solves the problems of redundant features and repeated recording of a KDDCup1999 data set, and the training data set and the testing data set are reasonable in quantity. The NSL-KDD data contains TCP/IP connection records, each record containing 41 features, such as basic features, content features, and traffic features, as well as a category label and a difficulty label. The UNSW-NB15 data set was created in the network wide laboratory of the australian network security center, collecting 16 hours of data on 22 days 1 month 2015 and 15 hours of data on 17 days 2 month 2015, with contemporary actual network traffic, containing more comprehensive attack activity. The UNSW-NB15 dataset contains 49 features, including traffic features, basic features, content features, temporal features, etc.

In order to remove redundant data, the detection efficiency is improved, and the time consumption is reduced. The method carries out data preprocessing on the network flow data and comprises two parts of sign characteristic numeralization and numerical data normalization.

And (3) symbol characteristic numeralization: the symbol feature data is usually contained in the intrusion detection data set, and the symbol data is difficult to be directly processed by the model, so that the symbol data is converted into digital data by using a one-hot encoder in the step. For example, the protocol _ type feature in the NSL-KDD dataset contains three characters, TCP, UDP and ICMP respectively. Mapping the three characters into 3 binary vectors through one-hot coding, wherein the mapping results are as follows: [1,0,0], [0,1,0], and [0,0,1 ]. In this way, all symbol characteristics are mapped by one-hot encoding. For the category label, the normal traffic data in the dataset is labeled as 0 and the abnormal traffic data is labeled as 1.

Normalization of numerical data: the data normalization can solve the problem that the dimensionality of different characteristic data is greatly different, and therefore the data normalization method is widely used in a data preprocessing step. In order to ensure the reliability of the detection result, the normalization processing needs to be performed on the numerical data in the two data sets, wherein the normalization refers to reducing all the characteristic data to [0,1]]Within the range. The method aims to adopt a Min-Max normalization method to process data, and the conversion formula is as follows:

where x represents the attribute value of a feature, x_maxMaximum value, x, representing such characteristic property_minRepresents the minimum value of such characteristic attribute, and x' represents the result of normalizing x.

(2) Changing data class distribution

In network traffic data, the abnormal traffic data is usually much smaller than the normal traffic data, resulting in the decision function being biased towards most samples, and the low frequency attack samples are considered as noise and ignored. Therefore, in order to improve the detection performance of the model, data with a small amount of data needs to be processed. There are generally two processing methods, namely, a solution at an algorithm level and a solution at a data level, wherein the solution at the algorithm level is usually to modify a classifier algorithm or optimize the performance of a learning algorithm, and the distribution of data of different categories is adjusted by adjusting the importance of the categories in the learning or decision making process. Common data-level solutions include under-sampling and over-sampling methods, which balance the distribution of data classes by sampling. Undersampling balances the class distribution by reducing the amount of data for the majority of classes and oversampling balances the class distribution by increasing the amount of data for the minority of classes.

Because the under-sampling method obtains the balanced data set by deleting most types of data, important data information may be lost, and therefore, the over-sampling method generally has a better processing effect and a higher use frequency than the under-sampling method, and the method adopts a famous over-sampling method, namely a comprehensive few over-Sampling Method (SMOTE), to process data with a small data volume, thereby improving the detection performance. SMOTE is a method for randomly generating new samples between a few class samples and their neighbors, and improves class distribution by adding few class data. The SMOTE method adopts the theory of linear interpolation, effectively reduces the overfitting phenomenon and reduces the limitation in the sampling process. The formula for generating synthetic data by the SMOTE method is as follows:

y_new＝y_i+(y_i-y_j) X delta formula (1)

Wherein, y_newRepresenting newly generated synthetic data, y_iRepresenting data samples in a few categories, y_jRepresents from y_iAnd δ represents a random number between 0 and 1. The distribution of the NSL-KDD raw training data set and the distribution of the training data set after processing using the SMOTE method are shown in fig. 2A-2B. The distribution of UNSW-NB15 raw training data sets and the distribution of training data sets processed using the SMOTE method are shown in fig. 3A-3B.

(3) Model training

Due to the fact that the dimensionality of the data is too high, training efficiency is low, the needed storage space can be reduced by reducing the dimensionality of the data, the calculation speed is increased, redundant features are removed, and the data are better expressed. The traditional linear dimensionality reduction methods such as principal component analysis and the like are difficult to capture nonlinear information in data, and the kernel function-based nonlinear dimensionality reduction methods such as principal component analysis and the like are high in calculation complexity and difficult to apply to large-scale data sets. The automatic encoder is used as a dimension reduction method in deep learning, nonlinear structure information can be effectively extracted from mass data sets, and higher-level features can be obtained. Therefore, the automatic encoder is very suitable for the tasks of dimension reduction and feature learning, and the intrusion detection system is constructed by adopting the automatic encoder algorithm, so that the detection capability of massive high-dimensional data is improved.

The automatic encoder is a three-layer neural network comprising an input layer, an encoding layer and a decoding layer, and is an unsupervised learning structure consisting of an encoder and a decoder. After data preprocessing and data category distribution processing, an automatic encoder is used for carrying out dimension reduction on training data and training a model. Specifically, in order to avoid overfitting, the discarding layer is added to preprocess the balance data, so that the overfitting phenomenon is prevented, and the balance data preprocessed by the discarding layer is used as the input of the automatic encoder. For an auto-encoder, the encoder maps the input data to low-dimensional features, and the decoder reconstructs the input data using the mapped low-dimensional features. Through reconstruction input, the hidden layer can learn the characterization information of the input data, the characteristic dimensionality of the data set can be effectively reduced, and the integrity of the characteristic information is guaranteed. The formula for the discarded layer is as follows:

r to Bernoulli (p) formula (2)

λ (n) ═ r ═ α (n) formula (3)

Where r is an independent and uniformly distributed vector, obeying a bernoulli distribution with probability p, whose shape is the same as α (n), representing the product of the elements, α (n) representing the output of the current layer, and λ (n) representing the output of the discarded layer.

(4) Abnormal behavior detection

Finally, the Relu activation function and the Adam optimizer are selected and the loss value is calculated using the mean square error. The formula for the mean square error is as follows:

wherein the content of the first and second substances,

is the true value of the ith data,

is the predicted value of the ith data, and m is the total number of samples of the test data. After the models are trained, the best training model with the minimum loss value is selected to classify the test data, and the detection result is evaluated by combining with the evaluation index.

(5) Comparison of results

The invention performs experiments on a plurality of real network traffic data sets, and evaluates the performance of the model by using the overall accuracy, precision, recall rate and F1 value. In order to verify the effectiveness of the proposed method (SMOTE + AE), the invention carries out comparison experiments on both the machine learning method and the deep learning method. Seven commonly used machine learning methods are used for comparison, which are respectively as follows: decision tree, random forest, gaussian naive bayes, polynomial naive bayes, bernoulli naive bayes, Adaboost, extreme gradient boost algorithms. Baseline comparison experiments for the deep learning method were gated-round network (GRU), SMOTE + gated-round network, sparse auto-encoder (SAE), SMOTE + sparse auto-encoder and auto-encoder.

1) Performance comparison with machine learning methods

The results of this experiment comparing the performance of the NSL-KDD dataset with the UNSW-NB15 dataset are shown in Table 1. As can be seen from table 1, for the machine learning method, the detection result based on the tree method is generally better than the detection result based on the probability method because the naive bayes method is difficult to process the features having correlation. The detection capability of the ensemble learning method is superior to that of a single classifier because it is difficult for a single classifier to fully summarize the features of a particular data set, and the ensemble learning method can capture more information. As can also be seen from table 1, the accuracy of the SAIDS method on the two data sets was 91.08% and 97.58%, respectively, which is superior to the traditional machine learning method, demonstrating the effectiveness of the method herein. This is due to the fact that the shallow learning method does not handle unbalanced flow data, resulting in a model biased towards the normal category with a large number of samples. As the complexity of data increases, the learning ability of the shallow learning method is limited.

TABLE 1 comparison of Performance in NSL-KDD and UNSW-NB15 data set machine learning methods

2) Performance comparison with deep learning methods

The deep learning method can learn the intrinsic rules of the data, so that the method is more suitable for fitting and predicting the network traffic data. The results of comparing the SAIDS model proposed by the present invention with the most advanced deep learning model are shown in FIGS. 4A-4B. It can be seen from the figure that the SAIDS performance proposed by the present invention is superior to other five methods, and can effectively detect network intrusion data. The detection performance of the model processed using the SMOTE method is generally better than that of the model processed without the SMOTE method, which demonstrates the importance of balancing the data class distribution. The invention considers the processing of unbalanced data and redundant characteristics at the same time, and the detection result is superior to that of a single gated neural network and a sparse automatic encoder.

The performance comparison result with the machine learning and deep learning method shows that the SAIDS model provided by the invention has better prediction accuracy rate for the detection of network traffic and has the potential of practical application.

The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is specific, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. A network flow abnormal behavior identification method based on an automatic encoder comprises the following steps:

2) training the SAIDS model, comprising the following steps:

preprocessing original training data by the SAIDS model, wherein the original training data comprises an NSL-KDD data set and an UNSW-NB15 data set, and acquiring standardized data through preprocessing, wherein the preprocessing comprises converting symbolic data into numerical data by using one-hot coding and normalizing the numerical data;

balancing the category distribution of normal flow and abnormal flow in the flow data by adopting a comprehensive few oversampling methods for the preprocessed training data to obtain balanced data;

the SAIDS model preprocesses the network traffic to be identified, classifies the preprocessed network traffic into normal traffic and abnormal traffic, and identifies abnormal behavior;

the SAIDS model includes a discard layer and an auto-encoder; the discarding layer preprocesses the balance data to prevent overfitting; the automatic encoder comprises an input layer, an encoding layer and a decoding layer, wherein the input layer receives preprocessed balance data, the encoding layer maps the balance data into low-dimensional features, the decoding layer reconstructs the low-dimensional features into input data, and the input data is classified into normal flow and abnormal flow; the SAIDS model is trained by selecting a Relu activation function and an Adam optimizer and using the mean square error to calculate the loss value.

2. The method of claim 1, wherein the raw training data carries class labels for normal traffic and abnormal traffic.

3. The method of claim 1, wherein the normalization process reduces the numerical data to a range of [0,1 ].

4. The method of claim 3, wherein the normalization is performed using a Min-Max normalization method.

5. The method of claim 1, wherein the integrated minority over-sampling method uses linear interpolation, and new data is generated by multiplying a difference between data samples in the minority class and a randomly selected nearest neighbor sample by a random number between 0 and 1, and summing the difference and the data samples in the minority class.

6. The method of claim 1, wherein the discarding layer is an elemental product of the input balance data and a vector having a probability that obeys a bernoulli distribution.