CN111291860A

CN111291860A - Anomaly detection method based on convolutional neural network feature compression

Info

Publication number: CN111291860A
Application number: CN202010031422.0A
Authority: CN
Inventors: 李思照; 姜宏睿; 孙建国; 巩建光; 阎梓宁; 王文衫
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2020-06-16

Abstract

The invention belongs to the technical field of deep learning intrusion detection, and particularly relates to an anomaly detection method based on convolutional neural network feature compression. According to the method, the data characteristics are preprocessed by adopting the technology of single-hot coding and dispersion standardization, so that the deep learning model can more effectively identify the characteristics of a data set, and the distortion rate of the data is reduced; compressing the one-hot coded sparse vector into a dense vector through an embedding layer, and reducing the training time of each model; the original data is subjected to linear transformation through dispersion standardization, so that the original linear relation of the data is still kept after the data are changed, and the accuracy of the model in intrusion detection can be improved. The invention has the advantages of high intrusion detection accuracy, short training time and high prediction precision, and can be widely applied to the aspects of network intrusion detection and the like.

Description

Anomaly detection method based on convolutional neural network feature compression

Technical Field

The invention belongs to the technical field of deep learning intrusion detection, and particularly relates to an anomaly detection method based on convolutional neural network feature compression.

Background

In recent years, while network technologies have been rapidly developed, intrusion behaviors into network communication systems have become more and more common in the fields of industry, education, medical care, and the like. With this trend, many scholars have made a lot of effort in anomaly detection, and these efforts can be mainly summarized into two categories: and carrying out intrusion detection by utilizing a traditional mathematical model and carrying out intrusion detection by utilizing a deep learning model. The traditional intrusion detection method by using a mathematical model mainly comprises the steps of constructing a network association diagram by using a probability model, and carrying out probability inference and intrusion judgment through the diagram. Foreign scholars have used a Bayesian model to construct a threat attack graph, and finally, attack chains are detected through the graph. Other scholars further establish a service dependency graph, solve local threats through dynamic iteration and then speculate a global threat chain. Domestic scholars have also made this effort by creating a time-dependent network by collecting information flow between memories, threads, files, and then speculating intrusion paths through a bayesian network.

The means for detecting abnormal intrusion by using the deep learning model is mainly to characterize the characteristics of network abnormal behaviors through a heuristic algorithm and then try to approximate a global optimal solution by fitting data to a high-dimensional plane through multiple iterative training by using the deep learning model. Finally, by training this network, the model can identify abnormal behaviors or features. The result of using the generation countermeasure network to detect intrusion is significantly better than the conventional machine learning model in the case of a small amount of data. In addition, there are also learners who use a bidirectional long and short memory network and a recurrent neural network to detect intrusion, however, these methods are not so effective in cases involving large-scale network environment anomaly detection, because the existing models are difficult to converge quickly in a short time as the network size and data increase. Therefore, aiming at the problem, the embedded model can be used for compressing the sparse features, so that the accuracy of intrusion detection can be ensured while the model training time is reduced.

Disclosure of Invention

The invention aims to provide an anomaly detection method based on convolutional neural network feature compression, which solves the problems that in the existing system, the deep learning model is too long in training time, cannot effectively perform anomaly detection on large-scale data quantity, and is high in data distortion rate.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: inputting a data set to be detected, and dividing a training set into a test set and a training set;

step 2: digitizing the characteristics of the data in the data set to be detected by using the one-hot code;

and step 3: standardizing data in a data set to be detected;

because the difference between the characteristic values in the data set is larger, the convolution neural network model pays more attention to the higher digital index and ignores the lower digital index, at the moment, the original data is subjected to linear transformation through dispersion standardization, so that the transformation result falls between intervals of [0,1], and the linear relation in the original data is not changed, and the formula is as follows:

wherein the content of the first and second substances,

is the vector before transformation;

is the transformed vector;

and 4, step 4: establishing a convolutional neural network model;

the convolutional neural network model comprises 1 embedded layer, 4 1-dimensional convolutional layers and 4 full-connection layers, wherein the embedded layer is a feedforward type neural network, and for each neural unit of the embedded layer, the expression is as follows: assuming a set S of signal vectors, the weight and offset of the node j in the embedding layer is theta_jAnd b_jThe formula is as follows:

wherein act represents an activation function, O_jAn output representing a jth neuron; based on the above expression, the definition of an embedding layer is: suppose that

Is that

When the number of input and output nodes of the vector after conversion is m and n, respectively, there is a matrix Θ of dimension n × m [ θ ═ m^T]And an n-dimensional vector

The following formula is satisfied:

and 5: compressing the one-hot coded sparse vector into a dense vector through an embedding layer; after compression, the convolution process for the model is formulated as follows:

wherein the content of the first and second substances,

inputting a vector for the ith layer of the convolutional neural network;

represents the ith layer of convolution kernel; symbol

Representing a convolution operation;

is the disparity vector for the ith layer; act is an activation function;

step 6: inputting the training set into a convolutional neural network model for training;

evaluating the performance of the convolutional neural network model in intrusion detection by using a confusion matrix; all data in the dataset fall into the following four categories: TP, TN, FP and FN, where T (and F represent correct or incorrect classification results, respectively, P and N (represent positive and negative examples in model prediction results, respectively;

evaluating the function of the convolutional neural network model by adopting the accuracy AC, the detection rate DR and the error alarm rate FAR;

AC represents the proportion of the correct classification number in the classification result in the total samples;

DR represents the probability of correct detection of the model when intrusion occurs;

the FAR represents the probability that a certain normal behavior is judged as intrusion by a model;

when AC is more than or equal to 98%, DR is more than or equal to 98% and FAR is less than 0.6%, judging that the convolutional neural network model meets the requirements, and stopping training;

and 7: inputting the test set into a trained convolutional neural network model to obtain a detection result; after each item of data in the test set is calculated through a convolutional neural network model, a vector with the length of 5 is obtained, and the bits from 1 to 5 are respectively the possibility of normal recording, the possibility of denial of service attack, the possibility of monitoring and other detection activities, the possibility of illegal access from a remote machine and the possibility of illegal access of an ordinary user to the privilege of a local super user.

The invention has the beneficial effects that:

the invention designs an anomaly detection method based on convolutional neural network feature compression, which adopts the technology of single-hot coding and dispersion standardization to preprocess data features, so that a deep learning model can more effectively identify the features of a data set, thereby reducing the distortion rate of data; compressing the one-hot coded sparse vector into a dense vector through an embedding layer, and reducing the training time of each model; the original data is subjected to linear transformation through dispersion standardization, so that the original linear relation of the data is still kept after the data are changed, and the accuracy of the model in intrusion detection can be improved. The invention has the advantages of high intrusion detection accuracy, short training time and high prediction precision, and can be widely applied to the aspects of network intrusion detection and the like.

Drawings

FIG. 1 is a flow chart of an anomaly detection method based on convolutional neural network feature compression.

Fig. 2 is a diagram of an embedded convolutional neural network structure.

Fig. 3 is a comparison graph of different learning rates.

Fig. 4 is a comparison of different systems.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention designs an anomaly detection method based on convolutional neural network feature compression, which mainly adopts the principle that an embedded model is used for compressing sparse features, and finally data are fitted through a convolutional neural network. The method is mainly used for improving the success rate of intrusion detection by utilizing a system embedded with a deep learning model under the condition of large-scale data, reducing the training time of the deep learning model and finally obtaining a vector with the length of 5, so that whether the data is abnormal or not is judged, the main structure of the method is provided with an embedded layer for compressing sparse features, four one-dimensional convolution layers are used for carrying out convolution operation, 4 full-connection layers are used for integrating the purified features, and the high-level meaning of the data features is obtained and finally used for data classification. The invention has the advantages of high intrusion detection accuracy, short training time and high prediction precision, and can be widely applied to the aspects of network intrusion detection and the like.

The invention relates to the field of deep learning intrusion detection, wherein an intrusion is detected by using a convolutional neural network model in deep learning, and an embedded model is used for compressing sparse features to improve the detection precision. The invention describes an anomaly detection method based on convolutional neural network feature compression.

The invention relates to an anomaly detection method based on convolutional neural network feature compression, which compresses sparse matrix features through an embedded model and identifies and detects intrusion according to the compressed features, thereby improving the accuracy of intrusion detection, reducing the training time of a deep learning model and having certain research and practical values. The invention mainly comprises the following steps:

and step 3: standardizing data in a data set to be detected;

wherein the content of the first and second substances,

is the vector before transformation;

is the transformed vector;

and 4, step 4: establishing a convolutional neural network model;

Is that

The following formula is satisfied:

wherein the content of the first and second substances,

inputting a vector for the ith layer of the convolutional neural network;

represents the ith layer of convolution kernel; symbol

Representing a convolution operation;

is the disparity vector for the ith layer; act is an activation function;

The embedded model is used for compressing sparse features, the purposes of improving prediction precision and reducing training time are finally achieved, and each neural unit of an embedded layer is described as follows: assuming a set S of signal vectors, the weight and offset of the node j in the embedding layer is theta_jAnd b_jThe formula is as follows:

in the formula, act represents the activation function, O_jRepresenting the output of the jth neuron, the embedding layer can be defined, based on the above expression, as follows: suppose that

Is that

After the vector is converted, the number of input and output nodes is m and n, respectively, and there is a matrix theta of dimension n x m [ theta ]^T]And an n-dimensional vector

The following formula is satisfied:

with these formulas, we can compress the one-hot coded sparse vector into a dense vector through the embedding layer.

The symbolic features in the data features are converted into numerical data through the one-hot coding, so that the model identification and intrusion detection in the system are more convenient.

Data normalization, which applies a dispersion normalization method to convert the original data so that the result falls between [0,1] and the linear relationship of the original data is not changed, because the difference of the eigenvalues in the data set is large, the large eigenvalue is emphasized more in the system processing, and the small eigenvalue is easy to ignore, and the formula is as follows

The method aims to solve the problems that in the existing system, the deep learning model is too long in training time, abnormal detection cannot be effectively carried out on large-scale data quantity, the deep learning model is high in data distortion rate and the like. The invention provides an anomaly detection method based on convolutional neural network feature compression, which utilizes an embedded model to compress sparse features, finally fits data through a convolutional neural network, then improves the learning rate of the model when training the model so as to achieve the aim of rapid convergence, and when the training termination condition is reached, evaluates the performance of the system through a test data set, and finally obtains a vector with the length of 5 as a criterion, thereby verifying whether the abnormal behavior is detected, and mainly comprises the following steps:

(1) data preprocessing: feature digitization and data normalization

(2) Establishing a model: the model mainly comprises an embedded layer, 4 1-dimensional convolutional layers and 4 full-connection layers.

(3) The models were trained by setting the learning rates to 0.01, 0.001, and 0.0001 and performing convolution operations on the models.

(4) The model training results are compared with the remaining comparative model results.

(5) The system is applied to the NSL-KDD data set to obtain the result of abnormal data in the NSL-KDD data set.

Compared with the prior system, the invention has the advantages that:

1. and (3) reducing the data distortion rate: the data features are preprocessed by adopting the technology of single-hot coding and dispersion standardization, so that the deep learning model can more effectively identify the features of the data set, and the distortion rate of the data is reduced.

2. Reducing the training time of the model: the invention compresses the sparse vector of the one-hot coding into the dense vector through the embedded layer, thereby reducing the training time of each model.

3. And the intrusion detection precision is improved under the condition of large data size: the original data is subjected to linear transformation through dispersion standardization, so that the original linear relation of the data is still kept after the data are changed, and the accuracy of the model in intrusion detection can be improved.

Example 1:

firstly, preprocessing a data set, firstly, digitizing the characteristics of the data by using one-hot coding, secondly, standardizing the data, wherein the difference between characteristic values in the data set is larger, so that a convolutional neural network model pays more attention to higher digital indexes and neglects lower digital indexes, and at the moment, the original data is linearly transformed through dispersion standardization, so that the transformation result falls between intervals of [0,1], and the linear relation in the original data is not changed. The formula is as follows:

in the formula

Is the vector before the transformation is performed,

is the transformed vector.

After preprocessing the data, the invention establishes a convolutional neural network-based model, which comprises an embedded layer, 4 1-dimensional convolutional layers and 4 fully-connected layers, wherein the embedded layer is a feedforward type neural network, and for each neural unit of the embedded layer, the model can be expressed as: assuming a set S of signal vectors, the weight and offset of the node j in the embedding layer is theta_jAnd b_jThe formula is as follows:

Is that

The following formula is satisfied:

with these formulas, we can compress the one-hot coded sparse vector into a dense vector through the embedding layer. After compression, the convolution process for the model is formulated as follows:

wherein, let us assume the i-th layer input vector of the convolutional neural network as T_i，

Representing the i-th layer of convolution kernel, symbol

Which represents a convolution operation, is a function of,

is the firstThe offset vector for i layers, act is the activation function. Through the training step of the convolution operation,

finally, a vector with the length of 5 is obtained, and the bits from 1 to 5 are respectively the possibility of normal recording, the possibility of denial of service attack, the possibility of monitoring and other detection activities, the possibility of illegal access from a remote machine and the possibility of illegal access of an ordinary user to the privilege of a local super user. For example, the end result is (0.7, 0.05, 0.05, 0.1, 0.1), from which we can determine that this piece of data is 70% likely to be normal behavior data, 5% likely to be a denial of service attack, and 5% likely to be the possibility of monitoring and other probing activities. 10% may be illegal access from a remote machine and 10% may be illegal access to the local supervisor privileges of the average user. Therefore, the data can be judged to be normal behavior data.

Finally, the model parameters are continuously adjusted in the training process, so that the system performance is better. In the experimental process, we use the confusion matrix to evaluate the performance of the system in intrusion detection, and all data in the data set must be classified into the following four types: TP, TN, FP and FN, where t (true) and f (false) represent correct or incorrect classification results, respectively, and P (positive) and N (negative) represent positive and negative examples in model prediction results, respectively, e.g., TP indicates that intrusion behavior occurs under actual conditions and is detected by the model, and in addition, the following three indicators are used to evaluate the function of the model in our system: accuracy (AC), Detection Rate (DR) and False Alarm Rate (FAR). The calculation formula is as follows:

(1) AC represents the proportion of the correct class number in the classification result to the total sample.

(2) DR represents the probability of the model detecting correctly when an intrusion occurs

(3) Probability of false positive of some normal behavior as intrusion by FAR representative model

Our experimental environment is as follows:

Intel(R)Core(TM)i7-7700HQ 2.80GHz

GPU:NVIDIA GeForce GTX1060

RAM:16GB

to evaluate our system's ability to recognize intrusions, we chose the NSL-KDD dataset to train and test our system, which contains 125973 records, with 41 features and 1 label for each record: of which 7 are symbolic features and 34 are continuous features. As shown in fig. 3, fig. 3 is a comparison graph of different learning rates, LR is the learning rate, Accuracy is the Accuracy, and Epoch is the training period. We have conducted three experiments and we can see that the system fluctuates more in accuracy during training when the learning rate is 0.0001, appears to be more stable when the learning rate is 0.001, and performs best when we set the learning rate to 0.01, since we consider the optimal value of the learning rate to be 0.01.

We have also compared some other systems, as shown in fig. 4, fig. 4 is a comparison graph of different systems, AC is the accuracy, DR is the probability of correct detection of the model when an intrusion occurs, and FAR is the probability of the model misjudging some normal behavior as an intrusion. Bayesian is a Bayesian model, SVM is a support vector machine, CNN-IDS is a traditional convolutional neural network, LSTM-RNN is a long-short term memory network and a cyclic neural network, and GAN is a pairwise anti-network. The system is better than the traditional system based on machine learning in the aspects of accurate determination and false alarm rate, and results show that the traditional system based on machine learning is difficult to ensure the identification accuracy when the data volume is large, but the system based on deep learning has good performance, the system based on CNN-IDS uses data dimension reduction to remarkably reduce the false alarm rate in data preprocessing, but the detection rate is slightly inferior to that of the method, and in addition, the system based on GAN and CNN-IDS can cause data distortion to a certain degree, so the accurate determination and detection rate is slightly inferior to that of the system. The LSTM-RNN based approach has a higher detection rate than our system, but our approach performs better than the above system in terms of false alarm rate. In summary, the accuracy rate (AC) of the data intrusion detection of our system is superior to that of other detection systems, and can reach 98.03%, and the false alarm rate of our detection system is lower than that of other systems, and is only 0.54%. Therefore, our system is superior to other systems in intrusion detection.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.