Disclosure of Invention
The invention aims to solve the problem that the existing network intrusion detection method has poor detection effect on rare attacks in unbalanced network data, and provides a rare attack-oriented network intrusion detection method.
In order to solve the problems, the invention is realized by the following technical scheme:
the rare attack-oriented network intrusion detection method comprises the following steps:
step 1, collecting network attack data, and processing the collected network attack data into network data with an attack type label as a training set;
step 2, after the feature selection is carried out on the training set by utilizing a genetic programming algorithm, a subdata set with selected features is generated;
step 3, evaluating the accuracy of the subdata sets by using a random forest, and calculating the fitting value of the subdata sets by using the accuracy: if the fitting value of the sub data set reaches the target fitting value, ending the iteration, taking the sub data set as an optimized data set, and turning to the step 4; otherwise, the subdata set is used as a training set, and the step 2 is returned to, and the iteration is continued;
step 4, dividing the optimized data set into a common attack set and a rare attack set according to the feature tags of the network data attack types;
step 5, firstly, constructing an initial common attack classification model by using a convolutional neural network, and sending a common attack set to the initial common attack classification model to train the model to obtain a common attack classifier; then, constructing an initial rare attack classification model by using a common attack classifier, and sending a rare attack set to the initial rare attack classification model to train the model to obtain a rare attack classifier; finally, cascading the common attack classifier and the rare attack classifier to form a combined attack classifier;
step 6, sending the network data to be detected collected in real time into a joint attack classifier for classification; in the joint attack classifier, firstly, sending network data to be detected into a common attack classifier for first detection, and judging whether common attacks exist: if the network data to be detected exists, judging that the network data to be detected is subjected to common attack; if not, sending the network data to be detected into the rare attack classifier for secondary detection, and judging whether rare attacks exist: if the network data to be detected exists, judging that the network data to be detected is subjected to rare attack; and if the network data to be detected does not exist, judging that the network data to be detected is normal network data which is not attacked.
In the above step 3, the fitting value f of the sub data setfitnessComprises the following steps:
where score represents the accuracy of the sub data set obtained using a random forest for evaluation, and n represents the number of trees in the random forest.
In order to prevent the dead cycle condition that all the subdata sets fail to reach the required target fitting value in the whole iteration process, in step 3, if a preset iteration number is reached and the fitting value of the subdata set does not reach the target fitting value yet, the subdata set with the highest fitting value in all iterations is used as an optimized data set so as to ensure that the optimal subdata set is sent to a subsequent step as the optimized data set and ensure the subsequent classification accuracy.
Compared with the prior art, the invention has the following characteristics:
1. the method carries out feature extraction on unbalanced data through a genetic coding algorithm and a random forest, finally obtains an optimized subset, and eliminates the influence of redundant data on a rare attack mode; meanwhile, a common attack set and a rare attack set are respectively constructed by separating data to balance data distribution among attack types.
2. The invention provides a joint attack classifier based on a convolutional neural network, so that the rare attack classifier can perform efficient learning based on small sample data, and further, the detection capability of rare attacks is improved. Firstly, training a common attack classifier based on a common attack type with huge data volume, and obtaining a proper model by adjusting parameters; then, based on the idea of transfer learning, the common attack classifier is used as an initialization model of the rare attack classifier, and then the rare attack classifier is continuously finely adjusted on the rare attack set. And finally, respectively training a convolutional neural network based on the obtained subsets to obtain a common attack classifier and a rare attack classifier, and connecting the two sub-classifiers in a series connection mode to construct a combined attack classifier for detecting the rare attack type.
3. According to the invention, under the condition of unbalanced network data, the two sub-classifiers of the constructed combined attack classifier can be ensured to be effectively learned through the mode of firstly detecting the common attack of the network data and then detecting the rare attack, so that the common attack and the rare attack can be detected, and the effect of detecting the rare attack is improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific examples.
In order to solve the problem of low detection performance of rare attacks caused by unbalanced data distribution among attack categories in the network intrusion detection process, the key technology adopted by the invention is as follows:
(1) provides a feature extraction technology based on genetic programming and random forests
In general, data used for network intrusion detection not only contains a large amount of redundant information, but also has an imbalance in distribution among attack categories in the data. In unbalanced network data, the existence of redundant data seriously affects the classification capability of the classifier on rare attack data, so that the overall network intrusion detection performance is reduced, and a higher false alarm rate is generated. In order to eliminate the influence caused by redundant data, the network data is processed before the attack classifier is trained, wherein a commonly used technology is feature extraction, and the technology is to eliminate redundant attack features and keep important information. In order to eliminate information which generates negative influence in the detection process and highlight rare attack characteristics, the invention provides a method for cleaning redundant data by combining genetic programming and random forests. The method comprises the steps of firstly finding out subgroups of network data by using a genetic programming algorithm, then evaluating each subgroup by using a random forest algorithm, finally selecting one with the highest fitting value, and forming a new data set by using features appearing in a random forest in the subgroups, wherein an overfitting feature set is effectively prevented from being generated in the process.
The genetic algorithm is designed according to the principle of biological evolution, a population representing a potential problem solution is continuously evolved to finally find an optimized problem solution set, and the basic process comprises the following steps:
the first step is as follows: n individuals are randomly initialized, the population at this time is marked as P (0), and iteration T of population evolution is initialized.
The second step is that: representative individuals with good performance are selected and inherited to the next generation directly or after cross-pairing.
The third step: and carrying out evolution operation on the new population and changing the individual value of the new population.
The fourth step: the population P (t) generates a next generation population P (t +1) through an operation mode of crossing, mutation and selection. And when the value of T is equal to T, the iteration is finished, and the population evolution operation is finished.
Genetic programming algorithms are extensions of genetic algorithms that are typically used as an optimization technique to find solution sets for a particular problem while forming solution set populations for the problem to be solved. Briefly, a genetic programming algorithm can be represented by the following formula:
gt+1=ffitness(gt)t=0,1,2, ,n
the algorithm first randomly initializes a population g0As an original solution set of the problem to be solved; then using the fitting function ffitnessAt the current population gtSelecting a plurality of individuals to operate to form a next generation population gt+1. The selection process generates a tree that defines rules and functions and evolves through genetic operations such as crossing, mutation, and breeding until iteration is complete.
The random forest acts as an integrated evaluator that uses averaging to control the overfitting problem. The main implementation process of the random forest is as follows: firstly, constructing different sample sets from original data; then training a decision tree based on each sample set; and finally, integrating all decision trees to form a forest. After the forest is formed, each tree votes for the same classification target. Finally, the forest will define the category of the target as the type with the most votes.
The random forest can effectively reduce the generation of overfitting solution sets, and the random forest is used as a filtering method to evaluate each solution set obtained from a genetic programming algorithm. Firstly, obtaining the individuals of solution set in genetic programming algorithm, and according to the obtained individuals in the original data setTo construct a temporary data set X in which a large number of samples (X) are presenti,yi) Where i is 1, …, l, and xi∈RnN denotes the characteristic dimension in the dataset, yi∈Z+A category representing a variable Y; then classifying by utilizing a random forest based on a new data set X, and obtaining a fitting value of the data set; finally, the data set with the highest fitting value is selected from the generated population, and the features appearing in the random forest are selected from the data set to construct a final optimized data set.
The feature extraction process based on genetic programming and random forests is as follows:
1) setting parameters of a genetic programming algorithm and the number of iterations.
2) Parameters and iteration numbers are initialized, wherein each population parameter represents a solution set of a specific problem, and individuals in the population represent features in the data set.
3) The new data set obtained after one pass of genetic programming is called a sub data set after feature selection.
4) And evaluating the accuracy of the sub data sets by using a random forest algorithm, and calculating a fitting value of each sub data set by using a fitting function so as to evaluate the sub data sets generated by each characteristic selection.
The fitting function is defined as follows:
ffitness=score/n
where score represents the accuracy of classification of each sub data set using a random forest, and n represents the number of trees in the forest.
5) The above process is repeated until the fitting value of the sub data set reaches the target fitting value or reaches a predetermined number of iterations. When the fitting value of the sub data set reaches the target fitting value, ending iteration and taking the sub data set as an optimized data set; and if the target fitting value is not reached until the iteration is finished, taking the subdata set with the highest fitting value in all iterations as an optimized data set.
(2) Provides a data separation technology
In order to solve the problem of low detection capability on rare attacks caused by unbalanced attack data, the invention provides a method for separating an attack data set and constructing a combined attack classifier. Taking the NSL-KDD data set as an example, the training samples of U2R and R2L included in the NSL-KDD data set are small in number, and the conventional machine learning classifier or deep learning classifier is trained directly based on the original data set, so that it is difficult to identify the two attack types in the classification process. Therefore, the invention obtains the common attack set and the rare attack set by separating the data sets, and the process can reduce the unbalance degree between attack data distribution. And separating the data, and respectively constructing a common attack set and a rare attack set. The common attack set consists of a large amount of data of common attack types and part of normal network traffic; the rare attack set consists of rare attack types containing a small number of records and part of normal network traffic. And in the process of constructing the subset, the normal type data is divided in a random selection mode. Compared with the original data, the data distribution in the common attack set and the rare attack set is more balanced, and the attack classifier trained on the two data sets can obtain better classification performance. When processing unbalanced data, a common processing method is to reduce the degree of imbalance between different classes by sampling data and construct an attack classifier to intensively learn attack patterns of all classes. Meanwhile, the method provides a data basis for designing a combined attack classifier to solve the problem of low detection performance of rare attacks caused by data imbalance. After data is separated, the number of samples in the common attack set is large, and the number of samples in the rare attack set is small.
(3) Provides a combined attack classifier based on deep learning and traditional machine learning
The data separation mode can balance the distribution among attack types, and the joint attack classifier obtained based on new data set training eliminates the influence of common attack mode on rare attack mode. However, after the data is separated, the rare attack set has a small sample number, which may affect the performance of the rare attack classifier. Therefore, based on the characteristics that the data volume of the common attack set is huge and the data volume of the rare attack set is small, the deep learning technology is used for constructing the common attack classifier, the traditional machine learning technology is used for constructing the rare attack classifier, and the two attack classifiers are connected to construct the combined attack classifier. The method comprises the steps of firstly, constructing a common attack classifier by using a convolutional neural network, and adjusting training parameters by using a common attack set to obtain a proper common attack classifier; then introducing a transfer learning idea, taking a common attack classifier as a source model to initialize a rare attack classifier, and utilizing a rare attack set to fine tune training parameters to obtain a proper rare attack classifier; and finally, connecting the two classifiers to form a joint attack classifier, as shown in FIG. 1.
The convolutional neural network mainly reduces parameters in the neural network by sharing weights, and most of the traditional convolutional neural networks are constructed for tasks such as image recognition, video processing and the like. In the convolutional neural network, the convolutional layer is mainly used for extracting high-dimensional features of data, and the pooling layer further reduces the dimension of the obtained feature set, so that the computational complexity is reduced. The data size of general image data after multilayer convolution is large, and the calculation consumption can be effectively reduced after reasonable pooling operation. Different from image data, after the network data is subjected to feature extraction, a convolutional neural network is used for mapping to a higher-dimensional space, the obtained network data features become very sparse, and the dimension reduction is performed on the sparse network data features by using a pooling operation, so that some important information in the data is seriously lost, and the training effect of an attack classifier is further influenced. The convolutional neural network is trained by using a back propagation algorithm, firstly, input data is given, and output is obtained through calculation after a series of high-dimensional mapping operations; then using an error function to compare the difference between the output and the true input sample data label; and finally updating the weight value through back propagation. The convolutional neural network used in this chapter is composed of an input layer, a convolutional layer, a fully-connected layer, and a classification layer.
The invention relates to a joint attack classifier based on a convolutional neural network, which uses two convolutional neural networks with the same structure. In order to eliminate the influence of common attacks in the data set on rare attacks, the network data needs to be separated, and a common attack set and a rare attack set are respectively constructed. In practical situations, when there is a correlation with most of the data or tasks encountered and the amount of problem data to be solved is insufficient, a migration learning technique may be applied for optimization. The transfer learning optimizes the learning efficiency of the new model by sharing the learned high-quality model parameters with the new model, and then the new model completes fine adjustment in the task field of the new model. In order to enable the rare attack classifier to obtain effective classification performance under the condition of small sample size of the rare attack set, the invention applies the transfer learning technology to train the attack classifier.
The specific implementation algorithm for training the common attack classifier is as follows:
inputting: given a common attack dataset X ═ X (X)1,x2,…,xn) Where n represents the number of samples in the data set. Given the number of filters T, the number of fully connected layers L.
And (3) outputting: the class y of the sample is input.
The first step is as follows: for each classifier T e [1, T]Initializing the weight W of the classifiertAnd bias btSo that: wt=0,bt0. Then using ft=tanh(WtX+bt) A new set of features obtained after the input data has undergone a convolution operation is computed.
The second step is that: for each layer of full connection layer L is belonged to [1, L ∈]Initializing the weight W of the full connection layertAnd bias btSo that: wl=0,bl0, then h is defined0And f, finally calculating the output of the l-th layer full connection layer: h isl=relu(Wlhl-1+bl)。
The third step: calculating the output y of the last classification layer as softmax (W)LhL-1+bL)。
The fourth step: and calculating a loss function, and updating parameters by using a gradient descent algorithm.
Firstly, a common attack model is trained, and a random gradient descent algorithm is adopted to update the weight and the bias. After obtaining the common attack classifier, the initialization model is obtained. When training the rare attack classifier, firstly obtaining all weight matrixes W and bias b of the common attack classifier, and after the rare attack classifier is constructed, initializing the preliminarily established rare attack classifier by using the obtained parameters, wherein the specific implementation process is as follows:
inputting: given a rare attack dataset X '═ X'1,x'2,…,x'n) Where n 'represents the number of samples in the data set containing n'. Given the number of filters T, the number of fully-connected layers L, the initialized weight WtAnd bias bt。
And (3) outputting: the class y' of the sample is input.
The first step is as follows: for each classifier T e [1, T]And each full link layer is e [1, L ∈]Weight W int'and bias b'tInitializing such that: wt'=Wt,b't=bt。
The second step is that: and giving rare attack input samples, carrying out forward propagation to finally obtain output data, and then finely adjusting parameters in the classifier in a rare attack set by using a backward propagation algorithm.
Training of the two classifiers is carried out independently, and the common attack classifier learns a common attack mode and a normal mode; the rare attack classifier learns rare attack patterns and normal patterns. And in the detection stage, the two attack classifiers are connected to form a combined attack classifier, the process firstly uses a common attack classifier to classify the network data, and if the classification result is of a normal type, the network data is detected again by the rare attack classifier.
Referring to fig. 2, a rare attack-oriented network intrusion detection method specifically includes the following steps:
step 1, collecting network attack data, and processing the collected network attack data into network data with an attack type label as a training set.
The classic datasets in the field of network intrusion detection are the KDD99 dataset and the NSL-KDD dataset. The KDD99 data set is a data set collected and created by american researchers simulating different network attacks in a real network environment, and is widely applied to a method for evaluating network intrusion detection. Although a huge number of attack samples exist in the KDD99 raw data set, the huge number of samples are repetitive and redundant, and once a large number of repetitive samples appear in the training set, the learning process of the classifier is biased. The NSL-KDD dataset is an upgraded version of the KDD99 dataset, eliminating redundant samples in the training set and deleting duplicate records in the testing set. Thus, the NSL-KDD dataset was used in the study of the present invention. The NSL-KDD dataset consists of network traffic data. Each piece of data is a TCP packet within a certain time, the packets refer to data transmitted from a source address to a destination address under a certain protocol, and each attack sample in the data includes 41 characteristic attributes and 1 sample class label.
Step 2, after the feature selection is carried out on the training set by utilizing a genetic programming algorithm, a subdata set with selected features is generated;
and 3, evaluating the accuracy of the subdata sets by using the random forest, and calculating the fitting value of the subdata sets by using the accuracy. Wherein the fitting value f of the sub data setfitnessComprises the following steps:
where score represents the accuracy of the sub data set obtained using a random forest for evaluation, and n represents the number of trees in the random forest.
If the fitting value of the sub data set reaches the target fitting value, ending iteration, taking the sub data set as an optimized data set, and turning to the step 4; otherwise, the subdata set is used as a training set, and the step 2 is returned to, and the iteration is continued.
And (4) when the preset iteration number is reached and the fitting value of the subdata set does not reach the target fitting value, taking the subdata set with the highest fitting value in all iterations as an optimized data set, and turning to the step 4.
And 4, dividing the optimized data set into a common attack set and a rare attack set according to the feature tags of the network data attack types.
Step 5, firstly, constructing an initial common attack classification model by using a convolutional neural network, and sending a common attack set to the initial common attack classification model to train the model to obtain a common attack classifier; then, constructing an initial rare attack classification model by using a common attack classifier, and sending a rare attack set to the initial rare attack classification model to train the model to obtain a rare attack classifier; and finally, cascading the common attack classifier and the rare attack classifier to form a combined attack classifier.
Step 6, sending the network data to be detected collected in real time into a joint attack classifier for classification; in the joint attack classifier, firstly, sending network data to be detected into a common attack classifier for first detection, and judging whether common attacks exist: if the network data to be detected exists, judging that the network data to be detected is subjected to common attack; if not, sending the network data to be detected into the rare attack classifier for secondary detection, and judging whether rare attacks exist: if the network data to be detected exists, judging that the network data to be detected is subjected to rare attack; and if the network data to be detected does not exist, judging that the network data to be detected is normal network data which is not attacked.
It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.