CN114091661A

CN114091661A - Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm

Info

Publication number: CN114091661A
Application number: CN202111409785.4A
Authority: CN
Inventors: 李童; 刘晓东; 张润滋; 杨震
Original assignee: Beijing University of Technology; Nsfocus Technologies Group Co Ltd
Current assignee: Beijing University of Technology; Nsfocus Technologies Group Co Ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-02-25
Anticipated expiration: 2041-11-24
Also published as: CN114091661B

Abstract

The invention discloses an oversampling method for improving intrusion detection performance based on generation of a confrontation network and a k-nearest neighbor algorithm, which is used for improving intrusion detection performance and specifically comprises the following steps: carrying out numeralization and normalization processing on the original data; constructing a generation model based on WGAN-GP, training the generation model by utilizing a few types of attack samples and random noise, and enabling a generator to model attack distribution so as to generate attack samples; filtering and generating noise in the attack sample by adopting a k-nearest neighbor algorithm; finally, carrying out importance sorting on the field attributes of the data by using variance analysis, carrying out feature selection according to a sorting result, removing unnecessary features, and finally obtaining an oversampled training set; the performance of the intrusion detection model can be effectively improved by utilizing the over-sampled training set generated by the invention.

Description

Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm

Technical Field

The invention relates to an oversampling technology based on a generation countermeasure network and a k-nearest neighbor algorithm, which is used for improving the performance of intrusion detection and belongs to the field of intrusion detection.

Background

The intrusion detection is an effective method for detecting and defending network attacks, can monitor network flow in real time, divides network records into normal records and malicious records, and provides necessary information for a defense system. With the advent of the big data era, machine learning methods have been developed at a high speed, and have become widely used methods for intrusion detection. However, in real life attacks occur much less frequently than in normal activities, the data sets used for machine learning model training tend to be unbalanced, thereby affecting detection performance. Oversampling techniques are commonly used to solve the problem of data set imbalance. Researchers have proposed synthesizing a few oversampling techniques (SMOTE) and adaptive synthesis sampling techniques (ADASYN), which generate samples by interpolating between two instances of the same class. However, the complexity of network traffic causes its class boundary to be fuzzy, and the use of interpolation may generate samples across boundaries, increasing the confusion of decision boundaries. Furthermore, these methods only focus on class labels, and do not consider the similarity of feature relationships, increasing the risk of noise generation.

The generative countermeasure network (GAN) is a deep learning model that can model complex, highly-dimensionally distributed real-world data, the structure of which is shown in fig. 1. It is inspired by the game theory of two persons, namely the game, and consists of a generator and a discriminator. Both the generator and the arbiter are neural network structures. The generator captures the potential distribution of real data samples and generates new data; the discriminator judges whether the input is real data or generated data. The generator network uses the arbiter as a loss function and updates its parameters to generate data that appears more realistic. On the other hand, the network of discriminators updates its parameters in order to better identify the generated data from the real data. The two networks are iteratively trained so that the generator can generate near-real samples. The GAN samples generated from the data distribution are more similar in characteristics to the real data, so researchers have applied GAN to intrusion detection for generating attack samples. However, GAN-based oversampling methods also risk generating noise.

The k-nearest neighbor algorithm (KNN) is a supervised machine learning algorithm that can be used to solve classification and regression problems. The core idea of the KNN algorithm is the class of unlabeled samples, which is determined by the k nearest neighbors voted to it. Specifically, for an unlabeled instance, k instances that are closest to the instance are found in the training set, and a majority of the k instances belong to a class, the input instance is classified into the class. Based on this idea we can use it for noise filtering, i.e. for the generated attack samples, find the k instances nearest to this instance in the training set, and if most of these k instances belong to non-attack samples, we mark them as noise.

Analysis of variance (ANOVA) is a commonly used method of feature selection to screen features by their own variance. If the variance of a feature is small, it means that the sample has substantially no difference in the feature, and most values in the feature may be the same, or even the value of the entire feature is the same, and the feature has no effect on sample discrimination. Therefore, we calculate the f-value of each feature separately based on ANOVA. And finally, sorting according to the importance of the features to obtain the optimal subset.

Disclosure of Invention

In order to solve the problem of poor detection performance caused by the imbalance of a machine learning model training data set, the invention aims to provide an oversampling method for improving the intrusion detection performance, and attack samples with high quality are generated by modeling the distribution of the attack samples; and noise filtering is carried out on the generated samples by utilizing neighbor information, and finally the samples are supplemented into the original data set, so that the balance of the training set is improved, and the intrusion detection performance is improved.

In order to achieve the purpose, the technical scheme adopted by the invention is a noise reduction oversampling method based on a generation countermeasure network (GAN) and a k-nearest neighbor algorithm. As shown in fig. 2, the method comprises the following five steps:

data preprocessing: fields of original data comprise various data types such as character types, numerical types and the like, and the characteristic scales are inconsistent, so that data are subjected to numeralization and normalization processing, and attack samples are extracted for training a generation model;

separately building a generative model for each attack: the method is used for constructing a generating model based on the WGAN-GP, and training the WGAN-GP by utilizing a few types of attack samples and random noise, so that a generator models attack distribution and is used for generating the attack samples.

Inputting the preprocessed original training set into a generator of corresponding attacks after training is completed, and generating a corresponding attack sample set;

noise filtering: filtering noise data in the generated attack sample set by using a k-nearest neighbor algorithm;

feature selection: carrying out importance ranking on the field attributes based on analysis of variance (ANOVA), carrying out feature selection according to a ranking result, removing unnecessary features, and finally obtaining an oversampled training set;

furthermore, the intrusion detection model is trained by using the training set after oversampling, and the detection performance is further improved.

Further, the method comprises the following concrete implementation steps:

preprocessing data:

the original training set DS contains various data types such as character type and numerical type, and the feature scale is inconsistent, and is not suitable for GAN and machine learning algorithm training, so that it needs to be preprocessed.

Step (1.1) digitizing:

the original training set data may have character-type features, such as protocol, which are not suitable for training, and thus need to be digitized. This step is digitized by mapping the character-type features to integer values between [0, S-1], where S is the number of feature values. For example, the feature protocol includes three feature values of tcp, udp and icmp, S is 3, the mapping interval is [0-2], and tcp, udp and icmp are to be mapped to 0,1 and 2, respectively.

Step (1.2) normalization:

the data obtained in the step (1.1) are numerical data, but different features often have different dimensions and dimension units, which affect the result of data analysis, and data normalization processing is required to eliminate the dimension effect between indexes. After the raw data are subjected to data normalization processing, all the characteristics are in the same order of magnitude, and the method is suitable for comprehensive comparison and evaluation. We scale all eigenvalues to the [0,1] interval by the following formula.

Where x is the feature value before normalization and x' is the feature value after normalization. x is the number of_maxIs the maximum value of the corresponding feature, x_minIs the minimum value of the corresponding characteristic.

Step (1.3) dividing subsets:

the intrusion detection training set comprises a plurality of attack types, the same type of attacks have similarity in distribution, and the GAN is used for generating different types of attacks respectively. According to the attack type of the sample generated by the requirement, extracting a corresponding attack subset named DS_Attack(e.g., a subset of attacks that only includes DoS attacks is DS_DoS)。

Step (2) respectively constructing a generation model aiming at each attack:

in the step, the GAN is used for generating an attack sample, so that a training set is supplemented, and the balance of the training set is improved.

Step (2.1) constructing a GAN model:

the model is designed based on WGAN-GP and consists of a generator and a discriminator, wherein the generator is defined as G, and the discriminator is defined as D. G and D are both feedforward neural networks. G includes an input layer, a hidden layer, and an output layer. The sample obtained by the output layer is a generated sample, the generated sample needs to be the same as the original data dimension, so that the number of neurons of the output layer needs to be the same as the dimension of the preprocessed data, the activation function is Linear, and the activation functions of the rest layers are ReLu. D includes an input layer, a hidden layer, and an output layer. D distinguishes data by scoring, so the number of neurons in the output layer is set to 1 and the activation function is Linear. The remaining layer activation function is ReLu. The objective function is:

where E (×) represents the mathematical expected value of the distribution function, D (×) represents the function value of the arbiter, x represents the real sample,

it is indicated that the generation of the sample,

representing the random interpolation sampling points between the real samples and the generated samples; p denotes the data distribution, p_dataIs the distribution of true sample data, p_GIs the generation of a distribution of the samples,

defined as the distribution p of slave data_dataAnd the generated sample distribution p_GThe point pairs of the middle sampling are uniformly distributed along a straight line; λ is a gradient penalty coefficient, and ^ represents a gradient.

Step (2.2) training and generating a model:

and (4) according to the number n of the attack subsets obtained in the step (1.3), training n different generators in total.

First, the parameters of the two networks of generators and discriminators are initialized, and a noise distribution following a normal distribution is defined.

Second, real data and noise data are prepared. The real data is the subset DS obtained from step (1.3)_AttackNoise is derived from the Noise distribution and DS_AttackNoise with the same amount of data.

And thirdly, fixing a generator and training a discriminator. Noise generates the same number of samples Sample by the generator_AttackUsing DS_AttackAnd Sample_AttackTraining the arbiter to distinguish whether the data is from DS_AttackWhether the actual data comes from Sample_Attack。

Fourthly, fixing the discriminator and training the generator. Adopting a third step to train the discriminant training generator after k rounds to ensure that the discriminant can not distinguish that the data comes from the DS as far as possible_AttackWhether the actual data comes from Sample_Attack。

And updating the iteration generator and the discriminator for many times according to the third step and the fourth step, wherein in an ideal state, the final discriminator can not distinguish whether the sample comes from a real training sample set or the sample generated by the generator is trained, and the discrimination probability of the discriminator is 0.5 at the moment.

Generating a sample in the step (3):

through the step (2.2), Sample generators of different attack types can be obtained, and corresponding attack samples Sample can be generated by using the Sample generators_Attack。

And (4) noise filtering:

noise was filtered using k-nearest neighbor algorithm:

in the original data set, the proportion of non-attack samples is larger than that of attack samples, so that the granularity of noise filtering is influenced by the non-attack samples when the k value is larger, and the k value is set to be between 3 and 5. Sample set Sample generated by using k-nearest neighbor algorithm pair_AttackAnd (4) carrying out noise filtering, respectively calculating Euclidean distances between the generated samples and other samples, and when more than half of non-attack samples exist in the neighborhood, regarding the current sample as noise and deleting the noise from the set.

And (5) feature selection:

step (5.1) merging sample sets:

and combining the filtered attack sample set with the original training set to form a new training set. The new training set is more balanced than the original data set DS.

Step (5.2) obtains feature subsets:

the network traffic data has the characteristic of high dimensionality, and has some characteristics with insignificant influence on classification. Through feature selection, the number of features can be reduced, dimensionality disasters can be avoided, the generalization capability of the model is stronger, and the model can be shortenedType training time. The present invention employs an analysis of variance method for feature selection that uses an F-test to determine if the mean values of certain groups are different and statistically tests if the mean values are equal. More specifically, for each feature x_iLet x be_iHaving the same average value, i.e. H, in positive and negative class samples₀：μ_S+＝μ_S-Where μ denotes the mean value and S + denotes x which belongs to the positive class sample_iSet of values, S-representing x belonging to negative class samples_iA set of values. Then, f _ value is calculated according to the following equation.

Wherein S_A、S_ERespectively representing the composition and the intra-composition deviation, wherein n is the total number of samples, and r is the number of categories. According to the above steps, f _ value of each feature is calculated separately. Finally, feature importance ordering is carried out according to the f _ value size, a feature subset is constructed, and finally a new training set DS for intrusion detection model training is obtained_new。

And (5) further using the new training set obtained in the step (5) for training an intrusion detection model, so that the detection performance of the intrusion detection model can be finally improved.

Advantageous effects

The existing oversampling technology increases the number of a few types of samples in a manual way, thereby improving the balance of a data set. The classical oversampling method mainly includes random sampling, synthesis of minority class oversampling (SMOTE), adaptive synthesis oversampling (ADASYN), and the like. The invention provides a noise reduction oversampling method based on a generation countermeasure network (GAN) and a k-nearest neighbor algorithm. According to the method, an attack sample is generated by utilizing GAN, noise in the sample is filtered by combining with neighbor information, and the attack sample is generated after data dimension reduction is carried out through feature selection. By supplementing the generated attack samples to the original data set, the balance of the data set is improved, and the performance of the intrusion detection system is finally improved.

Drawings

FIG. 1, generating a framework diagram of a countermeasure network;

FIG. 2 is a flow chart of an oversampling method;

Detailed Description

The invention aims to provide an oversampling method for improving intrusion detection performance, which generates high-quality attack samples by modeling attack sample distribution, improves the balance of a training set and improves the intrusion detection performance.

In order to achieve the purpose, the invention adopts the technical scheme that the oversampling method for improving the intrusion detection performance based on the GAN comprises the following steps:

preprocessing data:

in the specific implementation example, we used UNSW _ NB15 as the experimental data set. The UNSW _ NB15 dataset is one of the datasets commonly used in the field of intrusion detection, and contains normal activity samples and nine different types of attack samples. Examples in the dataset are represented by 42-dimensional features, which can be divided into character-type features and numerical-type features according to the representation form. Since the character-type features cannot be used for GAN training, we need to perform numerical conversion on the character-type features. Moreover, the inconsistency of the feature scale of the data affects the training time and the convergence rate, so that normalization processing needs to be performed on the data.

Step (1.1) the character type characteristics are processed in a numerical mode:

UNSW _ NB15 includes three types of character features, namely protocol, service, and state, and is not suitable for training, so that it is necessary to perform quantization conversion on the three types of character features respectively. In this step, we map the character-type features into integer values between [0, S-1], where S is the number of feature values, to perform digitization. protocol, service and state contain 133, 13 and 11 different feature values, respectively, and thus are mapped as follows:

protocol—>[0-132]

service—>[0-12]

state—>[0-10]

after the digitization process, all 42-dimensional features in UNSW _ NB15 are numerical features.

And (1.2) carrying out normalization processing on each characteristic:

Step (1.3) dividing subsets:

the UNSW _ NB15 training set contained 9 attack types, Generic, Exploit, Fuzzers, Reconnaisnce, DoS, Analysis, Shellcode, Backdoor, and Worms, respectively. The same type of attack has similarity in distribution, and we use GAN to generate different types of attacks respectively. We only generate attack types with a small number of samples, namely Analysis, Shellcode, Backdoor and Worms. We extract the corresponding attack samples from the original data set to form four subsets DS respectively_Analysis,DS_Shellcode,DS_BackdoorAnd DS_Worms。

Step (2) respectively constructing a generation model aiming at each attack:

Step (2.1) constructing a GAN model:

a model for generating an attack sample is constructed by adopting a WGAN-GP framework, the model consists of a generator and a discriminator, the generator is defined as G, and the discriminator is defined as D. G and D are both feedforward neural networks.

The G comprises an input layer, an output layer and a hidden layer, wherein the activation function of the output layer is Linear, and the activation function of the hidden layer is ReLu. The number of neurons in the input layer is equal to the dimension of noise, and the number of neurons in the output layer is equal to the dimension of a real sample.

And D, an input layer, an output layer and a hidden layer, wherein the neuron number of the input layer is equal to the dimension of the real sample, and the neuron number of the output layer is 1. The output layer activation function is Linear, and the hidden layer activation function is ReLu. The objective function is:

it is indicated that the generation of the sample,

defined as the distribution p of slave data_dataAnd the generated sample distribution p_GThe point pairs of the middle sampling are uniformly distributed along a straight line; λ is a gradient penalty coefficient, v represents a gradient.

Step (2.2) training a model:

according to the four subsets obtained in step (1.3), the step needs to train four different generators.

Second, real data and noise data are prepared. The real data is the attack subset DS obtained from step (1.3)_Analysis,DS_Shellcode,DS_BackdoorAnd DS_Worms. Firstly, one of the attack subsets DS is taken_AnalysisNoise is derived from the Noise distribution and DS_AnalysisAttack noise with the same amount of data.

And thirdly, fixing a generator and training a discriminator. Noise generates the same number of samples Sample by the generator_AnalysisUsing DS_AnalysisAnd Sample_AnalysisTraining the arbiter to distinguish whether the data is from DS_AnalysisWhether the actual data comes from Sample_Analysis。

Fourthly, fixing the discriminator and training the generator. Training the arbiter training generator obtained after 100 rounds according to the third step, so that the arbiter can not distinguish that the data comes from the DS as much as possible_AnalysisWhether the actual data comes from Sample_Analysis. And updating and iterating for 1000 times according to the third step and the fourth step to obtain a final attack sample generator.

And fifthly, training and generating a countermeasure network according to the steps by using each attack subset in sequence, and finally obtaining four different attack sample generators.

Generating a sample in the step (3):

by using the four Sample generators obtained in step (2.2), we can generate corresponding attack samples Sample_Analysis,Sample_Shellcode,Sample_BackdoorAnd Sample_Worms。

And (4) noise filtering:

noise is filtered by using a k-nearest neighbor algorithm, and as the proportion of non-attack samples in an original data set is large, the granularity of noise filtering is influenced by the non-attack samples when the k value is large, so that the k value is set to be 3. And (3) carrying out noise filtration on the attack samples generated in the step (3) by using a k-nearest neighbor algorithm, and deleting the current samples as noise when more than half of the attack samples exist in the nearest neighbors, so that the effect of noise filtration is achieved. We name the filtered attack Sample set as Sample_Analysis ^*,Sample_Shellcode ^*,Sample_Backdoor ^*And Sample_Worms ^*。

And (5) feature selection:

step (5.1) merging sample sets:

collecting the filtered attack samples into a Sample_Analysis ^*,Sample_Shellcode ^*,Sample_Backdoor ^*And Sample_Worms ^*And combining with the original training set DS to form a new training set. The new training set is more balanced than the original data set DS.

And (5.2) feature selection:

UNSW _ NB15 the original training set data had 42 features, and there would be some features that did not significantly affect the classification. We calculate f _ values of 42 features respectively and sort them in descending order, finally select the features with feature importance ranking 30% first, namely { proto, dttl, dloss, sinpkt, swin, stcpb, dtcpb, dwin, dmean, ct _ state _ ttl, ct _ dst _ ltm, ct _ src _ dport _ ltm, is _ sm _ ips _ ports } through parameter optimization. Finally, we construct a new training set DS finally used for intrusion detection model training based on the 13-dimensional features_new。

Step (6), training an intrusion detection model:

and training the intrusion detection model by using the obtained new training set. An intrusion detection model is constructed based on four common Machine learning algorithms of a Decision Tree (DT), a Random Forest (RF), a Support Vector Machine (SVM) and an Artificial Neural Network (ANN) and is used for evaluating the effectiveness of the method. IDS is an important tool to ensure network security, and it is necessary not only to accurately identify attacks, but also to avoid false positives. Therefore, we use Accuracy and F₁The detection performance of the intrusion detection model is evaluated according to the value, and the calculation mode is as follows:

tables 1 and 2 plot our method compared to the prior art method, where the second row represents the results of the experiment using the original training set. From table 1, it can be seen that Accuracy of our method achieves the best results on DT, SVM and ANN, and also approaches the optimal results on RF. From table 2, it can be seen that the F1 values of our method have the best effect on all four intrusion detection models. From the overall result, after the original training set is oversampled by adopting the method, the intrusion detection model based on the SVM obtains the optimal detection performance. Therefore, the intrusion detection performance can be effectively improved by adopting the oversampling method for improving the intrusion detection performance based on the generation countermeasure network and the k-nearest neighbor algorithm.

TABLE 1 intrusion detection results-Accuracy after processing based on different oversampling methods

TABLE 2 intrusion detection results after processing based on different oversampling methods-F1 values

Claims

1. An oversampling method for improving intrusion detection performance based on generation of a countermeasure network and a k-nearest neighbor algorithm, characterized in that: comprises the following steps of (a) carrying out,

preprocessing data: the data example of intrusion detection comprises character type characteristics and numerical type characteristics, the character type characteristics are subjected to numerical type conversion to be suitable for training, specifically, the character type characteristics are mapped into integer values between [0, S-1], wherein S is the number of characteristic values, after numerical processing, the characteristics in the data set are numerical type, in order to eliminate dimension influence between indexes, all characteristic values are scaled to the [0,1] interval through the following formula,

where x is the feature value before normalization, x' is the feature value after normalization, x_maxIs the maximum value of the corresponding feature, x_minFor the minimum value of the corresponding characteristics, after data digitization and normalization, extracting a corresponding attack subset named DS according to the attack type of the sample generated as required_Attack；

Respectively constructing a generation model aiming at each attack, wherein the attack sample generation model is designed based on WGAN-GP and consists of a generator and a discriminator, the generator is defined as G, the discriminator is D, G and D are both feedforward neural networks, G sequentially comprises an input layer, 4 hidden layers and an output layer, the sample obtained by the output layer is the generated sample, the number of neurons in the output layer is the same as the dimensionality of the preprocessed data, an activation function is Linear, and the activation functions of the rest layers are ReLu; d comprises an input layer, a hidden layer and an output layer, wherein the output layer result of D is used for judging whether the sample is a real sample or a generated sample, the number of neurons in the output layer is set to be 1, the activation function is Linear, and the activation functions of the rest layers are ReLu;

step (3) inputting the original training set preprocessed in step (1) into a generator of corresponding attacks finished by training to generate a corresponding attack Sample set Sample_Attack；

Filtering noise data in the generated attack sample set by using a k-nearest neighbor algorithm, and when more than half of the nearest neighbors of the generated samples are non-attack samples, considering the current samples as noise and deleting the current samples from the set, wherein the k value is set to be between 3 and 5;

and (5) selecting features by using an analysis of variance method, specifically, merging an attack sample set after noise filtration and an original training set, and performing all featuresOrdering importance, and finally obtaining a new training set DS for the intrusion detection model training after removing unnecessary features_new。

2. The oversampling method of claim 1 based on generation of a countermeasure network and a k-nearest neighbor algorithm to improve intrusion detection performance, wherein: the initialization and training process of the generative model described in step (2) is specifically as follows,

the first step, initializing the network structure of the generator and the discriminator according to the design, and defining a noise distribution which follows normal distribution;

a second step of preparing real data and noise data, wherein the real data is the attack subset DS obtained from the step (1)_AttackNoise is derived from the Noise distribution and DS_AttackNoise with the same amount of data;

thirdly, fixing a generator, training a discriminator and generating samples with the same number by the Noise through the generator_AttackUsing DS_AttackAnd Sample_AttackTraining the arbiter to distinguish whether the data is from DS_AttackWhether the actual data comes from Sample_Attack；

Fourthly, fixing the discriminator and training the generator: adopting a discriminator training generator after k times of training in the third step to make the discriminator unable to distinguish that the data is DS_AttackOr Sample_Attack；

And updating the iteration generator and the discriminator for many times according to the third step and the fourth step, and finally finishing training when the discriminator cannot distinguish whether the data is a real training sample or a sample generated by the generator.