CN114091661A - Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm - Google Patents

Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm Download PDF

Info

Publication number
CN114091661A
CN114091661A CN202111409785.4A CN202111409785A CN114091661A CN 114091661 A CN114091661 A CN 114091661A CN 202111409785 A CN202111409785 A CN 202111409785A CN 114091661 A CN114091661 A CN 114091661A
Authority
CN
China
Prior art keywords
attack
sample
data
training
generator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111409785.4A
Other languages
Chinese (zh)
Other versions
CN114091661B (en
Inventor
李童
刘晓东
张润滋
杨震
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Nsfocus Technologies Group Co Ltd
Original Assignee
Beijing University of Technology
Nsfocus Technologies Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology, Nsfocus Technologies Group Co Ltd filed Critical Beijing University of Technology
Priority to CN202111409785.4A priority Critical patent/CN114091661B/en
Priority claimed from CN202111409785.4A external-priority patent/CN114091661B/en
Publication of CN114091661A publication Critical patent/CN114091661A/en
Application granted granted Critical
Publication of CN114091661B publication Critical patent/CN114091661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an oversampling method for improving intrusion detection performance based on generation of a confrontation network and a k-nearest neighbor algorithm, which is used for improving intrusion detection performance and specifically comprises the following steps: carrying out numeralization and normalization processing on the original data; constructing a generation model based on WGAN-GP, training the generation model by utilizing a few types of attack samples and random noise, and enabling a generator to model attack distribution so as to generate attack samples; filtering and generating noise in the attack sample by adopting a k-nearest neighbor algorithm; finally, carrying out importance sorting on the field attributes of the data by using variance analysis, carrying out feature selection according to a sorting result, removing unnecessary features, and finally obtaining an oversampled training set; the performance of the intrusion detection model can be effectively improved by utilizing the over-sampled training set generated by the invention.

Description

Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm
Technical Field
The invention relates to an oversampling technology based on a generation countermeasure network and a k-nearest neighbor algorithm, which is used for improving the performance of intrusion detection and belongs to the field of intrusion detection.
Background
The intrusion detection is an effective method for detecting and defending network attacks, can monitor network flow in real time, divides network records into normal records and malicious records, and provides necessary information for a defense system. With the advent of the big data era, machine learning methods have been developed at a high speed, and have become widely used methods for intrusion detection. However, in real life attacks occur much less frequently than in normal activities, the data sets used for machine learning model training tend to be unbalanced, thereby affecting detection performance. Oversampling techniques are commonly used to solve the problem of data set imbalance. Researchers have proposed synthesizing a few oversampling techniques (SMOTE) and adaptive synthesis sampling techniques (ADASYN), which generate samples by interpolating between two instances of the same class. However, the complexity of network traffic causes its class boundary to be fuzzy, and the use of interpolation may generate samples across boundaries, increasing the confusion of decision boundaries. Furthermore, these methods only focus on class labels, and do not consider the similarity of feature relationships, increasing the risk of noise generation.
The generative countermeasure network (GAN) is a deep learning model that can model complex, highly-dimensionally distributed real-world data, the structure of which is shown in fig. 1. It is inspired by the game theory of two persons, namely the game, and consists of a generator and a discriminator. Both the generator and the arbiter are neural network structures. The generator captures the potential distribution of real data samples and generates new data; the discriminator judges whether the input is real data or generated data. The generator network uses the arbiter as a loss function and updates its parameters to generate data that appears more realistic. On the other hand, the network of discriminators updates its parameters in order to better identify the generated data from the real data. The two networks are iteratively trained so that the generator can generate near-real samples. The GAN samples generated from the data distribution are more similar in characteristics to the real data, so researchers have applied GAN to intrusion detection for generating attack samples. However, GAN-based oversampling methods also risk generating noise.
The k-nearest neighbor algorithm (KNN) is a supervised machine learning algorithm that can be used to solve classification and regression problems. The core idea of the KNN algorithm is the class of unlabeled samples, which is determined by the k nearest neighbors voted to it. Specifically, for an unlabeled instance, k instances that are closest to the instance are found in the training set, and a majority of the k instances belong to a class, the input instance is classified into the class. Based on this idea we can use it for noise filtering, i.e. for the generated attack samples, find the k instances nearest to this instance in the training set, and if most of these k instances belong to non-attack samples, we mark them as noise.
Analysis of variance (ANOVA) is a commonly used method of feature selection to screen features by their own variance. If the variance of a feature is small, it means that the sample has substantially no difference in the feature, and most values in the feature may be the same, or even the value of the entire feature is the same, and the feature has no effect on sample discrimination. Therefore, we calculate the f-value of each feature separately based on ANOVA. And finally, sorting according to the importance of the features to obtain the optimal subset.
Disclosure of Invention
In order to solve the problem of poor detection performance caused by the imbalance of a machine learning model training data set, the invention aims to provide an oversampling method for improving the intrusion detection performance, and attack samples with high quality are generated by modeling the distribution of the attack samples; and noise filtering is carried out on the generated samples by utilizing neighbor information, and finally the samples are supplemented into the original data set, so that the balance of the training set is improved, and the intrusion detection performance is improved.
In order to achieve the purpose, the technical scheme adopted by the invention is a noise reduction oversampling method based on a generation countermeasure network (GAN) and a k-nearest neighbor algorithm. As shown in fig. 2, the method comprises the following five steps:
data preprocessing: fields of original data comprise various data types such as character types, numerical types and the like, and the characteristic scales are inconsistent, so that data are subjected to numeralization and normalization processing, and attack samples are extracted for training a generation model;
separately building a generative model for each attack: the method is used for constructing a generating model based on the WGAN-GP, and training the WGAN-GP by utilizing a few types of attack samples and random noise, so that a generator models attack distribution and is used for generating the attack samples.
Inputting the preprocessed original training set into a generator of corresponding attacks after training is completed, and generating a corresponding attack sample set;
noise filtering: filtering noise data in the generated attack sample set by using a k-nearest neighbor algorithm;
feature selection: carrying out importance ranking on the field attributes based on analysis of variance (ANOVA), carrying out feature selection according to a ranking result, removing unnecessary features, and finally obtaining an oversampled training set;
furthermore, the intrusion detection model is trained by using the training set after oversampling, and the detection performance is further improved.
Further, the method comprises the following concrete implementation steps:
preprocessing data:
the original training set DS contains various data types such as character type and numerical type, and the feature scale is inconsistent, and is not suitable for GAN and machine learning algorithm training, so that it needs to be preprocessed.
Step (1.1) digitizing:
the original training set data may have character-type features, such as protocol, which are not suitable for training, and thus need to be digitized. This step is digitized by mapping the character-type features to integer values between [0, S-1], where S is the number of feature values. For example, the feature protocol includes three feature values of tcp, udp and icmp, S is 3, the mapping interval is [0-2], and tcp, udp and icmp are to be mapped to 0,1 and 2, respectively.
Step (1.2) normalization:
the data obtained in the step (1.1) are numerical data, but different features often have different dimensions and dimension units, which affect the result of data analysis, and data normalization processing is required to eliminate the dimension effect between indexes. After the raw data are subjected to data normalization processing, all the characteristics are in the same order of magnitude, and the method is suitable for comprehensive comparison and evaluation. We scale all eigenvalues to the [0,1] interval by the following formula.
Figure BDA0003372959700000031
Where x is the feature value before normalization and x' is the feature value after normalization. x is the number ofmaxIs the maximum value of the corresponding feature, xminIs the minimum value of the corresponding characteristic.
Step (1.3) dividing subsets:
the intrusion detection training set comprises a plurality of attack types, the same type of attacks have similarity in distribution, and the GAN is used for generating different types of attacks respectively. According to the attack type of the sample generated by the requirement, extracting a corresponding attack subset named DSAttack(e.g., a subset of attacks that only includes DoS attacks is DSDoS)。
Step (2) respectively constructing a generation model aiming at each attack:
in the step, the GAN is used for generating an attack sample, so that a training set is supplemented, and the balance of the training set is improved.
Step (2.1) constructing a GAN model:
the model is designed based on WGAN-GP and consists of a generator and a discriminator, wherein the generator is defined as G, and the discriminator is defined as D. G and D are both feedforward neural networks. G includes an input layer, a hidden layer, and an output layer. The sample obtained by the output layer is a generated sample, the generated sample needs to be the same as the original data dimension, so that the number of neurons of the output layer needs to be the same as the dimension of the preprocessed data, the activation function is Linear, and the activation functions of the rest layers are ReLu. D includes an input layer, a hidden layer, and an output layer. D distinguishes data by scoring, so the number of neurons in the output layer is set to 1 and the activation function is Linear. The remaining layer activation function is ReLu. The objective function is:
Figure BDA0003372959700000041
where E (×) represents the mathematical expected value of the distribution function, D (×) represents the function value of the arbiter, x represents the real sample,
Figure BDA0003372959700000042
it is indicated that the generation of the sample,
Figure BDA0003372959700000043
representing the random interpolation sampling points between the real samples and the generated samples; p denotes the data distribution, pdataIs the distribution of true sample data, pGIs the generation of a distribution of the samples,
Figure BDA0003372959700000044
defined as the distribution p of slave datadataAnd the generated sample distribution pGThe point pairs of the middle sampling are uniformly distributed along a straight line; λ is a gradient penalty coefficient, and ^ represents a gradient.
Step (2.2) training and generating a model:
and (4) according to the number n of the attack subsets obtained in the step (1.3), training n different generators in total.
First, the parameters of the two networks of generators and discriminators are initialized, and a noise distribution following a normal distribution is defined.
Second, real data and noise data are prepared. The real data is the subset DS obtained from step (1.3)AttackNoise is derived from the Noise distribution and DSAttackNoise with the same amount of data.
And thirdly, fixing a generator and training a discriminator. Noise generates the same number of samples Sample by the generatorAttackUsing DSAttackAnd SampleAttackTraining the arbiter to distinguish whether the data is from DSAttackWhether the actual data comes from SampleAttack
Fourthly, fixing the discriminator and training the generator. Adopting a third step to train the discriminant training generator after k rounds to ensure that the discriminant can not distinguish that the data comes from the DS as far as possibleAttackWhether the actual data comes from SampleAttack
And updating the iteration generator and the discriminator for many times according to the third step and the fourth step, wherein in an ideal state, the final discriminator can not distinguish whether the sample comes from a real training sample set or the sample generated by the generator is trained, and the discrimination probability of the discriminator is 0.5 at the moment.
Generating a sample in the step (3):
through the step (2.2), Sample generators of different attack types can be obtained, and corresponding attack samples Sample can be generated by using the Sample generatorsAttack
And (4) noise filtering:
noise was filtered using k-nearest neighbor algorithm:
in the original data set, the proportion of non-attack samples is larger than that of attack samples, so that the granularity of noise filtering is influenced by the non-attack samples when the k value is larger, and the k value is set to be between 3 and 5. Sample set Sample generated by using k-nearest neighbor algorithm pairAttackAnd (4) carrying out noise filtering, respectively calculating Euclidean distances between the generated samples and other samples, and when more than half of non-attack samples exist in the neighborhood, regarding the current sample as noise and deleting the noise from the set.
And (5) feature selection:
step (5.1) merging sample sets:
and combining the filtered attack sample set with the original training set to form a new training set. The new training set is more balanced than the original data set DS.
Step (5.2) obtains feature subsets:
the network traffic data has the characteristic of high dimensionality, and has some characteristics with insignificant influence on classification. Through feature selection, the number of features can be reduced, dimensionality disasters can be avoided, the generalization capability of the model is stronger, and the model can be shortenedType training time. The present invention employs an analysis of variance method for feature selection that uses an F-test to determine if the mean values of certain groups are different and statistically tests if the mean values are equal. More specifically, for each feature xiLet x beiHaving the same average value, i.e. H, in positive and negative class samples0:μS+=μS-Where μ denotes the mean value and S + denotes x which belongs to the positive class sampleiSet of values, S-representing x belonging to negative class samplesiA set of values. Then, f _ value is calculated according to the following equation.
Figure BDA0003372959700000051
Wherein SA、SERespectively representing the composition and the intra-composition deviation, wherein n is the total number of samples, and r is the number of categories. According to the above steps, f _ value of each feature is calculated separately. Finally, feature importance ordering is carried out according to the f _ value size, a feature subset is constructed, and finally a new training set DS for intrusion detection model training is obtainednew
And (5) further using the new training set obtained in the step (5) for training an intrusion detection model, so that the detection performance of the intrusion detection model can be finally improved.
Advantageous effects
The existing oversampling technology increases the number of a few types of samples in a manual way, thereby improving the balance of a data set. The classical oversampling method mainly includes random sampling, synthesis of minority class oversampling (SMOTE), adaptive synthesis oversampling (ADASYN), and the like. The invention provides a noise reduction oversampling method based on a generation countermeasure network (GAN) and a k-nearest neighbor algorithm. According to the method, an attack sample is generated by utilizing GAN, noise in the sample is filtered by combining with neighbor information, and the attack sample is generated after data dimension reduction is carried out through feature selection. By supplementing the generated attack samples to the original data set, the balance of the data set is improved, and the performance of the intrusion detection system is finally improved.
Drawings
FIG. 1, generating a framework diagram of a countermeasure network;
FIG. 2 is a flow chart of an oversampling method;
Detailed Description
The invention aims to provide an oversampling method for improving intrusion detection performance, which generates high-quality attack samples by modeling attack sample distribution, improves the balance of a training set and improves the intrusion detection performance.
In order to achieve the purpose, the invention adopts the technical scheme that the oversampling method for improving the intrusion detection performance based on the GAN comprises the following steps:
preprocessing data:
in the specific implementation example, we used UNSW _ NB15 as the experimental data set. The UNSW _ NB15 dataset is one of the datasets commonly used in the field of intrusion detection, and contains normal activity samples and nine different types of attack samples. Examples in the dataset are represented by 42-dimensional features, which can be divided into character-type features and numerical-type features according to the representation form. Since the character-type features cannot be used for GAN training, we need to perform numerical conversion on the character-type features. Moreover, the inconsistency of the feature scale of the data affects the training time and the convergence rate, so that normalization processing needs to be performed on the data.
Step (1.1) the character type characteristics are processed in a numerical mode:
UNSW _ NB15 includes three types of character features, namely protocol, service, and state, and is not suitable for training, so that it is necessary to perform quantization conversion on the three types of character features respectively. In this step, we map the character-type features into integer values between [0, S-1], where S is the number of feature values, to perform digitization. protocol, service and state contain 133, 13 and 11 different feature values, respectively, and thus are mapped as follows:
protocol—>[0-132]
service—>[0-12]
state—>[0-10]
after the digitization process, all 42-dimensional features in UNSW _ NB15 are numerical features.
And (1.2) carrying out normalization processing on each characteristic:
the data obtained in the step (1.1) are numerical data, but different features often have different dimensions and dimension units, which affect the result of data analysis, and data normalization processing is required to eliminate the dimension effect between indexes. After the raw data are subjected to data normalization processing, all the characteristics are in the same order of magnitude, and the method is suitable for comprehensive comparison and evaluation. We scale all eigenvalues to the [0,1] interval by the following formula.
Figure BDA0003372959700000071
Where x is the feature value before normalization and x' is the feature value after normalization. x is the number ofmaxIs the maximum value of the corresponding feature, xminIs the minimum value of the corresponding characteristic.
Step (1.3) dividing subsets:
the UNSW _ NB15 training set contained 9 attack types, Generic, Exploit, Fuzzers, Reconnaisnce, DoS, Analysis, Shellcode, Backdoor, and Worms, respectively. The same type of attack has similarity in distribution, and we use GAN to generate different types of attacks respectively. We only generate attack types with a small number of samples, namely Analysis, Shellcode, Backdoor and Worms. We extract the corresponding attack samples from the original data set to form four subsets DS respectivelyAnalysis,DSShellcode,DSBackdoorAnd DSWorms
Step (2) respectively constructing a generation model aiming at each attack:
in the step, the GAN is used for generating an attack sample, so that a training set is supplemented, and the balance of the training set is improved.
Step (2.1) constructing a GAN model:
a model for generating an attack sample is constructed by adopting a WGAN-GP framework, the model consists of a generator and a discriminator, the generator is defined as G, and the discriminator is defined as D. G and D are both feedforward neural networks.
The G comprises an input layer, an output layer and a hidden layer, wherein the activation function of the output layer is Linear, and the activation function of the hidden layer is ReLu. The number of neurons in the input layer is equal to the dimension of noise, and the number of neurons in the output layer is equal to the dimension of a real sample.
And D, an input layer, an output layer and a hidden layer, wherein the neuron number of the input layer is equal to the dimension of the real sample, and the neuron number of the output layer is 1. The output layer activation function is Linear, and the hidden layer activation function is ReLu. The objective function is:
Figure BDA0003372959700000081
where E (×) represents the mathematical expected value of the distribution function, D (×) represents the function value of the arbiter, x represents the real sample,
Figure BDA0003372959700000082
it is indicated that the generation of the sample,
Figure BDA0003372959700000083
representing the random interpolation sampling points between the real samples and the generated samples; p denotes the data distribution, pdataIs the distribution of true sample data, pGIs the generation of a distribution of the samples,
Figure BDA0003372959700000084
defined as the distribution p of slave datadataAnd the generated sample distribution pGThe point pairs of the middle sampling are uniformly distributed along a straight line; λ is a gradient penalty coefficient, v represents a gradient.
Step (2.2) training a model:
according to the four subsets obtained in step (1.3), the step needs to train four different generators.
First, the parameters of the two networks of generators and discriminators are initialized, and a noise distribution following a normal distribution is defined.
Second, real data and noise data are prepared. The real data is the attack subset DS obtained from step (1.3)Analysis,DSShellcode,DSBackdoorAnd DSWorms. Firstly, one of the attack subsets DS is takenAnalysisNoise is derived from the Noise distribution and DSAnalysisAttack noise with the same amount of data.
And thirdly, fixing a generator and training a discriminator. Noise generates the same number of samples Sample by the generatorAnalysisUsing DSAnalysisAnd SampleAnalysisTraining the arbiter to distinguish whether the data is from DSAnalysisWhether the actual data comes from SampleAnalysis
Fourthly, fixing the discriminator and training the generator. Training the arbiter training generator obtained after 100 rounds according to the third step, so that the arbiter can not distinguish that the data comes from the DS as much as possibleAnalysisWhether the actual data comes from SampleAnalysis. And updating and iterating for 1000 times according to the third step and the fourth step to obtain a final attack sample generator.
And fifthly, training and generating a countermeasure network according to the steps by using each attack subset in sequence, and finally obtaining four different attack sample generators.
Generating a sample in the step (3):
by using the four Sample generators obtained in step (2.2), we can generate corresponding attack samples SampleAnalysis,SampleShellcode,SampleBackdoorAnd SampleWorms
And (4) noise filtering:
noise is filtered by using a k-nearest neighbor algorithm, and as the proportion of non-attack samples in an original data set is large, the granularity of noise filtering is influenced by the non-attack samples when the k value is large, so that the k value is set to be 3. And (3) carrying out noise filtration on the attack samples generated in the step (3) by using a k-nearest neighbor algorithm, and deleting the current samples as noise when more than half of the attack samples exist in the nearest neighbors, so that the effect of noise filtration is achieved. We name the filtered attack Sample set as SampleAnalysis *,SampleShellcode *,SampleBackdoor *And SampleWorms *
And (5) feature selection:
step (5.1) merging sample sets:
collecting the filtered attack samples into a SampleAnalysis *,SampleShellcode *,SampleBackdoor *And SampleWorms *And combining with the original training set DS to form a new training set. The new training set is more balanced than the original data set DS.
And (5.2) feature selection:
UNSW _ NB15 the original training set data had 42 features, and there would be some features that did not significantly affect the classification. We calculate f _ values of 42 features respectively and sort them in descending order, finally select the features with feature importance ranking 30% first, namely { proto, dttl, dloss, sinpkt, swin, stcpb, dtcpb, dwin, dmean, ct _ state _ ttl, ct _ dst _ ltm, ct _ src _ dport _ ltm, is _ sm _ ips _ ports } through parameter optimization. Finally, we construct a new training set DS finally used for intrusion detection model training based on the 13-dimensional featuresnew
Step (6), training an intrusion detection model:
and training the intrusion detection model by using the obtained new training set. An intrusion detection model is constructed based on four common Machine learning algorithms of a Decision Tree (DT), a Random Forest (RF), a Support Vector Machine (SVM) and an Artificial Neural Network (ANN) and is used for evaluating the effectiveness of the method. IDS is an important tool to ensure network security, and it is necessary not only to accurately identify attacks, but also to avoid false positives. Therefore, we use Accuracy and F1The detection performance of the intrusion detection model is evaluated according to the value, and the calculation mode is as follows:
Figure BDA0003372959700000101
Figure BDA0003372959700000102
tables 1 and 2 plot our method compared to the prior art method, where the second row represents the results of the experiment using the original training set. From table 1, it can be seen that Accuracy of our method achieves the best results on DT, SVM and ANN, and also approaches the optimal results on RF. From table 2, it can be seen that the F1 values of our method have the best effect on all four intrusion detection models. From the overall result, after the original training set is oversampled by adopting the method, the intrusion detection model based on the SVM obtains the optimal detection performance. Therefore, the intrusion detection performance can be effectively improved by adopting the oversampling method for improving the intrusion detection performance based on the generation countermeasure network and the k-nearest neighbor algorithm.
TABLE 1 intrusion detection results-Accuracy after processing based on different oversampling methods
Figure BDA0003372959700000103
TABLE 2 intrusion detection results after processing based on different oversampling methods-F1 values
Figure BDA0003372959700000104
Figure BDA0003372959700000111

Claims (2)

1. An oversampling method for improving intrusion detection performance based on generation of a countermeasure network and a k-nearest neighbor algorithm, characterized in that: comprises the following steps of (a) carrying out,
preprocessing data: the data example of intrusion detection comprises character type characteristics and numerical type characteristics, the character type characteristics are subjected to numerical type conversion to be suitable for training, specifically, the character type characteristics are mapped into integer values between [0, S-1], wherein S is the number of characteristic values, after numerical processing, the characteristics in the data set are numerical type, in order to eliminate dimension influence between indexes, all characteristic values are scaled to the [0,1] interval through the following formula,
Figure FDA0003372959690000011
where x is the feature value before normalization, x' is the feature value after normalization, xmaxIs the maximum value of the corresponding feature, xminFor the minimum value of the corresponding characteristics, after data digitization and normalization, extracting a corresponding attack subset named DS according to the attack type of the sample generated as requiredAttack
Respectively constructing a generation model aiming at each attack, wherein the attack sample generation model is designed based on WGAN-GP and consists of a generator and a discriminator, the generator is defined as G, the discriminator is D, G and D are both feedforward neural networks, G sequentially comprises an input layer, 4 hidden layers and an output layer, the sample obtained by the output layer is the generated sample, the number of neurons in the output layer is the same as the dimensionality of the preprocessed data, an activation function is Linear, and the activation functions of the rest layers are ReLu; d comprises an input layer, a hidden layer and an output layer, wherein the output layer result of D is used for judging whether the sample is a real sample or a generated sample, the number of neurons in the output layer is set to be 1, the activation function is Linear, and the activation functions of the rest layers are ReLu;
step (3) inputting the original training set preprocessed in step (1) into a generator of corresponding attacks finished by training to generate a corresponding attack Sample set SampleAttack
Filtering noise data in the generated attack sample set by using a k-nearest neighbor algorithm, and when more than half of the nearest neighbors of the generated samples are non-attack samples, considering the current samples as noise and deleting the current samples from the set, wherein the k value is set to be between 3 and 5;
and (5) selecting features by using an analysis of variance method, specifically, merging an attack sample set after noise filtration and an original training set, and performing all featuresOrdering importance, and finally obtaining a new training set DS for the intrusion detection model training after removing unnecessary featuresnew
2. The oversampling method of claim 1 based on generation of a countermeasure network and a k-nearest neighbor algorithm to improve intrusion detection performance, wherein: the initialization and training process of the generative model described in step (2) is specifically as follows,
the first step, initializing the network structure of the generator and the discriminator according to the design, and defining a noise distribution which follows normal distribution;
a second step of preparing real data and noise data, wherein the real data is the attack subset DS obtained from the step (1)AttackNoise is derived from the Noise distribution and DSAttackNoise with the same amount of data;
thirdly, fixing a generator, training a discriminator and generating samples with the same number by the Noise through the generatorAttackUsing DSAttackAnd SampleAttackTraining the arbiter to distinguish whether the data is from DSAttackWhether the actual data comes from SampleAttack
Fourthly, fixing the discriminator and training the generator: adopting a discriminator training generator after k times of training in the third step to make the discriminator unable to distinguish that the data is DSAttackOr SampleAttack
And updating the iteration generator and the discriminator for many times according to the third step and the fourth step, and finally finishing training when the discriminator cannot distinguish whether the data is a real training sample or a sample generated by the generator.
CN202111409785.4A 2021-11-24 Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm Active CN114091661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111409785.4A CN114091661B (en) 2021-11-24 Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111409785.4A CN114091661B (en) 2021-11-24 Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm

Publications (2)

Publication Number Publication Date
CN114091661A true CN114091661A (en) 2022-02-25
CN114091661B CN114091661B (en) 2024-06-04

Family

ID=

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115021965A (en) * 2022-05-06 2022-09-06 中南民族大学 Method and system for generating attack data of intrusion detection system based on generating type countermeasure network
CN116170237A (en) * 2023-04-25 2023-05-26 南京众智维信息科技有限公司 Intrusion detection method fusing GNN and ACGAN
CN117100293A (en) * 2023-10-25 2023-11-24 武汉理工大学 Muscle fatigue detection method and system based on multidimensional feature fusion network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188212A1 (en) * 2016-07-27 2019-06-20 Anomalee Inc. Prioritized detection and classification of clusters of anomalous samples on high-dimensional continuous and mixed discrete/continuous feature spaces
CN112491797A (en) * 2020-10-28 2021-03-12 北京工业大学 Intrusion detection method and system based on unbalanced industrial control data set
CN113395280A (en) * 2021-06-11 2021-09-14 成都为辰信息科技有限公司 Anti-confusion network intrusion detection method based on generation of countermeasure network
CN113438239A (en) * 2021-06-25 2021-09-24 杭州电子科技大学 Network attack detection method and device based on depth k nearest neighbor
CN113569920A (en) * 2021-07-06 2021-10-29 上海顿飞信息科技有限公司 Second neighbor anomaly detection method based on automatic coding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188212A1 (en) * 2016-07-27 2019-06-20 Anomalee Inc. Prioritized detection and classification of clusters of anomalous samples on high-dimensional continuous and mixed discrete/continuous feature spaces
CN112491797A (en) * 2020-10-28 2021-03-12 北京工业大学 Intrusion detection method and system based on unbalanced industrial control data set
CN113395280A (en) * 2021-06-11 2021-09-14 成都为辰信息科技有限公司 Anti-confusion network intrusion detection method based on generation of countermeasure network
CN113438239A (en) * 2021-06-25 2021-09-24 杭州电子科技大学 Network attack detection method and device based on depth k nearest neighbor
CN113569920A (en) * 2021-07-06 2021-10-29 上海顿飞信息科技有限公司 Second neighbor anomaly detection method based on automatic coding

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAODONG LIU等: "A GAN and Feature Selection-Based Oversampling Technique for Intrusion Detection", 《HINDAWI》, 6 July 2021 (2021-07-06), pages 1 - 15 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115021965A (en) * 2022-05-06 2022-09-06 中南民族大学 Method and system for generating attack data of intrusion detection system based on generating type countermeasure network
CN115021965B (en) * 2022-05-06 2024-04-02 中南民族大学 Method and system for generating attack data of intrusion detection system based on generation type countermeasure network
CN116170237A (en) * 2023-04-25 2023-05-26 南京众智维信息科技有限公司 Intrusion detection method fusing GNN and ACGAN
CN116170237B (en) * 2023-04-25 2023-07-25 南京众智维信息科技有限公司 Intrusion detection method fusing GNN and ACGAN
CN117100293A (en) * 2023-10-25 2023-11-24 武汉理工大学 Muscle fatigue detection method and system based on multidimensional feature fusion network
CN117100293B (en) * 2023-10-25 2024-02-06 武汉理工大学 Muscle fatigue detection method and system based on multidimensional feature fusion network

Similar Documents

Publication Publication Date Title
CN112308158B (en) Multi-source field self-adaptive model and method based on partial feature alignment
Zhong et al. Applying big data based deep learning system to intrusion detection
Xing et al. Medical health big data classification based on KNN classification algorithm
CN111211994B (en) Network traffic classification method based on SOM and K-means fusion algorithm
CN112165464B (en) Industrial control hybrid intrusion detection method based on deep learning
CN111785329B (en) Single-cell RNA sequencing clustering method based on countermeasure automatic encoder
CN110266672B (en) Network intrusion detection method based on information entropy and confidence degree downsampling
CN112087447B (en) Rare attack-oriented network intrusion detection method
CN110213222A (en) Network inbreak detection method based on machine learning
CN110348486A (en) Based on sampling and feature brief non-equilibrium data collection conversion method and system
CN114492768B (en) Twin capsule network intrusion detection method based on small sample learning
CN111556016B (en) Network flow abnormal behavior identification method based on automatic encoder
CN111143838A (en) Database user abnormal behavior detection method
CN113378160A (en) Graph neural network model defense method and device based on generative confrontation network
CN113901448A (en) Intrusion detection method based on convolutional neural network and lightweight gradient elevator
CN110177112B (en) Network intrusion detection method based on double subspace sampling and confidence offset
Ai-jun et al. Research on unbalanced data processing algorithm base tomeklinks-smote
Zhuang et al. A handwritten Chinese character recognition based on convolutional neural network and median filtering
CN111737688B (en) Attack defense system based on user portrait
Bui et al. A clustering-based shrink autoencoder for detecting anomalies in intrusion detection systems
CN114091661B (en) Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm
CN114091661A (en) Oversampling method for improving intrusion detection performance based on generation countermeasure network and k-nearest neighbor algorithm
CN116051924A (en) Divide-and-conquer defense method for image countermeasure sample
CN108446740B (en) A kind of consistent Synergistic method of multilayer for brain image case history feature extraction
CN114124437A (en) Encrypted flow identification method based on prototype convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant