CN115021965A

CN115021965A - Method and system for generating attack data of intrusion detection system based on generating type countermeasure network

Info

Publication number: CN115021965A
Application number: CN202210485160.4A
Authority: CN
Inventors: 孟博; 杨杰; 王德军; 魏增颂
Original assignee: Shijiazhuang Citic Youlian Software Co ltd; South Central University for Nationalities
Current assignee: Shijiazhuang Citic Youlian Software Co ltd; South Central Minzu University
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-09-06
Anticipated expiration: 2042-05-06
Also published as: CN115021965B

Abstract

The invention provides a method and a system for generating attack data of an intrusion detection system based on a generating type countermeasure network, which comprises the steps of firstly carrying out characteristic analysis on acquired data flow, then carrying out characteristic screening through a random forest algorithm, then preprocessing a data set, removing zero values and null values in the data set, and uniformly sampling various attack data; the constructed generative confrontation network model comprises a generator, a converter and a discriminator, wherein random noise is adopted by the generator as input, a new data sample is generated through a multilayer neural network, the converter combines non-attack characteristics of the generated data sample with attack characteristics of an attack behavior data sample to form a new attack sample, the new attack sample is delivered to the discriminator, real data and the data sample generated by the converter are trained uniformly, and training result parameters are transmitted to the generator for iterative training; in addition, the attack performance of the attack sample is evaluated through the detection of an intrusion detection system based on the deep belief network.

Description

Generation method and system of attack data of intrusion detection system based on generation type countermeasure network

Technical Field

The invention relates to the technical field of information security, in particular to a method and a system for generating attack data of an intrusion detection system based on a generating type countermeasure network.

Background

The research work of intrusion detection technology as an active security defense has been widely developed. Especially, along with the development of machine learning algorithm and deep learning algorithm, the detection algorithm is also more abundant. This aspect is also quite studied with respect to attacks by deep learning based intrusion detection systems. The network-based intrusion detection system IDS is an important branch of an intrusion detection system, which monitors a network through a system, collects data information of data packets, and observes and analyzes real-time network traffic to detect intrusion behavior in the network.

When the concept of deep learning is proposed, it is a great trend to construct a nonlinear network structure composed of a plurality of hidden layers to satisfy data classification. The 'depth' refers to the number of hidden layers in a neural network, a traditional neural network only comprises 2-3 hidden layers, deep learning can comprise up to 150 hidden layers, a plurality of continuous layers are adopted for operation, the layers are connected with one another, and each layer receives the output of the previous layer as input. For example: the automatic encoder consists of an encoder and a decoder for generating reconstruction, can represent linear transformation and nonlinear transformation, and is widely used for dimensionality reduction task in the intrusion detection field. The deep belief network is a directed deep neural network consisting of a plurality of layers of RBMs and a layer of BPs, features are extracted through a hidden layer to enable training data of the later layer to be more representative, the problem of detection of complex high-dimensional data can be solved, and the deep belief network is already applied to the field of intrusion detection.

Intrusion detection algorithms are various, and the detection efficiency and accuracy of the detection system are improved. The research on the direction of ensuring the safety and the reliability is deficient. At present, in a method for generating attack flow of a network-based intrusion detection system, iteration times are multiple, calculation efficiency is low, and generated disturbance time is long.

Disclosure of Invention

The invention provides a method and a system for generating attack data of an intrusion detection system based on a generating type countermeasure network, which are used for solving or at least partially solving the technical problems of poor generation efficiency and poor effect of the attack data in the prior art.

In order to solve the above technical problem, a first aspect of the present invention provides a method for generating attack data of an intrusion detection system based on a generative countermeasure network, including:

s1: acquiring data traffic, wherein the data traffic comprises normal network behavior data traffic and attack behavior data traffic;

s2: performing characteristic analysis on the acquired data traffic by adopting a traffic analysis tool to obtain a related data set, wherein the related data set comprises normal network behavior data samples and attack behavior data samples, and the normal network behavior data samples and the attack behavior data samples both comprise attack characteristics and non-attack characteristics;

s3: carrying out feature screening through a random forest algorithm, marking attack features and non-attack features of data samples in a related data set, and then preprocessing the data set after feature marking;

s4: constructing a generating type confrontation network model, wherein the model comprises a generator, a converter and a discriminator, the generator is used for learning the characteristic distribution rule of a normal network behavior data sample and generating an attack data sample, the converter is used for combining the non-attack characteristics contained in the generated attack data sample with the attack characteristics contained in the attack behavior data sample to form a new attack data sample, the discriminator is a two-classifier, the normal network behavior data sample in a relevant data set and the new attack data sample generated by the converter are subjected to unified training, whether the input data sample is a real data sample or a generated data sample is judged, then training result parameters are transmitted to the generator for iterative training, and the trained generating type confrontation network model is obtained;

s5: and generating target attack data by using the trained generative countermeasure network model.

In one embodiment, after step S5, the method further comprises:

and setting an intrusion detection system of the deep belief network, and detecting the attack performance of the generated target attack data.

In one embodiment, the step S3 of performing feature screening by using a random forest algorithm to mark attack features and non-attack features of the data samples in the relevant data sets includes:

and (4) carrying out feature screening through a random forest algorithm, marking the features with the importance ranking meeting the preset conditions as attack features, and marking the rest features as non-attack features.

In one embodiment, the preprocessing of the feature labeled data set in step S3 includes:

and clearing abnormal data in the data set after the characteristic mark, deleting data containing infinite numerical values and null values, and converting date numerical values into time stamps.

In one embodiment, in step S4, when performing the iterative training, the loss function is:

wherein, P _r Is the probability distribution of the real data sample, P _g Is the probability distribution of the generated data samples. W (P) _r ,P _g ) Is P _r And P _g Wasserstein distance of, pi (P) _r ,P _g ) Is P _r And P _g For each joint distribution, a pair of samples x and y is obtained by sampling from the set of all possible joint distributions combined by the distributions, | x-y | | | is the distance between the samples, and Ε _(x,y)～γ [||x-y||]For the expected value of the sample versus distance under the joint distribution gamma,

representing the lower bound for solving for the expected value.

Based on the same inventive concept, the second aspect of the present invention provides a system for generating attack data of an intrusion detection system based on a generative countermeasure network, comprising:

the data traffic acquiring module is used for acquiring data traffic, wherein the data traffic comprises normal network behavior data traffic and attack behavior data traffic;

the characteristic analysis module is used for carrying out characteristic analysis on the obtained data traffic by adopting a traffic analysis tool to obtain a related data set, wherein the related data set comprises a normal network behavior data sample and an attack behavior data sample, and the normal network behavior data sample and the attack behavior data sample both contain attack characteristics and non-attack characteristics;

the characteristic screening and preprocessing module is used for screening characteristics through a random forest algorithm, marking attack characteristics and non-attack characteristics of data samples in a related data set and then preprocessing the data set after characteristic marking;

the model building and training module is used for building a generating type confrontation network model, and the model comprises a generator, a converter and a discriminator, wherein the generator is used for learning the characteristic distribution rule of a normal network behavior data sample and generating an attack data sample, the converter is used for combining non-attack characteristics contained in the generated attack data sample with attack characteristics contained in the attack behavior data sample to form a new attack data sample, the discriminator is a two-classifier, the normal network behavior data sample in a relevant data set and the new attack data sample generated by the converter are subjected to unified training, whether the input data sample is a real data sample or the generated data sample is judged, then training result parameters are transmitted to the generator for iterative training, and the trained generating type confrontation network model is obtained;

and the attack data generation module is used for generating target attack data by using the trained generative countermeasure network model.

One or more technical solutions in the embodiments of the present application at least have one or more of the following technical effects:

the invention provides a method for generating attack data of an intrusion detection system based on a generating type countermeasure network, which divides a data set into normal network behavior data samples and attack behavior data samples, the normal network behavior data samples are used as the input of a model for training, and the attack behavior data samples select partial attack characteristics to be combined with the non-attack characteristics of the generated attack data samples and do not directly participate in the training of the model. According to the method, the random forest algorithm is used for feature screening, features with the importance ranking at the top are identified as attack features, the non-attack features of the generated attack data samples are combined with the attack features of the attack behavior data samples to form new attack data samples, on one hand, the attack capability of the attack sample data can be guaranteed, on the other hand, the consumption of time and space of the model algorithm can be reduced, and therefore the generation effect and the generation efficiency of the attack data are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a basic framework diagram of a generative countermeasure network provided in an embodiment of the present invention;

fig. 2 is an architecture diagram of an attack data generation method of an intrusion detection system based on a generative countermeasure network provided in an embodiment of the present invention;

fig. 3 is a flowchart of a method for generating attack data of an intrusion detection system based on a generative countermeasure network according to an embodiment of the present invention.

Detailed Description

The invention provides a method and a system for generating attack data of an intrusion detection system based on a generation-type countermeasure network.

In order to achieve the above object, the main concept of the present invention is as follows:

firstly, acquiring data traffic, then performing characteristic analysis on the acquired data traffic by adopting a traffic analysis tool, then performing characteristic screening by adopting a random forest algorithm, marking attack characteristics and non-attack characteristics of data samples in a related data set, and dividing the data set into normal network behavior data samples and attack behavior data samples; and then constructing a generating type confrontation network model, learning the characteristic distribution rule of the normal network behavior data sample through a generator, generating an attack data sample, combining the non-attack characteristic of the generated attack data sample with the attack characteristic of the attack behavior data sample by a converter to form a new attack data sample, uniformly training the data sample of the normal network behavior in the data set and the new attack data sample generated by the converter by a discriminator, transmitting the training result parameters to the generator, performing iterative training, and finally generating target attack data through the trained generating type confrontation network model.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The embodiment of the invention provides a method for generating attack data of an intrusion detection system based on a generative confrontation network, which comprises the following steps:

s4: the method comprises the steps of constructing a generating type confrontation network model, wherein the generator is used for learning a feature distribution rule of a normal network behavior data sample and generating an attack data sample, the converter is used for combining non-attack features contained in the generated attack data sample with attack features contained in the attack behavior data sample to form a new attack data sample, the discriminator is a two-classifier, the normal network behavior data sample in a relevant data set and the new attack data sample generated by the converter are subjected to unified training, whether the input data sample is a real data sample or a generated data sample is judged, then training result parameters are transmitted to the generator for iterative training, and finally target attack data are generated through the trained generating type confrontation network model.

Specifically, the thought basis of the generative countermeasure network GAN is two-person zero-sum game, which has been proposed and is a research hotspot. Through research in recent years, the research field of GAN has been related to various fields. The method uses the GAN mode to generate the attack data, and in the specific application process, the generated attack data cheats an intrusion detection system using deep learning, so that the method is an effective attack scheme aiming at the defects of the intrusion detection system based on the deep learning. An attack approach that bypasses deep learning based intrusion detection system detection is constructed if needed. The first problem to be solved is how to generate a data flow that can bypass the intrusion detection system to detect. The intrusion detection system based on the deep belief network adopts the technical principle that the deep neural network is used for feature extraction, perception and learning. The resulting data traffic is to conform to such deep-learned traffic characteristics and probability distributions.

Secondly, ensuring that the generated data traffic can bypass the detection of the intrusion detection system and simultaneously has the attack capability is a technical problem to be solved. Otherwise, the generated data sample is general data traffic and cannot cause attacks on the target server or the target user.

In order to solve the above problems, this embodiment provides a method for generating attack data of an intrusion detection system based on a generative countermeasure network, which constructs a generator and a discriminator based on the idea of a null-sum game of the generative countermeasure network, and performs mutual iterative training. Meanwhile, a converter is arranged between the generator and the discriminator, so that the non-attack characteristics of the generated data can be reserved, and the non-attack characteristics are combined with the attack characteristics of the existing attack behavior data sample to obtain attack data with strong attack.

In a specific implementation process, the data traffic in step S1 may be downloaded by the user. Step S3 performs feature screening and preprocessing by using a random forest algorithm, and then classifies the normal network behavior data sample and the attack behavior data sample, for example, the label of the normal network behavior data sample is Begin, and the label of the attack behavior data sample is the attack type of the data. The obtained normal network behavior data sample is used as the input of the model to participate in model training, the attack characteristic of the attack behavior data sample is used for being combined with the non-attack characteristic of the attack data sample generated by the generator in the training process, and the attack characteristic does not directly participate in the model training.

Referring to fig. 1-2, fig. 1 is a basic framework diagram of a generative countermeasure network provided in an embodiment of the present invention. Fig. 2 is an architecture diagram of a method for generating attack data of an intrusion detection system based on a generative countermeasure network according to an embodiment of the present invention.

The generative countermeasure network model constructed in step S4 is composed of a generator, a converter, and a discriminator, where the input of the generator is a one-dimensional random variable, and the output is learned traffic characteristics (the generative countermeasure network learning generates a characteristic distribution of normal network behavior data samples, and the generated attack data samples include attack characteristics and non-attack characteristics, but generally do not have attack capability, and it is necessary to replace the attack characteristics included in the attack data samples with the attack characteristics of the attack network behavior data to make the attack characteristics possess the attack capability). The converter combines the non-attack characteristics of the attack data sample generated by the generator with the attack characteristics of the attack flow, and the discriminator is used as a two-classifier to judge whether the input is the real data characteristics or the generated sample data characteristics. And uniformly training the flow characteristics in the data set and the flow characteristics generated by the converter, and transmitting the training result parameters to the generator for iterative training.

In one embodiment, after step S4, the method further comprises:

In particular, network intrusion detection methods based on deep learning are the hot spot of current research. The deep belief network is composed of a plurality of RBMs and a layer of BP neural network. The training steps are mainly as follows: and training the RBM layer by layer. The hidden layer vector may be obtained by mapping the visible layer vector of each lower layer and then inputting the hidden layer vector as the visible layer vector of the next layer. And adding a BP neural network after the last RBM, and taking the output vector of the last RBM as the input vector of the BP neural network. Aiming at the deep belief network, a generative confrontation network can be constructed, and a generator and a discriminator which are formed by a convolutional neural network are constructed for the generative confrontation network. And performing a zero sum game to generate attack data with attack capability finally, wherein the attack data can bypass the detection of the intrusion detection system.

and (4) performing feature screening through a random forest algorithm, marking the features with importance ranks meeting the preset conditions as attack features, and marking the rest features as non-attack features.

Specifically, the normal network behavior data samples and the attack behavior data samples both contain attack features and non-attack features, and the attack features and the non-attack features are divided by using a random forest algorithm.

In the specific implementation process, the feature meeting the predetermined condition is a feature with a preset feature priority level, so as to indicate the importance degree of the feature in determining that the data sample has the attack capability.

wherein, P _r Is the probability distribution, P, of the true data sample _g Is the probability distribution of the generated data samples. W (P) _r ,P _g ) Is P _r And P _g Wasserstein distance of, pi (P) _r ,P _g ) Is P _r And P _g For each joint distribution, a pair of samples x and y is obtained by sampling from the set of all possible joint distributions combined by the distributions, | x-y | | | is the distance between the samples, and Ε _(x,y)～γ [||x-y||]For the expected value of the sample versus distance under the joint distribution gamma,

representing the lower bound for solving for the expected value.

Specifically, the probability distribution of the true data sample x is P _r ，P _g To generate a probability distribution for the data sample x. The generative confrontation network mainly learns the mapping relation from a random variable z to a real data sample x, wherein the z follows normal distribution, and a differential function g (z) is obtained through the generator, and the parameter is theta _g Indicating the probability that the sample is from the generated data. Using the parameter theta _d A discriminator function f (x) is defined, which represents the probability that x is true data. The arbiter is trained to maximize it. L (f, θ) _d ) To generate a cost function for the antagonistic network:

for in mathematics, solve for P _r And P _g Integral versions of the probability distribution functions.

Theoretically deducing the optimal discriminator

Wherein D is ^* (x) Optimum discriminator function, P, for solving when the generator is fixed _r (x) And P _g (x) Respectively represent P _r And P _g The probability density of (c). Can pass through P _r And P _g The difference of the probability density is measured by the KL difference of (1), the JS divergence is the deformation of the KL divergence, JSD (P) _r (x))||P _g (x) Is P) _r And P _g JS divergence between, can also be to P _r And P _g Is measured. Further generating a cost function L (f, theta) of the antagonistic network _d ) The derivation is to cover the training standard for JS divergence:

L(f,θ _d )＝-2log2+2JSD(P _r (x)||P _g (x))

however, when P is _r And P _g Where the probability distributions of (a) do not coincide, it is not possible to solve the phase by means of a gradient descent methodInformation on the gradient between the two distributions. Therefore, the WGAN uses the Wasserstein distance instead of Jensen-Shannon divergence, i.e., the final loss function adopted is the formula for the Wasserstein distance.

Through continuous maximum and minimum value mutual game and continuous optimization of the generators and the discriminators, the two modules (the generators and the discriminators) finally reach Nash balance, the data generated by the generators cannot be distinguished by the discriminators as real sample data or generated sample data,

in order to ensure that the generated data traffic can have attack capability, converters are created in the generator and the intrusion detection system, and the converters combine non-attack characteristics contained in the generated attack data samples with attack characteristics contained in the attack behavior data samples to form new data samples. The advantage of doing so is that can guarantee the aggressive ability of attacking sample data, and the method that the converter directly combines can reduce the consumption of model algorithm time space simultaneously.

In a specific example, the dataset used is the CSE-CIC-IDS-2018 dataset. The data set is a collaborative project between the communications security agency (CSE) and the canadian network security institute (CIC) to generate a diverse and comprehensive baseline data set for intrusion detection based on creating a user profile that contains an abstract representation of events and behaviors seen on the network, while the configuration files will be combined to generate a set of different data sets, each having a unique set of functions that can cover a portion of the assessment domain. The data set contains 7 different attack scenarios: brute-force, heartbed, Botnet, DoS, DDoS, Web attacks, and profiling of the network.

And analyzing the characteristics of the data flow by using a flow analysis tool to obtain a related data set. The related traffic characteristics can be extracted by using a CICFlowMeter tool, which is a network traffic stream generator written by using Java. Finally, FlowID, SourceIP, DestinationIP, SourcePort, DestinationPort and the network flow characteristics with more than 80 are obtained.

And performing feature selection through a random forest algorithm, and marking attack features and non-attack features of data samples in the related data sets. And then preprocessing the data set, and clearing abnormal data in the data set. And classifying the normal network behavior data samples and the attack behavior data samples, wherein the normal network behavior data samples are used as the input of the model to participate in model training, the attack characteristics of the attack behavior data samples are directly combined with the non-attack characteristics of the attack data samples generated by the generator in the training process, and the attack characteristics do not directly participate in model training.

Combining the attack characteristics in the CSE-CIC-IDS-2018 data set and the non-attack characteristics generated by the generator through the converter, sending the combined characteristics to the discriminator to perform secondary classification, and performing iterative training on the discriminator and the generator to generate enough sample data. And finally, the generated target attack data is transmitted to an intrusion detection system of the deep belief network to be used as a detector, and the attack performance of the generated data is detected.

In the specific implementation process, the attack flow is selected according to different attack flows. The attack characteristics of various attack modes can be combined with the non-attack characteristics of the generated attack data sample to simulate various attack methods, including but not limited to Dos attack, Brute-force, heartbed, Botnet and other attack methods.

In general, the invention is a method for generating attack data for a generative-based intrusion detection system against a network, with the object of generating attack data with attack characteristics that are able to bypass the detection of an intrusion detection system based on a deep belief network,

firstly, performing characteristic analysis on acquired data flow by adopting a flow analysis tool, then performing characteristic screening by a random forest algorithm, then preprocessing a data set, removing zero values and null values in the data set, and uniformly sampling various attack data; the constructed generative confrontation network model comprises a generator, a converter and a discriminator, wherein random noise is adopted by the generator as input, a new data sample is generated through a multilayer neural network, the converter combines non-attack characteristics of the generated data sample with attack characteristics of a real data sample (attack behavior data sample) to form a new attack sample, the new attack sample is delivered to the discriminator, the real data and the data sample generated by the converter are trained uniformly, and training result parameters are transmitted to the generator for iterative training; in addition, the attack performance of the attack sample is evaluated through the intrusion detection system detection based on the deep belief network.

It is worth noting that the data set is divided into normal network behavior data samples and attack behavior data samples, the normal network behavior data samples are used as the input of the model for training, the attack behavior data samples select partial attack characteristics to be combined with non-attack characteristics in the generated attack data, and the partial attack characteristics do not directly participate in the training of the model

The attack data generated by the method of the invention can implement effective network attack to the intrusion detection system based on deep learning. According to the method, the specific attack characteristics of the selected attack behavior data samples are combined with the non-attack characteristics of the generated attack sample data according to the difference of the selected attack behavior data samples, and various attack methods are simulated, including but not limited to Dos attack, Brute-force, Heartbed, Botnet and other attack methods. The method has the advantages that the random forest algorithm is used for feature screening, the features with the top importance rank are identified as attack features, and the attack features are combined with the non-attack features of the generated samples, so that the attack capability of the generated samples can be efficiently reserved.

Example two

Based on the same inventive concept, the embodiment provides a system for generating attack data of an intrusion detection system based on a generative countermeasure network, which comprises:

the characteristic analysis module is used for carrying out characteristic analysis on the acquired data traffic by adopting a traffic analysis tool to obtain a related data set, wherein the related data set comprises a normal network behavior data sample and an attack behavior data sample, and the normal network behavior data sample and the attack behavior data sample both contain attack characteristics and non-attack characteristics;

Since the system described in the second embodiment of the present invention is a system adopted for implementing the method for generating attack data of the intrusion detection system based on the generative countermeasure network in the first embodiment of the present invention, a person skilled in the art can understand the specific structure of the system based on the method described in the first embodiment of the present invention, and details thereof are not described herein. All systems adopted by the method of the first embodiment of the present invention are within the intended protection scope of the present invention.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for generating attack data for a generative-based intrusion detection system against a network, comprising:

2. The method for generating attack data for a generative-based intrusion detection system to combat a network as recited in claim 1, wherein after step S5, the method further comprises:

and setting an intrusion detection system of the deep belief network to detect the attack performance of the generated target attack data.

3. The method as claimed in claim 1, wherein the step S3 of performing feature screening by using a random forest algorithm to mark attack features and non-attack features of the data samples in the related data sets comprises:

4. The method as claimed in claim 1, wherein the preprocessing of the signature labeled data set in step S3 includes:

5. The method as claimed in claim 1, wherein the step S4 of iterative training is performed by using a loss function as follows:

wherein, P _r Is the probability distribution, P, of the true data sample _g For the probability distribution of the generated data samples, W (P) _r ,P _g ) Is P _r And P _g Wasserstein distance of, pi (P) _r ,P _g ) Is P _r And P _g For each joint distribution, a pair of samples x and y is obtained by sampling from the set of all possible joint distributions combined by the distributions, | x-y | | | is the distance between the samples, and Ε _(x,y)～γ [||x-y||]For the expected value of the sample versus distance under the joint distribution gamma,

representing the lower bound for solving for the expected value.

6. A system for generating attack data based on a generative countering network intrusion detection system, comprising: