CN116304705A

CN116304705A - Flow data set generation method and device based on conditional diffusion model

Info

Publication number: CN116304705A
Application number: CN202310278870.4A
Authority: CN
Inventors: 赵莎莎; 刘振娟; 张登银; 刘鑫; 冯向南; 蔡宇欣; 肖睿
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-06-23

Abstract

The invention discloses a flow data set generation method and device based on a conditional diffusion model, wherein the method comprises the following steps: collecting a flow data set with a label; preprocessing an original flow data set to obtain a gray level map; taking the gray level diagram as input, and performing forward process training of the diffusion model; after the diffusion model converges, a trained noise predictor is obtained and used for the reverse process of the diffusion model; generating a noise image with the same size as the target gray level image by using Gaussian noise as an initial value of the noise-containing image, and performing reverse process training; training a reverse process in a circulating iteration mode, and finally obtaining a target gray scale image; and converting the generated gray level map into a corresponding numerical matrix to finish the generation of flow data. The invention avoids the defects that partial key features are possibly lost due to undersampling and the classifier is possibly over-fitted due to oversampling, and compared with GAN, the invention can obtain better picture generation effect and avoid the defect of unstable training in the original generation countermeasure model.

Description

Flow data set generation method and device based on conditional diffusion model

Technical Field

The invention relates to a flow data set generation method and device based on a conditional diffusion model, which are conditional flow data generation methods and belong to the field of network flow mining and network behavior analysis.

Background

With the development of computer information technology, a large number of terminal devices are rushed into the internet, and a large amount of network traffic with various types is generated. The Internet has profound effects on society, economy and life of China, and has become an indispensable infrastructure in daily life. In order to protect user privacy, improve user quality of service (QoS), and maintain network security, network traffic identification and classification have become important for research in the fields of network behavior analysis, anomaly detection, and the like. With the development of information security technology, traffic data analysis faces the following challenges:

1. in order to protect user privacy, many internet applications use encryption protocols to encrypt traffic, so that traffic characteristics are difficult to extract, and some traditional methods based on deep packet inspection and machine learning have greatly reduced accuracy when confronted with encrypted traffic.

2. Traffic data is difficult to obtain. When acquiring flow data, the acquired data sets are unbalanced due to different heat of various applications. In the face of an balanced data set, the deep learning model can learn the characteristic attribute of each sample well and obtain a good classification effect, and unbalanced data sets can lead to the conditions of low sample class classification accuracy reduction, unstable model performance and the like.

3. The data is unbalanced, common solutions include random oversampling and undersampling, generating an countermeasure network technology and the like, but in application, the oversampling may cause training of a large number of repeated samples, so that the classifier is likely to generate overfitting, and different rules are generated on the same sample by the classifier; undersampling has a high probability of losing part of the key feature information; the original generation counteracts the problem that the network is easy to have unstable training, vanishing gradient and collapse mode.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a flow data set generation method and device based on a conditional diffusion model, and is a flow data generation method based on supervised learning, so that the problem of difficult acquisition of encrypted flow data is solved, and the defects in the existing data set equalization method are overcome.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

the invention provides a flow data set generation method based on a conditional diffusion model, which comprises the following steps:

collecting a flow data set through a packet grabbing tool, and configuring labels of the flow data according to flow categories (such as MySQL and QQ) to form a flow data set with labels;

converting into an array with the same length by cutting off or zero filling based on the flow data set with the label, normalizing, and finally converting into a gray level map;

putting the gray level image into a conditional denoising diffusion probability model, and training a forward diffusion process;

after the forward diffusion process training is completed, a corresponding flow gray scale image can be generated by inputting label information into the model;

and reducing the dimension of the generated gray level map into an array, and then performing inverse normalization processing to obtain the generated flow data.

Further, collecting a flow data set, comprising:

the packets in the network are captured using Wireshark and saved as PCAP format to form a traffic data set.

Further, the flow data set is flow data in PCAP format or ERF format or PCAPNG format.

Further, constructing a gray scale map based on the flow data set includes:

filtering useless local area network data packets in the data set based on the flow data set;

based on the cleaned flow data set, reading flow data in a binary reading mode, reading each byte of information to enable the value range to be between 0 and 255, and finally forming a one-dimensional array;

for the one-dimensional array, the array length in the data set is unified by line cutting, and then a two-dimensional array is formed by dimension transformation;

and carrying out data normalization on the two-dimensional array, defining the two-dimensional array between [0,1] to generate a gray level diagram, wherein the two-dimensional array is essentially a two-dimensional matrix, and the expression is as follows:

Pixel＝[P ₁ ，P ₂ ，...，P _i ] ^T (4)

P _i ＝(x _irc )(r∈[1，2，...，h]，c∈[1，2，...，w]) (5)

wherein Pixel represents the entire dataset, P _i Characteristic vector representing the ith packet, i.e. figure i Zhang Huidu, x _icr For the information of a byte on the ith data packet, which is also a pixel value of the gray scale map, c represents the c-th column of the matrix, represents the arrival time of the byte, the sequence of the arrival time is represented from left to right, r represents the r-th row of the matrix, the arrival time is increased from top to bottom, and h×w represents the length of the data packet after uniform clipping.

Further, the useless local area network data packets to be filtered in the process comprise data packets under ARP protocol and DHCP protocol;

further, placing the processed data set into a conditional denoising diffusion probability model for training comprises the following steps:

step A, a gray level diagram constructed by flow data is read, wherein:

gray scale map P _i Is a 28X28 image, the input X of the model consisting of gray-scale images, X is a four-dimensional tensor [ b, c, h, w ]]Where b is the minimum value, c is the channel of the gray map, values 1, h, w are the height and width of the gray map;

diffusion steps t-uniformity (1, T) represent adding noise to the image to different extents;

tag c= [ c ₁ ，c ₂ ，...，c _b ]The vector is also a vector with a dimension b, wherein the value of c represents the flow label of the current gray level image, and the random discarding condition is used for unconditional training during training of the noise predictor;

step B, gaussian noise is added to the gray level diagram to obtain a noise-containing diagram:

wherein X is _t Representing the denoised image, X ₀ Represents the initial gray scale map, t represents the diffusion step, α= [ α ] ₁ ，α ₂ ，...，α _t ，...，α _T ]Representing sequences related to t, in this example using

The function is generated and the function is generated,

representing the weight, c representing the condition, i.e. the tag information of the data, e-N (0,I) representing gaussian noise subject to a standard normal distribution;

step C, using noisy graph X _t The tag c of the flow, and the diffusion step t as input to the noise predictor G, predicts the current gaussian noise e _θ ；

Step D, utilizing the predicted Gaussian noise E _θ And the true gaussian noise e calculates the loss function:

loss represents loss, e represents gaussian noise subject to standard normal distribution, e _θ Representing the noise predictor, a neural network,

representing the weight, X ₀ The first gray scale is represented, t represents the diffusion step, and c represents the condition, i.e., the label information of the data.

Then optimizing a loss function of the diffusion probability model through an Adam optimization algorithm;

and E, repeating the steps A-D until the training times epoch reach a preset value.

Further, step (4) of performing a reverse process of the conditional diffusion probability model by using the trained noise predictor to generate flow data, includes the steps of:

step a, generating a noise image X with the same size as the target gray level image by using Gaussian noise E-N (0,I) conforming to standard normal distribution _T As the initial value of the noisy graph, setting the initial value of the diffusion value T as T, namely the number of loop iterations, and decrementing to 0 in the iteration process;

step b, predicting the current Gaussian noise by using the noisy image, the target flow label and the diffusion step t as inputs of a noise predictor

Wherein the first term on the right of the equation is expressed in terms of the current noise-containing map X _t The label c and the diffusion step t are the noise prediction values obtained by the input of the noise predictor G under the condition, and the rightThe second term corresponds to the prediction noise input by the unconditional noise predictor G, w is the super-parameter and represents Gaussian noise with a certain probability of combining the conditional and unconditional noise predictors as the current prediction

Step c, subtracting the generated prediction noise from the noise image;

step d, repeating the step b and the step c until the diffusion step t is 0, and finally generating a gray level diagram of the target flow data;

and e, converting the gray level image into a corresponding numerical matrix to finish the generation of flow data.

Further, the noise predictor G is implemented based on a U-Net model.

In a second aspect, the present invention provides a flow data set generating device based on a conditional diffusion model, including:

and a data acquisition module: the method comprises the steps of collecting a flow data set, configuring labels of the flow data according to flow categories, and forming a flow data set with labels;

a digital-to-image conversion module: the flow data set with the label is used for being converted into an array with the same length through truncation or zero padding, then normalized and finally converted into a gray level image;

model training module: the gray level image is used for putting the gray level image into a conditional denoising diffusion probability model to train a forward diffusion process;

a gray level map generation module: after the training for the forward diffusion process is completed, a corresponding flow gray scale image can be generated by inputting label information into the model;

and the graph-digital conversion module is used for: the method is used for reducing the dimension of the generated gray level image into an array, and then carrying out inverse normalization processing to obtain the generated flow data.

In a third aspect, the present invention provides a flow data set generating device based on a conditional diffusion model, including a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention relates to a flow data generation method based on a conditional denoising diffusion probability model, which comprises the steps of firstly generating a gray level diagram, training and generating a new gray level diagram based on the gray level diagram, decoding the new gray level diagram to obtain data, solving the problem that the current encrypted flow data is difficult to obtain, and avoiding the defects that partial key features are lost due to undersampling and the classifier is over-fitted due to oversampling;

2. the invention relates to a flow data generation method based on a conditional denoising diffusion probability model, which adopts a noise predictor G based on a U-Net model, and can obtain better picture generation effect compared with GAN, thereby avoiding the defect of unstable training in an original generation countermeasure model.

Drawings

FIG. 1 is a forward process of a diffusion probability model as described in the present invention;

FIG. 2 is a reverse process of the diffusion probability model described in the present invention;

FIG. 3 is a block diagram of a Resblock network in a diffusion probability model as described in the present invention;

fig. 4 is a block diagram of a noise predictor U-Net network described in the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Embodiment one:

the network quantity generation method based on the conditional diffusion probability model provided by the invention has the forward diffusion process shown in figure 1, and specifically comprises the following steps:

the method comprises the steps of (1) collecting a flow data set, configuring labels of flow data according to flow content, and forming a flow data set with labels;

in a specific example of the embodiment of the present invention, the flow data set collecting step includes:

capturing a data packet in a network by using Wireshark, and storing the data packet in a PCAP format to form a traffic data set, wherein the data source comprises: weChat, tremble, panning, mailbox, and the like.

Traffic content perspectives binary data, and tags are data sources such as QQ, weChat, and Uku.

Step (2) constructing a gray scale map based on the flow data set;

in a specific implementation method of the embodiment of the present invention, the step (2) specifically includes:

(2-1) filtering useless local area network data packets in the data set, such as data packets under ARP protocol and DHCP protocol, based on the flow data set;

(2-2) based on the cleaned flow data set, reading flow data in a binary reading mode, reading each byte of information to enable the value range to be between 0 and 255, and finally forming a one-dimensional array;

(2-3) for the one-dimensional array, cutting is required, the array length in the data set is unified, and then a two-dimensional array is formed through dimension transformation;

(2-4) carrying out data normalization on the two-dimensional array, defining the two-dimensional array between [0,1] and generating a gray scale map, wherein the gray scale map is essentially a two-dimensional matrix, and the expression is as follows:

Pixel＝[P ₁ ，P ₂ ，...，P _i ] ^T (4)

P _i ＝(x _irc )(r∈[1，2，...，h]，c∈[1，2，...，w]) (5)

wherein Pixel represents the entire dataset, P _i Characteristic vector representing the ith packet, i.e. figure i Zhang Huidu, x _icr For a byte of information on the ith packet, also a pixel value of the gray scale, c represents the c-th column of the matrix, represents the arrival time of the byte, the order of arrival time is represented from left to right, r represents the r-th row of the matrix, and the arrival time is increased from top to bottomH×w represents the length of the packet after uniform clipping.

The gray scale is a black-and-white image, the colored image is colored, the flow collection and the conversion into pictures are all prepared, and the gray scale image is a two-dimensional array in the memory of a computer.

Step (3), putting the processed data set into a conditional denoising diffusion probability model for training;

the invention provides a network quantity generation method based on a conditional denoising diffusion probability model, which is divided into two processes, namely a forward process and a reverse process, wherein in a specific implementation method of the invention, a structure of a noise predictor G is shown in a table 1, and the method is realized based on a U-Net model:

TABLE 1U-Net network parameters

The forward process training step (3) comprises the following steps:

step one, a gray scale image constructed by flow data is read, wherein:

gray scale map P _i Is a 28X28 image, the input X of the model consisting of gray-scale images, X is a four-dimensional tensor [ b, c, h, w ]]Wherein b is the minimum batch size, c is the channel of the gray scale map, the values are 1, h and w are the height and width of the gray scale map, and the training frequency epoch is set to be 1000;

tag c= [ c ₁ ，c ₂ ，…，c _b ]The vector is also a vector with a dimension b, wherein the value of c represents the flow label of the current gray level image, and the random discarding condition is used for unconditional training during training of the noise predictor;

step two, gaussian noise is added into the gray level diagram to obtain a noise-containing diagram:

The function is generated and the function is generated,

step three, using noisy graph X _t The tag c of the flow, and the diffusion step t as input to the noise predictor G, predicts the current gaussian noise e _θ The method comprises the steps of carrying out a first treatment on the surface of the The noise predictor G is currently mostly implemented based on such a Unet neural network model.

Step four, utilizing the predicted Gaussian noise E _θ And the true gaussian noise e calculates the loss function:

and fifthly, repeating the first, second, third and fourth steps until the model converges, namely the training frequency epoch is reduced to 0.

After the deep learning training is finished, the model can learn data information, the trained model can learn noise distribution information in the training process, so that noise is predicted, and noise removal processing is carried out on the image.

Step (4) is to use a trained noise predictor to perform a reverse process of a conditional diffusion probability model to generate flow data, as shown in fig. 2, and specifically includes the following steps:

step one, generating a noise image X with the same size as a target gray level image by using Gaussian noise E-N (0,I) conforming to standard normal distribution _T As the initial value of the noise-containing graph, setting the initial value of the diffusion value T as T, namely the denoising cycle number, and decrementing to 0 in the iterative process;

step two, predicting the current Gaussian noise by using the noisy image, the target flow label and the diffusion step t as the input of a noise predictor

Wherein the first term on the right of the equation is expressed in terms of the current noise-containing map X _t The label c and the diffusion step t are the noise predicted value obtained by the input of the noise predictor G under the condition that the conditional noise predicted value is obtained, the second item on the right corresponds to the predicted noise input by the noise predictor G under the condition that the unconditional noise predicted value is obtained, and w is the super parameter and represents the Gaussian noise which takes the noise predicted value under the condition that the conditional noise predicted value and the unconditional noise predicted value are combined with a certain probability as the current prediction

Step three, subtracting the generated prediction noise from the noise image;

and fourthly, repeating the second and third steps until the diffusion step t is 0, and finally generating a gray level diagram of the target flow data.

And fifthly, reducing the dimension of the generated gray level map into an array, and then performing inverse normalization processing to obtain the generated flow data.

The invention can be deployed on a server, and solves the problem of unbalanced flow data sets in network flow identification and anomaly detection research.

Embodiment two:

the present embodiment provides a flow data set generating device based on a conditional diffusion model, including:

The apparatus of this embodiment may be used to implement the method described in embodiment one.

Embodiment III:

the embodiment provides a flow data set generating device based on a conditional diffusion model, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A method for generating a flow data set based on a conditional diffusion model, comprising the steps of:

collecting a flow data set, configuring labels of the flow data according to flow categories, and forming a flow data set with the labels;

2. The method of generating a flow data set based on a conditional diffusion model of claim 1, wherein collecting the flow data set comprises:

3. The method for generating a flow data set based on a conditional diffusion model according to claim 2, wherein the flow data set is flow data in PCAP format or ERF format or pcaping format.

4. The flow data set generating method based on the conditional diffusion model according to claim 1, wherein constructing a gray scale map based on the flow data set comprises:

Pixel＝[P ₁ ，P ₂ ，...，P _i ] ^T (4)

P _i ＝(x _irc )(r∈[1，2，...，h]，c∈[1，2，...，w]) (5)

5. The method for generating a traffic data set based on a conditional diffusion model according to claim 1, wherein the useless local area network data packets to be filtered in the process comprise ARP and DHCP data packets.

6. The conditional diffusion model-based flow data set generation method of claim 1, wherein placing the processed data set into the conditional denoising diffusion probability model is trained comprising the steps of:

step A, a gray level diagram constructed by flow data is read, wherein:

The function is generated and the function is generated,

7. The method of generating a flow data set based on a conditional diffusion model according to claim 1, wherein the step (4) of generating flow data by performing a reverse process of the conditional diffusion probability model using a trained noise predictor comprises the steps of:

Step c, subtracting the generated prediction noise from the noise image;

8. The method for generating a conditional diffusion model-based flow data set according to claim 1, wherein the noise predictor G is implemented based on a U-Net model.

9. A flow data set generating device based on a conditional diffusion model, comprising:

10. A flow data set generating device based on a conditional diffusion model, which is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1 to 8.