CN113033614A

CN113033614A - Network traffic data processing method and system

Info

Publication number: CN113033614A
Application number: CN202110221395.8A
Authority: CN
Inventors: 卜佑军; 王方玉; 张建辉; 陈博; 张桥; 张鹏; 伊鹏; 马海龙; 胡宇翔; 张稣荣; 孙嘉; 路祥雨; 王继; 张进
Original assignee: Information Engineering University of PLA Strategic Support Force; Network Communication and Security Zijinshan Laboratory
Current assignee: Information Engineering University of PLA Strategic Support Force; Network Communication and Security Zijinshan Laboratory
Priority date: 2021-02-27
Filing date: 2021-02-27
Publication date: 2021-06-25

Abstract

The invention belongs to the technical field of network security, in particular to a network flow data processing method and a system, which are used for network flow classification detection and comprise the following steps: sampling unbalanced data in an original network traffic data set and adding noise to obtain data to be processed, wherein the unbalanced data are traffic data of which various network traffic distribution ratios are smaller than a set condition value type; carrying out standardized processing on data to be processed, and generating sample data based on a data generator; and adding the sample data into the original network traffic data set to form an enhanced data set for network traffic classification detection. The invention adopts the data generator to accurately process the hidden variable and the likelihood function of the sampling data, and can generate clearer sample data; the method can process large-scale image data, has a high application prospect in real life, and is high in processing efficiency on hardware and simple in optimization process.

Description

Network traffic data processing method and system

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a network traffic data processing method and system, which are suitable for unbalanced data enhancement processing in network traffic detection.

Background

The unbalance of the misclassification cost and the unbalance of the category number are two characteristics of an unbalanced data set. With the advent of the big data age, a wide variety of unbalanced data sets are widely present in life. The problem of judging the class of the sample by constructing a relation model between the class and the training sample by using the unbalanced data set as the training set is called an unbalanced data classification problem. When an artificial intelligence algorithm such as machine learning is adopted for data processing, actual application in different fields is seriously influenced due to the imbalance of the artificial intelligence algorithm. Due to the numerous problems encountered with unbalanced data in processing, increased research has been directed. The existing unbalanced data field mainly comprises medical diagnosis, fraud detection, information security, flow classification and the like in more fields, and huge challenges are brought to the research work of workers due to the unbalanced problem of data sets. Therefore, how to effectively solve the distribution of unbalanced data is an important problem to be solved urgently in the current data processing field.

Network traffic imbalance data classification is one of the challenge problems in the classification field, and most of the network traffic collected in a real scene is imbalance traffic. The traditional unbalanced flow data processing method mainly comprises the following steps: a data set level method, a characteristic level processing method, an algorithm related to a classification algorithm and the like. The method at the data set level mainly comprises an oversampling algorithm and an undersampling algorithm, wherein the oversampling algorithm is mainly a smote algorithm and the like, and the undersampling algorithm is mainly a random undersampling algorithm and the like. The traditional classification algorithm cannot effectively classify under the condition of unbalanced data distribution, and cannot classify under the condition of highly unbalanced data distribution. And oversampling on the data set level can cause overfitting, and undersampling can often cause incomplete data information learning. Due to the rapid development of internet technology, new data generation modes are developed successively on the data set level at present. Data generation methods such as utilizing variational self-coding, antagonistic generation networks, and deep oversampling models, but from the current theoretical and practical perspective, these data generation methods have the following problems in terms of unbalanced traffic: (1) the unbalanced data processing method is mainly an under-sampling and over-sampling technology on a data set layer, the under-sampling method is simple and easy to operate, the model training time can be reduced, but discarding most types of samples can discard useful information hidden in data, so that the performance of a trained classifier is not ideal; the oversampling is to increase the number of a few types of samples to balance the data, but copying the samples for many times can cause the classification algorithm to generate overfitting, and although some improved algorithms can overcome the problem of overfitting, problems of generalization and the like are generated. (2) Recently, although a data generation method such as variational self-coding and anti-birthday network is simple, it is difficult to generate a large image because the number of dimensions of synthesized flow increases with the increase of the calculation length. (3) The variational automatic coding algorithm is an example of combining a probability map model with deep learning, and can only deduce approximate values of potential variables corresponding to the flow data points; while the countermeasure generation network has no encoder to infer hidden information, the data points of the traffic cannot be represented by hidden variables.

Disclosure of Invention

Therefore, the invention provides a network flow data processing method and a system, which are suitable for solving the classification of network flow unbalanced data and are convenient for the practical analysis and application of network flow classification detection.

According to the design scheme provided by the invention, a network flow data processing method is provided, which is used for network flow classification detection and comprises the following steps:

sampling unbalanced data in an original network traffic data set and adding noise to obtain data to be processed, wherein the unbalanced data are traffic data of which various network traffic distribution ratios are smaller than a set condition value type;

carrying out standardized processing on data to be processed, and generating sample data based on a data generator;

and adding the sample data into the original network traffic data set to form an enhanced data set for network traffic classification detection.

As the network traffic data processing method of the present invention, further, in the unbalanced data sampling process, the unbalanced data of the data set is preprocessed by using a sampling means, wherein the sampling means includes but is not limited to: smote oversampling and Undersampling.

The network traffic data processing method further expresses the network traffic data into a data shape with a tensor structure, and sequentially obtains a network traffic sequence as data to be processed through sampling and noise processing on unbalanced data.

As the network traffic data processing method of the present invention, further, the data generator employs a stream-based generation model, and the generation model includes a data initialization layer for normalizing data to be processed, a convolution layer for reversing the order of channels of input data to perform convolution operation, and an affine coupling mapping layer for simplifying a network structure.

As the network flow data processing method, the data to be processed is firstly compressed, and then the generated model is input and the data compression is matched to obtain the sample data.

As the network flow data processing method, further, the data initialization layer carries out batch processing standardization on the input network flow; the affine coupling mapping layer simplifies the flow sequence by a 1 x 1 reversible matrix.

As the network flow data processing method, the affine coupling mapping layer further adopts a bijective function and establishes a bijective model by overlapping the bijective function to complete flow sequence simplification processing.

As the network traffic data processing method of the present invention, further, the affine coupling mapping layer divides the network traffic tensor by a division function, and connects the divided tensors by a connection function reverse operation.

As the network flow data processing method, the enhanced data set is further input as a classification detection model, each sample in the data set is traversed, the class mean value between the characteristics is calculated by using the distance of the measured data, and class conditional probability estimation is obtained through the class mean value; and (3) utilizing a back propagation algorithm to enable the similarity probability of the generated sample data and the original network traffic data to reach an expectation through gradient calculation processing.

Further, the present invention also provides a network traffic data processing system, for classification detection of network traffic, comprising: a sampling module, a generating module and an enhancing module, wherein,

the sampling module is used for sampling unbalanced data in the original network traffic data set and acquiring data to be processed by adding noise, wherein the unbalanced data is traffic data of which the distribution ratio of various types of network traffic is smaller than a set condition value type;

the generating module is used for carrying out standardized processing on the data to be processed and generating sample data based on the data generator;

and the enhancement module is used for adding the sample data into the original network traffic data set to form an enhanced data set for the network traffic classification detection.

The invention has the beneficial effects that:

the invention adopts the data generator to accurately process the hidden variable and the likelihood function of the sampling data, so that the method can generate clearer sample data; the method can process large-scale image data and has a high application prospect in real life. Training can be carried out while generating data samples, namely, the data set samples are subjected to parallelization processing; compared with a variable-countermeasure generation network and a model based on deep oversampling and the like, the method can not perform parallel computation, the data generation model based on the flow has high-efficiency data generation and reasoning capability, the processing efficiency on hardware is high, and the optimization process is simpler.

Description of the drawings:

FIG. 1 is a schematic diagram illustrating a flow of processing network traffic data according to an embodiment;

FIG. 2 is a flow diagram illustrating a process flow of the flow-based data generation model in an embodiment;

FIG. 3 is a schematic diagram of a data compression process in the embodiment.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

For solving the problem of unbalanced data in typical traffic, an embodiment of the present invention provides a network traffic data processing method, which is used for network traffic classification detection, and includes: sampling unbalanced data in an original network traffic data set and adding noise to obtain data to be processed, wherein the unbalanced data are traffic data of which various network traffic distribution ratios are smaller than a set condition value type; carrying out standardized processing on data to be processed, and generating sample data based on a data generator; and adding the sample data into the original network traffic data set to form an enhanced data set for network traffic classification detection. The data generator can accurately process the hidden variable and the likelihood function of the sampled data, so that the method can generate clearer sample data; the method can process large-scale image data and has a high application prospect in real life.

As the network traffic data processing method in the embodiment of the present invention, further, in the unbalanced data sampling process, the unbalanced data in the data set is preprocessed by using a sampling means, wherein the common sampling means for sampling the data in the original data set is mainly used for preprocessing the unbalanced data in the training data set by Smote oversampling, Undersampling, and the like.

As the network traffic data processing method in the embodiment of the present invention, further, the data generator employs a stream-based generation model, where the generation model includes a data initialization layer for normalizing data to be processed, a convolution layer for reversing a channel order of input data to perform convolution operation, and an affine coupling mapping layer for simplifying a network structure. Further, the data initialization layer carries out batch processing standardization on the input network flow; the affine coupling mapping layer simplifies the flow sequence by a 1 x 1 reversible matrix.

In the data generator model, firstly, input data is subjected to standardization preprocessing, and the normalization layer is used for initializing the parameters of the scale and the bias layer; after initialization, the scale and variance of the data flow are treated as trainable parameters independent of the data. And then realizing reversible transformation of input data in an affine coupling layer, and finally simplifying the calculated amount of the whole network through 1-by-1 reversible convolution.

As the network flow data processing method in the embodiment of the invention, further, the enhanced data set is input as a classification detection model, each sample in the data set is traversed, the class mean value between the features is calculated by using the distance of the measured data, and class conditional probability estimation is obtained through the class mean value; and (3) utilizing a back propagation algorithm to enable the similarity probability of the generated sample data and the original network traffic data to reach an expectation through gradient calculation processing.

Enhanced data set s_augAs the input of the classification algorithm, the differentiable classification algorithm traverses each sample in the training set through the differentiable feature extractor, calculates the class mean value between the features by using the distance of the measured data, and then calculates the conditional probability estimation p of the class through the class mean value. And finally, performing gradient calculation through a back propagation algorithm to enable the generated data and the original data to obtain ideal similar probability, so that the flow-based data generator can be continuously trained, the whole preprocessing process can be continuously optimized, and the error rate of the classification algorithm can be reduced.

Further, based on the foregoing method, an embodiment of the present invention further provides a network traffic data processing system, configured to perform classification detection on network traffic, where the system includes: a sampling module, a generating module and an enhancing module, wherein,

In the embodiment of the scheme, unbalanced data is processed mainly through sampling, and the sampled sample data is stored in a data generator based on a stream; generating enough samples x'_iThen, the newly generated sample data and the original data set are combined into an enhanced data set s_aug(ii) a Then, a differentiable classification algorithm is adopted to carry out probability estimation p on the calculated class mean value, and a negative feedback gradient descent algorithm is utilized to continuously optimize the classification algorithm and the feature extractor. The data preprocessing stage is not on the networkAnd (3) carrying out real-time detection and analysis on the balance flow, sampling a few types of sample data according to the distribution of various types of network flow, putting the sampled data and noise into a flow-based data generation model together to generate new data, and then forming an enhanced flow data set together with the original flow data set data. And finally, classifying by adopting a classification algorithm.

The scheme of the present invention is further explained below by referring to the typical traffic type and model parameter examples:

referring to FIG. 1, sampling is performed from the original data set and the network traffic is processed into a shape of [ c × w × h]Is used as input, where c is the channel dimension, [ w × h [ ]]Is the flow dimension of the input. Adding noise to the flow sequence from the sample, i.e. assuming the original data as x_i，x′_i＝g(x_iZ). Carrying out data standardization processing on a flow sequence which is sampled and added with noise in a flow-based data generation model, then simplifying a network structure in an affine coupling layer, and finally simplifying the overall calculation quantity of a flow matrix through a 1 x 1 reversible matrix; and the stream is combined with a multi-scale structure (as shown in fig. 2 (b)) that compresses the flow data before processing in fig. 2 (a); then, after compression, a part of controllable determinant is output, then, the rest part of the controllable determinant is calculated by l and m functions, and after data compression, the l and m functions have more hidden features; for example, when the stream data input is x in dimension T, the output formula of the affine coupling layer is: y is_1:t＝x_1:t，y_t+1:T＝x_t+1:T⊙exp(l(x_1:d))+m(x_1:d) (ii) a The data stream is then divided, processed through the flow data of fig. 2(a), and finally compressed. For example, the process of compressing a 4 × 1 tensor into a 2 × 4 tensor is shown in fig. 3.

And an initialization layer, namely initializing a scale layer and a bias layer related to data, and carrying out batch processing standardization on preprocessed network traffic in order to solve the problems encountered in training the deep model. The noise variance added due to batch processing normalization is inversely proportional to the small batch size of the GPU or its processing units. These parameters of the scale and bias layers are initialized using the normalization layer so that the activated channels conform to a mean of 0 and a variance distribution in units of 1. Both the initialized scale and the scaling are considered to be independent of the preprocessed network traffic tensor.

1 × 1 deconvolution, the weight matrix for 1 × 1 deconvolution being regarded as a random rotation matrix, and the 1 × 1 invertible convolution for preprocessing [ c × w × h ] network traffic h tensor and c × c weight matrix w is calculated as follows:

the differentiation of det (w) is o (c)³) Is often compared to the cost of calculating conv2D (h; w) is o (h.w.c)²) And calculating cost and comparing. A logarithmically determinant weight matrix W is initialized as a random rotation matrix, these values not being 0 after a one-step gradient descent algorithm.

Using LU decomposition, the cost of computing det (W) is reduced from o (c) by LU decomposition of the matrix W³) Simplifying o (c), where P is the transformation matrix, L is the lower triangular matrix with 1 diagonal, U is the upper triangular matrix with 0 diagonal, S is a vector, and W is PL (U + diag (S)).

When the value of c is large, the cost of the differential calculation becomes important. In the parameterization process, the parameters are first initialized by sampling the random rotation matrix W, and then the corresponding P values and the corresponding L, U and s, logdet (W) sum (logs) are calculated. The simplified calculation matrix after deconvolution processing also reduces the overall calculation amount according to the calculation flow.

The affine coupling layer, the computationally efficient strong invertible transformation of the forward function, the backward function and the logarithmic determinant, is the introduction of the affine coupling layer. The affine coupling layer is a time bijective function (namely, for any y, a uniquely determined x corresponds to the time bijective function) adopted in the process of processing the convection data, and a bijective model is built by superposing a series of simple bijective functions. In each simple bijection, a part of the input vector is updated with an easy-to-invert function, but the function depends in a complex way on the rest of the input vector. These superimposed bijective functions may be referred to as affine coupling layers. A special case where the coupling layer is s-1 and the determinant is 0 can be used.

0 initialization, initializing the last convolution of each neural network with zeros so that the function of the bijective coupling layer can perform the recognition function and help train the deep networks. Dividing and connecting, adopting a dividing function Split () to divide h in the preprocessed network traffic tensor [ c multiplied by w multiplied by h ] into 2 parts along a channel, and then using a connecting function localization () function to reversely operate, namely: the separated tensors are concatenated into one tensor. In the above steps of traffic processing, some sort of arrangement should be performed on the variables to ensure that the dimensions between each type of traffic can affect each other after each step of processing. The permutation type corresponds to the order in which the reversal of the channels is performed at the affine coupling layer.

The space of the model is expanded through the strong reversible and learning transformation, so that a data generation algorithm with accurate log-likelihood calculation, accurate sampling, accurate latent variable inference and interpretable latent variable space is obtained, and various data in the original data are processed. Then enhancing the data set s_augAnd performing negative feedback regulation according to the sequence in the flow chart 1 to know that an ideal sample is generated, wherein the balanced data set can be used as an input of model training to be applied to field research such as network traffic detection.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The elements of the various examples and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and the components and steps of the examples have been described in a functional generic sense in the foregoing description for clarity of hardware and software interchangeability. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Those skilled in the art will appreciate that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, such as: read-only memory, magnetic or optical disk, and the like. Alternatively, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits, and accordingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A network traffic data processing method is used for network traffic classification detection, and is characterized by comprising the following steps:

2. The method according to claim 1, wherein in the unbalanced data sampling process, the unbalanced data in the data set is preprocessed by using a sampling means, wherein the sampling means includes but is not limited to: smote oversampling and Undersampling.

3. The network traffic data processing method according to claim 1 or 2, characterized in that the network traffic data is expressed in a data shape having a tensor structure, and a network traffic sequence is obtained as data to be processed by sampling and noise processing for unbalanced data in turn.

4. The method according to claim 1, wherein the data generator employs a stream-based generative model comprising a data initialization layer for normalizing data to be processed, a convolution layer for reversing order of channels of input data for convolution operation, and an affine coupling mapping layer for simplifying a network structure.

5. The method according to claim 4, wherein the data to be processed is first compressed, and then the generated model is input and the data compression is performed to obtain the sample data.

6. The method of claim 4, wherein the data initialization layer performs batch normalization of the incoming network traffic; the affine coupling mapping layer simplifies the flow sequence by a 1 x 1 reversible matrix.

7. The method according to claim 4, wherein the affine coupling mapping layer uses bijective functions and builds a bijective model by superimposing the bijective functions to complete the flow sequence reduction process.

8. The network traffic data processing method according to claim 1 or 4, wherein the affine coupling mapping layer divides the network traffic tensor by a division function, and connects the divided tensors by a connection function reverse operation.

9. The network traffic data processing method according to claim 1, wherein the enhanced data set is input as a classification detection model, a class mean value between features is calculated by traversing each sample in the data set and using a distance of measured data, and a class conditional probability estimate is obtained by the class mean value; and (3) utilizing a back propagation algorithm to enable the similarity probability of the generated sample data and the original network traffic data to reach an expectation through gradient calculation processing.

10. A network traffic data processing system for classification detection of network traffic, comprising: a sampling module, a generating module and an enhancing module, wherein,