CN115134128A

CN115134128A - Method for mining and utilizing new type encrypted network flow packet in distributed scene

Info

Publication number: CN115134128A
Application number: CN202210665404.7A
Authority: CN
Inventors: 张平; 唐艳艳
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-09-30

Abstract

The invention discloses a method for mining and utilizing a new type of encrypted network flow packet in a distributed scene. The new type of traffic packets detected on different network nodes in a distributed scenario contain valuable mode information. The scheme designed by the invention can carry out globally consistent class division and class label distribution on the new type of encrypted network traffic packets distributed on different network traffic monitoring nodes. The scheme can also utilize the marked new type traffic packet samples to quickly update the existing various global models (such as a feature vector extraction model, a new type encryption network traffic packet detection model, an existing type encryption network traffic classification model and the like) so as to expand the class identification capability of the models.

Description

Method for mining and utilizing new type encrypted network flow packet in distributed scene

Technical Field

The invention relates to the field of network security, in particular to a method for monitoring and managing a network traffic packet in a distributed scene.

Background

Network traffic packet classification is a crucial task in network management and cyberspace security. Network management typically needs to classify network traffic packets into different categories and then employ different routing or firewall configuration policies for the different types of network traffic packets. For example, we can divide the network traffic packets according to the application classes and allocate different priorities to different classes of network traffic packets to guarantee the network quality of service (QoS) of the high-priority service. As another example, network packet classification may be used for network intrusion detection. The network data packet is classified into a benign traffic packet and a malicious traffic packet, so that the purpose of network anomaly detection can be achieved.

Most network traffic is currently encrypted traffic. Most network applications introduce secure communication protocols, such as ssl (secure Sockets Layer), tls (transport Layer security), to improve their security performance. At the same time, many malware encrypt their network traffic packets to escape detection by firewalls and network intrusion detection systems. Since the payload (payload) of the encrypted network traffic packet is in an encrypted state, this brings a challenge to the traditional traffic classification method based on Deep Packet Inspection (DPI) and the like. The network traffic classifier based on machine learning generally needs to manually design and select features, and has high implementation difficulty and low classification precision.

In recent years, deep learning techniques have been introduced into encrypted network traffic classification scenarios. However, the network encryption traffic classification scheme based on deep learning has many challenges that are disjointed from the real scene.

First, the training of deep learning models requires a large amount of sample support, otherwise the over-fitting problem is easily induced. The deep learning model is generally complex, parameters to be trained are numerous, and the construction of the high-precision encryption flow classifier based on deep learning needs the support of a large number of marked training samples. However, it is not easy to collect a large amount of correctly labeled encrypted traffic. Because the traffic packet load is in an encrypted state, the cost of encrypted network traffic type analysis and labeling is very high. The capacity of a single monitoring node is limited, and the number of encrypted traffic packet samples which can be marked is limited.

Second, a classification model of value for an application should be able to identify as many traffic classes as possible. However, the coverage area of a single network monitoring node is limited, the types of samples which can be collected are limited, and the model identification capability is limited. The network traffic packet distribution usually has some regional characteristics, for example, the types of network traffic generated by different types of network users are not completely consistent. For another example, network viruses usually burst in a certain area and then spread to other areas.

Furthermore, new types of traffic packets are layered and cannot be classified correctly by models trained on existing traffic packet samples. In a real application scenario, the type of network traffic is not fixed, and we often encounter a large amount of new types of network traffic packets. The reasons for the frequent occurrence of new types of network traffic packets are many. On one hand, various novel network applications are endlessly layered, and the new network applications inevitably lead to the generation of new network traffic patterns. On the other hand, to escape network monitoring, a malicious network user usually changes its own behavior pattern, thereby causing a malicious network traffic pattern to change.

Therefore, it is necessary to research the problem of monitoring and management of the encrypted network traffic packets, which is closer to the real situation, in the distributed network monitoring scene, where the existing type and the new type of encrypted network traffic packets exist at the same time. In the scenario studied by the present invention, there are multiple network monitoring nodes. A plurality of network monitoring nodes (nodes for short) are distributed at the entrance positions of different network areas to monitor the network traffic of the areas. Each node accumulates a certain amount of labeled network traffic samples. The types of network traffic corresponding to these labeled samples are collectively referred to as the existing types. Correspondingly, a new type means that no samples of that category have yet been labeled. In this scenario, both existing and new types of encrypted network traffic packets exist. The newly received traffic packet samples may be either existing type samples or new type samples.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for mining and utilizing a new type of encrypted network traffic packet in a distributed scene aiming at the defects of the prior art. The technical scheme of the invention is as follows:

a method for mining and utilizing a new type of encrypted network traffic packet in a distributed scene is characterized by comprising the following steps:

(1) a preparation stage: a plurality of network traffic monitoring nodes (referred to as "nodes" for short) respectively monitor the network traffic of different network areas which are respectively responsible for the network traffic; each node independently collects a certain number of network traffic packet samples (referred to as labeled samples) which are subjected to class labeling; a plurality of network flow monitoring nodes cooperate with each other to train a new type of network flow packet detection model; the new type of network flow packet means that no network flow packet sample of the type is subjected to category marking;

(2) detecting a new type of flow packets: each node detects a new type of network traffic packet from the newly received network traffic packets respectively; the mining and the utilization of the new type of traffic packets are carried out in a periodic mode, and each round of mining and utilization operation is carried out on the basis of all the new type of traffic packets detected in the current period;

(3) the subcategory finds: each node independently carries out local clustering operation on the new type of flow packets detected in the current cycle time; each node independently distributes labels to each subcategory sample of each clustering result; in the local clustering result, the same local label is distributed to the new type flow packet samples of the same sub-category; the labels of different local subcategories are different from each other;

(4) local subcategory feature vector extraction: each node selects a global uniform reference; on the basis of the global uniform reference, each node extracts globally consistent class feature vectors for each local sub-class; each node uploads the feature vectors of the local subcategories to the sink node together with the local subcategory labels corresponding to the feature vectors;

(5) global consistency category labeling: the sink node collects the subcategory feature vectors and local label information from different nodes; the aggregation node performs global clustering on the basis of all collected sub-category feature vectors; the sink node distributes global labels to all sub-categories in the global clustering result; the sink node establishes a mapping scheme of a local label and a global label for each node, and returns the mapping scheme to the corresponding node respectively; each node distributes global labels to each subcategory sample by using the received mapping scheme;

(6) updating the model: and (4) expanding the model by a plurality of network flow monitoring nodes, and training the expanded model by utilizing the respectively collected samples distributed with the global labels in the step (5) in a cooperative mode until the model converges or reaches a preset error threshold.

As a further optimization, the specific steps of the step (4) are as follows:

(4.1) designing a global consistent reference model:

the globally consistent reference model is defined as: y ═ f _μ (x)＝f _e (f _θ (x))＝argmax(softmax(f _θ (x) ); submodel f _θ Extracting a model for the characteristics of the encrypted network traffic packet; each node uses the globally optimal model parameter theta _* Pair sub-model f _θ Initializing, sub-model f _e The method does not contain parameters to be optimized, and does not need initialization processing;

(4.2) training of the subcategory incremental model: each node is a respective different local sub-category sample, and an incremental model is independently trained; the optimization equation for incremental training is:

(4.3) extracting the subcategory features based on the incremental model: each node selects a parameter subset from each subcategory model parameter according to the same rule as a feature vector of the subcategory; each node uploads each local sub-category feature vector and each local sub-category label to a sink node;

(4.4) globally consistent sub-category label assignment: the sink node performs global clustering on all local subcategories on the basis of the collected subcategory feature vectors, and allocates different global labels to each global subcategory according to a global clustering result; the sink node establishes a mapping scheme of local subcategory labels and global labels for the local subcategories of all the nodes according to the collected local labels of the subcategories and the redistributed global labels, and feeds back the mapping relation to the corresponding nodes; and modifying the local class label of each sample into a global class label by each node according to the received mapping scheme.

As a further optimization, the specific steps of the step (6) are as follows:

(6.1) extension of the model: expanding the model according to the total number of the newly added categories and the total number of the newly added samples; increasing the number of output layer neurons of the model when the number of newly added classes and samples is small; when the number of the newly added classes and the number of samples are very large, the number of middle-level processing layers or the number of neurons of each level is also required to be increased;

(6.2) model initialization: initializing the original basic model parameters in the model subjected to model expansion by using the existing optimal characteristic parameters; initializing each neuron parameter of the model extension part by using a random number;

(6.3) optimization equation: the optimization equation is defined as

Wherein f' _θ Is an extended model;

(6.4) model training: and training the expanded model by using the collected samples distributed with the global labels by the plurality of network flow monitoring nodes in a cooperative mode until the model converges or reaches a preset error threshold.

Has the advantages that:

the scheme adopted by the invention designs a new type of encrypted network flow packet mining and utilizing method in a distributed scene. The new type of traffic packets detected on different network nodes in a distributed scenario contain valuable mode information. The scheme designed by the invention can carry out globally consistent class division and class label distribution on the new type of encrypted network traffic packets distributed on different network traffic monitoring nodes. The scheme can also utilize the marked new type traffic packet samples to quickly update the existing various global models (such as a feature vector extraction model, a new type encryption network traffic packet detection model, an existing type encryption network traffic classification model and the like) so as to expand the class identification capability of the models.

Drawings

FIG. 1 is a schematic diagram of a feature extraction model structure

FIG. 2(a) distribution of feature vector top-3 elements of new type traffic sample

FIG. 2(b) distribution of feature vector top-3 elements of a prior art type flow sample

FIG. 3 high-confidence new type traffic packet sample extraction model

FIG. 4 Category representation of incremental model parameters

FIG. 5(a) two-dimensional spatial view of network traffic packet (feature vector 1 st and 2 nd large element dimensions)

FIG. 5(b) two dimensional spatial view of network traffic packet (feature vector 1 st and 3 rd element dimensions)

FIG. 5(c) two dimensional spatial view of network traffic packet (feature vector 2 nd and 3 rd large element dimensions)

FIG. 6(a) expression capability of bias parameters of first layer

FIG. 6(b) expression capability of kernel parameter of first layer

FIG. 7(a) expression ability of bias parameter of last layer

FIG. 7(b) expression ability of kernel parameter of last layer

FIG. 8 globally consistent class label assignments

The specific implementation mode is as follows:

the specific implementation process of the invention is as follows:

the invention researches a monitoring and management problem of the encrypted network flow packet in a distributed scene which is similar to a real scene. In the problem scenario, a plurality of network monitoring nodes (simply referred to as "nodes") exist, and each network monitoring node independently monitors and manages encrypted network traffic packets in the jurisdiction. Each network monitoring node has accumulated some labeled encrypted network traffic packet samples. The number and types of marked samples on each node are limited, and the training of a complex deep learning model cannot be completed independently.

The encrypted network traffic packets newly received by each node have both existing type traffic packets and new type traffic packets. An existing type of encrypted network traffic packet (referred to simply as an "existing type of traffic packet") means that some sample of the encrypted network traffic packet of that type has been assigned the correct class label. Such a sample to which a label is assigned is called a labeled sample, or simply a labeled sample. A new type of encrypted network traffic packet (referred to simply as a "new type of traffic packet") means that no sample of that type of encrypted network traffic packet has been assigned a class label. We assume that different network monitoring nodes all assign the same class label to the same type of encrypted traffic packet sample.

In order to solve the problem that the number and types of labeled network traffic packets of a single network node are limited, model training is cooperatively performed on different nodes by using respective labeled samples. By integrating the sample resources of a plurality of nodes, the number of labeled samples for model training can be increased to avoid the over-fitting problem, and the trained model can learn the flow pattern characteristic differences of different network areas.

The inventors have mainly conducted the following three specific studies with respect to the problems studied.

(1) The method for extracting the characteristics of the encrypted network traffic packet in the distributed scene comprises the following steps: the feature extraction model can be used in other methods such as new type flow packet detection, new type flow packet marking, existing type flow packet classification and the like.

(2) The new type encrypted network flow packet detection method under the distributed scene comprises the following steps: in the newly received network traffic, the existing type and the new type of encrypted network traffic packets coexist. If we classify the newly received network traffic directly, the new type of traffic packet will be wrongly classified into some existing type, resulting in classification errors. Therefore, it is necessary to detect and separate a new type of encrypted network traffic packet from newly received network traffic of different nodes.

(3) The method for mining and utilizing the new type of encrypted network traffic packet in the distributed scene comprises the following steps: the new type of traffic packets detected at different network nodes contain valuable pattern information. We will study how to mine this information and use it to update existing models.

Method for extracting characteristics of encrypted network traffic packet in distributed scene

This section introduces a method for extracting characteristics of an encrypted network traffic packet in a distributed scenario. The method can be used for other methods such as new type flow detection, new type flow marking, existing type flow classification and the like. The method for extracting the characteristics of the encrypted network traffic packet in the distributed scene mainly comprises the following steps: first, a feature extraction model is designed. This model is used to directly convert the original encrypted traffic packets into feature vectors. Then, a training method of the feature extraction model in the distributed scene is designed. The method comprises the following specific steps:

(1) a preparation stage: the network flow monitoring nodes respectively monitor the network flow of different network areas which are respectively responsible for the network flow monitoring nodes; each node independently collects a certain number of network traffic packet samples (referred to as labeled samples) which are subjected to class labeling (are distributed with class labels);

(2) constructing a feature extraction model: network flow packet feature extraction model f _θ Can be expressed as v ═ f _θ (x) Wherein x is an encrypted network traffic packet and v is a feature vector extracted by a model; the feature extraction model f _θ At least one-dimensional convolution (1D CNN) layer and an Attention (Attention) layer; converting the output of the Attention layer into a group of weights after conversion; the set of weights is used as weights of different channels of the one-dimensional convolutional layer and is used for changing the original output value of the one-dimensional convolutional layer; as an optimization, the feature extraction model f _θ The device also comprises a one-dimensional pooling layer, a full-connection layer and an activation layer; fig. 1 shows a combination of a one-dimensional convolution layer and an Attention layer in a feature extraction model. In particular, the feature extraction model generally includes a plurality of convolutional layers. The conventional convolutional layer structures in the field of computer vision are mainly two-dimensional convolutional layers and three-dimensional convolutional layers, and some researchers apply the two-dimensional convolutional layers and the three-dimensional convolutional layers to encrypted flow classification scenes. However, network trafficThe feature extraction model is essentially sequential data which is a one-dimensional byte stream, so the feature extraction model adopts a one-dimensional convolutional layer and a one-dimensional pooling layer as basic components of a convolutional neural network. Meanwhile, an Attention layer is also introduced into the feature extraction model. The Attention layer uses the output of a certain convolutional layer as its input to capture the characteristic differences of different channels of the convolutional layer. The Attenttion layer converts the captured difference information into a set of weights by combining with Softmax. The set of weights is used as the weights of different channels of the convolutional layer and is used for changing the weight of the original output value of the convolutional layer, so that dynamic weighting is carried out on different output characteristics of the convolutional layer. As an optimization, the feature extraction model f _θ The device also comprises a one-dimensional pooling layer, a full-connection layer and an activation layer; in fig. 1, "1D CNN" represents a sub-network of the artificial neural network having a one-dimensional convolutional layer as a main component, and "other layers" in fig. 1 are generally composed of components such as a one-dimensional convolutional layer, a pooling layer, and a fully connected layer.

The hierarchical depth and the hierarchical structure of the middle layer of the feature extraction model need to be comprehensively determined according to the number of training samples, the performance of a machine and other factors. According to the deep learning theory, in general, under the condition that the number of samples is enough, the more complex the model structure is, the more the hierarchy depth is, and the stronger the expression capability of the model is. The number of the neurons of the output layer of the feature extraction model and the number of the types of the existing encryption network traffic packets are kept in the same order of magnitude, and can be generally set to be equal to or slightly larger than the number of the existing types. A simpler embodiment of the feature extraction model is given below. This embodiment is suitable for a scenario where the number of training samples is limited. The input to this implementation example model is an encrypted network traffic packet x in one-dimensional form. The model output is the corresponding feature vector v. The model consisted of 7 layers. Which comprises 3 convolutional layers, two pooling layers and two full-connection layers. The Attention layer is inserted after the second convolutional layer in the form of a bypass. The results calculated by the Attention layer are used to dynamically weight the output characteristics of the second convolution layer. Each output of the second convolution layer is multiplied by a corresponding weight in the set of weights provided by the Attention layer, thereby obtaining a weighted output. The weighted output of the second convolutional layer will continue to be input to subsequent modules for processing.

To extract a model f for the feature _θ In training, we need to solve the following two problems. Firstly, the input of the feature extraction model is a flow packet sample, and the output is a feature vector, however, the prior knowledge about the optimal feature vector is not provided, and the training process cannot be directly guided. Secondly, the sample resources are distributed over a plurality of independent nodes, and a cooperative training mechanism needs to be constructed.

Theoretically, the feature extractor should not change the class attribution of the traffic packet samples. That is, the locations of the same type of traffic packet samples in the feature space should be close. Therefore, the training process and the optimization direction of the feature extractor can be supervised and guided by using the class labels of the labeled samples. To realize the idea, an interface model f is constructed on the basis of the feature vector v _e 。

(3) Constructing an interface model: interface model f _e The system is formed by nesting two modules, namely Softmax and Argmax; the interface model may be expressed as y ═ f _e (v)＝argmax(softmax(v))；

(4) And (3) constructing an optimization equation: the optimization equation can be expressed as

Where l is the loss function (loss function);

the feature extraction model is constructed based on a deep neural network technology, and a large number of marked training samples are required to be used as supports. In order to increase the number of samples for model training and improve the accuracy of the model, a distributed training scheme of the model is constructed to realize model-level sharing of sample resources accumulated by each node.

(5) Distributed training of the model: a plurality of network flow monitoring nodes (referred to as 'nodes' for short) utilize the respectively collected marked network flow packet samples in the step (1) to extract the characteristic extraction model f in the step (2) by adopting a cooperative mode according to the optimization equation given in the step (4) _θ Training is performed until the model converges or reachesA predetermined error threshold. The specific steps of the distributed training of the model are as follows:

(5.1) model initialization: selecting a node as a sink node, wherein the sink node firstly extracts a model f from the characteristics _θ Is randomly initialized with the initialization parameter theta ₀ Then theta will be ₀ Sending to other nodes;

(5.2) local model training: node i utilizes the received theta ₀ To f _θ Initializing and constructing a local optimization equation

The node i utilizes the locally accumulated marked encrypted network flow data set to carry out model f based on the optimization equation _θ Optimizing to obtain optimized model parameters

The node i feeds back an optimization result to the sink node

(5.3) generating model parameters at the stage: the sink node receives feedback results from all the participating nodes

Calculating its mathematical expected value

The model parameters of the distributed training of the current round are

The convergent node uses the model parameter theta of the current stage ₁ Sending to other nodes;

(5.4) repeating the steps (5.2) - (5.3) until the model converges or reaches a preset error threshold value, thereby obtaining the current optimal model parameter theta _* ；

(5.5) all nodes get the current best from the sink nodeOptimal model parameter θ _* And constructing the current optimal feature extraction model f _θ For extracting feature vectors of network traffic packets.

Method for detecting new type encrypted network flow packet in distributed scene

In the newly received network traffic, the existing type and the new type of encrypted network traffic packets coexist. If the newly received network traffic is directly classified, the new type of traffic packet will be wrongly classified into a certain existing type, resulting in a classification error. Therefore, a new type of encrypted network traffic packet needs to be detected and separated from newly received network traffic of different nodes. Therefore, a new type of encrypted network traffic packet detection method is designed for a distributed scenario, and is used for detecting and separating a new type of encrypted network traffic packet from newly received network traffic of different nodes. The inventor provides a new type network flow packet detection method under a distributed scene, which comprises the following steps:

(1) a preparation stage: the network flow monitoring nodes respectively monitor the network flow of different network areas which are respectively responsible for the network flow monitoring nodes; each node independently collects a certain number of network traffic packet samples (referred to as labeled samples) which are subjected to class labeling (distributed with class labels); the new type of network flow packet means that no network flow packet sample of the type is subjected to class marking;

(2) training a feature extraction model: the feature extraction model may be expressed as v ═ f _θ (x) Wherein x is a network traffic packet and v is a feature vector extracted by a model; a plurality of network flow monitoring nodes (referred to as "nodes" for short) utilize the respectively collected labeled network flow packet samples to train an optimal parameter theta of the feature extraction model by adopting a cooperative mode _* ；

(3) Acquisition of positive samples for detection model training: each network node extracts a characteristic vector from a newly received network traffic packet by using a characteristic vector extraction model, and compares the vector with a preset vector to judge whether the network traffic needs to be used as a positive sample in a new type network traffic packet detection model training sample; the labels of all positive samples are set to the same value; the specific steps of obtaining the positive sample for the test model training are as follows:

(3.1) defining a threshold vector [ alpha ] ₁ ，α ₂ ，..，α _k ]：

The length of the threshold vector is k; each element of the threshold vector alpha _i Respectively marking a range of the interval; each element of the threshold vector alpha _i The device consists of two parts of flag and value, wherein the flag belongs to { +, - }, and the value belongs to (0, 1); flag is negative, which indicates that value defines a right boundary, and the left boundary is 1; flag is a positive sign, which indicates that value defines a left boundary, and the right boundary is 0;

threshold vector [ alpha ] ₁ ，α ₂ ，..，α _k ]Is determined from historical data. The following examples are given. The same number of existing type and new type flow packet samples are respectively input into a feature extraction model to obtain respective feature vectors. To form threshold results that can be quantitatively compared for subsequent schemes, we use a Softmax module to normalize the feature vectors. For the feature vector of each sample, we sort the elements of the feature vector in descending order. Most element values in each vector are close to 0 and are in an order of magnitude with noise errors, and the significance of analyzing and comparing the noise errors is not large. Therefore, we only record the top-k elements of each vector. For existing and new types of traffic packets, we draw their histograms of top-k feature vector elements separately for comparison. Fig. 2 is histogram statistical information of the feature vector top-3 elements, where the horizontal axis represents the values of the elements and the vertical axis represents the number of samples corresponding to the values of the elements. Fig. 2(a) shows the distribution of the elements in the first three sorted new type flow feature vectors (k is 1, 2, 3 in the respective corresponding graphs), and fig. 2(b) shows the distribution of the elements in the first three sorted existing type flow feature vectors (k is 1, 2, 3 in the respective corresponding graphs). The first column corresponds to the distribution of top-1 vector elements and the second, three columns are the distributions corresponding to the second largest and third largest vector elements.

Despite the overlap between the distribution intervals of the new type and the existing type samples. However, we can still select the distribution interval to obtain a new type of sample with very high confidence. Taking fig. 2 as an example, when the top3 elements of the sample output vector are respectively located in three intervals of [0, 0.75], [0.2, 1] and [0.1, 1], the confidence that the sample is a new type sample is very high. For the example of fig. 2, the threshold vector may be represented as 0.75, -0.2, -0.1. With the above threshold vectors, we can construct a high confidence new type traffic packet sample extraction model as shown in fig. 3.

And (3.2) extracting a top-k element of the feature vector of the network traffic packet: extracting a feature vector of the network flow packet; the length of the feature vector is greater than or equal to k; sorting the extracted feature vectors and reserving the largest k elements (top-k elements), denoted as [ v' ₁ ，v′ ₂ ，..，v′ _k ]；

(3.3) obtaining positive samples for test model training: by comparing threshold vectors [ alpha ] ₁ ，α ₂ ，..，α _k ]And the top-k element of the feature vector [ v' ₁ ，v′ ₂ ，..，v′ _k ]To determine if it is a high confidence new type sample; comparison of [ v' ₁ ，v′ ₂ ..，v′ _k ]Whether k elements of (a) are respectively located at [ alpha ] ₁ ，α ₂ ，..，α _k ]Within the interval marked by the k elements; if [ v' ₁ ，v′ ₂ ，..，v′ _k ]All elements in (A) are respectively located at [ alpha ] ₁ ，α ₂ ，..，α _k ]Within the interval marked by the k elements, a positive sample label is set for the sample and added to the positive sample set for the detection model training. The samples in the positive sample set represent new types of samples with very high confidence. The specific algorithm is as follows:

(4) obtaining negative samples for test model training: the negative sample represents the existing type and is derived from the existing labeled network traffic packet sample set; randomly selecting a certain number of samples from the marked network traffic packet samples in the step (1) by each network node as negative samples; the number of the selected negative samples is the same as or similar to that of the positive samples; the labels of all negative examples are set to the same value; the label of the negative examples should be different from the label of the positive examples and may be set to 0 and 1, respectively, for example.

(5) Constructing a new type of flow packet detection model: novel type flow packet detection model f _n From f _b And f _θ Two submodels, which are combined in series and can be expressed as y' ═ f _n (x)＝f _b (f _θ (x))；

(6) And (3) constructing an optimization equation: the optimization equation is expressed as

Where l is the loss function (loss function);

(7) distributed training of the model: a plurality of network flow monitoring nodes (referred to as 'nodes' for short) utilize the respectively collected samples as the steps (3) to (4) to adopt a cooperative mode according to the optimization equation given in the step (6) for the new type flow packet detection model f in the step (5) _n Training is performed until the model converges or a predetermined error threshold is reached. The distributed training of the model comprises the following specific steps:

(7.1) model initialization: the model initialization parameter is defined as n ₀ ＝[b ₀ ，θ _* ](ii) a Wherein theta is _* Is a sub-model f _θ Already existing in each node; the sink node only needs to be matched with the sub-model f _b Parameter b of ₀ Carrying out random initialization and sending the initialization result to each node;

(7.2) model construction: each node builds a new type flow packet detection model y ═ f _n (x)＝f _b (f _θ (x) Using the received model initialization parameters b) ₀ Sum submodel f _θ Current optimum parameter θ of _* Matched mouldInitializing the type parameters;

(7.3) local model training:

first, node i constructs a local optimization equation

Then, the node i uses the local training sample set to carry out the model f _n Optimizing to obtain optimized model parameters

The training set includes samples of the new type (i.e., positive samples) and samples of the existing type (i.e., negative samples); after the training is finished, the node i feeds an optimization result back to the sink node

(7.4) generating the training result of the model in the current round: the sink node receives feedback results from all the participating nodes

Calculating its mathematical expected value

Further obtaining the optimized result of the distributed training of the current round

The convergent node optimizes the result n in the current round ₁ Sending to each node;

(7.5) repeating the steps (7.3) - (7.4) until the model converges or a preset error threshold is reached, thereby obtaining the final model parameter n _* 。

After distributed training is finished, all nodes obtain the current optimal model parameter n from the sink node _* And constructing a current optimal new type flow packet detection model f _n* . The model detects and identifies all newly received network traffic packets in the current time interval so as to separate new type traffic packets.Novel type flow packet detection model f _n* The existing type and the new type are introduced as mutual contrast in the training process. Therefore, compared with the previous simple threshold segmentation mode, the new type flow packet detection model f can learn the difference characteristic information between the new type and the existing type more comprehensively _n* The detection capability of the device is greatly improved.

This section mainly includes two aspects. One is a globally consistent category label assignment method. The novel encrypted network traffic packets on different nodes are divided into different subclasses, and globally uniform labels are distributed to samples of the different subclasses. The second is the existing model updating method. The existing model is updated in a distributed mode with samples on each node that have been globally consistent label assigned.

Globally consistent class label assignments face challenges. New types of traffic packets may be further classified into different types. For new traffic packets with similar pattern characteristics, but distributed on different nodes, they should have the same class label. The most direct method for realizing global unified labeling is to upload the acquired new type of traffic packets to a server by each node, and the server performs unified labeling. However, this method is not suitable because of the large amount of raw traffic packet data. If each node independently performs the dimensionality reduction operation and uploads the dimensionality reduction result, the communication overhead can be reduced. However, the dimension reduction results of different nodes do not have global uniformity, and can not be directly compared to realize global uniform class labeling.

In the method provided by the invention, the global consistent class marking of the new type flow packet comprises three processes: (1) and local subcategory division and local subcategory labeling of the new type of traffic packets. And clustering the new type flow packets of each node into different sub-categories, and distributing proper local labels for each new type sample according to the category division result. It should be noted that, since each node performs class labeling on its sample independently, local labels of samples of the same type located in different nodes are usually different. (2) And extracting globally consistent features of the local category. And each node extracts globally consistent class characteristics for each local class and uploads the class characteristics to the server. And ensuring the global consistency of the extracted class characteristics to carry out the premise and the basis of the next step. (3) Globally consistent category label assignments. And the server divides the local categories into different global categories according to the similarity between the feature data uploaded by each node, and allocates corresponding global labels to the global categories. Each local node will replace its respective local category label with a global label, thereby providing global consistency of the category labels. A method for mining and utilizing a new type of encrypted network traffic packet in a distributed scene comprises the following steps:

(1) a preparation stage: a plurality of network traffic monitoring nodes (referred to as "nodes" for short) respectively monitor the network traffic of different network areas which are respectively responsible for the network traffic; each node independently collects a certain number of network traffic packet samples which are subjected to category labeling (referred to as labeled) respectively; training a new type of encrypted network traffic packet detection model by a plurality of nodes through the new type of encrypted network traffic packet detection method in the distributed scene; the new type of network flow packet means that no network flow packet sample of the type is subjected to category marking;

(2) detecting a new type of flow packets: each node detects a new type of network flow packet from the newly received network flow packets respectively; the mining and the utilization of the new type of traffic packets are carried out in a periodic mode, and each round of mining and utilization operation is carried out on the basis of all the new type of traffic packets detected in the current period;

(3) local subcategory label assignment: each node independently carries out local clustering operation on the new type of flow packets detected in the current cycle time; each node independently distributes labels to each subcategory sample of each clustering result; in the local clustering result, the same local label is distributed to the new type flow packet sample of the same sub-category; tags of different local sub-categories are different from each other;

(4) local subcategory feature vector extraction: each node selects a globally uniform reference; based on the global uniform reference, each node extracts globally consistent class feature vectors for each local sub-class respectively; each node uploads the feature vectors of the local subcategories to the sink node together with the local subcategory labels corresponding to the feature vectors; the method comprises the following specific steps:

(4.1) designing a global consistent reference model:

(4.2) training of the subcategory incremental model: each node is a respective local sub-category sample, and an incremental model is independently trained; the optimization equation for incremental training is:

it should be noted that although the incremental model training process adopts the traditional deep learning model training method, the training sample composition, training purpose and training cost are different. The traditional deep learning training data comprises a plurality of samples of different classes, and the purpose of the traditional deep learning training data is to learn the characteristic information contained in the samples of different classes so as to improve the model precision. Because the difference of different types of samples is large, the convergence rate of the model is low, and the training cost is high. In the incremental model training process designed by the scheme, training samples come from the same local subcategory and aim to learn the characteristic information of the single-category data. Because the difference among the samples in the same category is small, the convergence speed of the model is very high. Experiments show that even if several rounds of epoch training are carried out, the parameters can be ensured to realize excellent class representation performance. Taking a VPNNonVPN data set as an example, 100 single-class training sample sets are generated through random sampling, incremental model training tests are independently performed respectively, and when the epoch is larger than 2, the training precision of all 100 times is close to 1.

because different nodes use the same reference model for incremental model training, the sub-category features have global consistency. Meanwhile, the sub-category incremental model parameters have better sub-category discrimination. FIG. 4 is a class feature expression capability for incremental model parameters. Each node in fig. 4 corresponds to a parameter vector of the incremental model. The incremental model is trained from a sampled subset of a certain class of data. For convenience of display, the principal component analysis method is adopted to reduce the dimension of the parameter vector and display the parameter vector in a two-dimensional space. As can be seen from the figure, the parameter vectors of the incremental models of different types of samples have better discrimination. And the incremental models obtained by the sample subsets of the same type are positioned at adjacent positions in the parameter space. However, these incremental model parameters are not suitable for direct use as local subcategory feature vectors. On one hand, the superposition phenomenon exists in the model parameter space of the incremental models with different local class numbers. For example, the two models for the category numbered 8 and B in fig. 4 are coincident in the parameter space, which results in the two category data being inseparable.

In order to avoid the problem of overlapping of the feature spaces of partial sub-categories and reduce the dimension of the feature vectors and the communication overhead, a representative optimized parameter subset is extracted from the model parameters and used as the feature vectors of the sub-categories. The scheme selects the bias parameter of the last layer of the model as the final parameter. First, the deep neural network has higher abstract capability and stronger category expression capability in each layer from the input layer to the output layer, and the later layers are. Second, from the back propagation algorithm perspective, the layer parameters near the output are adjusted first. Therefore, the parameters close to the output layer are most easily affected by the incremental training process, and feature information in the relevant class training data can be most captured. In addition, in most deep learning models, the number of nodes of each layer close to the output layer is relatively small, so the number of parameters of the layers is also smaller. Taking the classical letNet-5 model as an example, the total parameters of the model exceed 40000, while the feature parameters selected according to our scheme are 10. The category expression capability of the optimized parameter subset may be referred to in the "performance evaluation" section.

(5) Globally consistent sub-category label assignment: the sink node collects the subcategory feature vectors and local label information from different nodes; the sink node performs global clustering on the local sub-category feature vectors on the basis of the collected sub-category feature vectors, and allocates different global labels to each global sub-category according to global clustering results; the sink node distributes global labels to all sub-categories in the global clustering result, establishes a mapping scheme of local sub-category labels and global labels for the local sub-categories of all nodes according to the collected local labels of the sub-categories and the redistributed global labels, and feeds the mapping relation back to the corresponding nodes; and modifying the local class label of each sample into a global class label by each node according to the received mapping scheme. The whole process is shown in fig. 8.

Through the above operations, we obtain many new types of sample sets. Next, we will use the new type sample set as training data to update the feature vector extraction model, the new type encrypted network traffic packet detection model, the existing type encrypted network traffic classification model, and other models. The principle of updating the aforementioned models is basically similar. The following explains the updating of the feature extraction model as an example.

(6) Updating the model: and (4) expanding the model by a plurality of network flow monitoring nodes, and training the expanded model by utilizing the respectively collected samples distributed with the global labels in the step (5) in a cooperative mode until the model converges or reaches a preset error threshold. The method comprises the following specific steps:

when the model is updated, the total number of the classes and the total number of the samples are increased with the addition of a new type of training samples. We need to extend the model appropriately so that the complexity of the model is adapted to the number of samples and the number of classes. Specifically, when the number of classes and samples to be added is much smaller than the number of samples and classes of the existing class in the previous training, we only need to modify the output layer of the model and add the same number of neurons as the new class. When the number of newly added classes and samples is very large, we need to expand the middle layer. The intermediate layer expansion mode can be the increase of the layers, and can also be the dimension expansion of the existing intermediate layers.

(6.2) model initialization: initializing the original basic model parameters in the model subjected to model expansion by using the existing optimal characteristic parameters; initializing each neuron parameter of the model extension part by using random numbers;

(6.3) optimization equation: the optimization equation is defined as

Wherein f' _θ Is an extended model;

Fourth, performance evaluation

1. Experimental setup

1) Data set for evaluation: the data set used during the experiment consisted of two parts. One part from the ISCX VPNnonVPN dataset. The data set includes different types of conventional encryption traffic and protocol encapsulation traffic. The sample of presence category disputes are deleted. The final sorted data set has 12 categories in total. Including 6 regular encrypted traffic classes (i.e., Chat, Email, File, P2P, Streaming, Voip) and 6 protocol encapsulated traffic classes (i.e., Vpn _ Chat, Vpn _ Email, Vpn _ File, Vpn _ P2P, Vpn _ Streaming, Vpn _ Voip). If not otherwise stated, we will use 10 digits 0 through 9 to sequentially identify the first 10 categories in the aforementioned listing order, and A, B using 2 letters to sequentially identify the last 2 categories in the order. However, the number of samples for the partial category of the data set is too small. For example: the total number of samples of the Vpn _ Email class is only 253. And the sample distribution among the classes is highly unbalanced. For example, the number of samples for the Chat category is 5257, which is much larger than the total number of samples for the Vpn _ Email category. For this reason, we sample-expand each category of the data set and ensure that the number of samples in each category of the data set is substantially equal.

2) Platform and model: the deep learning framework used in the experiment was Tensorflow. The federal learning platform used in the experiments was the TensorFlow Federated (TFF) frame. In a specific experiment, the federal learning related mechanism is realized in a local mode, namely, a client node and a server are realized in a virtual mode and are located on the same equipment substantially. Several models are mentioned in the proposed solution, such as a feature extraction solution and a novel detection solution. The main part of these models used in the experiments was similar to LeNet-5 and was modified in three ways. First, all 2D modules are replaced by 1D modules. For example, the 2D convolution module of the convolutional layer is replaced with a 1D convolution module. Second, an attention layer was added to the model. Thirdly, the input layer and the output layer are adaptively modified according to the sample size and the number of traffic classes. The main part of the model consists of seven layers. It includes three convolution layers, two pooling layers and two full-connection layers. The convolutional and pooling layers are implemented in a one-dimensional fashion. After the attention layer is inserted in the second convolutional layer in the form of a bypass, the output of the attention layer is used to dynamically weight the output of the second convolutional layer.

3) And evaluation index: evaluation criteria used during the experiment included Accuracy (Accuracy), Accuracy (Precision), Recall (Recall), and F1 score (F1-score).

2. Model component selection and performance comparison.

The model employed in the proposed solution is constructed based on 1D-CNN and attention mechanisms. Table 1 is a comparison of performance between different model strategies, namely our (models based on 1D-CNN and attention mechanism), 1D-CNN based models and 2D-CNN based models. According to the experimental results, the classification performance of the model based on the 1D-CNN and the attention mechanism is better than that of the 1D-CNN model and the 2D-CNN model alone. It should be noted that it is meaningless to compare the results of this experiment with the results of other subsequent experiments. On the one hand, the number of samples of different classes is very unbalanced. The data sets for subsequent experiments were class-balanced. On the other hand, the number of samples used for model training in this experiment is much greater than the number of samples in subsequent experiments.

TABLE 1 comparison of Performance of different model strategies

Scheme(s)	Precision (Accuracy)	Accuracy (Precision)	Recall ratio (Recall)	F1 score F1-score
					The invention	0.949	0.957	0.938	0.947
1D-CNN	0.921	0.945	0.933	0.939
					2D-CNN	0.845	0.852	0.846	0.848

3. The new type of traffic packet detects model performance.

As can be seen from fig. 2, the output vectors of the new type traffic packet and the known traffic packet have large characteristic differences in the respective dimensions corresponding to the top3 vector elements. To show the difference more intuitively, we use top3 vector elements of the output vector as three different dimensions to construct a three-dimensional space. We label these samples in this three-dimensional space. The sample labeled 0 represents an existing type sample and the sample labeled 1 represents a new type sample. In order to observe the distribution characteristics of the existing type samples and the new type samples in the space, three-dimensional space graphs are projected onto three different 2-dimensional planes so as to facilitate the viewing of the effect. The results are shown in fig. 5(a), 5(b) and 5 (c).

As can be seen from fig. 5(a), 5(b) and 5(c), most of the samples of the existing type and the samples of the new type are distributed in different regions in the space, and have relatively obvious distribution differences. By choosing the appropriate threshold parameters, we can easily separate out most new types of samples. Meanwhile, the existing type sample and the new type sample have certain overlap in the partial region. For example, a large number of samples labeled 0 or 1 are densely distributed in the lower right corner region of fig. 5(a) and 5(b), and the same situation also occurs in the lower left corner region of fig. 5 (c). Therefore, a threshold-based segmentation scheme will completely separate them from this region. We compared the performance of the new type of sample detection scheme proposed by the present invention with a threshold based segmentation scheme. Table 2 shows the comparison results. In the experiment, the segmentation threshold of the first dimension is set to 0.9, and the segmentation thresholds of the other two dimensions are set to 0.1.

TABLE 2 comparison of Performance of New sample testing protocols

Scheme(s)	Precision (Accuracy)	Accuracy (Precision)	Recall ratio (Recall)	F1 score F1-score
					Threshold segmentation scheme	0.683	0.996	0.613	0.759
The invention	0.942	0.986	0.906	0.944

As can be seen from table 2, the Accuracy (Precision) of the threshold segmentation scheme is very high, however, the Recall (Recall) and Accuracy (Accuracy) are relatively low. The accuracy (Precision) of the thresholding scheme is very high, mainly due to the very concentrated distribution of the existing type samples over the top3 dimension. When the threshold segmentation scheme is adopted, the existing type samples can be correctly identified with high probability, and the probability of judging the existing type samples as the new type samples is low.

The new type samples are scattered, and many new type samples are overlapped with the existing type samples and cannot be separated out directly through a simple threshold segmentation mode. Thus, the solution L1 has a relatively poor ability to identify new types of samples, with a relatively low Recall (Recall) and Accuracy (Accuracy).

The Recall rate (Recall) and the Accuracy (Accuracy) of the scheme of the invention are both improved by a larger margin. However, its accuracy (Precision) is slightly degraded with respect to the thresholding scheme. The new type of samples (labeled 1) in the training samples of the inventive scheme are from a thresholding scheme. Under the appointed threshold, part of the existing type samples are wrongly divided into the new type samples by the threshold segmentation scheme, and the mode of labeling the wrong samples is necessarily learned in the training process of the scheme, so that the probability of dividing the existing type samples into the new type samples is increased. Therefore, its accuracy (Precision) is somewhat degraded with respect to the threshold segmentation scheme.

4. Global consistent feature extraction scheme performance

Although the incremental model trained by different types of data has certain class expression capability, the incremental model parameters are not suitable for being directly used as class feature data to be uploaded to a server side due to the fact that the model parameters are numerous and the expression capability needs to be further improved.

The last layer of parameters of the incremental model may be extremely expressive. Verification is performed by experiment as follows. The experimental data are from the VPNnonVPN dataset. The five categories numbered [ 1, 2, 3, 4, 5] are regarded as the existing categories, and the other 7 categories are regarded as the new categories. A training sample set is constructed from the existing type data set, and a basic model is trained. Then, 56 training sample sets are randomly generated from the new type data sets, and each training sample set only contains a certain type of samples in the 7 new type data sets. Incremental model training is then performed based on the aforementioned base model on a per data set basis. The model converges very fast due to the incremental training of the single class samples. To reduce the training cost, we set epoch to 3 for each type of sample increment training. The training process of the single incremental model takes no more than 1 s. At the end of training, the training precision of most models reaches 1.

Two groups of parameters (kernel and bias) of a first layer and a last layer of the model are extracted, and visual comparison is carried out after dimensionality reduction is carried out through PCA respectively. The clustering effect of the two sets of parameters of the first layer in the 2-dimensional space is shown in fig. 6(a) and 6(b), respectively. The clustering effect of the two sets of parameters in the last layer in the 2-dimensional space is shown in fig. 7(a) and fig. 7(b), respectively.

As can be seen from fig. 6 and 7, the clustering effect of the two sets of parameters in the first layer of the model is significantly worse than that of the two sets of parameters in the last layer of the model. In fig. 7, all the categories do not overlap with each other. In fig. 6, the distribution of the nodes in each category is relatively scattered and overlapping phenomenon exists. The clustering effect of the parameter bias (fig. 6a and 7a) is significantly better than that of the kernel parameter (fig. 6b and 6 b). Taking fig. 7 as an example, although the two sets of parameters, kernel and bias, in the last layer have better clustering effect, the node distribution in the set of parameters of bias (fig. 7a) is obviously more concentrated.

As shown in fig. 4, it is not a sensible option to directly use the ensemble of model parameters as the feature data. Not only are there many feature vectors, but also the effect of distinguishing categories is not ideal, and there are cases where some categories cannot be distinguished. If the bias parameter of the last layer is used as the feature data, the length of the feature vector is greatly reduced, and the classification effect is also remarkably improved (fig. 7 a).

5. Performance comparison before and after model update

In an adaptive updating global model performance analysis experiment, 3 types of scenes are designed. The three types of scenes comprise 9 types of known traffic categories. The number of unknown traffic classes in the three types of scenes is 1 type, 2 types and 3 types respectively. The specific scenario description is shown in table 3. In each experimental scenario, 2000 samples are randomly selected from each existing type as training samples, and a basic classification model G1 is trained. Then, we randomly select 2000 samples from each existing type and each new type as new traffic samples respectively for performing the operation of the later stage in the proposed scheme to obtain an updated classification model G2. The flow samples used in the two stages are different from each other. Finally, the performance of the two classification models before and after updating in different scenes is analyzed on the basis of an experimental result, and the experimental result is shown in table 4. From the results, in all three scenarios, the performance of the new model G2 is slightly degraded relative to G1, and the larger the number of new types is, the larger the degradation of the performance is. This is because as the number of new types increases, the proportion of new type samples in the new sample set also increases. Due to the existence of errors in links such as new type sample identification and marking, the more the new type sample accounts for, the greater the influence of the errors on the final result is.

Table 3 experimental scenario description

Scene	Of the existing type	New type
			Scene 1	[1，2，3，4，5，6，7，8，9]	[A]
Scene 2	[2，3，4，5，6，7，8，9，A]	[0，B]
			Scene 3	[0，2，3，4，6，7，8，A，B]	[1，5，9]

TABLE 4 adaptive update model Performance analysis

Scene	Model (model)	Accuracy Accuracy	Accuracy Precision	Recall rate recalling	F1 score F1-score
						1	G1	0.948	0.949	0.948	0.948
1	G2	0.942	0.942	0.942	0.942
						2	G1	0.952	0.953	0.952	0.952
2	G2	0.942	0.943	0.942	0.942
						3	G1	0.968	0.969	0.968	0.968
3	G2	0.897	0.898	0.897	0.896

Claims

1. A method for mining and utilizing a new type of encrypted network traffic packet in a distributed scene is characterized by comprising the following steps:

(2) detecting a new type of traffic packet: each node detects a new type of network traffic packet from the newly received network traffic packets respectively; the mining and the utilization of the new type of traffic packets are carried out in a periodic mode, and each round of mining and utilization operation is carried out on the basis of all the new type of traffic packets detected in the current period;

(4) local subcategory feature vector extraction: each node selects a global uniform reference; based on the global uniform reference, each node extracts globally consistent class feature vectors for each local sub-class respectively; each node uploads the feature vectors of the local subcategories to the sink node together with the local subcategory labels corresponding to the feature vectors;

(6) updating the model: and (3) a plurality of network flow monitoring nodes expand the model, and train the expanded model by utilizing the respectively collected samples which are distributed with the global labels in the step (5) in a cooperative mode until the model converges or reaches a preset error threshold value.

2. The mining and utilization method of the new type of traffic packets under the distributed scenario according to claim 1, wherein the step (4) specifically comprises the following steps:

(1) designing a global consistent reference model:

(2) training of the subcategory incremental model: each node is a respective local sub-category sample, and an incremental model is independently trained; the optimization equation for incremental training is:

(3) extracting the subclass characteristics based on the incremental model: each node selects a parameter subset from each subcategory model parameter according to the same rule as a feature vector of the subcategory; each node uploads each local sub-category feature vector and each local sub-category label to a sink node;

(4) globally consistent sub-category label assignment: the sink node performs global clustering on all local subcategories on the basis of the collected subcategory feature vectors, and allocates different global labels to each global subcategory according to a global clustering result; the sink node establishes a mapping scheme of the local subcategories labels and the global labels for the local subcategories of all the nodes according to the collected local labels of the subcategories and the redistributed global labels, and feeds the mapping relation back to the corresponding nodes; and modifying the local class label of each sample into a global class label by each node according to the received mapping scheme.

3. The mining and utilization method of the new type of traffic packets under the distributed scenario according to claim 1, wherein the step (6) specifically comprises the following steps:

(1) and (3) expanding the model: expanding the model according to the total number of the newly added categories and the total number of the newly added samples; increasing the number of output layer neurons of the model when the number of newly added classes and samples is small; when the number of newly added categories and the number of samples are very large, the number of middle-level hierarchical layers or the number of neurons in each hierarchical layer is also required to be increased;

(2) model initialization: initializing the original basic model parameters in the model subjected to model expansion by using the existing optimal characteristic parameters; initializing each neuron parameter of the model extension part by using a random number;

(3) an optimization equation: the optimization equation is defined as

Wherein f is _θ’ Is an extended model;

(4) model training: and training the expanded model by using the collected samples distributed with the global labels by the plurality of network flow monitoring nodes in a cooperative mode until the model converges or reaches a preset error threshold.