CN115459937A

CN115459937A - Method for extracting characteristics of encrypted network traffic packet in distributed scene

Info

Publication number: CN115459937A
Application number: CN202210665402.8A
Authority: CN
Inventors: 张平; 唐艳艳
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-12-09

Abstract

The invention discloses an encrypted network flow characteristic extraction method in a distributed scene. The method can automatically extract the characteristic information contained in the original encrypted network flow packet without manually designing, selecting and extracting the characteristic information. The method adopts the technologies of a one-dimensional convolution neural network, an attention mechanism and the like, and greatly improves the representative capacity of the extracted characteristic information. The method is suitable for a distributed scene, sample data collected on different network nodes can be integrated, model training precision is improved, further, model-level sample data sharing is achieved among different network nodes in the distributed scene, and safety of original data can be guaranteed to a certain extent. The method can be used as a sub-module form and applied to different scenes such as detection of a new type of encrypted traffic packet, classification of an existing type of encrypted traffic packet, class marking of the new type of encrypted traffic packet and the like.

Description

Method for extracting characteristics of encrypted network traffic packet in distributed scene

Technical Field

The invention relates to the field of network security, in particular to a method for monitoring and managing a network traffic packet in a distributed scene.

Background

Network traffic packet classification is a crucial task in network management and cyber-spatial security. Network management typically needs to classify network traffic packets into different categories and then employ different routing or firewall configuration policies for the different types of network traffic packets. For example, we can divide the network traffic packets according to the application classes and allocate different priorities to different classes of network traffic packets to guarantee the network quality of service (QoS) of the high-priority service. As another example, network packet classification may be used for network intrusion detection. The purpose of network anomaly detection can be achieved by classifying network data packets into benign traffic packets and malicious traffic packets.

Most network traffic is currently encrypted traffic. Most network applications introduce Secure communication protocols, such as SSL (Secure Sockets Layer) and TLS (Transport Layer Security), to improve their Security performance. At the same time, many malware encrypt through their network traffic packets to escape detection by firewalls and network intrusion detection systems. Since the payload (payload) of the encrypted network traffic packet is in an encrypted state, this brings a challenge to the traditional traffic classification method based on Deep Packet Inspection (DPI) and the like. The network traffic classifier based on machine learning generally needs to manually design and select features, and has high implementation difficulty and low classification precision.

In recent years, deep learning techniques have been introduced to the context of encrypted network traffic classification. However, the network encryption traffic classification scheme based on deep learning has many challenges that are disjointed from the real scene.

First, training of deep learning models requires a large amount of sample support, otherwise over-fitting problems are easily induced. The deep learning model is generally complex, the parameters to be trained are numerous, and the construction of the high-precision encryption flow classifier based on deep learning needs the support of a large number of labeled training samples. However, it is not easy to collect a large amount of correctly labeled encrypted traffic. Since the traffic packet payload is in an encrypted state, the cost of encrypted network traffic type analysis and labeling is very high. The capacity of a single monitoring node is limited, and the number of encrypted traffic packet samples which can be marked is limited.

Second, a classification model of value for an application should be able to identify as many traffic classes as possible. However, the coverage area of a single network monitoring node is limited, the types of samples which can be collected are limited, and the model identification capability is limited. Network traffic packet distribution typically has certain regional characteristics, for example, the types of network traffic generated by different types of network users are not completely consistent. For another example, network viruses usually burst in a certain area and then spread to other areas.

Furthermore, new types of traffic packets are layered and cannot be classified correctly by models trained on existing traffic packet samples. In a real application scenario, the type of network traffic is not fixed, and we often encounter a large number of new types of network traffic packets. The reasons for the frequent occurrence of new types of network traffic packets are many. On one hand, various new network applications are developed endlessly, and the new network applications inevitably lead to the generation of new network traffic patterns. On the other hand, to escape network monitoring, a malicious network user usually changes its own behavior pattern, thereby causing a malicious network traffic pattern to change.

Therefore, it is necessary to research the problem of monitoring and managing the encrypted network traffic packets, which are closer to the real situation, in the distributed network monitoring scene, where the existing type and the new type of encrypted network traffic packets exist at the same time. In the scenario under investigation of the present invention, there are multiple network monitoring nodes. A plurality of network monitoring nodes (nodes for short) are distributed at the entry positions of different network areas to monitor the network traffic of the area. Each node accumulates a certain amount of labeled network traffic samples. The types of network traffic corresponding to these labeled samples are collectively referred to as the existing types. Correspondingly, a new type means that no samples of the category have been labeled yet. In this scenario, both existing and new types of encrypted network traffic packets exist. The newly received traffic packet samples may be either existing type samples or new type samples.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for extracting the traffic characteristics of an encryption network in a distributed scene aiming at the defects of the prior art. The technical scheme of the invention is as follows:

a method for extracting characteristics of an encrypted network flow packet in a distributed scene is characterized by comprising the following steps:

(1) A preparation stage: the network flow monitoring nodes respectively monitor the network flow of different network areas which are respectively responsible for the network flow monitoring nodes; each node independently collects a certain number of network traffic packet samples (referred to as labeled samples) which are subjected to class labeling (are distributed with class labels);

(2) Constructing a feature extraction model: network flow packet feature extraction model f _θ Can be expressed as v = f _θ (x) Wherein x is an encrypted network traffic packet and v is a feature vector extracted by a model; the feature extraction model f _θ At least comprises a one-dimensional convolution (1D CNN) layer and an Attention (Attention) layer; converting the output of the Attention layer into a group of weights after conversion; the set of weights is used as the weights of different channels of the one-dimensional convolution layer and is used for changing the original output value of the one-dimensional convolution layer; as an optimization, the feature extraction model f _θ The device also comprises a one-dimensional pooling layer, a full-connection layer and an activation layer;

(3) Constructing an interface model: interface model f _e The system is formed by nesting two modules, namely softmax and argmax; the interface model may be expressed as y = f _e (v)＝argmax(softmax(v))；

(4) And (3) constructing an optimization equation: the optimization equation can be expressed as

Where l is the loss function (loss function);

(5) Distributed training of the model: a plurality of network flow monitoring nodes (referred to as 'nodes' for short) utilize the respectively collected marked samples in the step (1) and adopt a cooperative mode to extract a model f for the characteristics in the step (2) according to the optimization process given in the step (4) _θ Training is performed until the model converges or a predetermined error threshold is reached.

As a further optimization, the specific steps of the step (5) are as follows:

(5.1) model initialization: selecting a node as a sink node, and collectingNode first extracts model f for features _θ Is randomly initialized with the initialization parameter theta ₀ Then theta will be ₀ Sending to other nodes;

(5.2) local model training: node i utilizes the received theta ₀ To f _θ Initializing and constructing local optimization program

The node i utilizes the locally accumulated marked encryption network flow data set to carry out optimization equation pair on the model f based on the optimization equation _θ Optimizing to obtain optimized model parameters

The node i feeds back an optimization result to the sink node

(5.3) generating model parameters at the current stage: the sink node receives feedback results from all the participating nodes

Calculating its mathematical expected value

The model parameters of the distributed training of the current round are

The convergent node uses the model parameter theta of the current stage ₁ Sending to other nodes;

(5.4) repeating the steps (5.2) and (5.3) until the model converges or reaches a preset error threshold value, thereby obtaining the current optimal model parameter theta _* ；

(5.5) all nodes obtain the current optimal model parameter theta from the sink node _* And constructing the current optimal feature extraction model f _θ For extracting feature vectors of network traffic packets.

Has the beneficial effects that:

the scheme adopted by the invention designs an encrypted network flow characteristic extraction method under a distributed scene. The scheme provides an end-to-end encrypted network flow characteristic extraction mode, and the characteristic information contained in an original network flow packet is automatically extracted after calculation by inputting the original network flow packet, so that the problem that the design, selection and extraction of the characteristic information are required to be manually carried out in the traditional machine learning scheme is avoided. The scheme adopts the technologies of a one-dimensional convolution neural network, an attention mechanism and the like, and greatly improves the representative capability of the extracted characteristic information. The scheme provides a training scheme of an encrypted network flow characteristic extraction model in a distributed scene. According to the training scheme, sample data collected on different network nodes can be integrated, the model training precision is improved, further, the model-level sample data sharing is realized among different network nodes in a distributed scene, and the safety of original data can be guaranteed to a certain extent. The invention can be used as a sub-module form and applied to a plurality of different scenes such as the detection of a new type of encrypted flow packet, the classification of the existing type of encrypted flow packet, the class marking of the new type of encrypted flow packet and the like.

Drawings

FIG. 1 is a schematic diagram of a feature extraction model structure

FIG. 2 (a) distribution of feature vector top-3 elements of new type traffic sample

FIG. 2 (b) distribution of feature vector top-3 elements of existing type flow sample

FIG. 3 high-confidence new type flow packet sample extraction model

FIG. 4 Category representation of incremental model parameters

FIG. 5 (a) two-dimensional spatial view of network traffic packet (feature vector 1 st and 2 nd large element dimensions)

FIG. 5 (b) two dimensional spatial view of network traffic packet (feature vector 1 st and 3 rd element dimensions)

FIG. 5 (c) two-dimensional spatial view of network traffic packet (feature vector 2 nd and 3 rd large element dimensions)

FIG. 6 (a) expression capability of bias parameter of first layer

FIG. 6 (b) expression capability of kernel parameter of first layer

FIG. 7 (a) expression ability of bias parameter of last layer

FIG. 7 (b) expression ability of kernel parameter of the last layer

FIG. 8 globally consistent class label assignments

The specific implementation mode is as follows:

the specific implementation process of the invention is as follows:

the invention researches a monitoring and management problem of the encrypted network traffic packet in a distributed scene which is similar to a real scene. In the problem scenario, a plurality of network monitoring nodes (simply referred to as "nodes") exist, and each network monitoring node independently monitors and manages encrypted network traffic packets in the jurisdiction. Each network monitoring node has accumulated some labeled encrypted network traffic packet samples. The number and types of marked samples on each node are limited, and the training of a complex deep learning model cannot be completed independently.

The encrypted network traffic packets newly received by each node have both existing type traffic packets and new type traffic packets. An existing type of encrypted network traffic packet (referred to simply as an "existing type of traffic packet") means that some of the encrypted network traffic packet samples have been assigned the correct class label. Such a sample to which a label is assigned is called a labeled sample, or simply a labeled sample. A new type of encrypted network traffic packet (referred to simply as a "new type of traffic packet") means that no sample of that type of encrypted network traffic packet has been assigned a class label. We assume that different network monitoring nodes all assign the same class label to the same type of encrypted traffic packet sample.

In order to solve the problem that the number and types of labeled network traffic packets of a single network node are limited, model training is cooperatively performed on different nodes by using respective labeled samples. By integrating the sample resources of a plurality of nodes, the number of labeled samples for model training can be increased to avoid the over-fitting problem, and the trained model can learn the traffic pattern characteristic differences of different network regions.

The inventors have mainly conducted the following three specific studies with respect to the problems under study.

(1) The method for extracting the characteristics of the encrypted network traffic packet in the distributed scene comprises the following steps: the feature extraction model can be used in other methods such as new type traffic packet detection, new type traffic packet labeling, existing type traffic packet classification and the like.

(2) The new type encrypted network flow packet detection method under the distributed scene comprises the following steps: in the newly received network traffic, the existing type and the new type of encrypted network traffic packets coexist. If we directly classify the newly received network traffic, the new type traffic packet will be wrongly classified into a certain existing type, resulting in a classification error. Therefore, it is necessary to detect and separate a new type of encrypted network traffic packet from newly received network traffic of different nodes.

(3) The method for mining and utilizing the new type of encrypted network traffic packet in the distributed scene comprises the following steps: the new type of traffic packets detected at different network nodes contain valuable pattern information. We will study how to mine this information and use it to update existing models.

1. Method for extracting characteristics of encrypted network traffic packet in distributed scene

This section introduces a method for extracting characteristics of an encrypted network traffic packet in a distributed scenario. The method can be used for detecting the new type of flow, labeling the new type of flow, classifying the existing type of flow and other methods. The method for extracting the characteristics of the encrypted network traffic packet in the distributed scene mainly comprises the following steps: first, a feature extraction model is designed. This model is used to directly convert the original encrypted traffic packets into feature vectors. Then, a training method of the feature extraction model in the distributed scene is designed. The method comprises the following specific steps:

(2) Constructing a feature extraction model: network flow packet feature extraction model f _θ Can be expressed as v = f _θ (x) Wherein x is an encrypted network traffic packet and v is a feature vector extracted by a model; the feature extraction model f _θ At least comprises a one-dimensional convolution (1D CNN) layer and an Attention (Attention) layer; converting the output of the Attention layer into a group of weights after conversion; the set of weights is used as the weights of different channels of the one-dimensional convolution layer and is used for changing the original output value of the one-dimensional convolution layer; as an optimization, the feature extraction model f _θ The device also comprises a one-dimensional pooling layer, a full-connection layer and an activation layer; fig. 1 shows the combination of the one-dimensional convolution layer and the Attention layer in the feature extraction model. In particular implementations, the feature extraction model generally includes a plurality of convolutional layers. The convolutional layer structures commonly used in the field of computer vision are mainly two-dimensional convolutional layers and three-dimensional convolutional layers, and some researchers apply the two-dimensional convolutional layers and the three-dimensional convolutional layers to encrypted flow classification scenes. However, network traffic is essentially sequential data, which is a one-dimensional byte stream, so the feature extraction model will use one-dimensional convolutional layers, one-dimensional pooling layers as basic components of convolutional neural networks. Meanwhile, an Attention layer is also introduced into the feature extraction model. The Attention layer uses the output of a certain convolutional layer as its input to capture the characteristic difference of different channels of the convolutional layer. The Attenttion layer converts the captured difference information into a set of weights by combining with Softmax. The group of weights are used as the weights of different channels of the convolution layer and are used for changing the weight of the original output value of the convolution layer, so that dynamic weighting is carried out on different output characteristics of the convolution layer. As an optimization, the feature extraction model f _θ The device also comprises a one-dimensional pooling layer, a full-connection layer and an activation layer; "1D CNN" in fig. 1 represents a sub-network of the artificial neural network having one-dimensional convolutional layers as main components, and "other layers" in fig. 1 are generally composed of components such as one-dimensional convolutional layers, pooling layers, and fully-connected layers.

The level depth and the level structure of the middle layer of the feature extraction model need to be comprehensively determined according to the number of training samples, the performance of a machine and other factors. According to the deep learning theory, in general, under the condition that the number of samples is enough, the more complex the model structure is, the more the hierarchy depth is, and the stronger the expression capability of the model is. The number of the neurons of the output layer of the feature extraction model and the number of the classes of the existing encryption network traffic packets are kept in the same order of magnitude, and can be generally set to be equal to or slightly larger than the number of the existing classes. A simpler embodiment of the feature extraction model is given below. The embodiment is suitable for a scene with a limited number of training samples. The input to this implementation example model is an encrypted network traffic packet x in one-dimensional form. The model output is the corresponding feature vector v. The model consisted of 7 layers. Including 3 convolutional layers, two pooling layers, and two full-link layers. The Attention layer is inserted after the second convolutional layer in the form of a bypass. The result calculated by the Attention layer is used for dynamically assigning the right to the output characteristic of the second convolution layer. Each output of the second convolution layer is multiplied by a corresponding weight value in the weight value set provided by the Attention layer, so that a weighted output is obtained. The weighted output of the second convolutional layer will continue to be input to subsequent modules for processing.

To extract a model f for the feature _θ For training, we need to solve the following two problems. Firstly, the input of the feature extraction model is a flow packet sample, and the output is a feature vector, however, the prior knowledge about the optimal feature vector is not available, and the training process cannot be directly guided. Secondly, the sample resources are distributed over a plurality of independent nodes, and a cooperative training mechanism needs to be constructed.

Theoretically, the feature extractor should not change the class attribution of the traffic packet samples. That is, the locations of the same type of traffic packet samples in the feature space should be close. Therefore, the training process and the optimization direction of the feature extractor can be supervised and guided by using the class labels of the labeled samples. In order to realize the thought, an interface model f is constructed on the basis of the feature vector v _e 。

(3) Constructing an interface model: interface model f _e The system is formed by nesting two modules, namely Softmax and Argmax; the interface model may beExpressed as y = f _e (v)＝argmax(softmax(v))；

(4) And (3) construction of an optimization equation: the optimization equation can be expressed as

Where l is the loss function (loss function);

the feature extraction model is constructed based on a deep neural network technology, and a large number of marked training samples are required to be used as supports. In order to increase the number of samples for model training and improve the accuracy of a model, a distributed training scheme of the model is constructed to realize model-level sharing of sample resources accumulated by each node.

(5) Distributed training of the model: a plurality of network flow monitoring nodes (referred to as 'nodes' for short) utilize the respectively collected marked network flow packet samples in the step (1) to extract the characteristic extraction model f in the step (2) by adopting a cooperative mode according to the optimization equation given in the step (4) _θ Training is performed until the model converges or a predetermined error threshold is reached. The distributed training of the model comprises the following specific steps:

(5.1) model initialization: selecting a node as a sink node, wherein the sink node firstly extracts a model f from the characteristics _θ Is randomly initialized with the initialization parameter theta ₀ Then theta will be ₀ Sending to other nodes;

(5.2) local model training: node i utilizes the received theta ₀ To f is paired _θ Initializing and constructing local optimization program

The node i utilizes the locally accumulated marked encrypted network flow data set to carry out optimization equation pair on the model f based on the optimization equation _θ Optimizing to obtain optimized model parameters

The node i feeds back an optimization result to the sink node

Calculating its mathematical expected value

The model parameters of the distributed training of the current round are

The convergent node combines the model parameter theta of the current stage ₁ Sending to other nodes;

(5.4) repeating the steps (5.2) - (5.3) until the model converges or reaches a preset error threshold value, thereby obtaining the current optimal model parameter theta _* ；

(5.5) all nodes obtain the current optimal model parameter theta from the sink node _* And constructing the current optimal characteristic extraction model f _θ For extracting feature vectors of network traffic packets.

2. New type encrypted network flow packet detection method in distributed scene

In the newly received network traffic, the existing type and the new type of encrypted network traffic packets coexist. If the newly received network traffic is directly classified, the new type of traffic packet will be wrongly classified into a certain existing type, resulting in a classification error. Therefore, a new type of encrypted network traffic packet needs to be detected and separated from newly received network traffic of different nodes. Therefore, a method for detecting a new type of encrypted network traffic packet is designed for a distributed scene, and is used for detecting and separating the new type of encrypted network traffic packet from newly received network traffic of different nodes. The inventor provides a new type network flow packet detection method under a distributed scene, which comprises the following steps:

(1) A preparation stage: the network flow monitoring nodes respectively monitor the network flow of different network areas which are respectively responsible for the network flow monitoring nodes; each node independently collects a certain number of network traffic packet samples (referred to as labeled samples) which are subjected to class labeling (distributed with class labels); the new type of network flow packet means that no network flow packet sample of the type is subjected to category marking;

(2) Training a feature extraction model: the feature extraction model can be expressed as v = f _θ (x) Wherein x is a network traffic packet and v is a feature vector extracted by a model; a plurality of network flow monitoring nodes (referred to as 'nodes' for short) utilize respectively collected marked network flow packet samples to train an optimal parameter theta of the characteristic extraction model in a cooperative mode _* ；

(3) Acquisition of positive samples for detection model training: each network node extracts a feature vector from a newly received network traffic packet by using a feature vector extraction model, and compares the feature vector with a preset vector to judge whether the network traffic needs to be used as a positive sample in a new type network traffic packet detection model training sample or not; the labels of all positive samples are set to the same value; the specific steps of obtaining the positive sample for the test model training are as follows:

(3.1) defining a threshold vector [ alpha ] ₁ ，α ₂ ，..，α _k ]：

The length of the threshold vector is k; each element of the threshold vector a _i Respectively marking a range; each element of the threshold vector alpha _i The device consists of two parts of flag and value, wherein the flag belongs to { +, - }, and the value belongs to (0, 1); flag is negative, which indicates that value defines a right boundary, and the left boundary is 1; flag is a positive sign, which indicates that value defines a left boundary, and the right boundary is 0;

threshold vector [ alpha ] ₁ ，α ₂ ，..，α _k ]Is determined from historical data. The following examples are given. The same number of existing type and new type flow packet samples are respectively input into a feature extraction model to obtain respective feature vectors. For forming quantifiable comparisonsThe threshold result is used by the subsequent scheme, and the feature vector is normalized by using a Softmax module. For the feature vector of each sample, we sort the elements of the feature vector in descending order. Most element values in each vector are close to 0, and are in an order of magnitude with noise errors, so that the analysis and comparison of the noise errors are not significant. Therefore, we only record the top-k elements of each vector. For the existing type traffic packet and the new type traffic packet, I draw histograms of their top-k feature vector elements for comparison respectively. Fig. 2 is histogram statistical information of the feature vector top-3 elements, where the horizontal axis represents the values of the elements and the vertical axis represents the number of samples corresponding to the values of the elements. Fig. 2 (a) shows the distribution of the first three elements of the new type flow characteristic vector sorting (corresponding to k =1,2,3 in the figure respectively), and fig. 2 (b) shows the distribution of the first three elements of the existing type flow characteristic vector sorting (corresponding to k =1,2,3 in the figure respectively). The first column corresponds to the distribution of top-1 vector elements and the second and third columns are the distributions corresponding to the second largest and third largest vector elements.

Despite the overlap between the distribution intervals of the new type and the existing type samples. However, we can still select the distribution interval to obtain a new type of sample with very high confidence. Taking fig. 2 as an example, when the top3 elements of the sample output vector are respectively located in three intervals of [0,0.75], [0.2,1] and [0.1,1], the confidence that the sample is a new type sample is very high. For the example of fig. 2, the threshold vector may be represented as 0.75, -0.2, -0.1. With the above threshold vectors, we can construct a high confidence new type traffic packet sample extraction model as shown in fig. 3.

(3.2) extracting a top-k element of the feature vector of the network traffic packet: extracting a feature vector of the network flow packet; the length of the feature vector is greater than or equal to k; sorting the extracted feature vectors and reserving the largest k elements (top-k elements) denoted as [ v' ₁ ，v′ ₂ ，..，v′ _k ]；

(3.3) obtaining a positive sample for test model training: by comparing threshold vectors [ alpha ] ₁ ，α ₂ ，..，α _k ]And (c) aToken vector of top-k element [ v' ₁ ，v′ ₂ ，..，v′ _k ]To determine whether it is a high confidence new type sample; comparison of [ v' ₁ ，v′ ₂ ，..，v′ _k ]Whether k elements of (a) are respectively located at [ alpha ] ₁ ，α ₂ ，..，α _k ]Within the interval marked by the k elements; if [ v' ₁ ，v′ ₂ ，..，v′ _k ]All elements in (1) are respectively located at [ alpha ] ₁ ，α ₂ ，..，α _k ]Within the interval marked by the k elements, a positive sample label is set for the sample and is added to a positive sample set for the training of the detection model. The samples in the positive sample set represent new types of samples with very high confidence. The specific algorithm is as follows:

(4) Obtaining negative samples for testing model training: the negative sample represents the existing type and is derived from the existing labeled network traffic packet sample set; randomly selecting a certain number of samples from the marked network traffic packet samples in the step (1) by each network node as negative samples; the number of the selected negative samples is the same as or similar to that of the positive samples; the labels of all negative examples are set to the same value; the label of the negative examples should be different from the label of the positive examples and may be set to 0 and 1, respectively, for example.

(5) Constructing a new type of flow packet detection model: new type flow packet detection model f _n From f _b And f _θ Two submodels are formed, and the two submodels are combined in series, and can be expressed as y' = f _n (x)＝f _b (f _θ (x))；

(6) And (3) construction of an optimization equation: the optimization equation is expressed as

Where l is the loss function (loss function);

(7) Distributed training of the model: a plurality of network flow monitoring nodes (referred to as 'nodes' for short) utilize the samples collected respectively in the steps (3) to (4) and adopt a cooperative mode to detect the new type flow packet detection model f in the step (5) according to the optimization process given in the step (6) _n Training is performed until the model converges or a predetermined error threshold is reached. The specific steps of the distributed training of the model are as follows:

(7.1) model initialization: the model initialization parameter is defined as n ₀ ＝[b ₀ ，θ _* ](ii) a Wherein theta is _* Is a sub-model f _θ Already existing at each node; the sink node only needs to be matched with the sub-model f _b Parameter b of ₀ Carrying out random initialization and sending the initialization result to each node;

(7.2) model construction: each node builds a new type flow packet detection model y = f _n (x)＝f _b (f _θ (x) B) using the received model initialization parameters b ₀ Sum submodel f _θ Current optimum parameter θ of _* Initializing model parameters;

(7.3) local model training:

first, node i constructs a local optimization equation

Then, the node i uses the local training sample set to carry out the model f _n Optimizing to obtain optimized model parameters

The training set includes samples of the new type (i.e., positive samples) and samples of the existing type (i.e., negative samples); after the training is finished, the node i feeds an optimization result back to the sink node

(7.4) generating the training result of the model in the current round: the sink node receives feedback results from all the participating nodes

Calculating its mathematical expected value

Further obtaining the optimized result of the distributed training of the current round

The aggregation node optimizes the result n of the current round ₁ Sending to each node;

(7.5) repeating the steps (7.3) - (7.4) until the model converges or a preset error threshold is reached, thereby obtaining the final model parameter n _* 。

After distributed training is finished, all nodes obtain the current optimal model parameter n from the sink node _* And constructing the current optimal new type flow packet detection model f _n* (the model indices are "n"). The model detects and identifies all newly received network traffic packets in the time interval of the round so as to separate out new type traffic packets. Novel type flow packet detection model f _n* The existing type and the new type are introduced as mutual contrast in the training process. Therefore, compared with the previous simple threshold segmentation mode, the new type flow packet detection model f can learn the difference characteristic information between the new type and the existing type more comprehensively _n* The detection capability of the device is greatly improved.

3. Method for mining and utilizing new type encrypted network flow packet in distributed scene

This section mainly includes two aspects. One is a globally consistent category label assignment method. And dividing the novel encrypted network traffic packets on different nodes into different subclasses, and distributing globally uniform labels for samples of the different subclasses. The second is the existing model updating method. And updating the existing model in a distributed mode by using the samples which are subjected to global consistent label distribution on each node.

Globally consistent class label assignments face challenges. New types of traffic packets may be further classified into different types. For new traffic packets with similar pattern characteristics, but distributed on different nodes, they should have the same class label. The most direct method for realizing global unified labeling is to upload the acquired new type of traffic packets to a server by each node, and perform unified labeling by the server. However, this method is not suitable because of the large amount of raw traffic packet data. If each node independently performs the dimensionality reduction operation and uploads the dimensionality reduction result, the communication overhead can be reduced. However, the dimension reduction results of different nodes do not have global uniformity, and can not be directly compared to realize global uniform class marking.

In the method provided by the invention, the global consistent class marking of the new type flow packet comprises three processes: (1) And local subcategory division and local subcategory marking of the new type of traffic packets. Each node clusters the own new type flow packet into different subcategories, and distributes a proper local label for each new type sample according to the classification result. It should be noted that, because each node performs class labeling on its sample independently, local labels of samples of the same type located in different nodes are usually different. And (2) extracting global consistent features of local categories. And each node extracts globally consistent class characteristics for each local class and uploads the class characteristics to the server. And ensuring the global consistency of the extracted class characteristics to carry out the premise and the basis of the next step. And (3) globally consistent class label distribution. And the server divides the local categories into different global categories according to the similarity between the feature data uploaded by each node, and allocates corresponding global labels to the global categories. Each local node will replace its respective local category label with a global label, thereby providing global consistency of the category labels. A method for mining and utilizing a new type of encrypted network traffic packet in a distributed scene comprises the following steps:

(1) A preparation stage: a plurality of network traffic monitoring nodes (referred to as "nodes" for short) respectively monitor the network traffic of different network areas which are respectively responsible for the network traffic; each node independently collects a certain number of network flow packet samples which are subjected to category marking (referred to as marked for short); training a new type of encrypted network traffic packet detection model by a plurality of nodes through the new type of encrypted network traffic packet detection method in the distributed scene; the new type of network flow packet means that no network flow packet sample of the type is subjected to category marking;

(2) Detecting a new type of flow packets: each node detects a new type of network flow packet from the newly received network flow packets respectively; the mining and the utilization of the new type of traffic packets are carried out in a periodic mode, and each round of mining and utilization operation is carried out on the basis of all the new type of traffic packets detected in the current period;

(3) Local subcategory label assignment: each node independently carries out local clustering operation on the new type of flow packets detected in the current cycle; each node independently distributes labels to each subcategory sample of each clustering result; in the local clustering result, the same local label is distributed to the new type flow packet samples of the same sub-category; the labels of different local subcategories are different from each other;

(4) Local subcategory feature vector extraction: each node selects a globally uniform reference; based on the global uniform reference, each node extracts globally consistent class feature vectors for each local sub-class respectively; each node uploads the feature vectors of the local subcategories to the sink node together with the local subcategory labels corresponding to the local subcategories; the method comprises the following specific steps:

(4.1) designing a global consistent reference model:

the globally consistent reference model is defined as: y = f _μ (x)＝f _e (f _θ (x))＝argmax(softmax(f _θ (x) ); sub-model f _θ Extracting a model for the characteristics of the encrypted network traffic packet; each node uses the globally optimal model parameter theta _* Pair sub-model f _θ To proceed to the first stageInitialization, submodel f _e The method does not contain parameters to be optimized, and does not need initialization processing;

(4.2) training of the subcategory incremental model: each node is a respective local sub-category sample, and an incremental model is independently trained; the optimization equation for incremental training is:

it should be noted that although the incremental model training process adopts the traditional deep learning model training method, the training sample composition, training purpose and training cost are different. The traditional deep learning training data comprises a plurality of samples of different classes, and the purpose of the traditional deep learning training data is to learn the characteristic information contained in the samples of different classes so as to improve the model precision. Because the difference of different types of samples is large, the convergence rate of the model is low, and the training cost is high. In the incremental model training process designed by the scheme, training samples come from the same local subclass and aim to learn the characteristic information of the single class of data. Because the difference among the samples in the same category is small, the convergence speed of the model is very high. Experiments show that even if several rounds of epoch training are carried out, the parameters can be ensured to realize excellent class representation performance. Taking a VPNNonVPN data set as an example, 100 single-class training sample sets are generated through random sampling, incremental model training tests are independently performed respectively, and when the epoch is larger than 2, the training precision of all 100 times is close to 1.

(4.3) extracting the subcategory features based on the incremental model: each node selects a parameter subset from each subcategory model parameter according to the same rule as a feature vector of the subcategory; each node uploads each local sub-category feature vector and each local sub-category label to a sink node;

because different nodes use the same reference model for incremental model training, the sub-category features have global consistency. Meanwhile, the sub-category incremental model parameters have better sub-category discrimination. FIG. 4 is a class feature expression capability for incremental model parameters. Each node in fig. 4 corresponds to a parameter vector of the incremental model. The incremental model is trained from a sampled subset of certain classes of data. For convenience of display, the principal component analysis method is adopted to reduce the dimension of the parameter vector and display the parameter vector in a two-dimensional space. As can be seen from the figure, the parameter vectors of the incremental models of different types of samples have better discrimination. And the incremental models obtained by the sample subsets of the same type are positioned at adjacent positions in the parameter space. However, these incremental model parameters are not suitable for direct use as local subcategory feature vectors. On one hand, the superposition phenomenon exists in the model parameter space of the incremental models with different local class numbers. For example, the two models for the category numbered 8 and B in fig. 4 are coincident in the parameter space, which results in the two category data being inseparable.

In order to avoid the problem of overlapping of the feature spaces of partial sub-categories and reduce the dimension of the feature vectors and the communication overhead, representative optimized parameter subsets are extracted from model parameters and used as the feature vectors of the sub-categories. The scheme selects the bias parameter of the last layer of the model as the final parameter. First, the deep neural network has higher abstract capability and stronger category expression capability in each layer from the input layer to the output layer, and the later layers are. Second, from the back propagation algorithm perspective, the layer parameters near the output are adjusted first. Therefore, the parameters close to the output layer are most easily affected by the incremental training process, and can capture the characteristic information in the relevant class of training data. In addition, in most deep learning models, the number of nodes of each layer close to the output layer is relatively small, so the number of parameters of the layers is also smaller. Taking the classical letNet-5 model as an example, the total parameters of the model exceed 40000, while the feature parameters selected according to our scheme are 10. The category expression capability of the optimized parameter subset may be referred to in the "performance evaluation" section.

(5) Globally consistent sub-category label assignment: the sink node collects the subcategory feature vectors and the local label information from different nodes; the sink node performs global clustering on the local sub-category feature vectors on the basis of the collected sub-category feature vectors, and allocates different global labels to each global sub-category according to a global clustering result; the sink node distributes global labels to all subcategories in the global clustering result, establishes mapping schemes of the local subcategories labels and the global labels for the local subcategories of all nodes according to the collected local labels of the subcategories and the redistributed global labels, and feeds the mapping relation back to the corresponding nodes; and modifying the local class label of each sample into a global class label by each node according to the received mapping scheme. The whole process is shown in fig. 8.

Through the above operations, we obtain many new types of sample sets. Next, we will use the new type sample set as training data to update the feature vector extraction model, the new type encrypted network traffic packet detection model, the existing type encrypted network traffic classification model, and other models. The principle of updating the aforementioned models is basically similar. The following explains the updating of the feature extraction model as an example.

(6) Updating the model: and (4) expanding the model by a plurality of network flow monitoring nodes, and training the expanded model by utilizing the respectively collected samples distributed with the global labels in the step (5) in a cooperative mode until the model converges or reaches a preset error threshold. The method comprises the following specific steps:

(6.1) extension of the model: expanding the model according to the total number of the newly added categories and the total number of the newly added samples; increasing the number of output layer neurons of the model when the number of newly added classes and samples is small; when the number of the newly added classes and the number of samples are very large, the number of middle-level processing layers or the number of neurons of each level is also required to be increased;

when the model is updated, the total number of the classes and the total number of the samples are increased with the addition of a new type of training samples. We need to appropriately expand the model so that the complexity of the model is adapted to the number of samples and the number of classes. Specifically, when the number of classes and samples to be added is much smaller than the number of samples and classes of the existing class in the previous training, we only need to modify the output layer of the model and add the same number of neurons as the new class. When the number of the added categories and the number of samples are very large, the middle layer needs to be expanded. The intermediate layer expansion mode can be the increase of the layers, and can also be the dimension expansion of the existing intermediate layers.

(6.2) model initialization: initializing the original basic model parameters in the model after the model expansion by using the existing optimal characteristic parameters; initializing each neuron parameter of the model extension part by using random numbers;

(6.3) optimization equation: the optimization equation is defined as

Wherein f' _θ Is an extended model;

(6.4) model training: and training the expanded model by using the collected samples distributed with the global labels by the plurality of network flow monitoring nodes in a cooperative mode until the model converges or reaches a preset error threshold.

4. Performance evaluation

1. Experimental setup

1) Data set for evaluation: the data set used during the experiment consisted of two parts. One part from the ISCX VPNnonVPN dataset. The data set includes different types of regular encryption traffic and protocol encapsulation traffic. The sample of presence category disputes are deleted. The final sorted data set has 12 categories in total. Including 6 regular encrypted traffic classes (i.e., chat, email, file, P2P, streaming, voip) and 6 protocol encapsulated traffic classes (i.e., vpn _ Chat, vpn _ Email, vpn _ File, vpn _ P2P, vpn _ Streaming, vpn _ Voip). If not specifically stated, we will use 10 digits 0 through 9 to sequentially identify the first 10 categories in the aforementioned listing order, and 2 letters A, B to sequentially identify the last 2 categories in the order. However, the number of samples for the partial category of the data set is too small. For example: the total number of samples of the Vpn _ Email class is only 253. And the sample distribution among the classes is highly unbalanced. For example, the number of samples for the Chat class is 5257, which is much larger than the total number of samples for the Vpn _ Email class. For this reason, we sample-expand each category of the data set and ensure that the number of samples in each category of the data set is substantially equal.

2) Platform and model: the deep learning framework used in the experiment was TensorFlow. The federal learning platform used in the experiments was the TensorFlow fed (TFF) frame. In a specific experiment, the federal learning related mechanism is realized in a local mode, namely, the client node and the server are realized in a virtual mode and are substantially positioned on the same device. Several models are mentioned in the proposed solution, such as a feature extraction solution and a novel detection solution. The main part of these models used in the experiments was similar to LeNet-5 and was modified in three ways. First, all 2D modules are replaced by 1D modules. For example, the 2D convolution module of the convolutional layer is replaced with a 1D convolution module. Second, an attention layer was added to the model. Thirdly, the input layer and the output layer are adaptively modified according to the sample size and the number of traffic classes. The main part of the model is composed of seven layers. It includes three convolution layers, two pooling layers and two full-connection layers. The convolutional and pooling layers are implemented in a one-dimensional fashion. Note that after the layer is inserted in the second convolutional layer in the form of a bypass, the output of the note layer is used to dynamically weight the output of the second convolutional layer.

3) And evaluation index: evaluation indexes used in the experimental process include Accuracy (Accuracy), accuracy (Precision), recall (Recall), and F1 score (F1-score).

2. Model component selection and performance comparison.

The model employed in the proposed solution is constructed based on 1D-CNN and attention mechanisms. Table 1 is a comparison of performance between different model strategies, namely our (models based on 1D-CNN and attention mechanism), 1D-CNN based models and 2D-CNN based models. According to the experimental results, the classification performance of the model based on the 1D-CNN and the attention mechanism is better than that of the 1D-CNN model and the 2D-CNN model alone. It should be noted that it is meaningless to compare the results of this experiment with the results of other subsequent experiments. On the one hand, the number of samples of different classes is very unbalanced. While the data sets of subsequent experiments are class balanced. On the other hand, the number of samples used for model training in this experiment is much greater than the number of samples in subsequent experiments.

TABLE 1 comparison of Performance of different model strategies

3. The new type of traffic packet detects model performance.

As can be seen from fig. 2, the output vectors of the new type traffic packet and the known traffic packet have large characteristic differences in the respective dimensions corresponding to the top3 vector elements. In order to show the difference more intuitively, the top3 vector elements of the output vector are used as three different dimensions to construct a three-dimensional space. We label these samples in this three-dimensional space. The sample labeled 0 represents the existing type sample and the sample labeled 1 represents the new type sample. In order to observe the distribution characteristics of the existing type samples and the new type samples in the space, three-dimensional space graphs are projected onto three different 2-dimensional planes so as to facilitate the viewing of the effect. The results are shown in fig. 5 (a), 5 (b) and 5 (c).

As can be seen from fig. 5 (a), 5 (b) and 5 (c), most of the samples of the existing type and the samples of the new type are distributed in different areas in the space, and have a relatively obvious distribution difference. By choosing the appropriate threshold parameters, we can easily separate out most new types of samples. Meanwhile, the existing type sample and the new type sample have certain overlap in the partial region. For example, a large number of samples labeled 0 or 1 are densely distributed in the lower right corner region of fig. 5 (a) and 5 (b), and the same situation also occurs in the lower left corner region of fig. 5 (c). Therefore, the threshold-based segmentation scheme will completely separate them from this region. We compared the performance of the new type of sample detection scheme proposed by the present invention with a threshold based segmentation scheme. Table 2 shows the comparison results. In the experiment, the segmentation threshold of the first dimension is set to 0.9, and the segmentation thresholds of the other two dimensions are set to 0.1.

TABLE 2 comparison of Performance of New sample testing protocols

Scheme(s)	Precision (Accuracy)	Accuracy (Precision)	Recall ratio (Recall)	F1 fraction F1-score
					Threshold segmentation scheme	0.683	0.996	0.613	0.759
The invention	0.942	0.986	0.906	0.944

As can be seen from table 2, the Accuracy (Precision) of the thresholding scheme is very high, however, the Recall (Recall) and Accuracy (Accuracy) are relatively low. The accuracy (Precision) of the thresholding scheme is very high, mainly due to the very concentrated distribution of the existing type of samples in the top3 dimension. When the threshold segmentation scheme is adopted, the existing type samples can be correctly identified with higher probability, and the probability of judging the existing type samples as the new type samples in error is lower.

The new type samples are scattered, and many new type samples are overlapped with the existing type samples and cannot be separated out directly through a simple threshold segmentation mode. Thus, the solution L1 has a relatively poor ability to identify new types of samples, with relatively low Recall (Recall) and Accuracy (Accuracy).

The Recall rate (Recall) and the Accuracy (Accuracy) of the scheme of the invention are both improved by a larger margin. However, its accuracy (Precision) is slightly degraded with respect to the thresholding scheme. The new type of sample (labeled 1) in the training samples of the inventive scheme comes from the threshold segmentation scheme. Under the appointed threshold, part of the existing type samples are wrongly divided into the new type samples by the threshold segmentation scheme, and the mode of marking the wrong samples is necessarily learned in the training process of the scheme, so that the probability of dividing the existing type samples into the new type samples is increased. Therefore, its accuracy (Precision) is somewhat degraded with respect to the threshold segmentation scheme.

4. Global consistent feature extraction scheme performance

Although the incremental model trained by different types of data has certain class expression capability, the incremental model parameters are not suitable for being directly used as class feature data to be uploaded to a server side due to the fact that model parameters are numerous and the expression capability needs to be further improved.

The last layer of parameters of the incremental model may be very expressive. The following is verified by experiments. The experimental data is from the VPNnonVPN dataset. Wherein the five categories numbered [ 1,2,3,4,5 ] are treated as existing categories and the other 7 categories are treated as new categories. A training sample set is constructed from the existing type data set, and a basic model is trained. Then, 56 training sample sets are randomly generated from the new type data sets, and each training sample set only contains a certain type of samples in the 7 new types of data. Incremental model training is then performed based on the base model described previously, on a per data set basis. The model converges very fast due to the incremental training of the single class samples. To reduce the training cost, we set epoch to 3 for each type of sample increment training. The training process of the single incremental model takes no more than 1s. At the end of training, the training precision of most models reaches 1.

Two groups of parameters (kernel and bias) of a first layer and a last layer of the model are extracted, and visual comparison is carried out after dimensionality reduction is carried out through PCA respectively. The clustering effect of the two sets of parameters of the first layer in the 2-dimensional space is shown in fig. 6 (a) and 6 (b), respectively. The clustering effect of the two sets of parameters in the last layer in the 2-dimensional space is shown in fig. 7 (a) and fig. 7 (b), respectively.

As can be seen from fig. 6 and 7, the clustering effect of the two sets of parameters in the first layer of the model is significantly worse than that of the two sets of parameters in the last layer of the model. In fig. 7, all the categories do not overlap with each other. In fig. 6, the distribution of the nodes in each category is relatively scattered and overlapping phenomenon exists. The clustering effect of the parameter bias (fig. 6a and fig. 7 a) is significantly better than that of the kernel parameter (fig. 6b and fig. 6 b). Taking fig. 7 as an example, although the two sets of parameters, kernel and bias, in the last layer have better clustering effect, the node distribution in the set of parameters of bias (fig. 7 a) is obviously more concentrated.

As shown in fig. 4, it is not a sensible option to directly use the ensemble of model parameters as the feature data. Not only are there many feature vectors, but also the effect of distinguishing categories is not ideal, and there are cases where some categories cannot be distinguished. If the bias parameter of the last layer is used as the feature data, the length of the feature vector is greatly reduced, and the classification effect is also significantly improved (fig. 7 a).

5. Performance comparison before and after model update

In an adaptive updating global model performance analysis experiment, 3 types of scenes are designed. The three types of scenes comprise 9 types of known traffic categories. The number of unknown traffic classes in the three types of scenes are respectively 1 type, 2 types and 3 types. The specific scenario description is shown in table 3. In each experimental scene, 2000 samples are respectively and randomly selected from each existing type as training samples, and a basic classification model G1 is trained. Then, respectively randomly selecting 2000 samples from each existing type and each new type as new flow samples for performing the operation of the later stage in the proposed scheme to obtain an updated classification model G2. The flow samples used in the two stages are different from each other. Finally, the performance of the two classification models before and after updating in different scenes is analyzed on the basis of an experimental result, and the experimental result is shown in table 4. From the results, in all three scenarios, the performance of the new model G2 is slightly reduced relative to G1, and the larger the number of new types is, the larger the performance reduction is. This is because as the number of new types increases, the proportion of new type samples in the new sample set also increases. Due to the existence of errors in links such as new type sample identification and marking, the more the new type sample accounts for, the greater the influence of the errors on the final result.

Table 3 experimental scenario description

Scene	Of the existing type	New type
			Scene 1	[1，2，3，4，5，6，7，8，9]	[A]
Scene 2	[2，3，4，5，6，7，8，9，A]	[0，B]
			Scene 3	[0，2，3，4，6，7，8，A，B]	[1，5，9]

TABLE 4 adaptive update model Performance analysis

Scene	Model (model)	Accuracy Accuracy	Accuracy Precision	Recall rate recalling	F1 fraction F1-score
						1	G1	0.948	0.949	0.948	0.948
1	G2	0.942	0.942	0.942	0.942
						2	G1	0.952	0.953	0.952	0.952
2	G2	0.942	0.943	0.942	0.942
						3	G1	0.968	0.969	0.968	0.968
3	G2	0.897	0.898	0.897	0.896

Claims

1. A method for extracting characteristics of an encrypted network traffic packet in a distributed scene is characterized by comprising the following steps:

(2) Constructing a feature extraction model: network flow packet feature extraction model f _θ Can be expressed as v = f _θ (x) Wherein x is an encrypted network traffic packet and v is a feature vector extracted by a model; the feature extraction model f _θ At least comprises a one-dimensional convolution (1D CNN) layer and an Attention (Attention) layer; converting the output of the Attention layer into a group of weights after conversion; the set of weights is used as weights of different channels of the one-dimensional convolutional layer and is used for changing the original output value of the one-dimensional convolutional layer; as an optimization, the feature extraction model f _θ The device also comprises a one-dimensional pooling layer, a full-connection layer and an activation layer;

(3) Constructing an interface model: interface model f _e The system is formed by nesting two modules, namely softmax and argmax; the interface model can be expressed as y = f _e (v)＝argmax(softmax(v))；

Where l is the loss function (loss function);

(5) Distributed training of the model: a plurality of network flow monitoring nodes (referred to as 'nodes') utilize the respectively collected marked samples in the step (1) and adopt a cooperative mode to extract a model f for the characteristics in the step (2) according to the optimization equation given in the step (4) _θ Training is performed until the model converges or a predetermined error threshold is reached.

2. The method for extracting features of encrypted network traffic packets in a distributed scenario according to claim 1, wherein the step (5) specifically comprises the following steps:

(1) Initializing a model: selecting a node as a sink node, wherein the sink node firstly extracts a model f from the characteristics _θ Is randomly initialized with the initialization parameter theta ₀ Then theta will be ₀ Sending to other nodes;

(2) Local model training: node i utilizes the received theta ₀ To f _θ Initializing and constructing a local optimization equation

The node i utilizes the locally accumulated marked encrypted network flow data set to carry out model f based on the optimization equation _θ Optimizing to obtain optimized model parameters

The node i feeds back an optimization result to the sink node

(3) Model parameters are generated at the stage: the sink node receives feedback results from all the participating nodes

Calculating its mathematical expected value

The model parameters of the distributed training of the round are

(4) Repeating the steps (2) and (3) until the model converges or reaches a preset error threshold value, thereby obtaining the current optimal model parameter theta _* ；

(5) All nodes obtain the current optimal model parameter theta from the sink node _* And constructCurrently optimal feature extraction model f _θ For extracting feature vectors of network traffic packets.