CN115913992A

CN115913992A - Anonymous network traffic classification method based on small sample machine learning

Info

Publication number: CN115913992A
Application number: CN202211592847.4A
Authority: CN
Inventors: 周强; 王良民; 路通; 朱会娟; 冯丽; 宋香梅; 申屠浩
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-04-04

Abstract

The invention discloses an anonymous network flow classification method based on small sample machine learning, which comprises the steps of mapping acquired flow data and data to be classified to a feature space through a deep neural network, using original labeled data for pre-training a deep classification model, using a small amount of newly acquired labeled data for calculating a class center of flow data features in the feature space, clustering by using the class center as a clustering center of target flow data to be classified, giving a pseudo label to the target flow data to be classified, and completing knowledge migration of the original labeled data by optimizing classification loss functions of the original labeled flow data and the target pseudo labeled data, thereby reducing the influence of data timeliness on the model and eliminating the problem of distribution difference of the training data and the data to be classified caused by data timeliness. The method solves the problem that the timeliness of the originally acquired flow sequence data is reduced due to the update of an anonymous system, so that the performance of an anonymous network flow classification algorithm is reduced.

Description

Anonymous network traffic classification method based on small sample machine learning

Technical Field

The invention relates to a network security technology, in particular to an anonymous network traffic classification method based on small sample machine learning.

Background

With the development of the internet, people design and develop various anonymous communication systems, and corresponding attack methods also appear. The anonymity of the Tor anonymous network can be effectively destroyed by a web site fingerprint (WF) attack method. In the website loading process, due to the loading items and other contents of different websites, different mode information exists in the traffic sequence between the client and the server, and convenience is provided for an attacker to destroy anonymity. The anonymous network traffic classification method based on deep learning is remarkably superior to a non-deep anonymous network traffic classification method in performance, deep anonymous network traffic classification needs a large amount of labeled data as a training set, and when a data set changes, such as Tor traffic data of different versions caused by updating of Tor browser versions, the changes can cause the performance of an anonymous network traffic classification algorithm to be reduced.

Currently, two methods are used for solving the problem that the anonymous network traffic classification performance is reduced due to the scarcity of labeled traffic data, namely TF (triple referencing) [1] and TLFA (Transfer Learning referencing access) [2], but the TF method has the problem of large calculation amount, and the TLFA method only uses a small amount of newly acquired labeled traffic to fine-tune a pre-trained classification model, so that the model classification performance is not improved enough.

Therefore, the labeling data scarcity caused by the data set change in the anonymous network traffic classification brings great challenges to the practical performance and deployment application of the algorithm.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides an anonymous network traffic classification method based on the machine learning of a small sample,

the problem of timeliness of the data set caused by data set change in anonymous network traffic classification brings a great challenge to the practical deployment and application of the algorithm. To address the above challenges, this document is based on clustering assumptions: the method comprises the steps that samples belonging to the same cluster in clustering belong to the same category, an anonymous network flow classification algorithm based on clustering analysis is provided, originally collected flow data, a small amount of newly collected labeled data and data to be classified are mapped to a feature space through a deep neural network, the category center of the newly collected labeled data is calculated in the feature space, the category center is used as the clustering center of the target flow data to be classified for clustering, a pseudo label is given to the target flow data, and the knowledge migration of the originally labeled data is completed by optimizing the classification loss function of the originally labeled flow data and the target pseudo labeled data, so that the influence of data timeliness on a model is reduced.

The technical scheme is as follows: the invention discloses an anonymous network traffic classification method based on small sample machine learning, which comprises the following steps:

step (1), collecting network flow to obtain an original flow sequence X _s Newly collected small amount of marked flow X' _s And a sequence of flows X to be classified _t ；

Wherein, the original flow sequence X _s The data of (A) are marked with:

n is the number of the original flow sequence data, and>

and &>

Respectively representing the records and the corresponding labels of the flow sequence; the newly collected small amount of annotation traffic is expressed as: />

The flow sequence to be classified is represented as: />

N and m are the data of a small number of newly acquired marked samples and the number of data samples to be classified respectively;

step (2) constructing a classification model

Splicing a feature extractor G and a task classifier C to form a classification model, wherein the feature extractor G adopts a deep convolution network, and the task classifier C comprises two layers of fully-connected neural networks;

step (3) pre-training classification model

The marked original flow sequence X _s Inputting the data into the depth model (classification model), calculating a classification loss function based on the obtained original flow data class prediction probability and the real label, and pre-training the depth classification model constructed in the previous step;

step (4) training classification model

Step (4.1) will have marked original flow sequence X _s And a newly collected small amount of annotation flow X' _s Mapping the flow sequence characteristics to a characteristic space through a neural network, and calculating the central point of each category of the newly acquired small quantity of marked flow sequence characteristics;

step (4.2) taking the obtained category central point as a clustering central point of newly acquired flow sequence features to be classified, calculating the distance from each flow sequence feature to be classified to each clustering central point, and giving a category label of the nearest category center of the flow sequence features to be classified, wherein the category label is used as a pseudo label of the flow sequence to be classified;

step (4.3) mapping the features of the feature space by a classifier to obtain class prediction probability, and calculating a clustering loss function according to the pseudo label and the prediction probability; updating the network weight of the feature extractor G and the task classifier C according to the obtained cluster adaptation loss;

the steps (4.1) to (4.3) are circulated for multiple times to finish model training; finally, the feature center of the newly acquired flow sequence in the feature space is aligned with the feature center of the original flow sequence, so that the features of the same category are mapped to the same region by the classifier, and the problem of performance reduction of a deep anonymous network flow classification algorithm caused by training data aging is effectively solved.

Further, the structures of the feature extractor G and the task classifier C in the step (2) are as follows:

the feature extractor G is provided with three convolution modules, wherein the first convolution module comprises two convolution layers, the second convolution modules comprise three convolution layers, a maximum pooling layer (Max boosting) and a Dropout layer are adopted behind each convolution module, an ELU activation function is adopted in each convolution module, and the activation function is beneficial to shortening training time and improving accuracy in a neural network; the task classifier C adopts two layers of fully-connected neural networks, and a dropout layer is added behind each layer of network, so that the overfitting problem is avoided.

Further, when the labeled original flow sequence data is pre-trained by using the classification model in the step (3), the classification loss function is calculated as follows, as in the conventional supervised deep model training:

wherein y' _s Predicted probability output, y, for the classifier on the original traffic data belonging to each class _s Is a true tag (one-hot encoded version) of the traffic,

represents the cross entropy loss function, calculated as follows:

where p (x) represents the prediction probability that sample x belongs to each class and q (x) represents the one-hot coding of the true label of sample x.

Further, the specific calculation method of the clustering center in the step (4.1) is as follows:

given a newly acquired small amount of flow sequence data input as

Assuming that the original flow sequence data has K categories, there is a clustering center C _k Comprises the following steps:

wherein, f' _i ＝G(x′ _i ) When y' _i When k is not less than k, I _i =1, otherwise I _i ＝0。n _k And the number of samples of the original flow sequence data with the label of K belongs to K e {1,2,3, …, K }.

Further, the method for calculating the pseudo label of the traffic sequence to be classified in the step (4.2) comprises the following steps:

after the newly acquired flow sequence passes through a neural network for mapping the original flow sequence, the distance between the new flow sequence characteristic and the clustering center is measured by adopting cosine similarity in a characteristic space, and the distance is calculated as follows:

wherein

Calculating the distance between each newly collected sample and all cluster centers; then, the class of the nearest clustering center of the newly acquired flow sequence is given, a pseudo label is given to the new flow sequence in each class cluster, and the pseudo label is obtained in the following way:

further, the step (4.3) of calculating the cluster adaptation loss and updating the network weight comprises the following specific processes:

the clustering loss function is calculated as follows:

wherein

For the classifier on the newly acquired flow sequence->

In a respective category, a prediction probability output in conjunction with which a decision is made>

A pseudo tag (one-hot encoded form) obtained by the above formula (5);

the overall optimization objective function of the final anonymous network traffic classification algorithm is as follows:

min _G,C L＝L _clu (x _s ,y _s )+λL _clu (x _t ) (7)

wherein, the lambda is a hyper-parameter of the classification loss and the clustering loss in the balance training.

Has the advantages that: the method comprises the steps of mapping originally acquired flow data, small sample marking data and data to be classified to a feature space through a deep neural network, calculating a class center of the small sample marking data in the feature space, clustering by taking the class center as a clustering center of target flow data to be classified, giving a pseudo label to the target flow data, using the originally acquired marking flow data for depth model pre-training, and optimizing classification loss functions of the originally marked flow data and the target pseudo marking data to finish knowledge migration of the originally marked data and reduce the influence of data failure on a model.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a schematic diagram of a feature extractor of the present invention;

FIG. 3 is a diagram of a task classifier according to the present invention.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

To address the above challenges, the present invention is based on clustering assumptions: the algorithm maps originally acquired and newly acquired labeled flow data and data to be classified into a feature space through a deep neural network, calculates a class center of the newly acquired labeled data in the feature space, clusters the class center serving as a cluster center of target flow data to be classified, gives a pseudo label to the target flow data, and optimizes classification loss functions of the originally labeled flow data and the target pseudo labeled data, so that the knowledge transfer of the originally labeled data is completed, and the influence of data on the model performance due to the aging problem is reduced.

The invention realizes effective anonymous network traffic classification by the following technical characteristics, and solves the problem of reduced anonymous network traffic classification performance caused by the distribution difference of training data and test data:

1. based on clustering assumptions: samples in the clusters belonging to the same type of clusters belong to the same type, and the target flow sequence cluster center is aligned with the original labeled data type center in the feature space, so that the problem of distribution difference caused by data timeliness is solved.

2. Extracting flow sequence characteristics by using a deep network, aligning an original labeled data class center and a flow sequence clustering center to be classified in a characteristic space, and optimizing a model by using an end-to-end mode

As shown in fig. 1, the anonymous network traffic classification method based on small sample machine learning of this embodiment includes the following steps:

step (1), collecting network flow to obtain an original flow sequence X _s Newly collected small amount of marked flow X' and flow sequence X to be classified _t ；

Wherein, the original flow sequence X _s The data of (A) are marked with:

n is the number of the original flow sequence data, and>

and &>

Respectively representing the records and the corresponding labels of the flow sequences; the small amount of tagged traffic newly collected is expressed as:

the flow sequence to be classified is represented as: />

N and m are the data of a small number of newly collected labeled samples and the number of samples of data to be classified respectively;

collecting a small amount of new access flow by using a packet capturing tool, converting the captured flow into an available format, and marking a label of a corresponding website on each flow;

step (2) of constructing a classification model

step (3) pre-training classification model

The marked original flow sequence X _s Inputting the data into the classification model, calculating a classification loss function based on the obtained original flow data class prediction probability and the real label, and pre-training the depth classification model constructed in the previous step;

step (4), training classification model

Step (4.1) will have marked original flow sequence X _s And a newly collected small amount of annotation flow X' _s Mapping the sample characteristics to a characteristic space through a neural network, and calculating the category center points of all categories of the newly acquired labeled sample characteristics;

step (4.2) using the obtained category central point as a clustering central point of the flow sequence features to be classified to calculate the distance from each flow sequence feature to be classified to each clustering central point, and giving a category label of the nearest category center of the sequence features to be classified, wherein the label is a pseudo label of the sequence features to be classified;

step (4.3) mapping the characteristics of the characteristic space by a classifier to obtain a class prediction probability, and calculating a clustering loss function through a pseudo label and the prediction probability; updating the network weight of the feature extractor G and the task classifier C according to the obtained cluster adaptation loss;

and (4) circulating the steps (4.1) to (4.3) for multiple times to finish model training.

The above-described embodiments use a well-collected test set to perform performance evaluation on the classification algorithm.

The detailed process of the algorithm comprises the following steps:

the embodiment is as follows:

the present embodiment is based on the anonymous communication system Tor as an environment for acquiring traffic. Tor is based on an onion routing technology, data packets of anonymous network users are transmitted through a plurality of proxy nodes, source IP, target IP and information in the data packets are encrypted, so that the real source and destination of the data packets cannot be tracked, and privacy information of the users is effectively protected. In an actual application scenario, due to the problems of the Tor-browser version (TBB), the setting of the Tor-browser, the aging between newly acquired data and originally acquired data, and the like, distribution difference exists between the originally acquired data and the newly acquired data, and the difference causes performance degradation of an anonymous network traffic classification model on originally acquired traffic sequence data, so that the requirements of actual application are difficult to meet, and time and labor are consumed for re-acquiring labeled data trained by a deep anonymous network traffic classification model. The present embodiment solves this problem by the following steps.

Step (1) of collecting network flow

Downloading a source code of Tor agent service from an official site https of Tor,// www.torproject.org/download/Tor, uploading the source code to a cloud server and installing, collecting access flow by using a packet capturing tool, simulating a Tor user network browsing habit, then accessing a website, capturing flow between a user and a Tor network first-hop node, converting the captured flow into an available format, marking a label of a corresponding website on each flow, and dividing a training set and a testing set for a collected flow data set. For the convenience of description of the subsequent experiment, the originally acquired flow sequence data is marked and expressed as

Dividing a small number of marked traffic into ^ or ^ in a test set>

The remaining flows of the test set are taken as a flow sequence to be classified and are denoted as->

/>

When collecting flow data, according to the general settlement of industry, divide into two types with the website of visiting: a monitoring website and a non-monitoring website. The monitored websites refer to websites in which the attacker is interested, the non-monitored websites are websites which are not visited by the user or in which the attacker is not interested, and the data set composition is shown in table 1.

Table 1: tor network traffic data set

Step (2) constructing a classification model

And splicing the feature extractor G and the task classifier C to form a classification model, wherein the model structure of the extractor G is shown in FIG. 2, and the model structure of the task classifier is shown in FIG. 3. The feature extractor G is composed of a convolution neural network, and the task classifier C is composed of two layers of fully-connected neural networks.

Step (3) pre-training classification model

The marked original flow sequence data

Inputting the data into the depth model, calculating a classification loss function based on the obtained original flow data class prediction probability and the real label, minimizing the classification loss function of the depth classification model constructed in the previous step based on a random gradient descent algorithm, completing pre-training of the model, and calculating a loss value as shown in a formula (1):

wherein

Represents the cross entropy loss function, calculated as follows:

Step (4) training classification model

Flow sequence with label newly collected

By the steps ofMapping the pre-trained neural network into a feature space, and calculating the central point of each category of the newly acquired flow sequence features as shown in the following formula (3):

wherein f' _i ＝G(x′ _i ) When y' _i When k is not less than k, I _i =1, otherwise I _i ＝0。n _k And the number of samples of the original flow sequence data with the label of K belongs to {1,2,3, …, K }.

Taking the obtained category central point as a clustering central point of the flow sequence characteristics to be collected, calculating the distance from each flow sequence characteristic to be classified to each clustering central point,

the class prediction probability is obtained after the features of the feature space are mapped by a classifier, and a cluster adaptation loss function is calculated through the pseudo label and the prediction probability, wherein the cluster adaptation loss function is shown in a formula (6) as follows:

wherein

For the classifier on the newly acquired flow sequence->

Is in each category, is output based on the prediction probability in the respective category>

A pseudo tag (one-hot encoded form) obtained by the above formula (5);

the overall optimization objective function of the final anonymous network traffic classification algorithm is shown in the following formula (7):

min _G,C L＝L _clu (x _s ,y _s )+λL _clu (x _t ) (7)

wherein, lambda is a hyper-parameter of the classification loss and the clustering loss in the balance training. And updating the network weights of the feature extractor G and the task classifier C based on a random gradient descent algorithm. And circulating the process for multiple times to finish model training.

And (3) selecting a five-fold cross validation and grid search method to carry out hyper-parameter tuning on the classifier based on the training set, the validation set and the test set data set in the step one, and determining the optimal hyper-parameter in the classification process. And (3) evaluating the classification effect of the classification model by using the test set data, and calculating a classification accuracy index, wherein the result is shown in table 2.

Table 2: the classification effect of different models (N-shot represents that N newly collected samples with labels are provided, and the evaluation index in the table is classification accuracy (%))

TF [1] in Table 2 refers to More reactive and portable website formatting with n-shot learning of P.Sirinam et al, and TLFA [2] refers to Few-shot website formatting ack of M.Chen et al.

The embodiment shows that a large amount of labeled training data are needed for deep anonymous network traffic classification, and the anonymous network system such as version update of a Tor browser and the like can bring about the reduction of the effectiveness of the labeled data, so that the performance of the current deep anonymous network traffic classification algorithm is reduced. The algorithm is based on clustering assumptions: samples belonging to the same type of clusters in the clusters belong to the same category, and the target flow sequence clustering center is aligned with the original labeled data category center in the feature space, so that the problem of distribution difference caused by data timeliness is solved, and the anonymous network flow classification performance is effectively improved.

Claims

1. An anonymous network traffic classification method based on small sample machine learning is characterized by comprising the following steps: the method comprises the following steps:

step (1), collecting network flow to obtain an original flow sequence

The newly collected small number of marked flow>

And the flow sequence to be classified->

；

Wherein the original flow sequence

The data of (A) are marked with: />

，/>

Means the number of the original flow sequence data, and->

And &>

Respectively representing the records and the corresponding labels of the flow sequence; the small amount of tagged traffic newly collected is expressed as:

the flow sequence to be classified is expressed as: />

，/>

Respectively marking the data of a small number of newly collected marked samples and the number of data samples to be classified;

step (2) constructing a classification model

step (3) pre-training classification model

Original flow sequence with label

Inputting the data into a classification model, calculating a classification loss function based on the obtained original flow data class prediction probability and a real label, and pre-training the depth classification model constructed in the previous step;

step (4) training classification model

Step (4.1) will have the original flow sequence marked

And a newly acquired small number of marked flow>

Mapping the flow sequence characteristics to a characteristic space through a neural network, and calculating each category center of the newly acquired small quantity of labeled flow sequence characteristicsPoint;

2. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the structures of the feature extractor G and the task classifier C in the step (2) are as follows:

the feature extractor G is provided with three convolution modules, wherein the first convolution module comprises two convolution layers, the last two convolution modules comprise three convolution layers, a maximum pooling layer and a Dropout layer are adopted behind each convolution module, and an ELU activation function is adopted in each convolution module; and the task classifier C adopts two layers of fully-connected neural networks, and a dropout layer is added behind each layer of network.

3. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: when the classification model is used for pre-training the marked original flow sequence data, the classification loss function is calculated as follows:

wherein

For the prediction probabilities of the classifier for each class of raw flow data, <' >>

One-hot encoding of a real tag for a traffic>

Represents the cross entropy loss function, calculated as follows:

；/>

wherein

Indicates that a sample is->

Prediction probabilities belonging to the respective classes>

Indicates that a sample is->

One-hot encoding of the authentic tag.

4. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the specific calculation method of the clustering center in the step (4.1) comprises the following steps:

given a newly acquired small amount of flow sequence data input as

Assuming that the original traffic sequence data has K categories, then there is a cluster center >>

Comprises the following steps:

wherein, the first and the second end of the pipe are connected with each other,

(ii) a When +>

When, is greater or less>

Otherwise->

；/>

Represents->

The data set is labeled askBased on the flow rate sequence data, based on the flow rate data, and->

。

5. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the method for calculating the pseudo label of the traffic sequence to be classified in the step (4.2) comprises the following steps:

the cosine similarity is adopted in the feature space to measure the distance between the new flow sequence feature and the cluster center, and the distance is calculated as follows:

wherein

；

Calculating the distance between each newly collected sample and all cluster centers; then, the class of the nearest clustering center of the newly acquired flow sequence is given, a pseudo label is given to the new flow sequence in each class cluster, and the pseudo label acquisition mode is as follows:

。

6. the anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the step (4.3) of calculating the cluster adaptation loss and updating the network weight comprises the following specific processes:

the clustering loss function is calculated as follows:

wherein

For a classifier on newly acquired flow sequences>

Is predicted to output, is greater than or equal to>

Is a false label;

the final overall optimization objective function is as follows:

wherein the content of the first and second substances,

the method is a hyper-parameter for balancing classification loss and clustering loss in training. />