CN115913992A - Anonymous network traffic classification method based on small sample machine learning - Google Patents

Anonymous network traffic classification method based on small sample machine learning Download PDF

Info

Publication number
CN115913992A
CN115913992A CN202211592847.4A CN202211592847A CN115913992A CN 115913992 A CN115913992 A CN 115913992A CN 202211592847 A CN202211592847 A CN 202211592847A CN 115913992 A CN115913992 A CN 115913992A
Authority
CN
China
Prior art keywords
data
flow
flow sequence
classified
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211592847.4A
Other languages
Chinese (zh)
Inventor
周强
王良民
路通
朱会娟
冯丽
宋香梅
申屠浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202211592847.4A priority Critical patent/CN115913992A/en
Publication of CN115913992A publication Critical patent/CN115913992A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an anonymous network flow classification method based on small sample machine learning, which comprises the steps of mapping acquired flow data and data to be classified to a feature space through a deep neural network, using original labeled data for pre-training a deep classification model, using a small amount of newly acquired labeled data for calculating a class center of flow data features in the feature space, clustering by using the class center as a clustering center of target flow data to be classified, giving a pseudo label to the target flow data to be classified, and completing knowledge migration of the original labeled data by optimizing classification loss functions of the original labeled flow data and the target pseudo labeled data, thereby reducing the influence of data timeliness on the model and eliminating the problem of distribution difference of the training data and the data to be classified caused by data timeliness. The method solves the problem that the timeliness of the originally acquired flow sequence data is reduced due to the update of an anonymous system, so that the performance of an anonymous network flow classification algorithm is reduced.

Description

Anonymous network traffic classification method based on small sample machine learning
Technical Field
The invention relates to a network security technology, in particular to an anonymous network traffic classification method based on small sample machine learning.
Background
With the development of the internet, people design and develop various anonymous communication systems, and corresponding attack methods also appear. The anonymity of the Tor anonymous network can be effectively destroyed by a web site fingerprint (WF) attack method. In the website loading process, due to the loading items and other contents of different websites, different mode information exists in the traffic sequence between the client and the server, and convenience is provided for an attacker to destroy anonymity. The anonymous network traffic classification method based on deep learning is remarkably superior to a non-deep anonymous network traffic classification method in performance, deep anonymous network traffic classification needs a large amount of labeled data as a training set, and when a data set changes, such as Tor traffic data of different versions caused by updating of Tor browser versions, the changes can cause the performance of an anonymous network traffic classification algorithm to be reduced.
Currently, two methods are used for solving the problem that the anonymous network traffic classification performance is reduced due to the scarcity of labeled traffic data, namely TF (triple referencing) [1] and TLFA (Transfer Learning referencing access) [2], but the TF method has the problem of large calculation amount, and the TLFA method only uses a small amount of newly acquired labeled traffic to fine-tune a pre-trained classification model, so that the model classification performance is not improved enough.
Therefore, the labeling data scarcity caused by the data set change in the anonymous network traffic classification brings great challenges to the practical performance and deployment application of the algorithm.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides an anonymous network traffic classification method based on the machine learning of a small sample,
the problem of timeliness of the data set caused by data set change in anonymous network traffic classification brings a great challenge to the practical deployment and application of the algorithm. To address the above challenges, this document is based on clustering assumptions: the method comprises the steps that samples belonging to the same cluster in clustering belong to the same category, an anonymous network flow classification algorithm based on clustering analysis is provided, originally collected flow data, a small amount of newly collected labeled data and data to be classified are mapped to a feature space through a deep neural network, the category center of the newly collected labeled data is calculated in the feature space, the category center is used as the clustering center of the target flow data to be classified for clustering, a pseudo label is given to the target flow data, and the knowledge migration of the originally labeled data is completed by optimizing the classification loss function of the originally labeled flow data and the target pseudo labeled data, so that the influence of data timeliness on a model is reduced.
The technical scheme is as follows: the invention discloses an anonymous network traffic classification method based on small sample machine learning, which comprises the following steps:
step (1), collecting network flow to obtain an original flow sequence X s Newly collected small amount of marked flow X' s And a sequence of flows X to be classified t
Wherein, the original flow sequence X s The data of (A) are marked with:
Figure BDA0003995472950000021
n is the number of the original flow sequence data, and>
Figure BDA0003995472950000022
and &>
Figure BDA0003995472950000023
Respectively representing the records and the corresponding labels of the flow sequence; the newly collected small amount of annotation traffic is expressed as: />
Figure BDA0003995472950000024
The flow sequence to be classified is represented as: />
Figure BDA0003995472950000025
N and m are the data of a small number of newly acquired marked samples and the number of data samples to be classified respectively;
step (2) constructing a classification model
Splicing a feature extractor G and a task classifier C to form a classification model, wherein the feature extractor G adopts a deep convolution network, and the task classifier C comprises two layers of fully-connected neural networks;
step (3) pre-training classification model
The marked original flow sequence X s Inputting the data into the depth model (classification model), calculating a classification loss function based on the obtained original flow data class prediction probability and the real label, and pre-training the depth classification model constructed in the previous step;
step (4) training classification model
Step (4.1) will have marked original flow sequence X s And a newly collected small amount of annotation flow X' s Mapping the flow sequence characteristics to a characteristic space through a neural network, and calculating the central point of each category of the newly acquired small quantity of marked flow sequence characteristics;
step (4.2) taking the obtained category central point as a clustering central point of newly acquired flow sequence features to be classified, calculating the distance from each flow sequence feature to be classified to each clustering central point, and giving a category label of the nearest category center of the flow sequence features to be classified, wherein the category label is used as a pseudo label of the flow sequence to be classified;
step (4.3) mapping the features of the feature space by a classifier to obtain class prediction probability, and calculating a clustering loss function according to the pseudo label and the prediction probability; updating the network weight of the feature extractor G and the task classifier C according to the obtained cluster adaptation loss;
the steps (4.1) to (4.3) are circulated for multiple times to finish model training; finally, the feature center of the newly acquired flow sequence in the feature space is aligned with the feature center of the original flow sequence, so that the features of the same category are mapped to the same region by the classifier, and the problem of performance reduction of a deep anonymous network flow classification algorithm caused by training data aging is effectively solved.
Further, the structures of the feature extractor G and the task classifier C in the step (2) are as follows:
the feature extractor G is provided with three convolution modules, wherein the first convolution module comprises two convolution layers, the second convolution modules comprise three convolution layers, a maximum pooling layer (Max boosting) and a Dropout layer are adopted behind each convolution module, an ELU activation function is adopted in each convolution module, and the activation function is beneficial to shortening training time and improving accuracy in a neural network; the task classifier C adopts two layers of fully-connected neural networks, and a dropout layer is added behind each layer of network, so that the overfitting problem is avoided.
Further, when the labeled original flow sequence data is pre-trained by using the classification model in the step (3), the classification loss function is calculated as follows, as in the conventional supervised deep model training:
Figure BDA0003995472950000031
wherein y' s Predicted probability output, y, for the classifier on the original traffic data belonging to each class s Is a true tag (one-hot encoded version) of the traffic,
Figure BDA0003995472950000032
represents the cross entropy loss function, calculated as follows:
Figure BDA0003995472950000033
where p (x) represents the prediction probability that sample x belongs to each class and q (x) represents the one-hot coding of the true label of sample x.
Further, the specific calculation method of the clustering center in the step (4.1) is as follows:
given a newly acquired small amount of flow sequence data input as
Figure BDA0003995472950000034
Assuming that the original flow sequence data has K categories, there is a clustering center C k Comprises the following steps:
Figure BDA0003995472950000035
wherein, f' i =G(x′ i ) When y' i When k is not less than k, I i =1, otherwise I i =0。n k And the number of samples of the original flow sequence data with the label of K belongs to K e {1,2,3, …, K }.
Further, the method for calculating the pseudo label of the traffic sequence to be classified in the step (4.2) comprises the following steps:
after the newly acquired flow sequence passes through a neural network for mapping the original flow sequence, the distance between the new flow sequence characteristic and the clustering center is measured by adopting cosine similarity in a characteristic space, and the distance is calculated as follows:
Figure BDA0003995472950000036
wherein
Figure BDA0003995472950000037
Calculating the distance between each newly collected sample and all cluster centers; then, the class of the nearest clustering center of the newly acquired flow sequence is given, a pseudo label is given to the new flow sequence in each class cluster, and the pseudo label is obtained in the following way:
Figure BDA0003995472950000041
further, the step (4.3) of calculating the cluster adaptation loss and updating the network weight comprises the following specific processes:
the clustering loss function is calculated as follows:
Figure BDA0003995472950000042
wherein
Figure BDA0003995472950000043
For the classifier on the newly acquired flow sequence->
Figure BDA0003995472950000044
In a respective category, a prediction probability output in conjunction with which a decision is made>
Figure BDA0003995472950000045
A pseudo tag (one-hot encoded form) obtained by the above formula (5);
the overall optimization objective function of the final anonymous network traffic classification algorithm is as follows:
min G,C L=L clu (x s ,y s )+λL clu (x t ) (7)
wherein, the lambda is a hyper-parameter of the classification loss and the clustering loss in the balance training.
Has the advantages that: the method comprises the steps of mapping originally acquired flow data, small sample marking data and data to be classified to a feature space through a deep neural network, calculating a class center of the small sample marking data in the feature space, clustering by taking the class center as a clustering center of target flow data to be classified, giving a pseudo label to the target flow data, using the originally acquired marking flow data for depth model pre-training, and optimizing classification loss functions of the originally marked flow data and the target pseudo marking data to finish knowledge migration of the originally marked data and reduce the influence of data failure on a model.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a schematic diagram of a feature extractor of the present invention;
FIG. 3 is a diagram of a task classifier according to the present invention.
Detailed Description
The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.
To address the above challenges, the present invention is based on clustering assumptions: the algorithm maps originally acquired and newly acquired labeled flow data and data to be classified into a feature space through a deep neural network, calculates a class center of the newly acquired labeled data in the feature space, clusters the class center serving as a cluster center of target flow data to be classified, gives a pseudo label to the target flow data, and optimizes classification loss functions of the originally labeled flow data and the target pseudo labeled data, so that the knowledge transfer of the originally labeled data is completed, and the influence of data on the model performance due to the aging problem is reduced.
The invention realizes effective anonymous network traffic classification by the following technical characteristics, and solves the problem of reduced anonymous network traffic classification performance caused by the distribution difference of training data and test data:
1. based on clustering assumptions: samples in the clusters belonging to the same type of clusters belong to the same type, and the target flow sequence cluster center is aligned with the original labeled data type center in the feature space, so that the problem of distribution difference caused by data timeliness is solved.
2. Extracting flow sequence characteristics by using a deep network, aligning an original labeled data class center and a flow sequence clustering center to be classified in a characteristic space, and optimizing a model by using an end-to-end mode
As shown in fig. 1, the anonymous network traffic classification method based on small sample machine learning of this embodiment includes the following steps:
step (1), collecting network flow to obtain an original flow sequence X s Newly collected small amount of marked flow X' and flow sequence X to be classified t
Wherein, the original flow sequence X s The data of (A) are marked with:
Figure BDA0003995472950000051
n is the number of the original flow sequence data, and>
Figure BDA0003995472950000052
and &>
Figure BDA0003995472950000053
Respectively representing the records and the corresponding labels of the flow sequences; the small amount of tagged traffic newly collected is expressed as:
Figure BDA0003995472950000054
the flow sequence to be classified is represented as: />
Figure BDA0003995472950000055
N and m are the data of a small number of newly collected labeled samples and the number of samples of data to be classified respectively;
collecting a small amount of new access flow by using a packet capturing tool, converting the captured flow into an available format, and marking a label of a corresponding website on each flow;
step (2) of constructing a classification model
Splicing a feature extractor G and a task classifier C to form a classification model, wherein the feature extractor G adopts a deep convolution network, and the task classifier C comprises two layers of fully-connected neural networks;
step (3) pre-training classification model
The marked original flow sequence X s Inputting the data into the classification model, calculating a classification loss function based on the obtained original flow data class prediction probability and the real label, and pre-training the depth classification model constructed in the previous step;
step (4), training classification model
Step (4.1) will have marked original flow sequence X s And a newly collected small amount of annotation flow X' s Mapping the sample characteristics to a characteristic space through a neural network, and calculating the category center points of all categories of the newly acquired labeled sample characteristics;
step (4.2) using the obtained category central point as a clustering central point of the flow sequence features to be classified to calculate the distance from each flow sequence feature to be classified to each clustering central point, and giving a category label of the nearest category center of the sequence features to be classified, wherein the label is a pseudo label of the sequence features to be classified;
step (4.3) mapping the characteristics of the characteristic space by a classifier to obtain a class prediction probability, and calculating a clustering loss function through a pseudo label and the prediction probability; updating the network weight of the feature extractor G and the task classifier C according to the obtained cluster adaptation loss;
and (4) circulating the steps (4.1) to (4.3) for multiple times to finish model training.
The above-described embodiments use a well-collected test set to perform performance evaluation on the classification algorithm.
The detailed process of the algorithm comprises the following steps:
Figure BDA0003995472950000061
the embodiment is as follows:
the present embodiment is based on the anonymous communication system Tor as an environment for acquiring traffic. Tor is based on an onion routing technology, data packets of anonymous network users are transmitted through a plurality of proxy nodes, source IP, target IP and information in the data packets are encrypted, so that the real source and destination of the data packets cannot be tracked, and privacy information of the users is effectively protected. In an actual application scenario, due to the problems of the Tor-browser version (TBB), the setting of the Tor-browser, the aging between newly acquired data and originally acquired data, and the like, distribution difference exists between the originally acquired data and the newly acquired data, and the difference causes performance degradation of an anonymous network traffic classification model on originally acquired traffic sequence data, so that the requirements of actual application are difficult to meet, and time and labor are consumed for re-acquiring labeled data trained by a deep anonymous network traffic classification model. The present embodiment solves this problem by the following steps.
Step (1) of collecting network flow
Downloading a source code of Tor agent service from an official site https of Tor,// www.torproject.org/download/Tor, uploading the source code to a cloud server and installing, collecting access flow by using a packet capturing tool, simulating a Tor user network browsing habit, then accessing a website, capturing flow between a user and a Tor network first-hop node, converting the captured flow into an available format, marking a label of a corresponding website on each flow, and dividing a training set and a testing set for a collected flow data set. For the convenience of description of the subsequent experiment, the originally acquired flow sequence data is marked and expressed as
Figure BDA0003995472950000071
Dividing a small number of marked traffic into ^ or ^ in a test set>
Figure BDA0003995472950000072
The remaining flows of the test set are taken as a flow sequence to be classified and are denoted as->
Figure BDA0003995472950000073
/>
When collecting flow data, according to the general settlement of industry, divide into two types with the website of visiting: a monitoring website and a non-monitoring website. The monitored websites refer to websites in which the attacker is interested, the non-monitored websites are websites which are not visited by the user or in which the attacker is not interested, and the data set composition is shown in table 1.
Table 1: tor network traffic data set
Figure BDA0003995472950000074
Step (2) constructing a classification model
And splicing the feature extractor G and the task classifier C to form a classification model, wherein the model structure of the extractor G is shown in FIG. 2, and the model structure of the task classifier is shown in FIG. 3. The feature extractor G is composed of a convolution neural network, and the task classifier C is composed of two layers of fully-connected neural networks.
Step (3) pre-training classification model
The marked original flow sequence data
Figure BDA0003995472950000075
Inputting the data into the depth model, calculating a classification loss function based on the obtained original flow data class prediction probability and the real label, minimizing the classification loss function of the depth classification model constructed in the previous step based on a random gradient descent algorithm, completing pre-training of the model, and calculating a loss value as shown in a formula (1):
Figure BDA0003995472950000076
wherein
Figure BDA0003995472950000077
Represents the cross entropy loss function, calculated as follows:
Figure BDA0003995472950000078
where p (x) represents the prediction probability that sample x belongs to each class and q (x) represents the one-hot coding of the true label of sample x.
Step (4) training classification model
Flow sequence with label newly collected
Figure BDA0003995472950000081
By the steps ofMapping the pre-trained neural network into a feature space, and calculating the central point of each category of the newly acquired flow sequence features as shown in the following formula (3):
Figure BDA0003995472950000082
wherein f' i =G(x′ i ) When y' i When k is not less than k, I i =1, otherwise I i =0。n k And the number of samples of the original flow sequence data with the label of K belongs to {1,2,3, …, K }.
Taking the obtained category central point as a clustering central point of the flow sequence characteristics to be collected, calculating the distance from each flow sequence characteristic to be classified to each clustering central point,
Figure BDA0003995472950000083
calculating the distance between each newly collected sample and all cluster centers; then, the class of the nearest clustering center of the newly acquired flow sequence is given, a pseudo label is given to the new flow sequence in each class cluster, and the pseudo label is obtained in the following way:
Figure BDA0003995472950000084
the class prediction probability is obtained after the features of the feature space are mapped by a classifier, and a cluster adaptation loss function is calculated through the pseudo label and the prediction probability, wherein the cluster adaptation loss function is shown in a formula (6) as follows:
Figure BDA0003995472950000085
wherein
Figure BDA0003995472950000086
For the classifier on the newly acquired flow sequence->
Figure BDA0003995472950000087
Is in each category, is output based on the prediction probability in the respective category>
Figure BDA0003995472950000088
A pseudo tag (one-hot encoded form) obtained by the above formula (5);
the overall optimization objective function of the final anonymous network traffic classification algorithm is shown in the following formula (7):
min G,C L=L clu (x s ,y s )+λL clu (x t ) (7)
wherein, lambda is a hyper-parameter of the classification loss and the clustering loss in the balance training. And updating the network weights of the feature extractor G and the task classifier C based on a random gradient descent algorithm. And circulating the process for multiple times to finish model training.
And (3) selecting a five-fold cross validation and grid search method to carry out hyper-parameter tuning on the classifier based on the training set, the validation set and the test set data set in the step one, and determining the optimal hyper-parameter in the classification process. And (3) evaluating the classification effect of the classification model by using the test set data, and calculating a classification accuracy index, wherein the result is shown in table 2.
Table 2: the classification effect of different models (N-shot represents that N newly collected samples with labels are provided, and the evaluation index in the table is classification accuracy (%))
Figure BDA0003995472950000091
TF [1] in Table 2 refers to More reactive and portable website formatting with n-shot learning of P.Sirinam et al, and TLFA [2] refers to Few-shot website formatting ack of M.Chen et al.
The embodiment shows that a large amount of labeled training data are needed for deep anonymous network traffic classification, and the anonymous network system such as version update of a Tor browser and the like can bring about the reduction of the effectiveness of the labeled data, so that the performance of the current deep anonymous network traffic classification algorithm is reduced. The algorithm is based on clustering assumptions: samples belonging to the same type of clusters in the clusters belong to the same category, and the target flow sequence clustering center is aligned with the original labeled data category center in the feature space, so that the problem of distribution difference caused by data timeliness is solved, and the anonymous network flow classification performance is effectively improved.

Claims (6)

1. An anonymous network traffic classification method based on small sample machine learning is characterized by comprising the following steps: the method comprises the following steps:
step (1), collecting network flow to obtain an original flow sequence
Figure 403411DEST_PATH_IMAGE001
The newly collected small number of marked flow>
Figure 797484DEST_PATH_IMAGE002
And the flow sequence to be classified->
Figure 864534DEST_PATH_IMAGE003
Wherein the original flow sequence
Figure 253927DEST_PATH_IMAGE001
The data of (A) are marked with: />
Figure 7120DEST_PATH_IMAGE004
,/>
Figure 763854DEST_PATH_IMAGE005
Means the number of the original flow sequence data, and->
Figure 260695DEST_PATH_IMAGE006
And &>
Figure 35753DEST_PATH_IMAGE007
Respectively representing the records and the corresponding labels of the flow sequence; the small amount of tagged traffic newly collected is expressed as:
Figure 694267DEST_PATH_IMAGE008
the flow sequence to be classified is expressed as: />
Figure 171254DEST_PATH_IMAGE009
,/>
Figure 206206DEST_PATH_IMAGE010
Respectively marking the data of a small number of newly collected marked samples and the number of data samples to be classified;
step (2) constructing a classification model
Splicing a feature extractor G and a task classifier C to form a classification model, wherein the feature extractor G adopts a deep convolution network, and the task classifier C comprises two layers of fully-connected neural networks;
step (3) pre-training classification model
Original flow sequence with label
Figure 101350DEST_PATH_IMAGE001
Inputting the data into a classification model, calculating a classification loss function based on the obtained original flow data class prediction probability and a real label, and pre-training the depth classification model constructed in the previous step;
step (4) training classification model
Step (4.1) will have the original flow sequence marked
Figure 196344DEST_PATH_IMAGE001
And a newly acquired small number of marked flow>
Figure 662092DEST_PATH_IMAGE002
Mapping the flow sequence characteristics to a characteristic space through a neural network, and calculating each category center of the newly acquired small quantity of labeled flow sequence characteristicsPoint;
step (4.2) taking the obtained category central point as a clustering central point of newly acquired flow sequence features to be classified, calculating the distance from each flow sequence feature to be classified to each clustering central point, and giving a category label of the nearest category center of the flow sequence features to be classified, wherein the category label is used as a pseudo label of the flow sequence to be classified;
step (4.3) mapping the features of the feature space by a classifier to obtain class prediction probability, and calculating a clustering loss function according to the pseudo label and the prediction probability; updating the network weight of the feature extractor G and the task classifier C according to the obtained cluster adaptation loss;
and (4) circulating the steps (4.1) to (4.3) for multiple times to finish model training.
2. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the structures of the feature extractor G and the task classifier C in the step (2) are as follows:
the feature extractor G is provided with three convolution modules, wherein the first convolution module comprises two convolution layers, the last two convolution modules comprise three convolution layers, a maximum pooling layer and a Dropout layer are adopted behind each convolution module, and an ELU activation function is adopted in each convolution module; and the task classifier C adopts two layers of fully-connected neural networks, and a dropout layer is added behind each layer of network.
3. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: when the classification model is used for pre-training the marked original flow sequence data, the classification loss function is calculated as follows:
Figure DEST_PATH_IMAGE011
wherein
Figure 828631DEST_PATH_IMAGE012
For the prediction probabilities of the classifier for each class of raw flow data, <' >>
Figure 561970DEST_PATH_IMAGE013
One-hot encoding of a real tag for a traffic>
Figure 827866DEST_PATH_IMAGE014
Represents the cross entropy loss function, calculated as follows:
Figure 30177DEST_PATH_IMAGE015
;/>
wherein
Figure 672511DEST_PATH_IMAGE016
Indicates that a sample is->
Figure 27400DEST_PATH_IMAGE017
Prediction probabilities belonging to the respective classes>
Figure 198619DEST_PATH_IMAGE018
Indicates that a sample is->
Figure 888226DEST_PATH_IMAGE017
One-hot encoding of the authentic tag.
4. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the specific calculation method of the clustering center in the step (4.1) comprises the following steps:
given a newly acquired small amount of flow sequence data input as
Figure 68671DEST_PATH_IMAGE019
Assuming that the original traffic sequence data has K categories, then there is a cluster center >>
Figure 511023DEST_PATH_IMAGE020
Comprises the following steps:
Figure 243356DEST_PATH_IMAGE021
wherein, the first and the second end of the pipe are connected with each other,
Figure 30046DEST_PATH_IMAGE022
(ii) a When +>
Figure 623970DEST_PATH_IMAGE023
When, is greater or less>
Figure 78085DEST_PATH_IMAGE024
Otherwise->
Figure 715739DEST_PATH_IMAGE025
;/>
Figure 629206DEST_PATH_IMAGE026
Represents->
Figure 151455DEST_PATH_IMAGE027
The data set is labeled askBased on the flow rate sequence data, based on the flow rate data, and->
Figure 319131DEST_PATH_IMAGE028
5. The anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the method for calculating the pseudo label of the traffic sequence to be classified in the step (4.2) comprises the following steps:
the cosine similarity is adopted in the feature space to measure the distance between the new flow sequence feature and the cluster center, and the distance is calculated as follows:
Figure 268632DEST_PATH_IMAGE029
wherein
Figure 905281DEST_PATH_IMAGE030
Calculating the distance between each newly collected sample and all cluster centers; then, the class of the nearest clustering center of the newly acquired flow sequence is given, a pseudo label is given to the new flow sequence in each class cluster, and the pseudo label acquisition mode is as follows:
Figure 231220DEST_PATH_IMAGE031
6. the anonymous network traffic classification method based on small sample machine learning of claim 1, wherein: the step (4.3) of calculating the cluster adaptation loss and updating the network weight comprises the following specific processes:
the clustering loss function is calculated as follows:
Figure 518982DEST_PATH_IMAGE032
wherein
Figure DEST_PATH_IMAGE033
For a classifier on newly acquired flow sequences>
Figure 216549DEST_PATH_IMAGE034
Is predicted to output, is greater than or equal to>
Figure 855340DEST_PATH_IMAGE035
Is a false label;
the final overall optimization objective function is as follows:
Figure 719391DEST_PATH_IMAGE036
wherein the content of the first and second substances,
Figure 612392DEST_PATH_IMAGE037
the method is a hyper-parameter for balancing classification loss and clustering loss in training. />
CN202211592847.4A 2022-12-13 2022-12-13 Anonymous network traffic classification method based on small sample machine learning Pending CN115913992A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211592847.4A CN115913992A (en) 2022-12-13 2022-12-13 Anonymous network traffic classification method based on small sample machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211592847.4A CN115913992A (en) 2022-12-13 2022-12-13 Anonymous network traffic classification method based on small sample machine learning

Publications (1)

Publication Number Publication Date
CN115913992A true CN115913992A (en) 2023-04-04

Family

ID=86477931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211592847.4A Pending CN115913992A (en) 2022-12-13 2022-12-13 Anonymous network traffic classification method based on small sample machine learning

Country Status (1)

Country Link
CN (1) CN115913992A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117155707A (en) * 2023-10-30 2023-12-01 广东省通信产业服务有限公司 Harmful domain name detection method based on passive network flow measurement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117155707A (en) * 2023-10-30 2023-12-01 广东省通信产业服务有限公司 Harmful domain name detection method based on passive network flow measurement
CN117155707B (en) * 2023-10-30 2023-12-29 广东省通信产业服务有限公司 Harmful domain name detection method based on passive network flow measurement

Similar Documents

Publication Publication Date Title
CN113705712B (en) Network traffic classification method and system based on federal semi-supervised learning
Han et al. Joint air quality and weather prediction based on multi-adversarial spatiotemporal networks
CN106992994A (en) A kind of automatically-monitored method and system of cloud service
CN113378899B (en) Abnormal account identification method, device, equipment and storage medium
CN113762595B (en) Traffic time prediction model training method, traffic time prediction method and equipment
CN111461784B (en) Multi-model fusion-based fraud detection method
CN110990718A (en) Social network model building module of company image improving system
CN115913992A (en) Anonymous network traffic classification method based on small sample machine learning
CN114584406B (en) Industrial big data privacy protection system and method for federated learning
CN115660147A (en) Information propagation prediction method and system based on influence modeling between propagation paths and in propagation paths
CN111224998B (en) Botnet identification method based on extreme learning machine
Li et al. Street-Level Landmarks Acquisition Based on SVM Classifiers.
CN109657725B (en) Service quality prediction method and system based on complex space-time context awareness
CN113938290A (en) Website de-anonymization method and system for user side traffic data analysis
CN105447148A (en) Cookie identifier association method and apparatus
CN117271899A (en) Interest point recommendation method based on space-time perception
CN115438753B (en) Method for measuring security of federal learning protocol data based on generation
CN114757391B (en) Network data space design and application method oriented to service quality prediction
Kumar et al. Progressive machine learning approach with WebAstro for Web usage mining
CN115622810A (en) Business application identification system and method based on machine learning algorithm
CN115767601A (en) 5GC network element automatic nanotube method and device based on multidimensional data
CN112581177B (en) Marketing prediction method combining automatic feature engineering and residual neural network
Liu et al. Prediction model for non-topological event propagation in social networks
CN111586052A (en) Multi-level-based crowd sourcing contract abnormal transaction identification method and identification system
Xu et al. Machine learning based abnormal flow analysis of university course teaching network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination