CN115563532A

CN115563532A - Flow classification method and system based on federal semi-supervised learning

Info

Publication number: CN115563532A
Application number: CN202211123213.4A
Authority: CN
Inventors: 卜佑军; 孙重鑫; 陈博; 马海龙; 周锟; 张德升; 乔伟; 王克跃; 蒋笑笑; 王亮
Original assignee: Information Engineering University of PLA Strategic Support Force; Network Communication and Security Zijinshan Laboratory
Current assignee: Information Engineering University of PLA Strategic Support Force; Network Communication and Security Zijinshan Laboratory
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-01-03

Abstract

The invention provides a flow classification method and system based on federal semi-supervised learning. The method comprises the following steps: constructing a non-tag traffic data set and a tag traffic data set; the central server decomposes the global model into supervised learning parameters and unsupervised learning parameters and initializes the supervised learning parameters and the unsupervised learning parameters; sending the parameters and the auxiliary agent to each client; the client performs unsupervised training by using a local label-free flow data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, and uploads unsupervised learning parameter differences to the central server; the central server aggregates and updates each unsupervised learning parameter; carrying out supervised training by using a local tagged traffic data set, and sending the supervised learning parameter difference and the unsupervised learning parameter difference to each client; obtaining a new auxiliary agent based on nearest neighbor search, and sending the new auxiliary agent to each client when set sending conditions are met; and (4) iteratively executing the 2 steps until a stop condition is met.

Description

Flow classification method and system based on federal semi-supervised learning

Technical Field

The invention relates to the technical field of network safety, in particular to a flow classification method and system based on federal semi-supervised learning.

Background

As an important task in the field of management and security, traffic classification is an indispensable loop in control, planning, intrusion detection and traffic trend analysis. With the explosive growth of traffic, traffic classification as a management core component requires a more efficient, lower consumption model. However, traffic classification based on port identification, load matching, machine learning all rely on handcrafted statistical features that require observation of the entire or large portion of the flow to obtain such statistical features as average packet length, flow duration, and average arrival time of packets. Accurate and efficient real-time traffic classification cannot be accomplished.

Compared with a flow classification method needing manual feature extraction, the classification method based on deep learning integrates feature extraction and model training into a unified end-to-end model, automatically learns and classifies features from original flow, and is favored by researchers. Although the method based on deep learning can avoid complicated feature extraction operations, the conventional flow classification model based on deep learning lacks consideration for practical application, but the deep learning method still has some problems in practical application.

First, data islanding problem

In the field of network traffic classification, traffic data collected from user devices typically contains private information about the user's network behavior, etc. Not only do users not want the information to be disclosed, but also legal regulations do not allow business companies to disclose or share the user data, which causes the data island problem in the industry, and companies and organizations can only store and use their own internal data independently. Training with severely homogenized data tends to produce an overfitting model, which results in the lack of versatility of the trained traffic classification model. And the loss of user traffic information of part of directions seriously affects the application of the deep learning technology in a network traffic classification task, thereby greatly reducing the classification accuracy.

Second, the problem of data scarcity

At present, the mainstream traffic classification method based on deep learning is still supervised learning, and a large amount of labeled data needs to be collected to train a model. However, in the real-world situation, most of the captured flow data is label-free, and due to the complexity of knowledge in the field of computer networks, the labeling of the flow data requires expert experience, and if all the labels are labeled, enormous labor and time costs are consumed.

Third, transmission cost problem

The simple application of unsupervised learning and federal learning to the traffic classification task is infeasible, the current network traffic is huge, the environment is complex, a traffic classification model needs to be trained and updated in real time, the classification model can occupy huge service bandwidth in concurrent transmission, the bandwidth allocation is insufficient, network paralysis and low uploading and downloading speed can be caused, if the bandwidth is too large, a large amount of bandwidth resources are occupied, and the economic cost is very high.

Disclosure of Invention

In order to solve at least part of the problems of data islands, marked flow data scarcity, communication cost and the like existing in the real network flow classification task of deep learning, the invention provides a flow classification method and a flow classification system based on federal semi-supervised learning, which can obtain a flow classification model with high accuracy, wide applicability and capability of fully protecting user privacy.

In one aspect, the invention provides a flow classification method based on federal semi-supervised learning, which comprises the following steps:

step 1: the client captures the label-free network flow of the local gateway and performs data preprocessing on the label-free network flow to form a label-free flow data set; the central server carries out data preprocessing on the marked network traffic to form a tagged traffic data set;

and 2, step: the central server selects a flow classification model adopted by a global model, decomposes the global model into a supervised learning parameter and an unsupervised learning parameter and initializes the two learning parameters; and initializing a secondary agent; sending the initialized two learning parameters and the auxiliary agent to each client;

and 3, step 3: the client performs unsupervised training by using a local label-free flow data set based on the supervised learning parameters, the unsupervised learning parameters and the auxiliary agent, updates the unsupervised learning parameters, obtains unsupervised learning parameter differences before and after updating, and uploads the unsupervised learning parameter differences to the central server;

and 4, step 4: the central server aggregates and updates the unsupervised learning parameters of each client and obtains the unsupervised learning parameter difference before and after updating; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; h local unsupervised learning parameters which are most similar to the current unsupervised learning parameters are obtained based on nearest neighbor search and serve as new auxiliary agents, and when the set sending conditions are met, the new auxiliary agents are sent to the clients;

and 5: and (5) iteratively executing the step (3) to the step (4) until a stop condition is met, and taking the global model as a final flow classification model.

Further, in step 2, a Resnet9 network model is adopted as a traffic classification model.

Further, in step 3, the unsupervised training process of the client specifically includes:

freezing supervised learning parameter sigma, performing unsupervised training using local unlabeled traffic data set u to obtain new model

Namely:

obtaining an updated unsupervised learning parameter psi;

wherein, the minimized consistency loss term in the unsupervised training process is shown in formula (1):

wherein, denotes a freezing parameter,

representing auxiliary agents, η _u Which represents the step size of the movement of the parameter,

a unit direction vector representing the update of the parameter,

and

a parameter, λ, set for preventing unsupervised training from affecting supervised learning parameters _ICCS Represents a hyper-parameter for controlling unsupervised learning, and phi () is a consistency regularization of the local model with a secondary agent.

Further, equation (2) is used to represent Φ ():

wherein the content of the first and second substances,

is a secondary proxy for the mobile device to be,

is a pseudo tag output by the integrated secondary agent,

denotes a tag generated based on softmax, MAX (.) denotes an output tag on the class with the maximum consistency, pi (u) is a random enhancement operation performed on the tag-free traffic data set u,

is a loss of consistency among the secondary agents.

Further, in step 6, the supervised training process of the central server specifically includes:

supervised training using local tagged traffic dataset s to derive new models

Namely:

simultaneously obtaining an updated supervised learning parameter sigma;

the minimum loss term in the supervised training process is shown in formula (3):

wherein denotes a freezing parameter, η _s Which represents the step size of the movement of the parameter,

indicating the unit direction vector, λ, of the parameter update _s Is a hyper-parameter for controlling supervised learning.

Further, in step 1, the data preprocessing includes: and the flow data is divided, cleaned, unified in length and visualized in sequence to obtain a flow data image.

Further, the dividing the traffic data specifically includes: dividing the original flow into different bidirectional sessions by the Pacp file according to a source IP, a destination IP, a source port, a destination port and a transport layer protocol;

the cleaning of the flow data specifically comprises: deleting repeated data packets and null data packets, iterating data packets of all bidirectional sessions, and deleting information irrelevant to flow classification;

the uniform length of the flow data specifically includes: unifying the length of each session as a fixed byte, cutting off if the length of the session is greater than the fixed byte, and filling zero at the end of the session if the length of the session is less than the fixed byte; and/or, padding 0 at the end of the header of the UDP segment to make it equal to the length of the TCP header to make the transport layer segment uniform.

Further, for the r-th communication process, the central server aggregates the unsupervised learning parameters of the a clients based on the model similarity, that is:

representing the unsupervised learning parameters of the client a during the r-th communication.

Further, in step 4, the sending condition is that the central server sends the data to the client at fixed turn number every interval.

In another aspect, the present invention provides a flow classification system based on federal semi-supervised learning, including:

the flow preprocessing module is respectively arranged on the client and the central server and is used for preprocessing data of the non-label network flow of the local gateway captured by the client to form a non-label flow data set and preprocessing the data of the marked network flow on the central server to form a label flow data set;

the server initialization module is arranged in the central server and used for selecting the flow classification model adopted by the global model, decomposing the global model into a supervised learning parameter and an unsupervised learning parameter and initializing the two learning parameters; and initializing a secondary agent; sending the two initialized learning parameters and the auxiliary agent to each client;

the client training module is arranged at the client and used for carrying out unsupervised training by utilizing a local label-free flow data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, updating the unsupervised learning parameters, obtaining unsupervised learning parameter differences before and after updating, and uploading the unsupervised learning parameter differences to the central server;

the server retraining module is arranged in the central server and used for aggregating the unsupervised learning parameters of each client and obtaining the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, transmitting the supervised learning parameter difference and the unsupervised learning parameter difference to each client; and obtaining H most similar local models as new auxiliary agents based on nearest neighbor search, and sending the new auxiliary agents to each client when set sending conditions are met.

The invention has the beneficial effects that:

1. the method trains the network flow classification model through the federal semi-supervised learning architecture, can assist multiple parties to jointly learn an accurate and universal neural network model without disclosing and sharing local user flow data of a client; each participant, namely the client can independently train on the own user data set, and only needs to selectively share the model parameters learned and trained by the local data set during the training period; the training mode assisting multi-party training and not needing to collect local data not only solves the problem of data islanding in the flow field, but also skillfully solves the problem of exposing user privacy data.

2. The invention carries out semi-supervised learning under the federal environment based on the classification model of the convolutional neural network, the semi-supervised learning uses a large amount of local unlabeled flow, and simultaneously uses a small amount of labeled data labeled by experts to train the classification model, thereby effectively solving the problem of high cost for labeling data in the actual network flow classification task.

3. The invention provides consistency loss among clients to perform semi-supervised learning, utilizes the clients among similar network segments as consistency disturbance sources in the semi-supervised learning, maximizes the utilization of consensus among the clients among the similar network segments and effectively accelerates the training speed.

4. According to the method, the federal semi-supervised learning is carried out based on the model updating strategy of parameter decomposition, the model is decomposed into unsupervised learning parameters and supervised learning parameters, and reliable knowledge from the marked data is stored, so that the situation that the knowledge learned from the marked data is forgotten when the unsupervised learning proportion of the model is too large is avoided, and the inter-task interference can be effectively prevented; and communication costs can be further reduced.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

fig. 2 is a schematic flow chart of a traffic classification method based on federal semi-supervised learning according to an embodiment of the present invention;

FIG. 3 is a block diagram of a flow classification system based on Federal semi-supervised learning according to an embodiment of the present invention;

fig. 4 is a second frame diagram of the flow classification system based on federal semi-supervised learning according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the embodiment of the present invention is mainly applied to a communication scenario of a central server and a plurality of clients in a federated environment. As shown in fig. 2, an embodiment of the present invention provides a traffic classification method based on federal semi-supervised learning, including the following steps:

s101: the client captures the label-free network traffic of the local gateway and performs data preprocessing on the label-free network traffic to form a label-free traffic data set; the central server carries out data preprocessing on the marked network traffic to form a tagged traffic data set;

s102: the central server selects a flow classification model adopted by a global model, decomposes the global model into a supervised learning parameter and an unsupervised learning parameter and initializes the two learning parameters; and initializing a secondary agent; sending the initialized two learning parameters and the auxiliary agent to each client;

s103: the client performs unsupervised training by using a local label-free flow data set based on the supervised learning parameters, the unsupervised learning parameters and the auxiliary agent, updates the unsupervised learning parameters, obtains unsupervised learning parameter differences before and after updating, and uploads the unsupervised learning parameter differences to the central server;

s104: the central server aggregates the unsupervised learning parameters of the clients and obtains the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled traffic data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; h local unsupervised learning parameters which are most similar to the current unsupervised learning parameters are obtained based on nearest neighbor search and serve as new auxiliary agents, and when the set sending conditions are met, the new auxiliary agents are sent to the clients;

s105: and (5) iteratively executing the step S103 to the step S104 until a stop condition is met, wherein the global model at the moment is used as a final flow classification model.

The flow classification method based on the federal semi-supervised learning provided by the embodiment of the invention can assist multiple parties to jointly learn an accurate and universal neural network model without disclosing and sharing local user flow data of a client; each participant, namely the client can independently train on the own user data set, and only needs to selectively share the model parameters learned and trained by the local data set during the training period; the training mode which assists in multi-party training and does not need to collect local data not only solves the problem of data island in the traffic field, but also skillfully solves the problem of exposing user privacy data;

the embodiment of the invention provides a model updating strategy based on parameter decomposition, and the updating strategy is utilized to carry out federal semi-supervised learning and store reliable knowledge from the marked data, so that on one hand, the condition that the knowledge learned from the marked data is forgotten when the unsupervised learning proportion of the model is too large can be avoided, and the inter-task interference can be effectively prevented; and on the other hand, the communication cost can be further reduced.

Example 2

In order to maximize the utilization of the consensus among the clients among the similar network segments, on the basis of the embodiment, the embodiment of the invention provides another traffic classification method based on the federal semi-supervised learning, and the embodiment of the invention mainly provides a specific training mode of an unsupervised training process and a supervised training process. The embodiment of the invention specifically comprises the following steps:

s201: the client captures local label-free data flow and carries out data preprocessing on the local label-free data flow so as to form a label-free flow data set u ^l L represents local; the central server carries out data preprocessing on the marked network flow to form a labeled data set s;

specifically, the label-free traffic data sets of the clients are distributed in a non-independent and identical manner; the tagged network data set s is tagged traffic tagged by an expert and comprises a plurality of data pairs (x, y), wherein x is a data stream and y is a tag corresponding to the data stream.

The data preprocessing comprises the following steps: and the flow data is divided, cleaned, unified in length and visualized in sequence to obtain a flow data image. The dividing of the traffic data specifically includes: dividing the original flow into different bidirectional sessions by the Pacp file according to a source IP, a destination IP, a source port, a destination port and a transport layer protocol; the cleaning of the flow data specifically comprises the following steps: deleting repeated data packets and null data packets, iterating data packets of all bidirectional sessions, and deleting information (such as MAC addresses) irrelevant to flow classification; the unified length of the traffic data specifically includes: unifying the length of each session as a fixed byte, cutting off if the session length is greater than the fixed byte, and filling zero at the end of the session if the session length is less than the fixed byte. In addition, 0 is padded at the end of the header (8 bytes) of the UDP segment so that it is equal to the length (20 bytes) of the TCP header to make the transport layer segment uniform.

S202: the center server adopts Resnet9 network model as flow classification model, and records the parameters as global model parameters theta ^G And applying the global model parameter theta ^G Decomposing the parameters into a supervised learning parameter sigma and an unsupervised learning parameter psi and respectively initializing the parameters; the parameter psi can be trained locally at the client side subsequently, and the parameter sigma can be trained at the central server subsequently; and initializing a secondary agent;

specifically, the Resnet9 traffic classification model in this embodiment includes a plurality of convolutional layers, a pooling layer, residual connection, and a Softmax output layer; the auxiliary agent is one of the key points designed by the embodiment of the invention and is mainly used for maximizing the utilization of the consensus among the clients among similar network segments.

S203: the central server randomly selects A clients from all the clients to participate in the following model training task, and sends related training parameters to the selected A clients;

specifically, in consideration of the control of the training scale, the actual online condition of the client, and the like, in practical applications, not all the clients participate in the training task, and therefore, the central server selects the client that needs to participate in the training task.

In the first communication process, the parameters for training sent to each client by the central server include: initial supervised learning parameters σ and unsupervised learning parameters ψ, and an initial secondary agent.

In the second and subsequent communication processes, the parameters for training sent by the central server to each client mainly include: supervised learning parameter difference and unsupervised learning parameter difference; in addition to this, new secondary agents may be included, as the case may be.

S204: the client performs unsupervised training by using a local label-free traffic data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, updates the unsupervised learning parameters, and obtains the difference parameter difference of the unsupervised learning parameters before and after updating

Then uploading the unsupervised learning parameter difference to a central server; wherein the content of the first and second substances,

represents the updated unsupervised learning parameters,

representing unsupervised learning parameters before updating.

Specifically, in the first communication process (at this time, also called as the first round of training process), the client performs unsupervised training by using a local unlabelled traffic data set based on the initial supervised learning parameters, unsupervised learning parameters and the auxiliary agent, and updates the unsupervised learning parameters;

in the second and subsequent communication processes, the client needs to calculate new supervised learning parameters and unsupervised learning parameters based on the received supervised learning parameter difference and unsupervised learning parameter difference as well as the supervised learning parameters and the unsupervised learning parameters in the previous communication process which are locally stored; and then performing unsupervised training by using the local unlabeled traffic data set based on the new supervised learning parameters, unsupervised learning parameters and the auxiliary agent.

As an implementation manner, the unsupervised training process of the client specifically includes:

Namely:

obtaining an updated unsupervised learning parameter psi;

wherein, the minimum consistency loss term in the unsupervised training process is shown in formula (1):

wherein, denotes a freezing parameter,

a unit direction vector representing the update of the parameter,

and

a parameter, λ, set for preventing unsupervised training from affecting supervised learning parameters _ICCS Represents a hyper-parameter for controlling unsupervised learning, and Φ () is a consistency regularization of the local model with the secondary agent.

Further, in the present embodiment, Φ (), is expressed by equation (2):

wherein the content of the first and second substances,

is a secondary proxy for the mobile device to be,

is a pseudo tag output by the integrated secondary agent,

is a loss of consistency between the secondary proxies.

S205: the central server aggregates and updates the unsupervised learning parameters of the A clients and obtains the unsupervised learning parameter difference before and after updating; carrying out supervised training by using a local labeled traffic data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then the supervised learning parameter difference sum

Unsupervised learning parameter differences

Sending the data to each client; wherein, the first and the second end of the pipe are connected with each other,

indicating the updated supervised learning parameters,

indicating the supervised learning parameters prior to the update,

represents the updated unsupervised learning parameters,

representing unsupervised learning parameters before updating.

Specifically, the central server may obtain a new unsupervised learning parameter based on the difference of the received unsupervised learning parameter and the locally stored unsupervised learning parameter in the previous communication process;

as an implementation manner, for the r-th communication process, the central server aggregates the unsupervised learning parameters of a clients based on model similarity, that is:

As an implementation manner, the supervised training process of the central server specifically includes:

performing supervised training using local tagged traffic data set s to obtain a new model

Namely:

simultaneously obtaining an updated supervised learning parameter sigma;

wherein, denotes a freezing parameter, η _s Which represents the step size of the movement of the parameter,

represents the unit direction vector, λ, of the parameter update _s Is a hyper-parameter for controlling supervised learning.

S206: presetting that the server sends a new auxiliary agent to the client in every 10 communication rounds; the central server judges whether the communication times r of the round is a multiple of 10, if yes, H most similar local models are obtained based on nearest neighbor search to serve as new auxiliary agents, and then the new auxiliary agents are sent to the clients

S207: and (4) iterating steps S203 to S206 for multiple times, performing aggregation updating for multiple times through the central server until the global model is converged and then not iterating, and finally obtaining a parameter theta in the global model ^G ；

Example 3

Corresponding to the above method, as shown in fig. 3 and 4, an embodiment of the present invention further provides a traffic classification system based on federal semi-supervised learning, including a traffic preprocessing module, a server initialization module, a client training module, and a server retraining module;

the flow preprocessing module is respectively arranged on the client and the central server and is used for preprocessing data of the unlabeled network flow of the local gateway captured by the client to form an unlabeled flow data set and preprocessing the labeled network flow marked on the central server to form a labeled flow data set. The server initialization module is arranged in the central server and used for selecting the flow classification model adopted by the global model, decomposing the global model into a supervised learning parameter and an unsupervised learning parameter and initializing the two learning parameters; and initializing a secondary agent; and sending the initialized two learning parameters and the auxiliary agent to each client. And the client training module is arranged at the client and used for carrying out unsupervised training by utilizing the local non-label flow data set based on the supervised learning parameters, the unsupervised learning parameters and the auxiliary agent, updating the unsupervised learning parameters, obtaining the unsupervised learning parameter difference before and after updating, and then uploading the unsupervised learning parameter difference to the central server. The server retraining module is arranged in the central server and used for aggregating the unsupervised learning parameters of each client and obtaining the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; and obtaining H most similar local models as new auxiliary agents based on nearest neighbor search, and sending the new auxiliary agents to each client when set sending conditions are met.

It should be noted that the system provided in the embodiment of the present invention is for implementing the method embodiment, and specific reference may be made to the method embodiment for functions of the system, which are not described herein again.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The traffic classification method based on the federal semi-supervised learning is characterized by comprising the following steps of:

step 1: the client captures the label-free network traffic of the local gateway and performs data preprocessing on the label-free network traffic to form a label-free traffic data set; the central server carries out data preprocessing on the marked network traffic to form a tagged traffic data set;

step 2: the central server selects a flow classification model adopted by a global model, decomposes the global model into a supervised learning parameter and an unsupervised learning parameter and initializes the two learning parameters; and initializing a secondary agent; sending the initialized two learning parameters and the auxiliary agent to each client;

and 4, step 4: the central server aggregates and updates the unsupervised learning parameters of the client sides, and obtains unsupervised learning parameter differences before and after updating; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, transmitting the supervised learning parameter difference and the unsupervised learning parameter difference to each client; h local unsupervised learning parameters which are most similar to the current unsupervised learning parameters are obtained based on nearest neighbor search and serve as new auxiliary agents, and when the set sending conditions are met, the new auxiliary agents are sent to the clients;

2. The method for traffic classification based on federal semi-supervised learning of claim 1, wherein in step 2, a Resnet9 network model is used as the traffic classification model.

3. The traffic classification method based on federal semi-supervised learning as claimed in claim 1, wherein in step 3, the unsupervised training process of the client specifically comprises:

freezing a supervised learning parameter sigma, performing unsupervised training using a local unlabelled traffic data set u to obtain a new model theta _σ*+ψ Namely:

obtaining an updated unsupervised learning parameter psi;

wherein, denotes a freezing parameter,

a unit direction vector representing the update of the parameter,

and

4. The method for traffic classification based on federal semi-supervised learning of claim 3, wherein Φ () is expressed by formula (2):

wherein the content of the first and second substances,

is a secondary proxy for the mobile device to be,

is a pseudo tag output by the integrated secondary agent,

is a loss of consistency between the secondary proxies.

5. The traffic classification method based on federal semi-supervised learning as in claim 1, wherein in the step 6, the supervised training process of the central server specifically comprises:

supervised training is performed using local tagged traffic data set s to obtain a new model θ _σ+ψ* Namely:

simultaneously obtaining an updated supervised learning parameter sigma;

minimizeL _s (σ)＝λ _s CrossEntropy(y，p _σ+ψ* (y|x)) (3)

6. The traffic classification method based on federal semi-supervised learning as in claim 1, wherein in the step 1, the data preprocessing comprises: and the flow data is divided, cleaned, unified in length and visualized in sequence to obtain a flow data image.

7. The method for traffic classification based on federal semi-supervised learning of claim 6, wherein the dividing of the traffic data specifically comprises: dividing the original flow into different bidirectional sessions by the Pacp file according to a source IP, a destination IP, a source port, a destination port and a transport layer protocol;

the cleaning of the flow data specifically comprises the following steps: deleting repeated data packets and null data packets, iterating data packets of all bidirectional sessions, and deleting information irrelevant to flow classification;

the unified length of the traffic data specifically includes: unifying the length of each session as a fixed byte, cutting off if the length of the session is greater than the fixed byte, and filling zero at the end of the session if the length of the session is less than the fixed byte; and/or, fill 0 at the end of the header of the UDP segment to make it equal to the length of the TCP header to make the transport layer segment uniform.

8. The method for traffic classification based on federal semi-supervised learning of claim 1, wherein for the r-th communication process, the central server aggregates unsupervised learning parameters of a clients based on model similarity, that is:

9. The traffic classification method based on federal semi-supervised learning of claim 1, wherein in step 4, the sending condition is that the central server sends the traffic to the client at fixed turn number every interval.

10. Flow classification system based on federal semi-supervised learning is characterized by comprising:

the flow preprocessing module is respectively arranged on the client and the central server and is used for preprocessing data of the unlabeled network flow of the local gateway captured by the client to form an unlabeled flow data set and preprocessing the data of the labeled network flow on the central server to form a labeled flow data set;

the client training module is arranged at the client and used for performing unsupervised training by utilizing a local label-free flow data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, updating the unsupervised learning parameters, obtaining unsupervised learning parameter differences before and after updating, and then uploading the unsupervised learning parameter differences to the central server;

the server retraining module is arranged in the central server and used for aggregating the unsupervised learning parameters of each client and obtaining the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled traffic data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; and obtaining H most similar local models based on nearest neighbor search to serve as new auxiliary agents, and sending the new auxiliary agents to the clients when set sending conditions are met.