CN115563532A - Flow classification method and system based on federal semi-supervised learning - Google Patents

Flow classification method and system based on federal semi-supervised learning Download PDF

Info

Publication number
CN115563532A
CN115563532A CN202211123213.4A CN202211123213A CN115563532A CN 115563532 A CN115563532 A CN 115563532A CN 202211123213 A CN202211123213 A CN 202211123213A CN 115563532 A CN115563532 A CN 115563532A
Authority
CN
China
Prior art keywords
unsupervised
parameter
client
supervised learning
supervised
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211123213.4A
Other languages
Chinese (zh)
Inventor
卜佑军
孙重鑫
陈博
马海龙
周锟
张德升
乔伟
王克跃
蒋笑笑
王亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Network Communication and Security Zijinshan Laboratory
Original Assignee
Information Engineering University of PLA Strategic Support Force
Network Communication and Security Zijinshan Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force , Network Communication and Security Zijinshan Laboratory filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN202211123213.4A priority Critical patent/CN115563532A/en
Publication of CN115563532A publication Critical patent/CN115563532A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a flow classification method and system based on federal semi-supervised learning. The method comprises the following steps: constructing a non-tag traffic data set and a tag traffic data set; the central server decomposes the global model into supervised learning parameters and unsupervised learning parameters and initializes the supervised learning parameters and the unsupervised learning parameters; sending the parameters and the auxiliary agent to each client; the client performs unsupervised training by using a local label-free flow data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, and uploads unsupervised learning parameter differences to the central server; the central server aggregates and updates each unsupervised learning parameter; carrying out supervised training by using a local tagged traffic data set, and sending the supervised learning parameter difference and the unsupervised learning parameter difference to each client; obtaining a new auxiliary agent based on nearest neighbor search, and sending the new auxiliary agent to each client when set sending conditions are met; and (4) iteratively executing the 2 steps until a stop condition is met.

Description

Flow classification method and system based on federal semi-supervised learning
Technical Field
The invention relates to the technical field of network safety, in particular to a flow classification method and system based on federal semi-supervised learning.
Background
As an important task in the field of management and security, traffic classification is an indispensable loop in control, planning, intrusion detection and traffic trend analysis. With the explosive growth of traffic, traffic classification as a management core component requires a more efficient, lower consumption model. However, traffic classification based on port identification, load matching, machine learning all rely on handcrafted statistical features that require observation of the entire or large portion of the flow to obtain such statistical features as average packet length, flow duration, and average arrival time of packets. Accurate and efficient real-time traffic classification cannot be accomplished.
Compared with a flow classification method needing manual feature extraction, the classification method based on deep learning integrates feature extraction and model training into a unified end-to-end model, automatically learns and classifies features from original flow, and is favored by researchers. Although the method based on deep learning can avoid complicated feature extraction operations, the conventional flow classification model based on deep learning lacks consideration for practical application, but the deep learning method still has some problems in practical application.
First, data islanding problem
In the field of network traffic classification, traffic data collected from user devices typically contains private information about the user's network behavior, etc. Not only do users not want the information to be disclosed, but also legal regulations do not allow business companies to disclose or share the user data, which causes the data island problem in the industry, and companies and organizations can only store and use their own internal data independently. Training with severely homogenized data tends to produce an overfitting model, which results in the lack of versatility of the trained traffic classification model. And the loss of user traffic information of part of directions seriously affects the application of the deep learning technology in a network traffic classification task, thereby greatly reducing the classification accuracy.
Second, the problem of data scarcity
At present, the mainstream traffic classification method based on deep learning is still supervised learning, and a large amount of labeled data needs to be collected to train a model. However, in the real-world situation, most of the captured flow data is label-free, and due to the complexity of knowledge in the field of computer networks, the labeling of the flow data requires expert experience, and if all the labels are labeled, enormous labor and time costs are consumed.
Third, transmission cost problem
The simple application of unsupervised learning and federal learning to the traffic classification task is infeasible, the current network traffic is huge, the environment is complex, a traffic classification model needs to be trained and updated in real time, the classification model can occupy huge service bandwidth in concurrent transmission, the bandwidth allocation is insufficient, network paralysis and low uploading and downloading speed can be caused, if the bandwidth is too large, a large amount of bandwidth resources are occupied, and the economic cost is very high.
Disclosure of Invention
In order to solve at least part of the problems of data islands, marked flow data scarcity, communication cost and the like existing in the real network flow classification task of deep learning, the invention provides a flow classification method and a flow classification system based on federal semi-supervised learning, which can obtain a flow classification model with high accuracy, wide applicability and capability of fully protecting user privacy.
In one aspect, the invention provides a flow classification method based on federal semi-supervised learning, which comprises the following steps:
step 1: the client captures the label-free network flow of the local gateway and performs data preprocessing on the label-free network flow to form a label-free flow data set; the central server carries out data preprocessing on the marked network traffic to form a tagged traffic data set;
and 2, step: the central server selects a flow classification model adopted by a global model, decomposes the global model into a supervised learning parameter and an unsupervised learning parameter and initializes the two learning parameters; and initializing a secondary agent; sending the initialized two learning parameters and the auxiliary agent to each client;
and 3, step 3: the client performs unsupervised training by using a local label-free flow data set based on the supervised learning parameters, the unsupervised learning parameters and the auxiliary agent, updates the unsupervised learning parameters, obtains unsupervised learning parameter differences before and after updating, and uploads the unsupervised learning parameter differences to the central server;
and 4, step 4: the central server aggregates and updates the unsupervised learning parameters of each client and obtains the unsupervised learning parameter difference before and after updating; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; h local unsupervised learning parameters which are most similar to the current unsupervised learning parameters are obtained based on nearest neighbor search and serve as new auxiliary agents, and when the set sending conditions are met, the new auxiliary agents are sent to the clients;
and 5: and (5) iteratively executing the step (3) to the step (4) until a stop condition is met, and taking the global model as a final flow classification model.
Further, in step 2, a Resnet9 network model is adopted as a traffic classification model.
Further, in step 3, the unsupervised training process of the client specifically includes:
freezing supervised learning parameter sigma, performing unsupervised training using local unlabeled traffic data set u to obtain new model
Figure BDA0003848027440000031
Namely:
Figure BDA0003848027440000032
obtaining an updated unsupervised learning parameter psi;
wherein, the minimized consistency loss term in the unsupervised training process is shown in formula (1):
Figure BDA0003848027440000033
wherein, denotes a freezing parameter,
Figure BDA0003848027440000034
representing auxiliary agents, η u Which represents the step size of the movement of the parameter,
Figure BDA0003848027440000035
a unit direction vector representing the update of the parameter,
Figure BDA0003848027440000036
and
Figure BDA0003848027440000037
a parameter, λ, set for preventing unsupervised training from affecting supervised learning parameters ICCS Represents a hyper-parameter for controlling unsupervised learning, and phi () is a consistency regularization of the local model with a secondary agent.
Further, equation (2) is used to represent Φ ():
Figure BDA0003848027440000038
wherein the content of the first and second substances,
Figure BDA0003848027440000039
is a secondary proxy for the mobile device to be,
Figure BDA00038480274400000310
is a pseudo tag output by the integrated secondary agent,
Figure BDA00038480274400000311
Figure BDA00038480274400000316
denotes a tag generated based on softmax, MAX (.) denotes an output tag on the class with the maximum consistency, pi (u) is a random enhancement operation performed on the tag-free traffic data set u,
Figure BDA00038480274400000312
is a loss of consistency among the secondary agents.
Further, in step 6, the supervised training process of the central server specifically includes:
supervised training using local tagged traffic dataset s to derive new models
Figure BDA00038480274400000313
Namely:
Figure BDA00038480274400000314
simultaneously obtaining an updated supervised learning parameter sigma;
the minimum loss term in the supervised training process is shown in formula (3):
Figure BDA00038480274400000315
wherein denotes a freezing parameter, η s Which represents the step size of the movement of the parameter,
Figure BDA0003848027440000041
indicating the unit direction vector, λ, of the parameter update s Is a hyper-parameter for controlling supervised learning.
Further, in step 1, the data preprocessing includes: and the flow data is divided, cleaned, unified in length and visualized in sequence to obtain a flow data image.
Further, the dividing the traffic data specifically includes: dividing the original flow into different bidirectional sessions by the Pacp file according to a source IP, a destination IP, a source port, a destination port and a transport layer protocol;
the cleaning of the flow data specifically comprises: deleting repeated data packets and null data packets, iterating data packets of all bidirectional sessions, and deleting information irrelevant to flow classification;
the uniform length of the flow data specifically includes: unifying the length of each session as a fixed byte, cutting off if the length of the session is greater than the fixed byte, and filling zero at the end of the session if the length of the session is less than the fixed byte; and/or, padding 0 at the end of the header of the UDP segment to make it equal to the length of the TCP header to make the transport layer segment uniform.
Further, for the r-th communication process, the central server aggregates the unsupervised learning parameters of the a clients based on the model similarity, that is:
Figure BDA0003848027440000042
representing the unsupervised learning parameters of the client a during the r-th communication.
Further, in step 4, the sending condition is that the central server sends the data to the client at fixed turn number every interval.
In another aspect, the present invention provides a flow classification system based on federal semi-supervised learning, including:
the flow preprocessing module is respectively arranged on the client and the central server and is used for preprocessing data of the non-label network flow of the local gateway captured by the client to form a non-label flow data set and preprocessing the data of the marked network flow on the central server to form a label flow data set;
the server initialization module is arranged in the central server and used for selecting the flow classification model adopted by the global model, decomposing the global model into a supervised learning parameter and an unsupervised learning parameter and initializing the two learning parameters; and initializing a secondary agent; sending the two initialized learning parameters and the auxiliary agent to each client;
the client training module is arranged at the client and used for carrying out unsupervised training by utilizing a local label-free flow data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, updating the unsupervised learning parameters, obtaining unsupervised learning parameter differences before and after updating, and uploading the unsupervised learning parameter differences to the central server;
the server retraining module is arranged in the central server and used for aggregating the unsupervised learning parameters of each client and obtaining the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, transmitting the supervised learning parameter difference and the unsupervised learning parameter difference to each client; and obtaining H most similar local models as new auxiliary agents based on nearest neighbor search, and sending the new auxiliary agents to each client when set sending conditions are met.
The invention has the beneficial effects that:
1. the method trains the network flow classification model through the federal semi-supervised learning architecture, can assist multiple parties to jointly learn an accurate and universal neural network model without disclosing and sharing local user flow data of a client; each participant, namely the client can independently train on the own user data set, and only needs to selectively share the model parameters learned and trained by the local data set during the training period; the training mode assisting multi-party training and not needing to collect local data not only solves the problem of data islanding in the flow field, but also skillfully solves the problem of exposing user privacy data.
2. The invention carries out semi-supervised learning under the federal environment based on the classification model of the convolutional neural network, the semi-supervised learning uses a large amount of local unlabeled flow, and simultaneously uses a small amount of labeled data labeled by experts to train the classification model, thereby effectively solving the problem of high cost for labeling data in the actual network flow classification task.
3. The invention provides consistency loss among clients to perform semi-supervised learning, utilizes the clients among similar network segments as consistency disturbance sources in the semi-supervised learning, maximizes the utilization of consensus among the clients among the similar network segments and effectively accelerates the training speed.
4. According to the method, the federal semi-supervised learning is carried out based on the model updating strategy of parameter decomposition, the model is decomposed into unsupervised learning parameters and supervised learning parameters, and reliable knowledge from the marked data is stored, so that the situation that the knowledge learned from the marked data is forgotten when the unsupervised learning proportion of the model is too large is avoided, and the inter-task interference can be effectively prevented; and communication costs can be further reduced.
Drawings
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;
fig. 2 is a schematic flow chart of a traffic classification method based on federal semi-supervised learning according to an embodiment of the present invention;
FIG. 3 is a block diagram of a flow classification system based on Federal semi-supervised learning according to an embodiment of the present invention;
fig. 4 is a second frame diagram of the flow classification system based on federal semi-supervised learning according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the embodiment of the present invention is mainly applied to a communication scenario of a central server and a plurality of clients in a federated environment. As shown in fig. 2, an embodiment of the present invention provides a traffic classification method based on federal semi-supervised learning, including the following steps:
s101: the client captures the label-free network traffic of the local gateway and performs data preprocessing on the label-free network traffic to form a label-free traffic data set; the central server carries out data preprocessing on the marked network traffic to form a tagged traffic data set;
s102: the central server selects a flow classification model adopted by a global model, decomposes the global model into a supervised learning parameter and an unsupervised learning parameter and initializes the two learning parameters; and initializing a secondary agent; sending the initialized two learning parameters and the auxiliary agent to each client;
s103: the client performs unsupervised training by using a local label-free flow data set based on the supervised learning parameters, the unsupervised learning parameters and the auxiliary agent, updates the unsupervised learning parameters, obtains unsupervised learning parameter differences before and after updating, and uploads the unsupervised learning parameter differences to the central server;
s104: the central server aggregates the unsupervised learning parameters of the clients and obtains the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled traffic data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; h local unsupervised learning parameters which are most similar to the current unsupervised learning parameters are obtained based on nearest neighbor search and serve as new auxiliary agents, and when the set sending conditions are met, the new auxiliary agents are sent to the clients;
s105: and (5) iteratively executing the step S103 to the step S104 until a stop condition is met, wherein the global model at the moment is used as a final flow classification model.
The flow classification method based on the federal semi-supervised learning provided by the embodiment of the invention can assist multiple parties to jointly learn an accurate and universal neural network model without disclosing and sharing local user flow data of a client; each participant, namely the client can independently train on the own user data set, and only needs to selectively share the model parameters learned and trained by the local data set during the training period; the training mode which assists in multi-party training and does not need to collect local data not only solves the problem of data island in the traffic field, but also skillfully solves the problem of exposing user privacy data;
the embodiment of the invention provides a model updating strategy based on parameter decomposition, and the updating strategy is utilized to carry out federal semi-supervised learning and store reliable knowledge from the marked data, so that on one hand, the condition that the knowledge learned from the marked data is forgotten when the unsupervised learning proportion of the model is too large can be avoided, and the inter-task interference can be effectively prevented; and on the other hand, the communication cost can be further reduced.
Example 2
In order to maximize the utilization of the consensus among the clients among the similar network segments, on the basis of the embodiment, the embodiment of the invention provides another traffic classification method based on the federal semi-supervised learning, and the embodiment of the invention mainly provides a specific training mode of an unsupervised training process and a supervised training process. The embodiment of the invention specifically comprises the following steps:
s201: the client captures local label-free data flow and carries out data preprocessing on the local label-free data flow so as to form a label-free flow data set u l L represents local; the central server carries out data preprocessing on the marked network flow to form a labeled data set s;
specifically, the label-free traffic data sets of the clients are distributed in a non-independent and identical manner; the tagged network data set s is tagged traffic tagged by an expert and comprises a plurality of data pairs (x, y), wherein x is a data stream and y is a tag corresponding to the data stream.
The data preprocessing comprises the following steps: and the flow data is divided, cleaned, unified in length and visualized in sequence to obtain a flow data image. The dividing of the traffic data specifically includes: dividing the original flow into different bidirectional sessions by the Pacp file according to a source IP, a destination IP, a source port, a destination port and a transport layer protocol; the cleaning of the flow data specifically comprises the following steps: deleting repeated data packets and null data packets, iterating data packets of all bidirectional sessions, and deleting information (such as MAC addresses) irrelevant to flow classification; the unified length of the traffic data specifically includes: unifying the length of each session as a fixed byte, cutting off if the session length is greater than the fixed byte, and filling zero at the end of the session if the session length is less than the fixed byte. In addition, 0 is padded at the end of the header (8 bytes) of the UDP segment so that it is equal to the length (20 bytes) of the TCP header to make the transport layer segment uniform.
S202: the center server adopts Resnet9 network model as flow classification model, and records the parameters as global model parameters theta G And applying the global model parameter theta G Decomposing the parameters into a supervised learning parameter sigma and an unsupervised learning parameter psi and respectively initializing the parameters; the parameter psi can be trained locally at the client side subsequently, and the parameter sigma can be trained at the central server subsequently; and initializing a secondary agent;
specifically, the Resnet9 traffic classification model in this embodiment includes a plurality of convolutional layers, a pooling layer, residual connection, and a Softmax output layer; the auxiliary agent is one of the key points designed by the embodiment of the invention and is mainly used for maximizing the utilization of the consensus among the clients among similar network segments.
S203: the central server randomly selects A clients from all the clients to participate in the following model training task, and sends related training parameters to the selected A clients;
specifically, in consideration of the control of the training scale, the actual online condition of the client, and the like, in practical applications, not all the clients participate in the training task, and therefore, the central server selects the client that needs to participate in the training task.
In the first communication process, the parameters for training sent to each client by the central server include: initial supervised learning parameters σ and unsupervised learning parameters ψ, and an initial secondary agent.
In the second and subsequent communication processes, the parameters for training sent by the central server to each client mainly include: supervised learning parameter difference and unsupervised learning parameter difference; in addition to this, new secondary agents may be included, as the case may be.
S204: the client performs unsupervised training by using a local label-free traffic data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, updates the unsupervised learning parameters, and obtains the difference parameter difference of the unsupervised learning parameters before and after updating
Figure BDA0003848027440000081
Then uploading the unsupervised learning parameter difference to a central server; wherein the content of the first and second substances,
Figure BDA0003848027440000082
represents the updated unsupervised learning parameters,
Figure BDA0003848027440000083
representing unsupervised learning parameters before updating.
Specifically, in the first communication process (at this time, also called as the first round of training process), the client performs unsupervised training by using a local unlabelled traffic data set based on the initial supervised learning parameters, unsupervised learning parameters and the auxiliary agent, and updates the unsupervised learning parameters;
in the second and subsequent communication processes, the client needs to calculate new supervised learning parameters and unsupervised learning parameters based on the received supervised learning parameter difference and unsupervised learning parameter difference as well as the supervised learning parameters and the unsupervised learning parameters in the previous communication process which are locally stored; and then performing unsupervised training by using the local unlabeled traffic data set based on the new supervised learning parameters, unsupervised learning parameters and the auxiliary agent.
As an implementation manner, the unsupervised training process of the client specifically includes:
freezing supervised learning parameter sigma, performing unsupervised training using local unlabeled traffic data set u to obtain new model
Figure BDA0003848027440000091
Namely:
Figure BDA0003848027440000092
obtaining an updated unsupervised learning parameter psi;
wherein, the minimum consistency loss term in the unsupervised training process is shown in formula (1):
Figure BDA0003848027440000093
wherein, denotes a freezing parameter,
Figure BDA0003848027440000094
representing auxiliary agents, η u Which represents the step size of the movement of the parameter,
Figure BDA0003848027440000095
a unit direction vector representing the update of the parameter,
Figure BDA0003848027440000096
and
Figure BDA0003848027440000097
a parameter, λ, set for preventing unsupervised training from affecting supervised learning parameters ICCS Represents a hyper-parameter for controlling unsupervised learning, and Φ () is a consistency regularization of the local model with the secondary agent.
Further, in the present embodiment, Φ (), is expressed by equation (2):
Figure BDA0003848027440000098
wherein the content of the first and second substances,
Figure BDA0003848027440000099
is a secondary proxy for the mobile device to be,
Figure BDA00038480274400000910
is a pseudo tag output by the integrated secondary agent,
Figure BDA00038480274400000911
Figure BDA00038480274400000913
denotes a tag generated based on softmax, MAX (.) denotes an output tag on the class with the maximum consistency, pi (u) is a random enhancement operation performed on the tag-free traffic data set u,
Figure BDA00038480274400000912
is a loss of consistency between the secondary proxies.
S205: the central server aggregates and updates the unsupervised learning parameters of the A clients and obtains the unsupervised learning parameter difference before and after updating; carrying out supervised training by using a local labeled traffic data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then the supervised learning parameter difference sum
Figure BDA0003848027440000101
Unsupervised learning parameter differences
Figure BDA0003848027440000102
Sending the data to each client; wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003848027440000103
indicating the updated supervised learning parameters,
Figure BDA0003848027440000104
indicating the supervised learning parameters prior to the update,
Figure BDA0003848027440000105
represents the updated unsupervised learning parameters,
Figure BDA0003848027440000106
representing unsupervised learning parameters before updating.
Specifically, the central server may obtain a new unsupervised learning parameter based on the difference of the received unsupervised learning parameter and the locally stored unsupervised learning parameter in the previous communication process;
as an implementation manner, for the r-th communication process, the central server aggregates the unsupervised learning parameters of a clients based on model similarity, that is:
Figure BDA0003848027440000107
representing the unsupervised learning parameters of the client a during the r-th communication.
As an implementation manner, the supervised training process of the central server specifically includes:
performing supervised training using local tagged traffic data set s to obtain a new model
Figure BDA0003848027440000108
Namely:
Figure BDA0003848027440000109
simultaneously obtaining an updated supervised learning parameter sigma;
the minimum loss term in the supervised training process is shown in formula (3):
Figure BDA00038480274400001010
wherein, denotes a freezing parameter, η s Which represents the step size of the movement of the parameter,
Figure BDA00038480274400001011
represents the unit direction vector, λ, of the parameter update s Is a hyper-parameter for controlling supervised learning.
S206: presetting that the server sends a new auxiliary agent to the client in every 10 communication rounds; the central server judges whether the communication times r of the round is a multiple of 10, if yes, H most similar local models are obtained based on nearest neighbor search to serve as new auxiliary agents, and then the new auxiliary agents are sent to the clients
Figure BDA00038480274400001012
S207: and (4) iterating steps S203 to S206 for multiple times, performing aggregation updating for multiple times through the central server until the global model is converged and then not iterating, and finally obtaining a parameter theta in the global model G
Example 3
Corresponding to the above method, as shown in fig. 3 and 4, an embodiment of the present invention further provides a traffic classification system based on federal semi-supervised learning, including a traffic preprocessing module, a server initialization module, a client training module, and a server retraining module;
the flow preprocessing module is respectively arranged on the client and the central server and is used for preprocessing data of the unlabeled network flow of the local gateway captured by the client to form an unlabeled flow data set and preprocessing the labeled network flow marked on the central server to form a labeled flow data set. The server initialization module is arranged in the central server and used for selecting the flow classification model adopted by the global model, decomposing the global model into a supervised learning parameter and an unsupervised learning parameter and initializing the two learning parameters; and initializing a secondary agent; and sending the initialized two learning parameters and the auxiliary agent to each client. And the client training module is arranged at the client and used for carrying out unsupervised training by utilizing the local non-label flow data set based on the supervised learning parameters, the unsupervised learning parameters and the auxiliary agent, updating the unsupervised learning parameters, obtaining the unsupervised learning parameter difference before and after updating, and then uploading the unsupervised learning parameter difference to the central server. The server retraining module is arranged in the central server and used for aggregating the unsupervised learning parameters of each client and obtaining the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; and obtaining H most similar local models as new auxiliary agents based on nearest neighbor search, and sending the new auxiliary agents to each client when set sending conditions are met.
It should be noted that the system provided in the embodiment of the present invention is for implementing the method embodiment, and specific reference may be made to the method embodiment for functions of the system, which are not described herein again.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. The traffic classification method based on the federal semi-supervised learning is characterized by comprising the following steps of:
step 1: the client captures the label-free network traffic of the local gateway and performs data preprocessing on the label-free network traffic to form a label-free traffic data set; the central server carries out data preprocessing on the marked network traffic to form a tagged traffic data set;
step 2: the central server selects a flow classification model adopted by a global model, decomposes the global model into a supervised learning parameter and an unsupervised learning parameter and initializes the two learning parameters; and initializing a secondary agent; sending the initialized two learning parameters and the auxiliary agent to each client;
and 3, step 3: the client performs unsupervised training by using a local label-free flow data set based on the supervised learning parameters, the unsupervised learning parameters and the auxiliary agent, updates the unsupervised learning parameters, obtains unsupervised learning parameter differences before and after updating, and uploads the unsupervised learning parameter differences to the central server;
and 4, step 4: the central server aggregates and updates the unsupervised learning parameters of the client sides, and obtains unsupervised learning parameter differences before and after updating; carrying out supervised training by using a local labeled flow data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, transmitting the supervised learning parameter difference and the unsupervised learning parameter difference to each client; h local unsupervised learning parameters which are most similar to the current unsupervised learning parameters are obtained based on nearest neighbor search and serve as new auxiliary agents, and when the set sending conditions are met, the new auxiliary agents are sent to the clients;
and 5: and (5) iteratively executing the step (3) to the step (4) until a stop condition is met, and taking the global model as a final flow classification model.
2. The method for traffic classification based on federal semi-supervised learning of claim 1, wherein in step 2, a Resnet9 network model is used as the traffic classification model.
3. The traffic classification method based on federal semi-supervised learning as claimed in claim 1, wherein in step 3, the unsupervised training process of the client specifically comprises:
freezing a supervised learning parameter sigma, performing unsupervised training using a local unlabelled traffic data set u to obtain a new model theta σ*+ψ Namely:
Figure FDA0003848027430000011
obtaining an updated unsupervised learning parameter psi;
wherein, the minimum consistency loss term in the unsupervised training process is shown in formula (1):
Figure FDA0003848027430000012
wherein, denotes a freezing parameter,
Figure FDA0003848027430000013
representing auxiliary agents, η u Which represents the step size of the movement of the parameter,
Figure FDA0003848027430000014
a unit direction vector representing the update of the parameter,
Figure FDA0003848027430000021
and
Figure FDA0003848027430000022
a parameter, λ, set for preventing unsupervised training from affecting supervised learning parameters ICCS Represents a hyper-parameter for controlling unsupervised learning, and Φ () is a consistency regularization of the local model with the secondary agent.
4. The method for traffic classification based on federal semi-supervised learning of claim 3, wherein Φ () is expressed by formula (2):
Figure FDA0003848027430000023
wherein the content of the first and second substances,
Figure FDA0003848027430000024
is a secondary proxy for the mobile device to be,
Figure FDA0003848027430000025
is a pseudo tag output by the integrated secondary agent,
Figure FDA0003848027430000026
Figure FDA0003848027430000027
denotes a tag generated based on softmax, MAX (.) denotes an output tag on the class with the maximum consistency, pi (u) is a random enhancement operation performed on the tag-free traffic data set u,
Figure FDA0003848027430000028
is a loss of consistency between the secondary proxies.
5. The traffic classification method based on federal semi-supervised learning as in claim 1, wherein in the step 6, the supervised training process of the central server specifically comprises:
supervised training is performed using local tagged traffic data set s to obtain a new model θ σ+ψ* Namely:
Figure FDA0003848027430000029
simultaneously obtaining an updated supervised learning parameter sigma;
the minimum loss term in the supervised training process is shown in formula (3):
minimizeL s (σ)=λ s CrossEntropy(y,p σ+ψ* (y|x)) (3)
wherein denotes a freezing parameter, η s Which represents the step size of the movement of the parameter,
Figure FDA00038480274300000210
indicating the unit direction vector, λ, of the parameter update s Is a hyper-parameter for controlling supervised learning.
6. The traffic classification method based on federal semi-supervised learning as in claim 1, wherein in the step 1, the data preprocessing comprises: and the flow data is divided, cleaned, unified in length and visualized in sequence to obtain a flow data image.
7. The method for traffic classification based on federal semi-supervised learning of claim 6, wherein the dividing of the traffic data specifically comprises: dividing the original flow into different bidirectional sessions by the Pacp file according to a source IP, a destination IP, a source port, a destination port and a transport layer protocol;
the cleaning of the flow data specifically comprises the following steps: deleting repeated data packets and null data packets, iterating data packets of all bidirectional sessions, and deleting information irrelevant to flow classification;
the unified length of the traffic data specifically includes: unifying the length of each session as a fixed byte, cutting off if the length of the session is greater than the fixed byte, and filling zero at the end of the session if the length of the session is less than the fixed byte; and/or, fill 0 at the end of the header of the UDP segment to make it equal to the length of the TCP header to make the transport layer segment uniform.
8. The method for traffic classification based on federal semi-supervised learning of claim 1, wherein for the r-th communication process, the central server aggregates unsupervised learning parameters of a clients based on model similarity, that is:
Figure FDA0003848027430000031
Figure FDA0003848027430000032
representing the unsupervised learning parameters of the client a during the r-th communication.
9. The traffic classification method based on federal semi-supervised learning of claim 1, wherein in step 4, the sending condition is that the central server sends the traffic to the client at fixed turn number every interval.
10. Flow classification system based on federal semi-supervised learning is characterized by comprising:
the flow preprocessing module is respectively arranged on the client and the central server and is used for preprocessing data of the unlabeled network flow of the local gateway captured by the client to form an unlabeled flow data set and preprocessing the data of the labeled network flow on the central server to form a labeled flow data set;
the server initialization module is arranged in the central server and used for selecting the flow classification model adopted by the global model, decomposing the global model into a supervised learning parameter and an unsupervised learning parameter and initializing the two learning parameters; and initializing a secondary agent; sending the two initialized learning parameters and the auxiliary agent to each client;
the client training module is arranged at the client and used for performing unsupervised training by utilizing a local label-free flow data set based on supervised learning parameters, unsupervised learning parameters and auxiliary agents, updating the unsupervised learning parameters, obtaining unsupervised learning parameter differences before and after updating, and then uploading the unsupervised learning parameter differences to the central server;
the server retraining module is arranged in the central server and used for aggregating the unsupervised learning parameters of each client and obtaining the unsupervised learning parameter difference before and after aggregation; carrying out supervised training by using a local labeled traffic data set, updating supervised learning parameters, and obtaining the difference of the supervised learning parameters before and after updating; then, the supervised learning parameter difference and the unsupervised learning parameter difference are sent to each client; and obtaining H most similar local models based on nearest neighbor search to serve as new auxiliary agents, and sending the new auxiliary agents to the clients when set sending conditions are met.
CN202211123213.4A 2022-09-15 2022-09-15 Flow classification method and system based on federal semi-supervised learning Pending CN115563532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211123213.4A CN115563532A (en) 2022-09-15 2022-09-15 Flow classification method and system based on federal semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211123213.4A CN115563532A (en) 2022-09-15 2022-09-15 Flow classification method and system based on federal semi-supervised learning

Publications (1)

Publication Number Publication Date
CN115563532A true CN115563532A (en) 2023-01-03

Family

ID=84740627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211123213.4A Pending CN115563532A (en) 2022-09-15 2022-09-15 Flow classification method and system based on federal semi-supervised learning

Country Status (1)

Country Link
CN (1) CN115563532A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108491A (en) * 2023-04-04 2023-05-12 杭州海康威视数字技术股份有限公司 Data leakage early warning method, device and system based on semi-supervised federal learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108491A (en) * 2023-04-04 2023-05-12 杭州海康威视数字技术股份有限公司 Data leakage early warning method, device and system based on semi-supervised federal learning
CN116108491B (en) * 2023-04-04 2024-03-22 杭州海康威视数字技术股份有限公司 Data leakage early warning method, device and system based on semi-supervised federal learning

Similar Documents

Publication Publication Date Title
CN113705712B (en) Network traffic classification method and system based on federal semi-supervised learning
Shi et al. From semantic communication to semantic-aware networking: Model, architecture, and open problems
US10163420B2 (en) System, apparatus and methods for adaptive data transport and optimization of application execution
CN113435472A (en) Vehicle-mounted computing power network user demand prediction method, system, device and medium
CN115102763B (en) Multi-domain DDoS attack detection method and device based on trusted federal learning
CN114710330B (en) Anomaly detection method based on heterogeneous layered federated learning
Vinayakumar et al. Secure shell (ssh) traffic analysis with flow based features using shallow and deep networks
CN115563532A (en) Flow classification method and system based on federal semi-supervised learning
CN116862012A (en) Machine learning model training method, business data processing method, device and system
Zhang et al. Optimization of image transmission in cooperative semantic communication networks
CN110365659B (en) Construction method of network intrusion detection data set in small sample scene
CN115359298A (en) Sparse neural network-based federal meta-learning image classification method
CN116187469A (en) Client member reasoning attack method based on federal distillation learning framework
CN108737491A (en) Information-pushing method and device and storage medium, electronic device
Gou et al. Clustered hierarchical distributed federated learning
Lin et al. Federated learning with dynamic aggregation based on connection density at satellites and ground stations
Han et al. An effective encrypted traffic classification method based on pruning convolutional neural networks for cloud platform
CN114070775A (en) Block chain network slice safety intelligent optimization method facing 5G intelligent network connection system
CN116992336B (en) Bearing fault diagnosis method based on federal local migration learning
CN117633657A (en) Method, device, processor and computer readable storage medium for realizing encryption application flow identification processing based on multi-graph characterization enhancement
Abbasi et al. FLITC: A novel federated learning-based method for IoT traffic classification
CN112215326A (en) Distributed AI system
Zhang et al. Encrypted network traffic classification: A data driven approach
Mertens et al. i-WSN League: Clustered Distributed Learning in Wireless Sensor Networks
Li et al. Knowledge-Assisted Few-Shot Fault Diagnosis in Cellular Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination