WO2023024844A1 - 模型训练方法、装置及系统 - Google Patents

模型训练方法、装置及系统 Download PDF

Info

Publication number
WO2023024844A1
WO2023024844A1 PCT/CN2022/109525 CN2022109525W WO2023024844A1 WO 2023024844 A1 WO2023024844 A1 WO 2023024844A1 CN 2022109525 W CN2022109525 W CN 2022109525W WO 2023024844 A1 WO2023024844 A1 WO 2023024844A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
training
node
parameter update
update information
Prior art date
Application number
PCT/CN2022/109525
Other languages
English (en)
French (fr)
Inventor
赵礼菁
胡翔
冯张潇
翁昕
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023024844A1 publication Critical patent/WO2023024844A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the embodiments of the present application relate to the field of communication technologies, and in particular, to a model training method, device, and system.
  • SA service awareness
  • SA can realize the intelligence and automation of SA based on the artificial intelligence (AI) model.
  • AI artificial intelligence
  • SA based on the AI model must rely on a large number of samples for training before recognition. If the local sample data of each node (or node) is transferred to each other, it does not meet the data privacy protection requirements; if the number of samples is too small, it will lead to The recognition rate for business content is low.
  • Embodiments of the present application provide a model training method, device, and system to improve the recognition rate of an AI model as much as possible on the premise of satisfying data privacy protection requirements.
  • the embodiment of the present application provides a model training method, which can be executed by a central node.
  • the method includes: the central node sends model training information to at least two first nodes, the model training information includes an artificial intelligence AI model and model training configuration information, and the AI model is used to identify the category to which the data flow belongs; the central node receives information from The at least two first model parameter update information of the at least two first nodes, the first model parameter update information is based on the local data of the first node corresponding to the first model parameter update information and the model training configuration information Model parameter update information after the training of the AI model; the central node sends second model parameter update information to the first node of the at least two first nodes, and the second model parameter update information is based on the at least The two first model parameter update information are obtained, and the second model parameter update information is used to update the model parameters of the AI model of the first node.
  • the first model parameter update information is trained at the first node using local data and model training configuration information, it can meet the data privacy protection requirements, and because the second model parameter update information is based on at least two second The update information of at least two first model parameters of a node can make the recognition rate of the AI model updated by using the second model parameter update information higher, so that the recognition rate of the AI model can be improved as much as possible.
  • the category to which the data flow belongs includes at least one of the following: the application to which the data flow belongs; the type or protocol to which the service content of the data flow belongs; or, the data flow Packet feature rules.
  • the AI model can be used to identify the application to which the data stream belongs. Since the AI model is used to identify the application to which the data stream belongs, manual participation is not required, and the ability to identify the application to which the data stream belongs can also be improved.
  • the AI model can be used to identify the type or protocol of the business content of the data flow. Since the AI model is used to identify the type or protocol of the data flow without manual participation, it can also improve the accuracy of the data flow. Ability to identify types or protocols.
  • the AI model can be used to identify the characteristic rules of the data flow. Since the recognition of the packet characteristic rules of the data flow is realized by using the AI model, manual participation is not required, and the recognition of the packet characteristic rules of the data flow can also be improved. ability.
  • the central node after the central node receives at least two first model parameter update information from the at least two first nodes, it further includes: the central node sends the second model to the second node Parameter update information and the AI model; the second model parameter update information is used to update model parameters of the AI model of the second node.
  • the AI of the second node is updated by updating the second model parameter update information obtained from the at least two first model parameter update information after at least two first nodes use local data and model training configuration information to train the AI model.
  • the model parameters of the model can make the AI model of the second node update information in combination with at least two first model parameters of at least two first nodes, so that the recognition rate of the AI model of the second node can be improved as much as possible.
  • the model training configuration information further includes: a training result accuracy threshold; the training result accuracy threshold is used to instruct the first node to perform an algorithm for the AI based on local data and the model training configuration information.
  • the training result accuracy of the model for training is used to instruct the first node to perform an algorithm for the AI based on local data and the model training configuration information.
  • the model training can be stopped, so that the central node can realize the accuracy of the first node AI model.
  • the accuracy of the training results is controlled.
  • the embodiment of the present application provides a model training method, which may be executed by a first node.
  • the method includes: the first node receives a model training message, the model training message includes an AI model and model training configuration information, and the AI model is used to identify the category to which the data stream belongs; the first node sends the first model parameter update information, so The first model parameter update information is the model parameter update information after training the AI model according to the local data of the first node and the model training configuration information; the first node receives the second model parameter update information, and the second The model parameter update information is obtained according to at least two first model parameter update information of at least two first nodes, and the second model parameter update information is used to update model parameters of the AI model of the first node.
  • the first node trains the AI model according to local data and model training configuration information, it can meet the requirements of data privacy protection; and because the second model parameter update information is based on at least two second
  • the first model parameter update information obtained can make the recognition rate of the AI model updated by using the second model parameter update information be higher, so that the recognition rate of the AI model can be improved as much as possible.
  • the category to which the data flow belongs includes at least one of the following: the application to which the data flow belongs; the type or protocol to which the service content of the data flow belongs; or, the data flow Packet feature rules.
  • the AI model can be used to identify the application to which the data stream belongs. Since the AI model is used to identify the application to which the data stream belongs, manual participation is not required, and the ability to identify the application to which the data stream belongs can also be improved.
  • the AI model can be used to identify the type or protocol of the business content of the data stream. Since the AI model is used to identify the type or protocol of the data stream without manual participation, it can also improve the accuracy of the data stream. Ability to identify types or protocols.
  • the AI model can be used to identify the characteristic rules of the data flow. Since the recognition of the packet characteristic rules of the data flow is realized by using the AI model, manual participation is not required, and the recognition of the packet characteristic rules of the data flow can also be improved. ability.
  • the method further includes: the first node determines that the first node trains the AI model updated by using the second model parameter update information according to local data.
  • the first node uses local data to train the AI model updated by using the second model parameter update information, so that the final trained AI model has a higher recognition rate for local data.
  • the method is executed by an application program APP deployed on a cloud platform or an edge computing platform.
  • the APP deployed on the cloud platform or edge computing platform can execute the method executed by the above first node, which can decouple the APP from the cloud platform or edge computing platform, and minimize the impact on the existing cloud platform or edge computing platform.
  • Computing Platform Changes
  • the method further includes: receiving the first message from the first node through the server module of the APP Model parameter update information; the sending of the first model parameter update information includes: sending the first model parameter update information through the client module of the APP.
  • the server module and the client module are set in the APP, and the communication with the first node is realized through the server module and the communication with the outside is realized through the client module, which can give full play to the function of APP information transmission.
  • the first model parameter update information is obtained after successfully verifying the model parameters of the trained AI model according to the model training configuration information.
  • the above design before sending the first model parameter update information, first checks the model parameters of the trained AI model, which can ensure that the model parameters of the AI model before and after training are consistent, and can try to avoid the inconsistency of the model parameters of the AI model before and after training. affect the training effect.
  • the model training configuration information further includes: a training result accuracy threshold; the training result accuracy threshold is used to instruct the first node to perform an algorithm for the AI based on local data and the model training configuration information.
  • the training result accuracy of the model for training is used to instruct the first node to perform an algorithm for the AI based on local data and the model training configuration information.
  • the model training can be stopped, so that the central node can realize the accuracy of the first node AI model.
  • the accuracy of the training results is controlled.
  • the method further includes: the first node acquires the AI model updated based on the second model parameter update information, and the data stream Carry out identification results and message feature rules; update the service-aware SA feature library according to the message feature rules.
  • the recognition rate of the SA feature library can be improved by updating the packet feature rules identified by the AI model obtained through model training to the SA feature library.
  • the embodiment of the present application provides a model training device, and the model training device includes various modules for implementing the first aspect or any possible design of the first aspect.
  • the model training device includes various modules for implementing the second aspect or any possible design of the second aspect.
  • the embodiment of the present application provides a model training device, which includes a processor and a memory.
  • the memory is used to store computer-executable instructions, and when the processor is running, the processor executes the computer-executable instructions in the memory to utilize the hardware resources in the controller to perform any one of the possible aspects of any one of the first aspect to the second aspect The operational steps of the method in the design.
  • the embodiment of the present application provides a model training system, including the model training device provided in the third aspect or the fourth aspect.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, it causes the computer to execute the methods in the above aspects.
  • the present application provides a computer program product containing instructions, which, when run on a computer, causes the computer to execute the methods in the above aspects.
  • FIG. 1 is a schematic diagram of a federated learning architecture provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flow chart of a model training method provided in an embodiment of the present application.
  • Fig. 4 is a schematic flow chart of another model training method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of the architecture of a model training system provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the distribution of sample data of each first node in scenario 1;
  • Fig. 7 is a schematic diagram of the recognition accuracy of each first node model and federated model in scenario 1;
  • FIG. 8 is a schematic diagram of the distribution of sample data of each first node in Scenario 2;
  • Fig. 9 is a schematic diagram of the recognition accuracy of each first node model and federated model in scenario 2;
  • Fig. 10 is a schematic diagram of the recall rate of small-sample applications in each first node in Scenario 2 and the recall rate after federated learning;
  • FIG. 11 is a schematic diagram of the distribution of sample data of each first node in Scenario 3;
  • Fig. 12 is a schematic diagram of the recognition accuracy of each first node model and federated model in scenario three;
  • FIG. 13 is a schematic diagram of the recall rate after no-sample application of federated learning in each first node in Scenario 3;
  • Figure 14 is a schematic diagram of the recognition accuracy rate of retraining after fine-tuning and initialization of the federated model in Scenario 4;
  • Figure 15 is a schematic diagram of the recognition accuracy of the new application for the model after fine-tuning the federated model and re-training after initialization in Scenario 4;
  • Fig. 16 is a schematic structural diagram of a model training device provided in an embodiment of the present application.
  • Fig. 17 is a schematic structural diagram of another model training device provided in the embodiment of the present application.
  • Fig. 18 is a schematic structural diagram of another model training device provided by the embodiment of the present application.
  • the main process of traffic identification based on AI algorithm includes: training phase, including grabbing traffic, and using it as the input of AI model after preprocessing; testing phase, including sending preprocessed traffic to AI model for classification, and taking classifier
  • the traffic category with the highest probability in the result is the final traffic prediction result.
  • model training a large amount of labeled data needs to be used for model training first.
  • the generalization ability of model recognition is also related to the amount of training data; and the training data needs to be protected for privacy, which is limited to local and cannot be used externally.
  • APP Application, application program
  • AI model training and reasoning consume a lot of processor performance. For example, if a single node has high performance requirements and the training time is too long, it will affect the rapid iteration of the model and cause the recognition ability to not be updated quickly.
  • APP traffic has regional distribution characteristics, even if the same APP has different traffic characteristics in different regions. For example, when the traffic characteristics change rapidly, the AI model of node A has the ability to identify a new APP, which cannot be passed to node B.
  • node data cannot be exported for model training, and data collected by means of dial testing cannot meet the training requirements.
  • the embodiments of the present application provide a model training method, device and system, which are used to improve the recognition rate of the AI model as much as possible on the premise of meeting the privacy protection requirements of the data.
  • FIG. 1 is a schematic diagram of a federated learning architecture provided by an embodiment of the present application. For ease of understanding, the scenario and process of federated learning are illustrated in conjunction with Figure 1.
  • Federated learning is an encrypted distributed machine learning technology, which refers to the joint construction of AI models by all parties involved in federated learning without sharing local data. Its core is: the participant performs model training on the AI model locally, and then only encrypts and uploads the updated part of the model to the coordinating node, and aggregates and integrates the updated part of the model with other participants to form a federated learning model. The model is then sent from the cloud to each participant, and through repeated local training and repeated integration, a better AI model is finally obtained.
  • the federated learning scenario can include a coordinator node and multiple participant nodes.
  • the coordinator node is the coordinator in the federated learning process.
  • the coordinator node can be deployed on the cloud.
  • the participant node is the participant in the federated learning process and also the owner of the dataset.
  • the coordinator node is called the central node (such as marked as 110 in FIG. 1 ), and the participant node is called the first node (such as marked as 120 and 121 in FIG. 1 ).
  • the central node 110 and the first nodes 120, 121 may be any nodes (such as network nodes) that support data transmission.
  • the central node may be a server, or a parameter server, or an aggregation server.
  • the first node may be a client (client), such as a mobile terminal or a personal computer.
  • the central node 110 can be used to maintain the federated learning model.
  • the first nodes 120 and 121 can obtain a federated learning model from the central node 110, and perform local training in combination with a local training data set to obtain a local model. After the local model is obtained through training, the first node 120, 121 can send the local model to the central node 110, so that the central node 110 can update or optimize the federated learning model. This goes on and on for several rounds of iterations until the federated learning model converges or reaches a preset iteration stop condition (such as reaching the maximum number of times or reaching the longest training time).
  • a preset iteration stop condition such as reaching the maximum number of times or reaching the longest training time.
  • the embodiment of the present application can be applied to various scenarios such as service package based on SA technology, service management based on SA technology, or auxiliary operation based on SA technology.
  • the solution provided by the embodiment of this application can be used to train the AI model, and then the trained AI model can be used to identify different applications for subsequent implementation of different billing and control strategies.
  • service management based on SA technology may include bandwidth control, congestion control, or service guarantee after identifying service content.
  • bandwidth control may include bandwidth control, congestion control, or service guarantee after identifying service content.
  • IP-based voice transmission voice over internet protocol, VoIP
  • VoIP voice over internet protocol
  • VoIP-type applications there are many types of VoIP-type applications, and the version or protocol is updated frequently, and many applications are encrypted, so SA technical support is required to detect VoIP software and control.
  • Another example is that with the continuous enrichment of content on the network, operators urgently need to analyze the content transmitted in the network. Through the analysis of transmission traffic, operators can better specify business operation and maintenance strategies. Therefore, it is necessary for the SA technology to identify traffic of different applications.
  • the embodiment of the present application may be applied to the scenario shown in FIG. 2 .
  • the central node 110 may be deployed on the cloud, and the first nodes 120 and 121 may be respectively deployed on the cloud platform.
  • the first nodes 120 and 121 may also be respectively deployed on edge computing platforms.
  • the central node 110 can download the AI model to the first node 120, 121, and then the first node 120, 121 can use the local data to perform model training respectively, and then upload the local model parameter update ((also called the first model parameter update information ) to the central node 110, the central node 110 gathers the local model parameter updates received from the first nodes 120, 121 to obtain a shared federation model parameter update (also called the second model parameter update), and then sends it to the first
  • the nodes 120, 121 and the first node 120, 121 respectively update the AI model before local training according to the update of the shared federated model parameters to obtain the final federated model.
  • the data streams from the Internet pass through the first nodes 120 and 121 and reach users A and B respectively, where the first nodes 120 and 121 can use the trained federated model to perform SA on the local data streams respectively.
  • Identification for example, the application to which the data flow belongs can be identified.
  • a data stream may be identified as belonging to different applications, for example, data stream 1 may be identified as application 1, application 2, application 3, application 4, application 5, and application 6. 99% accuracy on application 1, 80% accuracy on application 2, 78% accuracy on application 3, 72% accuracy on application 4, 68% accuracy on application 5, and accuracy on application 6 40%.
  • the data flow is identified as belonging to application 1.
  • the application names and specific accuracies of the second to fifth recognition accuracies may also be output as the classification results.
  • AI model parameters can be updated to achieve synchronization, and normalized training between different nodes can be realized; distributed training of AI models can also be realized, and model training computing power is shared among different nodes , to avoid performance bottlenecks; it can also process node differentiation or personalized data, so that the recognition ability can be quickly empowered between different nodes, and in the case of few or missing data samples, the recognition of other nodes can also be obtained through federated learning ability.
  • the central node 110 may be deployed on the NAIE, and the first nodes 120 and 121 may also be respectively deployed on edge computing platforms.
  • the first nodes 120 and 121 may be deployed in the local data center of operator A and the local data center of operator B respectively.
  • the identification capabilities of each data center can be summarized and integrated to improve the overall identification ability while ensuring that the original data does not come out of the local area.
  • it can also realize the sharing of training computing power, and realize rapid iteration of key applications and identification capabilities within key time periods.
  • the embodiment of the present application provides a model training method, which can be executed by the central node 110 and the first node 120 or 121 in Fig. 1 or Fig. 2 .
  • the method includes:
  • the central node sends a model training message to at least two first nodes; correspondingly, the first node of the at least two first nodes receives the model training message;
  • the model training message includes an artificial intelligence AI model and model training configuration information, and the AI model is used to identify the category to which the data flow belongs.
  • FIG. 3 is only an exemplary display and does not limit the embodiment of the present application.
  • the central node may be deployed on the cloud, for example, may be deployed on an artificial intelligence engine.
  • the first node can be deployed on a cloud platform or an edge computing platform, for example, it can be deployed on a clouded multiple service engine (clouded multiple service engine, CloudMSE), and for example, it can be deployed on a multi-access edge computing (MEC).
  • CloudMSE clouded multiple service engine
  • MEC multi-access edge computing
  • the central node or the first node may be implemented by using a container service, or may be implemented by one or more virtual machines (virtual machine, VM), or by one or more processors , or by one or more computers.
  • VM virtual machine
  • model training configuration information also includes: a training result accuracy threshold
  • the training result accuracy threshold is used to indicate the training result accuracy of the AI model trained by the first node according to the local data and the model training configuration information.
  • the AI model may adopt a neural network model or the like.
  • the neural network model may be a convolutional neural network (convolutional neural network, CNN) model, a recurrent neural network model, and the like.
  • CNN is a deep learning network structure that has been widely used in the field of image recognition. It is a feedforward neural network, and artificial neurons can respond to surrounding units.
  • the embodiment of the present application may use a CNN model. Since the convolution layer of the CNN model has a multi-layer structure, multiple convolution calculations are performed on the original data. The data processing process of the CNN model is relatively complicated, and more and more complex traffic features can be extracted, which is helpful for SA identification.
  • the CNN model has strong generalization ability, is not sensitive to the position of the traffic feature in the message, and does not need special processing in the early stage of the data flow to be identified, so it has strong adaptability to different network environments.
  • the CNN model is taken as an example.
  • the AI model takes the CNN model as an example, see Table 1, the model training configuration information may include the required parameters in Table 1, or may include the required parameters and optional parameters in Table 1, or may include the required parameters Part of the parameters and some of the optional parameters, for example, parameter No. 4 and parameter No. 5 can be selected to be included in the model training configuration information.
  • parameter No. 27 is dynamically generated using parameter No. 25 during training
  • parameter No. 28 is dynamically generated using parameter No. 26 during training.
  • the AI model supports a certain number of protocols, and different protocols may correspond to different applications or different types of applications.
  • AI models can configure the range of recognizable categories as needed.
  • the range of protocols that need to be recognized can be flexibly configured for the AI model according to business requirements.
  • the configured protocol configuration information is called a check protocol.
  • the checked agreement may include one or more of the following information: level of the agreement, number of the agreement, name of the agreement, number of the agreement, and the like.
  • Check the agreement can also include AI instance number, AI instance name.
  • the level of the protocol, the name of the protocol, and the number of the protocol may be configured in advance, and may not be changed later.
  • the checked protocol ID list corresponding to parameter 3 refers to the identification ID list of each protocol included in the checked protocol.
  • the AI model and model training configuration information may be sent to the first node in the form of a configuration file.
  • the configuration file may include the following three files: 1. .caffemodel file, which represents the initial AI model; 2. .proto file, including AI model-related parameter definition information, such as the content in the above table; 3. *netproto. txt file, including the parameter values of AI model-related parameters, such as the parameter values of each parameter in the above table.
  • the *netproto.txt file matches the AI model and can be used for AI model loading and verification.
  • the category to which the data flow belongs may include at least one of the following: the application to which the data flow belongs; the type or protocol to which the business content of the data flow belongs; or, the packet of the data flow feature rules, etc.
  • a data stream can be identified as belonging to an application, for example, data stream A can be identified as belonging to WeChat, data stream B can be identified as belonging to YouTube, data stream C can be identified as belonging to iQiyi, and so on.
  • the business content of a data stream can be identified as belonging to different types.
  • the business content of data stream D can be identified as belonging to video, specifically, it can be identified as belonging to WeChat video; the business content of data stream E
  • the content can be identified as belonging to the IP phone, the service content of the data stream F can be identified as belonging to the picture, and so on.
  • the business content of a data stream can be identified to which protocol it belongs. Different protocols may correspond to different applications or to different types of applications.
  • the business content of data stream G can be identified as belonging to the BT (Bit Torrent) protocol
  • the business content of data stream H can be identified as belonging to the MSN (Microsoft Network, Microsoft Network Services) protocol
  • the business content of data stream F can be identified as belonging to the SMTP (Simple Mail Transfer Protocol, Simple Mail Transfer Protocol), etc.
  • the AI model can also be used to extract the packet feature rules of the data flow.
  • packet characteristic rules may be extracted during the process of identifying the Unknown data flow, that is, the packet characteristic rules are extracted from the identification process in the Unknown data flow.
  • the feature rule may be updated into the SA feature database.
  • the central node can compress the model training information before sending it, and the first node can decompress the received model training information to obtain the AI model and model training configuration information; or the central node can compress the model training information.
  • the training information is encrypted and then sent.
  • the first node can decrypt the received model training information to obtain the AI model and model training configuration information; or the central node can compress and encrypt the model training information before sending it.
  • a node can decompress and decrypt the received model training information to obtain the AI model and model training configuration information.
  • the central node as the coordinator may select at least two first nodes from multiple first nodes to participate in federated learning, and send model training information to the at least two first nodes.
  • the first node may register with the central node in advance, and the central node selects at least two first nodes from the registered first nodes to participate in federated learning.
  • the central node may randomly select at least two first nodes or select at least two first nodes according to preset rules.
  • the central node may select from multiple first nodes according to the data distribution information of multiple first nodes At least two first nodes that store business data meeting training requirements participate in the model training process with the AI model.
  • the first node may also send a heartbeat message to the central node periodically or in real time; the heartbeat message may include status information of the first node.
  • the central node may send the AI model and model training configuration information to at least two first nodes in the form of a training task.
  • the central node may send training task information to at least two first nodes, where the training task information includes the AI model and model training configuration information.
  • the central node may first send a training task notification message to the first node, the training task notification message may include a task identification ID, and then receive the training task
  • the first node of the notification message may send a training task query message to the central node.
  • the training task query message may include a task ID, and then the central node may send the training task information to the first node after receiving the training task query message.
  • the central node before sending the model training information to at least two first nodes, can also perform initialization settings, the initialization settings include: selecting the first node participating in federated learning; setting the aggregation algorithm; setting parameters such as the number of iterations of model training , such as the parameters in Table 1; establish a federated learning instance, and select the instance type as SA, such as selecting the instance for AISA traffic identification; initialize an AI model, such as initializing an AISA traffic identification model, or will have already completed Pre-trained model parameters and weights are injected into an AI model.
  • Initializing an AI model refers to injecting initial model parameters and weights into the AI model.
  • the first node of the at least two first nodes trains the AI model according to local data and model training configuration information
  • the local data may be the data obtained by the first node locally performing feature recognition on the collected data flow through the SA engine, obtaining the classification result of the data flow, and labeling the data flow according to the classification result of the data flow.
  • the first node can first load the AI model according to the received model training configuration information, for example, the AI model can be loaded according to the parameters in Table 1, such as 1, 3 to 4, 10 to 14 in Table 1 , 16 to 30 and other parameters to load the AI model; then train the AI model according to the local data and model training configuration information, for example, the AI model can be trained according to the parameters 3, 8, 15, 31, and 32 in Table 1 to train.
  • the process of training the AI model may include: dividing the local data into K batches according to a predetermined size B (batchsize); and training the AI model K times based on the K batches.
  • B is the parameter value of parameter No. 31 in Table 1.
  • the model training configuration information further includes: a training result accuracy threshold; the training result accuracy threshold is used to indicate the training result accuracy of the AI model trained by the first node according to the local data and the model training configuration information.
  • parameter No. 15 in Table 1 is a training result parameter, which may include a training result accuracy threshold.
  • the accuracy of the training results refers to the proportion of successfully identified samples in the identified samples. For example, if the total number of samples is 100, the number of identified samples is 90, and among the 90 identified samples, the number of successfully identified samples is 80, so the accuracy is 80/ 90.
  • the model training process mainly includes: the collection and labeling phase of the training data set, the training phase, and the verification phase.
  • the SA engine can be used to obtain the training set for training the CNN model without manual participation, which can not only improve efficiency, but also reduce human resources. Deep learning training relies on large data sets with labels. Large data sets not only need to be labeled but also need to be updated regularly to ensure that new features can be learned. For example, based on the successful identification result of the SA engine, the training message can be marked with a label to form a training data set, and the data set can also be updated regularly.
  • BP can be called Back Progagation, also known as Error Back Progagation, which means error back propagation.
  • the basic idea of BP algorithm is forward propagation and back propagation. Forward propagation includes input samples being passed in from the input layer, processed layer by layer through each hidden layer, and passed to the output layer. If the actual output of the output layer does not match the expected output, it will turn to the backpropagation of the error.
  • Backpropagation includes backpropagating the output layer by layer to the input layer in some form, and apportioning the error to all units of each layer, so as to obtain the error of each layer unit, which is used as the basis for correcting the weight of each unit.
  • Network learning is done during the modification of weights. When the error reaches the expected value, the network learning ends. The above two links are iterated repeatedly until the response of the network to the input reaches the predetermined target range.
  • a CNN may include an input layer, a hidden layer, and an output layer.
  • the hidden layer may include multiple layers, which is not limited in this application.
  • the input training samples are input to the model to be trained through the input layer, and after being processed layer by layer in the hidden layer, they are transmitted to the output layer. If the actual output of the output layer does not match the expected output, it will turn to the backpropagation of the error.
  • Backpropagation means that the output is passed back to the input layer layer by layer in some form, and the error is distributed to each layer, so as to obtain the error of each layer, and this error is used as the basis for correcting the weight of each layer.
  • the training process is to obtain the CNN model after multiple weight modifications. The training process ends when the error reaches the desired value.
  • the cross-validation method can be used to train the model.
  • the training data set can be divided into two parts, one part is used to train the model (as a training sample), and the other part is used to verify the accuracy of the network model (verification sample).
  • a CNN model is trained, use the verification sample to verify whether the trained model can accurately identify the data stream, and give the recognition accuracy.
  • the recognition accuracy reaches the set threshold, it can be determined that the CNN model can be used as a model for subsequent feature recognition.
  • the training can be continued until the recognition accuracy reaches the set threshold.
  • the feature recognition of the Unknown data stream can be performed according to the AI model obtained from the training and verification to obtain the classification result.
  • the model training process may also include a reasoning and recognition stage.
  • a data set (message) without labels is input, that is, a data set to be identified, and the label classification of the data set (message) is given, that is, the recognition result.
  • the data set to be recognized can be divided into two parts: one part is the data message that the SA engine has marked the recognition result, which can be used to verify the functions between the SA engine and the AI model to prove the accuracy of the AI model recognition; Some of them are data packets that have not been marked with recognition results by the SA engine, which can be used to find packets that reflect the difference in capabilities between the SA engine and the AI model.
  • the first node of the at least two first nodes determines first model parameter update information
  • the first model parameter update information is the model parameter update information after the AI model is trained according to the local data of the first node and the model training configuration information.
  • the first node of the at least two first nodes sends first model parameter update information; correspondingly, the central node receives at least two first model parameter update information of the at least two first nodes;
  • the first model parameter update information is the model parameter update information after training the AI model according to the local data of the first node and the model training configuration information;
  • the first model parameter update information is sent after successful verification of the model parameters of the trained AI model according to the model training configuration information.
  • the first node can compress the first model parameter update information and send it to the central node, and the central node can obtain the first model parameter update information after decompression; or the first node can encrypt the first model parameter update information and send it to Central node, the central node obtains the first model parameter update information after decryption; or the first node compresses and encrypts the first model parameter update information and sends it to the central node, and the central node obtains the first model parameter update information after decompression and decryption .
  • the first model parameter update information is sent after successful verification of the model parameters of the trained AI model according to the model training configuration information. That is, after the first node calculates and obtains the update information of the first model parameters, it needs to verify whether the structure of the AI model after training is consistent with that of the AI model before training, and then send the update information of the first model parameters to the center when the structure is consistent Node, when the structure is inconsistent, an error will be reported, so as to ensure that the structure of the AI model before and after training is completely consistent, and avoid affecting the training effect.
  • some or all of the parameters 3 to 5, 10 to 13, and 16 to 30 in Table 1 can be used as verification parameters, that is, to compare whether the parameters of the AI model before and after training are consistent. When these parameters are completely consistent, It is determined that the structure of the AI model before and after training is consistent; when these parameters are not completely consistent, it is determined that the structure of the AI model before and after training is inconsistent.
  • the first node can also send other parameter values to the central node at the same time, such as training time, training data volume, training result precision, training result recall rate, etc.
  • the recall rate of the training result refers to the proportion of the identified samples to the total samples. For example, if the total samples are 100 and the identified samples are 90, the recall rate is 90/100.
  • the first node may send the first model parameter update information to the central node in the form of a task execution result.
  • the first node sends the execution result information of the training task to the central node; the execution result information includes: task ID, task execution success, first model parameter update information, or may also include training time spent, training data volume, training Result precision, training result recall rate, etc.
  • the execution result information may include: task ID, task execution failure, failure reason, and the like.
  • the execution result information may be sent after being compressed and/or encrypted.
  • the central node aggregates at least two first model parameter update information using a preset aggregation algorithm to obtain second model parameter update information;
  • the preset aggregation algorithm may be an average algorithm, a weighted average algorithm, a FedAvg (FederatedAveraging) algorithm, or a stochastic variance reduced gradient (SVRG) algorithm.
  • the model update of the first node adopts gradient descent:
  • the update process of the central node model can adopt the method of model aggregation:
  • model update amount ⁇ k of each first node can also be used to aggregate the model:
  • Subsequent central nodes use ⁇ t+1 to update the model, and then send it to each first node to realize the aggregation and enhancement of the Local model by using the federated learning mechanism to achieve business goals.
  • i represents the sample number, and the value of i is 1 to n, n indicates the total number of samples, lowercase k indicates the number of the first node, K indicates the number of the first node, the value of lowercase k is 1 to uppercase K, n k indicates the number of samples of the kth node, t indicates the time, t+ 1 represents the next moment, ⁇ represents the update of model parameters, ⁇ t represents the update of model parameters at time t, ⁇ t+1 represents the update of model parameters at time t+1, represents the model parameter update of the first node k at time t+1, f i ( ⁇ ) represents the model parameter update corresponding to sample i, F k ( ⁇ ) represents the model parameter update of the first node k, F k ( ⁇ t ) represents the model parameter update of the first node k at time t, f( ⁇ ) represents the
  • the central node sends second model parameter update information to the first node of the at least two first nodes; correspondingly, the first node of the at least two first nodes receives the second model parameter update information.
  • the second model parameter update information is obtained according to the at least two first model parameter update information, and the second model parameter update information is used to update the model parameters of the AI model of the first node.
  • the above method steps performed by the first node may be performed by an application program APP deployed on a cloud platform or an edge computing platform.
  • the method further includes: receiving the information from the first node through the server module of the APP The first model parameter update information; the sending the first model parameter update information includes: sending the first model parameter update information through the client module of the APP.
  • the method may further include: the first node obtains the AI updated based on the second model parameter update information.
  • the model is the identification result of identifying the data flow and the message feature rules; the first node updates the service-aware SA feature library according to the message feature rules.
  • the central node can compress the second model parameter update information before sending it, and the first node can obtain the second model parameter update information after decompression; or the central node can encrypt the second model parameter update information before sending it.
  • a node decrypts and obtains the second model parameter update information; or the central node may compress and encrypt the second model parameter update information before sending, and the first node decompresses and decrypts to obtain the second model parameter update information.
  • the first node of the at least two first nodes updates the AI model before this training according to the second model parameter update information.
  • the second model parameter update information is a one-bit array W
  • W is (4, 5, 6, 7, 8, 9)
  • the model parameters of the AI model before training are expanded into a one-dimensional array X as (1 , 2, 3, 4, 5, 6)
  • restore the AI model according to the array Model parameters update the pre-training AI model according to the restored model parameters.
  • the process of the above steps S302 to S307 can be performed iteratively until the model convergence condition is met, and the federated learning is ended to obtain the final AI model.
  • the first node may send a heartbeat message to the central node regularly or in real time, and the heartbeat message may include status information of the training task, such as receiving, running, completed, and failed.
  • the first node can send a de-registration request message to the central node, and the central node will register the first node, and send a de-registration success response message to the first node, The first node goes offline successfully.
  • the method may further include:
  • the first node trains the updated AI model according to local data and model training configuration information.
  • the first node uses local data to train the AI model after federated learning, which can not only improve the recognition rate of the AI model by using federated learning, but also make the trained AI model more adaptable to local data, further improving the recognition rate and satisfying local business needs.
  • the method may further include:
  • the first node Based on the updated AI model, the first node identifies the identification result and message feature rules of the data flow; the first node updates the service-aware SA feature library according to the message feature rules.
  • the feature rules can be used to realize the rapid identification of SA.
  • the speed of using the SA feature library is higher than that of the AI model.
  • the acquired message feature rules are quickly added to the SA feature library. It will ensure that SA has the ability to quickly identify this part of the traffic that meets the supplementary feature rules, and there is no need for AI model identification in the future.
  • packet feature rules are extracted to meet the requirements of automatic product operation and maintenance, and no manual participation is required to upgrade the SA feature library, improving efficiency and reducing costs.
  • the method may further include:
  • the central node sends the second model parameter update information and the AI model to the second node; wherein the second model parameter update information is used to update the model parameters of the AI model of the second node.
  • the second node updates the AI model according to the second model parameter update information.
  • the second node is a node that does not participate in federated learning.
  • the federated model obtained by federated learning can be directly used in non-federated scenarios, that is, the AI model obtained by federated learning can be applied to the second node that does not participate in federated learning to improve the recognition rate of the AI model of the second node.
  • the model carrying these parameters can be directly exported to the non-federated node (that is, the second node).
  • the non-federated node can fully understand the structure of the AI model, and the AI model can run automatically .
  • These parameters here may include parameters 4 to 5, 14, 16 to 33 in Table 1.
  • the second node can also directly modify the federation model before using it, so that different models can be used on different nodes. For example, if some nodes (that is, the second node) do not meet expectations after using the federated model, you can modify the model parameters of the federated model, or take the federated model offline, and run it on non-federated nodes by modifying the model parameters , to achieve better business results.
  • the model parameters modified here may include parameters 4 to 5, 14, and 16 to 33 in Table 1.
  • the first model parameter update information is trained at the first node using local data and model training configuration information, it can meet the data privacy protection requirements, and because the second model parameter update The information is obtained based on at least two first model parameter update information of at least two first nodes, which can make the recognition rate of the AI model updated by using the second model parameter update information higher, thereby improving the recognition of the AI model as much as possible. Rate.
  • the embodiment of the present application can also realize the sharing of training computing power among various nodes, avoiding problems such as too long training time of a single node, excessive performance consumption, and the like.
  • the method may further include:
  • the first node sends a registration request message to the central node; correspondingly, the central node receives the registration request message.
  • FIG. 3 is only an exemplary display and does not limit the embodiment of the present application.
  • the registration request message may include the name, identifier, etc. of the first node, or may also include information such as the data volume of the first node's local data.
  • the central node After the registration is successful, the central node sends a registration success response message to the first node; correspondingly, the first node receives the registration success response message.
  • the central node may send a registration failure response message to the first node; correspondingly, the first node receives the registration failure response message.
  • the first node may continue to send heartbeat messages to the central node.
  • the central node executes initialization settings.
  • the initialization setting may be performed by the management unit of the central node.
  • the initialization settings may include:
  • At least two first nodes participating in federated learning may be selected randomly or according to preset rules from multiple registered first nodes.
  • parameters such as the number of iterations of model training can be set, such as setting some or all of the parameters in Table 1;
  • the instance is selected for AISA traffic identification.
  • Initialize an AISA traffic recognition model or inject pre-trained model parameters and weights into an AI model.
  • Initializing an AI model refers to injecting initial model parameters and weights into the AI model.
  • the central node sends a training task notification message to at least two selected first nodes; correspondingly, the first node of the at least two first nodes receives the training task notification message;
  • the training task notification message may include a task ID, which is used to notify the first node of the training task.
  • the first node of the at least two first nodes sends a training task query message to the central node;
  • the training task query message may include a task ID, which is used to query the central node for the training task.
  • the model training message may be carried in a training task message and sent to the first node.
  • the central node sends training task information to the at least two first nodes; correspondingly, the first node of the at least two first nodes receives the training task information; wherein, the training task information includes the initial AI model and model Training configuration information.
  • the AI model and model training configuration information have been introduced in the previous embodiment, and will not be repeated here.
  • the central node may compress and/or encrypt the training task information and send it to the first node. After receiving the information, the first node decompresses and/or decrypts the information to obtain the AI model and model training configuration information.
  • the first node can continuously send heartbeat messages to the central node to inform the central node of the status of the training task, such as receiving the task, task running, task completion, task failure, etc.
  • the first model parameter update information may be included in the task execution result and sent to the central node.
  • the first node of the at least two first nodes sends a task execution result to the central node; the corresponding central node receives at least two task execution results sent by the at least two first nodes; wherein, the task execution result includes the The first model parameter update information.
  • the task execution results may also include one or more of the following: time spent on training, amount of training data, precision of training results, or recall rate of training results, and the like.
  • the first node may first compress and/or encrypt the task execution result and send it to the central node, and the central node may decompress and/or decrypt the task execution result to obtain the task execution result.
  • next round of training can also be performed, that is, skip to step 305 to train the updated AI model again until the model converges, and the federated learning ends.
  • the technical solution provided by the embodiment of the present invention uses a federated learning mechanism to update the model parameters of each local first node after local training and transmit them between different nodes, so that even if a certain first node has too little APP traffic to be identified and When the recognition effect is not good, the recognition capability of other first nodes can also be transferred to enhance the local recognition capability.
  • the recognition capability can be rapidly expanded and the single-node model training efficiency can be improved by training other first nodes through federated learning , to achieve fast iteration.
  • the identification capability of the new application APP can also be quickly transferred between different first nodes. For example, taking site A and site B participating in federated learning to train AI models as an example, site A’s identification capabilities can be passed on to site B, so that site B can still have the same level as site A even if there is less traffic (small samples) or no APP traffic The same identification capability of the site.
  • site B can share the computing power of site A, which improves the training performance compared to site A alone.
  • a federated learning server FLS1101 Federated Learning Server
  • a converging unit 1102 are set at the central node 110, and each first The nodes are respectively provided with a federated learning client FLC (Federated Learning Client) and a training unit, for example, an FLC1201 and a training unit 1202 are provided at the first node 120, and an FLC1211 and a training unit 1212 are provided at the first node 121.
  • the FLS1101 and each FLC1201 and FLC1211 perform information exchange through wired or wireless connections respectively.
  • the aggregation unit 1102 can be any unit that supports data aggregation, and can be set in the central node 110 together with the FLS1101 to realize federated learning together with the FLS1101.
  • the training unit 1202 and the training unit 1212 can be any units that support AI model training, and can be set up in the first node together with the FLC1201 and FLC1211 respectively, and cooperate with the FLS1101 to realize federated learning.
  • the FLS1101 and the converging unit 1102 may be implemented by using different container services, or may be implemented by one or more virtual machines (virtual machine, VM), or by one or more processors. Or implemented by one or more computers respectively.
  • VM virtual machine
  • processors or by one or more computers respectively.
  • FLC 1201, FLC 1211, training unit 1202, and training unit 1212 can be respectively implemented by using different container services, or can be respectively implemented by one or more virtual machines (virtual machine, VM), or respectively by one or multiple processors, or one or more computers, respectively.
  • VM virtual machine
  • the training unit 1202 and the training unit 1212 may be artificial intelligence service awareness (AISA) deployed in the first node 120 and the first node 121 respectively, and AISA may also be called artificial intelligence identification.
  • AISA artificial intelligence service awareness
  • the function may also be named by other names.
  • AISA is used as an example.
  • AISA can be used to classify collected data streams according to the SA feature library to obtain classification results.
  • the SA feature library can be located inside the AISA, or can also be located outside the AISA, and connected through an interface.
  • An SA engine may also be included in AISA.
  • the SA engine is used to implement feature recognition for the collected data stream according to the SA feature library.
  • AISA can identify the characteristics of the collected data streams through the SA engine, and after obtaining the classification results of the data streams, label the data streams according to the classification results of the data streams; then use the labeled data streams as the training data of the AI model Set, AISA trains the AI model according to the training data set.
  • the first nodes 120 and 121 can also deploy SA recognition engines, such as SA@AI engines, and the SA recognition engines can submit model training applications to AISA as applications that need to perform SA recognition, configure the ticking protocol, and then AISA can implement After data collection, model training, and rule extraction, the recognition results and rules are output and updated to the SA feature database.
  • SA recognition engines such as SA@AI engines
  • SA@AI engines can submit model training applications to AISA as applications that need to perform SA recognition, configure the ticking protocol, and then AISA can implement After data collection, model training, and rule extraction, the recognition results and rules are output and updated to the SA feature database.
  • the AI model can be deployed on the cloud platform, and the cloud platform can register with the federated learning server FLS on the central node of the cloud, and perceive each other's status through status messages.
  • the AI model can After outputting the recognition results, the cloud platform can forward the interactive data such as the recognition results participated by the AI model to the FLS.
  • the FLS accepts the model parameters uploaded by the cloud platform, performs model aggregation and fusion, and sends the shared model to the cloud platform after completion. In this way, after the AI model is converged and fused, the model capability can be enhanced.
  • FLC1201 and FCL1211 can be responsible for receiving the data of the local node and forwarding it to the central node 110.
  • AISA can also use FLC1201 and FCL1211 to synchronize the status message with the central node FLS1101, and the model parameters of the AI model are updated, exported, uploaded, and downloaded. , training takes time, data volume, recognition results (recall rate, precision), etc. uploading and downloading, etc.
  • the federated learning server FLS1101 is responsible for receiving the data of FLC1201 and FCL1211, detecting the status of the first nodes 120 and 121, implementing the distribution of training tasks, and accepting the updated model parameters uploaded by each distributed first node 120 and 121, Carry out model parameter update aggregation and fusion, and deliver the aggregated model parameter update, etc.
  • a server module and a client module may be set in FLC1201 and FCL1211, a client module may be set in training unit 1202 and training unit 1212, and a server module may be set in FLS1101.
  • the server modules in FLC1201 and FCL1211 are configured to connect with the client modules in training units 1202 and 1212 respectively, and are responsible for data transmission between FLC1201 and FCL1211 and training unit 1202 and training unit 1212 respectively.
  • the client module in FLC1201 and FCL1211 is configured to connect with the server module in FLS1101, responsible for data transmission between FLC1201 and FCL1211 and FLS1101.
  • the server module 11011 can be set in FLS1101, the client 12011 and server 12012 can be set in FLC1201, the client module 12021 can be set in training unit 1202, the client 12111 and server 12112 can be set in FLC1211, and in the training unit 1212 sets the client module 12121.
  • the client module 12011, the client module 12021, the client module 12111, and the client module 12121 can be HTTP/HTTPS client clients, and the server module 11011, the server module 12012, and the server module 12112 can be HTTP/HTTPS HTTPS server Server.
  • the FLC1201 and the FCL1211 may be deployed on the first node 120 and the first node 121 respectively in the form of an application program APP.
  • the first node 120 and the first node 121 may be nodes of an edge computing platform or a cloud platform.
  • the FLC1201 and FCL1211 may be deployed on the edge computing platform in the form of an application program APP.
  • FLC1201 and FCL1211 can be directly launched as an APP on the edge computing platform or cloud platform, and then FLC1201 and FCL1211 act as agents for the connection between the first node 120 and the first node 121 and FLS110.
  • the FLC1201 and the FCL1211 can be deployed on the first node 120 and the first node 121 respectively in the form of a static link library.
  • the FLC1201 and the FCL1211 can be respectively integrated into the virtual machine VM where the AISA is deployed.
  • the VM sets an interface for the FLC1201 and FCL1211 to connect to the outside, and the IP address and parameters for connecting the FLC1201 and FCL1211 to the FLS1101 can also be configured through the login Portal interface of the virtual machine. This parameter refers to the name and logo of the FLC1201 and FCL1211, the user name and password registered with the FLS1101, etc.
  • FLS1101 can execute the operations performed by the central node 110
  • FLC1201 and FCL1211 can respectively execute the operations performed by the first node 120 and the first node 121 .
  • the FLS1101 and the convergence unit 1102 can cooperate to execute the operations performed by the central node 110, wherein the FLS1101 is responsible for receiving and sending data, and the convergence unit 1102 is responsible for using the preset
  • the aggregation algorithm aggregates at least two first model parameter update information to obtain second model parameter update information.
  • the operations performed by the first node 120 and the first node 121 can be performed by the FLC1201 and the FCL1211 respectively and the training unit 1202 and the training unit 1212, wherein the training unit 1202 and the training unit 1212 are responsible for training the AI model, and other operations are performed by the FLC1201 and FCL1211.
  • samples of three locals are the first nodes, and the following local 1, local 2 and local 3 can be collectively referred to as local
  • the numbers are respectively 10%, 30% and 60% of the total sample number, and the total sample set is randomly allocated to each local in proportion.
  • the sample data distribution of each local in this scenario 1 is shown in Figure 6. Since local 1 is only allocated 10% of the total data volume, it becomes a small-sample node.
  • the federated learning experiment is used to verify whether the small-sample node can improve the recognition ability of the model after federated learning.
  • the recognition accuracy rates of the three local models trained based on local data are 76.0%, 90.9%, and 95.6%, respectively.
  • the recognition accuracy of the obtained model can reach 97.1%.
  • the model parameter fusion strategy performs parameter fusion once for each epoch, and the recognition accuracy of the obtained federated model is 95.7%. It can be seen that due to the small number of training samples in local1, the recognition accuracy of the model trained locally is low; and due to the large number of samples in local3, the recognition accuracy of the model trained locally is higher.
  • each local obtains a federated model with higher recognition, and the recognition accuracy has been improved by different degrees compared with local training. Especially for the small sample node local1, the recognition accuracy of the model through federated learning has been greatly improved.
  • the recognition accuracy rates of the three local models trained based on local data are 83.6%, 81.9%, and 82.8%, respectively.
  • the recognition accuracy of the obtained model can reach 97.1%.
  • the recognition accuracy of the obtained federated model reaches 95.6%.
  • Figure 10 shows Scenario 2: the recall rate of small sample applications in each local and the recall rate after federated learning.
  • the recall rate of small sample application recognition in each local is low, and the recall rate of the federated model has been significantly improved.
  • the recall rate of application A in the local1 local model is only 46.9%, but the recall rate of the federated model (federated) is 95.6%.
  • the recall rate of application B in the local model of local1 is only 13.8%, but the federated model (federated)
  • the recall rate of application C in the local2 local model is only 49.7%
  • the recall rate of application federated model (federated) is 98.5%
  • the recall rate of application D in the local2 local model is only 42.3%.
  • the application of the federated model (federated) has a recall rate of 97.6%; for example, the application of E in local3 has only a 32.1% recall rate of the local model, but the application of the federated model (federated) has a recall rate of 96.4%; for example, the application of F in local3
  • the local model only has a recall rate of 59.1%, but the applied federated model (federated) has a recall rate of 97.3%.
  • Scenario 3 Federated learning experiment to expand the number of recognition applications
  • the recognition accuracy of the three local models based on local data training is 87%, 87.5%, and 86.9%, respectively.
  • the recognition accuracy of the obtained model can reach 98.4%.
  • the recognition accuracy of the obtained federated model reaches 97.9%. It can be seen that when some local applications have no training samples, the local training model has no recognition ability for these applications, and the overall recognition accuracy is low.
  • the federated model obtained is better for local applications without samples. High recognition accuracy and recall rate, and the overall recognition accuracy of the federated model has been significantly improved.
  • Scenario 3 shown in Figure 13 the recall rate after federated learning with no samples in each local.
  • the recall rate is 0, and the recall rate of the federated model has been significantly improved.
  • the recall rate of APP1 in local1 is 0, and the recall rate of the federated model is increased to 98.1%.
  • the recall rate of APP2 in local2 is 0, and the recall rate of the federated model is increased to 97.0%.
  • the recall rate of APP3 in local3 is 0, and the recall rate of the federated model is increased to 98.5%.
  • the recall rate of APP4 is 0 in both local2 and local3, and the recall rate of the federated model is increased to 96.0%.
  • the recall rate of APP5 in local1 and local3 is 0, and the recall rate of the federated model is increased to 99.3%.
  • the recall rate of APP6 in local1 and local2 is 0, and the recall rate of the federated model is increased to 99.9%.
  • Scenario 4 Federated learning fine-tuning experiment based on federated model
  • a federated model is obtained.
  • the model can be fine-tuned for federated learning or federated training after reinitialization to improve the recognition rate.
  • the fine-tuning is Federated-finefune, which refers to performing several rounds of training on a pre-trained model to converge the model. In this scenario, it refers to performing several rounds of training on the federated model to converge the model.
  • the reinitialization is Federated-init, which means to initialize the federated model.
  • the federated model is fine-tuned, that is, after 500 rounds of federated training for the federated model, the recognition accuracy of the obtained model is 95.7%.
  • the recognition accuracy of the model is also 95.7%. It can be seen that fine-tuning training can be performed based on the federated model of federated learning, and the same model accuracy as model initialization and retraining can be achieved through fewer iterations of federated training.
  • the recognition accuracy rate based on the model after fine-tuning the federated model is 99.4%
  • the recognition accuracy rate based on the model after federated training on the initialized federated model is 99.4%
  • the recognition accuracy rate based on the model after federated training on the initialized federated model is 99.4%
  • the recognition accuracy rate based on the model after federated training on the initialized federated model is 99.8%.
  • the recognition accuracy rate based on the fine-tuned model of the federated model is 99.3%, and the recognition accuracy rate of the model based on the federated training of the initialized federated model is 99.2%.
  • the new application APP_M based on the The recognition accuracy of the model after fine-tuning the federated model is 99.7%, and the recognition accuracy of the model based on the federated training of the initialized federated model is 99.5%. It can be seen that the fine-tuned training model based on the federated learning federated model can achieve the same or even higher recognition rate for new applications than the model after model initialization and retraining.
  • the technical solution provided by the embodiment of the present application can provide an AI model that can achieve high-precision and high-performance recognition through training in the case of lack of training samples or insufficient computing power and privacy protection for the AI-based SA technology in the network.
  • traditional AI-based SA requires a large number of data samples for model training, and the reasonable means of collecting data samples is very limited. If the requirements are not met, the overall performance will be affected.
  • the embodiment of the present application provides an overall improvement in the recognition ability of the AI model in the absence of training samples.
  • the traditional AI-based traffic identification generally takes a long time to calculate, which is easy to cause a large amount of performance consumption on a single node.
  • the embodiment of the present application can realize distributed model training and avoid performance bottleneck of a single node.
  • the demand for privacy protection of network data traffic is increasing day by day.
  • the embodiments of the present application can realize distributed computing of AI technology in the network, protect the security of original data, and provide safe and reliable intelligent traffic identification services.
  • FIG. 16 shows the structure of a model training device provided by the embodiment of the present application.
  • the model training device can be deployed on the central node shown in FIG. 1 or FIG. 2 , or deployed on the FLS in FIG. 5 .
  • the model training device includes: a first sending module 1601 and a receiving module 1602 .
  • the first sending module 1601 is configured to send a model training message to at least two first nodes, and send second model parameter update information to the first node of the at least two first nodes, and the model training message includes artificial An intelligent AI model and model training configuration information, the AI model is used to identify the category to which the data stream belongs, the second model parameter update information is obtained according to the at least two first model parameter update information, and the second model The parameter update information is used to update the model parameters of the AI model of the first node;
  • the receiving module 1602 is configured to receive at least two first model parameter update information from the at least two first nodes, the first model parameter update information is based on the first node corresponding to the first model parameter update information
  • the local data and the model training configuration information update the model parameters after the AI model training.
  • the category to which the data flow belongs includes at least one of the following: the application to which the data flow belongs; the type or protocol to which the service content of the data flow belongs; and the packet feature rule of the data flow.
  • the apparatus further includes: a second sending module 1603, configured to send the second model parameter update information and the AI model to a second node; the second model parameter update information A model parameter for updating the AI model of the second node.
  • model training configuration information may further include: a training result accuracy threshold; the training result accuracy threshold is used to instruct the first node to train the AI model according to local data and the model training configuration information. Result precision.
  • FIG. 17 shows the structure of a model training device provided by the embodiment of the present application.
  • the model training device may be deployed on the first node shown in FIG. 1 or FIG. 2 , or deployed on the FLC in FIG. 5 .
  • the model training device includes: a receiving module 1701 and a sending module 1702 .
  • the receiving module 1701 is configured to receive a model training message and receive second model parameter update information, the model training message includes an AI model and model training configuration information, the AI model is used to identify the category to which the data stream belongs, and the second model
  • the parameter update information is obtained according to at least two first model parameter update information of at least two first nodes, and the second model parameter update information is used to update the model parameters of the AI model of the first node;
  • the sending module 1702 is configured to send first model parameter update information, where the first model parameter update information is model parameter update information after training the AI model according to the local data of the first node and the model training configuration information.
  • the category to which the data flow belongs includes at least one of the following: the application to which the data flow belongs; the type or protocol to which the service content of the data flow belongs; and the packet feature rule of the data flow.
  • the apparatus may further include: a training module 1703, configured to determine that the first node trains the AI model updated by using the second model parameter update information according to local data.
  • a training module 1703 configured to determine that the first node trains the AI model updated by using the second model parameter update information according to local data.
  • the receiving module and the sending module may be modules of an application program deployed on a cloud platform or an edge computing platform.
  • the receiving module 1701 and the sending module 1702 are client modules of the APP; the APP also includes a server module; the server module is configured to send the first node to the first node. Model parameter update information.
  • the first model parameter update information is sent after successful verification of model parameters of the trained AI model according to the model training configuration information.
  • model training configuration information further includes: a training result accuracy threshold; the training result accuracy threshold is used to indicate the training result of the first node training the AI model according to the local data and the model training configuration information precision.
  • the device may further include an acquisition module and an update module:
  • An acquisition module configured to acquire an identification result and packet feature rules for identifying data streams based on the AI model updated by using the second model parameter update information
  • An updating module configured to update a service-aware SA feature library according to the message feature rule.
  • each functional unit in each embodiment of the present application may be integrated into one processor, or physically exist separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • FIG. 18 shows the structure of a model training device provided by the embodiment of the present application.
  • the model training apparatus 1800 may be deployed on the central node shown in FIG. 1 or FIG. 2 , or deployed on the FLS in FIG. 5 .
  • the model training apparatus 1800 may be deployed on the first node shown in FIG. 1 or FIG. 2 , or deployed on the FLC in FIG. 5 .
  • the model training device 1800 may include a communication interface 1810 and a processor 1820 .
  • the model training device 1800 may also include a memory 1830.
  • the memory 1830 can be set inside the model training device, and can also be set outside the model training device.
  • the functions implemented by the model training apparatus in the above embodiments can all be implemented by the processor 1820 .
  • the processor 1820 receives the data stream through the communication interface 1810, and is used to implement the model training method described in any of the foregoing embodiments.
  • each step of the processing flow can implement the model training method described in any of the above-mentioned embodiments through an integrated logic circuit of hardware in the processor 1820 or instructions in the form of software.
  • the program codes executed by the processor 1820 to implement the model training method described in any of the foregoing embodiments may be stored in the memory 1830 .
  • the memory 1830 is coupled to the processor 1820 .
  • the processors involved in the embodiments of the present application may be general-purpose processors, digital signal processors, application-specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components. Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, modules or modules.
  • the processor may operate in conjunction with the memory.
  • the memory can be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), such as a random-access memory (random- access memory, RAM).
  • a memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the embodiment of the present application does not limit the specific connection medium among the communication interface, the processor, and the memory.
  • the memory, the processor, and the communication interface can be connected through a bus.
  • the bus can be divided into address bus, data bus, control bus and so on.
  • an embodiment of the present application also provides a computer storage medium, in which a software program is stored, and when the software program is read and executed by one or more processors, it can realize any one of the above-mentioned embodiments. model training method.
  • the computer storage medium may include: various media capable of storing program codes such as U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk.
  • the embodiments of the present application further provide a computer program product containing instructions, which, when run on a computer, cause the computer to execute the model training method provided by any one of the above embodiments.
  • an embodiment of the present application further provides a chip, the chip includes a processor, configured to implement the functions of the model training method provided by any one or more of the above embodiments.
  • the chip further includes a memory for necessary program instructions and data executed by the processor.
  • the chip may consist of chips, or may include chips and other discrete devices.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种模型训练方法、装置及系统,用以解决AI模型的识别率低的问题。该方法包括:中心节点向至少两个第一节点发送模型训练消息,该模型训练消息包括AI模型以及模型训练配置信息。中心节点接收来自至少两个第一节点的至少两个根据对应的第一节点的本地数据和所述模型训练配置信息对AI模型训练后的模型参数更新信息。中心节点向至少两个第一节点中的第一节点发送根据所述至少两个第一模型参数更新信息得到的第二模型参数更新信息,第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数。该方法在满足数据的隐私保护要求的前提下,可以尽量提高AI模型的识别率。

Description

模型训练方法、装置及系统
相关申请的交叉引用
本申请要求在2021年08月23日提交中国专利局、申请号为202110970462.6、申请名称为“模型训练方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及通信技术领域,尤其涉及一种模型训练方法、装置及系统。
背景技术
随着近年来移动互联网的高速发展,增强现实(augmented Reality,AR)/虚拟现实(virtual Reality,VR)、4K高清视频等新应用不断涌现,带来了移动数据业务的爆发式增长。针对运营商来说迫切需要的是:基于业务内容的差异化计费。
为了实现基于业务内容的差异化计费,需要识别业务内容。目前采用业务感知(service awareness,SA)技术来识别业务内容。SA技术,是在分析包头的基础上,能够深度分析数据包所携带的4~7层协议的特征,是一种基于应用层信息检测和控制技术。
SA可以基于人工智能(artificial Intelligence,AI)模型实现SA的智能化与自动化。但是基于AI模型进行SA必须依赖大量样本在识别前进行训练,如果将各个节点(或者称节点)的本地样本数据进行相互传递,则不符合数据的隐私保护要求;如果样本数量太少,又导致针对业务内容的识别率较低。
基于此,如何在满足数据的隐私保护要求的前提下,尽量提高AI模型的识别率成为亟需解决的问题。
发明内容
本申请实施例提供一种模型训练方法、装置及系统,用以在满足数据的隐私保护要求的前提下,尽量提高AI模型的识别率。
第一方面,本申请实施例提供一种模型训练方法,该方法可以由中心节点执行。该方法包括:中心节点向至少两个第一节点发送模型训练信息,所述模型训练信息包括人工智能AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别;中心节点接收来自所述至少两个第一节点的至少两个第一模型参数更新信息,第一模型参数更新信息是根据所述第一模型参数更新信息对应的第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息;中心节点向所述至少两个第一节点中的第一节点发送第二模型参数更新信息,所述第二模型参数更新信息是根据所述至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数。
上述方法,由于第一模型参数更新信息是分别在第一节点利用本地数据和模型训练配置信息训练得到的,因而可以满足数据的隐私保护要求,而且由于第二模型参数更新信息 根据至少两个第一节点的至少两个第一模型参数更新信息得到的,可以使得利用第二模型参数更新信息更新后的AI模型的识别率更高,从而可以尽量提高AI模型的识别率。
在一种可能的设计中,所述数据流所属类别包括以下内容中的至少一项:所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;或,所述数据流的报文特征规则。
上述设计,AI模型可以用于识别数据流所属的应用,由于利用AI模型实现对数据流所属的应用的识别,无需人工参与,还可以提高对数据流所属的应用的识别能力。或者,上述方案,AI模型可以用于识别数据流的业务内容所属的类型或协议,由于利用AI模型实现对数据流所属的类型或协议的识别,无需人工参与,还可以提高对数据流所属的类型或协议的识别能力。或者,上述方案,AI模型可以用于识别数据流的特征规则,由于利用AI模型实现对数据流的报文特征规则的识别,无需人工参与,还可以提高对数据流的报文特征规则的识别能力。
在一种可能的设计中,在所述中心节点接收来自所述至少两个第一节点的至少两个第一模型参数更新信息之后,还包括:中心节点向第二节点发送所述第二模型参数更新信息和所述AI模型;所述第二模型参数更新信息用于更新所述第二节点的所述AI模型的模型参数。
上述设计,通过根据至少两个第一节点利用本地数据和模型训练配置信息对AI模型训练后的至少两个第一模型参数更新信息得到的第二模型参数更新信息,来更新第二节点的AI模型的模型参数,可以使得第二节点的AI模型结合至少两个第一节点的至少两个第一模型参数更新信息,从而可以尽量提高第二节点的AI模型的识别率。
在一种可能的设计中,所述模型训练配置信息还包括:训练结果精度阈值;所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
上述设计,通过在模型训练配置信息中设置训练结果精度阈值,当第一节点训练后的AI模型的精度达到训练结果精度阈值时,可以停止模型训练,从而可以实现中心节点对第一节点AI模型的训练结果精度进行控制。
第二方面,本申请实施例提供一种模型训练方法,该方法可以由第一节点执行。该方法包括:第一节点接收模型训练消息,所述模型训练消息包括AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别;第一节点发送第一模型参数更新信息,所述第一模型参数更新信息是根据第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息;第一节点接收第二模型参数更新信息,所述第二模型参数更新信息是根据至少两个第一节点的至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数。
上述方法,由于第一节点根据本地数据和模型训练配置信息对AI模型训练,可以满足数据的隐私保护的要求;而且由于第二模型参数更新信息是根据至少两个第一节点的至少两个第一模型参数更新信息得到的,可以使得利用第二模型参数更新信息更新后的AI模型的识别率更高,从而可以尽量提高AI模型的识别率。
在一种可能的设计中,所述数据流所属类别包括以下内容中的至少一项:所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;或,所述数据流的报文特征规则。
上述设计,AI模型可以用于识别数据流所属的应用,由于利用AI模型实现对数据流所属的应用的识别,无需人工参与,还可以提高对数据流所属的应用的识别能力。或者, 上述方案,AI模型可以用于识别数据流的业务内容所属的类型或协议,由于利用AI模型实现对数据流所属的类型或协议的识别,无需人工参与,还可以提高对数据流所属的类型或协议的识别能力。或者,上述方案,AI模型可以用于识别数据流的特征规则,由于利用AI模型实现对数据流的报文特征规则的识别,无需人工参与,还可以提高对数据流的报文特征规则的识别能力。
在一种可能的设计中,所述方法还包括:第一节点确定所述第一节点根据本地数据,对采用所述第二模型参数更新信息更新后的所述AI模型进行训练。
上述设计,第一节点利用本地数据对采用所述第二模型参数更新信息更新后的AI模型进行训练,可以使得最终训练后的AI模型针对本地数据具有更高的识别率。
在一种可能的设计中,所述方法通过部署于云化平台或者边缘计算平台的应用程序APP执行。
上述设计,可以由在云化平台或者边缘计算平台部署的APP执行上述第一节点执行的方法,可以将APP与云化平台或者边缘计算平台解耦,尽量减少对现有的云化平台或者边缘计算平台的改动。
在一种可能的设计中,所述接收模型训练消息之后,发送第一模型参数更新信息之前,所述方法还包括:通过所述APP的服务端模块接收来自所述第一节点的所述第一模型参数更新信息;所述发送第一模型参数更新信息,包括:通过所述APP的客户端模块发送所述第一模型参数更新信息。
上述设计,在APP中设置服务端模块和客户端模块,通过服务端模块实现与第一节点的通信以及通过客户端模块实现与外部的通信,可以充分发挥APP信息传递的作用。
在一种可能的设计中,所述第一模型参数更新信息是根据所述模型训练配置信息,对训练后的所述AI模型的模型参数校验成功之后得到的。
上述设计,在发送第一模型参数更新信息之前先对训练后的AI模型的模型参数进行校验,可以保证训练前后AI模型的模型参数一致,可以尽量避免由于训练前后AI模型的模型参数不一致从而影响训练效果。
在一种可能的设计中,所述模型训练配置信息还包括:训练结果精度阈值;所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
上述设计,通过在模型训练配置信息中设置训练结果精度阈值,当第一节点训练后的AI模型的精度达到训练结果精度阈值时,可以停止模型训练,从而可以实现中心节点对第一节点AI模型的训练结果精度进行控制。
在一种可能的设计中,所述接收第二模型参数更新信息之后,所述方法还包括:第一节点获取基于采用所述第二模型参数更新信息更新后的所述AI模型,对数据流进行识别的识别结果和报文特征规则;根据所述报文特征规则更新业务感知SA特征库。
上述设计,通过将模型训练得到的AI模型识别出的报文特征规则更新到SA特征库,可以提高SA特征库的识别率。
第三方面,本申请实施例提供了一种模型训练装置,该模型训练装置包括用于执行第一方面或第一方面任一种可能的设计中的各个模块。或者,该模型训练装置包括用于执行第二方面或第二方面任一种可能的设计中的各个模块。
第四方面,本申请实施例提供了一种模型训练装置,该模型训练装置包括处理器和存 储器。存储器中用于存储计算机执行指令,当处理器运行时,处理器执行存储器中的计算机执行指令以利用控制器中的硬件资源执行第一方面至第二方面中任一方面的任一种可能的设计中方法的操作步骤。
第五方面,本申请实施例提供了一种模型训练系统,包括上述第三方面或第四方面提供的模型训练装置。
第六方面,本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面的方法。
第七方面,基于与第一方面同样的发明构思,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请实施例提供的一种联邦学习的架构示意图;
图2为本申请实施例提供的一种应用场景的示意图;
图3为本申请实施例提供的一种模型训练方法的流程示意图;
图4为本申请实施例提供的另一种模型训练方法的流程示意图;
图5为本申请实施例提供的一种模型训练系统的架构示意图;
图6为场景一中各个第一节点的样本数据分布情况的示意图;
图7为场景一中各个第一节点模型与联邦模型的识别准确率的示意图;
图8为场景二中各个第一节点的样本数据分布情况的示意图;
图9为场景二中各个第一节点模型与联邦模型的识别准确率的示意图;
图10为场景二中各个第一节点中小样本应用的召回率及联邦学习后的召回率的示意图;
图11为场景三中各个第一节点的样本数据分布情况的示意图;
图12为场景三中各个第一节点模型与联邦模型的识别准确率的示意图;
图13为场景三中各个第一节点中无样本应用联邦学习后的召回率的示意图;
图14为场景四中对联邦模型进行微调与初始化后重新训练的识别准确率的示意图;
图15为场景四中对联邦模型进行微调与初始化后重新训练的模型分别针对新应用的识别准确率的示意图;
图16为本申请实施例提供的一种模型训练装置的结构示意图;
图17为本申请实施例提供的另一种模型训练装置的结构示意图;
图18为本申请实施例提供的另一种模型训练装置的结构示意图。
具体实施方式
基于AI算法的SA技术可以实现识别技术的智能化与自动化。比如,基于AI算法进行流量识别的主要流程包括:训练阶段,包括抓取流量,预处理后作为AI模型的输入;测试阶段,包括将预处理后的流量送入AI模型进行分类,取分类器结果中概率最大的流量类别为最终的流量预测结果。
但是,基于AI算法的SA技术具有如下缺点:
首先,需要使用大量标记数据先进行模型训练,根据识别的内容越复杂、识别的范围越广,所需数据量就越庞大。其中模型识别泛化能力也和训练数据量有关;并且训练数据需进行隐私保护,仅限本地,不可外传使用。例如当某节点所需识别的APP(Application,应用程序)的流量太少(小样本),会致使该节点要么不具备此APP的识别能力,要么泛化能力差,识别精度低。
其次,AI模型训练与推理,对处理器的性能消耗十分大。例如,如果某单节点性能需求大,训练时间过长,影响模型的快速迭代,会导致识别能力不能快速更新。
再次,APP流量具有区域性分布特征,即使同一款APP在不同区域的流量特征也不同。比如,当流量特征变化快时,A节点的AI模型具备的某新款APP识别能力,无法传递给B节点。另外,根据隐私保护,节点数据无法导出用于模型训练,而依靠拨测等手段采集的数据无法满足训练要求。
有鉴于此,本申请实施例提供一种模型训练方法、装置及系统,用于在满足数据的隐私保护要求的前提下,尽量提高AI模型的识别率。
在对本申请实施例进行详细的解释说明之前,先对本申请实施例涉及的系统架构进行介绍。
图1为本申请实施例提供的一种联邦学习的架构示意图。为了便于理解,先结合图1对联邦学习的场景和过程进行示例性说明。
联邦学习是一种加密的分布式机器学习技术,它是指参与联邦学习的各方在不共享本地数据的前提下共建AI模型。其核心是:参与方在本地对AI模型进行模型训练,然后仅将模型更新部分加密上传到协调方节点,并与其它参与方的模型更新部分进行汇聚整合,形成一个联邦学习模型,这个联邦学习模型再由云端下发给各参与方,通过反复的本地训练以及反复的整合,最终得到一个更好的AI模型。
参阅图1,联邦学习的场景中可以包括协调方节点和多个参与方节点。协调方节点为联邦学习过程中的协调者,该协调方节点可以部署在云端,参与方节点为联邦学习过程的参与者,也为数据集的拥有者。为了便于理解和区分,本申请实施例中将协调方节点称为中心节点(比如在图1中标记为110),将参与方节点称为第一节点(比如在图1中标记为120和121)。
中心节点110和第一节点120、121可以是支持数据传输的任意节点(如网络节点)。例如,中心节点可以是服务器(server),或称参数服务器,或称汇聚服务器。第一节点可以是客户端(client),如移动终端或个人电脑等。
中心节点110可以用于维护联邦学习模型。第一节点120、121可以从中心节点110获取联邦学习模型,并结合本地训练数据集进行本地训练,得到本地模型。在训练得到本地模型之后,第一节点120、121可以将该本地模型发送给中心节点110,以便中心节点110更新或优化该联邦学习模型。如此往复,经过多轮迭代,直到联邦学习模型收敛或达到预设的迭代停止条件(例如达到最大次数或者达到最长训练时间)。
本申请实施例可以应用到基于SA技术的业务套餐、基于SA技术的业务管理、或者基于SA技术的辅助运营等多种场景下。
例如,针对运营商提出的有业务针对性的数据流量套餐,可以采用本申请实施例提供 的方案训练AI模型,然后采用训练后的AI模型对不同应用进行识别,用于后续实施不同的计费和控制策略。
又例如,基于SA技术的业务管理可以包括识别业务内容后的带宽控制、阻塞控制或者业务保证等,比如,比如以识别业务内容后的阻塞控制为例,某些国家或者地区不允许使用某种类型的软件,例如基于IP的语音传输(voice over internet protocol,VoIP),VoIP类型的应用种类比较多,并且版本或协议更新频繁,很多应用还是加密的,从而需要SA技术支持对于VoIP软件的检测和控制。
又例如,网络上内容的不断丰富,运营商急需对于网络中传输的内容进行分析,通过对于传输流量的分析,运营商可以更好的指定商业运维策略。从而需要SA技术对不同应用的流量的识别。
如下通过具体应用场景对本申请实施例提供的方案进行示例性说明。
基于上述内容,本申请实施例可以应用于图2所示的场景中。如图2所示,中心节点110可以部署在云端,第一节点120、121可以分别部署在云化平台。另外,第一节点120、121还可以分别部署在边缘计算平台。
中心节点110可以将AI模型下载到第一节点120、121,然后第一节点120、121可以利用本地数据分别进行模型训练,然后上传本地模型参数更新((也可以称为第一模型参数更新信息)至中心节点110,中心节点110将接收的来自第一节点120、121的本地模型参数更新进行汇聚得到共享联邦模型参数更新(也可以称为第二模型参数更新),然后下发给第一节点120、121,第一节点120、121分别根据共享联邦模型参数更新来更新本地训练前的AI模型,得到最终的联邦模型。
如图2所示,来自互联网Internet的数据流经过第一节点120、121、分别到达用户A、B,其中第一节点120、121可以分别对流经本地的数据流利用训练好的联邦模型进行SA识别,比如可以识别出数据流所属的应用。例如,一个数据流可能被识别属于不同的应用,比如数据流1识别出可能为应用1,应用2,应用3,应用4,应用5、应用6。应用1的准确度为99%,应用2的准确度为80%,应用3的准确度为78%,应用4的准确度为72%,应用5的准确度为68%,应用6的准确度为40%。一般情况下,经过识别后,该数据流识别为属于应用1。而在输出分类结果时,可以将排名在第二至第五的识别准确度的应用名称和具体准确度也作为分类结果输出。
本场景中,可以在保证数据的隐私保护要求情况下,进行AI模型参数更新实现同步,实现不同节点间归一化训练;还可以实现AI模型的分布式训练,模型训练算力分担在不同节点,避免性能瓶颈;还可以针对节点差异化或者个性化数据进行处理,使得识别能力在不同节点间快速赋能,另外在数据样本较少或缺失情况下,也能通过联邦学习获取其他节点的识别能力。
基于上述内容,在另一种可能的场景中,中心节点110可以部署在NAIE上,第一节点120、121还可以分别部署在边缘计算平台。例如,第一节点120、121可以分别部署在运营商A本地数据中心以及运营商B本地数据中心。本场景中,可以在保证原始数据不出本地的情况下,将各数据中心的识别能力进行汇总综合,提升整体的识别能力。另外,还可以实现训练算力的分担,实现对重点应用、重点时间内的识别能力快速迭代。
基于上述内容,本申请实施例提供了一种模型训练方法,该方法可以由图1或图2中 的中心节点110和第一节点120或121执行。如图3所示,该方法包括:
S301,中心节点向至少两个第一节点发送模型训练消息;相应的,所述至少两个第一节点中的第一节点接收模型训练消息;
其中,所述模型训练消息包括人工智能AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别。
需要说明的是,图3中仅示出了一个第一节点,此仅为示例性显示,并不对本申请实施例进行限制。
其中,所述中心节点可以部署在云端,例如,可以部署在人工智能引擎上。
其中,第一节点可以部署于云化平台或者边缘计算平台,例如,可以部署于云化多业务引擎(clouded multiple service engine,CloudMSE),又例如可以部署于多接入边缘计算(MEC)等。
在一种可能的实现方式中,中心节点或者第一节点可以采用容器服务来实现,也可以通过一个或者多个虚拟机(virtual machine,VM)来实现,或者由一个或者多个处理器来实现、或者由一个或者多个计算机来实现。
另外,所述模型训练配置信息还包括:训练结果精度阈值;
所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
在一种可能的实现方式中,AI模型可以采用神经网络模型等。例如,神经网络模型可以是卷积神经网络(convolutional neural network,CNN)模型、循环神经网络模型等。CNN是一种目前已经被广泛使用于图像识别领域的深度学习网络结构,这是一种前馈神经网络,人工神经元可以响应周围单元。作为一种示例,本申请实施例可以采用CNN模型,由于CNN模型的卷积层是多层结构,会对原始数据进行多次卷积计算。CNN模型的数据处理过程较为复杂,并能够提取到更多、更复杂的流量特征,有助于SA识别。另外,CNN模型的泛化能力强,对流量特征在报文中所处的位置并不敏感,不需要对待识别数据流进行前期的特殊处理,因此对不同网络环境的适应性强。后续描述时,以CNN模型为例。
其中,AI模型以CNN模型为例,参见表1,模型训练配置信息可以包括表1中的必选参数,或者可以包括表1中的必选参数和可选参数,或者可以包括必选参数中的部分参数以及可选参数中的部分参数,例如4号参数和5号参数可以选择一个包括在模型训练配置信息中。
表1:
Figure PCTCN2022109525-appb-000001
Figure PCTCN2022109525-appb-000002
Figure PCTCN2022109525-appb-000003
其中,27号参数是训练的时候使用25号参数动态生成的,28号参数是训练的时候使用26号参数动态生成的。
其中,AI模型支持一定数量的协议,不同的协议可以对应不同的应用或者对应于不同类型的应用。AI模型可以根据需要配置可识别类别的范围。例如,可以根据业务需求灵活为AI模型配置需要识别的协议范围,为了描述方便将配置的协议配置信息称为勾选协议。示例性地,勾选协议可以包括如下信息中的一项或者多项:协议的级别、协议的数量、协议的名称、协议的编号等。勾选协议中还可以包括AI实例编号、AI实例名称。其中,协议的级别、协议的名称、协议的编号可以是提前配置完成的,后续可以不再更改。当然,也可以根据需求对协议的级别、协议的名称、协议的编号等进行更改,本申请实施例对此不作限定。参数3对应的勾选协议ID列表是指勾选协议中包括的各个协议的标识ID列表。
示例性的,AI模型以及模型训练配置信息可以配置文件的形式发送给第一节点。例如,该配置文件可以包括如下三个文件:1、.caffemodel文件,代表初始的AI模型;2、.proto文件,包括AI模型相关参数定义信息,例如上表中的内容;3、*netproto.txt文件,包括AI模型相关参数的参数值,例如上表中各个参数的参数值。其中,*netproto.txt文件,与AI模型匹配,可以用于AI模型的加载和校验。
举例而言,所述数据流所属类别可以包括以下内容中的至少一项:所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;或,所述数据流的报文特征规则等。
例如,根据业务需求,一个数据流可以被识别属于的应用,比如,数据流A可以被识别属于微信,数据流B可以被识别属于youtube,数据流C可以被识别属于爱奇艺,等。
又例如,根据业务需求,一个数据流的业务内容可以被识别属于不同的类型,比如,数据流D的业务内容可以被识别属于视频,具体的可以被识别属于微信的视频;数据流E的业务内容可以被识别属于IP电话,数据流F的业务内容可以被识别属于图片,等。
又例如,根据业务需求,一个数据流的业务内容可以被识别属于的协议。不同的协议可以对应不同的应用或者对应于不同类型的应用。比如,数据流G的业务内容可以被识别 属于BT(Bit Torrent)协议,数据流H的业务内容可以被识别属于MSN(Microsoft Network,微软网络服务)协议,数据流F的业务内容可以被识别属于SMTP(Simple Mail Transfer Protocol,简单邮件传输协议),等。
又例如,根据业务需求,AI模型还可以用于提取出数据流的报文特征规则。比如,针对一个Unknown数据流,可以在识别Unknown数据流的过程中进行报文特征规则提取,即从Unknown数据流中的识别过程提取报文特征规则。本申请实施例中,可以在获得特征规则后,将特征规则更新至SA特征库中。
举例而言,中心节点可以对所述模型训练信息进行压缩后再发送,第一节点可以对接收到的模型训练信息进行解压缩得到AI模型和模型训练配置信息;或者中心节点可以对所述模型训练信息进行加密后再发送,第一节点可以对接收到的模型训练信息进行解密后得到AI模型和模型训练配置信息;或者中心节点可以对所述模型训练信息进行压缩和加密后再发送,第一节点可以对接收到的模型训练信息进行解压缩和解密后得到AI模型和模型训练配置信息。
在一种可能的实现方式中,作为协调方的中心节点可以从多个第一节点中选择至少两个第一节点参与联邦学习,并将模型训练信息发送给所述至少两个第一节点。例如,第一节点可以预先向中心节点进行注册,中心节点从注册的多个第一节点中选择至少两个第一节点参与联邦学习。又例如,中心节点可以随机选择至少两个第一节点或者根据预设规则选择至少两个第一节点,比如中心节点可以根据多个第一节点的数据分布信息,从多个第一节点中选择存储有符合训练需求的业务数据的至少两个第一节点,参与到与AI模型的模型训练过程中。
示例性的,当注册成功后,第一节点还可以定时或者实时向中心节点发送心跳消息;该心跳消息可以包括第一节点的状态信息。
另外,中心节点可以通过训练任务的形式将AI模型以及模型训练配置信息发送给至少两个第一节点。例如,中心节点可以将训练任务信息发送给至少两个第一节点,该训练任务信息包括AI模型以及模型训练配置信息。例如,在将训练任务信息发送给所述至少两个第一节点之前,中心节点可以先向第一节点发送训练任务通知消息,该训练任务通知消息可以包括任务标识ID,然后接收到该训练任务通知消息的第一节点可以向中心节点发送训练任务查询消息,该训练任务查询消息可以包括任务ID,然后中心节点接收到训练任务查询消息后,再将训练任务信息发送给第一节点。
其中,在向至少两个第一节点发送模型训练信息之前,中心节点还可以进行初始化设置,该初始化设置包括:选择参与联邦学习的第一节点;设置汇聚算法;设置模型训练的迭代次数等参数,例如表1中的参数;建立联邦学习实例,并将该实例类型选择为SA,例如将该实例选择为进行AISA流量识别;初始化一个AI模型,例如初始化一个AISA流量识别模型,或者将已经完成预训练的模型参数和权重注入一个AI模型。初始化一个AI模型是指为该AI模型注入初始的模型参数和权重。
S302,所述至少两个第一节点中的第一节点根据本地数据和模型训练配置信息对AI模型进行训练;
示例性的,本地数据可以是第一节点本地通过SA引擎对采集到的数据流进行特征识别,得到数据流的分类结果后,根据数据流的分类结果对数据流进行标签标注得到的数据。
其中,第一节点可以根据接收到的模型训练配置信息先对AI模型进行加载,例如可 以根据表1中的参数对AI模型进行加载,比如根据表1中的1、3至4、10至14、16至30号等参数对AI模型进行加载;然后根据本地数据和模型训练配置信息对AI模型进行训练,例如可以根据表1中的3、8、15、31、32号等参数对AI模型进行训练。例如,对所述AI模型进行训练的过程可以包括:将本地数据按照预定大小B(batchsize)划分为K个批次batch;基于K个batch对AI模型训练K次。B的取值为表1中31号参数的参数值。当表1中的参数8训练次数的参数值为1时,上述训练过程执行1次即可,当参数8训练次数的参数值为2时,则上述训练过程需要执行2次,以此类推。
另外,模型训练配置信息还包括:训练结果精度阈值;训练结果精度阈值用于指示第一节点根据本地数据和模型训练配置信息对AI模型进行训练的训练结果精度。例如表1中的15号参数为训练结果参数,该参数中可以包括训练结果精度阈值。当第一节点训练的AI模型的精度达到该训练结果精度阈值时,就可以停止训练,如此中心节点可以预先设置训练结果精度阈值,然后包括在模型训练配置信息中下发给第一节点,实现对第一节点的模型训练进行控制。训练结果精度是指识别成功的样本占识别出的样本的比重,比如,总样本为100,识别出的样本为90,在识别出的样本90中识别成功的样本为80,则精度为80/90。
示例性的,以采用有监督学习方式训练CNN为例,模型训练过程主要包括:训练数据集的收集和标注阶段,训练阶段、以及验证阶段。
训练数据集的收集和标注阶段中,本申请实施例中可以通过SA引擎来获得用于训练CNN模型的训练集,无需人工的参与,不仅可以提高效率,还可以减少人力资源。深度学习训练依赖有标签标注的大数据集,大数据集不仅需要标注标签还需要定期更新,确保能够学习到新的特征。例如,可以基于SA引擎的识别成功的结果,对待训练报文标注标签,形成训练数据集,还可以定期更新数据集。
训练CNN可以采用BP算法,BP可以称为Back Progagation,也称Error Back Progagation,含义为错误反向传播。BP算法基本思路是前向传播和反向传播。前向传播包括输入样本从输入层传入,经各隐含层,逐层处理后,传向输出层。若输出层的实际输出与期望输出不符,则转入误差的反向传播。反向传播包括将输出以某种形式通过逐层向输入层逐层反传,并将误差分摊给各层的所有单元,从而获得各层单位的误差,此误差作为修正个单元权重的依据。网络学习在权值的修改过程中完成。误差达到所期望值时,网络学习结束。对上述两个环节反复循环迭代,直到网络的对输入的响应达到预定的目标范围为止。
例如,CNN可以包括输入层、隐含层以及输出层。隐含层可以包括多层,本申请对此不作限定。输入的训练样本经输入层输入待训练模型,经过隐含层的逐层处理后,传向输出层。若输出层的实际输出与期望输出不符,则转入误差的反向传播。反向传播即将输出以某种形式通过逐层向输入层逐层反传,并将误差分摊给各层,从而获得各层的误差,此误差作为修正各层权重的依据。训练过程也就是经过多次的权值的修改最后得到CNN模型。误差达到所期望值时,训练过程即结束。
本申请实施例可以采用交叉验证的方式进行模型的训练,训练数据集可以分为两部分,一部分用于训练模型(作为训练样本),另一部分用于验证网络模型的准确程度(验证样本)。当一个CNN模型训练完成后,使用验证样本验证训练的模型是否能够准确地识别数据流,给出识别准确率。当识别准确率达到设定阈值,则可以确定CNN模型可以作为用 于后续特征识别的模型。当识别准确率未达到设定阈值,可以继续进行训练,直到识别准确率达到设定阈值。
进一步地,在完成模型训练和验证得到AI模型后,可以根据训练和验证得到的AI模型对Unknown数据流进行特征识别得到分类结果。
示例性的,以采用有监督学习方式训练CNN为例,模型训练过程还可以包括推理识别阶段。例如,在推理识别过程中,输入没有标注标签的数据集(报文)即待识别数据集,给出数据集(报文)的标签分类,即识别结果。示例性的,待识别数据集可以分两部分:一部分是SA引擎已标注识别结果的数据报文,可以用于SA引擎与AI模型之间的功能相互印证,证明AI模型识别的准确度;另一部分是没有被SA引擎标注识别结果的数据报文,可以用于寻找体现SA引擎与AI模型的能力差异的报文。
S303,所述至少两个第一节点中的第一节点确定第一模型参数更新信息;
其中,所述第一模型参数更新信息是根据第一节点的本地数据和模型训练配置信息对AI模型训练后的模型参数更新信息。
举一个例子,第一节点可以将收到的AI模型common model本地备份为old_model,然后进行训练后得到的模型为new_model,计算第一模型参数更新信息grad_n=new_model-old_model。
例如,计算第一模型参数更新信息的过程包括:根据模型结构,将训练后的AI模型的模型参数展开为一维数组Y,将训练前的AI模型的模型参数展开为一维数组X,将Y减去X得到的一维数据Z作为第一模型参数更新。例如,假设数据X为(1,2,3,4,5,6),数组Y为(2,3,4,5,6,7),则数据Z=Y-X,为(1,1,1,1,1,1)。
S304,所述至少两个第一节点中的第一节点发送第一模型参数更新信息;相应的,中心节点接收所述至少两个第一节点的至少两个第一模型参数更新信息;
其中,所述第一模型参数更新信息是根据第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息;
示例性的,所述第一模型参数更新信息是根据所述模型训练配置信息,对训练后的所述AI模型的模型参数校验成功之后发送的。
另外,第一节点可以对第一模型参数更新信息进行压缩后发送给中心节点,中心节点解压缩后得到第一模型参数更新信息;或者第一节点对第一模型参数更新信息进行加密后发送给中心节点,中心节点解密后得到第一模型参数更新信息;或者第一节点对第一模型参数更新信息进行压缩和加密后发送给中心节点,中心节点解压缩以及解密后得到第一模型参数更新信息。
举一个例子,第一模型参数更新信息是根据模型训练配置信息,对训练后的AI模型的模型参数校验成功之后发送的。也就是,第一节点计算得到第一模型参数更新信息之后,需要校验训练后的AI模型和训练前的AI模型的结构是否一致,当结构一致时再将第一模型参数更新信息发送给中心节点,当结构不一致时报错,如此可以保证训练前后的AI模型的结构完全一致,避免影响训练效果。
例如,可以根据模型训练配置信息对训练前后的AI模型的结构是否一致进行校验。比如表1中的3至5、10至13、16至30号参数中的部分或者全部可以作为校验参数,也就是比较训练前后的AI模型的这些参数是否一致,当这些参数完全一致时,确定训练前后的AI模型的结构一致;当这些参数不完全一致时,确定训练前后的AI模型的结构不一 致。
另外,第一节点还可以同时向中心节点发送其他参数值,例如训练耗费时间、训练的数据量、训练结果精度、训练结果召回率等。训练结果召回率是指识别出的样本占总样本的比重,例如总样本为100,识别出的样本为90,则召回率为90/100。
其中,第一节点可以任务执行结果的方式将第一模型参数更新信息发送给中心节点。例如,第一节点向中心节点发送训练任务的执行结果信息;该执行结果信息包括:任务ID、任务执行成功、第一模型参数更新信息,或者还可以包括训练耗费时间、训练的数据量、训练结果精度、训练结果召回率等。又例如,当训练失败时,该执行结果信息可以包括:任务ID、任务执行失败,失败原因等。该执行结果信息可以在压缩和/或加密后发送。
S305,中心节点采用预设的汇聚算法对至少两个第一模型参数更新信息进行汇聚得到第二模型参数更新信息;
举例而言,预设的汇聚算法可以是平均算法、加权平均算法、FedAvg(FederatedAveraging)算法、或者随机梯度下降SVRG(stochastic variance reduced gradient)算法等。下面以加权平均算法为例对汇聚计算进行介绍:
例如,假设有K个分布式第一节点,每个第一节点的数据集是Ρ k,即(x i,y i),i∈Ρ k,样本量n k=|Ρ k|,各个第一节点的总样本量为
Figure PCTCN2022109525-appb-000004
Figure PCTCN2022109525-appb-000005
Figure PCTCN2022109525-appb-000006
假设t时刻中心节点将共享模型参数ω t下发到各第一节点,第一节点的模型更新采用梯度下降:
Figure PCTCN2022109525-appb-000007
中心节点模型的更新过程可以采用模型聚合的方式:
Figure PCTCN2022109525-appb-000008
Figure PCTCN2022109525-appb-000009
另外,也可以采用各第一节点的模型更新量Δω k来聚合模型:
Figure PCTCN2022109525-appb-000010
Figure PCTCN2022109525-appb-000011
后续中心节点使用ω t+1更新模型,再下发给各第一节点,实现利用联邦学习机制对Local模型的汇聚增强,实现业务目标。
上述公式(1)、公式(2)、公式(3)、公式(4)、公式(5)、公式(6)、公式(7)中,i表示样本编号,i取值为1至n,n表示总样本数量,小写k表示第一节点的编号,K表示第一节点的数量,小写k的取值为1至大写K,n k表示第k节点的样本数量,t表示时刻,t+1表示下一时刻,ω表示模型参数更新,ω t表示在t时刻的模型参数更新,ω t+1表示在t+1时刻的模型参数更新,
Figure PCTCN2022109525-appb-000012
表示在t+1时刻的第一节点k的模型参数更新,f i(ω)表示样本i对应的模型参数更新,F k(ω)表示第一节点k的模型参数更新,F kt)表示第一节点k在t时刻的模型参数更新,f(ω)表示汇聚后的模型参数更新,f(ω t)表示在t时刻的汇聚后的模型参数更新,ɑ表示预设的系数,ɡ k表示
Figure PCTCN2022109525-appb-000013
S306,中心节点向所述至少两个第一节点中的第一节点发送第二模型参数更新信息;相应的,所述至少两个第一节点中的第一节点接收第二模型参数更新信息。
其中,所述第二模型参数更新信息是根据所述至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数。
示例性的,上述由第一节点执行的方法步骤可以通过部署于云化平台或者边缘计算平台的应用程序APP执行。
在一种可能的实现方式中,所述接收模型训练消息之后,发送第一模型参数更新信息之前,所述方法还包括:通过所述APP的服务端模块接收来自所述第一节点的所述第一模型参数更新信息;所述发送第一模型参数更新信息,包括:通过所述APP的客户端模块发送所述第一模型参数更新信息。
在本申请另一个实施例中,所述第一节点接收第二模型参数更新信息之后,所述方法还可以包括:第一节点获取基于采用所述第二模型参数更新信息更新后的所述AI模型,对数据流进行识别的识别结果和报文特征规则;第一节点根据所述报文特征规则更新业务感知SA特征库。
另外,中心节点可以对第二模型参数更新信息进行压缩后再发送,第一节点解压缩后得到第二模型参数更新信息;或者中心节点可以对第二模型参数更新信息进行加密后再发送,第一节点解密后得到第二模型参数更新信息;或者中心节点可以对第二模型参数更新信息进行压缩和加密后再发送,第一节点进行解压缩和解密后得到第二模型参数更新信息。
S307,所述至少两个第一节点中的第一节点根据第二模型参数更新信息更新本次训练前的AI模型。
例如,假设所述第二模型参数更新信息为一位数组W,W为(4,5,6,7,8,9),训练前的AI模型的模型参数展开为一维数组X为(1,2,3,4,5,6),则计算得到的一位数组V=W+X,应该为(5,6,7,8,9,10),然后按照该数组还原出AI模型的模型参数,根据还原出的模型参数对训练前的AI模型进行更新。
在一种可能的实现方式中,上述步骤S302至S307的过程可以迭代进行,直至满足模 型收敛条件,结束联邦学习,得到最终的AI模型。
另外,在上述过程中,第一节点可以定时或者实时向中心节点发送心跳消息,所述心跳消息可以包括训练任务的状态信息,例如接收,运行中,完成,失败等。
还有,联邦学习结束之后,第一节点如果需要下线,可以向中心节点发送去注册请求消息,中心节点去注册所述第一节点,并向所述第一节点发送去注册成功响应消息,所述第一节点下线成功。
在本申请另一个实施例中,在上述内容的基础上,所述方法还可以包括:
第一节点根据本地数据和模型训练配置信息对更新后的AI模型进行训练。
如此,第一节点利用本地数据对联邦学习后的AI模型进行训练,既可以利用联邦学习提高AI模型的识别率,又可以使得训练后的AI模型更加适应于本地数据,进一步提高识别率,满足本地业务需求。
在本申请另一个实施例中,在上述内容的基础上,所述方法还可以包括:
第一节点基于更新后的所述AI模型,对数据流进行识别的识别结果和报文特征规则;第一节点根据所述报文特征规则更新业务感知SA特征库。
其中,特征规则可以用于实现SA的快速识别,采用SA特征库的方式相比AI模型来说,速率更高,本申请实施例中将获取来的报文特征规则快速补充至SA特征库,将会保证SA具备对满足补充的特征规则的这部分流量的快速识别能力,后续无须再经过AI模型识别。另外,在采用AI模型识别数据流的过程中,进行报文特征规则提取,满足产品自动化运维要求,不再需要人工参与对SA特征库进行升级,提升效率,降低成本。
在本申请另一个实施例中,在上述内容的基础上,所述方法还可以包括:
中心节点向第二节点发送第二模型参数更新信息和AI模型;其中,第二模型参数更新信息用于更新第二节点的所述AI模型的模型参数。第二节点根据第二模型参数更新信息更新AI模型。
其中,所述第二节点为没有参与联邦学习的节点。如此,联邦学习得到的联邦模型,可以直接转为非联邦场景使用,即可以将联邦学习得到的AI模型应用于没有参与联邦学习的第二节点,提高第二节点的AI模型的识别率。
例如,如果第二节点有需求,携带这些参数的模型可以直接导出给非联邦节点(即第二节点)使用,非联邦节点通过读取这些参数,可以完全了解AI模型结构,AI模型可自动运行。此处的这些参数可以包括表1中的4至5、14、16至33号参数。
又例如,第二节点也可以直接修改联邦模型后再使用,方便在不同节点使用不同的模型。比如,如果某些节点(即第二节点)使用联邦模型后业务效果不达预期,则可以修改联邦模型的模型参数,也可以将联邦模型下线,通过修改模型参数,运行在非联邦节点上,实现更佳的业务效果。此处修改的模型参数可以包括表1中的4至5、14、16至33号参数。
本申请实施例提供的技术方案,由于第一模型参数更新信息是分别在第一节点利用本地数据和模型训练配置信息训练得到的,因而可以满足数据的隐私保护要求,而且由于第二模型参数更新信息是根据至少两个第一节点的至少两个第一模型参数更新信息得到的,可以使得利用第二模型参数更新信息更新后的AI模型的识别率更高,从而可以尽量提高AI模型的识别率。另外,本申请实施例还可以实现各个节点之间训练算力的分担,避免单个节点的训练时间过长,性能消耗过大等问题。
在本申请另一个实施例中,在上述内容的基础上,如图4所示,在S301之前,该方法还可以包括:
S401,第一节点向中心节点发送注册请求消息;相应的,中心节点接收该注册请求消息。
需要说明的是,参与模型训练的第一节点可以为多个,图3中仅示出了一个第一节点,此仅为示例性显示,并不对本申请实施例进行限制。
示例性的,所述注册请求消息可以包括所述第一节点的名称、标识等,或者还可以包括第一节点本地数据的数据量等信息。
S402,注册成功后,中心节点向第一节点发送注册成功响应消息;相应的,第一节点接收该注册成功响应消息。
其中,当注册失败时,中心节点可以向第一节点发送注册失败响应消息;相应的,第一节点接收该注册失败响应消息。
其中,注册成功后,第一节点还可以持续向中心节点发送心跳消息。
S403,中心节点执行初始化设置。
举例而言,可以由所述中心节点的管理单元执行初始化设置。所述初始化设置可以包括:
1,选择参与本轮联邦学习的第一节点;
在一种可能的实现方式中,可以随机或者按照预设规则从已经注册的多个第一节点中选择参与联邦学习的至少两个第一节点。
2,设置汇聚算法;
3,设置模型训练配置信息;
例如,可以设置模型训练的迭代次数等参数,例如设置表1中的部分或者全部参数;
4,建立联邦学习实例,并将该实例类型选择为SA;
例如,将该实例选择为进行AISA流量识别。
5,初始化一个AI模型。
例如,初始化一个AISA流量识别模型,或者将已经完成预训练的模型参数和权重注入一个AI模型。初始化一个AI模型是指为该AI模型注入初始的模型参数和权重。
S404,中心节点向选择的至少两个第一节点发送训练任务通知消息;相应的所述至少两个第一节点中的第一节点接收训练任务通知消息;
其中,训练任务通知消息可以包括任务ID,用于向第一节点通知训练任务。
S405,所述至少两个第一节点中的第一节点向中心节点发送训练任务查询消息;
其中,训练任务查询消息可以包括任务ID,用于向中心节点查询训练任务。
作为一个示例,S301中,模型训练消息可以携带在训练任务消息中发送给第一节点。例如,中心节点向所述至少两个第一节点发送训练任务信息;相应的所述至少两个第一节点中的第一节点接收训练任务信息;其中,训练任务信息包括初始的AI模型以及模型训练配置信息。AI模型以及模型训练配置信息已经在上一实施例中进行了介绍,在此不再赘述。
另外,中心节点可以将训练任务信息进行压缩和/或加密后发送给第一节点,第一节点接收到信息后进行解压缩和/或解密后得到AI模型以及模型训练配置信息。
还有,在训练过程中,第一节点可以持续向中心节点发送心跳消息,告知中心节点训 练任务的状态,例如接收任务,任务运行中,任务完成,任务失败等。
示例性的,S304中,第一模型参数更新信息可以携带在任务执行结果中发送给中心节点。例如,所述至少两个第一节点中的第一节点向中心节点发送任务执行结果;相应的中心节点接收至少两个第一节点发送的至少两个任务执行结果;其中,任务执行结果包括所述第一模型参数更新信息。
还有,任务执行结果还可以包括以下一种或者多种:训练耗费时间、训练的数据量、训练结果精度、或者训练结果召回率等。
其中,第一节点可以先对任务执行结果进行压缩和/或加密后发送给中心节点,中心节点解压缩和/或解密后得到任务执行结果。
在一种可能的实现方式中,还可以进行下一轮训练,即跳转到步骤305再次对更新后的AI模型进行训练,直至模型收敛,联邦学习结束。
本发明实施例提供的技术方案,使用联邦学习机制,将每个本地第一节点本地训练后的模型参数更新在不同节点间传递,使得某第一节点即使因为所需识别的APP流量太少以及识别效果不佳的情况下,也可以通过其他第一节点的识别能力传递,增强本地的识别能力。
另外,当某第一节点需要识别的APP的数量需求大且单靠本地训练无法短时间内完成识别时,通过联邦学习其他第一节点的训练,可以迅速扩充识别能力,提升单节点模型训练效率,实现快速迭代。另外,新的应用APP的识别能力也可以在不同第一节点间快速传递。例如,以A站点和B站点参与联邦学习训练AI模型为例,A站点的识别能力可以传递给B站点,从而使得B站点即使某一APP流量较少(小样本)或没有,也具备与A站点相同的识别能力。
还有,基于联邦学习不同节点间传递的都是模型参数更新,与原数据无任何关系,充分满足了隐私保护的要求。例如,A站点本地数据不出A站点,B站点无需依赖A站点源数据就能实现识别能力提升。
还有,基于联邦学习还可以实现训练算力的分担,实现对重点应用、重点时间内的识别能力快速迭代。例如,B站点可以分担A站点的计算能力,比A站点单独训练性能提升。
在图1的基础上,本申请实施例提供了一种模型训练系统的架构,如图5所示,在中心节点110设置联邦学习服务器FLS1101(Federated Learning Server)和汇聚单元1102,在各个第一节点分别设置联邦学习客户端FLC(Federated Learning Client)和训练单元,例如在第一节点120设置FLC1201和训练单元1202,在第一节点121设置FLC1211和训练单元1212。
其中,FLS1101与各个FLC1201、FLC1211分别通过有线或者无线连接进行信息交互。
如图5所示,汇聚单元1102可以是支持数据汇聚的任意单元,可以与FLS1101合设在中心节点110内,与FLS1101配合一同实现联邦学习。训练单元1202、训练单元1212可以是支持AI模型训练的任意单元,可以分别与FLC1201、FLC1211合设在第一节点内,与FLS1101配合一同实现联邦学习。
示例性的,FLS1101和汇聚单元1102可以分别采用不同的容器服务来实现,也可以分别通过一个或者多个虚拟机(virtual machine,VM)来实现,或者分别由一个或者多个处理器来实现、或者分别由一个或者多个计算机来实现。
示例性的,FLC1201、FLC 1211、训练单元1202、训练单元1212可以分别采用不同的容器服务来实现,也可以分别通过一个或者多个虚拟机(virtual machine,VM)来实现,或者分别由一个或者多个处理器来实现、或分别者由一个或者多个计算机来实现。
作为一个例子,训练单元1202和训练单元1212可以是分别部署在第一节点120和第一节点121中的应用人工智能的业务感知(artificial intelligence service awareness,AISA),AISA还可以称为人工智能识别功能,也可以命名为其它的名字,本申请实施例中为了描述,以称为AISA为例。
例如,AISA可以用于根据SA特征库针对采集到的数据流进行分类得到分类结果。SA特征库可以位于AISA之内,或者也可以位于AISA之外,通过接口连接。AISA中还可以包括SA引擎。SA引擎用于实现根据SA特征库针对采集到的数据流进行特征识别。AISA可以通过SA引擎对采集到的数据流进行特征识别,得到数据流的分类结果后,根据数据流的分类结果对数据流进行标签标注;然后将具有标签标注的数据流作为AI模型的训练数据集,AISA根据训练数据集来训练AI模型。
又例如,第一节点120、121还可以部署SA识别引擎,例如SA@AI引擎,可以由SA识别引擎作为需要进行SA识别的应用向AISA提出模型训练申请,配置勾选协议,然后由AISA进行数据收集、模型训练、规则提取后,输出识别结果与规则并更新到SA特征库。
示例性的,AI模型可以部署在云化平台,云化平台可以向在云端的中心节点上的联邦学习服务器FLS进行注册,并通过状态消息互相感知状态,AI模型进行数据收集、模型训练后,输出识别结果,云化平台可以将AI模型参与的识别结果等交互数据转发给FLS,FLS接受云化平台上传的模型参数,进行模型汇聚融合,完成后将共享模型下发给云化平台。这样AI模型汇聚融合后实现模型能力增强。
作为一个例子,FLC1201和FCL1211可以负责接收本地节点的数据,并转发给中心节点110,AISA还可以利用FLC1201和FCL1211实现与中心节点FLS1101的状态消息同步,AI模型的模型参数更新导出、上传、下载,训练耗费时间、数据量、识别结果(召回率、精度)等上传与下载,等。在中心节点110,联邦学习服务器FLS1101负责接收FLC1201和FCL1211的数据,检测第一节点120、121状态,实现训练任务的下发,接受各分布式第一节点120、121上传的模型参数跟新,进行模型参数更新汇聚融合,下发汇聚后的模型参数更新,等。
作为另一个例子,还可以在FLC1201和FCL1211中设置服务端模块和客户端模块,在训练单元1202和训练单元1212设置客户端模块,在FLS1101设置服务端模块。其中,FLC1201和FCL1211中的服务端模块被配置分别为与训练单元1202、1212中的客户端模块连接,负责FLC1201和FCL1211分别与训练单元1202和训练单元1212之间的数据传输。其中,FLC1201和FCL1211中的客户端模块被配置为与FLS1101中的服务端模块连接,负责FLC1201和FCL1211与FLS1101之间的数据传输。
例如图5所示,可以在FLS1101设置服务端模块11011,在FLC1201设置客户端12011和服务端12012,在训练单元1202设置客户端模块12021,在FLC1211设置客户端12111和服务端12112,在训练单元1212设置客户端模块12121。
示例性的,客户端模块12011、客户端模块12021、客户端模块12111、客户端模块12121可以为HTTP/HTTPS客户端Client,服务端模块11011、服务端模块12012、服务端模块 12112可以为HTTP/HTTPS服务器Server。
在一个例子中,FLC1201和FCL1211可以应用程序APP的方式分别部署在第一节点120、第一节点121上。第一节点120、第一节点121可以为边缘计算平台或者云化平台的节点,例如,所述FLC1201和FCL1211可以应用程序APP的方式部署在边缘计算平台。又例如,可以将FLC1201和FCL1211作为一个APP在边缘计算平台或者云化平台直接上线,然后由FLC1201和FCL1211代理第一节点120、第一节点121与FLS110之间连接。
在另一个例子中,FLC1201和FCL1211可以静态链接库的方式分别部署在第一节点120、第一节点121上,例如,可以将所述FLC1201和FCL1211分别集成到部署AISA的虚拟机VM内部,在该VM为所述FLC1201和FCL1211设置与外部连接的接口,还可以通过该虚拟机的登录Portal界面配置所述FLC1201和FCL1211与FLS1101连接的IP地址和参数。该参数是指该FLC1201和FCL1211的名称、标识,向FLS1101注册的用户名和密码等。
基于图5所示的架构,针对图3所示的方法,可以由FLS1101执行中心节点110执行的操作,可以由FLC1201和FCL1211分别执行第一节点120、第一节点121执行的操作。
基于图5所示的架构,针对图4所示的方法,可以由FLS1101和汇聚单元1102配合执行中心节点110执行的操作,其中,FLS1101负责数据的接收和发送,汇聚单元1102负责采用预设的汇聚算法对至少两个第一模型参数更新信息进行汇聚得到第二模型参数更新信息。可以由FLC1201和FCL1211分别和训练单元1202、训练单元1212执行第一节点120、第一节点121执行的操作,其中,训练单元1202、训练单元1212负责训练AI模型,其他操作由FLC1201和FCL1211负责。
如下通过具体应用场景对本申请实施例提供的方案达到的效果进行示例性说明。
场景一:小样本节点的联邦学习实验
本场景一中,3个local(分别为local 1、local 2和local 3,local 1、local 2和local 3为第一节点,以下local 1、local 2和local 3可以统一称为local)的样本数分别为总样本数的10%、30%和60%,将总样本集按比例随机分配给各个local。本场景一中各个local的样本数据分布情况如图6所示。由于local 1只分配到总数据量的10%,成为小样本节点,通过联邦学习实验以验证小样本节点能否在联邦学习后提升模型的识别能力。
如图7所示,3个local基于本地数据进行训练得到的模型识别准确率分别为76.0%、90.9%和95.6%。当使用全量数据(即总样本集)训练时,得到的模型的识别准确率能够达到97.1%。对3个local进行联邦学习时,模型参数融合策略为每个epoch进行一次参数融合,得到的联邦模型的识别准确率为95.7%。由此可见,由于local1训练样本数量较少,在本地训练得到的模型的识别准确率较低;而由于local3的样本数量较多,在本地训练得到的模型的识别准确率较高。通过联邦学习,各个local得到一个有较高识别的联邦模型,识别准确率相比本地训练都得到了不同幅度的提升。特别是小样本节点local1,通过联邦学习的模型的识别准确率得到大幅提升。
场景二:小样本应用的联邦学习实验
本场景二中,各个local的样本数量相近,但应用的样本数量分布不同,每个local都各有一些应用为小样本应用,但小样本应用在其他local的样本数量足够大。本场景二中各个local的样本数据分布情况如图8所示。通过联邦学习实验以验证各个local的小样本应 用能否在联邦学习后具有较好的识别能力。
如图9所示,3个local基于本地数据进行训练得到的模型识别准确率分别为83.6%、81.9%和82.8%。当使用全量数据训练时,得到的模型的识别准确率能够达到97.1%。通过联邦学习之后,得到的联邦模型的识别准确率达到95.6%。由此可见,由于各个local都有一些应用为小样本的,本地训练模型对这些小样本应用的识别准确率不高,但通过联邦学习之后,得到的联邦模型对小样本应用的识别准确率和召回率都得到了明显的提升。例如图10所示为场景二:各个local中小样本应用的召回率及联邦学习后的召回率,各个local中的小样本应用识别的召回率都较低,联邦模型的召回率得到了明显提升,比如应用A在local1本地模型只有46.9%的召回率,但是应用联邦模型(federated)则有95.6%的召回率;比如应用B在local1本地模型只有13.8%的召回率,但是应用联邦模型(federated)则有94.4%的召回率;比如应用C在local2本地模型只有49.7%的召回率,但是应用联邦模型(federated)则有98.5%的召回率;比如应用D在local2本地模型只有42.3%的召回率,但是应用联邦模型(federated)则有97.6%的召回率;比如应用E在local3本地模型只有32.1%的召回率,但是应用联邦模型(federated)则有96.4%的召回率;比如应用F在local3本地模型只有59.1%的召回率,但是应用联邦模型(federated)则有97.3%的召回率。
场景三:扩展识别应用数量的联邦学习实验
本场景三中,各个local的样本数量相近,但应用的样本数量分布不同,一些应用在某一个或某两个local都没有样本。本场景三中各个local的样本数据分布情况如图11所示。通过联邦学习实验以验证各个local能否在联邦学习后具有扩展识别应用数量的能力。
如图12所示,3个local基于本地数据进行训练得到的模型识别准确率分别为87%、87.5%和86.9%,当使用全量数据训练时,得到的模型的识别准确率能够达到98.4%。通过联邦学习之后,得到的联邦模型的识别准确率达97.9%。由此可见,当各local有一些应用没有训练样本时,本地训练模型对这些应用没有识别能力,并且整体的识别准确率较低,通过联邦学习,得到的联邦模型对local无样本的应用有较高的识别准确率和召回率,并且联邦模型的整体识别准确率得到明显提升。例如图13所示场景三:各个local中无样本应用联邦学习后的召回率,当各local有一些应用没有训练样本时,召回率为0,联邦模型的召回率得到了明显提升。比如,APP1在local1召回率为0,应用联邦模型(federated)召回率提升至98.1%。比如,APP2在local2召回率为0,应用联邦模型(federated)召回率提升至97.0%。比如,APP3在local3召回率为0,应用联邦模型(federated)召回率提升至98.5%。比如,APP4在local2和local3召回率都为0,应用联邦模型(federated)召回率提升至96.0%。比如,APP5在local1和local3召回率为0,应用联邦模型(federated)召回率提升至99.3%。比如,APP6在local1和local2召回率为0,应用联邦模型(federated)召回率提升至99.9%。
场景四:基于联邦模型的联邦学习微调实验
在联邦学习完成后得到了一个联邦模型,当某个local产生一些新的应用类别时,可以对模型进行联邦学习的微调或重新初始化后联邦训练两种方法提高识别率,分别实验这两种方法的联邦学习的效果。所述微调就是Federated-finefune,是指将一个预训练模型再进行几轮训练,从而收敛模型,在本场景中是指对联邦模型再进行几轮训练,从而收敛模型。所述重新初始化就是Federated-init,是指对联邦模型进行初始化。
例如图14所示,基于联邦学习的联邦模型,对联邦模型进行微调,即对联邦模型经 过500轮联邦训练后,得到的模型的识别准确率为95.7%。而对初始化后的联邦模型进行联邦训练,经过2000轮联邦训练,模型的识别准确率也为95.7%。由此可见,可以基于联邦学习的联邦模型进行微调训练,能够通过较少的联邦训练迭代轮次就达到与模型初始化重新训练相同的模型准确率。
例如图15所示,针对新应用APP_T,基于对联邦模型进行微调后的模型的识别准确率为99.4%,基于对初始化后的联邦模型进行联邦训练后的模型的识别准确率为99.4%。针对新应用APP_D,基于对联邦模型进行微调后的模型的识别准确率为99.8%,基于对初始化后的联邦模型进行联邦训练后的模型的识别准确率为99.8%。针对新应用APP_Y,基于对联邦模型进行微调后的模型的识别准确率为99.3%,基于对初始化后的联邦模型进行联邦训练后的模型的识别准确率为99.2%.针对新应用APP_M,基于对联邦模型进行微调后的模型的识别准确率为99.7%,基于对初始化后的联邦模型进行联邦训练后的模型的识别准确率为99.5%。由此可见,基于联邦学习的联邦模型进行微调训练的模型,针对新应用的识别率可以达到与模型初始化重新训练后的模型的准确率相同甚至更高。
本申请实施例提供的技术方案可以为基于AI的SA技术在网络中提供一种在缺少训练样本或计算能力不足、同时需要隐私保护情况下,训练得到能够实现高精度高性能识别的AI模型。例如,传统基于AI的SA,均需要大量数据样本进行模型训练,而数据样本的合理采集手段十分有限,如果达不到要求,影响整体性能。本申请实施例提供的可以在缺少训练样本情况下,实现AI模型的识别能力的整体提升。又例如,传统基于AI的流量识别,计算时间普遍较长,容易在单一节点造成的大量性能消耗。本申请实施例可以实现分布式模型训练,避免单一节点性能瓶颈。又例如,网络数据流量隐私保护需求日益增强,本申请实施例可以实现AI技术在网络中的分布式计算,保护原始数据安全,提供安全可信的智能化流量识别服务。
基于上述内容,图16示出了本申请实施例提供的一种模型训练装置的结构。如图16所示,该模型训练装置可以部署于图1或图2中所示的中心节点,或者部署于图5中的FLS。
该模型训练装置包括:第一发送模块1601和接收模块1602。
该第一发送模块1601,用于向至少两个第一节点发送模型训练消息,向所述至少两个第一节点中的第一节点发送第二模型参数更新信息,所述模型训练消息包括人工智能AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别,所述第二模型参数更新信息是根据所述至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数;
该接收模块1602,用于接收来自所述至少两个第一节点的至少两个第一模型参数更新信息,第一模型参数更新信息是根据所述第一模型参数更新信息对应的第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息。
示例性的,所述数据流所属类别包括以下内容中的至少一项:所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;所述数据流的报文特征规则。
在一种可能的实现方式中,所述装置还包括:第二发送模块1603,用于向第二节点发送所述第二模型参数更新信息和所述AI模型;所述第二模型参数更新信息用于更新所述第二节点的所述AI模型的模型参数。
另外,所述模型训练配置信息还可以包括:训练结果精度阈值;所述训练结果精度阈 值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
基于上述内容,图17示出了本申请实施例提供的一种模型训练装置的结构。如图17所示,该模型训练装置可以部署于图1或图2中所示的第一节点,或者部署于图5中的FLC。
该模型训练装置包括:接收模块1701和发送模块1702。
接收模块1701,用于接收模型训练消息,接收第二模型参数更新信息,所述模型训练消息包括AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别,所述第二模型参数更新信息是根据至少两个第一节点的至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数;
发送模块1702,用于发送第一模型参数更新信息,所述第一模型参数更新信息是根据第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息。
示例性的,所述数据流所属类别包括以下内容中的至少一项:所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;所述数据流的报文特征规则。
示例性的,所述装置还可以包括:训练模块1703,用于确定所述第一节点根据本地数据,对采用所述第二模型参数更新信息更新后的所述AI模型进行训练。
举例而言,所述接收模块和所述发送模块可以为部署于云化平台或者边缘计算平台的应用程序APP的模块。
例如,所述接收模块1701和所述发送模块1702为所述APP的客户端模块;所述APP还包括服务端模块;所述服务端模块,用于向所述第一节点发送所述第一模型参数更新信息。
在一种可能的实现方式中,所述第一模型参数更新信息是根据所述模型训练配置信息,对训练后的所述AI模型的模型参数校验成功之后发送的。
另外,所述模型训练配置信息还包括:训练结果精度阈值;所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
在一种示例中,所述装置还可以包括获取模块和更新模块:
获取模块,用于获取基于采用所述第二模型参数更新信息更新后的所述AI模型,对数据流进行识别的识别结果和报文特征规则;
更新模块,用于根据所述报文特征规则更新业务感知SA特征库。
本申请实施例中对模块的划分是示意性的,仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。另外,在本申请各个实施例中的各功能单元可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
基于上述内容,图18示出了本申请实施例提供的一种模型训练装置的结构。如图18所示,该模型训练装置1800可以部署于图1或图2中所示的中心节点,或者部署于图5中的FLS。或者,该模型训练装置1800可以部署于图1或图2中所示的第一节点,或者部署于图5中的FLC。
该模型训练装置1800中可以包括通信接口1810、处理器1820。可选的,模型训练装 置1800中还可以包括存储器1830。其中,存储器1830可以设置于模型训练装置内部,还可以设置于模型训练装置外部。上述实施例中所述模型训练装置所实现的功能均可以由处理器1820实现。处理器1820通过通信接口1810接收数据流,并用于实现上述任一实施例所述的模型训练方法。在实现过程中,处理流程的各步骤可以通过处理器1820中的硬件的集成逻辑电路或者软件形式的指令完成上述任一实施例所述的模型训练方法。为了简洁,在此不再赘述。处理器1820用于实现上述任一实施例所述的模型训练方法所执行的程序代码可以存储在存储器1830中。存储器1830和处理器1820耦合。
本申请实施例中涉及的处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
本申请实施例中的耦合是装置、模块或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、模块或模块之间的信息交互。
处理器可能和存储器协同操作。存储器可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
本申请实施例中不限定上述通信接口、处理器以及存储器之间的具体连接介质。比如存储器、处理器以及通信接口之间可以通过总线连接。所述总线可以分为地址总线、数据总线、控制总线等。
基于以上实施例,本申请实施例还提供了一种计算机存储介质,该存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时可实现上述任意一个实施例提供的模型训练方法。所述计算机存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
基于以上实施例,本申请实施例还提供了包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述任意一个实施例提供的模型训练方法。
基于以上实施例,本申请实施例还提供了一种芯片,该芯片包括处理器,用于实现上述任意一个或多个实施例所提供的模型训练方法的功能。可选地,所述芯片还包括存储器,所述存储器,用于处理器所执行必要的程序指令和数据。该芯片,可以由芯片构成,也可以包含芯片和其他分立器件。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机 程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (27)

  1. 一种模型训练方法,其特征在于,所述方法包括:
    向至少两个第一节点发送模型训练消息,所述模型训练消息包括人工智能AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别;
    接收来自所述至少两个第一节点的至少两个第一模型参数更新信息,第一模型参数更新信息是根据所述第一模型参数更新信息对应的第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息;
    向所述至少两个第一节点中的第一节点发送第二模型参数更新信息,所述第二模型参数更新信息是根据所述至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数。
  2. 如权利要求1所述的方法,其特征在于,所述数据流所属类别包括以下内容中的至少一项:
    所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;所述数据流的报文特征规则。
  3. 如权利要求1或2所述的方法,其特征在于,在所述接收来自所述至少两个第一节点的至少两个第一模型参数更新信息之后,还包括:
    向第二节点发送所述第二模型参数更新信息和所述AI模型;所述第二模型参数更新信息用于更新所述第二节点的所述AI模型的模型参数。
  4. 如权利要求1至3中任一项所述的方法,其特征在于,所述模型训练配置信息还包括:训练结果精度阈值;
    所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
  5. 一种模型训练方法,其特征在于,所述方法包括:
    接收模型训练消息,所述模型训练消息包括AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别;
    发送第一模型参数更新信息,所述第一模型参数更新信息是根据第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息;
    接收第二模型参数更新信息,所述第二模型参数更新信息是根据至少两个第一节点的至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数。
  6. 如权利要求5所述的方法,其特征在于,所述数据流所属类别包括以下内容中的至少一项:
    所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;所述数据流的报文特征规则。
  7. 如权利要求5或6所述的方法,其特征在于,所述方法还包括:
    确定所述第一节点根据本地数据,对采用所述第二模型参数更新信息更新后的所述AI模型进行训练。
  8. 如权利要求5至7中任一项所述的方法,其特征在于,所述方法通过部署于云化平台或者边缘计算平台的应用程序APP执行。
  9. 如权利要求8所述的方法,其特征在于,所述接收模型训练消息之后,发送第一模型参数更新信息之前,所述方法还包括:
    通过所述APP的服务端模块接收来自所述第一节点的所述第一模型参数更新信息;
    所述发送第一模型参数更新信息,包括:
    通过所述APP的客户端模块发送所述第一模型参数更新信息。
  10. 如权利要求5至9中任一项所述的方法,其特征在于,所述第一模型参数更新信息是根据所述模型训练配置信息,对训练后的所述AI模型的模型参数校验成功之后发送的。
  11. 如权利要求5至10中任一项所述的方法,其特征在于,所述模型训练配置信息还包括:训练结果精度阈值;
    所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
  12. 如权利要求5至11中任一项所述的方法,其特征在于,所述接收第二模型参数更新信息之后,所述方法还包括:
    获取基于采用所述第二模型参数更新信息更新后的所述AI模型,对数据流进行识别的识别结果和报文特征规则;
    根据所述报文特征规则更新业务感知SA特征库。
  13. 一种模型训练装置,其特征在于,所述装置包括:
    第一发送模块,用于向至少两个第一节点发送模型训练消息,向所述至少两个第一节点中的第一节点发送第二模型参数更新信息,所述模型训练消息包括人工智能AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别,所述第二模型参数更新信息是根据所述至少两个第一模型参数更新信息得到的,所述第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数;
    接收模块,用于接收来自所述至少两个第一节点的至少两个第一模型参数更新信息,第一模型参数更新信息是根据所述第一模型参数更新信息对应的第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息。
  14. 如权利要求13所述的装置,其特征在于,所述数据流所属类别包括以下内容中的至少一项:
    所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;所述数据流的报文特征规则。
  15. 如权利要求13或14所述的装置,其特征在于,所述装置还包括:
    第二发送模块,用于向第二节点发送所述第二模型参数更新信息和所述AI模型;所述第二模型参数更新信息用于更新所述第二节点的所述AI模型的模型参数。
  16. 如权利要求13至15中任一项所述的装置,其特征在于,所述模型训练配置信息还包括:训练结果精度阈值;
    所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
  17. 一种模型训练装置,其特征在于,所述装置包括:
    接收模块,用于接收模型训练消息,接收第二模型参数更新信息,所述模型训练消息包括AI模型以及模型训练配置信息,所述AI模型用于识别数据流所属类别,所述第二模型参数更新信息是根据至少两个第一节点的至少两个第一模型参数更新信息得到的,所述 第二模型参数更新信息用于更新第一节点的所述AI模型的模型参数;
    发送模块,用于发送第一模型参数更新信息,所述第一模型参数更新信息是根据第一节点的本地数据和所述模型训练配置信息对所述AI模型训练后的模型参数更新信息。
  18. 如权利要求17所述的装置,其特征在于,所述数据流所属类别包括以下内容中的至少一项:
    所述数据流所属的应用;所述数据流的业务内容所属的类型或协议;所述数据流的报文特征规则。
  19. 如权利要求17或18所述的装置,其特征在于,所述装置还包括:
    训练模块,用于确定所述第一节点根据本地数据,对采用所述第二模型参数更新信息更新后的所述AI模型进行训练。
  20. 如权利要求17至19中任一项所述的装置,其特征在于,所述接收模块和所述发送模块为部署于云化平台或者边缘计算平台的应用程序APP的模块。
  21. 如权利要求20所述的装置,其特征在于,
    所述接收模块和所述发送模块为所述APP的客户端模块;
    所述APP还包括服务端模块;
    所述服务端模块,用于向所述第一节点发送所述第一模型参数更新信息。
  22. 如权利要求17至21中任一项所述的装置,其特征在于,所述第一模型参数更新信息是根据所述模型训练配置信息,对训练后的所述AI模型的模型参数校验成功之后发送的。
  23. 如权利要求17至22中任一项所述的装置,其特征在于,所述模型训练配置信息还包括:训练结果精度阈值;
    所述训练结果精度阈值用于指示所述第一节点根据本地数据和所述模型训练配置信息对所述AI模型进行训练的训练结果精度。
  24. 如权利要求17至23中任一项所述的装置,其特征在于,所述装置还包括:
    获取模块,用于获取基于采用所述第二模型参数更新信息更新后的所述AI模型,对数据流进行识别的识别结果和报文特征规则;
    更新模块,用于根据所述报文特征规则更新业务感知SA特征库。
  25. 一种模型训练装置,其特征在于,包括处理器以及存储器,其中:
    所述存储器,用于存储程序代码;
    所述处理器,用于读取并执行所述存储器存储的程序代码,以实现如权利要求1~12中任一项所述的方法。
  26. 一种模型训练系统,其特征在于,包括如权利要求13至16中任一项所述的装置,以及包括如权利要求17至24中任一项所述的装置。
  27. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时,用于实现1至12中任一项所述的方法。
PCT/CN2022/109525 2021-08-23 2022-08-01 模型训练方法、装置及系统 WO2023024844A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110970462.6A CN115718868A (zh) 2021-08-23 2021-08-23 模型训练方法、装置及系统
CN202110970462.6 2021-08-23

Publications (1)

Publication Number Publication Date
WO2023024844A1 true WO2023024844A1 (zh) 2023-03-02

Family

ID=85253349

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/109525 WO2023024844A1 (zh) 2021-08-23 2022-08-01 模型训练方法、装置及系统

Country Status (2)

Country Link
CN (1) CN115718868A (zh)
WO (1) WO2023024844A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152609A (zh) * 2023-04-04 2023-05-23 南京鹤梦信息技术有限公司 分布式模型训练方法、系统、装置以及计算机可读介质
CN117793764A (zh) * 2023-12-27 2024-03-29 广东宜通衡睿科技有限公司 5g专网软探针拨测数据完整性校验和补全方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117857647A (zh) * 2023-12-18 2024-04-09 慧之安信息技术股份有限公司 基于mqtt面向工业物联网的联邦学习通信方法和系统
CN117521115B (zh) * 2024-01-04 2024-04-23 广东工业大学 一种数据保护方法、装置及计算机存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206416A1 (en) * 2016-01-19 2017-07-20 Fuji Xerox Co., Ltd. Systems and Methods for Associating an Image with a Business Venue by using Visually-Relevant and Business-Aware Semantics
CN109189825A (zh) * 2018-08-10 2019-01-11 深圳前海微众银行股份有限公司 横向数据切分联邦学习建模方法、服务器及介质
CN111444848A (zh) * 2020-03-27 2020-07-24 广州英码信息科技有限公司 一种基于联邦学习的特定场景模型升级方法和系统
CN112861894A (zh) * 2019-11-27 2021-05-28 华为技术有限公司 一种数据流分类方法、装置及系统
CN114584581A (zh) * 2022-01-29 2022-06-03 华东师范大学 面向智慧城市物联网信物融合的联邦学习系统及联邦学习训练方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206416A1 (en) * 2016-01-19 2017-07-20 Fuji Xerox Co., Ltd. Systems and Methods for Associating an Image with a Business Venue by using Visually-Relevant and Business-Aware Semantics
CN109189825A (zh) * 2018-08-10 2019-01-11 深圳前海微众银行股份有限公司 横向数据切分联邦学习建模方法、服务器及介质
CN112861894A (zh) * 2019-11-27 2021-05-28 华为技术有限公司 一种数据流分类方法、装置及系统
CN111444848A (zh) * 2020-03-27 2020-07-24 广州英码信息科技有限公司 一种基于联邦学习的特定场景模型升级方法和系统
CN114584581A (zh) * 2022-01-29 2022-06-03 华东师范大学 面向智慧城市物联网信物融合的联邦学习系统及联邦学习训练方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116152609A (zh) * 2023-04-04 2023-05-23 南京鹤梦信息技术有限公司 分布式模型训练方法、系统、装置以及计算机可读介质
CN117793764A (zh) * 2023-12-27 2024-03-29 广东宜通衡睿科技有限公司 5g专网软探针拨测数据完整性校验和补全方法及系统

Also Published As

Publication number Publication date
CN115718868A (zh) 2023-02-28

Similar Documents

Publication Publication Date Title
WO2023024844A1 (zh) 模型训练方法、装置及系统
JP6569020B2 (ja) ネットワーキング技術
EP4152797A1 (en) Information processing method and related device
US20210211467A1 (en) Offload of decryption operations
EP3937051B1 (en) Methods and apparatuses for processing transactions based on blockchain integrated station
WO2018196643A1 (zh) 一种私有数据云存储系统及私有数据云存储方法
US20210326863A1 (en) Methods and apparatuses for identifying replay transaction based on blockchain integrated station
US11783339B2 (en) Methods and apparatuses for transferring transaction based on blockchain integrated station
US10592578B1 (en) Predictive content push-enabled content delivery network
CN110741573A (zh) 在区块链网络中选择性使用网络编码传播交易的方法和系统
WO2023202596A1 (zh) 一种半监督模型训练方法、系统及相关设备
WO2022105090A1 (zh) 基于回调机制的细粒度数据流可靠卸载方法
US11368482B2 (en) Threat detection system for mobile communication system, and global device and local device thereof
US11870756B2 (en) Reliable data transfer protocol for unidirectional network segments
EP1303812B1 (fr) Procede de transmission d'un agent mobile dans un reseau; emetteur, recepteur, et agent mobile associes
KR20220074971A (ko) 블록체인 기반 데이터 프로세싱 방법, 장치 및 디바이스, 그리고 판독가능 저장 매체
WO2021050905A1 (en) Global table management operations for multi-region replicated tables
US11936717B2 (en) Scalable media file transfer
JP2011505606A (ja) 表形式データストリームプロトコルの行におけるヌル列の圧縮
US11159614B2 (en) Method and apparatus for managing data in a network based on swarm intelligence
WO2023051455A1 (zh) 一种信任模型的训练方法及装置
CN113487041B (zh) 横向联邦学习方法、装置及存储介质
WO2022121840A1 (zh) 一种神经网络模型的调整系统、方法及设备
WO2020220986A1 (zh) 一种报文处理方法、装置及设备
US7756975B1 (en) Methods and systems for automatically discovering information about a domain of a computing device

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE