CN117313831A - Combined learning training method and device based on model distillation - Google Patents

Combined learning training method and device based on model distillation Download PDF

Info

Publication number
CN117313831A
CN117313831A CN202210691251.3A CN202210691251A CN117313831A CN 117313831 A CN117313831 A CN 117313831A CN 202210691251 A CN202210691251 A CN 202210691251A CN 117313831 A CN117313831 A CN 117313831A
Authority
CN
China
Prior art keywords
model
participant
student
teacher
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210691251.3A
Other languages
Chinese (zh)
Inventor
张敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinao Xinzhi Technology Co ltd
Original Assignee
Xinao Xinzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinao Xinzhi Technology Co ltd filed Critical Xinao Xinzhi Technology Co ltd
Priority to CN202210691251.3A priority Critical patent/CN117313831A/en
Publication of CN117313831A publication Critical patent/CN117313831A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure relates to the technical field of joint learning, and provides a joint learning training method and device based on model distillation. The method comprises the following steps: obtaining a teacher model corresponding to the joint learning training, and issuing the teacher model to each participant; training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; and determining an aggregation model corresponding to the joint learning training based on the first aggregation parameters. By adopting the technical means, the problems of large model scale, high communication cost, high model application cost and the like in the combined learning training in the prior art are solved.

Description

Combined learning training method and device based on model distillation
Technical Field
The disclosure relates to the technical field of joint learning, in particular to a joint learning training method and device based on model distillation.
Background
In joint learning, a neural network model is trained by training data sets of a plurality of participants at present to obtain a network model of each participant, and a joint learning model is obtained according to the network model of each participant. This training method has the following problems: the model of each participant training is large in scale, the deep learning model is large in scale (hundreds of megameters to hundreds of G), the network complexity is high, and the time length, calculation and storage costs required by the combined training are high; the communication cost is high, the large model joint training is required to transfer a large data volume for a long time, and the cross-regional and large-scale network communication cost is high, the time delay is high and unstable, so that the model training is not facilitated. Increasing costs and risk of uncontrollable; the model application cost is high, the deep learning limits the corresponding model deployment on some scenes and equipment due to the calculation complexity or parameter redundancy, the response speed is low, and the large-flow access cannot be dealt with.
In the process of implementing the disclosed concept, the inventor finds that at least the following technical problems exist in the related art: the problems of large model scale, high communication cost, high model application cost and the like of training exist in the joint learning training.
Disclosure of Invention
In view of the above, the embodiments of the present disclosure provide a model distillation-based joint learning training method, apparatus, electronic device, and computer-readable storage medium, so as to solve the problem of long time required for joint learning training in the prior art.
In a first aspect of an embodiment of the present disclosure, a joint learning training method based on model distillation is provided, including: obtaining a teacher model corresponding to the joint learning training, and issuing the teacher model to each participant; training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; and determining an aggregation model corresponding to the joint learning training based on the first aggregation parameters.
In a second aspect of embodiments of the present disclosure, there is provided a model distillation based joint learning training apparatus, including: the acquisition module is configured to acquire a teacher model corresponding to the joint learning training and issue the teacher model to each participant; a training module configured to train a student model of each participant with participant data of each participant according to the teacher model; a distillation module configured to perform a model distillation process on the student model at each participant, wherein the model distillation process is to migrate knowledge of the teacher model to the student model; the aggregation module is configured to acquire first model parameters of the student models trained by each participant, and aggregate the first model parameters of each student model to obtain first aggregate parameters; the determining module is configured to determine an aggregation model corresponding to the joint learning training based on the first aggregation parameter.
In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.
Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: obtaining a teacher model corresponding to the joint learning training, and issuing the teacher model to each participant; training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; and determining an aggregation model corresponding to the joint learning training based on the first aggregation parameters. By adopting the technical means, the problems of large model scale, high communication cost, high model application cost and the like in the combined learning training in the prior art can be solved, and the model scale, the communication cost and the model application cost in the combined learning training are further reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a schematic diagram of a joint learning architecture according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of a model distillation based joint learning training method provided in an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of a model distillation-based joint learning training device according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
The joint learning refers to comprehensively utilizing a plurality of AI (Artificial Intelligence ) technologies on the premise of ensuring data safety and user privacy, jointly excavating data value by combining multiparty cooperation, and promoting new intelligent business states and modes based on joint modeling. The joint learning has at least the following characteristics:
(1) The participating nodes control the weak centralized joint training mode of the own data, so that the data privacy safety in the co-creation intelligent process is ensured.
(2) Under different application scenes, a plurality of model aggregation optimization strategies are established by utilizing screening and/or combination of an AI algorithm and privacy protection calculation so as to obtain a high-level and high-quality model.
(3) On the premise of ensuring data safety and user privacy, a method for improving the efficiency of the joint learning engine is obtained based on a plurality of model aggregation optimization strategies, wherein the efficiency method can be used for improving the overall efficiency of the joint learning engine by solving the problems of information interaction, intelligent perception, exception handling mechanisms and the like under a large-scale cross-domain network with parallel computing architecture.
(4) The requirements of multiparty users in all scenes are acquired, the real contribution degree of all joint participants is determined and reasonably evaluated through a mutual trust mechanism, and distribution excitation is carried out.
Based on the mode, AI technical ecology based on joint learning can be established, the industry data value is fully exerted, and the scene of the vertical field is promoted to fall to the ground.
A model distillation based joint learning training method and apparatus according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a joint learning architecture according to an embodiment of the present disclosure. As shown in fig. 1, the architecture of joint learning may include a server (central node) 101, as well as participants 102, 103, and 104.
In the joint learning process, a basic model may be established by the server 101, and the server 101 transmits the model to the participants 102, 103, and 104 with which a communication connection is established. The basic model may also be uploaded to the server 101 after any party has established, and the server 101 sends the model to the other parties with whom it has established a communication connection. The participants 102, 103 and 104 construct a model according to the downloaded basic structure and model parameters, perform joint learning training by using local data, obtain updated model parameters, and encrypt and upload the updated model parameters to the server 101. Server 101 aggregates the model parameters sent by participants 102, 103, and 104 to obtain global model parameters, and transmits the global model parameters back to participants 102, 103, and 104. Participant 102, participant 103 and participant 104 iterate the respective models according to the received global model parameters until the models eventually converge, thereby enabling training of the models. In the joint learning process, the data uploaded by the participants 102, 103 and 104 are model parameters, local data is not uploaded to the server 101, and all the participants can share final model parameters, so that common modeling can be realized on the basis of ensuring data privacy. It should be noted that the number of participants is not limited to the above three, but may be set as needed, and the embodiment of the present disclosure is not limited thereto.
Fig. 2 is a schematic flow chart of a model distillation-based joint learning training method according to an embodiment of the present disclosure. The model distillation based joint learning training method of fig. 2 may be performed by the server of fig. 1.
As shown in fig. 2, the model distillation-based joint learning training method includes:
s201, obtaining a teacher model corresponding to the joint learning training, and issuing the teacher model to each participant;
s202, training a student model of each participant by using participant data of each participant according to a teacher model;
s203, performing model distillation processing on the student model at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student model;
s204, obtaining first model parameters of the student model after model distillation treatment of each participant, and aggregating a plurality of first model parameters to obtain first aggregation parameters;
s205, determining an aggregation model corresponding to the joint learning training based on the first aggregation parameters.
There are multiple participants and a training center in joint learning. Each participant is configured to provide training data and to train its own student model based on its own training data. The training center can initiate the joint learning training, aggregate model parameters of the student models of the multiple participants after model distillation treatment to obtain aggregate parameters, and generate an aggregate model according to the aggregate parameters, wherein the aggregate model is the joint learning model. The execution subject of the embodiments of the present disclosure may be the entire system of joint learning, including: a plurality of participants and a training center. The teacher model is a user-trained model with a large number of data assets, and the teacher model corresponding to the joint learning training is a user-trained model with a large number of data assets corresponding to the subject of the joint learning training. The teacher model and the student model may be any neural network model, and the teacher model is a pre-training model (i.e., a user-trained model that has a large number of data assets). The student model is a model of a participant, and the accuracy of the student model of each participant after training using the participant data of each participant is still to be further improved. Because the teacher model is a pre-training model, model distillation processing is performed on the student model at each participant, and model parameters of the student model may be further updated with the teacher model.
Training the student model of each participant with the participant data of each participant according to the teacher model may be understood as providing guidance using the teacher model, such as labeling the participant data with the teacher model, while training the student model of each participant with the participant data of each participant.
The model distillation process is performed on the student model at each participant, wherein the model distillation process is to migrate knowledge of the teacher model to the student model, that is, the model distillation process is performed on the student model by the teacher model at each participant. Model distillation is to migrate the knowledge of the teacher model to the student model.
Optionally, determining an aggregation model corresponding to the joint learning training based on the first aggregation parameter includes: at the training center, model parameters of the student model are updated based on the first aggregate parameters to obtain an aggregate model (the student model may be a neural network model without training, the scale of the student model is much smaller than that of the teacher model, and the student model may be issued by the training center to each participant).
The topic of joint learning training, or the application scenario of the embodiments of the present disclosure may be electric gas consumption prediction (training an aggregation model to predict the amount of electricity or gas consumed by a user for a certain period of time in a certain area), face recognition (training an aggregation model to identify a face), insurance data processing (training an aggregation model to process insurance data (insurance data is data about an insurance user), and thus obtain the most suitable insurance type of the insurance user), and so on.
According to the technical scheme provided by the embodiment of the disclosure, a teacher model corresponding to the joint learning training is obtained, and the teacher model is issued to each participant; training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; and determining an aggregation model corresponding to the joint learning training based on the first aggregation parameters. By adopting the technical means, the problems of large model scale, high communication cost, high model application cost and the like in the combined learning training in the prior art can be solved, and the model scale, the communication cost and the model application cost in the combined learning training are further reduced.
In step S203, a model distillation process is performed on the student model at each participant, wherein the model distillation process is to migrate knowledge of the teacher model to the student model, and includes: at each participant: calculating a target loss value corresponding to the teacher model by using the target function; and performing model distillation processing on the student model by the teacher model based on the target loss value.
The target loss value is a loss value between the teacher model and the student model, and can be regarded as a constraint from the teacher model to the student model, and knowledge migration from the teacher model to the student model, that is, model distillation processing is realized according to the constraint. The teacher model may be any common neural network model.
In step S203, it includes: calculating a first loss value corresponding to the feature extraction network of the teacher model and the student model by using the minimum absolute value deviation function; calculating a second loss value corresponding to the area candidate network of the teacher model and the student model by using the minimum average error function; calculating a third loss value corresponding to the head network of the teacher model and the student model by using the cross entropy loss function; model distillation processing is carried out on the teacher model based on the first loss value, the second loss value and the third loss value, and a student model is obtained; wherein the objective function comprises: a minimum absolute value bias function, a minimum average error function, and a cross entropy loss function, a target loss value, comprising: a first loss value, a second loss value, and a third loss value.
A neural network model generally includes three parts, namely a feature extraction network backhaul, a region candidate network RPN, and a Head network Head. Background is often used as a residual network for extracting features; the RPN is used for determining candidate frames according to the characteristics; the Head is used for predicting according to the region corresponding to the candidate frame.
The minimum absolute value deviation function is the L1 norm value loss function, the minimum average error function is the L2 norm value loss function, and the cross entropy loss function is CrossEntropy Loss. Calculating first loss values corresponding to the feature extraction networks of the teacher model and the student model by using the minimum absolute value deviation function, wherein the first loss values may be losses regarding the output of the feature extraction network in the teacher model that is not compressed and the feature extraction network in the student model that is compressed. The second and third loss values are similar to the first loss value. The first loss value, the second loss value and the third loss value can be regarded as three constraints, and knowledge migration from the teacher model to the student model, namely model distillation treatment, is realized according to the three constraints.
After performing step S203, that is, performing a model distillation process on the student model at each participant, wherein the model distillation process is performed after the knowledge of the teacher model is migrated to the student model, the method further includes: carrying out model acceleration processing on the student model subjected to model distillation processing by each participant by using a deep learning reasoning optimizer; acquiring second model parameters of the student models of each participant after model acceleration processing, and aggregating the second model parameters of each student model to obtain second aggregation parameters; and determining an aggregation model corresponding to the joint learning training based on the second aggregation parameters.
The deep learning reasoning optimizer can be TensorRT, which is used for interlayer fusion or tensor fusion, data precision calibration, CUDA core automatic adjustment, dynamic tensor video memory and multi-stream execution, and model reasoning speed can be doubled by means of TensorRT acceleration.
CUDA is a general parallel computing architecture that may be used in model training.
Interlayer fusion or tensor fusion: the speed of calculating tensors by the CUDA core is fast, but a great deal of time is wasted on the starting of the CUDA core and the read-write operation of the input/output tensors of each layer, which causes the bottleneck of memory bandwidth and the waste of GPU resources, and the TensorRT greatly reduces the number of layers by combining layers transversely or longitudinally, so that the occupied number of the CUDA cores is small, and the whole model structure is smaller, faster and more efficient. Data precision calibration: in the process of deployment reasoning, the model does not need back propagation, so that the data precision can be properly reduced, such as the precision reduced to FP16 or INT8, the lower data precision can lead to lower memory occupation and delay, the model volume is smaller, but in the actual use process, the FP16 is found to bring larger precision loss, so that in order to ensure the model precision, the FP16 quantization is not carried out. The CUDA core automatically adjusts: when the network model performs reasoning calculation, the CUDA core of the GPU is called for calculation. Dynamic tensor video memory: during the use period of each tensor, the tensorRT can assign the video memory for the tensor, so that repeated application of the video memory is avoided, the memory occupation is reduced, and the repeated use efficiency is improved. Multi-stream execution: and optimizing the execution speed of the GPU bottom layer.
The model acceleration processing is carried out on the student model of each participant after model distillation processing by using a deep learning reasoning optimizer, and the model structure or model parameters of the student model can be adjusted, so that the adjusted student model effect is better, and the aggregate model effect is better.
In an alternative embodiment, the method comprises: acquiring a preset total round and a teacher model corresponding to the joint learning training, and issuing the teacher model to each participant; the method comprises the following steps of performing joint learning training in a circulating way: training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; determining an aggregation model based on a first aggregation parameter, adding one to a training round, wherein the training round is used for representing the corresponding times of the current joint learning training, the initial value of the training round is zero, when the training round is equal to a preset total round, the joint learning training is ended, and when the training round is smaller than the preset total round, the joint learning training is continued; the aggregated model is issued to each participant to update the teacher model at each participant.
In order to make the precision of the aggregate model higher, the embodiment of the disclosure proposes a circulation algorithm to perform training of multiple rounds, and when the final training round is equal to the preset total round, the joint learning training is ended. According to the teacher model, the student model of each participant is trained by using the participant data of each participant, and the student model of each participant is trained by using the participant data of one participant. And judging whether to finish the joint learning training or not when the aggregation model is obtained each time. If joint learning training is continued, an aggregate model is issued to each participant to update the teacher model with the aggregate model at each participant, and then the training of the next round is continued.
The method of aggregating the first model parameters of each student model may employ Fedadm, fedProx and SCAFFOLD.
In an alternative embodiment, the method comprises: acquiring preset model precision and a teacher model corresponding to the joint learning training, and issuing the teacher model to each participant; the method comprises the following steps of performing joint learning training in a circulating way: training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; determining an aggregation model based on the first aggregation parameter, testing the model precision of the aggregation model, ending the joint learning training when the model precision is greater than the preset model precision, and continuing the joint learning training when the model precision is less than or equal to the preset model precision; the aggregated model is issued to each participant to update the teacher model at each participant.
In order to make the precision of the aggregate model higher, the embodiment of the disclosure provides a circulation algorithm to perform multiple training, and when the final model precision is greater than the preset model precision, the joint learning training is ended. The embodiments of the present disclosure are similar to the previous embodiments and will not be described again here.
In step S203, training a student model of each participant with participant data of each participant according to the teacher model, includes: inputting the participant data of each participant into a student model of each participant to obtain a first output of each participant; inputting the participant data of each participant into a teacher model to obtain a second output; based on the second output and the first output of each participant, a student model of each participant is trained.
The teacher model has larger specification, the model distillation processing is carried out on the teacher model to obtain the student model, and then the student model is trained, so that the training speed of joint learning can be improved. But the accuracy of the student model is lower than the teacher model, which is used to guide the training of the student model in the disclosed embodiments. The training of the student model of each participant based on the second output and the first output of each participant is based on training the student model of each participant based on a difference based on the second output and the first output of each participant. This training is identical to the loss-based training of models (it is common in the prior art to calculate gradients from losses and then update the parameters of the model with gradient-direction propagation algorithms), but differs in that the disclosed embodiments use a teacher model to guide the training of student models.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic diagram of a model distillation-based joint learning training device according to an embodiment of the present disclosure. As shown in fig. 3, the model distillation-based joint learning training device includes:
the obtaining module 301 is configured to obtain a teacher model corresponding to the joint learning training, and issue the teacher model to each participant;
a training module 302 configured to train a student model of each participant with participant data of each participant according to the teacher model;
a distillation module 303 configured to perform a model distillation process on the student model at each participant, wherein the model distillation process is to migrate knowledge of the teacher model to the student model;
the aggregation module 304 is configured to obtain first model parameters of the student model after model distillation processing of each participant, and aggregate a plurality of the first model parameters to obtain first aggregation parameters;
The determining module 305 is configured to determine an aggregation model corresponding to the joint learning training based on the first aggregation parameter.
There are multiple participants and a training center in joint learning. Each participant is configured to provide training data and to train its own student model based on its own training data. The training center can initiate the joint learning training, aggregate model parameters of the student models of the multiple participants after model distillation treatment to obtain aggregate parameters, and generate an aggregate model according to the aggregate parameters, wherein the aggregate model is the joint learning model. The execution subject of the embodiments of the present disclosure may be the entire system of joint learning, including: a plurality of participants and a training center. The teacher model is a user-trained model with a large number of data assets, and the teacher model corresponding to the joint learning training is a user-trained model with a large number of data assets corresponding to the subject of the joint learning training. The teacher model and the student model may be any neural network model, and the teacher model is a pre-training model (i.e., a user-trained model that has a large number of data assets). The student model is a model of a participant, and the accuracy of the student model of each participant after training using the participant data of each participant is still to be further improved. Because the teacher model is a pre-training model, model distillation processing is performed on the student model at each participant, and model parameters of the student model may be further updated with the teacher model.
The model distillation process is performed on the student model at each participant, wherein the model distillation process is to migrate knowledge of the teacher model to the student model, that is, the model distillation process is performed on the student model by the teacher model at each participant. Model distillation is to migrate the knowledge of the teacher model to the student model.
Optionally, the aggregation module 304 is further configured to update, at the training center, model parameters of the student model based on the first aggregation parameters to obtain an aggregate model (the student model may be a neural network model without training, the student model may be of a size much smaller than the teacher model, and the student model may be issued by the training center to each participant).
The topic of joint learning training, or the application scenario of the embodiments of the present disclosure may be electric gas consumption prediction (training an aggregation model to predict the amount of electricity or gas consumed by a user for a certain period of time in a certain area), face recognition (training an aggregation model to identify a face), insurance data processing (training an aggregation model to process insurance data (insurance data is data about an insurance user), and thus obtain the most suitable insurance type of the insurance user), and so on.
According to the technical scheme provided by the embodiment of the disclosure, a teacher model corresponding to the joint learning training is obtained, and the teacher model is issued to each participant; training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; and determining an aggregation model corresponding to the joint learning training based on the first aggregation parameters. By adopting the technical means, the problems of large model scale, high communication cost, high model application cost and the like in the combined learning training in the prior art can be solved, and the model scale, the communication cost and the model application cost in the combined learning training are further reduced.
Optionally, distillation module 303 is further configured to, at each participant: calculating a target loss value corresponding to the teacher model by using the target function; and performing model distillation processing on the student model by the teacher model based on the target loss value.
The target loss value is a loss value between the teacher model and the student model, and can be regarded as a constraint from the teacher model to the student model, and knowledge migration from the teacher model to the student model, that is, model distillation processing is realized according to the constraint. The teacher model may be any common neural network model.
Optionally, the distillation module 303 is further configured to calculate a first loss value corresponding to the feature extraction network of the teacher model and the student model using the minimum absolute value deviation function; calculating a second loss value corresponding to the area candidate network of the teacher model and the student model by using the minimum average error function; calculating a third loss value corresponding to the head network of the teacher model and the student model by using the cross entropy loss function; model distillation processing is carried out on the teacher model based on the first loss value, the second loss value and the third loss value, and a student model is obtained; wherein the objective function comprises: a minimum absolute value bias function, a minimum average error function, and a cross entropy loss function, a target loss value, comprising: a first loss value, a second loss value, and a third loss value.
A neural network model generally includes three parts, namely a feature extraction network backhaul, a region candidate network RPN, and a Head network Head. Background is often used as a residual network for extracting features; the RPN is used for determining candidate frames according to the characteristics; the Head is used for predicting according to the region corresponding to the candidate frame.
The minimum absolute value deviation function is the L1 norm value loss function, the minimum average error function is the L2 norm value loss function, and the cross entropy loss function is CrossEntropy Loss. Calculating first loss values corresponding to the feature extraction networks of the teacher model and the student model by using the minimum absolute value deviation function, wherein the first loss values may be losses regarding the output of the feature extraction network in the teacher model that is not compressed and the feature extraction network in the student model that is compressed. The second and third loss values are similar to the first loss value. The first loss value, the second loss value and the third loss value can be regarded as three constraints, and knowledge migration from the teacher model to the student model, namely model distillation treatment, is realized according to the three constraints.
Optionally, the distillation module 303 is further configured to perform model acceleration processing on the student model after model distillation processing on each participant by using the deep learning reasoning optimizer; acquiring second model parameters of the student models of each participant after model acceleration processing, and aggregating the second model parameters of each student model to obtain second aggregation parameters; and determining an aggregation model corresponding to the joint learning training based on the second aggregation parameters.
The deep learning reasoning optimizer can be TensorRT, which is used for interlayer fusion or tensor fusion, data precision calibration, CUDA core automatic adjustment, dynamic tensor video memory and multi-stream execution, and model reasoning speed can be doubled by means of TensorRT acceleration.
CUDA is a general parallel computing architecture that may be used in model training.
Interlayer fusion or tensor fusion: the speed of calculating tensors by the CUDA core is fast, but a great deal of time is wasted on the starting of the CUDA core and the read-write operation of the input/output tensors of each layer, which causes the bottleneck of memory bandwidth and the waste of GPU resources, and the TensorRT greatly reduces the number of layers by combining layers transversely or longitudinally, so that the occupied number of the CUDA cores is small, and the whole model structure is smaller, faster and more efficient. Data precision calibration: in the process of deployment reasoning, the model does not need back propagation, so that the data precision can be properly reduced, such as the precision reduced to FP16 or INT8, the lower data precision can lead to lower memory occupation and delay, the model volume is smaller, but in the actual use process, the FP16 is found to bring larger precision loss, so that in order to ensure the model precision, the FP16 quantization is not carried out. The CUDA core automatically adjusts: when the network model performs reasoning calculation, the CUDA core of the GPU is called for calculation. Dynamic tensor video memory: during the use period of each tensor, the tensorRT can assign the video memory for the tensor, so that repeated application of the video memory is avoided, the memory occupation is reduced, and the repeated use efficiency is improved. Multi-stream execution: and optimizing the execution speed of the GPU bottom layer.
The model acceleration processing is carried out on the student model of each participant after model distillation processing by using a deep learning reasoning optimizer, and the model structure or model parameters of the student model can be adjusted, so that the adjusted student model effect is better, and the aggregate model effect is better.
Optionally, the training module 302 is further configured to acquire a preset total round and a teacher model corresponding to the joint learning training, and issue the teacher model to each participant; the method comprises the following steps of performing joint learning training in a circulating way: training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; determining an aggregation model based on a first aggregation parameter, adding one to a training round, wherein the training round is used for representing the corresponding times of the current joint learning training, the initial value of the training round is zero, when the training round is equal to a preset total round, the joint learning training is ended, and when the training round is smaller than the preset total round, the joint learning training is continued; the aggregated model is issued to each participant to update the teacher model at each participant.
In order to make the precision of the aggregate model higher, the embodiment of the disclosure proposes a circulation algorithm to perform training of multiple rounds, and when the final training round is equal to the preset total round, the joint learning training is ended. According to the teacher model, the student model of each participant is trained by using the participant data of each participant, and the student model of each participant is trained by using the participant data of one participant. And judging whether to finish the joint learning training or not when the aggregation model is obtained each time. If joint learning training is continued, an aggregate model is issued to each participant to update the teacher model with the aggregate model at each participant, and then the training of the next round is continued.
The method of aggregating the first model parameters of each student model may employ Fedadm, fedProx and SCAFFOLD.
Optionally, the training module 302 is further configured to acquire a preset model precision and a teacher model corresponding to the joint learning training, and issue the teacher model to each participant; the method comprises the following steps of performing joint learning training in a circulating way: training a student model of each participant by using the participant data of each participant according to the teacher model; model distillation processing is carried out on the student models at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student models; obtaining first model parameters of a student model of each participant after model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters; determining an aggregation model based on the first aggregation parameter, testing the model precision of the aggregation model, ending the joint learning training when the model precision is greater than the preset model precision, and continuing the joint learning training when the model precision is less than or equal to the preset model precision; the aggregated model is issued to each participant to update the teacher model at each participant.
In order to make the precision of the aggregate model higher, the embodiment of the disclosure provides a circulation algorithm to perform multiple training, and when the final model precision is greater than the preset model precision, the joint learning training is ended. The embodiments of the present disclosure are similar to the previous embodiments and will not be described again here.
Optionally, the training module 303 is further configured to input the participant data of each participant into the student model of each participant, resulting in a first output of each participant; inputting the participant data of each participant into a teacher model to obtain a second output; based on the second output and the first output of each participant, a student model of each participant is trained.
The teacher model has larger specification, the model distillation processing is carried out on the teacher model to obtain the student model, and then the student model is trained, so that the training speed of joint learning can be improved. But the accuracy of the student model is lower than the teacher model, which is used to guide the training of the student model in the disclosed embodiments. The training of the student model of each participant based on the second output and the first output of each participant is based on training the student model of each participant based on a difference based on the second output and the first output of each participant. This training is identical to the loss-based training of models (it is common in the prior art to calculate gradients from losses and then update the parameters of the model with gradient-direction propagation algorithms), but differs in that the disclosed embodiments use a teacher model to guide the training of student models.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.
Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.
Illustratively, the computer program 403 may be partitioned into one or more modules/units, which are stored in the memory 402 and executed by the processor 401 to complete the present disclosure. One or more of the modules/units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 403 in the electronic device 4.
The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not meant to be limiting of the electronic device 4, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device may also include an input-output device, a network access device, a bus, etc.
The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Further, the memory 402 may also include both internal storage units and external storage devices of the electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device. The memory 402 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other manners. For example, the apparatus/electronic device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementations, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims (10)

1. A model distillation-based joint learning training method, comprising:
obtaining a teacher model corresponding to the joint learning training, and issuing the teacher model to each participant;
training a student model of each participant by using the participant data of each participant according to the teacher model;
performing model distillation processing on the student model at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student model;
obtaining first model parameters of a student model of each participant after the model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters;
And determining an aggregation model corresponding to the joint learning training based on the first aggregation parameter.
2. The method of claim 1, wherein the model distillation process is performed on the student model at each participant, wherein the model distillation process is to migrate knowledge of the teacher model to the student model, comprising:
at each participant:
calculating a target loss value corresponding to the teacher model by using an objective function;
and performing model distillation processing on the student model by the teacher model based on the target loss value.
3. The method according to claim 2, characterized by comprising:
calculating a first loss value corresponding to the characteristic extraction network of the teacher model and the student model by using a minimum absolute value deviation function;
calculating a second loss value corresponding to the area candidate network of the teacher model and the student model by using a minimum average error function;
calculating a third loss value corresponding to the head network of the teacher model and the student model by using a cross entropy loss function;
performing the model distillation treatment on the teacher model based on the first loss value, the second loss value and the third loss value to obtain the student model;
Wherein the objective function comprises: the minimum absolute value deviation function, the minimum average error function, and the cross entropy loss function, the target loss value comprising: the first loss value, the second loss value, and the third loss value.
4. The method of claim 1, wherein the model distillation process is performed on the student model at each participant, wherein the model distillation process is performed after the knowledge of the teacher model is migrated to the student model, the method further comprising:
carrying out model acceleration processing on the student model subjected to the model distillation processing by each participant by using a deep learning reasoning optimizer;
acquiring second model parameters of the student models of each participant after the model acceleration processing, and aggregating the second model parameters of each student model to obtain second aggregation parameters;
and determining an aggregation model corresponding to the joint learning training based on the second aggregation parameter.
5. The method according to claim 1, characterized in that it comprises:
acquiring a preset total round and the teacher model corresponding to the joint learning training, and issuing the teacher model to each participant;
The combined learning training is performed by circularly executing the following steps:
training a student model of each participant by using the participant data of each participant according to the teacher model;
performing model distillation processing on the student model at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student model;
obtaining first model parameters of a student model of each participant after the model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters;
determining the aggregation model based on the first aggregation parameter, adding one to a training round, wherein the training round is used for representing the number of times corresponding to the current joint learning training, the initial value of the training round is zero, the joint learning training is ended when the training round is equal to the preset total round, and the joint learning training is continued when the training round is smaller than the preset total round;
and issuing the aggregation model to each participant so as to update the teacher model at each participant.
6. The method according to claim 1, characterized in that it comprises:
Acquiring preset model precision and the teacher model corresponding to the joint learning training, and issuing the teacher model to each participant;
the combined learning training is performed by circularly executing the following steps:
training a student model of each participant by using the participant data of each participant according to the teacher model;
performing model distillation processing on the student model at each participant, wherein the model distillation processing is to migrate knowledge of the teacher model to the student model;
obtaining first model parameters of a student model of each participant after the model distillation treatment, and aggregating a plurality of first model parameters to obtain first aggregation parameters;
determining an aggregation model based on the first aggregation parameter, testing the model precision of the aggregation model, ending the joint learning training when the model precision is greater than the preset model precision, and continuing the joint learning training when the model precision is less than or equal to the preset model precision;
and issuing the aggregation model to each participant so as to update the teacher model at each participant.
7. The method of claim 1, wherein training the student model of each participant with participant data of each participant according to the teacher model comprises:
Inputting the participant data of each participant into a student model of each participant to obtain a first output of each participant;
inputting the participant data of each participant into the teacher model to obtain a second output;
based on the second output and the first output of each participant, a student model of each participant is trained.
8. A model distillation based joint learning training device, comprising:
the acquisition module is configured to acquire a teacher model corresponding to the joint learning training and issue the teacher model to each participant;
a training module configured to train a student model of each participant with participant data of each participant according to the teacher model;
a distillation module configured to perform a model distillation process on the student model at each participant, wherein the model distillation process is to migrate knowledge of the teacher model to the student model;
the aggregation module is configured to acquire first model parameters of the student model after each participant is subjected to model distillation treatment, and aggregate a plurality of first model parameters to obtain first aggregation parameters;
a determination module configured to determine an aggregation model corresponding to the joint learning training based on the first aggregation parameter.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.
CN202210691251.3A 2022-06-17 2022-06-17 Combined learning training method and device based on model distillation Pending CN117313831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210691251.3A CN117313831A (en) 2022-06-17 2022-06-17 Combined learning training method and device based on model distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210691251.3A CN117313831A (en) 2022-06-17 2022-06-17 Combined learning training method and device based on model distillation

Publications (1)

Publication Number Publication Date
CN117313831A true CN117313831A (en) 2023-12-29

Family

ID=89260850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210691251.3A Pending CN117313831A (en) 2022-06-17 2022-06-17 Combined learning training method and device based on model distillation

Country Status (1)

Country Link
CN (1) CN117313831A (en)

Similar Documents

Publication Publication Date Title
CN112235384B (en) Data transmission method, device, equipment and storage medium in distributed system
CN112862011A (en) Model training method and device based on federal learning and federal learning system
US10338629B2 (en) Optimizing neurosynaptic networks
CN113988310A (en) Deep learning model selection method and device, computer equipment and medium
CN112861659A (en) Image model training method and device, electronic equipment and storage medium
CN116627970A (en) Data sharing method and device based on blockchain and federal learning
CN113222560A (en) Enterprise service system based on AI consultation and construction method thereof
CN117313832A (en) Combined learning model training method, device and system based on bidirectional knowledge distillation
CN116402366A (en) Data contribution evaluation method and device based on joint learning
CN117313831A (en) Combined learning training method and device based on model distillation
CN117033997A (en) Data segmentation method, device, electronic equipment and medium
CN116340959A (en) Breakpoint privacy protection-oriented method, device, equipment and medium
CN114118275A (en) Joint learning training method and device
CN116069767A (en) Equipment data cleaning method and device, computer equipment and medium
CN114756425A (en) Intelligent monitoring method and device, electronic equipment and computer readable storage medium
CN116050557A (en) Power load prediction method, device, computer equipment and medium
CN115564055A (en) Asynchronous joint learning training method and device, computer equipment and storage medium
CN114897186A (en) Joint learning training method and device
CN117077798A (en) Method and device for generating joint learning model based on semi-supervised learning
CN116484707A (en) Determination method and device of joint learning model
CN116485215A (en) Resource allocation method and device in joint learning
CN115481752B (en) Model training method, device, electronic equipment and storage medium
CN116416655A (en) Theme prediction method and device based on joint learning
CN115271042A (en) Model training method and device based on sample sampling time
CN114897185A (en) Joint learning training method and device based on category heterogeneous data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination