WO2023050754A1

WO2023050754A1 - Model training method and apparatus for private data set

Info

Publication number: WO2023050754A1
Application number: PCT/CN2022/085131
Authority: WO
Inventors: 刘洋
Original assignee: 清华大学
Priority date: 2021-09-30
Filing date: 2022-04-02
Publication date: 2023-04-06
Also published as: CN114003949A; CN114003949B

Abstract

A private data set-based method and apparatus for model training, which relate to the technical field of multi-party data collaboration. The method comprises: training a server-side model on the basis of a public data set and a real label corresponding to the public data set; obtaining first model outputs sent by clients, the first model outputs being obtained by inputting the public data set into local learning models, and the local learning models being obtained by training on the basis of the private data set and the corresponding label; training the server-side model on the basis of public data corresponding to the first model outputs; inputting the public data set into the server-side model to obtain second model outputs; and sending the second model outputs to the clients, for the clients to perform retraining of the local learning models on the basis of the second model outputs and the public data set. As such, while avoiding private data set leakage, model training is performed by using the private data set as part of training samples on the basis of knowledge distillation and knowledge fusion.

Description

Model training method and device for private data set

Cross References to Related Applications

This application requires that the application number submitted on September 30, 2021 is 202111165679.6, and the name of the invention is "model training method and device based on private data sets", and the application number submitted on October 12, 2021 is 202111189306.2, the name of the invention Priority to a Chinese patent application for "Model Training Method and Apparatus Based on Private Dataset", which is incorporated herein by reference in its entirety.

technical field

This application relates to the technical field of multi-party data cooperation, and in particular to a model training method and device based on private data sets.

Background technique

In data analysis, data mining, economic forecasting and other fields, machine learning models can be used to analyze and discover potential data value. Since the data held by a single data owner may be incomplete, it is difficult to accurately describe the target. In order to obtain better model prediction results, the joint training of the model is carried out through the data cooperation of multiple data owners. has been widely used. However, in the process of multi-party data cooperation, issues such as data security and model security are involved.

Especially in the medical field, some data sets involve privacy and cannot be made public, and can only be used within the hospital. It is very difficult to build a learning model based on the private data sets of various hospitals. In existing schemes, there are model outputs obtained after using private data sets and inputting private data sets into the learning model (generally the output of the last layer of neural network of the learning model) rather than model results and corresponding labels as exchanged information. The model is trained by means of knowledge distillation and knowledge fusion. But in this way, not only the problem of privacy leakage still exists.

Therefore, there is currently a lack of model training solutions based on multi-party private data sets.

Contents of the invention

The embodiment of the present application provides a model training method and device based on a private data set to solve the problem of the lack of a model training solution based on a multi-party private data set.

In the first aspect, the embodiment of the present application provides a model training method based on a private data set, including:

Based on the public data set and the real label corresponding to the public data set, the server-side model is trained;

Obtain the first model output sent by each client; the first model output is obtained by the client inputting the public data set into the local learning model; the local learning model is obtained by the client based on the private data set and corresponding label pair Suppose the model is trained;

retraining the server-side model based on the public data corresponding to each of said first model outputs and each of said first model outputs;

inputting the public data set into the retrained server-side model to obtain a second model output;

The second model output is sent to each of the clients, so that each of the clients can retrain the local learning model based on the second model output and the public data set.

Optionally, the training of the server-side model based on the public data set and the real label corresponding to the public data set includes:

Inputting the public data set into the server-side model to obtain prediction results;

The server-side model is trained based on a cross-entropy loss function between the predicted result and the true label.

Optionally, the training of the server-side model based on the public data set and the real label corresponding to the public data set further includes:

Determine and store the first target model output; the first target model output is the model output corresponding to the target public data; the target public data is the public data set, and the prediction result obtained after being input into the server-side model conforms to the corresponding Publicly available data on real labels;

Determine the target public data to be distilled; the target public data to be distilled is the public data set, and the prediction result obtained after inputting the server-side model does not conform to the public data corresponding to the real label;

Determine the first public data to be distilled; the first public data to be distilled is part of the target public data to be distilled that has a corresponding first target model output;

The server-side model is trained based on the first public data to be distilled and a first target model output corresponding to the first public data to be distilled.

Optionally, the acquiring the first model output sent by each client includes:

Determine the second public data to be distilled; the second public data to be distilled is part of the target distillation public data that does not have a corresponding first target model output;

sending a request to each of the clients; the request is used to request the client to return the first model output; the first model output is the part corresponding to the second public data to be distilled in the model output of each local learning model model output;

receiving the first model output returned by each client.

Optionally, retraining the server-side model based on the public data corresponding to each of the first model outputs and each of the first model outputs includes:

Filtering the first model output to obtain a second target model output; the second target model output is a partial model output in which the corresponding prediction result in the first model output meets the corresponding real label;

Determine the third public data to be distilled; the third public data to be distilled is part output data of the second target model corresponding to the second public data to be distilled;

Based on the third public data to be distilled and the output of each of the second target models, the server-side model is retrained.

Optionally, the retraining of the server-side model based on the third public data to be distilled and the output of each of the second target models includes:

determining an information entropy of a model output in each of said second target model outputs;

determining the weights of the model outputs in each of the second target model outputs based on the size of the information entropy;

fusing the outputs of each of the second target models based on the weights to obtain a third target model output;

Retraining the server-side model based on the third public data to be distilled and the output of the third target model.

Optionally, the public data set and the private data set include: image data, text data or sound data related to entities.

In the second aspect, the embodiment of the present application provides a model training device based on a private data set, including:

The first training unit is used to train the server-side model based on the public data set and the real label corresponding to the public data set;

The obtaining unit is used to obtain the first model output sent by each client; the first model output is obtained by the client inputting the public data set into the local learning model; the local learning model is obtained by the client based on the private data set And the corresponding label is obtained by training the preset model;

A second training unit, configured to retrain the server-side model based on the public data corresponding to each of the first model outputs and each of the first model outputs;

an input unit, configured to input the public data set into the retrained server-side model to obtain a second model output;

a sending unit, configured to send the second model output to each of the clients, so that each of the clients can retrain the local learning model based on the second model output and the public data set .

In the third aspect, the embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the following when executing the program: The steps of the model training method based on the private data set provided by this application.

In the fourth aspect, the embodiment of the present application provides a non-transitory computer-readable storage medium on which a computer program is stored, which is characterized in that, when the computer program is executed by a processor, the private data set-based privacy data collection provided by this application is realized. The steps of the model training method.

The embodiment of the present application provides a model training method based on a private data set. By publicizing the data set, the output of the first model and the output of the second model as the channels and media for information exchange between each local learning model and the server-side model, the server-side The independent training ability of the model, based on the output of each first model, performs knowledge distillation and knowledge fusion, and then sends the fused knowledge back to each local learning model based on the output of the second model, so that each local learning model can obtain the fused knowledge . That is: through the public data set, the output of the first model and the output of the second model as the medium of knowledge transmission, all the knowledge is stored in a powerful model (server-side model) as a general knowledge base to help federated learning. The server-side model not only uses sufficient computing resources to train itself, but also uses all clients as multiple teachers to learn knowledge, helping to further improve the effect of the server-side model. In return, the accumulated knowledge on the server side will be further passed on to the client to help improve the effect of the local learning models of all clients, so that each local learning model obtained after training contains the knowledge of multiple private data sets, that is, each local learning model is based on Trained on multi-party privacy datasets. In this way, the embodiment of the present application provides a feasible model training method based on private data sets, which can be specifically applied to the training of models related to private data in the medical field.

Description of drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are the For some embodiments of the present invention, those of ordinary skill in the art can also obtain other drawings based on these drawings on the premise of not paying creative efforts.

Fig. 1 is one of the schematic flow charts of the model training method based on the private data set provided by the embodiment of the present application;

Fig. 2 is the second schematic flow diagram of the model training method based on the private data set provided by the embodiment of the present application;

Fig. 3 is the third schematic flow diagram of the model training method based on the private data set provided by the embodiment of the present application;

Fig. 4 is the fourth schematic flow diagram of the model training method based on the private data set provided by the embodiment of the present application;

Fig. 5 is the fifth schematic flow diagram of the model training method based on the private data set provided by the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a model training device based on a private data set provided in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of this application clearer, the technical solutions in this application will be clearly and completely described below in conjunction with the accompanying drawings in this application. Obviously, the described embodiments are part of the embodiments of this application , but not all examples. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

Data processing such as data analysis, data mining, and trend prediction is widely used in more and more scenarios for a large amount of information data flooding in various industries such as economy, culture, education, medical care, and public management. Among them, through data cooperation, multiple data owners can obtain better data processing results. For example, more accurate model parameters can be obtained through joint training of multi-party data.

In some embodiments, the joint training system for models based on private data can be applied to the scenario where all parties cooperate to train machine learning models for use by multiple parties while ensuring the data security of all parties. In this scenario, multiple data parties have their own data, and they want to jointly use each other's data for unified modeling (for example, linear regression model, logistic regression model, etc.), but they do not want their own data (especially private data) was leaked. For example, hospital A has a batch of patient data (such as photos of patients’ diseased parts) that are not suitable for public disclosure due to patient privacy issues, and hospital B has a batch of patient data that is also not suitable for public disclosure due to patient privacy issues, based on the patients of hospital A and hospital B The training sample set determined by the data can be trained to obtain a relatively good machine learning model. Both A and B are willing to participate in model training through each other's patient data, but Hospital A and Hospital B need to ensure that patient data will not be leaked, and they cannot or are unwilling to let the other party know their patient data. Therefore, there is a need for a model training method based on private data sets, which can enable multiple parties to obtain a jointly used machine learning model through joint training of multiple data without leaking private data, and achieve a win-win cooperation state. Based on this, the embodiments of the present application provide a model training method and device based on private datasets based on knowledge distillation and federated learning.

Among them, in the traditional federated learning setting, the client uploads model parameters or model gradients to the central server, and the server aggregates them in a certain form and distributes them back to the client, and further updates on the localized data. Transferring parameters or gradients will bring a series of privacy, heterogeneity, and communication cost issues. At present, some work uses knowledge distillation to transfer knowledge between the terminal and the server. But since the client is actually resource-constrained, it is impossible to use large models directly on the client, so how to solve the resource-constrained problem is still a great challenge. Only by mining the computing resources on the server side as much as possible, and using the auxiliary large model to transfer and accumulate knowledge on the server side, can we achieve the same effect of knowledge fusion as centralized training with large models.

Fig. 1 is one of the flow diagrams of the model training method based on the private data set provided by the embodiment of the present application. As shown in Fig. 1, the method includes:

Step 110, based on the public data set and the real label corresponding to the public data set, the server-side model is trained;

Among them, the public data set and the private data set are the same type of data, except that the public data set is data that can be made public, and the private data set is data that cannot or is not suitable for publicity. Specifically, the public data set and the private data set may be image data, text data or sound data related to entities. For example, patient disease picture data of some hospitals, and user data of some Internet companies. The server-side model is a large model, that is, the server-side model is relatively complex, and knowledge can be mined and learned as much as possible.

Step 120, obtain the first model output sent by each client; the first model output is obtained by the client inputting the public data set into the local learning model; the local learning model is obtained by the client based on the private data set and the corresponding The label is trained on the preset model;

Step 130, retraining the server-side model based on the public data corresponding to each of the first model outputs and each of the first model outputs;

Step 140, inputting the public data set into the retrained server-side model to obtain a second model output;

Step 150, sending the second model output to each of the clients, so that each of the clients can retrain the local learning model based on the second model output and the public data set.

Through public data sets, the output of the first model and the output of the second model as the channels and media for the information exchange between the local learning models and the server-side model, the independent training ability of the server-side model is fully utilized, and knowledge distillation and Knowledge fusion, and then the knowledge obtained after fusion is sent back to each local learning model based on the output of the second model, so that each local learning model can obtain the fused knowledge. That is: through the public data set, the output of the first model and the output of the second model as the medium of knowledge transmission, all the knowledge is stored in a powerful model (server-side model) as a general knowledge base to help federated learning. The server-side model not only uses sufficient computing resources to train itself, but also uses all clients as multiple teachers to learn knowledge, helping to further improve the effect of the server-side model. In return, the accumulated knowledge on the server side will be further passed on to the client to help improve the effect of the local learning models of all clients, so that each local learning model obtained after training contains the knowledge of multiple private data sets, that is, each local learning model is based on Trained on multi-party privacy datasets.

In the solution provided by the embodiment of the present application, the server-side model is the center of knowledge aggregation, and the knowledge it learns directly affects the knowledge obtained by each local learning model based on the output of the second model; therefore, the training of the server-side model is a relatively important part .

Specifically, in step 110 "train the server-side model based on the public data set and the real label corresponding to the public data set" and step 130 "based on the public data corresponding to each of the first model outputs and each of the The output of the first model described above, retraining the server-side model" is the training part of the server-side model:

The training of the server-side model is mainly divided into three parts: preliminary training, self-distillation and aggregation distillation (retraining). It should be noted that these three parts are not executed in strict chronological order, but integrated with each other.

Referring to Figure 2, the steps of preliminary training and self-distillation are as follows:

Step 111, inputting the public data set into the server-side model to obtain prediction results;

Step 112: Train the server-side model based on the cross-entropy loss function between the prediction result and the real label.

This part of the training is relatively routine, simply using the cross-entropy loss function between the predicted result and the real label to train the server-side model. Specifically, some existing training examples can be referred to.

Step 113, determine and store the first target model output; the first target model output is the model output corresponding to the target public data; the target public data is the prediction obtained after being input into the server-side model in the public data set The results conform to the public data corresponding to the real label;

Specifically, in step 114, the output of the first target model is stored to prepare for the distillation obtained in step 114, step 115 and step 116. Save the correctly predicted model output (that is, the first target model output) to the global model output as a memory to help correct the wrong but correct samples later.

The specific instructions for self-distillation are as follows: For samples with wrong model predictions in the public data set (that is, the target public data to be distilled), we first look for whether the sample exists in the memory of the global model output (ie: the output of the first target model) The corresponding model output, if it exists, means that this part of the knowledge model was once included, so reviewing the knowledge you have mastered can help the model correct its own mistakes. For this idea, we use the self-distillation method to carry out distillation training on the model. Use the current model output to approach the model output when it was right before, and combine the cross-entropy loss. Specific steps are as follows:

Step 114, determining the target public data to be distilled; the target public data to be distilled is the public data set, and the prediction result obtained after inputting the server-side model does not conform to the public data corresponding to the real label;

Step 115, determining the first public data to be distilled; the first public data to be distilled is part of the target public data to be distilled that has a corresponding first target model output;

Step 116: Train the server-side model based on the first public data to be distilled and a first target model output corresponding to the first public data to be distilled.

The self-distillation of the server-side model is completed in the above way. Focused distillation is performed on other data except the first public data to be distilled in the target public data to be distilled; focused distillation is a core of the solution of the embodiment of this application, as long as it is used to obtain the knowledge of private data sets of other clients.

Referring to Figure 3, before performing aggregation distillation, it is necessary to perform "obtaining the first model output sent by each client" in step 120, and the specific steps are as follows:

Step 121, the client trains the preset server-side model based on the private data set and the corresponding label;

Step 122, the client inputs the public data set into the local learning model to obtain a model output;

Step 123, determining the second public data to be distilled; the second public data to be distilled is part of the target distillation public data that does not have a corresponding first target model output;

Step 124, sending a request to each of the clients;

Step 125: The client returns the first model output; wherein, the first model output is a partial model output corresponding to the second public data to be distilled among the model outputs of each local learning model;

Step 126, receiving the first model output sent back by each client.

In this way, the data sent by each client is the data used for aggregation distillation, and the knowledge contained in each local learning model trained based on the private data set is sent to the server-side model based on the output of these first models. Such setting not only avoids the problem of privacy data leakage in the process of knowledge transmission, but also reduces the amount of data that needs to be transmitted.

For the samples that the server has never done correctly from the beginning to the end (ie: the second public data to be distilled), we believe that the server does not have the ability to predict correctly only by itself, so we choose to gather knowledge from the client to help guide Server-side learning. First, we select a model that can predict the correct answer from all clients, and then weight it according to the information entropy of the output of the model, based on the principle that the higher the information entropy, the lower the corresponding confidence.

Specifically, with reference to Figure 4, the aggregation distillation is carried out, and the steps mainly include:

Step 131, filtering the output of the first model to obtain the output of the second target model; the output of the second target model is a partial model output in which the corresponding prediction result in the output of the first model conforms to the corresponding real label;

The purpose of this step is to eliminate the output of the first model that cannot play a good role in teaching the training of the second public data to be distilled.

Step 132, determining the third public data to be distilled; the third public data to be distilled is part output data of the second target model corresponding to the second public data to be distilled;

Step 133, determining the information entropy of the model output in each of the second target model outputs;

Step 134, based on the size of the information entropy, determine the weight of the model output in each of the second target model outputs;

Step 135: Fusing the outputs of each of the second target models based on the weights to obtain a third target model output;

Step 136, retraining the server-side model based on the third public data to be distilled and the output of the third target model. That is: use the weighted model output combined with the cross-entropy loss to distill the server side.

Wherein, from step 133 to step 136, the server-side model is retrained based on the third public data to be distilled and the output of each of the second target models. In the specific fusion process, it is weighted according to the information entropy of the output of the model. It is considered that the higher the information entropy, the lower the corresponding confidence, and the fusion of knowledge is carried out selectively. Afterwards, step 140 and step 150 are executed to complete the retraining of the local learning model.

Based on the above solution, the embodiment of the present application provides a novel method, which uses selective knowledge fusion to store all knowledge in a powerful model as a general knowledge base to help federated learning. The server-side model not only uses sufficient computing resources to train itself, but also uses all clients as multiple teachers to learn knowledge, helping to further improve the effect of the server-side model. In return, the accumulated knowledge on the server side will be further passed on to the client side to help all clients improve the effect of local learning models. At the same time, it can increase the robustness of the models at both ends, and reduce the communication cost of uploading knowledge from the client to the server.

The scheme provided by the embodiment of the present application is described below in conjunction with specific embodiments:

With reference to Fig. 5, the model training system based on privacy data set includes: a server end and a plurality of clients (represent a plurality of clients with hospital A and hospital B in Fig. 5)

Step 501: Hospital A trains the preset model based on the private data set A and corresponding labels to obtain a local learning model A;

Step 502: Hospital B trains the preset model based on the private data set B and corresponding labels to obtain a local learning model B;

Step 503: The server-side trains the server-side model based on the public data set and the real label corresponding to the public data set;

Among them, the private data set A, the private data set B and the public data set are the pictures of the injured part of the patient. The main purpose of the embodiment of this application is to obtain a model that can identify the injured and predict the injured;

Step 504: Input the public dataset into the server-side model for prediction;

Step 505: Save the correctly predicted model output to the memory of the global model output;

Step 506: Based on the memory output by the global model, perform self-distillation on some incorrectly predicted samples;

Step 507: Obtain the model output A sent by hospital A;

Step 508: Obtain the model output B sent by the hospital B;

Among them, the model output A is obtained by inputting the public data set into the local learning model A; the model output B is obtained by inputting the public data set into the local learning model B;

Step 509: Carry out the elimination operation and weighted fusion on the model output A and the model output B.

Step 510: Based on the fused model output, perform aggregation distillation on some of the wrongly predicted samples;

It should be noted that the data for aggregation distillation can be multiple pictures, and model output A and model output B have model outputs for each picture that undergoes aggregation distillation; when performing fusion and elimination, one picture should be progress of the picture. That is, first determine a picture for aggregation distillation, and then find out Hospital A and Hospital B to obtain the model output corresponding to this picture; judge whether the prediction results obtained by the two model outputs match the real label, and if they match, determine this The information entropy of the two pictures is weighted based on the level of information entropy, and the higher the information entropy, the lower the corresponding confidence.

Step 511: Input the public data set into the server-side model for prediction to obtain the second model output;

Step 512: Send the second model output to Hospital A;

Step 513: Train the local learning model A based on the second model output

Step 514: Send the second model output to Hospital B;

Step 515: Train the local learning model B based on the second model output.

This cycle is carried out, and all knowledge is stored in a powerful model by selective knowledge fusion as a general knowledge base to help federated learning. Then pass it to client hospital A and hospital B to help prompt the effect of local learning model A and local learning model B. This allows hospital A and hospital B to conduct joint training without disclosing their own private data sets to obtain local learning model A and local learning model B with better actual prediction effect.

Based on any of the above-mentioned embodiments, FIG. 6 is a schematic structural diagram of a model training device based on a private data set provided in the embodiment of the present application. As shown in FIG. 6, the device includes:

The first training unit 61 is used to train the server-side model based on the public data set and the real label corresponding to the public data set;

The obtaining unit 62 is configured to obtain the first model output sent by each client; the first model output is obtained by the client inputting the public data set into a local learning model; the local learning model is obtained by the client based on private data The set and corresponding labels are trained on the preset model;

The second training unit 63 is configured to retrain the server-side model based on the public data corresponding to each of the first model outputs and each of the first model outputs;

An input unit 64, configured to input the public data set into the retrained server-side model to obtain a second model output;

The sending unit 65 is configured to send the second model output to each of the clients, so that each of the clients can regenerate the local learning model based on the second model output and the public data set. train.

Wherein, the first training unit 61 is specifically used for:

Wherein, said obtaining the first model output sent by each client includes:

receiving the first model output returned by each client.

Optionally, the second training unit 63 is specifically used for:

Based on the weights, each of the second target model outputs is fused to obtain a third target model output;

Figure 7 is a schematic structural diagram of an electronic device provided by the embodiment of the present application, as shown in Figure 7, the electronic device may include: a processor (processor) 710, a communication interface (Communications Interface) 720, a memory (memory) 730 and a communication bus 740 , where the processor 710 , the communication interface 720 , and the memory 730 communicate with each other through the communication bus 740 . The processor 710 can call the logic command in the memory 730 to execute the following method: based on the public data set and the real label corresponding to the public data set, train the server-side model; obtain the first model output sent by each client The first model output is obtained by the client inputting the public data set into the local learning model; the local learning model is obtained by the client training the preset model based on the private data set and the corresponding label; The public data corresponding to the first model output and each of the first model outputs are used to retrain the server-side model; the public data set is input into the retrained server-side model to obtain the second model output; The second model output is sent to each of the clients, so that each of the clients can retrain the local learning model based on the second model output and the public data set.

In addition, the above-mentioned logic commands in the memory 730 can be implemented in the form of software function units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several commands are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

The embodiment of the present application also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it is implemented to perform the methods provided by the above-mentioned embodiments, for example, including: and the real label corresponding to the public data set, train the server-side model; obtain the first model output sent by each client; the first model output is obtained by the client inputting the public data set into the local learning model The local learning model is obtained by the client from training the preset model based on the private data set and the corresponding label; based on the public data corresponding to the output of each of the first models and the output of each of the first models, the server-side retraining the model; inputting the public data set into the retrained server-side model to obtain a second model output; sending the second model output to each of the clients for each of the clients Retraining of the local learning model is performed based on the second model output and the public dataset.

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the devices can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.

Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic Disc, CD, etc., including several commands to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, rather than limiting them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present application.

Claims

A model training method based on a privacy dataset, comprising:

Based on the public data set and the real label corresponding to the public data set, the server-side model is trained;

Obtain the first model output sent by each client; the first model output is obtained by the client inputting the public data set into the local learning model; the local learning model is obtained by the client based on the private data set and corresponding label pair Suppose the model is trained;

retraining the server-side model based on the public data corresponding to each of said first model outputs and each of said first model outputs;

inputting the public data set into the retrained server-side model to obtain a second model output;

The second model output is sent to each of the clients, so that each of the clients can retrain the local learning model based on the second model output and the public data set.
The method for training a model based on a private data set according to claim 1, wherein the training of the server-side model based on the public data set and the real label corresponding to the public data set includes:

Inputting the public data set into the server-side model to obtain prediction results;

The server-side model is trained based on a cross-entropy loss function between the predicted result and the true label.
The method for training a model based on a private data set according to claim 2, wherein the training of the server-side model based on the public data set and the real label corresponding to the public data set further includes:

Determine and store the first target model output; the first target model output is the model output corresponding to the target public data; the target public data is the public data set, and the prediction result obtained after being input into the server-side model conforms to the corresponding Publicly available data on real labels;

Determine the target public data to be distilled; the target public data to be distilled is the public data set, and the prediction result obtained after inputting the server-side model does not conform to the public data corresponding to the real label;

Determine the first public data to be distilled; the first public data to be distilled is part of the target public data to be distilled that has a corresponding first target model output;

The server-side model is trained based on the first public data to be distilled and a first target model output corresponding to the first public data to be distilled.
The model training method based on a private data set according to claim 3, wherein said obtaining the first model output sent by each client includes:

Determine the second public data to be distilled; the second public data to be distilled is part of the target distillation public data that does not have a corresponding first target model output;

sending a request to each of the clients; the request is used to request the client to return the first model output; the first model output is the part corresponding to the second public data to be distilled in the model output of each local learning model model output;

receiving the first model output returned by each client.
The method for training a model based on a private data set according to claim 4, wherein the server-side model is retrained based on the public data corresponding to each of the first model outputs and each of the first model outputs, include:

Filtering the first model output to obtain a second target model output; the second target model output is a partial model output in which the corresponding prediction result in the first model output meets the corresponding real label;

Determine the third public data to be distilled; the third public data to be distilled is part output data of the second target model corresponding to the second public data to be distilled;

Based on the third public data to be distilled and the output of each of the second target models, the server-side model is retrained.
The model training method based on a private data set according to claim 5, wherein said retraining the server-side model based on the third public data to be distilled and the output of each of said second target models includes:

determining an information entropy of a model output in each of said second target model outputs;

determining the weights of the model outputs in each of the second target model outputs based on the size of the information entropy;

fusing the outputs of each of the second target models based on the weights to obtain a third target model output;

Retraining the server-side model based on the third public data to be distilled and the output of the third target model.
The model training method based on a private data set according to claim 1, wherein the public data set and the private data set include: image data, text data or sound data related to entities.
A model training device based on a private data set, comprising:

The first training unit is used to train the server-side model based on the public data set and the real label corresponding to the public data set;

The obtaining unit is used to obtain the first model output sent by each client; the first model output is obtained by the client inputting the public data set into the local learning model; the local learning model is obtained by the client based on the private data set And the corresponding label is obtained by training the preset model;

A second training unit, configured to retrain the server-side model based on the public data corresponding to each of the first model outputs and each of the first model outputs;

an input unit, configured to input the public data set into the retrained server-side model to obtain a second model output;

a sending unit, configured to send the second model output to each of the clients, so that each of the clients can retrain the local learning model based on the second model output and the public data set .
An electronic device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the program, the computer program described in any one of claims 1 to 7 is realized. The steps of the model training method based on the private data set.
A non-transitory computer-readable storage medium, on which a computer program is stored, wherein, when the computer program is executed by a processor, the method for training a model based on a private data set according to any one of claims 1 to 7 is implemented A step of.