CN114492837A

CN114492837A - Federal model training method and device

Info

Publication number: CN114492837A
Application number: CN202210056987.3A
Authority: CN
Inventors: 沈力; 黄天晟; 刘世伟
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-05-13

Abstract

The embodiment of the application provides a method and a device for training a federated model, wherein the method comprises the following steps: and receiving the personalized model sent by the central equipment, wherein the number of the effective model parameters of the personalized model is less than that of the effective model parameters of the global model. And carrying out model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises the gradient of each model parameter in the personalized model. And determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity, wherein the target mask sequence is used for the central equipment to determine the personalized model in the next round of training. And sending the target mask sequence and the training gradient of the personalized model to the central device. The personalized processing is realized by relying on the mask sequence without training an additional model, so that the calculated amount of local training in personalized federal learning can be effectively reduced.

Description

Federal model training method and device

Technical Field

The embodiment of the application relates to computer technology, in particular to a method and a device for training a federated model.

Background

On the basis of traditional federal learning, personalized federal learning allows local devices to have different personalized models instead of sharing the same global model.

At present, when realizing personalized federal learning in the prior art, each local device usually trains a global model together, and each local device also trains a respective local model, so that each device can own a personalized model.

However, training an additional local model by each local device to implement personalized federated learning results in a large amount of computation for model training.

Disclosure of Invention

The embodiment of the application provides a method and a device for training a federated model, and aims to solve the problem that the calculation amount of model training for personalized federated learning is large.

In a first aspect, an embodiment of the present application provides a method for training a joint model, which is applied to a participant device, and includes:

receiving an individualized model sent by central equipment, wherein the number of effective model parameters of the individualized model is smaller than that of effective model parameters of a global model;

performing model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises gradients of various model parameters in the personalized model;

determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity, wherein the target mask sequence is used for the central equipment to determine the personalized model in the next round of training;

and sending the target mask sequence and the training gradient of the personalized model to the central equipment.

In one possible design, determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model, and the pruning amount includes:

determining a first mask sequence according to the model parameters of the personalized model;

setting part of masks in the first mask sequence to be a first preset value according to the pruning quantity to obtain a second mask sequence;

and setting part of masks in the second mask sequence to be a second preset value according to the pruning quantity to obtain the target mask sequence.

In one possible design, determining a first mask sequence according to the model parameters of the personalized model includes:

determining a parameter state of each model parameter of the personalized model;

and determining the first mask sequence according to the parameter state of each model parameter of the personalized model.

In one possible design, determining the first mask sequence according to a parameter state of each model parameter of the personalized model includes:

setting a mask corresponding to the model parameter in the valid state as a second preset value, and setting a mask corresponding to the model parameter in the invalid state as a first preset value to obtain the first mask sequence.

In one possible design, the amount of pruning is α, which is an integer greater than or equal to 1; setting part of masks in the first mask sequence to be a first preset value according to the pruning quantity to obtain a second mask sequence, wherein the second mask sequence comprises the following steps:

sorting from large to small according to the weights of the effective model parameters in the personalized model to obtain an effective model parameter sequence;

determining alpha effective model parameters at the tail end in the effective model parameter sequence as first effective model parameters;

setting the mask corresponding to the first effective model parameter in the first mask sequence as the first preset value to obtain the second mask sequence.

In one possible design, the amount of pruning is α, which is an integer greater than or equal to 1; setting a part of masks in the second mask sequence to a second preset value according to the pruning quantity to obtain the target mask sequence, wherein the method comprises the following steps:

acquiring gradients corresponding to first masks in the second mask sequence, wherein the values of the first masks are the first preset values;

determining alpha masks to be selected in the first masks of the second mask sequence according to gradients corresponding to the first masks in the second mask sequence;

and setting the alpha masks to be selected in the second mask sequence to be the second preset value to obtain the target mask sequence.

In one possible design, obtaining a gradient corresponding to each first mask in the second mask sequence includes:

determining model parameters corresponding to each first mask in the second mask sequence;

and determining the gradient of the model parameter corresponding to each first mask as the gradient corresponding to each first mask.

In one possible design, determining α candidate masks in the first masks of the second mask sequence according to a gradient corresponding to each first mask in the second mask sequence includes:

sorting the first masks in the second mask sequence according to the sequence of the corresponding gradients from large to small;

and determining the first alpha masks in the sorted first masks as the alpha masks to be selected.

In one possible design, the model training includes N iterations, where N is an integer greater than 1; performing model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises the following steps:

obtaining N intermediate gradients corresponding to each model parameter;

and determining and obtaining the training gradient of the personalized model according to the N intermediate gradients corresponding to each model parameter.

In a second aspect, an embodiment of the present application provides a method for training a joint model, which is applied to a central device, and includes:

determining a target global model;

determining a target mask sequence corresponding to the participant device;

updating the model parameters of the target global model according to the target mask sequence to obtain an individualized model;

sending the personalized model to the participant device.

In one possible design, determining a target global model includes:

if the current training is the first round of training, determining a preset model as the target global model;

if the current training is the Mth round of training, obtaining a plurality of training gradients sent by a plurality of participant devices, and determining the target global model according to the M-1 th global model and the plurality of training gradients, wherein the M-1 th global model is the global model of the M-1 th round of training, the target global model is the Mth global model, and M is an integer greater than or equal to 2.

In one possible design, determining the target global model based on the M-1 th global model and the plurality of training gradients comprises:

and updating the model parameters of the M-1 th global model according to the plurality of training gradients to obtain the target global model.

In one possible design, determining a target mask sequence corresponding to the participant device includes:

if the current training is the first round of training, determining a preset mask sequence as the target mask sequence;

if the current training is the Mth round of training, judging whether a mask sequence sent by the participant equipment exists, if so, determining the target mask sequence according to the mask sequence sent by the participant equipment; if not, determining the preset mask sequence as the target mask sequence.

In one possible design, determining the target mask sequence based on the mask sequence sent by the participant device includes:

acquiring a mask sequence sent by the participant equipment for the last time;

and determining the mask sequence sent by the participant device for the last time as the target mask sequence.

In a third aspect, an embodiment of the present application provides a bang model training device, which is applied to a participant device, and includes:

the receiving module is used for receiving the personalized model sent by the central equipment, and the number of the effective model parameters of the personalized model is less than that of the effective model parameters of the global model;

the training module is used for carrying out model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises the gradient of each model parameter in the personalized model;

the determining module is used for determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity, and the target mask sequence is used for determining the personalized model in the next round of training by the central equipment;

and the sending module is used for sending the target mask sequence and the training gradient of the personalized model to the central equipment.

In one possible design, the determining module is specifically configured to:

In one possible design, the amount of pruning is α, which is an integer greater than or equal to 1; the determining module is specifically configured to:

setting the mask corresponding to the first effective model parameter in the first mask sequence as the first preset value, so as to obtain the second mask sequence.

In one possible design, the determining module is specifically configured to:

In one possible design, the model training includes N iterations, where N is an integer greater than 1; the training module is specifically configured to:

obtaining N intermediate gradients corresponding to each model parameter;

In a fourth aspect, an embodiment of the present application provides a bang model training device, which is applied to a center device, and includes:

a determining module for determining a target global model;

the determining module is further configured to determine a target mask sequence corresponding to the participant device;

the updating module is used for updating the model parameters of the target global model according to the target mask sequence to obtain an individualized model;

a sending module for sending the personalized model to the participant device.

In one possible design, the determining module is specifically configured to:

acquiring a mask sequence sent by the participant equipment for the last time;

and determining the mask sequence sent by the participant device last time as the target mask sequence.

In a fifth aspect, an embodiment of the present application provides a bang model training apparatus, including:

a memory for storing a program;

a processor for executing the program stored in the memory, the processor being configured to perform the method as described above in the first aspect and in the various possible designs of the first aspect, or in the second aspect and in any one of the various possible designs of the second aspect, when the program is executed.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method as described in the first aspect and any one of the possible designs of the first aspect, or any one of the possible designs of the second aspect and the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer program product, which includes a computer program that, when executed by a processor, implements the method as described in the first aspect and in various possible designs of the first aspect, or in any one of the second aspect and in various possible designs of the second aspect.

The embodiment of the application provides a method and a device for training a federated model, wherein the method comprises the following steps: and receiving the personalized model sent by the central equipment, wherein the number of the effective model parameters of the personalized model is less than that of the effective model parameters of the global model. And performing model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises the gradient of each model parameter in the personalized model. And determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity, wherein the target mask sequence is used for the central equipment to determine the personalized model in the next round of training. And sending the target mask sequence and the training gradient of the personalized model to the central equipment. The method comprises the steps of determining a mask sequence corresponding to each participant device, determining an individualized model corresponding to each participant device according to the mask sequence, enabling the participant devices to train the individualized model, and then sending a training gradient of the individualized model to a central device, enabling the central device to aggregate the models, and therefore model training of individualized federal learning can be effectively achieved. The personalized processing is realized by relying on the mask sequence without training an additional model, so that the calculated amount of local training in personalized federal learning can be effectively reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and those skilled in the art can also obtain other drawings according to the drawings without inventive exercise.

Fig. 1 is a schematic diagram of a principle of bang learning provided in an embodiment of the present application;

FIG. 2 is a flow chart of a federated model training method provided in an embodiment of the present application;

FIG. 3 is a second flowchart of a federated model training method provided in an embodiment of the present application;

fig. 4 is a schematic diagram illustrating an implementation of a first mask sequence according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an implementation of determining a valid model parameter sequence according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating an implementation of determining a second mask sequence according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating an implementation of determining a target mask sequence according to an embodiment of the present application;

fig. 8 is a flowchart three of a federal model training method provided in an embodiment of the present application;

FIG. 9 is a fourth flowchart of a federated model training method provided in an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating an implementation of a federated model training method provided in an embodiment of the present application;

fig. 11 is a schematic diagram illustrating an implementation of a personalized federal learning algorithm provided in an embodiment of the present application;

FIG. 12 is a first schematic structural diagram of a federal model training device provided in an embodiment of the present application;

fig. 13 is a schematic structural diagram ii of a federal model training device provided in an embodiment of the present application;

fig. 14 is a schematic hardware structure diagram of a federal model training device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.

In order to better understand the technical solution of the present application, the related technologies related to the present application are further described in detail below.

In recent years, data privacy security has been increasingly emphasized. Based on this, the government has made new Data security regulations such as european union Data security regulations (GDPR).

In this context, model training to acquire tagged data for traditional machine learning becomes increasingly difficult. Federal learning is proposed as a privacy-safe distributed training paradigm. Federal learning allows local devices to use their own data to train models locally, which are then aggregated by a central server. In the process, the local device only needs to transmit the trained model without transmitting the data of the local device to a central server or other nodes, so that the risk of data leakage is reduced.

In the following, for simplicity, federal learning is explained, and with the continuous development of network technologies, technologies such as machine learning and model training based on big data are continuously developed in many fields. By mining large data, a lot of valuable information can be obtained. With the development of technology, the source of raw data is more and more extensive, and even cross-domain union can be achieved.

In some business scenarios, multiple business platforms collect respective business data. These service data are valuable and are usually kept as private information of the service platform. Each service platform does not want to share the original form of service data with other service platforms. In some needs, however, multiple business platforms desire to perform collaborative computations without exposing business data of parties to improve business processing capabilities. Based on the above, the federal machine learning and other modes are carried forward.

Federal Machine Learning (also called federal Learning), which can combine all parties to perform data use and collaborative modeling on the premise that data is not out of the local, is gradually a common method in privacy protection calculation.

In the process of federal learning, the private data of the participants can be protected in a parameter exchange mode under an encryption mechanism, the data can not be transmitted, the participants do not need to expose own data to other participants and can not reversely deduce the data of other participants, and therefore the federal learning can well protect the privacy of users and guarantee the data safety, and the problem of data islanding can be solved.

For different data sets, federal Learning can be divided into Horizontal federal Learning (Horizontal federal Learning), Vertical federal Learning (Vertical federal Learning), and federal Transfer Learning (federal transferred Learning). There are many machine learning algorithms that may be used for federated learning, including, but not limited to, neural networks, random forests, and the like.

Fig. 1 is a schematic diagram of a principle of bang learning provided in an embodiment of the present application. As shown in fig. 1, the server and k client terminals may participate in a federal learning process. In the federal learning process, a server issues a global model to each client terminal, each client terminal trains the global model issued by the server by using local data, the obtained trained local model is uploaded to the server, the server aggregates the models uploaded by each client terminal to obtain an updated global model, and the process is repeated in sequence until the aggregated global model converges.

Based on the method, each client terminal has a training sample, and the local model is trained by using the local training sample, so that the model training can be completed under the condition that data are not local.

The goal of traditional federal learning is to solve the problem shown in equation one below:

wherein, w in the formula I represents a model of the federal learning training, K represents the number of the equipment participating in the federal learning training, K represents an index of a certain equipment,

indicating a loss of experience for all data in one local device. P_kTraining data representing the kth device, (x)_i，y_i) Represents one data from the training data, where x_iCharacteristic of this data, y_iA label representing this data. Function(s)

Representing a cross entropy loss function. s.t. represents constraints (subject to).

Solving the problem shown in the first formula mainly comprises the following four main steps:

1. the central server distributes the same global model to a plurality of local devices.

2. The local equipment trains the distributed model by using the data of the local equipment, and uploads the trained model to the central server again.

3. And (4) the central server aggregates the uploaded models, and repeats the step 1 to carry out next iterative training until the models are converged.

However, the heterogeneity of local training data (i.e., data distribution that is not independently identically distributed) tends to slow the convergence speed of the model, as well as reduce the accuracy of the final model training. In addition, traditional federal learning trains only one global model as a deployment. In the case of high data heterogeneity, one global model often cannot be adapted to different data distributions. At this time, training multiple deployable models is an orientation that can be optimized for different data distributions.

Therefore, on the basis of traditional federal learning, personalized federal learning is proposed. The objective of the personalized federal learning emphatic training is to deploy a plurality of high-precision models to local equipment participating in the training, so that the deployed models can improve the inference accuracy of the local equipment on the data of the local equipment. For example, the solution problem for personalized federal learning can be constructed in the form shown in equation two below:

wherein w^*Is an optimal solution, v, to solve the conventional federated learning problem P1_kParameters of a personalized model deployed in different local devices, F_k(v_k) Is a model v_kIs used as the empirical loss function. The solving form of the method is similar to the traditional federated learning. The difference is that in each round of local training, the local equipment not only uses local data to train the issued global model, and simultaneously, each equipment maintains one by oneSexual model (in v)_kIs a parameter) and in local training, the personalized model is trained with local data according to the loss function in P2.

Based on the above description, it can be determined that to address the existing limitations of federal learning, personalized federal learning allows local devices to own different personalized models, rather than sharing the same global model. To achieve personalization, local devices must find a way to efficiently exchange global information and incorporate their personalization model.

In one possible implementation, the model may be personalized and layered, for example, the global model layer may be divided into a sharing layer and a personalization layer. For the shared layer, FedAvg (federed Averaging, federal Averaging) method is adopted for weight Averaging, and for the personalized layer, the model is only trained locally and is not exchanged with other devices. By doing so, each device has its own personalized model, where the local network layer uses local data tweaks, unaffected by the other device models. And also to learn local feature extractors and shared classifiers.

Similarly, another algorithm based on characterization learning, fedrp, proposes learning a shared feature extractor and a personalized classifier. The method for personalized federal learning based on model layering has poor robustness for different model structures. Specific individuation layers and sharing layer division need to be designed for specific models, and large-scale wide application is not facilitated.

And, another idea to implement personalized federal learning is to use near-end operators to make the local model as "close" to the global model as possible. For example, a local model and a global model may be maintained simultaneously. Specifically, the training of the two models can be achieved by involving two training phases: 1) training a global model and a local model based on the idea of FedAvg; 2) and each local device finely adjusts the local model based on the global model obtained by the training. In their fine-tuning phase, a near-end term is applied between the local model and the global model.

Yet another meta-learning based algorithm applies a similar near-end term based idea. The difference is that they treat the global model as the average of all local models, so the global model evolves gradually with local fine-tuning. Such near-end operator-based personalized federal learning schemes often require simultaneous training of multiple models, often requiring additional model training calculations.

Model interpolation is another way to implement personalized federal learning. For example, linear interpolation weights of the (global) model and the local model may be used as the personalized model. Therein, three components (global model, local model and interpolation weights) need to be optimized alternately, i.e. while optimizing one of the other two are fixed. And, in some implementations, there is a similar interpolation idea, the only difference being that their global model is optimized using a generic FedAvg, so the existence of local models is not considered in its optimization process. Likewise, the idea of interpolation-based implementation of personalized federal learning requires additional computational effort, since this type of approach requires alternating optimization of global and local models.

Based on the above-mentioned problems in the prior art, the present application proposes the following technical idea: the personalized federated learning mechanism for constructing multi-model deployment through model personalized sparsity is provided, and compared with the existing solution, the convergence rate, the local training calculated amount and the final reasoning precision of the model can be remarkably improved.

Based on the above description, the federal model training method provided in the present application is described in detail below with reference to specific embodiments, and it can be understood that the federal model training method in the present application relates to actions of devices on two sides, one side being a participant device, that is, the above-described local device. And the other side is a central device, i.e. the central server introduced above. The following describes the implementation of the federal model training method corresponding to each of the two devices.

For example, a federal model training method of a participant device may be first described with reference to fig. 2, where fig. 2 is a flowchart of the federal model training method provided in this embodiment of the present application.

As shown in fig. 2, the method includes:

s201, receiving the personalized model sent by the central equipment, wherein the number of effective model parameters of the personalized model is smaller than that of effective model parameters of the global model.

Based on the above description, it can be determined that, in federal learning, the central server sends the model to each participant device, so that each participant device trains the model, and any one of the participant devices is taken as an example for description below.

In this embodiment, the central server sends the personalized model to the participant device, where the number of valid model parameters of the personalized model is less than the number of valid model parameters of the global model.

The global model and the personalized model are introduced firstly, wherein the global model is the model which needs to be trained by the central server, and in normal federal learning, the central server issues the global model to each participant device. Meanwhile, it can be understood that there are a plurality of model parameters in the global model, and in a possible implementation manner, in this embodiment, for example, mask processing may be performed on a plurality of model parameters of the global model, where the mask processing may invalidate part of the model parameters and keep part of the model parameters valid, so that after the mask processing is performed on the plurality of model parameters of the global model, the personalized model may be effectively obtained.

It can be understood that the personalized model is obtained only when part of the model parameters are invalid due to the mask processing of the model parameters of the global model, and it can be further understood that all the model parameters in the global model are valid, so that the number of valid model parameters of the personalized model parameters is smaller than that of the global model.

Based on the above description, the central device in this embodiment may send the personalized model to the participant device after determining the personalized model corresponding to the participant device, and therefore the participant device in this embodiment may receive the personalized model sent by the central model.

S202, performing model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises gradients of various model parameters in the personalized model.

After receiving the personalized model, the participant device may perform model training on the personalized model to obtain a training gradient of the personalized model, where the training gradient in this embodiment includes respective gradients of model parameters in the personalized model.

For a specific implementation of the training gradient, reference may be made to the description in the related art for model training of the personalized model, and details are not described here.

S203, determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity, wherein the target mask sequence is used for the central equipment to determine the personalized model in the next round of training.

It can be understood that in this embodiment, it is necessary to implement personalization processing for the participant devices, and therefore, it is necessary to determine a mask sequence corresponding to each participant device in a personalized manner, where the mask sequence is used for performing the mask processing.

The mask sequence is a sequence composed of 0 and 1, and more specifically, for each model parameter, the device has a corresponding 0 or 1, thereby composing the mask sequence. When 0 is applied to the model parameter, it means that the model parameter is invalid, and when 1 is applied to the model parameter, it means that the model parameter is valid, thereby implementing the masking process described above.

Therefore, after the gradient of the personalized model is determined, the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity can be subjected to preset processing, and thus the target sequence is determined. Wherein the target mask sequence is used for the central device to determine the personalized model parameters in the next training round.

The pruning quantity is used for indicating how many model parameters are specifically set as invalid when the target mask sequence is determined, wherein the specific setting of the pruning quantity can be selected and set according to actual requirements.

And S204, sending the target mask sequence and the training gradient of the personalized model to the central equipment.

After the participant device determines the target mask sequence for the personalized model for the next round of training, the target mask sequence may be sent to the hub device so that the hub device may determine the personalized model for the participant device in the next round of training.

And the participant equipment can also send the training gradient of the personalized model to the central equipment, so that the central equipment can aggregate the models according to the training gradient sent by each participant equipment, and the training of the global model is realized.

The federal model training method provided by the embodiment of the application comprises the following steps: and receiving the personalized models sent by the central equipment, wherein the number of the effective model parameters of the personalized models is less than that of the effective model parameters of the global model. And performing model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises the gradient of each model parameter in the personalized model. And determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity, wherein the target mask sequence is used for the central equipment to determine the personalized model in the next round of training. And sending the target mask sequence and the training gradient of the personalized model to the central device. The method comprises the steps of determining a mask sequence corresponding to each participant device, determining an individualized model corresponding to each participant device according to the mask sequence, enabling the participant devices to train the individualized model, and then sending a training gradient of the individualized model to a central device, enabling the central device to aggregate the models, and therefore effectively achieving individualized federal learning model training. The personalized processing is realized by relying on the mask sequence without training an additional model, so that the calculated amount of local training in personalized federal learning can be effectively reduced.

On the basis of the above embodiments, the federal model training method provided by the participant device is further described in detail below with reference to fig. 3 to 7. Fig. 3 is a second flowchart of a federal model training method provided in the embodiment of the present application, fig. 4 is a schematic diagram of implementation of a first mask sequence provided in the embodiment of the present application, fig. 5 is a schematic diagram of implementation of determining an effective model parameter sequence provided in the embodiment of the present application, fig. 6 is a schematic diagram of implementation of determining a second mask sequence provided in the embodiment of the present application, and fig. 7 is a schematic diagram of implementation of determining a target mask sequence provided in the embodiment of the present application.

As shown in fig. 3, the method includes:

s301, receiving the personalized models sent by the central equipment, wherein the number of effective model parameters of the personalized models is smaller than that of effective model parameters of the global model.

The implementation manner of S301 is similar to that of S201 described above, and is not described again here.

Furthermore, when the central device in this embodiment sends the personalized model to the participant device, for example, it may send sparse parameters

To the participant device.

Wherein m is_k，tIs the personalized mask that the current participant device k corresponds to in the tth round of training, and w_tA model parameter of the global model in the t-th round of training, which indicates a corresponding position element multiplication. Then m_k，t⊙w_tThe method is to multiply each model parameter of the global model with each corresponding mask to obtain the corresponding coefficient parameter of the current participant device k in the t round training

The sparse parameters are the personalized models sent by the central server to the participant devices.

S302, model training is carried out on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises gradients of parameters of each model in the personalized model.

When the training Gradient of the personalized model is determined by performing model training on the personalized model, for example, a Stochastic Gradient Descent (SGD) method may be used to perform model training on the personalized model, where the SGD is a widely used method for updating a neural network.

In the implementation process of the SGD, for example, N rounds of training may be performed on the personalized model, so as to obtain N intermediate gradients corresponding to each model parameter, and then the training gradient of the personalized model is determined according to the N intermediate gradients corresponding to each model parameter.

After determining the respective corresponding gradients of the model parameters, the training gradient of the personalized model in this embodiment can be obtained, wherein the specific implementation of determining the training gradient of the personalized model may refer to the description of SGD in the related art, and is not described herein again.

S303, determining the parameter state of each model parameter of the personalized model.

Further, in this embodiment, the target mask sequence needs to be determined according to the gradient of the personalized model, the model parameter of the personalized model, and the pruning amount. Therefore, a specific implementation of determining the target mask sequence is described below.

In this embodiment, the personalized mask is a vector that can only take 1 or 0, and a plurality of personalized models for deployment of different local devices can be obtained by vector multiplication of the mask and the model parameters. Therefore, in this embodiment, the personalized mask of the participant device is applied to the global model to obtain the personalized model of the participant device, and the number of valid model parameters of the personalized model is smaller than that of the global model.

More specifically, in the personalized model, since part of the model parameters are covered by the mask, part of the model parameters are invalid, and the other part of the model parameters are valid. Also, based on the above description, it can be determined that the central server sends the personalized model to the participant device, for example, sending sparse parametersNumber of

To the participant device. Then it is only the coefficient parameters that are acquired and no masking sequence is acquired for the participant device.

Since the current mask sequence is needed when determining the next mask sequence to be processed, the parameter state of each model parameter of the personalized model can be determined according to the personalized model.

For example, it can be understood with reference to fig. 4 that, in the current personalized model, there are 5 model parameters, a, b, c, d, and e, shown in fig. 4, where the parameter state of the model parameter a is valid, the parameter state of the model parameter b is invalid, the parameter state of the model parameter c is valid, the parameter state of the model parameter d is invalid, and the parameter state of the model parameter e is valid.

It should be noted that, in this embodiment, the parameter state valid means that the parameter is not covered by the mask, and the slave parameter state invalid means that the parameter is covered by the mask.

Fig. 4 is only an exemplary description of the parameter states, and in an actual implementation process, the parameter states of the model parameters may be selected and set according to actual requirements, which is not limited in this embodiment.

S304, setting the mask corresponding to the model parameter in the valid state as a second preset value, and setting the mask corresponding to the model parameter in the invalid state as a first preset value to obtain a first mask sequence.

After determining the parameter states of the respective model parameters, the first mask sequence may be determined based on the parameter states of the respective model parameters.

Based on the above description, it can be determined that the personalized mask is a sequence composed of 0 and 1, and when the mask is 0, the corresponding model parameter is covered, i.e. the model parameter is invalid; and when the mask is 1, the corresponding model parameter is not covered, which means that the model parameter is valid.

The first mask sequence can thus be determined from the parameter states of the individual model parameters. In a possible implementation manner, for example, a mask corresponding to the model parameter in the valid state may be set to a second preset value, where the second preset value is a preset value used for indicating that the parameter state of the model parameter is valid, and for example, the second preset value may be 1. And, for example, the mask corresponding to the model parameter in the invalid state may be set to a first preset value, where the first preset value is a preset value used to indicate that the parameter state of the model parameter is invalid, and for example, the first preset value may be 0. By setting the first preset value and the second preset value, the first mask sequence can be obtained.

For example, as can be understood by referring to fig. 4, the parameter states of the model parameters a, b, c, d, and e are sequentially valid, invalid, and valid, for example, the first preset value may be 0, and the second preset value may be 1, then in the example of fig. 4, the first mask sequence is determined according to the parameter states of the model parameters of the personalized model, which may be 10101 shown in fig. 4.

It should be noted that, in one possible implementation manner, the first mask sequence in this embodiment may be an initial mask sequence. I.e. the first set mask sequence for the participant device, the initial mask sequence set for the first time is unknown to the participant device, and the participant device needs to determine the first mask sequence based on the implementation described above.

When the mask sequence is initialized to obtain the first mask sequence, for example, an ERK (Erd s-ryi Kernel) method may be used to calculate sparse allocation of each layer of the convolutional network, so as to determine the mask of each model parameter, and further obtain the initialized first mask sequence. In particular, this allows the number of sparse parameters to vary as the input and output channels vary. These initialization methods essentially ensure that more layers have higher pruning rates. Under the federal learning setting in this embodiment, each of the participating devices has the same sparsely allocated randomly initialized first mask sequence computed from ERK. Among them, ERK is a method for initializing neural network sparse proportion, and is widely used in dynamic sparse training technical routes.

And, in another possible implementation manner, the first mask sequence in this embodiment may not be the initial mask sequence. I.e., the second and subsequent masking sequences for that participant device, it will be appreciated that the subsequent masking sequences, in addition to the first initial masking sequence, are actually the target masking sequences that the participant device determines itself.

Then the first masking sequence may also be determined, for example, in the above described implementation when the first masking sequence is not the initial masking sequence. Alternatively, because the mask sequence is determined by the participant device itself, the participant device may not perform the above operation, but directly obtain the previously determined target mask sequence in the training of the current round.

S305, sorting from large to small according to the weight of the effective model parameters in the personalized model to obtain an effective model parameter sequence.

Because it is currently necessary to determine the target mask sequence for the next round of training, in this embodiment, for example, the model parameters may be pruned first, and then the model parameters may be restored, so as to obtain the target mask sequence for the next round of training.

When pruning is performed on the model parameters, for example, the effective model parameter sequences can be obtained by sorting according to the weights of the effective model parameters in the personalized model from large to small, wherein the weights of the effective model parameters in the effective model sequences are sequentially reduced.

As will be understood below, for example, in conjunction with fig. 5, referring to fig. 5, it is assumed that the current personalized model has a plurality of model parameters, namely, model parameters a, b, c, d, e, f, g, h, i, j shown in fig. 5, and the weights of the model parameters may refer to fig. 5. In the example of fig. 5, the smaller the weight value, the greater the weight, for example, the weight of the model parameter a is 1, which is the largest weight among the model parameters.

And the effective model parameters include model parameters a, c, e, f, g, and j, and then the effective model parameters are sorted from large to small according to the weights of the effective model parameters, so as to obtain the effective model parameter sequence shown in fig. 5: a → e → c → g → f → j.

In the actual implementation process, the specific implementation of the weight of the model parameter may be selected and set according to the actual requirement, which is not particularly limited in this embodiment.

S306, determining alpha effective model parameters at the tail end in the effective model parameter sequence as first effective model parameters.

In this embodiment, for example, a pruning amount α may be set, where α is an integer greater than or equal to 1. And after the effective model parameter sequence is determined, the weights of the model parameters in the effective model parameter sequence are sequentially reduced, so that when parameter pruning is realized, for example, α effective model parameters at the tail end in the effective model parameter sequence can be determined as first effective model parameters.

Assuming that the pruning quantity α is equal to 3, in the example of fig. 5 above, the valid model parameter sequences can be, for example: the last 3 valid model parameters of a → e → c → g → f → j are determined as the first valid model parameters, i.e., model parameter g, model parameter f, model parameter j. The first valid model parameter in this embodiment is a valid model parameter for which parameter pruning is required.

In an actual implementation process, the specific setting of the pruning amount may be selected according to actual requirements, which is not limited in this embodiment.

S307, setting a mask corresponding to the first effective model parameter in the first mask sequence as a first preset value to obtain a second mask sequence.

It can be understood that, in the embodiment, the parameter pruning actually sets the parameter state of the determined first effective model parameter to be invalid, and for example, the mask corresponding to the first effective model parameter in the first mask sequence may be set to be a first preset value, so as to obtain the second mask sequence. The second mask sequence is the mask sequence that enables setting the parameter state of the first valid model parameter to an invalid state.

For example, as can be understood in conjunction with fig. 6, assuming that the model parameters of the current personalized model include model parameters a, b, c, d, e, f, g, h, i, j and the effective model parameters include a, c, e, f, g, j, continuing with the example of fig. 5, it may be determined that in the current example, the first mask sequence is as shown in fig. 6 and the first mask sequence is 1010111001.

Based on the above description, the last 3 first valid model parameters are the model parameter g, the model parameter f, and the model parameter j, and the masks corresponding to the 3 model parameters are set as the first preset values, where the first preset value in this embodiment may be 0, so as to obtain the second mask sequence 1010100000 shown in fig. 6.

It is understood that the second mask sequence is compared with the first mask sequence, and actually the α valid model parameters with the smallest weight are pruned and set as the invalid model parameters. It can also be expressed by the correlation of the following formula.

In which, in the phase of parameter pruning, the positions of parameters considered unimportant are usually pruned. A very intuitive idea is to prune the weight position of the smallest absolute value. Thus, in the parameter pruning stage, it is proposed in this embodiment to use the absolute values of the personalized weights (i.e. w) of the participant devices_k，t，N) As a pruning criterion.

Typically, a plurality of network structure layers, such as convolutional layers, pooling layers, etc., are included in the model, and it can be determined that a plurality of model parameters are included in each network structure layer. In one possible implementation, a specific network structure layer may be indexed by j, for example, and the implementation of parameter pruning may be expressed as the content of the following formula four:

wherein the content of the first and second substances,

is the mask of the jth network fabric layer in the current round of training for participant device k,

is the local weight of the jth network structure layer in the tth round of training for the current participant device k, and α_tRepresents the amount of pruning in the t-th round of training, wherein

Is a function. This function returns the mask of the pruned jth network structure layer

In particular, the method comprises the following steps of,

middle alpha_tThe smallest weight that is not clipped is clipped to 0.

And in one possible implementation, the amount of pruning α in this embodiment_tFor example, a cosine annealing (cosine annealing) method may be used, with attenuation as iterations of the number of training rounds t progress.

S308, acquiring gradients corresponding to the first masks in the second mask sequence, wherein the values of the first masks are first preset values.

The foregoing describes the implementation of parametric pruning, and in this embodiment, after some model parameters are pruned and deleted in the pruning stage, the same number of pruning model parameters must be recovered to ensure that the sparsity (or density) of the trained neural network remains unchanged throughout the training process.

In a possible implementation manner, the clipped model parameters may be restored again according to the gradient information of the clipped model parameters in this embodiment. Here, it should be noted that the trimmed model parameters described currently are actually the model parameters corresponding to the mask 0.

Therefore, in this embodiment, a gradient corresponding to each first mask in the second mask sequence may be obtained, where a value of the first mask is a first preset value.

The second mask sequence is the above-described mask sequence obtained after the parameter pruning, and the first preset value in this embodiment may be 0, in this embodiment, a mask with a mask of 0 in the second mask sequence may be determined as a first mask, a model parameter corresponding to each first mask is determined, and a gradient of a model parameter corresponding to each first mask is determined as a gradient corresponding to each first mask.

Assuming that the above example is continued, if the currently obtained second mask sequence may be, for example, 1010100000, the masks in which the masks are 0 are determined as the first masks, and the gradients of the model parameters corresponding to the respective first masks are determined as the gradients corresponding to the respective first masks, it may be understood with reference to fig. 7, where the masks corresponding to the model parameters b, d, f, g, h, i, j are all 0, that 7 first masks are included in the current second mask sequence may be determined, and the gradients of the model parameters b, d, f, g, h, i, j corresponding to the 7 first masks may be determined as the gradients corresponding to the respective first masks.

S309, sorting the first masks in the second mask sequence according to the sequence of the corresponding gradients from large to small.

After determining the gradient corresponding to each first mask, each first mask in the second masks may be sorted, for example, in an order from a large gradient to a small gradient.

And S310, determining the first alpha masks in the sorted first masks as alpha masks to be selected.

After the first masks are sorted, because α model parameters are pruned in the foregoing process, α model parameters need to be restored currently, and therefore, in this embodiment, the first α masks in the sorted first masks may be determined as masks to be selected, so as to obtain α masks to be selected.

Assuming that the above example is continued, the current pruning amount α is 3, and the first 3 masks in the sorted first mask are assumed to be: for example, the masks corresponding to the model parameter b, the model parameter i, and the model parameter j may be determined as masks to be selected, so as to determine 3 masks to be selected.

And S311, setting alpha masks to be selected in the second mask sequence to be a second preset value, and obtaining the target mask sequence.

Then, α candidate masks in the second mask sequence may be set to a second preset value to obtain a target mask sequence, where the target mask sequence in this embodiment is used by the central device to determine an individualized model in the next round of training.

In a possible implementation manner, if the second preset value in this embodiment may be 1, continuing to use the above example, referring to fig. 7, where the determined α masks to be selected are masks corresponding to the model parameter b, masks corresponding to the model parameter i, and masks corresponding to the model parameter j, and for example, on the basis of the second mask sequence, the masks corresponding to the model parameter b, masks corresponding to the model parameter i, and masks corresponding to the model parameter j may be set to 1, so as to obtain the target mask sequence 1110100011. For the comparison target mask sequence and the first mask sequence described above, it can be determined that the mask sequence used for model training in this embodiment always maintains the same sparsity.

In a possible implementation manner, the implementation procedure of the above-described parameter recovery can be represented as the following formula five, for example:

specifically, we restore the pruned weights by making the following updates to the personalized mask:

wherein the content of the first and second substances,

is the random gradient of the neural network of layer j,

is a function. The mask it returns

Corresponding to the characteristic of top-alpha_tIntuitively, it is considered in this embodiment that if a gradient trained locally is large, it should not be clipped to 0, but should be restored to a normal state.

In the implementation of updating the model parameters in this embodiment, an iterative mask needs to be obtained for updating the model, and it can be determined based on the above description that a dynamic sparse training method is adopted in this embodiment to perform iterative search of the mask. The dynamic sparse training enables the model in the training process to maintain the same sparsity, so that the training overhead in the local training process can be reduced.

And S312, sending the target mask sequence and the training gradient of the personalized model to the central equipment.

After the target mask sequence and the training gradient of the personalized model are determined, the participant equipment can send the target mask sequence and the training gradient of the personalized model to the central equipment, and then the central server can aggregate the training gradients of the personalized model, so that one round of model training is completed. After the model training of one round is finished, and when the model training of the next round is finished, the central server can determine the personalized model which needs to be sent to the participant equipment currently according to the target mask sequence, and then repeatedly execute the introduced model training process until the number of rounds of model training reaches the preset number of rounds, so that the trained model can be obtained.

The federate model training method provided by the embodiment of the application determines the target mask sequence corresponding to the participant device, wherein the target mask sequence is used to determine the personalization model of the current participant device, so that the determination of the respectively corresponding personalization model for each participant device can be simply and effectively implemented based on the mask, and then after the participant device finishes a round of training, sending the training gradient of the personalized model to the central device, so that the central device aggregates the gradients of the participant devices, therefore, the individual federal learning can be effectively realized based on the individual mask of each participator device, the training precision of the federal learning is further improved, the mask is used for realizing the individual processing of the model, thus, additional overhead caused by model layering or training of additional models can be avoided. And when the personalized mask of the participant equipment is determined, parameter pruning and parameter restoration are performed on the basis of the mask of the current round of training of the participant equipment to obtain the personalized mask for the next round of training, so that the personalized mask can effectively ensure personalized processing aiming at the current participant equipment, and the sparsity of the model in the whole training process can be effectively ensured to be consistent through the parameter pruning and the parameter restoration, thereby reducing the training overhead in the local training process.

The implementation process of the federal model training method executed by the participant device is described above, and the implementation process of the federal model training method executed by the central device is described below with reference to specific embodiments.

First, description is made with reference to fig. 8, and fig. 8 is a flowchart three of a federal model training method provided in the embodiment of the present application.

As shown in fig. 8, the method includes:

s801, determining a target global model.

In this embodiment, the central device may determine a target global model, where the target global model is the global model that the central device needs to train, or may be the global model in the training process.

S802, determining a target mask sequence corresponding to the participant device.

The central device may further determine a target mask sequence corresponding to the participant device, and based on the above description, it may be determined that the participant device may send the target mask sequence to the central device, and then the central device in this embodiment may determine the target mask sequence corresponding to each participant device according to the target mask sequence sent by each participant device. Wherein for each participant device, its corresponding target mask sequence is personalized.

And S803, updating the model parameters of the target global model according to the target mask sequence to obtain the personalized model.

After determining the target mask sequence corresponding to each participant device, for example, the model parameters of the target global model may be updated according to the target mask sequence, that is, some parameters in the target global model are set to be in an invalid state, and the rest of parameters are kept to be in a valid state. After the model parameters of the target global model are updated, the personalized models corresponding to the participant devices can be obtained.

And S804, sending the personalized model to the participant equipment.

After determining the personalized model, the determined personalized model may be sent to the participant model, and then the participant device may perform model training according to the personalized model, where the process of model training of the participant device may refer to the description of the above embodiment, and details are not described here.

After the one-round model training of the participant device is finished, the target mask sequence determined by the participant device for the next round of training and the training gradient of the personalized model may be sent to the central device, so that the central device may aggregate the training gradients of the personalized models of the respective participant devices to implement the one-round model training, and perform the next round of model training according to the target mask sequence of the respective participant devices until the final model training is finished.

The federal model training method provided by the embodiment of the application comprises the following steps: a target global model is determined. A target mask sequence corresponding to the participant device is determined. And updating the model parameters of the target global model according to the target mask sequence to obtain the personalized model. The personalized model is sent to the participant device. By determining the target mask sequence corresponding to each participant device, the personalized model of each participant device can be obtained according to the personalized mask sequence of each participant device, and then the personalized model is sent to the participant device to realize model training. The personalized processing is realized by relying on the mask sequence without training an additional model, so that the calculated amount of local training in personalized federal learning can be effectively reduced.

On the basis of the above embodiments, the federal model training method on the side of the central facility will be further described in detail with reference to fig. 9. Fig. 9 is a fourth flowchart of a federal model training method provided in an embodiment of the present application.

As shown in fig. 9, the method includes:

and S901, if the current training is the first round of training, determining a preset model as a target global model.

In this embodiment, the central device needs to obtain a target global model, and the model training in this embodiment may be performed in multiple rounds. In one possible implementation, if the current training is the first round of training, the preset model may be, for example, the target global model. The preset model is a model to be trained by the central equipment.

S902, if the current training is the Mth round of training, obtaining a plurality of training gradients sent by a plurality of participant devices, and updating the model parameters of the M-1 th global model according to the training gradients to obtain the target global model.

And in another possible implementation manner, if the current training is not the first round of training, but the mth round of training, the target global model in this embodiment may be the global model in training.

The central device may, for example, obtain a plurality of training gradients sent by the plurality of participant devices, and update the model parameters of the M-1 th global model according to the plurality of training gradients, so that the target global model needs to be used in the mth round of training. The M-1 th global model is a full local model trained in the M-1 th round, the target global model is an Mth global model, and M is an integer greater than or equal to 2.

And S903, if the current training is the first round of training, determining the preset mask sequence as a target mask sequence.

The central device in this embodiment also needs to determine a target mask sequence corresponding to the participant device, and determines that the target mask sequence is implemented in different ways for different training rounds. It should be noted that, in this embodiment, the central device may determine the target mask sequence corresponding to each of the participant devices, and the implementation manner of determining the target mask sequence for each of the participant devices is similar, so that any one of the participant devices is taken as an example to be introduced below, and details of the rest of the participant devices are not described again.

In one possible implementation, if the current training is the first training round, the preset mask sequence may be determined as the target mask sequence. The preset mask sequence may be the initial mask sequence described in the above embodiment, that is, the mask sequence set for the participant device for the first time, and thus, for example, the ERK algorithm initialized by masks may be used to determine the preset mask sequence, so as to determine the target mask sequence in the first round of training.

And S904, if the current training is the Mth round of training, judging whether a masking code sequence sent by the participant equipment exists, if so, executing S905, otherwise, executing S907.

In another possible implementation, if the current training is not the first training round, but the mth training round, the central device may determine whether there is a mask sequence sent by the participant device, for example. It can be determined based on the above description of the embodiments that, except that the mask sequence trained for the first time is determined by the center device, the following mask sequences are determined by the participant device itself, and the participant device, after determining the target mask sequence, sends the determined target mask sequence to the center device.

Meanwhile, it should be further noted that, in a possible implementation manner, when performing each round of model training, the central device may select some participant devices to participate in the current round of model training among the multiple participant devices, instead of letting each participant device participate in the model training. It will be appreciated that in each round of model training, the hub device may select a portion of the participant devices and send the personalized model to the selected portion of the participant devices.

It can be determined that the participant device participates in part of the model training, and if the participant device participates in the model training, the participant device generates a next target mask sequence to be used in the model training process and sends the next target mask sequence to the central device. Similarly, if the participating device never participates in model training, even if it is currently the mth round of model training, the target mask sequence is still the initialized preset mask sequence for the participating device that never participated in model training. Therefore, in this embodiment, it is necessary to determine whether there is a mask sequence transmitted by the participant device.

S905, acquiring a mask sequence sent by the participant equipment for the last time.

In a possible implementation manner, if the central device determines that there is a mask sequence sent by the participant device currently, it indicates that the participant device has previously participated in model training, and therefore, the generated target mask sequence is sent to the central device. And it may also be determined that the participant device may participate in multiple rounds of model training, where in each round of model training, the participant device sends a target mask sequence to the central device, where the current mask sequence to be used is actually the mask sequence that the participant device needs to send most recently. The mask sequence last sent by the participant device may be obtained.

And S906, determining the mask sequence sent by the participant device for the last time as a target mask sequence.

After the mask sequence sent by the participant device for the last time is obtained, the mask sequence sent by the participant device for the last time may be determined as the target mask sequence.

To illustrate with a specific example, say that the current training is the 5 th training round, and assuming that the participant device a is involved in the 2 nd and 3 rd training rounds, then in the 2 nd training round, participant device a sends the target mask sequence to the center device, and in the 3 rd training round, participant device B also sends the target mask sequence to the center device. Then, in the current 5 th round of training, when determining the target mask sequence corresponding to the participant device, it may be determined that the mask sequence sent by the participant device currently exists, and then the central device may obtain the mask sequence sent by the participant device last time, that is, the mask sequence sent in the 3 rd round of training, and determine the mask sequence sent in the 3 rd round of training as the target mask sequence.

And S907, determining the preset mask sequence as a target mask sequence.

And in another possible implementation, if it is determined that there is no mask sequence currently transmitted by the participant device, indicating that the participant device has not previously participated in model training despite the fact that it is not the first round of training, then the target mask sequence is still the initialized mask sequence for the participant device, and thus the preset mask sequence may be determined to be the target mask sequence.

And S908, updating model parameters of the target global model according to the target mask sequence to obtain the personalized model.

And S909, sending the personalized model to the participant device.

The implementation manners of S908 and S909 are similar to the implementation manners of S803 and S804, and are not described herein again.

According to the federal model training method provided by the embodiment of the application, the preset model is determined as the target global model when the model of the first round is trained, and the global model of the previous round is subjected to parameter updating according to the training gradient sent by the participating side equipment when the model of the subsequent round is trained, so that the target global model is obtained, and therefore model training based on federal learning can be effectively achieved. And when determining the target mask sequence corresponding to the participant device, determining the initialized preset mask sequence as the target mask sequence if the participant device participates in the model training for the first time, and determining the personalized mask sequence sent by the participant device for the last time as the target mask sequence if the participant device does not participate in the model training for the first time, thereby ensuring that the personalized mask sequence aiming at the participant device can be effectively determined when the participant device participates in the model training for each time, further determining the personalized model aiming at the participant device according to the target global model and the personalized mask sequence, and then sending the personalized model to the participant device, thereby accurately and simply realizing personalized federal learning based on the mask.

Based on the above embodiments, the federal model training method provided in this application is further introduced systematically with reference to fig. 10. Fig. 10 is a schematic diagram illustrating an implementation of a federal model training method provided in an embodiment of the present application.

Referring to fig. 10, the server is the above-described central device, and the local device is the above-described participant device. In each round of training in fig. 10, there will be the following training steps:

1) and the server sparsifies the global model according to the dynamic mask and then distributes the sparse model to the local equipment.

2) And after receiving the sparse parameters, the local equipment uses local data of the local equipment to train the model. Wherein the training process keeps the same sparsity.

3) After the training is finished, the local equipment sends the training gradient of the trained personalized model and the target mask sequence to the server so as to update the sparse model parameters and upload the sparse model parameters to the server again.

4) And the server performs aggregate updating on the global model according to the collected sparse updates.

And repeatedly executing the training process until the preset number of training rounds is reached, thereby completing the model training.

Further, it should be noted that the federal model training method provided in the present application is to solve the problem of minimizing loss in personalized sparse federal learning as follows:

w in the above formula six represents a model of federal learning training, K represents the number of devices participating in the federal learning training, K represents an index of a certain device,

indicating an optimal personalization mask for the kth local device. The personalized mask is a vector which can only take 1 or 0, and a plurality of personalized models can be obtained for the deployment of different local devices by vector multiplication of the mask and model parameters.

Is the local data distribution, (x, y) represents a data sampled from the local data distribution, where x represents the characteristic of the data and y represents the label of the data. Finally, the process is carried out in a batch,

is the expected loss function of the kth local device (where the function is

Representing a cross entropy loss function). The overall loss function is the average expected loss function that reduces all local devices.

For the problem shown in the above formula six, based on the idea of the data parallel stochastic gradient descent method in the present application, for example, the following model parameter updating method can be given:

wherein ξ_k，tIs a batch of data that is randomly sampled from the data distribution of the local facility. Eta is the learning rate of the local training,

is a sparse parameter that is sparse according to the optimal personalized mask.

But since the optimal mask is unknown. We assume for the moment that there is another solution to iteratively find the optimal personalized mask. Based on this, we can adapt the model parameter update scheme described above:

wherein m is_k，tIs an iteratively obtained personalized mask. It should be noted that, since part of the parameters are covered by the mask to be 0, only a part of the parameters will participate in the forward propagation process of the neural network during the updating process, which can reduce the computational complexity of the training to some extent. In addition, when the gradient is transmitted backwards, since the backward transmission gradient is multiplied by the personalized mask vector, only part of the gradient needs to be transmitted backwards, and the calculation complexity of training can be further reduced. In addition, since the uploaded updates are sparse, the update method can further reduce the communication overhead of aggregation and synchronization.

In the above update scheme, the global model w_tIt must be updated and synchronized after the one-step SGD is performed and with full participation. In federal learning, to further save communication overhead, we typically employ 1) local SGD, i.e., multiple local steps that allow SGD updates before synchronization, i.e., the SGD is performed locally by the individual participating devices to enable training of the personalized model. And 2) partial participation, i.e., only a small fraction of local devices will be trained locally. That is, each time a portion of the participant devices are selected as described above participate in the present round of model training.

Based on this, IThey have the same rewrite of the data-parallel SGD solution. Order to

Represents the weight before proceeding to step τ +1 of local SGD, let

(this means that the local weights will be synchronized with the global weights when τ ═ 0). In our joint adaptation solution, for τ ∈ { 0.,. N-1}, each local device k ∈ S_tThe following local SGD is performed on its model:

in which ξ_k，t，τData representing a batch sampled at step τ of round t. After local training, the local device will send sparse parameter updates to the server for aggregation, which may be specifically expressed as:

wherein S is_tIs a randomly selected set of local devices, S ═ S_tAnd | is the number of devices that need to be selected per round.

Based on the above description, the federal model training method in this embodiment can be further summarized as the algorithm flow shown in fig. 11. Fig. 11 is a schematic diagram illustrating an implementation of a personalized federal learning algorithm provided in an embodiment of the present application.

As shown in fig. 11:

1. setting a hyper-parameter: the learning rate eta of the training model, the number S of selected equipment in each round, the number N of local training times and the number T of global communication rounds.

The server side main process executes the following operations:

2. randomly initializing global model parameters w₀；

3. Calling mask searching algorithm to determine initialized mask sequence m_k，0；

4. When the training round number T is 0, 1.., T-1, performing:

5. sampling and selecting S participant devices to enter set S_t；

6. For each participating device, if participating device k does not belong to set S_t

7. Then the next round mask sequence m for that participant device k is set_k，t+1＝m_k，t；

8. For each participant device, if participant device k belongs to set S_t

9. Then send sparse parameters

To device k;

10. calling the main process of the equipment end to obtain an update gradient U_k，tAnd mask m of the next round_k，t+1；

11. The server side aggregates and updates the model parameters to obtain a target global model

It is understood that T is added by 1 each time the above-mentioned round of model training is performed, and when T is greater than or equal to T, it can be determined that the model training is finished.

Then, the server side can output the trained sparse personalized parameter m to the participant equipment k_k，T⊙w_T。

The above description is of a server-side host process, and based on the above description, it can be determined that the server-side host process can call a device-side host process to obtain an update gradient U_k，tAnd mask m of the next round_k，t+1Then, the implementation of the device-side host process is described below.

12. When the local training round number τ is 0, 1.

13. Randomly sampling a batch (batch) of data, denoted ξ_k，t，τ

14. Calculating the gradient after the loss function is passed:

15. the following calculations are performed:

16. cumulative gradient update after τ step calculation

15. Setting the Next mask sequence m by mask search algorithm Next _ Masks ()_k，t+1＝Next_Masks(·)；

16. Sending cumulative gradient updates U_k，tAnd the next mask sequence m_k，t+1To the server.

It will be appreciated that steps 12-16 described above are actually the process of the local SGD.

And the mask search algorithm Next _ Masks () needs to be introduced:

17. setting a hyper-parameter: initial pruning rate alpha of each round₀

18. The mask initialization method for the first mask setting comprises the following steps:

and calculating the sparse proportion of the neural network according to the ERK. Randomly initializing mask m according to sparse proportion_k，0；

Return initialization mask m_k，0

19. In addition to the first mask setting, the timing of mask determination is determined by the algorithm Next Masks () that searches for the Next mask, where Next Masks () is:

20. according to alpha₀Attenuating the pruning quantity of each round;

21. randomly sampling a batch (batch) of data, denoted ξ_k，t，N

22. Ladder after calculation of loss function back transmissionDegree:

23. when the network structure layer index j belongs to the network structure layer set

When, carry out:

24. pruning a part of parameters:

25. and restoring a part of parameters:

26. return next mask

The above description is about the overall implementation algorithm flow of the federal model training method executed by the central device and the participant devices, and specific implementation of each step may refer to the description of the above embodiments, which is not described herein again.

For the above technique, we present the following convergence conclusion. The convergence conclusion can ensure that the personalized model generated by the method of the application converges to the vicinity of the minimum extreme point, thereby proving the effectiveness of the proposed method of the application.

Under the assumption of continuity assumption, bounded variance, bounded gradient, and bounded device gradient difference, the learning rate is assumed

And all optimal personalization masks satisfy the same sparsity condition, i.e.

The application concludes the convergence as follows:

wherein the intermediate result

Intermediate results

Intermediate results

β represents the number of parameters that need to be thinned.

B is a constant in the gradient bounded hypothesis, i.e.

G is a constant in the assumption that the gradient of the device is bounded, i.e.

σ²For constants in the gradient variance bounded hypothesis, i.e.

d represents the dimensions of the model parameters.

k' and k are subscripts of the summation. Sigma_kSum Σ_k′The term "K" is used to denote the sum of K "1, 2, a.

dist(m₁，m₂) Representation mask m₁Sum mask m₂Hamming distance between.

Based on the above introduction, it can be understood that the traditional federal learning single model deployment has a limitation on the overall accuracy improvement. The technical scheme of the application allows a plurality of personalized models to be deployed on different local devices so as to break through the single model deployment constraint.

And the problem of poor robustness and popularization for realizing personalized federated learning by adopting model layering exists. According to the technical scheme, the specific model does not need to be subjected to personalized layering, and all optimization super-participation specific model implementation is irrelevant.

And, there is a problem with implementing personalized federated learning using near-end operators and model interpolation that requires additional training overhead to train the local model. The technical scheme of the application does not need to additionally train a local model. The specific personalization is realized by training a personalized mask. Compared with the training of the personalized model, the personalized mask searching only needs negligible additional computing overhead.

And for existing solutions that are compressed by personalized models. The technical scheme of the application can ensure that the whole training process is based on a sparse model. In contrast, the solution of the present application 1) can significantly reduce the training computation overhead because the training is always performed on sparse models, and 2) it has better performance in terms of convergence speed and model accuracy as confirmed by real data experiments. 3) The technical scheme of the application has the advantage that the convergence rate upper bound exists under the non-convex condition, namely, the model deployed by the scheme can converge to a model with the gradient upper bound.

In summary, the technical scheme of the application is an individualized federal learning and training algorithm based on sparse training. The dynamic sparse training technology is applied to personalized federal learning training for the first time. Aiming at the characteristics of sparse training and personalized federal learning, a systematic model parameter updating and aggregation scheme is provided. Specifically, the technical scheme of the application firstly provides that the same model sparsity is kept for training when local training is learned in the federal. In global aggregation, model parameters are not directly aggregated, but sparse model update parameters are aggregated.

Meanwhile, aiming at the characteristic of personalized federal learning, the technical scheme of the application improves an optimal mask searching algorithm of sparse training. In particular, we propose for the first time to apply a global model to warm-start the mask-finding process of sparse training. In addition, the technical scheme of the application provides an upper bound of convergence speed of the sparse personalized algorithm. Through error analysis, the theory explains why the algorithm of the technical scheme of the application has better convergence and reasoning effect under the condition of high data heterogeneity.

Fig. 12 is a first structural schematic diagram of a federal model training device provided in an embodiment of the present application. As shown in fig. 12, the apparatus 120 includes: a receiving module 1201, a training module 1202, a determining module 1203, and a sending module 1204.

A receiving module 1201, configured to receive an individualized model sent by a central device, where the number of effective model parameters of the individualized model is smaller than the number of effective model parameters of a global model;

a training module 1202, configured to perform model training on the personalized model to obtain a training gradient of the personalized model, where the training gradient includes gradients of model parameters in the personalized model;

a determining module 1203, configured to determine a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model, and the pruning amount, where the target mask sequence is used by the central device to determine a personalized model in a next round of training;

a sending module 1204, configured to send the target mask sequence and the training gradient of the personalized model to the central device.

In one possible design, the determining module 1203 is specifically configured to:

In one possible design, the amount of pruning is α, which is an integer greater than or equal to 1; the determining module 1203 is specifically configured to:

In one possible design, the model training includes N iterations, where N is an integer greater than 1; the training module 1202 is specifically configured to:

obtaining N intermediate gradients corresponding to each model parameter;

The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 13 is a schematic structural diagram of a federal model training device provided in the embodiment of the present application. As shown in fig. 13, the apparatus 130 includes: a determining module 1301, an updating module 1302 and a sending module 1303.

A determining module 1301, configured to determine a target global model;

the determining module 1301 further determines a target mask sequence corresponding to the participant device;

an updating module 1302, configured to update a model parameter of the target global model according to the target mask sequence, so as to obtain an individualized model;

a sending module 1303, configured to send the personalized model to the participant device.

In one possible design, the determining module 1301 is specifically configured to:

acquiring a mask sequence sent by the participant equipment for the last time;

Fig. 14 is a schematic diagram of a hardware structure of the federal model training device provided in the embodiment of the present application, and as shown in fig. 14, the federal model training device 140 of the present embodiment includes: a processor 1401 and a memory 1402; wherein

A memory 1402 for storing computer-executable instructions;

a processor 1401 configured to execute the computer executable instructions stored in the memory to implement the steps performed by the federal model training method in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 1402 may be separate or integrated with the processor 1401.

When the memory 1402 is provided separately, the federal model training device further includes a bus 1403 for connecting the memory 1402 and the processor 1401.

An embodiment of the present application further provides a computer-readable storage medium, where a computer executing instruction is stored in the computer-readable storage medium, and when a processor executes the computer executing instruction, the federal model training method implemented by the upper federal model training device is implemented.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Those of ordinary skill in the art will understand that: all or a portion of the steps for implementing the above-described method embodiments may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training a federated model is applied to participant equipment, and comprises the following steps:

performing model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises the gradient of each model parameter in the personalized model;

2. The method of claim 1, wherein determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model, and the pruning amount comprises:

3. The method of claim 2, wherein determining a first mask sequence based on the model parameters of the personalized model comprises:

4. The method of claim 3, wherein determining the first mask sequence according to the parameter state of each model parameter of the personalized model comprises:

5. The method according to any one of claims 2 to 4, wherein the amount of pruning is α, which is an integer greater than or equal to 1; setting part of masks in the first mask sequence to be a first preset value according to the pruning quantity to obtain a second mask sequence, wherein the second mask sequence comprises the following steps:

6. The method according to any one of claims 2 to 5, wherein the amount of pruning is α, which is an integer greater than or equal to 1; setting a part of masks in the second mask sequence to a second preset value according to the pruning quantity to obtain the target mask sequence, wherein the method comprises the following steps:

7. The method of claim 6, wherein obtaining a gradient corresponding to each first mask in the second sequence of masks comprises:

8. The method according to claim 6 or 7, wherein determining α candidate masks from the first masks in the second mask sequence according to the gradient corresponding to each first mask in the second mask sequence comprises:

9. The method of claim 8, wherein the model training comprises N iterations, wherein N is an integer greater than 1; performing model training on the personalized model to obtain a training gradient of the personalized model, wherein the training gradient comprises the following steps:

obtaining N intermediate gradients corresponding to each model parameter;

and determining the training gradient of the personalized model according to the N intermediate gradients corresponding to each model parameter.

10. A method for training a federated model is applied to central equipment, and comprises the following steps:

determining a target global model;

determining a target mask sequence corresponding to the participant device;

sending the personalized model to the participant device.

11. The method of claim 10, wherein determining a target global model comprises:

12. The method of claim 11, wherein determining the target global model based on the M-1 th global model and the plurality of training gradients comprises:

13. The method of any of claims 10-12, wherein determining the target mask sequence for the participant device comprises:

14. The method of claim 13, wherein determining the target mask sequence based on the mask sequence sent by the participant device comprises:

acquiring a mask sequence sent by the participant equipment for the last time;

15. The utility model provides a bang model training device which characterized in that is applied to the participant equipment, the device includes:

the determining module is used for determining a target mask sequence according to the gradient of the personalized model, the model parameters of the personalized model and the pruning quantity, and the target mask sequence is used for the central equipment to determine the personalized model in the next round of training;

16. The utility model provides a bang model training device which characterized in that, is applied to central equipment, the device includes:

a determining module for determining a target global model;

a sending module for sending the personalized model to the participant device.

17. The utility model provides a bang model training equipment which characterized in that includes:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being configured to perform the method of any of claims 1 to 9 or claims 10-14 when the program is executed.

18. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any of claims 1 to 9 or claims 10-14.

19. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1 to 9 or claims 10-14 when executed by a processor.