CN114386570A

CN114386570A - Heterogeneous federated learning training method based on multi-branch neural network model

Info

Publication number: CN114386570A
Application number: CN202111575862.3A
Authority: CN
Inventors: 陈旭; 崔嘉洛; 周知
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-22

Abstract

The invention relates to a heterogeneous federated learning training method based on a multi-branch neural network model. A multi-branch neural network model is introduced as a shared global model, suitable sub-branch models can be matched according to the computing power of different devices, the scene of heterogeneous computing resources can be well adapted, the computing resources of different devices are fully utilized, and therefore the performance and efficiency of the whole heterogeneous federated learning training system are effectively improved; aiming at the characteristics of a multi-branch model, the invention provides a multi-branch model aggregation method based on shared layer parameters, which can aggregate different sub-branch models to form a global multi-branch model, so that model parameters can be effectively shared among different equipment models; on the basis of parameter aggregation, distillation learning is introduced to solve the problem of performance fluctuation after model aggregation, so that the convergence speed of the global model is accelerated, the number of required training rounds is reduced, and communication consumption is saved.

Description

Heterogeneous federated learning training method based on multi-branch neural network model

Technical Field

The invention relates to the technical field of federal learning, in particular to a heterogeneous federal learning training method based on a multi-branch neural network model.

Background

Existing federated learning algorithms such as FedAvg and its improved algorithms, while effective for distributed machine learning while satisfying data privacy, require that all models deployed on devices participating in federated learning be homogeneous due to limitations of model gradient aggregation. However, in a real situation, the existence of device heterogeneity is inevitable, which causes the size and capability of the training model to depend on the device with the weakest computing capability in the system, thereby causing resource waste and performance bottleneck of the system. Many scholars are also trying to address the federal learning training of heterogeneous models by introducing methods of knowledge distillation. For example, fedmad performs knowledge transfer by using a labeled public data set and a method of performing average aggregation on output logit of device models, so as to individually train each device model, but it cannot aggregate to form a public model on a central server like classical federal learning. In addition, in the work of FedDF, by averagely aggregating all equipment models and the output logit thereof, the integrated distillation for model aggregation is provided, and a plurality of heterogeneous shared models can be obtained through cloud aggregation. However, if the device heterogeneity changes, the device model types inevitably increase, so that the server maintenance and heterogeneous model training cost is increased sharply.

Disclosure of Invention

The invention provides a heterogeneous federated learning training method based on a multi-branch neural network model to overcome the defects in the prior art, and effectively improves the performance and efficiency of the whole heterogeneous federated learning training system.

In order to solve the technical problems, the invention adopts the technical scheme that: a heterogeneous federated learning training method based on a multi-branch neural network model comprises the following steps:

s1, initialization training of a cloud multi-branch model: in the cloud server, a global branch model with a plurality of branches and a common data set D exist₀Different global branch models are reserved with respective output layers on the basis of sharing part of the public hidden layer; firstly, carrying out multiple global branch models on public data set D based on cloudPre-training to initialize;

s2, matching of the sub-branch neural network model: before the start of federal learning, all devices requesting to participate in federal learning report available calculation and storage information of the devices to a cloud server, and then the cloud server calculates and determines an optimal branch model suitable for the devices according to the collected device information and distributes the weighted single branch model to the corresponding devices;

s3, local training of the equipment model: after each participant device receives the single-branch model issued by the cloud, the single-branch model is replaced by a local model of the device, and then the current device model is trained on the basis of a local private data set;

s4, aggregation of cloud multi-branch lifting network models: after local training is completed, each participant device uploads parameters of the respective model to a cloud server, and weighted average is performed on the parameters of the device model and the same parts of the global branch models, so that the global branch models of the cloud are aggregated and updated;

s5, distillation training of the cloud multi-branch neural network model: performing knowledge distillation learning on the aggregated multi-branch network model by using a public data set based on the cloud and a model gradient uploaded by each device; after the distillation training is finished, model parameters of each branch in the multi-branch model are issued to corresponding equipment for the next round of local training;

s6, application of the cloud multi-branch neural network model: and repeating the steps S3 to S5 until a preset number of training rounds is reached, and finally obtaining a global branch neural network model.

In one embodiment, the global branch model has the overall parameter W_sWherein the parameters of each sub-branch model are denoted as W_s ^kK is 1, …, K; on the device side, there are N participating devices, with a private local data set D for each device i_iAnd a local model

In said step S1, each time is calculatedThe cross entropy loss value of the output of each branch model and the grounttrue, and then the loss value of each sub-branch is weighted and averaged to be used as the total loss value:

finally, the parameters are updated by back propagation according to the total loss value until the model converges:

in one embodiment, in step S2, when each device i reports its maximum satisfiable parameter number P_iThen, match one to satisfy P_i≥P^kThen the selected sub-branch model k is issued to the device i, where P is^kIs the parameter number of the sub-branch model k.

In one embodiment, in step S3, a constraint is applied to a difference between parameters of the current local model and the cloud original model, so that under the constraint of the cloud model, the shared hidden layer of each device model can obtain a parameter distribution as similar as possible in the training.

In one embodiment, the applying a constraint to the difference between the parameters of the current local model and the cloud original model specifically includes: adding an L2 regular loss function to the original cross entropy loss function to measure the difference of parameter distribution between the local model and the cloud model; when device i receives the sub-branch model k, it initializes before local training

Assuming that the number of shared network layers of the sub-branch model k and the cloud main branch model is H, then the L2 canonical loss function trained locally by device i is represented as:

and combining the loss function based on the local data set to obtain the total loss function of the equipment i as follows:

wherein η is a hyper-parameter for measuring the ratio of the regular loss function of L2 to the overall loss function;

finally, the local model parameters of device i are optimally updated according to the following formula:

in one embodiment, in step S4, the polymerization method using multi-branch joint averaging specifically includes: when the cloud server receives the model parameter sets uploaded by all the participant devices

Then multi-branch polymerization is carried out; firstly, acquiring a dictionary set W of all parameter layers in a global multi-branch model_SIf the number of layers of the model is K, the index of the K-th layer is l_kCorrespondingly, the parameter of the k-th layer is denoted as W_S[l_k](ii) a Then traverse W_SAll parameter layers in (1): for each layer k, the layer parameter W is calculated_S[l_k]Firstly, initializing to zero value, and counting the index l of the layer_kUploading a set of models

The total number of occurrences in (1) is recorded as Count_k]Simultaneously for each presence of l_kIndex layer device model

Adding the parameters of its corresponding layer to the global model, i.e.Order to

Traversing and accumulating each layer of the global multi-branch model to obtain and output a new global model parameter W_S。

In one embodiment, the loss function in distillation training comprises: a cross entropy loss function with a real tag, as shown in equation 1; the KL divergence loss function from the soft tag output by the device model is shown in equation 6 below:

the total loss function of the distillation training is shown in equation 7, where the over-parameter α is used to set the weight ratio between the cross-entropy function and the KL divergence loss;

the global branch model optimized in distillation training is calculated from equation 8:

in one embodiment, in step S6, the output global multi-branch neural network model can be matched with the adaptive branch model according to the resource limitation and the accuracy requirement of different devices, so as to meet the applications of different devices; when the computing resources of the device side are insufficient or insufficient, the device side can only process the computation of the shared parameter part of the local model, then sends the intermediate result to the cloud global network for the computation of the rest part, and finally returns the cloud result to the edge device.

The present invention also provides an electronic device comprising: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the method for generating the quasi-cyclic hyperelliptic code.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating quasi-cyclic hyperelliptic codes described above.

Compared with the prior art, the beneficial effects are:

1. in the invention, a multi-branch neural network model is introduced as a shared global model, and suitable sub-branch models can be matched according to the computing power of different devices. Compared with the traditional federal learning method, the method can be well adapted to the scenes of heterogeneous computing resources, and fully utilizes the computing resources of different devices, thereby effectively improving the performance and efficiency of the whole heterogeneous federal learning training system;

2. aiming at the characteristics of a multi-branch model, the invention provides a multi-branch model aggregation method based on shared layer parameters, which can aggregate different sub-branch models to form a global multi-branch model, so that model parameters can be effectively shared among different equipment models;

3. on the basis of parameter aggregation, distillation learning is introduced to solve the problem of performance fluctuation after model aggregation, so that the convergence speed of the global model is accelerated, the number of required training rounds is reduced, and communication consumption is saved;

4. in the invention, the multi-branch neural network model obtained by cloud training can flexibly match the application requirements of various types of equipment, and the cooperative inference of end edges is supported.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a schematic workflow diagram of the multi-branch model training system of the present invention.

FIG. 3 is a diagram illustrating the computation of a loss function in local training according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The invention is described below in one of its embodiments with reference to specific embodiments. Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

In the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by the terms "upper", "lower", "left", "right", etc. based on the orientation or positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but it is not intended to indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limiting the present patent, and the specific meaning of the terms may be understood by those skilled in the art according to specific circumstances. In addition, if there is a description of "first", "second", etc. in an embodiment of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, the meaning of "and/or" appearing throughout is to include three juxtapositions, exemplified by "A and/or B" including either scheme A, or scheme B, or a scheme in which both A and B are satisfied.

Example 1:

as shown in fig. 1, a heterogeneous federated learning training method based on a multi-branch neural network model includes the following steps:

step 1, initializing and training a cloud multi-branch model. In the invention, most data exist in private data sets of each parameter side device, and only a small number of public data sets are stored in our cloud server. A global multi-branch neural network model exists in the cloud server as shown in fig. 2, wherein different branch models have respective output layers on the basis of sharing part of the common hidden layer. Therefore, the model size of each branch is different, and the requirements of different computing resource devices can be met. And then, pre-training the global multi-branch model based on the public data set of the cloud end to enable the initialization to have certain performance, and the convergence speed and the training precision of subsequent federal learning are improved.

Specifically, the system of the invention can be divided into a cloud end and an equipment end. In the cloud server, a global multi-branch model with K branches exists, and the overall parameters of the model are recorded as W_sWherein the parameters of each sub-branch model are denoted as W_s ^kK1, …, K, and a common data set denoted D₀. At the device side, there are N participating devices, and for each device i there is a private local data set D_iAnd a local model

Since the target tasks of the various branches in the multi-branch model are the same, they will be based on a common data set D₀To perform joint optimization training. Firstly, calculating the cross entropy loss value of each sub-branch model output and the grounttrue, and then taking the weighted average of the loss values of the sub-branches as the total loss value as shown in formula 1:

finally, the parameters are updated by back propagation according to the total loss value until the model converges, as shown in formula 2:

step 2, matching of the sub-branch neural network model: before the start of federal learning, all devices requesting to participate in federal learning should report the available computing resource information of the devices to the cloud server, and the model parameters are used as the measurement of computing capacity in the system of the invention. Thus, when each device i reports its maximum satisfiable parameter number P_iThen, match one to satisfy P_i≥P^kMaximum sub-branch model k, P of^kIs the parameter quantity of the sub-branch model k, and then the selected sub-branch model k is issued to the device i.

Step 3, local training of the equipment model: after each participant device receives the single-branch model issued by the cloud, the single-branch model is replaced by a local model of the device, and then the current device model is trained on the basis of a local private data set; but in each round of local training, as shown in fig. 3, constraints are imposed on the differences between the parameters of the current local model and the cloud-side raw model. Therefore, under the constraint of the cloud model, the shared hidden layers of the equipment models can obtain parameter distribution which is as similar as possible in training, and noise parameters can be effectively reduced in the subsequent parameter aggregation step.

In a traditional federal learning algorithm, equipment carries out reasoning calculation on local data through a local model to obtain a predicted value, then calculates a Loss value between the predicted value and a true value according to a traditional Loss function (such as a cross entropy function), and finally updates model parameters according to the obtained Loss value. However, since the local models of the devices in the system of the present invention are heterogeneous, the distribution of model parameters obtained by their respective local training may not be uniform. In order to enable models of different devices after local training to have parameter distribution which is as similar as possible, constraint needs to be imposed on updating of model parameters in the training process, and specifically, an L2 regular loss function is added to an original cross entropy loss function and used for measuring the difference of parameter distribution between a local model and a cloud model. When device i receives the sub-branch model k, it initializes before local training

Assuming that the number of shared network layers of the sub-branch model k and the cloud main branch model is H, the L2 canonical loss function trained locally by device i can be represented as:

and combining the loss functions based on the local data set to obtain the total loss function of the equipment i as shown in formula 4, wherein eta is a hyper-parameter and is used for measuring the ratio of the L2 regular loss function to the overall loss function.

Finally, the local model parameters of device i are optimally updated according to equation 5:

step 4, aggregation of the cloud multi-branch lifting network model: after local training is completed, each participant device uploads parameters of the respective model to the cloud server, and weighted average is performed on the device model and the parameters of the same part of the plurality of global branch models, so that the global branch models of the cloud are aggregated and updated.

Since the models uploaded by each device belong to different subbranches, and the subbranch models are heterogeneous, the models cannot be aggregated by directly using the traditional federal learning method. Considering that the device models are heterogeneous but part of the global multi-branch model, a multi-branch joint average aggregation method is proposed, which has the core that average aggregation is performed among the shared layer parameters of each device model, and the specific flow is as follows:

when the cloud server receives the module uploaded by all the participant devicesSet of type parameters

Adding the parameters of its corresponding layer to the global model, i.e. ordering

And 5, distillation training of the cloud multi-branch neural network model: due to the fact that structures of different branch models are inconsistent, parameter distribution of each equipment model is inconsistent, parameter noise is inevitably brought to the global model after parameter aggregation, and performance fluctuation of the global model is caused. Therefore, knowledge distillation learning is carried out on the aggregated multi-branch network model based on the public data set of the cloud and the model gradient uploaded by each device, and therefore the prediction accuracy of the multi-branch model is restored and improved. And after the distillation training is finished, model parameters of each branch in the multi-branch model are sent to corresponding equipment for the next round of local training.

Specifically, after a new global multi-branch model is generated by cloud aggregation, a public data set D based on the cloud is followed₀And carrying out distillation training on model parameters of each device. Likewise, the loss function in distillation training is divided into two parts: the first is a cross entropy loss function with a real tag, as shown in formula 1, and the second is a KL divergence loss function with a soft tag output by an equipment model, as shown in formula 6.

The overall loss function of the distillation training is thus shown in equation 7, where the over-parameter α is used to set the weight ratio between the cross-entropy function and the KL divergence loss.

Then the global branch model optimized in distillation training can be calculated from equation 8:

and after the distillation training is finished, model parameters of each branch in the global multi-branch model are sent to corresponding equipment for the next round of local training.

Step 6, application of the cloud multi-branch neural network model: and repeating the steps S3 to S5 until a preset number of training rounds is reached, and finally obtaining a global branch neural network model. When the device needs to be deployed or newly added for application, the sub-branch models with different inference precision can be matched according to the resource limitation and the requirement of different devices, so that the flexibility and the diversity of model deployment are realized. In addition to this, the multi-branch neural network model can also support edge collaborative reasoning. For example, when the computing resources of the device side are insufficient or the requirement of low delay is met, the device side may only process the computation of the shared parameter part of the local model, then send the intermediate result to the cloud global network for the computation of the remaining part, and finally return the cloud result to the edge device, thereby implementing the accelerated inference of the edge side cooperation.

Example 2

The present embodiment provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the method for generating the quasi-cyclic hyperelliptic code in the embodiment 1.

Example 3

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for generating a quasi-cyclic hyperelliptic code according to embodiment 1.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A heterogeneous federated learning training method based on a multi-branch neural network model is characterized by comprising the following steps:

s1, initialization training of a cloud multi-branch model: in the cloud server, a global branch model with a plurality of branches and a common data set D exist₀Different global branch models are reserved with respective output layers on the basis of sharing part of the public hidden layer; firstly, pre-training a plurality of global branch models based on a public data set D of a cloud end to initialize the global branch models;

2. The method of claim 1, wherein the global branch model has overall parameters W_sWherein the parameters of each sub-branch model are denoted as W_s ^kK is 1, …, K; on the device side, there are N participating devices, with a private local data set D for each device i_iAnd a local model

In said step S1, each branch modulus is calculatedThe cross entropy loss value of the output of the type and the group is then weighted and averaged for each sub-branch as the overall loss value:

3. the method for heterogeneous federated learning and training based on multi-branch neural network model as claimed in claim 2, wherein in step S2, when each device i reports its maximum satisfiable parameter P_iThen, match one to satisfy P_i≥P^kThen the selected sub-branch model k is issued to the device i, where P is^kIs the parameter number of the sub-branch model k.

4. The method of claim 2, wherein in step S3, a constraint is applied to a difference between parameters of the current local model and the cloud original model, so that under the constraint of the cloud original model, the shared hidden layer of each device model can obtain a parameter distribution as similar as possible in the training.

5. The method according to claim 4, wherein the applying of the constraint on the difference between the parameters of the current local model and the cloud-side original model specifically comprises: adding an L2 regular loss function to the original cross entropy loss function to measure the difference of parameter distribution between the local model and the cloud model; when device i receivesIs the sub-branch model k, then it is initialized before local training

6. the method for heterogeneous federated learning training based on a multi-branch neural network model as claimed in claim 5, wherein in step S4, an aggregation method using multi-branch joint averaging specifically includes: when the cloud server receives the model parameter sets uploaded by all the participant devices

Then multi-branch polymerization is carried out; firstly, acquiring a dictionary set W of all parameter layers in a global multi-branch model_SIf the number of layers of the model is K, the index of the K-th layer is l_kCorrespondingly, the parameter of the k-th layer is denoted as W_S[l_k](ii) a Then traverse W_SAll parameter layers in (1): for each layer k, the layer parameter W is calculated_S[l_k]Firstly, initializing to zero value, and counting the index l of the layer_kUpload model set W_c ⁱThe total number of occurrences in (i) } is counted [ l_k]Simultaneously for each presence of l_kDevice model W of index layer_c ⁱAdding the parameters of its corresponding layer to the global model, i.e. ordering

After each layer of the global branch model is traversed and accumulated, a new global model parameter W is obtained and output_S。

7. The method of claim 6, wherein the loss function in distillation training comprises: a cross entropy loss function with a real tag, as shown in equation 1; the KL divergence loss function from the soft tag output by the device model is shown in equation 6 below:

8. the method for heterogeneous federated learning and training based on multi-branch neural network model of claim 7, wherein in step S6, the output global multi-branch neural network model can match the adaptive branch models according to the resource limitations and accuracy requirements of different devices, so as to meet the applications of different types of devices; when the computing resources of the device side are insufficient or insufficient, the device side can only process the computation of the shared parameter part of the local model, then sends the intermediate result to the cloud global network for the computation of the rest part, and finally returns the cloud result to the edge device.

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to implement the method of generating quasi-cyclic hyperelliptic code as claimed in any of claims 1 to 8.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of generating a quasi-cyclic hyperelliptic code according to any one of claims 1 to 8.