CN113191504A

CN113191504A - Federated learning training acceleration method for computing resource heterogeneity

Info

Publication number: CN113191504A
Application number: CN202110556962.5A
Authority: CN
Inventors: 何耶肖; 李欢; 章小宁; 吴昊; 范晨昱
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-07-30
Anticipated expiration: 2041-05-21
Also published as: CN113191504B

Abstract

The invention discloses a federated learning training acceleration method for computing resource isomerism, which is characterized in that whether the iterative number difference between a fastest device and a slowest device reaches a threshold value or not is judged, if so, the fastest device does not need to wait, local model parameters are directly updated by using a gradient update value, then the latest global model parameters are downloaded to obtain the latest global model parameter copy, the global model parameter copy is updated by using extra gradient update parameters, the loss function value with the latest global model parameters is judged, and if the loss function value of the updated global model parameter copy is smaller, the latest global model parameters are replaced by the updated global model parameter copy. The invention provides the relevant adaptability improvement to the traditional SSP parallel synchronization mechanism, and improves the utilization rate of computing resources, thereby improving the training efficiency and shortening the overall training time.

Description

Federated learning training acceleration method for computing resource heterogeneity

Technical Field

The invention relates to the technical field of federal learning, in particular to a calculation resource isomerism oriented federal learning training acceleration method.

Background

In recent years, with the rapid development of machine learning, many artificial intelligence applications, such as data mining, image recognition, natural language processing, biometric recognition, search engines, medical diagnosis, detecting credit card fraud, stock market analysis, voice and handwriting recognition, strategic games, robots, and the like, which are difficult to implement using conventional technologies, have been raised in various fields of human life. Machine learning is a data analysis method for automated analytical model construction that allows computers to learn autonomously without explicit programming. As a data-driven technique, machine learning requires a large amount of data to train to arrive at a high performance model. Today, with the development of cell phones, tablets, and various wearable devices, billions of edge devices generate a lot of user Data, and the Data generated by edge devices will increase to 79.4ZB by 2025 as estimated by the report of International Data Corporation (IDC). This is a valuable data resource for machine learning. The related techniques and applications of machine learning are expected to further advance if such data stored on edge devices can be utilized. However, in order to train the machine learning model, the traditional approach is to upload all raw data sets collected by the edge devices to a remote data center for uniform training. This approach requires a large amount of communication resources to transmit a large amount of data, resulting in unacceptably high costs. In addition, as people's privacy awareness increases, many people are reluctant to upload user data to a data center, and the problem of user privacy disclosure and data security is also caused by transmitting user data through a communication network. Also, the centralized training mode cannot be applied to the fields of finance and the like which are highly sensitive to data security.

In order to protect data privacy and reduce communication resource overhead, federal learning, a distributed training system, has been proposed in recent years to replace the centralized training system. Today, many banks, securities companies, medical equipment manufacturers, and technology companies are actively developing federal learning, the security and utility of which is widely verified, and the basic framework of federal learning is shown in fig. 1.

A set of edge devices (also referred to as clients), such as smartphones, laptops, tablets, etc., use their locally stored data to engage in a distributed training process of the model. Each edge device retains a copy of the model as a local model. The server connecting all edge devices maintains a global model and aggregates the gradient updates from the various edge devices to update the global model. In the training iteration, each edge device uses its local data to compute gradient updates for the model parameters, then uploads the gradient updates to the server, and then downloads the updated new global model parameters from the server as new local model parameters. In the process, each edge device only shares the intermediate calculation result (namely the gradient updating of the model parameters) to the server without uploading the original data stored locally by the edge device, so that the privacy and the security of the data are protected. In addition, the gradient and model of the transmission may utilize various encryption methods to further enhance its security.

In the edge environment, the devices have heterogeneous computing resources due to differences in their processor architectures, power consumption limitations, systems, and the like. The gradient computation time difference between each device is large due to the heterogeneous computing resources. In order to maintain consistency of model parameters of each device, federated learning requires the application of a synchronous parallel mechanism. If the traditional Bulk Synchronization Parallel (BSP) synchronization Parallel mechanism is used, a device with more computing resources capable of quickly completing gradient update computation needs to wait for a device with less computing resources to realize synchronization in each iteration. The more gradiometric device wastes a lot of computing resources due to the waiting process, while the slowest device largely determines the time consumption of the iteration, which is called the Straggler problem. The Straggler problem reduces the training efficiency of federated learning, slowing the convergence rate of the learning model. In order to improve the training efficiency, a Stale Synchronous Parallel (SSP) Synchronous Parallel mechanism is proposed. The strategy of SSP is to allow each device to not wait for the next iteration directly after completing one iteration, but to limit the difference in the number of iterations between the fastest device and the slowest device to within a threshold. Once this threshold is reached, the fastest device needs to wait until the slowest device catches up. Although SSP improves training efficiency to some extent, a large amount of computing resources are inevitably wasted due to the waiting process. The current machine learning algorithm needs a large amount of computing resource support, and on the edge device with limited resources, the low-efficiency training can lead to long training time and difficulty in deploying algorithm application.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a federated learning training acceleration method for computing resource heterogeneity.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a federated learning training acceleration method for computing resource heterogeneity comprises the following steps:

s1, initializing local iteration times, continuous extra gradient calculation times, a continuous extra gradient calculation time threshold value and total iteration times, and downloading initial global model parameters from a server;

s2, adding 1 to the local iteration times, judging whether the local iteration times meet the total iteration times, if so, ending the training, otherwise, entering the step S3;

s3, storing the latest global model parameters downloaded from the server as local model parameters, performing gradient update by using a BP algorithm in combination with a local data set to obtain gradient update parameters, and uploading the gradient update parameters to the server;

s4, judging whether the continuous extra gradient calculation times meet the continuous extra gradient calculation time threshold, if yes, entering a step S10, otherwise, entering a step S5;

s5, updating local model parameters by using the gradient updating parameters obtained in the step S3 to obtain updated local model parameters, and performing an additional gradient updating by using a BP algorithm by combining a local data set to obtain additional gradient updating parameters;

s6, receiving a signal which is issued by the server and allows the latest global model parameter to be downloaded, judging whether the extra gradient calculation is completed, if so, entering the step S7, otherwise, entering the step S9;

s7, downloading the latest global model parameters from the server and copying to obtain a global model parameter copy, updating the global model parameter copy by using the extra gradient update parameters obtained in the step S5, and adding 1 to the calculation times of the continuous extra gradients;

s8, re-determining the latest global model parameters according to the latest global model parameters and the loss function values corresponding to the updated global model parameter copies, and returning to the step S2;

s9, immediately stopping extra gradient calculation, downloading the latest global model parameters from the server, initializing the calculation times of continuous extra gradient, and returning to the step S2;

and S10, initializing continuous additional gradient calculation times, receiving a signal which is sent by the server and allows the latest global model parameter to be downloaded, downloading the latest global model parameter, and returning to the step S2.

The beneficial effect of this scheme does:

after the threshold value of the synchronous parallel mechanism SSP is reached, the equipment which should wait is allowed to additionally perform a round of gradient calculation, after the global model parameters are obtained by next downloading, the global model parameters are updated by using the gradient obtained by the additional round of calculation on the premise of reducing the loss function of the model, so that the calculation resources are further utilized, the waiting time of the equipment which can quickly complete the gradient calculation is reduced, the utilization rate of the calculation resources is improved, the training efficiency is improved, and the overall training time is shortened.

Further, the gradient update parameter calculation formula is:

wherein the content of the first and second substances,

the parameters are updated for the purpose of the gradient,

to derive the parameters in the objective function,

for the local model parameters, z is the data sample of the local data set, and Q is the objective function.

The beneficial effects of the further scheme are as follows:

and calculating to obtain gradient updating parameters, and updating global model parameters by combining the gradient updating parameters.

Further, the additional gradient update parameter calculation formula is:

wherein the content of the first and second substances,

the parameters are updated for the purpose of additional gradients,

to derive the parameters in the objective function,

for the updated local model parameters, z is the data sample of the local data set and Q is the objective function.

The beneficial effects of the further scheme are as follows:

and performing an additional round of gradient updating to obtain additional gradient updating parameters, and updating the global model parameter copy.

Further, the update formula for updating the local model parameters by using the gradient update parameters in step S5 is as follows:

wherein the content of the first and second substances,

updating the parameters for the gradient, η is the training step length,

as are the parameters of the local model,

are updated local model parameters.

The beneficial effects of the further scheme are as follows:

and the local model is updated, an additional round of gradient calculation is performed, and the utilization rate of calculation resources is improved.

Further, the updating formula for updating the global model parameter copy with the additional gradient updating parameters in step S7 is:

wherein the content of the first and second substances,

updating parameters for additional gradients, w_s' is a copy of the parameters of the global model, w_s ^*Is a copy of the updated global model parameters.

The beneficial effects of the further scheme are as follows:

preparation is made for re-determining the latest global model parameters for the loss function values corresponding to the latest global model parameters and the updated copy of the global model parameters in step S8.

Further, the step S8 is specifically:

determining the latest global model parameter w_sWith the updated global model parameter copy w_s ^*Respectively corresponding loss function values loss (w)_s) And loss (w)_s ^*) Size, if latest globalModel parameter w_sLoss function value loss (w)_s) Less than the updated global model replica w_s ^*Loss function value loss (w)_s ^*) The updated global model copy w is discarded_s ^*Otherwise, the updated global model copy w_s ^*As the latest global model parameter w_sAnd returns to step S2.

The beneficial effects of the further scheme are as follows:

and updating the gradient obtained by the extra gradient calculation, updating the copy after obtaining the global model parameter by next downloading, replacing the original global model parameter with the copy if the updated copy has a smaller loss function value than the original global model parameter, and otherwise discarding the copy to enable the training model to be converged more quickly, thereby improving the training efficiency and shortening the training time.

Further, the receiving of the signal for allowing downloading of the latest global model parameter sent by the server includes the following steps:

s61, initializing global model parameters, an iteration number difference threshold value between the fastest device and the slowest device and a target loss function value;

s62, updating the initial global model parameters by using the gradient updating parameters uploaded to the server in the step S3 to obtain updated global model parameters;

s63, judging whether the loss function value corresponding to the updated global model parameter is smaller than the target loss function value, if so, stopping training, otherwise, entering the step S64;

s64, judging whether the iteration number difference between the fastest device and the slowest device meets an iteration number difference threshold value, if so, entering a step S65, otherwise, entering a step S66;

s65, sending a signal for allowing the latest global model parameter to be downloaded to other equipment except the fastest equipment, and returning to the step S62;

s66, a signal for allowing the latest global model parameter to be downloaded is sent to each device, and the process returns to step S62.

The beneficial effects of the further scheme are as follows:

after reaching the SSP threshold, the fastest devices need not wait, update the local model directly with the gradient obtained from the local computation, and then perform an additional round of gradient computation using the local model and the data. This reduces the latency of the device and improves the utilization of the computing resources.

Further, the update formula for updating the initial global model parameters by using the gradient update parameters in step S62 is as follows:

wherein the content of the first and second substances,

the parameters are updated for the purpose of the gradient,

is an initial global model parameter, eta is a training step length, w_sIs the updated global model parameters.

The beneficial effects of the further scheme are as follows:

the global model parameters are updated, the convergence of the global model is promoted, and the overall training progress is promoted.

Drawings

FIG. 1 is a prior art federated learning basic framework;

FIG. 2 is a schematic flow chart of a federated learning training acceleration method for computing resource heterogeneity provided in the present invention;

fig. 3 is a flowchart illustrating the substep of step S6.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 2, an embodiment of the present invention provides a federated learning training acceleration method for computing resource heterogeneity, including the following steps S1 to S10:

s1, initializing local iteration times t_pSuccessive additional gradient calculation times alpha_pCalculating a time threshold c and a total iteration time T by continuous additional gradiometries, and downloading initial global model parameters from a server

In this embodiment, all the edge devices are connected to a single server through a physical channel, and acquire an initial global model parameter of a download signal from the server through the physical channel

S2, determining the local iteration number t_pAdding 1, entering an iteration process, and judging the local iteration times t_pWhether the total iteration times T are met, if yes, finishing the training, and otherwise, entering a step S3;

s3, storing the latest global model parameters downloaded from the server as local model parameters

Calculating gradient updating parameters by using a BP algorithm in combination with a local data set, and uploading the gradient updating parameters to a server;

calculating gradient updating parameters through a BP algorithm, wherein the gradient updating parameters are calculated according to the following formula:

wherein the content of the first and second substances,

updating parameters for gradients，

To derive the parameters in the objective function,

In this embodiment, the local data set is the data set generated at the edge device.

In this embodiment, the work flow of the BP algorithm includes the following steps:

firstly, providing an input example for an input layer neuron, and then forwarding signals layer by layer until a result of an output layer is generated; then the error is reversely propagated to the hidden layer neuron; then adjusting the connection weight and the threshold value according to the error of the hidden layer neuron; and finally, circularly performing an iterative process until a preset stop condition is reached.

S4, judging the number alpha of continuous extra gradient calculation_pWhether the continuous additional gradient calculation time threshold c is met, if yes, the step S10 is executed, and if not, the step S5 is executed;

in this embodiment, the threshold c of the continuous extra-gradient computation time is a hyper-parameter, which is set manually and generally ranges from 1 to 10.

S5, updating parameters by using the gradient obtained in the step S3

Updating local model parameters

Obtaining updated local model parameters

Updating parameters using gradients

Updating local model parameters

The update formula of (2) is:

wherein eta is a training step length;

and calculating an additional round of gradient update by using a BP algorithm in combination with a local data set to obtain an additional gradient update parameter

The additional gradient update parameter calculation formula is:

wherein the content of the first and second substances,

the parameters are updated for the purpose of additional gradients,

to derive the parameters in the objective function,

in this embodiment, the signal for allowing the latest global model parameter to be downloaded and sent by the server is received through the physical channel.

As shown in fig. 3, the receiving of the signal for allowing the latest global model parameter to be downloaded from the server includes the following steps S61 to S66:

s61, initializing global model parameters w_sThreshold m for the difference in the number of iterations between the fastest and the slowest device and the value of the loss-of-interest function loss_inf；

In this embodiment, the threshold m of the difference between the iteration times of the fastest device and the slowest device is a hyper-parameter, which is set manually, and the range is generally 1 to 10.

S62, updating the parameters by the gradient uploaded to the server in the step S3

Updating initial global model parameters

Obtaining updated global model parameters w_sUpdating parameters using gradients

Updating initial global model parameters

The update formula of (2) is:

wherein eta is a training step length;

s63, judging the updated global model parameter w_sCorresponding loss function value loss (w)_s) Whether less than the loss of interest function value loss_infIf yes, stopping training, otherwise, entering step S64;

s64, judging whether the iteration number difference between the fastest device and the slowest device meets a preset iteration number difference threshold value m, if so, entering a step S65, otherwise, entering a step S66;

In this embodiment, the server is connected to all edge devices participating in federal learning through a physical channel.

S7, downloading the latest global model parameter w from the server_sAnd copying to obtain a global model parameter copy w_s', and updating the parameters with the additional gradient obtained in step S5

Updating global model parameter copy w_s ^*And calculating the number of successive additional gradients alpha_pAdding 1; updating parameters with additional gradients

Updating global model parameter copy w_sThe update formula of' is:

wherein eta is a training step length;

s8, according to the latest global model parameter w_sWith the updated global model parameter copy w_s ^*Corresponding loss function value, and re-determining the latest global model parameter w_sReturning to step S2;

in this embodiment, the latest global model parameter w is determined_sWith the updated global model parameter copy w_s ^*Respectively corresponding loss function values loss (w)_s) And loss (w)_s ^*) Size, if the latest global model parameter w_sLoss function value loss (w)_s) Less than the updated global model replica w_s ^*Loss function value loss (w)_s ^*) The updated global model copy w is discarded_s ^*Otherwise, the updated global model copy w_s ^*As the latest global model parameter w_sAnd returns to step S2.

S9, immediately stopping additional gradient calculation, and downloading the latest global position from the serverModel parameter w_sAnd initializing the number of successive additional gradient calculations alpha_pReturning to step S2;

s10, initializing continuous additional gradient calculation times alpha_pAnd receiving a signal that the server allows downloading the latest global model parameter w_sReturning to step S2.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A federated learning training acceleration method for computing resource heterogeneity is characterized by comprising the following steps:

2. The method for accelerating training of federal learning oriented to computing resource heterogeneity according to claim 1, wherein the gradient update parameter calculation formula is:

wherein the content of the first and second substances,

the parameters are updated for the purpose of the gradient,

to derive the parameters in the objective function,

is the local model parameter, z is the data sample of the local data set, Q is the objectiveAnd (4) a standard function.

3. The method for accelerating training of federal learning oriented to computing resource heterogeneity according to claim 1, wherein the additional gradient update parameter calculation formula is:

wherein the content of the first and second substances,

the parameters are updated for the purpose of additional gradients,

to derive the parameters in the objective function,

4. The method as claimed in claim 1, wherein the updating formula for updating the local model parameters by using the gradient update parameters in step S5 is as follows:

wherein the content of the first and second substances,

updating the parameters for the gradient, η is the training step length,

as are the parameters of the local model,

are updated local model parameters.

5. The method of claim 4, wherein in the step S7, the update formula for updating the global model parameter copy with the additional gradient update parameter is as follows:

wherein the content of the first and second substances,

6. The method for accelerating training of federal learning oriented to computing resource heterogeneity according to claim 2, wherein the step S8 specifically includes:

determining the latest global model parameter w_sWith the updated global model parameter copy w_s ^*Respectively corresponding loss function values loss (w)_s) And loss (w)_s ^*) Size, if the latest global model parameter w_sLoss function value loss (w)_s) Less than the updated global model replica w_s ^*Loss function value loss (w)_s ^*) The updated global model copy w is discarded_s ^*Otherwise, the updated global model copy w_s ^*As the latest global model parameter w_sAnd returns to step S2.

7. The method for accelerating the training of the federal learning oriented towards computing resource heterogeneity according to claim 1, wherein the receiving the signal transmitted by the server for allowing the latest global model parameter to be downloaded includes the following steps:

8. The method of claim 5, wherein the updating formula for updating the initial global model parameters by using the gradient updating parameters in step S62 is as follows:

wherein the content of the first and second substances,

the parameters are updated for the purpose of the gradient,