CN113191504B

CN113191504B - Federated learning training acceleration method for computing resource isomerism

Info

Publication number: CN113191504B
Application number: CN202110556962.5A
Authority: CN
Inventors: 何耶肖; 李欢; 章小宁; 吴昊; 范晨昱
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2022-06-28
Anticipated expiration: 2041-05-21
Also published as: CN113191504A

Abstract

The invention discloses a federated learning training acceleration method for computing resource isomerism, which is characterized in that whether the iterative number difference between a fastest device and a slowest device reaches a threshold value or not is judged, if so, the fastest device does not need to wait, local model parameters are directly updated by using a gradient update value, then the latest global model parameters are downloaded to obtain the latest global model parameter copy, the global model parameter copy is updated by using extra gradient update parameters, the loss function value with the latest global model parameters is judged, and if the loss function value of the updated global model parameter copy is smaller, the latest global model parameters are replaced by the updated global model parameter copy. The invention provides the relevant adaptability improvement to the traditional SSP parallel synchronization mechanism, and improves the utilization rate of computing resources, thereby improving the training efficiency and shortening the overall training time.

Description

Federated learning training acceleration method for computing resource heterogeneity

Technical Field

The invention relates to the technical field of federal learning, in particular to a calculation resource isomerism oriented federal learning training acceleration method.

Background

In recent years, with the rapid development of machine learning, many artificial intelligence applications, such as data mining, image recognition, natural language processing, biometric recognition, search engines, medical diagnosis, detecting credit card fraud, stock market analysis, voice and handwriting recognition, strategic games, robots, and the like, which are difficult to implement using conventional technologies, have been raised in various fields of human life. Machine learning is a data analysis method for automated analytical model construction that allows computers to learn autonomously without explicit programming. As a data-driven technique, machine learning requires a large amount of data to train to arrive at a high performance model. Today, with the development of cell phones, tablets, and various wearable devices, billions of edge devices generate a lot of user Data, and the Data generated by edge devices will increase to 79.4ZB by 2025 as estimated by the report of International Data Corporation (IDC). This is a valuable data resource for machine learning. The related techniques and applications of machine learning are expected to further advance if such data stored on edge devices can be utilized. However, in order to train the machine learning model, the traditional approach is to upload all raw data sets collected by the edge devices to a remote data center for uniform training. This approach requires a large amount of communication resources to transmit a large amount of data, resulting in unacceptably high costs. In addition, as people's privacy awareness increases, many people are reluctant to upload user data to a data center, and the problem of user privacy disclosure and data security is also caused by transmitting user data through a communication network. Also, the centralized training mode cannot be applied to the fields of finance and the like which are highly sensitive to data security.

In order to protect data privacy and reduce communication resource overhead, federal learning, a distributed training system, has been proposed in recent years to replace the centralized training system. Today, many banks, securities companies, medical equipment manufacturers, and technology companies are actively developing federal learning, the safety and utility of which are widely verified, and the basic framework of federal learning is shown in fig. 1.

A set of edge devices (also referred to as clients), such as smartphones, laptops, tablets, etc., participate in the distributed training process of the model using their locally stored data. Each edge device maintains a copy of the model as a local model. The server connecting all edge devices maintains a global model and aggregates the gradient updates from the various edge devices to update the global model. In the training iteration, each edge device uses its local data to compute gradient updates for the model parameters, then uploads the gradient updates to the server, and then downloads the updated new global model parameters from the server as new local model parameters. In the process, each edge device only shares the intermediate calculation result (namely the gradient updating of the model parameters) to the server without uploading the original data stored locally by the edge device, so that the privacy and the security of the data are protected. In addition, the gradient and model of the transmission may utilize various encryption methods to further enhance its security.

In an edge environment, devices have heterogeneous computing resources due to differences in their processor architectures, power consumption limitations, systems, and so forth. The gradient computation time difference between the devices is large due to the heterogeneous computing resources. In order to maintain consistency of model parameters of various devices, federal learning requires the application of a synchronous parallel mechanism. If a traditional Bulk Synchronization Parallel (BSP) synchronization Parallel mechanism is used, a device with more computing resources capable of quickly completing gradient update computation needs to wait for a device with less computing resources to achieve synchronization in each iteration. The more rapidly graduated device wastes a lot of computing resources due to the waiting process, while the slowest device largely determines the time consumption of the iteration, which is called the Straggler problem. The Straggler problem reduces the training efficiency of federated learning, slowing the convergence rate of the learning model. In order to improve the training efficiency, a Stale Synchronous Parallel (SSP) synchronization Parallel mechanism is proposed. The strategy of SSP is to allow each device to not wait for the next iteration directly after completing one iteration, but to limit the difference in the number of iterations between the fastest device and the slowest device to within a threshold. Once this threshold is reached, the fastest device needs to wait until the slowest device catches up. Although SSP improves training efficiency to some extent, a large amount of computing resources are inevitably wasted due to the waiting process. The current machine learning algorithm needs a large amount of computing resource support, and on the edge device with limited resources, the low-efficiency training can lead to long training time and difficulty in deploying algorithm application.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a federated learning training acceleration method oriented to computing resource isomerism.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a federated learning training acceleration method for computing resource isomerism comprises the following steps:

s1, initializing local iteration times, continuous extra gradient calculation times, a continuous extra gradient calculation time threshold value and total iteration times, and downloading initial global model parameters from a server;

s2, adding 1 to the local iteration times, judging whether the local iteration times meet the total iteration times, if so, ending the training, otherwise, entering the step S3;

s3, storing the latest global model parameters downloaded from the server as local model parameters, performing gradient update by using a BP algorithm in combination with a local data set to obtain gradient update parameters, and uploading the gradient update parameters to the server;

s4, judging whether the continuous extra gradient calculation times meet the continuous extra gradient calculation time threshold, if yes, entering a step S10, otherwise, entering a step S5;

s5, updating the local model parameters by using the gradient update parameters obtained in the step S3 to obtain updated local model parameters, and performing an additional gradient update by using a BP algorithm in combination with the local data set to obtain additional gradient update parameters;

S6, receiving a signal which is issued by the server and allows the latest global model parameter to be downloaded, and judging whether the extra gradient calculation is completed, if so, entering a step S7, otherwise, entering a step S9;

s7, downloading the latest global model parameters from the server and copying to obtain a global model parameter copy, updating the global model parameter copy by using the extra gradient update parameters obtained in the step S5, and adding 1 to the calculation times of the continuous extra gradients;

s8, re-determining the latest global model parameters according to the latest global model parameters and the loss function values corresponding to the updated global model parameter copies, and returning to the step S2;

s9, immediately stopping extra gradient calculation, downloading the latest global model parameters from the server, initializing the calculation times of continuous extra gradient, and returning to the step S2;

and S10, initializing continuous additional gradient calculation times, receiving a signal which is sent by the server and allows the latest global model parameter to be downloaded, downloading the latest global model parameter, and returning to the step S2.

The beneficial effect of this scheme does:

after the threshold value of the synchronous parallel mechanism SSP is reached, the equipment which should wait is allowed to additionally perform a round of gradient calculation, after the global model parameters are obtained by next downloading, the global model parameters are updated by using the gradients obtained by the additional round of calculation on the premise of reducing the loss function of the model, so that the calculation resources are further utilized, the waiting time of the equipment which can quickly complete the gradient calculation is reduced, the utilization rate of the calculation resources is improved, the training efficiency is improved, and the overall training time is shortened.

Further, the gradient update parameter calculation formula is:

wherein, the first and the second end of the pipe are connected with each other,

the parameters are updated for the purpose of the gradient,

in order to derive the parameters in the objective function,

for the local model parameters, z is the data sample of the local data set, and Q is the objective function.

The beneficial effects of the further scheme are as follows:

and calculating to obtain gradient updating parameters, and updating global model parameters by combining the gradient updating parameters.

Further, the additional gradient update parameter calculation formula is:

wherein the content of the first and second substances,

the parameters are updated for the purpose of additional gradients,

to derive the parameters in the objective function,

for the updated local model parameters, z is the data sample of the local data set and Q is the objective function.

The beneficial effects of the further scheme are as follows:

and performing an additional round of gradient updating to obtain additional gradient updating parameters, and updating the global model parameter copy.

Further, the update formula for updating the local model parameters by using the gradient update parameters in step S5 is as follows:

wherein the content of the first and second substances,

updating the parameters for the gradient, η is the training step length,

as are the parameters of the local model,

are updated local model parameters.

The beneficial effects of the further scheme are as follows:

and the local model is updated, an additional round of gradient calculation is performed, and the utilization rate of calculation resources is improved.

Further, the updating formula for updating the global model parameter copy with the additional gradient updating parameters in step S7 is:

update parameters for additional gradients, w_s' is a copy of the parameters of the global model, w_s ^*Is a copy of the updated global model parameters.

The beneficial effects of the further scheme are as follows:

preparation is made for re-determining the latest global model parameters for the loss function values corresponding to the latest global model parameters and the updated copy of the global model parameters in step S8.

Further, the step S8 is specifically:

determining the latest global model parameter w_sWith the updated global model parameter copy w_s ^*Respectively corresponding loss function values loss (w)_s) And loss (w)_s ^*) Size, if the latest global model parameter w_sLoss function value loss (w)_s) Less than the updated global model replica w_s ^*Loss function value loss (w)_s ^*) The updated global model copy w is discarded_s ^*Otherwise, the updated global model copy w_s ^*As the latest global model parameter w_sAnd returns to step S2.

The beneficial effects of the further scheme are as follows:

and updating the gradient obtained by the extra gradient calculation, updating the copy after obtaining the global model parameter by next downloading, replacing the original global model parameter with the copy if the updated copy has a smaller loss function value than the original global model parameter, and otherwise discarding the copy to enable the training model to be converged more quickly, thereby improving the training efficiency and shortening the training time.

Further, the receiving of the signal for allowing downloading of the latest global model parameters sent by the server includes the following steps:

s61, initializing global model parameters, an iteration number difference threshold value between the fastest device and the slowest device and a target loss function value;

s62, updating the initial global model parameters by using the gradient updating parameters uploaded to the server in the step S3 to obtain updated global model parameters;

s63, judging whether the loss function value corresponding to the updated global model parameter is smaller than the target loss function value, if so, stopping training, otherwise, entering the step S64;

s64, judging whether the iteration number difference between the fastest device and the slowest device meets an iteration number difference threshold value, if so, entering a step S65, otherwise, entering a step S66;

s65, sending a signal for allowing the latest global model parameter to be downloaded to other equipment except the fastest equipment, and returning to the step S62;

s66, a signal for allowing the latest global model parameter to be downloaded is sent to each device, and the process returns to step S62.

The beneficial effects of the further scheme are as follows:

after reaching the SSP threshold, the fastest devices need not wait, update the local model directly with the gradient obtained from the local computation, and then perform an additional round of gradient computation using the local model and the data. This reduces the latency of the device and improves the utilization of the computing resources.

Further, the update formula for updating the initial global model parameters by using the gradient update parameters in step S62 is as follows:

the parameters are updated for the purpose of the gradient,

is an initial global model parameter, eta is a training step length, w_sIs the updated global model parameters.

The beneficial effects of the further scheme are as follows:

the global model parameters are updated, the convergence of the global model is promoted, and the overall training progress is promoted.

Drawings

FIG. 1 is a prior art federated learning basic framework;

FIG. 2 is a schematic flow chart of a federated learning training acceleration method for computing resource heterogeneity provided in the present invention;

fig. 3 is a flowchart illustrating the substep of step S6.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 2, an embodiment of the present invention provides a federated learning training acceleration method for computing resource heterogeneity, including the following steps S1 to S10:

s1, initializing local iteration times t_pSuccessive additional gradient calculation times alpha_pCalculating a time threshold c and a total iteration time T by continuous additional gradiometers, and downloading initial global model parameters from a server

In this embodiment, all the edge devices are connected to a single server through a physical channel, and acquire, from the server through the physical channel, global model parameters for downloading signal initialization

S2, determining the local iteration times t_pAdding 1, entering an iteration process, and judging the local iteration times t_pWhether the total iteration times T are met, if yes, finishing training, and otherwise, entering a step S3;

s3, storing the latest global model parameters downloaded from the server as local model parameters

Calculating gradient updating parameters by using a BP algorithm in combination with a local data set, and uploading the gradient updating parameters to a server;

calculating gradient updating parameters through a BP algorithm, wherein the gradient updating parameters are calculated according to the following formula:

wherein the content of the first and second substances,

the parameters are updated for the purpose of the gradient,

to derive the parameters in the objective function,

In this embodiment, the local data set is a data set generated at the edge device.

In this embodiment, the work flow of the BP algorithm includes the following steps:

firstly, providing an input example for an input layer neuron, and then forwarding signals layer by layer until a result of an output layer is generated; then the error is reversely propagated to the hidden layer neuron; then adjusting the connection weight and the threshold value according to the error of the hidden layer neuron; and finally, circularly performing an iterative process until a preset stop condition is reached.

S4, judging the number alpha of continuous extra gradient calculation_pWhether the continuous additional gradient calculation time threshold c is met, if yes, the step S10 is executed, and if not, the step S5 is executed;

in this embodiment, the threshold c of the continuous extra-gradient computation time is a hyper-parameter, which is set manually and generally ranges from 1 to 10.

S5, updating parameters by using the gradient obtained in the step S3

Updating local model parameters

Obtaining updated local model parameters

Updating parameters using gradients

Updating local model parameters

The update formula of (2) is:

wherein eta is a training step length;

and calculating an additional round of gradient update by using a BP algorithm in combination with a local data set to obtain an additional gradient update parameter

The additional gradient update parameter calculation formula is:

the parameters are updated for the purpose of additional gradients,

in order to derive the parameters in the objective function,

S6, receiving a signal which is sent by the server and allows the latest global model parameter to be downloaded, judging whether the extra gradient calculation is completed, if so, entering the step S7, otherwise, entering the step S9;

in this embodiment, the signal for allowing the latest global model parameter to be downloaded and sent by the server is received through the physical channel.

As shown in fig. 3, the receiving of the signal for allowing the latest global model parameter to be downloaded from the server includes the following steps S61 to S66:

s61, initializing global model parameters w_sThreshold m for the difference in the number of iterations between the fastest and the slowest device and the value of the loss-of-interest function loss_inf；

In this embodiment, the threshold m of the difference between the iteration times of the fastest device and the slowest device is a hyper-parameter, which is set manually, and the range is generally 1 to 10.

S62, updating the parameters by the gradient uploaded to the server in the step S3

Updating initial global model parameters

Obtaining updated global model parameters w_sUpdating parameters using gradients

Updating initial global model parameters

The update formula of (2) is:

wherein eta is a training step length;

s63, judging the updated global model parameter w_sCorresponding loss function value loss (w)_s) Whether less than the loss of interest function value loss_infIf yes, stopping training, otherwise, entering step S64;

s64, judging whether the iteration number difference between the fastest device and the slowest device meets a preset iteration number difference threshold value m, if so, entering a step S65, otherwise, entering a step S66;

In this embodiment, the server is connected to all edge devices participating in federal learning through a physical channel.

S7, downloading the latest global model parameter w from the server_sAnd copying to obtain a global model parameter copy w_s', and updating the parameters with the additional gradient obtained in step S5

Updating global model parameter copy w_s ^*And calculating the number of successive additional gradients alpha_pAdding 1; updating parameters with additional gradients

Updating global model parameter copy w_sThe update formula of' is:

Wherein eta is a training step length;

s8, according to the latest global model parameter w_sWith the updated copy w of the global model parameters_s ^*Corresponding loss function value, and re-determining the latest global model parameter w_sReturning to step S2;

in this embodiment, the latest global model parameter w is determined_sWith the updated global model parameter copy w_s ^*Respectively corresponding loss function values loss (w)_s) And loss (w)_s ^*) Size, if the latest global model parameter w_sLoss function value loss (w)_s) Less than the updated global model replica w_s ^*Loss function value loss (w)_s ^*) The updated global model copy w is discarded_s ^*Otherwise, the updated global model copy w_s ^*As the latest global model parameter w_sAnd returns to step S2.

S9, immediately stopping additional gradient calculation, and downloading the latest global model parameter w from the server_sAnd initializing the number of successive additional gradient calculations alpha_pReturning to step S2;

s10, initializing continuous additional gradient calculation times alpha_pAnd receiving a signal that the server allows downloading the latest global model parameter w_sReturning to step S2.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto and changes may be made without departing from the scope of the invention in its aspects.

Claims

1. A federated learning training acceleration method for computing resource isomerism is characterized by comprising the following steps:

s3, storing the latest global model parameter downloaded from the server as a local model parameter, performing gradient update by using a BP algorithm in combination with a local data set to obtain a gradient update parameter, and uploading the gradient update parameter to the server;

s7, downloading the latest global model parameters from the server and copying to obtain a global model parameter copy, updating the global model parameter copy by using the extra gradient update parameters obtained in the step S5, and adding 1 to the continuous extra gradient calculation times;

s9, stopping the extra gradient calculation immediately, downloading the latest global model parameters from the server, initializing the calculation times of the continuous extra gradient, and returning to the step S2;

And S10, initializing continuous additional gradient calculation times, receiving a signal which is issued by the server and allows the latest global model parameter to be downloaded, downloading the latest global model parameter, and returning to the step S2.

2. The method as claimed in claim 1, wherein the gradient update parameter calculation formula is:

the parameters are updated for the purpose of the gradient,

in order to derive the parameters in the objective function,

for the local model parameters, z is the data sample of the local data set and Q is the objective function.

3. The method as claimed in claim 1, wherein the additional gradient update parameter calculation formula is:

the parameters are updated for the purpose of additional gradients,

in order to derive the parameters in the objective function,

4. The method as claimed in claim 1, wherein the updating formula for updating the local model parameters by using the gradient update parameters in step S5 is as follows:

updating the parameters for the gradient, eta is the training step length,

are the parameters of the local model and are,

are updated local model parameters.

5. The method as claimed in claim 4, wherein the updating formula for updating the global model parameter copy with the additional gradient updating parameter in step S7 is:

6. The method as claimed in claim 2, wherein the step S8 specifically includes:

7. The method for accelerating the training of the federal learning oriented towards computing resource heterogeneity according to claim 1, wherein the receiving the signal transmitted by the server for allowing the latest global model parameter to be downloaded includes the following steps:

s62, updating the initial global model parameters by using the gradient update parameters uploaded to the server in the step S3 to obtain updated global model parameters;

s63, judging whether the loss function value corresponding to the updated global model parameter is smaller than the target loss function value, if yes, stopping training, otherwise, entering the step S64;

s64, judging whether the iteration number difference between the fastest device and the slowest device meets the iteration number difference threshold value, if yes, entering a step S65, otherwise, entering a step S66;

8. The method of claim 7, wherein the updating formula for updating the initial global model parameters by using the gradient update parameters in step S62 is as follows:

the parameters are updated for the purpose of the gradient,

is an initial global model parameter, eta is a training step length, w _sAre updated global model parameters.