CN115719093A

CN115719093A - Distributed training method, device, system, storage medium and electronic equipment

Info

Publication number: CN115719093A
Application number: CN202211468935.3A
Authority: CN
Inventors: 沈力; 程亦飞; 钱迅; 陶大程
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-02-28

Abstract

The invention discloses a distributed training method, a device, a system, a storage medium and electronic equipment, wherein the method comprises the steps of determining the random gradient of a machine learning model in the current iteration in the iterative training process of the machine learning model; compressing the random gradient of the current iteration to obtain a compressed gradient of the current iteration, and sending the compressed gradient to a central server node, wherein the central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device; and receiving a central gradient fed back by the central server node, determining a compensation gradient based on the random gradient and the compression gradient of the current iteration, compensating the central gradient based on the compensation gradient to obtain a target gradient of the current iteration, and updating the current iteration of the machine learning model based on the target gradient. The communication cost is reduced through gradient compression, and the problem of slow convergence caused by the gradient compression is avoided through gradient compensation in the iteration process.

Description

Distributed training method, device and system, storage medium and electronic equipment

Technical Field

The present invention relates to the field of distributed machine learning technologies, and in particular, to a distributed training method, apparatus, system, storage medium, and electronic device.

Background

With the development of artificial intelligence technology and the explosive growth of data volume, the scale of machine learning is becoming larger and larger. In order to improve the training speed of large-scale machine learning models, distributed learning is proposed and applied to machine learning training in multiple fields of vision, voice and the like. One of the more common distributed learning deployment environments is a centralized distributed architecture, which is composed of a plurality of computing nodes and a central server, wherein the central server is responsible for overall planning of the computing results of the computing nodes.

In the process of implementing the invention, at least the following technical problems are found in the prior art: the model parameter quantity of large-scale machine learning is usually large, so that the dimensionality of random gradient is very high, the communication cost between a computing node and a central server in the P-SGD is very high, and the model training efficiency is reduced.

Disclosure of Invention

The invention provides a distributed training method, a device, a system, a storage medium and electronic equipment, which are used for ensuring the training precision of a machine learning model on the basis of reducing the communication cost.

According to an aspect of the present invention, there is provided a distributed training method applied to a computing node device, the method including:

in the iterative training process of the machine learning model, determining the random gradient of the machine learning model in the current iteration;

compressing the random gradient of the current iteration to obtain a compressed gradient of the current iteration, and sending the compressed gradient to a central server node, wherein the central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device;

and receiving a central gradient fed back by the central server node, determining a compensation gradient based on the random gradient and the compression gradient of the current iteration, compensating the central gradient based on the compensation gradient to obtain a target gradient of the current iteration, and updating the current iteration of the machine learning model based on the target gradient.

According to another aspect of the present invention, there is provided a distributed training method applied to a central server node device, the method including:

in the iterative training process of the machine learning model, receiving the compression gradient of the machine learning model in the current iteration, which is sent by each computing node;

determining the central gradient of the current iteration based on the compression gradient sent by each computing node and the error of the current iteration;

under the condition that the current iteration times do not meet preset conditions, compressing the central gradient of the current iteration, and feeding the compressed central gradient back to each computing node; and feeding back the central gradient of the current iteration to each computing node under the condition that the current iteration number meets a preset condition, wherein the computing nodes update the current iteration of the machine learning model on the basis of the random gradient, the compression gradient and the central gradient of the current iteration.

According to another aspect of the present invention, there is provided a distributed training apparatus integrated with a computing node device, the apparatus comprising:

the random gradient determining module is used for determining the random gradient of the machine learning model in the current iteration in the iterative training process of the machine learning model;

the compression gradient determining module is used for compressing the random gradient of the current iteration to obtain a compression gradient of the current iteration and sending the compression gradient to a central server node, wherein the central server node determines the central gradient of the current iteration based on the compression gradient sent by each computing node device;

and the model updating module is used for receiving the central gradient fed back by the central server node, determining a compensation gradient based on the random gradient and the compression gradient of the current iteration, compensating the central gradient based on the compensation gradient to obtain a target gradient of the current iteration, and updating the current iteration of the machine learning model based on the target gradient.

According to another aspect of the present invention, there is provided a distributed training apparatus integrated with a central server node device, the apparatus comprising:

the compression gradient receiving module is used for receiving the compression gradient of the machine learning model sent by each computing node in the current iteration in the iterative training process of the machine learning model;

the central gradient determining module is used for determining the central gradient of the current iteration based on the compression gradient sent by each computing node and the error of the current iteration;

and the central gradient sending module is used for compressing the central gradient of the current iteration under the condition that the current iteration number does not meet a preset condition, feeding the compressed central gradient back to each computing node, and feeding the central gradient of the current iteration back to each computing node under the condition that the current iteration number meets the preset condition, wherein the computing nodes update the current iteration of the machine learning model on the basis of the random gradient of the current iteration, the compressed gradient and the central gradient.

According to another aspect of the invention, there is provided a distributed training system comprising a central server node and a plurality of computing nodes, wherein,

the computing node determines the random gradient of the machine learning model in the current iteration in the iterative training process of the machine learning model, compresses the random gradient of the current iteration to obtain the compressed gradient of the current iteration, and sends the compressed gradient to the central server node;

the central server node determines the central gradient of the current iteration based on the compression gradient sent by each computing node device, and sends the central gradient to each computing node after compressing the central gradient under the condition that the current iteration times do not meet the preset condition;

and the computing node receives a central gradient fed back by the central server node, determines a compensation gradient based on the random gradient and the compression gradient of the current iteration, compensates the central gradient based on the compensation gradient to obtain a target gradient of the current iteration, and updates a machine learning model based on the target gradient in the current iteration.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the distributed training method of any of the embodiments of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the distributed training method according to any one of the embodiments of the present invention when the computer instructions are executed.

According to the technical scheme provided by the embodiment, in the transmission process of each computing node and the central server node, the random gradient in each iteration process is compressed, and the compressed gradient obtained by compression is transmitted, so that the communication cost of the computing nodes between the computing nodes and the central server node is reduced. Furthermore, the central gradient fed back by the central server node may be a gradient subjected to compression processing, and the computation node compresses the gradient in the bidirectional transmission with the central server node, thereby further reducing the communication cost. Meanwhile, a compensation gradient is determined through errors caused by compression processing, the central node of the current iteration fed back by the central server node is compensated through the compensation gradient, model parameters of the current iteration are updated based on the compensated target gradient, gradient compensation is performed in the current iteration process, the problem of slow convergence caused by gradient compression is solved on the basis of reducing communication cost, and the convergence speed of the distributed training process is improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a distributed training method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram of an embodiment of the present invention provided by an embodiment of the present invention;

FIG. 3 is a flowchart of a distributed training method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a distributed training method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a distributed training apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a distributed training apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a distributed training system according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The training of the machine learning model is realized by adopting a parallel random gradient descent (P-SGD) mode in a centralized distributed environment, and the following steps are required: 1. each computing node calculates a random gradient based on the local model and the data and sends the random gradient to a central server; 2. the central server averages the collected random gradients and returns the random gradients to the computing nodes; 3. the computational nodes perform model averaging with the average gradient. In the distributed training system formed by multiple computing nodes and a central server node, the central server node may be configured in a central server device, and the computing nodes are configured in computing node devices, where different computing nodes may be configured in different computing node devices, or two or more computing nodes may be configured in the same computing node device, which is not limited here.

In this embodiment, the compute node and the server node are trained together to obtain the machine learning model. The application scenario of distributed training is not limited here, that is, the type of the machine learning model and the function of the trained machine learning model are not limited. In some embodiments, the machine learning model may include, but is not limited to, a neural network model, a logistic regression model, etc., wherein the neural network model may be, but is not limited to, a convolutional neural network model CNN, a recurrent neural network model RNN, a long-short term memory network model LSTM, a residual network model ResNet50, etc. Distributed training of machine learning models can be applied to machine learning models such as an image classification model, an image segmentation model, an image feature extraction model, an image compression model, an image enhancement model, an image denoising model, an image label generation model, a text classification model, a text translation model, a text abstract extraction model, a text prediction model, a keyword conversion model, a text semantic analysis model, a speech recognition model, an audio denoising model, an audio synthesis model, an audio equalizer conversion model, a weather prediction model, a commodity recommendation model, an article recommendation network, an action recognition model, a face recognition model, a facial expression recognition model and the like. The application scenarios described above are merely exemplary, and the application scenarios of the neural model generation method are not limited in the present application.

In distributed training of the machine learning model, the compute nodes perform iterative training of the machine learning model locally. Correspondingly, the calculation node is preset with sample data, iterative training is carried out on the machine learning model based on the sample data, the training mode of the machine learning model is not limited, for example, supervised training and unsupervised training can be carried out, the machine learning model can be trained, and network parameters can be updated.

The sample data can be image data, and the prediction result of the machine learning model is an image processing result; or the sample data is text data, and the prediction result of the machine learning model is a text processing result; or the sample data is audio data, and the prediction result of the machine learning model is an audio processing result.

For example, if the sample data is image data, the machine learning model may be an image classification model, and the prediction result output by the machine learning model may be an image classification result; alternatively, the machine learning model may be an image segmentation model, and the prediction result may be an image segmentation result; alternatively, the machine learning model may be an image feature extraction model, and the prediction result may be an image feature extraction result; alternatively, the machine learning model may be an image compression model and the prediction result may be an image compression result; alternatively, the machine learning model may be an image enhancement model and the prediction result may be an image enhancement result; alternatively, the machine learning model may be an image denoising model, and the prediction result may be an image denoising result; alternatively, the machine learning model may be an image label generation model, the prediction result may be an image label, and so on. If the sample data is text data, the machine learning model can be a text classification model, and a prediction result output by the machine learning model can be a text classification result; alternatively, the machine learning model may be a text prediction model and the prediction result may be a text prediction result; alternatively, the machine learning model may be a text summarization extraction model, and the prediction result may be a text summarization extraction result; alternatively, the machine learning model may be a text translation model and the prediction result may be a text translation result; alternatively, the machine learning model may be a keyword conversion model, and the prediction result may be a keyword conversion result; alternatively, the machine learning model may be a text semantic analysis model, and the prediction result may be a text semantic analysis result, or the like. If the sample data is audio data, the machine learning model can be a speech recognition model, and the prediction result output by the machine learning model can be a speech recognition result; alternatively, the machine learning model may be an audio noise reduction model and the prediction result may be an audio noise reduction result; alternatively, the machine learning model may be an audio synthesis model and the prediction result may be an audio synthesis result; alternatively, the machine learning model may be an audio equalizer transition model, the prediction result may be an audio equalizer transition result, and so on.

Each compute node performs local training of the machine learning model for any of the above scenarios in any iteration of distributed training, and computes and sends a stochastic gradient to the central server node based on the local model and the data. A large amount of communication cost exists in data transmission between each computing node and the central server node, in order to reduce the communication cost, the gradient of transmission can be compressed, and meanwhile, the problem that the convergence of the training process of the machine learning model is slow is caused by the compression.

In view of the above technical problems, an embodiment of the present invention provides a distributed training method, and fig. 1 is a flowchart of the distributed training method provided in the embodiment of the present invention, where the embodiment is applicable to a case where a computer node trains a machine learning model, and the method may be executed by a distributed training apparatus, where the distributed training apparatus may be implemented in a form of hardware and/or software, and the distributed training apparatus may be configured in a computing node device, and the computing node device may be an electronic device such as a computer, a mobile phone, a PC terminal, and the like. As shown in fig. 1, the method includes:

and S110, in the iterative training process of the machine learning model, determining the random gradient of the machine learning model in the current iteration.

And S120, compressing the random gradient of the current iteration to obtain a compressed gradient of the current iteration, and sending the compressed gradient to a central server node, wherein the central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device.

S130, receiving a center gradient fed back by the center server node, determining a compensation gradient based on the random gradient and the compression gradient of the current iteration, compensating the center gradient based on the compensation gradient to obtain a target gradient of the current iteration, and updating the current iteration of the machine learning model based on the target gradient.

In this embodiment, when any computing node completes local training, the random gradient of the machine learning model in the current iteration is determined, where the random gradient of the machine learning model in the current iteration may be determined based on a loss function of the machine learning model, and the loss function may be preset, which is not limited herein. Illustratively, the loss function may be labeled f _i (x,ξ _i ). Determining a random gradient based on the loss function

In particular, the derivative of the model parameter in the machine learning model may be determined based on the loss function, and may be derivatives of different orders corresponding to different network layers. And combining the derivatives of each order respectively corresponding to each model parameter to obtain the random gradient of the machine learning model in the current iteration, for example, combining the derivatives of each order respectively corresponding to each model parameter in a matrix or vector form to obtain the random gradient of the machine learning model in the current iteration.

Each computing node needs to send the random gradient of the current iteration to the central server node, and in order to reduce the communication cost, the random gradient of the current iteration can be compressed to obtain a compressed gradient, and the compressed gradient is sent to the central server. Compared with the original random gradient, the compressed gradient is small in quantity and quick in transmission, and the communication cost between the computing node and the central server node is reduced.

Optionally, the compressing the random gradient of the current iteration to obtain a compressed gradient of the current iteration includes: and calling a compressor, and compressing the random gradient of the current iteration based on the compressor to obtain the compressed gradient of the current iteration. The computing node is configured with a compressor, and the type of the compressor is not limited herein, and the compressor may be a delta compressor, for example, the delta compressor satisfies the requirement of any vector x, and

wherein δ is a compression related parameter of the compressor,

to compress an arbitrary vector x.

In this embodiment, the compression gradient is transmitted between the compute node and the central server node to replace the transmission model parameter, and since the model parameter cannot be compressed, the communication cost is high, and by transmitting the compression gradient, the communication cost of distributed training can be reduced.

The central server node receives the compression gradients transmitted by each computing node, and determines a central gradient of the current iteration based on the compression gradients, where in some embodiments, the central gradient is a mean value of the compression gradients uploaded by each computing node, and in an embodiment, the central gradient is a sum of the mean value of the compression gradients uploaded by each computing node and the current error compensation value. Wherein the error compensation can be based on the central gradient and the compressed gradient determination of the central gradient during the last iteration, e.g., e _t ＝v _t-1 -p _t-1 Wherein e is _t For error compensation values in the current iteration, p _t-1 For the central gradient in the last iteration, v _t-1 A compressed gradient that is the central gradient in the last iteration. Accordingly, the central gradient of the current iteration may be

Wherein N is the number of compute nodes.

In some embodiments, the central server node transmits the central gradient to each computing node, and in some embodiments, after compressing the central gradient, the central server node transmits the compressed central gradient to each computing node, and compresses the central gradient, thereby realizing bidirectional gradient compression between the central server node and the computing nodes to reduce communication cost.

And for each computing node, performing current iteration updating on the machine learning model based on the central gradient fed back by the central server node. The center gradient received by the compute node may be a compressed center gradient or an uncompressed center gradient. Because unidirectional or bidirectional gradient compression exists between the central server node and each computing node, the problem of slow convergence caused by compression processing correspondingly exists. In view of the problem, in this embodiment, a compensation gradient is determined based on a compression error caused by compression processing, a central gradient of a current iteration process is compensated by the compensation gradient, and accordingly, a machine learning model is updated in a current iteration based on a compensated gradient, so that a problem of slow convergence of distributed training is avoided. It should be noted that, the central gradient in the current iteration process is compensated by the compensation gradient formed in the current iteration process, and compared with a case that compressed and left gradient elements are accumulated in an error variable on a computing node, the central gradient in the error variable in the whole distributed training process is reduced, the influence of delaying an update model by the elements in the error is reduced, and the convergence speed of the distributed training process is further improved.

Optionally, determining a compensation gradient based on the random gradient of the current iteration and the compression gradient, and compensating the central gradient based on the compensation gradient to obtain a target gradient of the current iteration, including: determining a compensation gradient based on a difference of the random gradient of the current iteration and the compression gradient; determining a target gradient for the current iteration based on a sum of the compensatory gradient and the central gradient. In particular, the random gradient of the current iteration may be

The compression gradient may be

Accordingly, the compensating gradient may be

Further, the target gradient may be

Wherein p is _t Is the central gradient.

On the basis of the above embodiment, the parameter updating of the machine learning model by the computing node based on the target gradient may be implemented based on the following formula:

wherein the content of the first and second substances,

for model parameters not updated in the current iteration,

and eta is the learning rate of the model parameters after the current iteration.

In a distributed training system formed by a plurality of computing nodes and a central server node, each computing node completes an iterative process based on the process, performs gradient interaction with the central server node in each iterative process to determine the target gradient of each iterative process so as to update model parameters of a machine learning model, and iteratively executes the process until a convergence state is reached, so that each computing node can obtain the trained machine learning model.

According to the technical scheme provided by the embodiment, in the transmission process of each computing node and the central server node, the random gradient in each iteration process is compressed, and the compressed gradient obtained by compression is transmitted, so that the communication cost of the computing nodes between the computing nodes and the central server node is reduced. Furthermore, the central gradient fed back by the central server node may be a gradient subjected to compression processing, and the computation node compresses the gradient in the bidirectional transmission with the central server node, thereby further reducing the communication cost. Meanwhile, a compensation gradient is determined through errors caused by compression processing, the compensation gradient is used for compensating the center node of the current iteration fed back by the center server node, the model parameter of the current iteration is updated based on the compensated target gradient, gradient compensation is performed in the current iteration process, the problem of slow convergence caused by gradient compression is solved on the basis of reducing communication cost, and the convergence speed of the distributed training process is improved.

On the basis of the above embodiments, an embodiment of the present invention further provides a distributed training method, and referring to fig. 2, fig. 2 is a schematic flow chart of the embodiment of the present invention provided in the embodiment of the present invention. Optionally, before compressing the random gradient of the current iteration, determining the current iteration number, and compressing the random gradient of the current iteration if the current iteration number does not meet a preset condition; and under the condition that the current iteration number meets a preset condition, sending the random gradient of the current iteration as a compression gradient to the central server node. Correspondingly, the method specifically comprises the following steps:

s210, in the iterative training process of the machine learning model, determining the random gradient of the machine learning model in the current iteration.

S220, under the condition that the current iteration number does not meet a preset condition, compressing the random gradient of the current iteration; and under the condition that the current iteration number meets a preset condition, sending the random gradient of the current iteration as a compression gradient to the central server node.

S230, receiving a center gradient fed back by the center server node, determining a compensation gradient based on the random gradient and the compression gradient of the current iteration, compensating the center gradient based on the compensation gradient to obtain a target gradient of the current iteration, and updating the current iteration of the machine learning model based on the target gradient.

Due to the fact that different computing nodes have different training processes of the machine learning model, the training processes can be the sample data difference, the compressor difference and the like, and particularly in each iteration, the model parameters of the machine learning model trained on different computing nodes in the iterative training process are different due to bidirectional gradient compression between each computing node and a central server node. In this embodiment, in order to reduce the model parameter difference between different computing nodes, in the iteration process, gradient compression is performed on a part of iterations to alleviate the model parameter difference between different computing nodes.

The determination condition of the number of iterations is set in advance, and when the determination condition of the number of iterations is satisfied, gradient compression is not performed, and when the determination condition of the number of iterations is not satisfied, gradient compression is performed. Optionally, the number of iterations of performing gradient compression is greater than the number of iterations of not performing gradient compression, so as to reduce the difference between machine learning models trained on different computing nodes on the basis of reducing communication cost.

In some embodiments, the preset condition includes a preset interval number condition, for example, the preset interval number may be 50 times, or 100 times, etc., that is, when an interval number between the current iteration number and the last iteration number without performing gradient compression satisfies the preset interval number, it is determined that the current iteration number satisfies the preset condition, no gradient compression is performed in the current iteration process, and when an interval number between the current iteration number and the last iteration number without performing gradient compression does not satisfy the preset interval number, it is determined that the current iteration number does not satisfy the preset condition, and gradient compression is performed in the current iteration process.

In some embodiments, the preset condition comprises a condition for determining a number of iterations based on a compression related parameter in the compressor. Alternatively, the preset condition may be

Wherein t is iteration times, and δ is a compression related parameter of the compressor, and can be read from the compressor configured by the computing node. The iteration times are judged based on the judgment condition of the compression associated parameters to the iteration times in the compressor, and the convergence requirement is favorably met.

On the basis of the embodiment, after the random gradient of the current iteration is determined, the current iteration frequency is judged, under the condition that the current iteration frequency does not meet the preset condition, the random gradient of the current iteration is compressed, and the obtained compressed gradient is sent to the central server node; and under the condition that the current iteration number meets the preset condition, sending the random gradient of the current iteration as a compression gradient to the central server node. Accordingly, at presentThe number of sub-iterations not satisfying a predetermined condition, i.e.

In the case of (2), the compensation gradient can be determined as

At the current iteration number, a preset condition is satisfied, i.e.

In the case of (1), the random gradient coincides with the compression gradient, and the compensation gradient is

Is zero. And the computing node stores the compensation gradient and facilitates subsequent calling.

The central server node determines a central gradient based on the received compression gradient, e.g. the central gradient is

The central server node determines the current iteration number, and it should be noted that, for the computing node and the central server node, the current iteration number is the same, that is, the current iteration number is in the same iteration process. Optionally, the preset conditions for determining the iteration times of the computing node and the central server node are the same. And under the condition that the central server node determines that the current iteration times meet the preset conditions, the central gradient is not compressed, and under the condition that the central server node determines that the current iteration times do not meet the preset conditions, the central gradient is compressed. And feeding back the processed center gradient to each computing node.

In the iteration process of different iteration times, the update modes of model parameters of the machine learning model are different by the computing node, for example, under the condition that the current iteration time does not meet the preset condition, after compensating the central gradient based on the compensation gradient, the model parameters are updated on the basis of the model parameters obtained by the last iteration time, for example, by the method that the model parameters are updated on the basis of the model parameters obtained by the last iteration time

Is realized in the following manner. Updating the local model parameter to the returned average model parameter when the current iteration number meets the preset condition, and updating the model parameter of the average model parameter based on the central gradient, for example, by

Wherein the compensating gradient here

Is zero.

According to the technical scheme of the embodiment, bidirectional gradient compression is performed between the computing node and the central server node in the distributed training process, so that the communication cost in the distributed training process is reduced. Meanwhile, by means of periodic model averaging, the problem that different local models on each computing node are caused by an error compensation mechanism is reduced, and the model accuracy of a machine learning model on the computing node is guaranteed while the communication cost is reduced.

Fig. 3 is a flowchart of a distributed training method provided in an embodiment of the present invention, where the embodiment is applicable to a case where a central server node trains a machine learning model, and the method may be performed by a distributed training apparatus integrated in the central server node, where the distributed training apparatus may be implemented in a form of hardware and/or software, and the distributed training apparatus may be configured in a central server node device, where the central server node device may be an electronic device such as a computer, a server, or the like. As shown in fig. 3, the method includes:

and S310, receiving the compression gradient of the machine learning model in the current iteration, which is sent by each computing node, in the iterative training process of the machine learning model.

And S320, determining the central gradient of the current iteration based on the compression gradient sent by each computing node and the error of the current iteration.

S330, compressing the central gradient of the current iteration under the condition that the current iteration number does not meet a preset condition, feeding the compressed central gradient back to each computing node, and feeding the central gradient of the current iteration back to each computing node under the condition that the current iteration number meets the preset condition. And the computing node updates the current iteration of the machine learning model based on the random gradient, the compression gradient and the central gradient of the current iteration.

In this embodiment, the compression gradient sent by the computing node and received by the central server node may be obtained by performing compression processing based on a random gradient, or may be obtained by using the random gradient as a compression gradient. The determining manner of the compression gradient may be based on the current iteration number, for example, the current iteration number is determined, the compression gradient is obtained by compressing the random gradient when the current iteration number does not satisfy a preset condition, and the compression gradient is obtained by calculating the random gradient by the calculation node when the current iteration number satisfies the preset condition.

The central server node determines a central gradient of the current iteration based on the compressed gradient and an error of the current iteration, e.g., the central gradient may be

Wherein the content of the first and second substances,

to compute node i to compression gradient, e _t Is the error of the current iteration.

The central server node judges the current iteration times, and when the current iteration times do not meet the preset condition, the central server node judges that the current iteration times do not meet the preset condition, namely the central server node judges that the current iteration times do not meet the preset condition

In case of (2), compressing the central gradient of the current iteration

And applying the compressed central gradient p _t And feeding back the current time of the machine learning model to each computing node, so that the computing nodes can conveniently perform current iteration updating on the machine learning model based on the random gradient, the compression gradient and the central gradient. At the current iteration number, a preset condition is satisfied, i.e.

In case of (2), the central gradient p of the current iteration is set _t ＝v _t And average model parameters

And the model parameters are fed back to each computing node, so that the computing nodes can update the model parameters based on the central gradient and the average model parameters.

On the basis of the foregoing embodiment, an embodiment of the present invention further provides a preferred example of a distributed training method, referring to fig. 4, where fig. 4 is a flowchart of the distributed training method provided in the embodiment of the present invention, that is, an execution flow when the current iteration number does not satisfy a preset condition. Random gradient is calculated by any ith computing node

To pair

Is compressed to obtain

And sent to a central server to compress the error, i.e. to compensate for the gradient

Temporarily storing the global error e on the computing node, and transmitting the global error e by the central server _t Is compensated to

On the average value of (A), the central gradient is obtained

Central server pair v _t Is compressed to obtain p _t And sending the error information to each computing node, updating a global error variable, and compensating the locally stored error to p by the computing node _t To obtain

And updates the local model with this result.

In particular, the compute nodes are based on a local model

And sampling the sample

Calculating a random gradient

When the iteration round t satisfies

The compute node combines the stochastic gradient (as the compressed gradient) with the local model

Sending to the central server node, otherwise, the computing node uses the compressor

Compressing the random gradient to obtain

And will be

Sending to the server node, compressing the error

Is stored on the compute node.

The central server node firstly averages the gradients sent by the computing nodes, and error compensation is carried out on the results by using error variables:

when the iteration round t satisfies

In time, the central server node averages the received local model parameters and the result

And v _t (uncompressed central node) to each compute node. Otherwise, the central server node uses the compressor

To v is to v _t Is compressed to obtain

(compressed center node), and p _t To the respective computing nodes. The error variable is updated by the computing node e _t+1 ＝v _t -p _t (i.e., the error variable for the next iteration process). If it is not

This operation resets the error variable to 0.

And (3) updating the model on the computing node: when the iteration round t satisfies

In time, the computing node updates the local model parametersNew as returned average model parameter

Updating a model

Equivalent to using

To update the model. Otherwise, the computing node runs an instant error compensation mechanism to store the local error

Compensating to the returned gradient p _t Followed by model updating

For the above embodiment, the following convergence conclusions can be drawn: for a non-convex optimization target, under the assumption of continuity, bounded variance and bounded gradient, when the algorithm uses a delta compressor, the learning rate is assumed

We have the following convergence conclusion:

the result shows that the embodiment of the scheme achieves the upper bound of the convergence rate which is the same as that of the unidirectional gradient compression algorithm under the traditional error compensation mechanism and is superior to that of the bidirectional gradient algorithm, thereby achieving the aims of higher convergence rate and lower communication cost.

Fig. 5 is a schematic structural diagram of a distributed training apparatus according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes:

a random gradient determining module 410, configured to determine a random gradient of the machine learning model in a current iteration during an iterative training process of the machine learning model;

a compression gradient determining module 420, configured to, when the number of iterations of the current time does not meet a preset condition, perform compression processing on a random gradient of the current time iteration to obtain a compression gradient of the current time iteration, and send the compression gradient to a central server node, where the central server node determines a central gradient of the current time iteration based on the compression gradients sent by each computing node device;

and the model updating module 430 is configured to receive a center gradient fed back by the center server node, determine a compensation gradient based on the random gradient of the current iteration and the compression gradient, compensate the center gradient based on the compensation gradient to obtain a target gradient of the current iteration, and update the machine learning model based on the target gradient.

On the basis of the foregoing embodiment, optionally, the model updating module 330 is configured to: determining a compensation gradient based on a difference of the random gradient of the current iteration and the compression gradient; determining a target gradient for a current iteration based on a sum of the compensatory gradient and the central gradient.

On the basis of the foregoing embodiment, optionally, the center gradient fed back by the center server node is subjected to compression processing.

On the basis of the above embodiment, optionally, the compression gradient determining module 320 is configured to:

and calling a compressor, and compressing the random gradient of the current iteration based on the compressor to obtain the compressed gradient of the current iteration.

Based on the above embodiment, optionally, the compression gradient determining module 420 is configured to: under the condition that the current iteration times do not meet the preset conditions, compressing the random gradient of the current iteration;

under the condition that the current iteration number meets a preset condition, sending a random gradient of the current iteration as a compression gradient to a central server node;

correspondingly, the central gradient fed back by the central server node is the central gradient which is not subjected to compression processing under the condition that the previous iteration number is judged to meet the preset condition.

On the basis of the foregoing embodiment, optionally, the preset condition includes a preset interval number condition, or a condition for determining the number of iterations based on a compression related parameter in the compressor.

The distributed training device provided by the embodiment of the invention can execute the distributed training method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 6 is a schematic structural diagram of a distributed training apparatus according to an embodiment of the present invention. As shown in fig. 6, the apparatus includes:

a compressed gradient receiving module 510, configured to receive, in an iterative training process of the machine learning model, a compressed gradient of the machine learning model in the current iteration, where the compressed gradient is sent by each computing node;

a central gradient determining module 520, configured to determine a central gradient of the current iteration based on the compressed gradient sent by each computing node and an error of the current iteration;

a central gradient sending module 530, configured to, when the current iteration number does not satisfy a preset condition, compress a central gradient of the current iteration and feed the compressed central gradient back to each computing node, and when the current iteration number satisfies the preset condition, feed the central gradient of the current iteration back to each computing node, where the computing node updates, on the basis of a random gradient of the current iteration, the compressed gradient, and the central gradient, a machine learning model in the current iteration.

An embodiment of the present invention provides a distributed training system, and referring to fig. 7, fig. 7 is a schematic structural diagram of the distributed training system provided in this embodiment. The distributed training system of fig. 7 includes a central server node 610 and a plurality of compute nodes 620. The computing node 620 is configured to determine a random gradient of the machine learning model in the current iteration during an iterative training process of the machine learning model, compress the random gradient of the current iteration to obtain a compressed gradient of the current iteration, and send the compressed gradient to the central server node.

The central server node 610 is configured to: and determining the central gradient of the current iteration based on the compression gradient sent by each computing node device, and compressing the central gradient and sending the compressed central gradient to each computing node under the condition that the current iteration number does not meet the preset condition.

The computation node 620 is further configured to receive a center gradient fed back by the center server node, determine a compensation gradient based on the random gradient of the current iteration and the compression gradient, compensate the center gradient based on the compensation gradient to obtain a target gradient of the current iteration, and update the current iteration of the machine learning model based on the target gradient.

Optionally, the computing node 620 compresses the random gradient of the current iteration when the current iteration number does not meet the preset condition; and under the condition that the current iteration number meets a preset condition, sending the random gradient of the current iteration as a compression gradient to the central server node.

Optionally, the central server node 610 is configured to: under the condition that the current iteration number does not meet the preset condition, compressing the central gradient of the current iteration, and feeding the compressed central gradient back to each computing node 620; and feeding back the central gradient of the current iteration to each computing node 620 when the current iteration number meets a preset condition.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the distributed training method.

In some embodiments, the distributed training method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the distributed training method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the distributed training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

The computer program for implementing the distributed training method of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

An embodiment of the present invention further provides a computer-readable storage medium, in which computer instructions are stored, where the computer instructions are configured to cause a processor to execute a distributed training method, where the method includes:

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A distributed training method applied to a computing node device, the method comprising:

2. The method according to claim 1, wherein the receiving a center gradient fed back by the central server node, determining a compensation gradient based on the random gradient and the compressed gradient of the current iteration, and compensating the center gradient based on the compensation gradient to obtain a target gradient of the current iteration comprises:

determining a compensation gradient based on a difference of the random gradient of the current iteration and the compression gradient;

determining a target gradient for a current iteration based on a sum of the compensatory gradient and the central gradient.

3. The method according to claim 1 or 2, wherein the center gradient fed back by the center server node is compressed.

4. The method according to claim 1, wherein the compressing the random gradient of the current iteration to obtain a compressed gradient of the current iteration comprises:

5. The method of claim 1, wherein compressing the stochastic gradient of the current iteration further comprises:

under the condition that the current iteration times do not meet the preset conditions, compressing the random gradient of the current iteration;

6. The method according to claim 5, wherein the preset condition comprises a preset interval number condition or a condition for determining the number of iterations based on a compression related parameter in a compressor.

7. A distributed training method is applied to a central server node device, and comprises the following steps:

under the condition that the current iteration times do not meet preset conditions, compressing the central gradient of the current iteration, and feeding the compressed central gradient back to each computing node; under the condition that the current iteration times meet preset conditions, feeding back the central gradient of the current iteration to each computing node;

and the computing node updates the machine learning model in the current iteration based on the random gradient, the compression gradient and the central gradient of the current iteration.

8. A distributed training apparatus, integrated with a compute node device, the apparatus comprising:

9. A distributed training apparatus, integrated in a central server node device, the apparatus comprising:

10. A distributed training system comprising a central server node and a plurality of compute nodes, wherein,

the computing node determines the random gradient of the machine learning model in the current iteration in the iterative training process of the machine learning model, compresses the random gradient of the current iteration to obtain the compression gradient of the current iteration, and sends the compression gradient to the central server node;

the central server node determines the central gradient of the current iteration based on the compression gradient sent by each computing node device, and sends the central gradient to each computing node after compression processing on the central gradient under the condition that the current iteration times do not meet preset conditions;

11. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the distributed training method of any one of claims 1-6 or the distributed training method of claim 7.

12. A computer-readable storage medium storing computer instructions for causing a processor to implement the distributed training method of any one of claims 1-6 or the distributed training method of claim 7 when executed.