CN113627519B

CN113627519B - Distributed random gradient descent method with compression and delay compensation

Info

Publication number: CN113627519B
Application number: CN202110904974.2A
Authority: CN
Inventors: 董德尊; 于恩达; 汪杨海; 廖湘科; 肖立权; 徐叶茂; 欧阳硕; 杨维玲; 王笑雨
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-08-07
Filing date: 2021-08-07
Publication date: 2022-09-09
Anticipated expiration: 2041-08-07
Also published as: CN113627519A

Abstract

The invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and computing nodes, wherein the computing nodes are responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, and data interaction between the parameter server and the computing nodes adopts a pair of multimode modes of a PS (packet switched) architecture; and after the transitional training is finished, performing formal training to obtain the local gradient and the local weight of the next moment, and then performing compensation training. The method adopts a unique local updating mechanism to cover the extra calculation overhead of quantization, is suitable for all lossy compression methods, and has the advantage of wide application range.

Description

Distributed random gradient descent method with compression and delay compensation

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a distributed random gradient descent method with compression and delay compensation.

Background

At present, distributed training has become an effective method for deep learning model training. A large training data set is partitioned into multiple nodes to perform the training task. Therefore, the nodes must share their calculation parameters with each other before updating the global parameters, and the communication cost of the sharing process limits the scalability of the distributed system, and seriously reduces the efficiency of the distributed training. For example, when training ResNet-50 on a 16-node Nvidia P102-100 GPU cluster connected via 1Gbps Ethernet, the communication time is more than nine times the computation time. As the number of compute nodes increases, the communication cost tends to deteriorate. In order to solve the communication problem in distributed training, researchers have proposed many methods for accelerating distributed training, which can be classified into a system-level method and an algorithm-level method.

On the system level, the pipeline method is based on hierarchical structure optimization training of the neural network, so that each Back Propagation (BP) can be in overlapped communication with the calculation process of the next layer. Following the pipeline approach, a communication priority scheduling mechanism is proposed to achieve a more aggressive overlap between computational and communication overhead. Recent research efforts have been directed to improving distributed training performance by parallelizing computational and communication operations. The Post-local SGD, K-AVG, and Periodic Averaging methods allow each compute node to perform local updates prior to communication, evolving the local model through average synchronization.

At the algorithm level, a gradient compression technique is proposed to reduce the communication traffic, and can be divided into a gradient sparsification method and a gradient quantization method. The gradient quantization method converts the high-precision gradient into the low-precision gradient for communication. The 1-bit quantization method reduces communication traffic by encoding a 32-bit gradient into 1 bit. The QSGD method allows a user to select different quantization levels according to network bandwidth. The WAGE method and the 8-bit training method quantify not only the gradient but also the weight. Early sparsification methods judged whether to send gradients by a single threshold. The DGC then accelerates the large-scale distributed training even further by swapping only the first 0.1% gradient in each iteration and accumulating the other gradients until it becomes large enough. While these communication algorithms may alleviate communication pressure, they introduce additional computational overhead on data encoding and gradient selection. Worse yet, the performance improvement of the compression method is not significant when the additional computation overhead and gradient computation time are much higher than the communication cost.

Some research efforts have been made to combine the advantages of the system-level approach and the algorithm-level approach, but the results are still not ideal. The LAGS-SGD approach integrates DGC with the pipeline approach, but does not bring a great speed advantage due to the start-up cost and the additional compression cost of multi-layer communication. The Canary method combines 8-bit quantization with gradient partitioning, but fails to solve the problem of reduced precision. OMGS-SGD combines DGC with an optimal merging mechanism, and it is not much faster than the DGC approach when training dense communication models such as VGG-16.

For the communication problem in distributed training, an appropriate method is needed to eliminate or cover the extra cost of compression, and meanwhile, the problem of precision reduction caused by compression needs to be solved, and finally, the selected mechanism optimization method must bring sufficient training efficiency benefits.

Disclosure of Invention

Aiming at the problems of eliminating or covering extra cost of compression and precision reduction caused by compression in communication optimization in distributed training, the invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and a computing node, wherein the computing node is responsible for calculating gradients and local weights, the parameter server is used for receiving and aggregating gradients from all computing nodes and then updating the gradients, data interaction between the parameter server and the computing nodes adopts a pair of multimode of a PS (packet switched) architecture, and the method specifically comprises the following steps:

firstly, carrying out preheating training, wherein the preheating training comprises a step S1, a step S2 and a step S3;

s1, retrieving the global weight W at the t-1 moment from the parameter server side _i-1 。

S2, using the global weight W at the time t-1 _i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node _j,i-1 The calculation formula of (2) is as follows:

Grad _j,i-1 ＝grad_cal(W _i-1 ,X _j ,Y _j )，1≤j≤N (1)

where grad _ cal is a function of gradient calculation, using a global weight W _i-1 And sample features X stored in the input of the jth compute node _j To calculateThe prediction result Y is obtained _j From the predicted result Y _j ' and Label Y _j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss _j,i-1 (ii) a The gradient calculation is realized by adopting an SGD algorithm, a DCSGD algorithm, an Adam algorithm and the like;

s3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is measured by the local gradient value obtained by computing _i-1 Updating to obtain the global weight W at the time t _i ：

Wherein eta is the hyper-parametric learning rate, and N is the number of the calculation nodes.

The preheating training stage repeats steps S1 to S3 several times, so as to make the weight faster and more stable, and the number of repetitions can be adjusted according to the user' S will, and usually 5 Epoch cycles are preheated.

After the preheating training is finished, performing transition training, wherein the transition training comprises a step S4 and a step S5;

s4, global weight W of t time retrieved from parameter server side _i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t _i ^loc Are all equal to the global weight W at time t _i Also equal to the backup weight loc _ weight. In each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment _j,i The computational expression of (a) is:

Grad _j,i ＝grad_cal(W _i ,X _j ,Y _j )，1≤j≤N (3)

s5, at each computing node, utilizing the local gradient Grad at the t moment obtained by computing _j,i To W _i Update, generate local weight W at time t +1 _i+1 ^loc ：

In the local weight W _i+1 ^loc While calculating, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the calculated local gradient value at the time t is used for calculating the global weight W at the time t _i Updating to obtain the global weight W at the moment of t +1 _i+1 ：

Obtaining the global weight W of the t +1 moment _i+1 Backed up into the weight variable loc _ weight.

After the transitional training is finished, performing formal training, which comprises a step S6, a step S7 and a step S8;

s6, obtaining the local weight W _i+1 ^loc Then, each computing node immediately starts to compute the local gradient Grad at the time t +1 _j,i+1 ：

Grad _j,i+1 ＝grad_cal(W _i+1 ^loc ,X _j ,Y _j )， (6)

S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained _j,i+1 Data storage variable Sgrad with 2-bit format _j At the same time, the local gradient Grad at the time t +1 is used _j,i+1 Global weight W for time t +1 _i+1 Updating to obtain the local weight W at the time of t +2 _i+2 ^loc ：

S8, each computing node maps the variable Sgrad _j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 _i+2 And backup is carried out to the global weight W of the weight variable loc _ weight at the moment t +2 _i+2 Meter (2)The calculation formula is as follows:

meanwhile, the local weight W at the moment of t +2 is obtained _i+2 ^loc Then, using formula (6), W _i+2 ^loc Calculating the local gradient Grad of each computing node at the moment of t +2 _j,i+2 ，1≤j≤N。

After the formal training is finished, performing compensation training, which includes step S9;

s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment _j Substitution variable Sgrad _j Then, the global weight is calculated by using the formula (8).

After the transition from the warm-up training to the normal training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed. The training times of the models with different complexity degrees are different, and meanwhile, the training is limited by the expectation of the accuracy of a user, and experiments prove that the convergence of the method is close to that of the S-SGD, so the training examples of the S-SGD can be referred to for the setting of the training cycle number.

The invention discloses a distributed random gradient descent method with compression and delay compensation, which is implemented based on a PS (packet switched) architecture in an MXNet environment and comprises the following steps,

n iterations are performed during the warm-up training phase. In each iteration, the computing node uses the global weight W to compute the local gradient Grad, the computed local gradient Grad is stored in the buffer comm _ buf of the computing node, then the computed local gradient Grad is pushed to the parameter server, the parameter server updates the global weight W, and the content in the buffer comm _ buf is covered by the updated global weight W.

In the (n-1) th iteration of the warm-up training phase, the global weight W is copied to another different buffer comm _ bufIn a buffer loc _ buf, the data in the buffer loc _ buf is updated in the nth iteration by using the local gradient Grad, so as to provide the required new local weight W for the formal training phase ^loc . The iteration number n of the warm-up training stage can be adjusted according to the complexity of different models, and the time required is usually very short.

In the formal training stage, after obtaining the global weight, the computing node copies the global weight to the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using the local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment. The dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are not overwritten by the copy of the global weights at the next moment until they participate in the local gradient computation; and controlling the iteration number of the preheating training by using a first counter on the computing node, determining whether quantization is performed or not by using a second counter on the computing node, and performing the updating operation on the local weight and the quantization operation in parallel.

The local gradient is pushed to the parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into 2-bit data into 32 bits before updating the global weights. If the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately.

The invention has the beneficial effects that:

the method employs a unique local update mechanism to mask the quantization overhead. The introduction of the compensation mechanism in the method is applicable to all lossy compression methods (including quantization and sparsification methods), is a mainstream technology for accelerating training, and the existing lossy compression method can make up for the precision loss by using the method, so the method has the advantage of wide application range.

Drawings

FIG. 1 shows the comparison of LeNet-5 model training accuracy on 2 compute nodes using different methods;

FIG. 2 is a diagram illustrating the comparison of the training accuracy of the ResNet-50 model on 4 computing nodes by different methods;

FIG. 3 is a diagram illustrating the comparison of the training accuracy of ResNet-50 models on 8 computing nodes by different methods;

FIG. 4 is a speed boost ratio for different methods using 4 compute nodes in a K80 cluster for training;

FIG. 5 is a speed boost ratio for training with 4 compute nodes in a V100 cluster according to various methods.

Detailed Description

For a better understanding of the present disclosure, an example is given here.

The invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and computing nodes, wherein the computing nodes are responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, the data interaction between the parameter server and the computing nodes adopts a pair of multimode of a PS (packet switched) architecture, and the method specifically comprises the following steps:

Grad _j,i-1 ＝grad_cal(W _i-1 ,X _j ,Y _j )，1≤j≤N (1)

where grad _ cal is a function of gradient calculation, using a global weight W _i-1 And sample features X stored in the input of the jth compute node _j To calculate a prediction result Y _j From the predicted result Y _j ' and Label Y _j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss _j,i-1 (ii) a The gradient calculation is realized by adopting an SGD algorithm, a DCSGD algorithm, an Adam algorithm and the like, and the method has good universality;

s3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is obtained by utilizing the local gradient value obtained by computing _i-1 Updating to obtain the global weight W at the time t _i ：

After the preheating training is finished, performing transition training, which comprises a step S4 and a step S5;

Grad _j,i ＝grad_cal(W _i ,X _j ,Y _j )，1≤j≤N (3)

In the pair of local weight W _i+1 ^loc While calculation is carried out, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the global weight W at the time t is calculated by using the calculated local gradient value at the time t _i Updating to obtain the global weight W at the moment of t +1 _i+1 ：

Grad _j,i+1 ＝grad_cal(W _i+1 ^loc ,X _j ,Y _j )， (6)

S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained _j,i+1 Data storage variable Sgrad with 2-bit format _j At the same time, the local gradient Grad at the time t +1 is used _j,i+1 To global weight W for time t +1 _i+1 Updating to obtain the local weight W at the time of t +2 _i+2 ^loc ：

S8, each computing node maps the variable Sgrad _j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 _i+2 And backup is carried out on the global weight W at the moment of t +2 of the weight variable loc _ weight _i+2 The calculation formula of (2) is as follows:

at the same time, the local weight W at the moment of t +2 is obtained _i+2 ^loc Then, using formula (6), W _i+2 ^loc Calculating local gradient Grad of each calculation node at the t +2 moment _j,i+2 ，1≤j≤N。

After the transition from the warm-up training to the regular training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed.

The method can be implemented on the existing mainstream distributed learning framework, and for better understanding of the content of the invention, the CD-SGD (compact disc-serving gateway) which is a method for arranging the PS (packet switched) architecture on the MXNet is given.

n iterations are performed during the warm-up training phase. In each iteration, the computing node uses the global weight W to compute the local gradient Grad, stores the computed local gradient Grad in the buffer comm _ buf of the computing node, then pushes the computed local gradient Grad to the parameter server, and the parameter server updates the global weight W and overwrites the contents in the buffer comm _ buf with the updated global weight W, so that only one buffer is needed for the work.

In the (n-1) th iteration of the warm-up training phase, the global weight W is copied into another buffer loc _ buf different from the buffer comm _ buf, and the data in the buffer loc _ buf is updated in the nth iteration by using the local gradient Grad, thereby providing for the formal training phaseNew local weight W required by it ^loc . The iteration number n of the warm-up training stage can be adjusted according to the complexity of different models, and the time required is usually very short.

In the formal training stage, after acquiring the global weight, the computing node copies the global weight into the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using a local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment. The dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are covered by the global weight copy at the next time after participating in the local gradient computation; the number of iterations of the warm-up training is controlled by a first counter on the compute node, a second counter is used on the compute node to determine whether quantization is to be performed, and the updating and quantizing operations of the local weights are performed in parallel because the updating and quantizing operations of the local weights only read the buffer comm _ buf without modifying it.

The local gradient is pushed to the parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into 2-bit data into 32 bits before updating the global weights. If the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately. The local update mechanism adopted by the patent can ensure that the calculation in the next iteration is not delayed due to compression and works together with parallel communication.

The CD-SGD method is designed for inheriting the advantage of the gradient quantization method in reducing communication traffic, simultaneously making up the disadvantage of accuracy reduction caused by quantization loss, masking quantization overhead, and improving the overlapping performance of calculation and communication so as to improve training efficiency. Here, validity evaluation was made as to whether the CD-SGD method achieved the above-described object. FIG. 1 shows the comparison of LeNet-5 model training accuracy on 2 compute nodes using different methods; FIG. 2 is a diagram illustrating the comparison of the training accuracy of the ResNet-50 model on 4 computing nodes by different methods; FIG. 3 is a diagram illustrating the comparison of the training accuracy of ResNet-50 models on 8 computing nodes by different methods; FIG. 4 is a speed boost ratio for different methods using 4 compute nodes in a K80 cluster for training; FIG. 5 is a speed boost ratio for training with 4 computing nodes in a V100 cluster for different methods.

Fig. 1 to fig. 3 show a convergence accuracy curve comparison diagram of several methods, such as CD-SGD (the method of the present invention), S-SGD (the traditional high-accuracy synchronous update algorithm), BIT-SGD (the 2-BIT quantization method provided by MXNet, the optimization object of the present invention), and OD-SGD (a local update algorithm, the present invention uses its local update mechanism as a reference). In the experiment, 4 parameter server nodes and 4 computing nodes are adopted, the size of a single node batch is 128, the global learning rate (lr) of all algorithms is 0.1, the local lr of CD-SGD and OD-SGD is 0.4, and the warm-up time is 5 epochs. In addition, the quantization threshold of the CD-SGD and the BIT-SGD is 0.5, and the precision compensation step size k of the CD-SGD is 2. Based on fig. 1 to 3, the following conclusions can be drawn: (1) the test accuracy of BIT-SGD is obviously lower than that of other three methods, but CD-SGD solves the precision problem of BIT-SGD, and the convergence performance of CD-SGD is very close to that of S-SGD and even higher than that of OD-SGD. (2) The CD-SGD can keep better precision expandability, and when the number of nodes is increased, the precision is still not obviously different from that of the S-SGD, and the difference is not more than 0.5%.

Fig. 4 to 5 show relative velocity ratios of the above method trained on two different clusters of K80 and V100. The precision compensation step k size of the CD-SGD is 5, and the rest parameter settings are the same as those described above. Tests were performed, based on the speed of the S-SGD. Based on fig. 4 to 5, the following conclusions can be drawn: (1) when model ResNet-50 is trained on K80, the CD-SGD obtains the same training speed as the OD-SGD. We attribute this phenomenon to the limited computational power of K80, resulting in a computational bottleneck. Furthermore, we can note that the performance of BIT-SGD is worse than that of OD-SGD when model Vgg16 and model inclusion-bn are trained, unlike the training of model Alexnet. This means that in addition to masking the compression overhead and reducing the communication time, a parallel mechanism needs to be used to increase the overlap of computation and communication. The acceleration ratios of the CD-SGD on the model shown in FIG. 4 are 0%, 43%, 33%, 32%, respectively. (2) The V100 GPU cluster has stronger computing power and can complete computing tasks in shorter time. Therefore, the performance of BIT-SGD is superior to OD-SGD when training most of the models in FIG. 5, because limited computational cost cannot fully cover communication time. However, BIT-SGD is slower than OD-SGD when training the inclusion-bn because the inclusion-bn has many layers of computation, resulting in significant computational cost. The acceleration ratios of CD-SGD on the model shown in FIG. 5 are 24%, 43%, 39%, 44%, respectively. In FIGS. 4 and 5, the training efficiency of CD-SGD can be improved by 3% to 30% compared to BIT-SGD.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A distributed random gradient descent method with compression and delay compensation is characterized by being implemented by using a parameter server and a computing node, wherein the computing node is responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, data interaction between the parameter server and the computing nodes adopts a pair of multimode modes of a PS (packet switched) architecture, and the method comprises the following specific steps:

s1, retrieving the global weight W at the t-1 moment from the parameter server side _i-1 ；

S2, utilizing the global weight W at the time t-1 _i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node _j,i-1 The calculation formula of (2) is as follows:

Grad _j,i-1 ＝grad_cal(W _i-1 ,X _j ,Y _j )，1≤j≤N (1)

where grad _ cal is a function of gradient calculation, using a global weight W _i-1 And is stored inSample feature X of input for jth compute node _j To calculate a predicted result Y _j From the predicted result Y _j ' and Label Y _j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss _j,i-1 ；

Wherein eta is the hyper-parameter learning rate, and N is the number of the calculation nodes;

the preheat training phase repeats steps S1 through S3 several times;

s4, global weight W of t time retrieved from the parameter server side _i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t _i ^loc Are all equal to the global weight W at time t _i Also equal to the backup weight loc _ weight; in each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment _j,i The calculation expression of (a) is:

Grad _j,i ＝grad_cal(W _i ,X _j ,Y _j )，1≤j≤N (3)

In the local weight W _i+1 ^loc While calculation is carried out, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the global weight W at the time t is calculated by using the calculated local gradient value at the time t _i Updating to obtain the global weight W at the moment of t +1 _i+1 ：

Obtaining the global weight W of the t +1 moment _i+1 Backing up the weight variable loc _ weight;

Grad _j,i+1 ＝grad_cal(W _i+1 ^loc ,X _j ,Y _j )， (6)

S8, each computing node sets the variable Sgrad _j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 _i+2 And backup is carried out to the global weight W of the weight variable loc _ weight at the moment t +2 _i+2 The calculation formula of (2) is as follows:

at the same time, the local weight W at the moment of t +2 is obtained _i+2 ^loc Then, using formula (6), W _i+2 ^loc Calculating local gradient Grad of each calculation node at the t +2 moment _j,i+2 ，1≤j≤N；

s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment _j Substitution variable Sgrad _j Then, calculating the global weight by using a formula (8);

2. The distributed random gradient descent method with compression and delay compensation of claim 1, wherein the gradient computation is implemented using SGD algorithm, DCSGD algorithm, Adam algorithm.

3. The distributed random gradient descent method with compression and delay compensation of claim 1, implemented based on a PS architecture in an MXNet environment, comprising the steps of,

performing n iterations in a warm-up training stage; in each iteration, the computing node uses the global weight W to compute the local gradient Grad, the computed local gradient Grad is stored in a buffer comm _ buf of the computing node, then the computed local gradient Grad is pushed to a parameter server, the parameter server updates the global weight W, and the updated global weight W is used for covering the content in the buffer comm _ buf;

in the (n-1) th iteration of the warm-up training phase, the global weight W is replicatedTo another buffer loc _ buf different from the buffer comm _ buf, the data in the buffer loc _ buf is updated in the nth iteration using the local gradient Grad, thereby providing the formal training phase with the new local weight W it needs ^loc ；

In the formal training stage, after acquiring the global weight, the computing node copies the global weight into the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using a local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment; the dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are covered by the global weight copy at the next time after participating in the local gradient computation; controlling the iteration number of the preheating training by using a first counter on a computing node, determining whether quantization is performed or not by using a second counter on the computing node, and performing the updating of the local weight and the quantization operation in parallel;

the local gradient is pushed to a parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into the 2-bit data into 32 bits before updating the global weight; if the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately.

4. The distributed stochastic gradient descent method with compression and delay compensation of claim 3, wherein the number of iterations n performed during the warm-up training phase is adjusted according to the complexity of different models.