CN113627519B - Distributed random gradient descent method with compression and delay compensation - Google Patents

Distributed random gradient descent method with compression and delay compensation Download PDF

Info

Publication number
CN113627519B
CN113627519B CN202110904974.2A CN202110904974A CN113627519B CN 113627519 B CN113627519 B CN 113627519B CN 202110904974 A CN202110904974 A CN 202110904974A CN 113627519 B CN113627519 B CN 113627519B
Authority
CN
China
Prior art keywords
weight
local
training
grad
loc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110904974.2A
Other languages
Chinese (zh)
Other versions
CN113627519A (en
Inventor
董德尊
于恩达
汪杨海
廖湘科
肖立权
徐叶茂
欧阳硕
杨维玲
王笑雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110904974.2A priority Critical patent/CN113627519B/en
Publication of CN113627519A publication Critical patent/CN113627519A/en
Application granted granted Critical
Publication of CN113627519B publication Critical patent/CN113627519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and computing nodes, wherein the computing nodes are responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, and data interaction between the parameter server and the computing nodes adopts a pair of multimode modes of a PS (packet switched) architecture; and after the transitional training is finished, performing formal training to obtain the local gradient and the local weight of the next moment, and then performing compensation training. The method adopts a unique local updating mechanism to cover the extra calculation overhead of quantization, is suitable for all lossy compression methods, and has the advantage of wide application range.

Description

Distributed random gradient descent method with compression and delay compensation
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a distributed random gradient descent method with compression and delay compensation.
Background
At present, distributed training has become an effective method for deep learning model training. A large training data set is partitioned into multiple nodes to perform the training task. Therefore, the nodes must share their calculation parameters with each other before updating the global parameters, and the communication cost of the sharing process limits the scalability of the distributed system, and seriously reduces the efficiency of the distributed training. For example, when training ResNet-50 on a 16-node Nvidia P102-100 GPU cluster connected via 1Gbps Ethernet, the communication time is more than nine times the computation time. As the number of compute nodes increases, the communication cost tends to deteriorate. In order to solve the communication problem in distributed training, researchers have proposed many methods for accelerating distributed training, which can be classified into a system-level method and an algorithm-level method.
On the system level, the pipeline method is based on hierarchical structure optimization training of the neural network, so that each Back Propagation (BP) can be in overlapped communication with the calculation process of the next layer. Following the pipeline approach, a communication priority scheduling mechanism is proposed to achieve a more aggressive overlap between computational and communication overhead. Recent research efforts have been directed to improving distributed training performance by parallelizing computational and communication operations. The Post-local SGD, K-AVG, and Periodic Averaging methods allow each compute node to perform local updates prior to communication, evolving the local model through average synchronization.
At the algorithm level, a gradient compression technique is proposed to reduce the communication traffic, and can be divided into a gradient sparsification method and a gradient quantization method. The gradient quantization method converts the high-precision gradient into the low-precision gradient for communication. The 1-bit quantization method reduces communication traffic by encoding a 32-bit gradient into 1 bit. The QSGD method allows a user to select different quantization levels according to network bandwidth. The WAGE method and the 8-bit training method quantify not only the gradient but also the weight. Early sparsification methods judged whether to send gradients by a single threshold. The DGC then accelerates the large-scale distributed training even further by swapping only the first 0.1% gradient in each iteration and accumulating the other gradients until it becomes large enough. While these communication algorithms may alleviate communication pressure, they introduce additional computational overhead on data encoding and gradient selection. Worse yet, the performance improvement of the compression method is not significant when the additional computation overhead and gradient computation time are much higher than the communication cost.
Some research efforts have been made to combine the advantages of the system-level approach and the algorithm-level approach, but the results are still not ideal. The LAGS-SGD approach integrates DGC with the pipeline approach, but does not bring a great speed advantage due to the start-up cost and the additional compression cost of multi-layer communication. The Canary method combines 8-bit quantization with gradient partitioning, but fails to solve the problem of reduced precision. OMGS-SGD combines DGC with an optimal merging mechanism, and it is not much faster than the DGC approach when training dense communication models such as VGG-16.
For the communication problem in distributed training, an appropriate method is needed to eliminate or cover the extra cost of compression, and meanwhile, the problem of precision reduction caused by compression needs to be solved, and finally, the selected mechanism optimization method must bring sufficient training efficiency benefits.
Disclosure of Invention
Aiming at the problems of eliminating or covering extra cost of compression and precision reduction caused by compression in communication optimization in distributed training, the invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and a computing node, wherein the computing node is responsible for calculating gradients and local weights, the parameter server is used for receiving and aggregating gradients from all computing nodes and then updating the gradients, data interaction between the parameter server and the computing nodes adopts a pair of multimode of a PS (packet switched) architecture, and the method specifically comprises the following steps:
firstly, carrying out preheating training, wherein the preheating training comprises a step S1, a step S2 and a step S3;
s1, retrieving the global weight W at the t-1 moment from the parameter server side i-1
S2, using the global weight W at the time t-1 i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node j,i-1 The calculation formula of (2) is as follows:
Grad j,i-1 =grad_cal(W i-1 ,X j ,Y j ),1≤j≤N (1)
where grad _ cal is a function of gradient calculation, using a global weight W i-1 And sample features X stored in the input of the jth compute node j To calculateThe prediction result Y is obtained j From the predicted result Y j ' and Label Y j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss j,i-1 (ii) a The gradient calculation is realized by adopting an SGD algorithm, a DCSGD algorithm, an Adam algorithm and the like;
s3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is measured by the local gradient value obtained by computing i-1 Updating to obtain the global weight W at the time t i
Figure BDA0003201308060000031
Wherein eta is the hyper-parametric learning rate, and N is the number of the calculation nodes.
The preheating training stage repeats steps S1 to S3 several times, so as to make the weight faster and more stable, and the number of repetitions can be adjusted according to the user' S will, and usually 5 Epoch cycles are preheated.
After the preheating training is finished, performing transition training, wherein the transition training comprises a step S4 and a step S5;
s4, global weight W of t time retrieved from parameter server side i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t i loc Are all equal to the global weight W at time t i Also equal to the backup weight loc _ weight. In each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment j,i The computational expression of (a) is:
Grad j,i =grad_cal(W i ,X j ,Y j ),1≤j≤N (3)
s5, at each computing node, utilizing the local gradient Grad at the t moment obtained by computing j,i To W i Update, generate local weight W at time t +1 i+1 loc
Figure BDA0003201308060000032
In the local weight W i+1 loc While calculating, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the calculated local gradient value at the time t is used for calculating the global weight W at the time t i Updating to obtain the global weight W at the moment of t +1 i+1
Figure BDA0003201308060000033
Obtaining the global weight W of the t +1 moment i+1 Backed up into the weight variable loc _ weight.
After the transitional training is finished, performing formal training, which comprises a step S6, a step S7 and a step S8;
s6, obtaining the local weight W i+1 loc Then, each computing node immediately starts to compute the local gradient Grad at the time t +1 j,i+1
Grad j,i+1 =grad_cal(W i+1 loc ,X j ,Y j ), (6)
S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained j,i+1 Data storage variable Sgrad with 2-bit format j At the same time, the local gradient Grad at the time t +1 is used j,i+1 Global weight W for time t +1 i+1 Updating to obtain the local weight W at the time of t +2 i+2 loc
Figure BDA0003201308060000041
S8, each computing node maps the variable Sgrad j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 i+2 And backup is carried out to the global weight W of the weight variable loc _ weight at the moment t +2 i+2 Meter (2)The calculation formula is as follows:
Figure BDA0003201308060000042
meanwhile, the local weight W at the moment of t +2 is obtained i+2 loc Then, using formula (6), W i+2 loc Calculating the local gradient Grad of each computing node at the moment of t +2 j,i+2 ,1≤j≤N。
After the formal training is finished, performing compensation training, which includes step S9;
s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment j Substitution variable Sgrad j Then, the global weight is calculated by using the formula (8).
After the transition from the warm-up training to the normal training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed. The training times of the models with different complexity degrees are different, and meanwhile, the training is limited by the expectation of the accuracy of a user, and experiments prove that the convergence of the method is close to that of the S-SGD, so the training examples of the S-SGD can be referred to for the setting of the training cycle number.
The invention discloses a distributed random gradient descent method with compression and delay compensation, which is implemented based on a PS (packet switched) architecture in an MXNet environment and comprises the following steps,
n iterations are performed during the warm-up training phase. In each iteration, the computing node uses the global weight W to compute the local gradient Grad, the computed local gradient Grad is stored in the buffer comm _ buf of the computing node, then the computed local gradient Grad is pushed to the parameter server, the parameter server updates the global weight W, and the content in the buffer comm _ buf is covered by the updated global weight W.
In the (n-1) th iteration of the warm-up training phase, the global weight W is copied to another different buffer comm _ bufIn a buffer loc _ buf, the data in the buffer loc _ buf is updated in the nth iteration by using the local gradient Grad, so as to provide the required new local weight W for the formal training phase loc . The iteration number n of the warm-up training stage can be adjusted according to the complexity of different models, and the time required is usually very short.
In the formal training stage, after obtaining the global weight, the computing node copies the global weight to the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using the local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment. The dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are not overwritten by the copy of the global weights at the next moment until they participate in the local gradient computation; and controlling the iteration number of the preheating training by using a first counter on the computing node, determining whether quantization is performed or not by using a second counter on the computing node, and performing the updating operation on the local weight and the quantization operation in parallel.
The local gradient is pushed to the parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into 2-bit data into 32 bits before updating the global weights. If the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately.
The invention has the beneficial effects that:
the method employs a unique local update mechanism to mask the quantization overhead. The introduction of the compensation mechanism in the method is applicable to all lossy compression methods (including quantization and sparsification methods), is a mainstream technology for accelerating training, and the existing lossy compression method can make up for the precision loss by using the method, so the method has the advantage of wide application range.
Drawings
FIG. 1 shows the comparison of LeNet-5 model training accuracy on 2 compute nodes using different methods;
FIG. 2 is a diagram illustrating the comparison of the training accuracy of the ResNet-50 model on 4 computing nodes by different methods;
FIG. 3 is a diagram illustrating the comparison of the training accuracy of ResNet-50 models on 8 computing nodes by different methods;
FIG. 4 is a speed boost ratio for different methods using 4 compute nodes in a K80 cluster for training;
FIG. 5 is a speed boost ratio for training with 4 compute nodes in a V100 cluster according to various methods.
Detailed Description
For a better understanding of the present disclosure, an example is given here.
The invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and computing nodes, wherein the computing nodes are responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, the data interaction between the parameter server and the computing nodes adopts a pair of multimode of a PS (packet switched) architecture, and the method specifically comprises the following steps:
firstly, carrying out preheating training, wherein the preheating training comprises a step S1, a step S2 and a step S3;
s1, retrieving the global weight W at the t-1 moment from the parameter server side i-1
S2, using the global weight W at the time t-1 i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node j,i-1 The calculation formula of (2) is as follows:
Grad j,i-1 =grad_cal(W i-1 ,X j ,Y j ),1≤j≤N (1)
where grad _ cal is a function of gradient calculation, using a global weight W i-1 And sample features X stored in the input of the jth compute node j To calculate a prediction result Y j From the predicted result Y j ' and Label Y j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss j,i-1 (ii) a The gradient calculation is realized by adopting an SGD algorithm, a DCSGD algorithm, an Adam algorithm and the like, and the method has good universality;
s3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is obtained by utilizing the local gradient value obtained by computing i-1 Updating to obtain the global weight W at the time t i
Figure BDA0003201308060000061
Wherein eta is the hyper-parametric learning rate, and N is the number of the calculation nodes.
The preheating training stage repeats steps S1 to S3 several times, so as to make the weight faster and more stable, and the number of repetitions can be adjusted according to the user' S will, and usually 5 Epoch cycles are preheated.
After the preheating training is finished, performing transition training, which comprises a step S4 and a step S5;
s4, global weight W of t time retrieved from parameter server side i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t i loc Are all equal to the global weight W at time t i Also equal to the backup weight loc _ weight. In each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment j,i The computational expression of (a) is:
Grad j,i =grad_cal(W i ,X j ,Y j ),1≤j≤N (3)
s5, at each computing node, utilizing the local gradient Grad at the t moment obtained by computing j,i To W i Update, generate local weight W at time t +1 i+1 loc
Figure BDA0003201308060000071
In the pair of local weight W i+1 loc While calculation is carried out, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the global weight W at the time t is calculated by using the calculated local gradient value at the time t i Updating to obtain the global weight W at the moment of t +1 i+1
Figure BDA0003201308060000072
Obtaining the global weight W of the t +1 moment i+1 Backed up into the weight variable loc _ weight.
After the transitional training is finished, performing formal training, which comprises a step S6, a step S7 and a step S8;
s6, obtaining the local weight W i+1 loc Then, each computing node immediately starts to compute the local gradient Grad at the time t +1 j,i+1
Grad j,i+1 =grad_cal(W i+1 loc ,X j ,Y j ), (6)
S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained j,i+1 Data storage variable Sgrad with 2-bit format j At the same time, the local gradient Grad at the time t +1 is used j,i+1 To global weight W for time t +1 i+1 Updating to obtain the local weight W at the time of t +2 i+2 loc
Figure BDA0003201308060000073
S8, each computing node maps the variable Sgrad j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 i+2 And backup is carried out on the global weight W at the moment of t +2 of the weight variable loc _ weight i+2 The calculation formula of (2) is as follows:
Figure BDA0003201308060000074
at the same time, the local weight W at the moment of t +2 is obtained i+2 loc Then, using formula (6), W i+2 loc Calculating local gradient Grad of each calculation node at the t +2 moment j,i+2 ,1≤j≤N。
After the formal training is finished, performing compensation training, which includes step S9;
s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment j Substitution variable Sgrad j Then, the global weight is calculated by using the formula (8).
After the transition from the warm-up training to the regular training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed.
The method can be implemented on the existing mainstream distributed learning framework, and for better understanding of the content of the invention, the CD-SGD (compact disc-serving gateway) which is a method for arranging the PS (packet switched) architecture on the MXNet is given.
The invention discloses a distributed random gradient descent method with compression and delay compensation, which is implemented based on a PS (packet switched) architecture in an MXNet environment and comprises the following steps,
n iterations are performed during the warm-up training phase. In each iteration, the computing node uses the global weight W to compute the local gradient Grad, stores the computed local gradient Grad in the buffer comm _ buf of the computing node, then pushes the computed local gradient Grad to the parameter server, and the parameter server updates the global weight W and overwrites the contents in the buffer comm _ buf with the updated global weight W, so that only one buffer is needed for the work.
In the (n-1) th iteration of the warm-up training phase, the global weight W is copied into another buffer loc _ buf different from the buffer comm _ buf, and the data in the buffer loc _ buf is updated in the nth iteration by using the local gradient Grad, thereby providing for the formal training phaseNew local weight W required by it loc . The iteration number n of the warm-up training stage can be adjusted according to the complexity of different models, and the time required is usually very short.
In the formal training stage, after acquiring the global weight, the computing node copies the global weight into the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using a local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment. The dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are covered by the global weight copy at the next time after participating in the local gradient computation; the number of iterations of the warm-up training is controlled by a first counter on the compute node, a second counter is used on the compute node to determine whether quantization is to be performed, and the updating and quantizing operations of the local weights are performed in parallel because the updating and quantizing operations of the local weights only read the buffer comm _ buf without modifying it.
The local gradient is pushed to the parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into 2-bit data into 32 bits before updating the global weights. If the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately. The local update mechanism adopted by the patent can ensure that the calculation in the next iteration is not delayed due to compression and works together with parallel communication.
The CD-SGD method is designed for inheriting the advantage of the gradient quantization method in reducing communication traffic, simultaneously making up the disadvantage of accuracy reduction caused by quantization loss, masking quantization overhead, and improving the overlapping performance of calculation and communication so as to improve training efficiency. Here, validity evaluation was made as to whether the CD-SGD method achieved the above-described object. FIG. 1 shows the comparison of LeNet-5 model training accuracy on 2 compute nodes using different methods; FIG. 2 is a diagram illustrating the comparison of the training accuracy of the ResNet-50 model on 4 computing nodes by different methods; FIG. 3 is a diagram illustrating the comparison of the training accuracy of ResNet-50 models on 8 computing nodes by different methods; FIG. 4 is a speed boost ratio for different methods using 4 compute nodes in a K80 cluster for training; FIG. 5 is a speed boost ratio for training with 4 computing nodes in a V100 cluster for different methods.
Fig. 1 to fig. 3 show a convergence accuracy curve comparison diagram of several methods, such as CD-SGD (the method of the present invention), S-SGD (the traditional high-accuracy synchronous update algorithm), BIT-SGD (the 2-BIT quantization method provided by MXNet, the optimization object of the present invention), and OD-SGD (a local update algorithm, the present invention uses its local update mechanism as a reference). In the experiment, 4 parameter server nodes and 4 computing nodes are adopted, the size of a single node batch is 128, the global learning rate (lr) of all algorithms is 0.1, the local lr of CD-SGD and OD-SGD is 0.4, and the warm-up time is 5 epochs. In addition, the quantization threshold of the CD-SGD and the BIT-SGD is 0.5, and the precision compensation step size k of the CD-SGD is 2. Based on fig. 1 to 3, the following conclusions can be drawn: (1) the test accuracy of BIT-SGD is obviously lower than that of other three methods, but CD-SGD solves the precision problem of BIT-SGD, and the convergence performance of CD-SGD is very close to that of S-SGD and even higher than that of OD-SGD. (2) The CD-SGD can keep better precision expandability, and when the number of nodes is increased, the precision is still not obviously different from that of the S-SGD, and the difference is not more than 0.5%.
Fig. 4 to 5 show relative velocity ratios of the above method trained on two different clusters of K80 and V100. The precision compensation step k size of the CD-SGD is 5, and the rest parameter settings are the same as those described above. Tests were performed, based on the speed of the S-SGD. Based on fig. 4 to 5, the following conclusions can be drawn: (1) when model ResNet-50 is trained on K80, the CD-SGD obtains the same training speed as the OD-SGD. We attribute this phenomenon to the limited computational power of K80, resulting in a computational bottleneck. Furthermore, we can note that the performance of BIT-SGD is worse than that of OD-SGD when model Vgg16 and model inclusion-bn are trained, unlike the training of model Alexnet. This means that in addition to masking the compression overhead and reducing the communication time, a parallel mechanism needs to be used to increase the overlap of computation and communication. The acceleration ratios of the CD-SGD on the model shown in FIG. 4 are 0%, 43%, 33%, 32%, respectively. (2) The V100 GPU cluster has stronger computing power and can complete computing tasks in shorter time. Therefore, the performance of BIT-SGD is superior to OD-SGD when training most of the models in FIG. 5, because limited computational cost cannot fully cover communication time. However, BIT-SGD is slower than OD-SGD when training the inclusion-bn because the inclusion-bn has many layers of computation, resulting in significant computational cost. The acceleration ratios of CD-SGD on the model shown in FIG. 5 are 24%, 43%, 39%, 44%, respectively. In FIGS. 4 and 5, the training efficiency of CD-SGD can be improved by 3% to 30% compared to BIT-SGD.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims (4)

1. A distributed random gradient descent method with compression and delay compensation is characterized by being implemented by using a parameter server and a computing node, wherein the computing node is responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, data interaction between the parameter server and the computing nodes adopts a pair of multimode modes of a PS (packet switched) architecture, and the method comprises the following specific steps:
firstly, carrying out preheating training, wherein the preheating training comprises a step S1, a step S2 and a step S3;
s1, retrieving the global weight W at the t-1 moment from the parameter server side i-1
S2, utilizing the global weight W at the time t-1 i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node j,i-1 The calculation formula of (2) is as follows:
Grad j,i-1 =grad_cal(W i-1 ,X j ,Y j ),1≤j≤N (1)
where grad _ cal is a function of gradient calculation, using a global weight W i-1 And is stored inSample feature X of input for jth compute node j To calculate a predicted result Y j From the predicted result Y j ' and Label Y j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss j,i-1
S3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is obtained by utilizing the local gradient value obtained by computing i-1 Updating to obtain the global weight W at the time t i
Figure FDA0003201308050000011
Wherein eta is the hyper-parameter learning rate, and N is the number of the calculation nodes;
the preheat training phase repeats steps S1 through S3 several times;
after the preheating training is finished, performing transition training, wherein the transition training comprises a step S4 and a step S5;
s4, global weight W of t time retrieved from the parameter server side i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t i loc Are all equal to the global weight W at time t i Also equal to the backup weight loc _ weight; in each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment j,i The calculation expression of (a) is:
Grad j,i =grad_cal(W i ,X j ,Y j ),1≤j≤N (3)
s5, at each computing node, utilizing the local gradient Grad at the t moment obtained by computing j,i To W i Update, generate local weight W at time t +1 i+1 loc
Figure FDA0003201308050000021
In the local weight W i+1 loc While calculation is carried out, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the global weight W at the time t is calculated by using the calculated local gradient value at the time t i Updating to obtain the global weight W at the moment of t +1 i+1
Figure FDA0003201308050000022
Obtaining the global weight W of the t +1 moment i+1 Backing up the weight variable loc _ weight;
after the transitional training is finished, performing formal training, which comprises a step S6, a step S7 and a step S8;
s6, obtaining the local weight W i+1 loc Then, each computing node immediately starts to compute the local gradient Grad at the time t +1 j,i+1
Grad j,i+1 =grad_cal(W i+1 loc ,X j ,Y j ), (6)
S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained j,i+1 Data storage variable Sgrad with 2-bit format j At the same time, the local gradient Grad at the time t +1 is used j,i+1 Global weight W for time t +1 i+1 Updating to obtain the local weight W at the time of t +2 i+2 loc
Figure FDA0003201308050000023
S8, each computing node sets the variable Sgrad j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 i+2 And backup is carried out to the global weight W of the weight variable loc _ weight at the moment t +2 i+2 The calculation formula of (2) is as follows:
Figure FDA0003201308050000024
at the same time, the local weight W at the moment of t +2 is obtained i+2 loc Then, using formula (6), W i+2 loc Calculating local gradient Grad of each calculation node at the t +2 moment j,i+2 ,1≤j≤N;
After the formal training is finished, performing compensation training, which includes step S9;
s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment j Substitution variable Sgrad j Then, calculating the global weight by using a formula (8);
after the transition from the warm-up training to the regular training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed.
2. The distributed random gradient descent method with compression and delay compensation of claim 1, wherein the gradient computation is implemented using SGD algorithm, DCSGD algorithm, Adam algorithm.
3. The distributed random gradient descent method with compression and delay compensation of claim 1, implemented based on a PS architecture in an MXNet environment, comprising the steps of,
performing n iterations in a warm-up training stage; in each iteration, the computing node uses the global weight W to compute the local gradient Grad, the computed local gradient Grad is stored in a buffer comm _ buf of the computing node, then the computed local gradient Grad is pushed to a parameter server, the parameter server updates the global weight W, and the updated global weight W is used for covering the content in the buffer comm _ buf;
in the (n-1) th iteration of the warm-up training phase, the global weight W is replicatedTo another buffer loc _ buf different from the buffer comm _ buf, the data in the buffer loc _ buf is updated in the nth iteration using the local gradient Grad, thereby providing the formal training phase with the new local weight W it needs loc
In the formal training stage, after acquiring the global weight, the computing node copies the global weight into the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using a local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment; the dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are covered by the global weight copy at the next time after participating in the local gradient computation; controlling the iteration number of the preheating training by using a first counter on a computing node, determining whether quantization is performed or not by using a second counter on the computing node, and performing the updating of the local weight and the quantization operation in parallel;
the local gradient is pushed to a parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into the 2-bit data into 32 bits before updating the global weight; if the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately.
4. The distributed stochastic gradient descent method with compression and delay compensation of claim 3, wherein the number of iterations n performed during the warm-up training phase is adjusted according to the complexity of different models.
CN202110904974.2A 2021-08-07 2021-08-07 Distributed random gradient descent method with compression and delay compensation Active CN113627519B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110904974.2A CN113627519B (en) 2021-08-07 2021-08-07 Distributed random gradient descent method with compression and delay compensation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110904974.2A CN113627519B (en) 2021-08-07 2021-08-07 Distributed random gradient descent method with compression and delay compensation

Publications (2)

Publication Number Publication Date
CN113627519A CN113627519A (en) 2021-11-09
CN113627519B true CN113627519B (en) 2022-09-09

Family

ID=78383413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110904974.2A Active CN113627519B (en) 2021-08-07 2021-08-07 Distributed random gradient descent method with compression and delay compensation

Country Status (1)

Country Link
CN (1) CN113627519B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118384B (en) * 2021-12-09 2024-06-04 安谋科技(中国)有限公司 Quantification method of neural network model, readable medium and electronic device
CN114912587B (en) * 2022-06-09 2023-05-26 上海燧原科技有限公司 Neural network distributed training system, method, device, computing unit and medium
CN115129471A (en) * 2022-06-28 2022-09-30 中国人民解放军国防科技大学 Distributed local random gradient descent method for large-scale GPU cluster

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402469B2 (en) * 2015-10-16 2019-09-03 Google Llc Systems and methods of distributed optimization
US11630994B2 (en) * 2018-02-17 2023-04-18 Advanced Micro Devices, Inc. Optimized asynchronous training of neural networks using a distributed parameter server with eager updates
CN111382844B (en) * 2020-03-11 2023-07-07 华南师范大学 Training method and device for deep learning model
CN111882060A (en) * 2020-07-20 2020-11-03 中国人民解放军国防科技大学 Single-step delay stochastic gradient descent training method for machine learning
CN112381218B (en) * 2020-11-20 2022-04-12 中国人民解放军国防科技大学 Local updating method for distributed deep learning training
CN112862088B (en) * 2021-01-18 2023-11-07 中山大学 Distributed deep learning method based on pipeline annular parameter communication

Also Published As

Publication number Publication date
CN113627519A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN113627519B (en) Distributed random gradient descent method with compression and delay compensation
CN109951438B (en) Communication optimization method and system for distributed deep learning
CN110533183B (en) Task placement method for heterogeneous network perception in pipeline distributed deep learning
CN111382844B (en) Training method and device for deep learning model
CN106297774B (en) A kind of the distributed parallel training method and system of neural network acoustic model
CN107679618B (en) Static strategy fixed-point training method and device
CN109299781B (en) Distributed deep learning system based on momentum and pruning
CN114756383A (en) Distributed computing method, system, device and storage medium
CN115994567B (en) Asynchronous scheduling method for parallel computing tasks of deep neural network model
CN109635922B (en) Distributed deep learning parameter quantification communication optimization method and system
Wang et al. Scalable distributed dl training: Batching communication and computation
CN111243045A (en) Image generation method based on Gaussian mixture model prior variation self-encoder
CN112862088A (en) Distributed deep learning method based on pipeline annular parameter communication
CN110942138B (en) Deep neural network training method and system in hybrid memory environment
CN113312178A (en) Assembly line parallel training task allocation method based on deep reinforcement learning
Sun et al. Gradientflow: Optimizing network performance for large-scale distributed dnn training
CN106528357A (en) FPGA system and implementation method based on on-line training neural network of quasi-newton method
CN111813858A (en) Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes
CN113419931A (en) Performance index determination method and device of distributed machine learning system
Li et al. Optimizing makespan and resource utilization for multi-DNN training in GPU cluster
CN115129471A (en) Distributed local random gradient descent method for large-scale GPU cluster
CN117640378A (en) Method and system for self-adaptive deployment and resource allocation of micro-service with perceived performance in cloud edge environment
CN116883229A (en) Pipeline parallel method for accelerating neural network training in heterogeneous GPU cluster
CN112232401A (en) Data classification method based on differential privacy and random gradient descent
CN116400963A (en) Model automatic parallel method, device and storage medium based on load balancing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant