CN113627519B - Distributed random gradient descent method with compression and delay compensation - Google Patents
Distributed random gradient descent method with compression and delay compensation Download PDFInfo
- Publication number
- CN113627519B CN113627519B CN202110904974.2A CN202110904974A CN113627519B CN 113627519 B CN113627519 B CN 113627519B CN 202110904974 A CN202110904974 A CN 202110904974A CN 113627519 B CN113627519 B CN 113627519B
- Authority
- CN
- China
- Prior art keywords
- weight
- local
- training
- grad
- loc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and computing nodes, wherein the computing nodes are responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, and data interaction between the parameter server and the computing nodes adopts a pair of multimode modes of a PS (packet switched) architecture; and after the transitional training is finished, performing formal training to obtain the local gradient and the local weight of the next moment, and then performing compensation training. The method adopts a unique local updating mechanism to cover the extra calculation overhead of quantization, is suitable for all lossy compression methods, and has the advantage of wide application range.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a distributed random gradient descent method with compression and delay compensation.
Background
At present, distributed training has become an effective method for deep learning model training. A large training data set is partitioned into multiple nodes to perform the training task. Therefore, the nodes must share their calculation parameters with each other before updating the global parameters, and the communication cost of the sharing process limits the scalability of the distributed system, and seriously reduces the efficiency of the distributed training. For example, when training ResNet-50 on a 16-node Nvidia P102-100 GPU cluster connected via 1Gbps Ethernet, the communication time is more than nine times the computation time. As the number of compute nodes increases, the communication cost tends to deteriorate. In order to solve the communication problem in distributed training, researchers have proposed many methods for accelerating distributed training, which can be classified into a system-level method and an algorithm-level method.
On the system level, the pipeline method is based on hierarchical structure optimization training of the neural network, so that each Back Propagation (BP) can be in overlapped communication with the calculation process of the next layer. Following the pipeline approach, a communication priority scheduling mechanism is proposed to achieve a more aggressive overlap between computational and communication overhead. Recent research efforts have been directed to improving distributed training performance by parallelizing computational and communication operations. The Post-local SGD, K-AVG, and Periodic Averaging methods allow each compute node to perform local updates prior to communication, evolving the local model through average synchronization.
At the algorithm level, a gradient compression technique is proposed to reduce the communication traffic, and can be divided into a gradient sparsification method and a gradient quantization method. The gradient quantization method converts the high-precision gradient into the low-precision gradient for communication. The 1-bit quantization method reduces communication traffic by encoding a 32-bit gradient into 1 bit. The QSGD method allows a user to select different quantization levels according to network bandwidth. The WAGE method and the 8-bit training method quantify not only the gradient but also the weight. Early sparsification methods judged whether to send gradients by a single threshold. The DGC then accelerates the large-scale distributed training even further by swapping only the first 0.1% gradient in each iteration and accumulating the other gradients until it becomes large enough. While these communication algorithms may alleviate communication pressure, they introduce additional computational overhead on data encoding and gradient selection. Worse yet, the performance improvement of the compression method is not significant when the additional computation overhead and gradient computation time are much higher than the communication cost.
Some research efforts have been made to combine the advantages of the system-level approach and the algorithm-level approach, but the results are still not ideal. The LAGS-SGD approach integrates DGC with the pipeline approach, but does not bring a great speed advantage due to the start-up cost and the additional compression cost of multi-layer communication. The Canary method combines 8-bit quantization with gradient partitioning, but fails to solve the problem of reduced precision. OMGS-SGD combines DGC with an optimal merging mechanism, and it is not much faster than the DGC approach when training dense communication models such as VGG-16.
For the communication problem in distributed training, an appropriate method is needed to eliminate or cover the extra cost of compression, and meanwhile, the problem of precision reduction caused by compression needs to be solved, and finally, the selected mechanism optimization method must bring sufficient training efficiency benefits.
Disclosure of Invention
Aiming at the problems of eliminating or covering extra cost of compression and precision reduction caused by compression in communication optimization in distributed training, the invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and a computing node, wherein the computing node is responsible for calculating gradients and local weights, the parameter server is used for receiving and aggregating gradients from all computing nodes and then updating the gradients, data interaction between the parameter server and the computing nodes adopts a pair of multimode of a PS (packet switched) architecture, and the method specifically comprises the following steps:
firstly, carrying out preheating training, wherein the preheating training comprises a step S1, a step S2 and a step S3;
s1, retrieving the global weight W at the t-1 moment from the parameter server side i-1 。
S2, using the global weight W at the time t-1 i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node j,i-1 The calculation formula of (2) is as follows:
Grad j,i-1 =grad_cal(W i-1 ,X j ,Y j ),1≤j≤N (1)
where grad _ cal is a function of gradient calculation, using a global weight W i-1 And sample features X stored in the input of the jth compute node j To calculateThe prediction result Y is obtained j From the predicted result Y j ' and Label Y j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss j,i-1 (ii) a The gradient calculation is realized by adopting an SGD algorithm, a DCSGD algorithm, an Adam algorithm and the like;
s3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is measured by the local gradient value obtained by computing i-1 Updating to obtain the global weight W at the time t i :
Wherein eta is the hyper-parametric learning rate, and N is the number of the calculation nodes.
The preheating training stage repeats steps S1 to S3 several times, so as to make the weight faster and more stable, and the number of repetitions can be adjusted according to the user' S will, and usually 5 Epoch cycles are preheated.
After the preheating training is finished, performing transition training, wherein the transition training comprises a step S4 and a step S5;
s4, global weight W of t time retrieved from parameter server side i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t i loc Are all equal to the global weight W at time t i Also equal to the backup weight loc _ weight. In each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment j,i The computational expression of (a) is:
Grad j,i =grad_cal(W i ,X j ,Y j ),1≤j≤N (3)
s5, at each computing node, utilizing the local gradient Grad at the t moment obtained by computing j,i To W i Update, generate local weight W at time t +1 i+1 loc :
In the local weight W i+1 loc While calculating, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the calculated local gradient value at the time t is used for calculating the global weight W at the time t i Updating to obtain the global weight W at the moment of t +1 i+1 :
Obtaining the global weight W of the t +1 moment i+1 Backed up into the weight variable loc _ weight.
After the transitional training is finished, performing formal training, which comprises a step S6, a step S7 and a step S8;
s6, obtaining the local weight W i+1 loc Then, each computing node immediately starts to compute the local gradient Grad at the time t +1 j,i+1 :
Grad j,i+1 =grad_cal(W i+1 loc ,X j ,Y j ), (6)
S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained j,i+1 Data storage variable Sgrad with 2-bit format j At the same time, the local gradient Grad at the time t +1 is used j,i+1 Global weight W for time t +1 i+1 Updating to obtain the local weight W at the time of t +2 i+2 loc :
S8, each computing node maps the variable Sgrad j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 i+2 And backup is carried out to the global weight W of the weight variable loc _ weight at the moment t +2 i+2 Meter (2)The calculation formula is as follows:
meanwhile, the local weight W at the moment of t +2 is obtained i+2 loc Then, using formula (6), W i+2 loc Calculating the local gradient Grad of each computing node at the moment of t +2 j,i+2 ,1≤j≤N。
After the formal training is finished, performing compensation training, which includes step S9;
s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment j Substitution variable Sgrad j Then, the global weight is calculated by using the formula (8).
After the transition from the warm-up training to the normal training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed. The training times of the models with different complexity degrees are different, and meanwhile, the training is limited by the expectation of the accuracy of a user, and experiments prove that the convergence of the method is close to that of the S-SGD, so the training examples of the S-SGD can be referred to for the setting of the training cycle number.
The invention discloses a distributed random gradient descent method with compression and delay compensation, which is implemented based on a PS (packet switched) architecture in an MXNet environment and comprises the following steps,
n iterations are performed during the warm-up training phase. In each iteration, the computing node uses the global weight W to compute the local gradient Grad, the computed local gradient Grad is stored in the buffer comm _ buf of the computing node, then the computed local gradient Grad is pushed to the parameter server, the parameter server updates the global weight W, and the content in the buffer comm _ buf is covered by the updated global weight W.
In the (n-1) th iteration of the warm-up training phase, the global weight W is copied to another different buffer comm _ bufIn a buffer loc _ buf, the data in the buffer loc _ buf is updated in the nth iteration by using the local gradient Grad, so as to provide the required new local weight W for the formal training phase loc . The iteration number n of the warm-up training stage can be adjusted according to the complexity of different models, and the time required is usually very short.
In the formal training stage, after obtaining the global weight, the computing node copies the global weight to the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using the local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment. The dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are not overwritten by the copy of the global weights at the next moment until they participate in the local gradient computation; and controlling the iteration number of the preheating training by using a first counter on the computing node, determining whether quantization is performed or not by using a second counter on the computing node, and performing the updating operation on the local weight and the quantization operation in parallel.
The local gradient is pushed to the parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into 2-bit data into 32 bits before updating the global weights. If the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately.
The invention has the beneficial effects that:
the method employs a unique local update mechanism to mask the quantization overhead. The introduction of the compensation mechanism in the method is applicable to all lossy compression methods (including quantization and sparsification methods), is a mainstream technology for accelerating training, and the existing lossy compression method can make up for the precision loss by using the method, so the method has the advantage of wide application range.
Drawings
FIG. 1 shows the comparison of LeNet-5 model training accuracy on 2 compute nodes using different methods;
FIG. 2 is a diagram illustrating the comparison of the training accuracy of the ResNet-50 model on 4 computing nodes by different methods;
FIG. 3 is a diagram illustrating the comparison of the training accuracy of ResNet-50 models on 8 computing nodes by different methods;
FIG. 4 is a speed boost ratio for different methods using 4 compute nodes in a K80 cluster for training;
FIG. 5 is a speed boost ratio for training with 4 compute nodes in a V100 cluster according to various methods.
Detailed Description
For a better understanding of the present disclosure, an example is given here.
The invention discloses a distributed random gradient descent method with compression and delay compensation, which is realized by utilizing a parameter server and computing nodes, wherein the computing nodes are responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, the data interaction between the parameter server and the computing nodes adopts a pair of multimode of a PS (packet switched) architecture, and the method specifically comprises the following steps:
firstly, carrying out preheating training, wherein the preheating training comprises a step S1, a step S2 and a step S3;
s1, retrieving the global weight W at the t-1 moment from the parameter server side i-1 。
S2, using the global weight W at the time t-1 i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node j,i-1 The calculation formula of (2) is as follows:
Grad j,i-1 =grad_cal(W i-1 ,X j ,Y j ),1≤j≤N (1)
where grad _ cal is a function of gradient calculation, using a global weight W i-1 And sample features X stored in the input of the jth compute node j To calculate a prediction result Y j From the predicted result Y j ' and Label Y j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss j,i-1 (ii) a The gradient calculation is realized by adopting an SGD algorithm, a DCSGD algorithm, an Adam algorithm and the like, and the method has good universality;
s3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is obtained by utilizing the local gradient value obtained by computing i-1 Updating to obtain the global weight W at the time t i :
Wherein eta is the hyper-parametric learning rate, and N is the number of the calculation nodes.
The preheating training stage repeats steps S1 to S3 several times, so as to make the weight faster and more stable, and the number of repetitions can be adjusted according to the user' S will, and usually 5 Epoch cycles are preheated.
After the preheating training is finished, performing transition training, which comprises a step S4 and a step S5;
s4, global weight W of t time retrieved from parameter server side i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t i loc Are all equal to the global weight W at time t i Also equal to the backup weight loc _ weight. In each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment j,i The computational expression of (a) is:
Grad j,i =grad_cal(W i ,X j ,Y j ),1≤j≤N (3)
s5, at each computing node, utilizing the local gradient Grad at the t moment obtained by computing j,i To W i Update, generate local weight W at time t +1 i+1 loc :
In the pair of local weight W i+1 loc While calculation is carried out, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the global weight W at the time t is calculated by using the calculated local gradient value at the time t i Updating to obtain the global weight W at the moment of t +1 i+1 :
Obtaining the global weight W of the t +1 moment i+1 Backed up into the weight variable loc _ weight.
After the transitional training is finished, performing formal training, which comprises a step S6, a step S7 and a step S8;
s6, obtaining the local weight W i+1 loc Then, each computing node immediately starts to compute the local gradient Grad at the time t +1 j,i+1 :
Grad j,i+1 =grad_cal(W i+1 loc ,X j ,Y j ), (6)
S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained j,i+1 Data storage variable Sgrad with 2-bit format j At the same time, the local gradient Grad at the time t +1 is used j,i+1 To global weight W for time t +1 i+1 Updating to obtain the local weight W at the time of t +2 i+2 loc :
S8, each computing node maps the variable Sgrad j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 i+2 And backup is carried out on the global weight W at the moment of t +2 of the weight variable loc _ weight i+2 The calculation formula of (2) is as follows:
at the same time, the local weight W at the moment of t +2 is obtained i+2 loc Then, using formula (6), W i+2 loc Calculating local gradient Grad of each calculation node at the t +2 moment j,i+2 ,1≤j≤N。
After the formal training is finished, performing compensation training, which includes step S9;
s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment j Substitution variable Sgrad j Then, the global weight is calculated by using the formula (8).
After the transition from the warm-up training to the regular training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed.
The method can be implemented on the existing mainstream distributed learning framework, and for better understanding of the content of the invention, the CD-SGD (compact disc-serving gateway) which is a method for arranging the PS (packet switched) architecture on the MXNet is given.
The invention discloses a distributed random gradient descent method with compression and delay compensation, which is implemented based on a PS (packet switched) architecture in an MXNet environment and comprises the following steps,
n iterations are performed during the warm-up training phase. In each iteration, the computing node uses the global weight W to compute the local gradient Grad, stores the computed local gradient Grad in the buffer comm _ buf of the computing node, then pushes the computed local gradient Grad to the parameter server, and the parameter server updates the global weight W and overwrites the contents in the buffer comm _ buf with the updated global weight W, so that only one buffer is needed for the work.
In the (n-1) th iteration of the warm-up training phase, the global weight W is copied into another buffer loc _ buf different from the buffer comm _ buf, and the data in the buffer loc _ buf is updated in the nth iteration by using the local gradient Grad, thereby providing for the formal training phaseNew local weight W required by it loc . The iteration number n of the warm-up training stage can be adjusted according to the complexity of different models, and the time required is usually very short.
In the formal training stage, after acquiring the global weight, the computing node copies the global weight into the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using a local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment. The dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are covered by the global weight copy at the next time after participating in the local gradient computation; the number of iterations of the warm-up training is controlled by a first counter on the compute node, a second counter is used on the compute node to determine whether quantization is to be performed, and the updating and quantizing operations of the local weights are performed in parallel because the updating and quantizing operations of the local weights only read the buffer comm _ buf without modifying it.
The local gradient is pushed to the parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into 2-bit data into 32 bits before updating the global weights. If the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately. The local update mechanism adopted by the patent can ensure that the calculation in the next iteration is not delayed due to compression and works together with parallel communication.
The CD-SGD method is designed for inheriting the advantage of the gradient quantization method in reducing communication traffic, simultaneously making up the disadvantage of accuracy reduction caused by quantization loss, masking quantization overhead, and improving the overlapping performance of calculation and communication so as to improve training efficiency. Here, validity evaluation was made as to whether the CD-SGD method achieved the above-described object. FIG. 1 shows the comparison of LeNet-5 model training accuracy on 2 compute nodes using different methods; FIG. 2 is a diagram illustrating the comparison of the training accuracy of the ResNet-50 model on 4 computing nodes by different methods; FIG. 3 is a diagram illustrating the comparison of the training accuracy of ResNet-50 models on 8 computing nodes by different methods; FIG. 4 is a speed boost ratio for different methods using 4 compute nodes in a K80 cluster for training; FIG. 5 is a speed boost ratio for training with 4 computing nodes in a V100 cluster for different methods.
Fig. 1 to fig. 3 show a convergence accuracy curve comparison diagram of several methods, such as CD-SGD (the method of the present invention), S-SGD (the traditional high-accuracy synchronous update algorithm), BIT-SGD (the 2-BIT quantization method provided by MXNet, the optimization object of the present invention), and OD-SGD (a local update algorithm, the present invention uses its local update mechanism as a reference). In the experiment, 4 parameter server nodes and 4 computing nodes are adopted, the size of a single node batch is 128, the global learning rate (lr) of all algorithms is 0.1, the local lr of CD-SGD and OD-SGD is 0.4, and the warm-up time is 5 epochs. In addition, the quantization threshold of the CD-SGD and the BIT-SGD is 0.5, and the precision compensation step size k of the CD-SGD is 2. Based on fig. 1 to 3, the following conclusions can be drawn: (1) the test accuracy of BIT-SGD is obviously lower than that of other three methods, but CD-SGD solves the precision problem of BIT-SGD, and the convergence performance of CD-SGD is very close to that of S-SGD and even higher than that of OD-SGD. (2) The CD-SGD can keep better precision expandability, and when the number of nodes is increased, the precision is still not obviously different from that of the S-SGD, and the difference is not more than 0.5%.
Fig. 4 to 5 show relative velocity ratios of the above method trained on two different clusters of K80 and V100. The precision compensation step k size of the CD-SGD is 5, and the rest parameter settings are the same as those described above. Tests were performed, based on the speed of the S-SGD. Based on fig. 4 to 5, the following conclusions can be drawn: (1) when model ResNet-50 is trained on K80, the CD-SGD obtains the same training speed as the OD-SGD. We attribute this phenomenon to the limited computational power of K80, resulting in a computational bottleneck. Furthermore, we can note that the performance of BIT-SGD is worse than that of OD-SGD when model Vgg16 and model inclusion-bn are trained, unlike the training of model Alexnet. This means that in addition to masking the compression overhead and reducing the communication time, a parallel mechanism needs to be used to increase the overlap of computation and communication. The acceleration ratios of the CD-SGD on the model shown in FIG. 4 are 0%, 43%, 33%, 32%, respectively. (2) The V100 GPU cluster has stronger computing power and can complete computing tasks in shorter time. Therefore, the performance of BIT-SGD is superior to OD-SGD when training most of the models in FIG. 5, because limited computational cost cannot fully cover communication time. However, BIT-SGD is slower than OD-SGD when training the inclusion-bn because the inclusion-bn has many layers of computation, resulting in significant computational cost. The acceleration ratios of CD-SGD on the model shown in FIG. 5 are 24%, 43%, 39%, 44%, respectively. In FIGS. 4 and 5, the training efficiency of CD-SGD can be improved by 3% to 30% compared to BIT-SGD.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.
Claims (4)
1. A distributed random gradient descent method with compression and delay compensation is characterized by being implemented by using a parameter server and a computing node, wherein the computing node is responsible for computing gradients and local weights, the parameter server is used for receiving and aggregating the gradients from all the computing nodes and then updating the gradients, data interaction between the parameter server and the computing nodes adopts a pair of multimode modes of a PS (packet switched) architecture, and the method comprises the following specific steps:
firstly, carrying out preheating training, wherein the preheating training comprises a step S1, a step S2 and a step S3;
s1, retrieving the global weight W at the t-1 moment from the parameter server side i-1 ;
S2, utilizing the global weight W at the time t-1 i-1 Calculating local gradient at t-1 time in each computing node, and calculating local gradient value Grad at t-1 time of j-th computing node j,i-1 The calculation formula of (2) is as follows:
Grad j,i-1 =grad_cal(W i-1 ,X j ,Y j ),1≤j≤N (1)
where grad _ cal is a function of gradient calculation, using a global weight W i-1 And is stored inSample feature X of input for jth compute node j To calculate a predicted result Y j From the predicted result Y j ' and Label Y j Calculating the loss value loss, and then obtaining the local gradient value Grad of the jth computing node at the t-1 moment by derivation of the loss value loss j,i-1 ;
S3, each computing node pushes the local gradient value obtained by computing to a parameter server, and global weight W at the moment t-1 is obtained by utilizing the local gradient value obtained by computing i-1 Updating to obtain the global weight W at the time t i :
Wherein eta is the hyper-parameter learning rate, and N is the number of the calculation nodes;
the preheat training phase repeats steps S1 through S3 several times;
after the preheating training is finished, performing transition training, wherein the transition training comprises a step S4 and a step S5;
s4, global weight W of t time retrieved from the parameter server side i And (4) performing backup, storing the backup into a backup weight variable loc _ weight, and obtaining the local weight W of each computing node at the time t i loc Are all equal to the global weight W at time t i Also equal to the backup weight loc _ weight; in each computing node, computing the local gradient of each computing node at the t moment, and computing the local gradient value Grad of the jth computing node at the t moment j,i The calculation expression of (a) is:
Grad j,i =grad_cal(W i ,X j ,Y j ),1≤j≤N (3)
s5, at each computing node, utilizing the local gradient Grad at the t moment obtained by computing j,i To W i Update, generate local weight W at time t +1 i+1 loc :
In the local weight W i+1 loc While calculation is carried out, each calculation node uploads the calculated local gradient at the time t to the parameter server, and the global weight W at the time t is calculated by using the calculated local gradient value at the time t i Updating to obtain the global weight W at the moment of t +1 i+1 :
Obtaining the global weight W of the t +1 moment i+1 Backing up the weight variable loc _ weight;
after the transitional training is finished, performing formal training, which comprises a step S6, a step S7 and a step S8;
s6, obtaining the local weight W i+1 loc Then, each computing node immediately starts to compute the local gradient Grad at the time t +1 j,i+1 :
Grad j,i+1 =grad_cal(W i+1 loc ,X j ,Y j ), (6)
S7, after each computing node calculates the local gradient at the t +1 moment, the local gradient Grad at the t +1 moment is obtained j,i+1 Data storage variable Sgrad with 2-bit format j At the same time, the local gradient Grad at the time t +1 is used j,i+1 Global weight W for time t +1 i+1 Updating to obtain the local weight W at the time of t +2 i+2 loc :
S8, each computing node sets the variable Sgrad j The data of (2) is pushed to a parameter server, and the parameter server calculates the global weight W at the moment of t +2 i+2 And backup is carried out to the global weight W of the weight variable loc _ weight at the moment t +2 i+2 The calculation formula of (2) is as follows:
at the same time, the local weight W at the moment of t +2 is obtained i+2 loc Then, using formula (6), W i+2 loc Calculating local gradient Grad of each calculation node at the t +2 moment j,i+2 ,1≤j≤N;
After the formal training is finished, performing compensation training, which includes step S9;
s9, in the process of each k times of global weight calculation, the formal training mode is adopted for the first k-1 times, and each time of k times of iteration, compensation calculation is carried out, wherein the compensation calculation is carried out by using local gradient Grad at a certain moment j Substitution variable Sgrad j Then, calculating the global weight by using a formula (8);
after the transition from the warm-up training to the regular training is completed, the operations of steps S7, S8, and S9 are repeatedly repeated until the user-specified training Epoch number is performed.
2. The distributed random gradient descent method with compression and delay compensation of claim 1, wherein the gradient computation is implemented using SGD algorithm, DCSGD algorithm, Adam algorithm.
3. The distributed random gradient descent method with compression and delay compensation of claim 1, implemented based on a PS architecture in an MXNet environment, comprising the steps of,
performing n iterations in a warm-up training stage; in each iteration, the computing node uses the global weight W to compute the local gradient Grad, the computed local gradient Grad is stored in a buffer comm _ buf of the computing node, then the computed local gradient Grad is pushed to a parameter server, the parameter server updates the global weight W, and the updated global weight W is used for covering the content in the buffer comm _ buf;
in the (n-1) th iteration of the warm-up training phase, the global weight W is replicatedTo another buffer loc _ buf different from the buffer comm _ buf, the data in the buffer loc _ buf is updated in the nth iteration using the local gradient Grad, thereby providing the formal training phase with the new local weight W it needs loc ;
In the formal training stage, after acquiring the global weight, the computing node copies the global weight into the buffer loc _ buf, updates the copy of the global weight in the buffer loc _ buf by using a local gradient, and then in the next iteration, the updated local weight participates in the calculation of the local gradient at the next moment; the dependency engine of the MXNet can ensure that the local weights in the buffer loc _ buf are covered by the global weight copy at the next time after participating in the local gradient computation; controlling the iteration number of the preheating training by using a first counter on a computing node, determining whether quantization is performed or not by using a second counter on the computing node, and performing the updating of the local weight and the quantization operation in parallel;
the local gradient is pushed to a parameter server after being encoded into 2-bit data, and the parameter server decodes the local gradient encoded into the 2-bit data into 32 bits before updating the global weight; if the local gradient does not need to be quantized in a certain iteration of the formal training stage, the operation process of the iteration is the same as that of the preheating stage, and when the local weight is updated, the next iteration is executed immediately.
4. The distributed stochastic gradient descent method with compression and delay compensation of claim 3, wherein the number of iterations n performed during the warm-up training phase is adjusted according to the complexity of different models.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110904974.2A CN113627519B (en) | 2021-08-07 | 2021-08-07 | Distributed random gradient descent method with compression and delay compensation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110904974.2A CN113627519B (en) | 2021-08-07 | 2021-08-07 | Distributed random gradient descent method with compression and delay compensation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113627519A CN113627519A (en) | 2021-11-09 |
CN113627519B true CN113627519B (en) | 2022-09-09 |
Family
ID=78383413
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110904974.2A Active CN113627519B (en) | 2021-08-07 | 2021-08-07 | Distributed random gradient descent method with compression and delay compensation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113627519B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114118384B (en) * | 2021-12-09 | 2024-06-04 | 安谋科技(中国)有限公司 | Quantification method of neural network model, readable medium and electronic device |
CN114912587B (en) * | 2022-06-09 | 2023-05-26 | 上海燧原科技有限公司 | Neural network distributed training system, method, device, computing unit and medium |
CN115129471A (en) * | 2022-06-28 | 2022-09-30 | 中国人民解放军国防科技大学 | Distributed local random gradient descent method for large-scale GPU cluster |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10402469B2 (en) * | 2015-10-16 | 2019-09-03 | Google Llc | Systems and methods of distributed optimization |
US11630994B2 (en) * | 2018-02-17 | 2023-04-18 | Advanced Micro Devices, Inc. | Optimized asynchronous training of neural networks using a distributed parameter server with eager updates |
CN111382844B (en) * | 2020-03-11 | 2023-07-07 | 华南师范大学 | Training method and device for deep learning model |
CN111882060A (en) * | 2020-07-20 | 2020-11-03 | 中国人民解放军国防科技大学 | Single-step delay stochastic gradient descent training method for machine learning |
CN112381218B (en) * | 2020-11-20 | 2022-04-12 | 中国人民解放军国防科技大学 | Local updating method for distributed deep learning training |
CN112862088B (en) * | 2021-01-18 | 2023-11-07 | 中山大学 | Distributed deep learning method based on pipeline annular parameter communication |
-
2021
- 2021-08-07 CN CN202110904974.2A patent/CN113627519B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113627519A (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113627519B (en) | Distributed random gradient descent method with compression and delay compensation | |
CN109951438B (en) | Communication optimization method and system for distributed deep learning | |
CN110533183B (en) | Task placement method for heterogeneous network perception in pipeline distributed deep learning | |
CN111382844B (en) | Training method and device for deep learning model | |
CN106297774B (en) | A kind of the distributed parallel training method and system of neural network acoustic model | |
CN107679618B (en) | Static strategy fixed-point training method and device | |
CN109299781B (en) | Distributed deep learning system based on momentum and pruning | |
CN114756383A (en) | Distributed computing method, system, device and storage medium | |
CN115994567B (en) | Asynchronous scheduling method for parallel computing tasks of deep neural network model | |
CN109635922B (en) | Distributed deep learning parameter quantification communication optimization method and system | |
Wang et al. | Scalable distributed dl training: Batching communication and computation | |
CN111243045A (en) | Image generation method based on Gaussian mixture model prior variation self-encoder | |
CN112862088A (en) | Distributed deep learning method based on pipeline annular parameter communication | |
CN110942138B (en) | Deep neural network training method and system in hybrid memory environment | |
CN113312178A (en) | Assembly line parallel training task allocation method based on deep reinforcement learning | |
Sun et al. | Gradientflow: Optimizing network performance for large-scale distributed dnn training | |
CN106528357A (en) | FPGA system and implementation method based on on-line training neural network of quasi-newton method | |
CN111813858A (en) | Distributed neural network hybrid synchronous training method based on self-organizing grouping of computing nodes | |
CN113419931A (en) | Performance index determination method and device of distributed machine learning system | |
Li et al. | Optimizing makespan and resource utilization for multi-DNN training in GPU cluster | |
CN115129471A (en) | Distributed local random gradient descent method for large-scale GPU cluster | |
CN117640378A (en) | Method and system for self-adaptive deployment and resource allocation of micro-service with perceived performance in cloud edge environment | |
CN116883229A (en) | Pipeline parallel method for accelerating neural network training in heterogeneous GPU cluster | |
CN112232401A (en) | Data classification method based on differential privacy and random gradient descent | |
CN116400963A (en) | Model automatic parallel method, device and storage medium based on load balancing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |