CN109951438B

CN109951438B - Communication optimization method and system for distributed deep learning

Info

Publication number: CN109951438B
Application number: CN201910035739.9A
Authority: CN
Inventors: 刘万涛; 陆取; 虎嵩林; 韩冀中
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2020-11-20
Anticipated expiration: 2039-01-15
Also published as: CN109951438A

Abstract

The invention relates to a communication optimization method and a communication optimization system for distributed deep learning. The method comprises the following steps: 1) selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning; 2) and quantizing the partial gradient values selected by the thinning method, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication. The invention provides the combination of gradient sparse transmission and quantization compression, and divides three different training stages according to different characteristics in the training process, adjusts according to different training conditions, further improves the gradient compression effect, improves the training efficiency under a distributed scene, and has no obvious negative influence on the convergence performance and the model precision.

Description

Communication optimization method and system for distributed deep learning

Technical Field

The invention belongs to the technical field of artificial intelligence technology and deep learning, and particularly relates to a communication optimization method and system for distributed deep learning.

Background

In recent years, with the continuous temperature rise of artificial intelligence concepts, machine learning becomes a hot spot of scientific research. In the deep learning field, the network structure is more explosive, and larger and deeper network structures are continuously proposed, and people apply the models to many fields such as book classification, image processing, machine translation, voice recognition and the like, thereby achieving remarkable achievement.

With the continuous increase of data set size and network model parameters input by the neural network (the parameter quantity of common networks reaches tens of millions, hundreds of millions and so on, and the size of a lot of network parameters reaches several GB), under the condition of single machine training, the problems of insufficient storage space, too long time for training the network and the like can occur. Therefore, many training works of deep learning network structures are transferred to a distributed system.

There are two common modes of distributed training for deep learning. One is model parallel and one is data parallel. The model parallel method is usually suitable for the condition that the network structure is too large and a single node cannot be stored, the complete neural network structure is divided into a plurality of parts to be stored on different nodes, and then training data are acquired in parallel for training. However, the parallel splitting of the model is difficult, and the sharing difficulty of the training data is high, so that the mode is not the main application mode in the current stage of data parallel, namely, each node stores a copy of a network structure, and a training data set is split to different nodes.

The training of the distributed deep learning is widely applied to a Stochastic Gradient Descent (SGD) algorithm, and a ps-worker architecture is mainly adopted among clusters. As shown in fig. 1, a ps server (parameter server) is used to collect the gradients calculated by each worker node, and update the model parameters stored on the server according to the gradients. And the Worker acquires the parameter values after the parameter server calculates the latest parameters, and acquires new training data on the basis to carry out calculation work of the next batch.

By enlarging the cluster size, the computing power of the whole system is greatly improved, but the problem is that the communication bandwidth between machines becomes a main factor for limiting the training efficiency. Due to the fact that the data volume of gradient communication is large (the gradient size is consistent with the parameter size), the communication time is concentrated (generally, the gradient is concentrated and exchanged after the back propagation is finished), the network bandwidth often cannot meet the communication requirement, a large amount of time can be spent on network communication by a cluster, and the computing time saved by the nodes can be completely covered by the communication time, so that the training efficiency cannot be improved. Such communication problems are even more acute in some small-scale devices or mobile devices. Therefore, the network bandwidth problem of communication has become a major problem limiting the development of distributed machine learning.

The main objective of the invention is to solve the network communication bottleneck problem occurring when a distributed system is used for training a deep learning model. The network communication bottleneck problem to be solved by the present invention can be clearly described by a specific example. Fig. 2 shows the training of the conventional image classification network AlexNet on a Tesla K80 graphics card, in which the average time of one round of running AlexNet under different frames is listed, and it can be seen that the average time of one round of training the mini-batch AlexNet with size 128 on the GPU is between 19ms and 54ms (different frames cause certain overhead difference). Since AlexNet has a parameter of about 60M, the amount of gradient data that needs to be exchanged is about 60M × 4B — 240MB, regardless of the overhead of the network communication protocol, when performing gradient exchange during training. Over the relatively common network bandwidth of 1Gbps, it takes about 1s for the light to deliver a gradient of this order to the server side. Compared with 20-50 times of the cost of one round in calculation. AlexNet is also a less complex network than earlier, which is even more serious in some larger networks. It can be said that the network bandwidth limits the improvement of the distributed deep learning efficiency, and greatly affects the expansibility of the cluster.

Under the distributed architecture of the ps-worker, researchers provide a plurality of methods for solving the network bandwidth bottleneck problem in the field of machine learning during distributed training of models.

From the perspective of the system, an idea is that according to the characteristic of uneven convergence during machine learning training, a strictly required massive synchronization mechanism (BSP) is not adopted as a parameter updating strategy, but asynchronous updating is allowed, and a node with a fast calculation speed can update immediately after calculating a gradient value, and does not need to synchronize and transmit calculation results of all nodes every time, namely, so-called Stale Synchronization and Parallelism (SSP). The other idea is to overlap the time of gradient calculation and data communication, and in the stage of back propagation, each time the gradient value corresponding to one layer of parameters is calculated, the corresponding parameters are asynchronously updated, and meanwhile, the calculation nodes also perform calculation. At best, the time of network communication may be completely covered by the calculated time.

In addition to these methods, controlling the size of data during network transmission can also effectively solve the time problem of network communication. In the training process of deep learning, only gradient and parameters are transmitted, but since the change of parameter values can obviously affect the model precision, gradient compression is a common idea. There are two approaches to achieve this, sparse transfer of gradients and quantization compression, respectively.

(1) Gradient sparse transfer. The number of gradients transferred during communication is controlled, and transmission of partial gradients is delayed, so that data communication quantity on the network can be reduced. Storm proposes setting a threshold to filter gradients, those gradients greater than a certain threshold are passed, and the rest are accumulated locally. His experimental results show that the processing speed can be improved 17 times when 20 nodes are used. Since the selection of the threshold is also an additionally introduced hyper-parameter, the reasonable selection of this variable is very difficult. Aji proposes to select the part of the gradient with the largest absolute value for synchronous updating, i.e. to filter the gradient by a fixed ratio, reducing the gradient transmission by about 99%. The final trained model showed only a slight decrease in the BELU score. In order to automatically adjust the compression rate according to the characteristics of different tasks, chen proposes a method for locally selecting a gradient, which obtains a compression effect of 200 times on a fully connected network without affecting the final accuracy. Lin finds that 99.9% of gradient transmission is redundant through observation, so a deep gradient compression strategy is proposed, four skills are applied to make up for errors caused by sparsification, and the compression rate is improved to the level of 600 times at most.

(2) And (5) gradient quantification. The bandwidth requirement of the communication can also be reduced by using a lower precision value to represent the value of the gradient. zhou uses a bit convolution approach to speed up training and prediction (with simultaneous gradient and model compression) and is named Dorefa-net. The resulting accuracy of the Dorefa-net is comparable to a normal network. Glide quantifies the gradient delivered, each gradient has only one bit at the time of transmission, and a 160M network is trained with a tenfold acceleration with only a small loss of accuracy. Alistar proposes a compression strategy of QSGD, allows a user to make an effective balance between compression rate and convergence performance, and improves the training time of QSGD by 1.8 times when the QSGD trains ResNet-152. Wen proposes a method of tri-valued gradient compression, which quantizes all gradient values to { -1, 0, 1 }. Experiments show that there is no large loss of accuracy over most network structures. Both QSGD and ternary work demonstrate the convergence of the method from a theoretical point of view.

As shown in fig. 3, increasing the number of nodes can improve the computing capability of the whole cluster, and the computing time can be greatly shortened under the same mini-batch condition. But the bottleneck of communication is obvious at this time, and the scale of communication can be reduced by the gradient compression mode, so that the aim of shortening the training time is fulfilled when the communication is reduced.

The SSP method for abandoning strict synchronous update in the existing distributed communication optimization technology has the disadvantages that the gradient may be calculated by using old parameters when the node operates, and the convergence performance is influenced to a certain extent. The accuracy of the obtained model is different from that before optimization, and meanwhile, the convergence period can be increased, and the final acceleration effect is not necessarily obvious.

The idea of covering computation and communication time with each other is not completely universal, and such a manner of communication ahead of time cannot be completely applied to neural networks with a relatively low "communication-computation ratio". The time of communication can take up a large amount of time.

The work on the existing gradient compression has led to good results, but some disadvantages have been found.

However, it can be seen from the common work of gradient sparsity transmission that how to select a threshold value is generally the problem of sparsity, and meanwhile, the performance of convergence and the final precision cannot be completely guaranteed, and stable performance is difficult to perform in training. Another problem with the sparsification method is that the gradient being transmitted is not expressed in a more efficient manner and there is significant redundancy in the data transfer process. The gradient is chosen by either using a threshold or a fixed ratio, the range of the gradient transmitted over the network has been significantly reduced, and if the gradient is represented by a high precision floating point number, which is relatively wasteful of resources from a coding point of view, the gradient can be represented in a more efficient manner.

Quantization tends to require more computations than thinning, with additional overhead, and has an upper limit on compression rate. (32-bit floating-point numbers are compressed to 1-bit data at the highest). The quantization method also has a drawback, and in general, the quantization process gives errors to most of the gradient transmitted after reconstruction, regardless of the recovery scale value selected at the time of transmission. Some of these errors are not considered in the quantization process, and some add up to locally increase part of the workload. However, this part of the error may have an effect on the training effect.

Disclosure of Invention

The main objective of the present invention is to solve the network communication bottleneck problem that occurs when a deep learning model is trained using a distributed system.

The technical scheme adopted by the invention is as follows:

a communication optimization method for distributed deep learning comprises the following steps:

1) selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning;

2) and quantizing the partial gradient values selected by the thinning method, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication.

Further, the training process of the distributed deep learning adopts a stochastic gradient descent algorithm, and comprises three stages: a warm-up phase, a plateau phase, and a best-effort phase; the automatic adjustment of different stages is ensured through the automatic adjustment of the thinning and the quantitative fading, and the self-adaptive staged gradient quantization strategy is realized.

Further, the automatic sparseness adjustment adopts the value of a regularization term to decide specific sparseness; .

Further, the sparsification automatic adjustment comprises the following sub-steps:

a) initializing parameters and calculating the value r of a regularization term₀；

b) Training is started and regularization items after each round of training are stored;

c) calculating an average regularization term in an epoch;

d) modifying sparsityDegree of 1-0.2^(r0/ri)Wherein r is_iRepresents the average regularization term value in the ith epoch;

e) if the sparsity has increased to the maximum sparsity or the regularization term is no longer decreasing within an epoch or the number of rounds of epoch has reached an upper limit, the sparsity is set to the maximum value.

Further, the quantitative regression is to abandon the quantitative operation on the gradient at a proper time, so that the whole model can reach a convergence state more accurately.

Further, correcting errors generated by the thinning and the quantization by using a method of combining residuals comprises: the gradient which is not selected for transmission is firstly kept locally during the sparsification, the quantization error generated during the quantization and the gradient kept during the sparsification are accumulated, the combined gradient is called a combined residual error and is added to a newly calculated gradient value during the next round of training.

Further, for each locally retained merged residual, a momentum correction operation is performed after the residual reaches the threshold value and is passed, i.e., a exclusive nor operation is performed for the local momentum to eliminate the effect of the previous value on itself.

A distributed deep learning computing node apparatus, comprising:

the sparse module is responsible for selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning;

and the quantization module is responsible for quantizing the partial gradient values selected by the thinning module, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication.

A distributed deep learning system comprising at least one computing node device as described above, and a parameter server in communication connection with the computing node device.

The key point of the invention is to provide the combination of gradient sparse transmission and quantization compression, and divide three different training stages according to different characteristics in the training process, the compression method can be adjusted according to different training conditions, the effect of gradient compression is further improved, the training efficiency under a distributed scene is improved, and no obvious negative influence is generated on the convergence performance and the accuracy of the model.

In the existing gradient compression technology, the compression rate can only reach hundreds of times in the best condition from the perspective of the compression rate. The invention applies two multi-stage gradient compression technologies with different compression modes, can achieve compression ratio of thousands of times on a part of models, and thus improves the training efficiency of the distributed system.

Drawings

FIG. 1 is a diagram of a ps-worker architecture for a stochastic gradient descent algorithm.

Fig. 2 is a schematic diagram of the training situation of the image classification network AlexNet on the Tesla K80 graphics card.

Fig. 3 is a schematic diagram of a computation-communication comparison of distributed deep learning.

Fig. 4 is a flow chart of the main steps of the present invention.

Fig. 5 is a schematic diagram of the division of the training process into three training phases.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

In order to take advantage of the advantages of good gradient sparsification and gradient quantization and solve the problems in the prior art, the present invention adopts a compression strategy to combine the two (as shown in fig. 4): for the calculated gradient, a sparsification method is firstly applied, a part of large gradient values are selected as transmission objects, and the value ranges of the gradient values are relatively concentrated; and then, quantizing the gradients by applying a three-valued compression strategy, and taking a threshold value when the gradients are selected during thinning as a reconstruction scale. The two reasons are adopted, firstly, because the filtered gradient values are concentrated, the error generated by taking the waiting threshold value as the reconstructed gradient value is small, and the training effect is favorably improved; second, in the existing ternary method, the gradient clipping is mentioned to ensure the overall convergence performance, but a hyper-parameter is introduced for clipping. The hyper-parameter itself is chosen to intercept a proportion of the parameters from the gradient. Thus, the direct use of a threshold value may circumvent this hyper-parameter.

The basic formula for the stochastic gradient descent is given below:

wherein the content of the first and second substances,

the updated parameter value of the t-th round is shown,

denotes a parameter value at the start of the t-th round, η denotes a learning rate,

representing the calculated gradient, x representing the input data,

the loss function is represented.

From the above formula of SGD, it can be seen that the variation amplitude of the parameter is related to the learning rate and the value of the gradient. In the training process of the neural network, all parameter values are randomly generated at the beginning, so that the condition of the whole network is in a very unstable state at this time, the parameters in the stage are greatly changed, the fluctuation is severe, the paper mentions that a fixed relatively small learning rate is set in the first few epochs to reduce the fluctuation, and the facebook trains the large network to make a change so that the learning rate linearly increases from a relatively small value to a final value in the first five epochs. Where epoch is a single training iteration for all batches in forward and backward propagation, all sample data in a sample is computed once called an epoch.

In the MGC (Multi-stage Gradient Compression) strategy, in order to prevent parameters from being misled by error gradients when the parameters are changed violently at first, the invention adopts a sparse automatic adjustment technology to control training in a warm-up stage.

After a period of training, the variation of the loss function and the variation of the parameters both tend to be smooth. The experimental part of "ternggrad" found that in the latter half of training, if the compression strategy was adjusted and the quantization compression was stopped in time, better results could be obtained. This is because the gradient quantization compression is essentially a lossy representation of the data, and the result of the quantization is that the transmitted gradient values are mostly in error, which affects the direction represented by the overall gradient. In the final stage of training, the loss function is close to convergence, and the difference of the gradient values influences the direction represented by the loss function, so that the loss function may deviate from the target optimal solution. There should therefore also be a "best-effort" phase in order to reach the final goal more accurately and more quickly in the phase of approaching convergence. The invention adopts a strategy of quantitative regression to realize the change of the training strategy.

1. Multi-stage gradient compression

The algorithm one in table 1 summarizes the basic flow of the multi-stage gradient compression. Compared with the traditional distributed random gradient descent algorithm, the method adds two steps to the work of the computing node, namely sparse automatic adjustment and quantitative regression. The former will change the sparsity of the gradient to be selected according to the current training situation. The latter abandons the quantification operation for the gradient at a proper time, so that the whole model can reach a convergence state more accurately.

2. Automatic adjustment for thinning

In the loss function of deep learning, a regularization term is added to prevent overfitting and increase the generalization capability of the model. Because the initialization of the parameters is random, the value of the regularization term tends to be large at this time. The regularization term changes more dramatically in the early stages of training and can often fall to a reasonable level more quickly. The present invention uses the value of the regularization term to determine the specific degree of sparseness. In the compression strategy of the MGC, the sparsification degree of the gradient is relatively low at the beginning stage of training, and it is advantageous to transmit a relatively complete gradient at this time to transition from the fast to the steady stage. The specific algorithm is shown in table 2 below.

TABLE 1 Algorithm 1

TABLE 2 Algorithm two

3. Quantitative regression

The algorithm of the stochastic gradient descent intuitively understands that the algorithm is more like climbing a mountain, each step descends along the direction of the fastest descending position, intuitively, when approaching a certain lowest point, the compression amplitude should be reduced a little for reaching faster and not repeated because of direction deviation. The present invention therefore incorporates a method of quantifying regression in the MGC. The current strategy for quantizing fading is selected according to the total duration of the whole training, and the compression strategy in the later period (for example, the later 20% of the period) will abandon quantization. For example, if the total number of epochs is 100, the last 20 epochs will be discarded.

4. Merging residual corrections

Because the compression strategy of the invention simultaneously uses two compression modes, and the two compression modes can generate errors, in order to make up for the loss, the invention uses a method for combining residual error correction. During the first step of sparsification, the gradient which is not selected for transmission is firstly retained locally, during the second step of quantization, the generated quantization error is accumulated with the retained gradient of the first step, and the combined gradient is called as a combined residual error. And adding the new calculated gradient value at the next training round.

The existing SGD algorithm often adds momentum to accelerate training, and the problem of model performance can be solved through momentum correction. For the algorithm of the invention, some processing is performed on the existing momentum correction, and the measure of momentum correction is also applied to the local merged residual.

For those merged residual gradients that remain locally each time, a correction operation is performed after the residual reaches the threshold and a union operation is performed for the local momentum to eliminate the effect of the previous value on itself, as shown in the following equation, where threshold represents the threshold and u is the value of_t,kRepresenting momentum, t representing the round of computation, k representing the index of the gradient.

Overall, the present invention divides the whole training process into three main phases, namely a warm-up phase, a steady phase and a best-effort phase, as shown in fig. 5, wherein the abscissa represents the epoch number of training and the ordinate represents the loss function value. And the automatic adjustment of different stages is ensured by sparse automatic adjustment and quantitative regression, and a self-adaptive staged gradient quantization strategy is realized.

In the existing gradient compression technology, the compression rate can only reach hundreds of times in the best condition from the perspective of the compression rate. After MGCs (compression techniques used in the present invention) using two different compression methods are applied, as shown in table 3 below, a compression ratio of several thousand times can be achieved on a part of models, and thus, the training efficiency of the distributed system is improved.

Another embodiment of the present invention provides a distributed deep learning computing node device, including:

The specific implementation of the thinning module and the quantization module is as described above for the method of the present invention.

Another embodiment of the present invention provides a distributed deep learning system, which includes at least one computing node apparatus as described above, and a parameter server establishing a communication connection with the computing node apparatus.

TABLE 3 comparison of the Effect of the invention with other methods

The gradient sparsification and gradient quantization techniques in the present invention are not limited to the three-valued quantization and deep gradient compression techniques, and other compression methods for gradient sparsification or quantization (such as the one-bit quantization strategy of CNTK) can also be applied in the present scheme.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A communication optimization method for distributed deep learning is characterized by comprising the following steps:

2) quantizing a part of gradient values selected by a thinning method, and transmitting the quantized gradient values to a parameter server as transmission objects of gradient communication;

the training process of the distributed deep learning adopts a random gradient descent algorithm, and comprises three stages: a warm-up phase, a plateau phase, and a best-effort phase; ensuring automatic adjustment of different stages through automatic sparse adjustment and quantitative regression, and realizing a self-adaptive staged gradient quantization strategy; the quantitative regression is to give up the quantitative operation on the gradient at a proper time, so that the whole model can reach a convergence state more accurately;

the random gradient descent algorithm comprises the following steps:

for compute nodes numbered 1-N:

1) obtaining a training sample Z assigned to itself_t(i)；

2) Calculate the gradient g_t(i) And the newly calculated gradient is compared with the local gradient residual g_re(i) Cumulative g_t(i)＝g_t(i)+g_re(i)；

3) Running a sparse automatic adjustment algorithm to obtain sparsity tau;

4) filtering the gradient with the obtained sparsity and locally saving the threshold and residual g_re；

5) Applying a gradient regression strategy to decide whether to quantify the gradient of the transfer;

6) transmitting the gradient through the network;

7) applying hybrid residual correction to the local gradient residual;

8) acquiring the summarized gradient value from a server side;

9) updating local parameters;

the parameter server averages the obtained gradient values;

the sparse automatic adjustment algorithm adopts the value of a regularization term to decide the specific sparsity, and comprises the following substeps:

c) calculating an average regularization term in an epoch;

d) modifying sparsity to

Wherein r is_iRepresents the average regularization term value in the ith epoch;

2. The method of claim 1, wherein the quantitative regression abandons quantification at the last 20% of the total duration of the overall training.

3. The method according to claim 1 or 2, wherein the correcting the error generated by the thinning and the quantizing by using the method of combining residuals comprises: the gradient which is not selected for transmission is firstly kept locally during the sparsification, the quantization error generated during the quantization and the gradient kept during the sparsification are accumulated, the combined gradient is called a combined residual error and is added to a newly calculated gradient value during the next round of training.

4. A method according to claim 3, characterized in that for each locally retained merged residual, a momentum correction operation is performed after the residual has reached a threshold value and has been passed, i.e. a exclusive nor operation is performed once for the local momentum to eliminate the effect of the previous value on itself.

5. A computing node device for distributed deep learning by using the method of any one of claims 1 to 4, comprising:

6. A distributed deep learning system comprising at least one computing node apparatus of claim 5 and a parameter server in communication with the computing node apparatus.