CN109951438B - Communication optimization method and system for distributed deep learning - Google Patents

Communication optimization method and system for distributed deep learning Download PDF

Info

Publication number
CN109951438B
CN109951438B CN201910035739.9A CN201910035739A CN109951438B CN 109951438 B CN109951438 B CN 109951438B CN 201910035739 A CN201910035739 A CN 201910035739A CN 109951438 B CN109951438 B CN 109951438B
Authority
CN
China
Prior art keywords
gradient
training
deep learning
value
communication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910035739.9A
Other languages
Chinese (zh)
Other versions
CN109951438A (en
Inventor
刘万涛
陆取
虎嵩林
韩冀中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201910035739.9A priority Critical patent/CN109951438B/en
Publication of CN109951438A publication Critical patent/CN109951438A/en
Application granted granted Critical
Publication of CN109951438B publication Critical patent/CN109951438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a communication optimization method and a communication optimization system for distributed deep learning. The method comprises the following steps: 1) selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning; 2) and quantizing the partial gradient values selected by the thinning method, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication. The invention provides the combination of gradient sparse transmission and quantization compression, and divides three different training stages according to different characteristics in the training process, adjusts according to different training conditions, further improves the gradient compression effect, improves the training efficiency under a distributed scene, and has no obvious negative influence on the convergence performance and the model precision.

Description

Communication optimization method and system for distributed deep learning
Technical Field
The invention belongs to the technical field of artificial intelligence technology and deep learning, and particularly relates to a communication optimization method and system for distributed deep learning.
Background
In recent years, with the continuous temperature rise of artificial intelligence concepts, machine learning becomes a hot spot of scientific research. In the deep learning field, the network structure is more explosive, and larger and deeper network structures are continuously proposed, and people apply the models to many fields such as book classification, image processing, machine translation, voice recognition and the like, thereby achieving remarkable achievement.
With the continuous increase of data set size and network model parameters input by the neural network (the parameter quantity of common networks reaches tens of millions, hundreds of millions and so on, and the size of a lot of network parameters reaches several GB), under the condition of single machine training, the problems of insufficient storage space, too long time for training the network and the like can occur. Therefore, many training works of deep learning network structures are transferred to a distributed system.
There are two common modes of distributed training for deep learning. One is model parallel and one is data parallel. The model parallel method is usually suitable for the condition that the network structure is too large and a single node cannot be stored, the complete neural network structure is divided into a plurality of parts to be stored on different nodes, and then training data are acquired in parallel for training. However, the parallel splitting of the model is difficult, and the sharing difficulty of the training data is high, so that the mode is not the main application mode in the current stage of data parallel, namely, each node stores a copy of a network structure, and a training data set is split to different nodes.
The training of the distributed deep learning is widely applied to a Stochastic Gradient Descent (SGD) algorithm, and a ps-worker architecture is mainly adopted among clusters. As shown in fig. 1, a ps server (parameter server) is used to collect the gradients calculated by each worker node, and update the model parameters stored on the server according to the gradients. And the Worker acquires the parameter values after the parameter server calculates the latest parameters, and acquires new training data on the basis to carry out calculation work of the next batch.
By enlarging the cluster size, the computing power of the whole system is greatly improved, but the problem is that the communication bandwidth between machines becomes a main factor for limiting the training efficiency. Due to the fact that the data volume of gradient communication is large (the gradient size is consistent with the parameter size), the communication time is concentrated (generally, the gradient is concentrated and exchanged after the back propagation is finished), the network bandwidth often cannot meet the communication requirement, a large amount of time can be spent on network communication by a cluster, and the computing time saved by the nodes can be completely covered by the communication time, so that the training efficiency cannot be improved. Such communication problems are even more acute in some small-scale devices or mobile devices. Therefore, the network bandwidth problem of communication has become a major problem limiting the development of distributed machine learning.
The main objective of the invention is to solve the network communication bottleneck problem occurring when a distributed system is used for training a deep learning model. The network communication bottleneck problem to be solved by the present invention can be clearly described by a specific example. Fig. 2 shows the training of the conventional image classification network AlexNet on a Tesla K80 graphics card, in which the average time of one round of running AlexNet under different frames is listed, and it can be seen that the average time of one round of training the mini-batch AlexNet with size 128 on the GPU is between 19ms and 54ms (different frames cause certain overhead difference). Since AlexNet has a parameter of about 60M, the amount of gradient data that needs to be exchanged is about 60M × 4B — 240MB, regardless of the overhead of the network communication protocol, when performing gradient exchange during training. Over the relatively common network bandwidth of 1Gbps, it takes about 1s for the light to deliver a gradient of this order to the server side. Compared with 20-50 times of the cost of one round in calculation. AlexNet is also a less complex network than earlier, which is even more serious in some larger networks. It can be said that the network bandwidth limits the improvement of the distributed deep learning efficiency, and greatly affects the expansibility of the cluster.
Under the distributed architecture of the ps-worker, researchers provide a plurality of methods for solving the network bandwidth bottleneck problem in the field of machine learning during distributed training of models.
From the perspective of the system, an idea is that according to the characteristic of uneven convergence during machine learning training, a strictly required massive synchronization mechanism (BSP) is not adopted as a parameter updating strategy, but asynchronous updating is allowed, and a node with a fast calculation speed can update immediately after calculating a gradient value, and does not need to synchronize and transmit calculation results of all nodes every time, namely, so-called Stale Synchronization and Parallelism (SSP). The other idea is to overlap the time of gradient calculation and data communication, and in the stage of back propagation, each time the gradient value corresponding to one layer of parameters is calculated, the corresponding parameters are asynchronously updated, and meanwhile, the calculation nodes also perform calculation. At best, the time of network communication may be completely covered by the calculated time.
In addition to these methods, controlling the size of data during network transmission can also effectively solve the time problem of network communication. In the training process of deep learning, only gradient and parameters are transmitted, but since the change of parameter values can obviously affect the model precision, gradient compression is a common idea. There are two approaches to achieve this, sparse transfer of gradients and quantization compression, respectively.
(1) Gradient sparse transfer. The number of gradients transferred during communication is controlled, and transmission of partial gradients is delayed, so that data communication quantity on the network can be reduced. Storm proposes setting a threshold to filter gradients, those gradients greater than a certain threshold are passed, and the rest are accumulated locally. His experimental results show that the processing speed can be improved 17 times when 20 nodes are used. Since the selection of the threshold is also an additionally introduced hyper-parameter, the reasonable selection of this variable is very difficult. Aji proposes to select the part of the gradient with the largest absolute value for synchronous updating, i.e. to filter the gradient by a fixed ratio, reducing the gradient transmission by about 99%. The final trained model showed only a slight decrease in the BELU score. In order to automatically adjust the compression rate according to the characteristics of different tasks, chen proposes a method for locally selecting a gradient, which obtains a compression effect of 200 times on a fully connected network without affecting the final accuracy. Lin finds that 99.9% of gradient transmission is redundant through observation, so a deep gradient compression strategy is proposed, four skills are applied to make up for errors caused by sparsification, and the compression rate is improved to the level of 600 times at most.
(2) And (5) gradient quantification. The bandwidth requirement of the communication can also be reduced by using a lower precision value to represent the value of the gradient. zhou uses a bit convolution approach to speed up training and prediction (with simultaneous gradient and model compression) and is named Dorefa-net. The resulting accuracy of the Dorefa-net is comparable to a normal network. Glide quantifies the gradient delivered, each gradient has only one bit at the time of transmission, and a 160M network is trained with a tenfold acceleration with only a small loss of accuracy. Alistar proposes a compression strategy of QSGD, allows a user to make an effective balance between compression rate and convergence performance, and improves the training time of QSGD by 1.8 times when the QSGD trains ResNet-152. Wen proposes a method of tri-valued gradient compression, which quantizes all gradient values to { -1, 0, 1 }. Experiments show that there is no large loss of accuracy over most network structures. Both QSGD and ternary work demonstrate the convergence of the method from a theoretical point of view.
As shown in fig. 3, increasing the number of nodes can improve the computing capability of the whole cluster, and the computing time can be greatly shortened under the same mini-batch condition. But the bottleneck of communication is obvious at this time, and the scale of communication can be reduced by the gradient compression mode, so that the aim of shortening the training time is fulfilled when the communication is reduced.
The SSP method for abandoning strict synchronous update in the existing distributed communication optimization technology has the disadvantages that the gradient may be calculated by using old parameters when the node operates, and the convergence performance is influenced to a certain extent. The accuracy of the obtained model is different from that before optimization, and meanwhile, the convergence period can be increased, and the final acceleration effect is not necessarily obvious.
The idea of covering computation and communication time with each other is not completely universal, and such a manner of communication ahead of time cannot be completely applied to neural networks with a relatively low "communication-computation ratio". The time of communication can take up a large amount of time.
The work on the existing gradient compression has led to good results, but some disadvantages have been found.
However, it can be seen from the common work of gradient sparsity transmission that how to select a threshold value is generally the problem of sparsity, and meanwhile, the performance of convergence and the final precision cannot be completely guaranteed, and stable performance is difficult to perform in training. Another problem with the sparsification method is that the gradient being transmitted is not expressed in a more efficient manner and there is significant redundancy in the data transfer process. The gradient is chosen by either using a threshold or a fixed ratio, the range of the gradient transmitted over the network has been significantly reduced, and if the gradient is represented by a high precision floating point number, which is relatively wasteful of resources from a coding point of view, the gradient can be represented in a more efficient manner.
Quantization tends to require more computations than thinning, with additional overhead, and has an upper limit on compression rate. (32-bit floating-point numbers are compressed to 1-bit data at the highest). The quantization method also has a drawback, and in general, the quantization process gives errors to most of the gradient transmitted after reconstruction, regardless of the recovery scale value selected at the time of transmission. Some of these errors are not considered in the quantization process, and some add up to locally increase part of the workload. However, this part of the error may have an effect on the training effect.
Disclosure of Invention
The main objective of the present invention is to solve the network communication bottleneck problem that occurs when a deep learning model is trained using a distributed system.
The technical scheme adopted by the invention is as follows:
a communication optimization method for distributed deep learning comprises the following steps:
1) selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning;
2) and quantizing the partial gradient values selected by the thinning method, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication.
Further, the training process of the distributed deep learning adopts a stochastic gradient descent algorithm, and comprises three stages: a warm-up phase, a plateau phase, and a best-effort phase; the automatic adjustment of different stages is ensured through the automatic adjustment of the thinning and the quantitative fading, and the self-adaptive staged gradient quantization strategy is realized.
Further, the automatic sparseness adjustment adopts the value of a regularization term to decide specific sparseness; .
Further, the sparsification automatic adjustment comprises the following sub-steps:
a) initializing parameters and calculating the value r of a regularization term0
b) Training is started and regularization items after each round of training are stored;
c) calculating an average regularization term in an epoch;
d) modifying sparsityDegree of 1-0.2(r0/ri)Wherein r isiRepresents the average regularization term value in the ith epoch;
e) if the sparsity has increased to the maximum sparsity or the regularization term is no longer decreasing within an epoch or the number of rounds of epoch has reached an upper limit, the sparsity is set to the maximum value.
Further, the quantitative regression is to abandon the quantitative operation on the gradient at a proper time, so that the whole model can reach a convergence state more accurately.
Further, correcting errors generated by the thinning and the quantization by using a method of combining residuals comprises: the gradient which is not selected for transmission is firstly kept locally during the sparsification, the quantization error generated during the quantization and the gradient kept during the sparsification are accumulated, the combined gradient is called a combined residual error and is added to a newly calculated gradient value during the next round of training.
Further, for each locally retained merged residual, a momentum correction operation is performed after the residual reaches the threshold value and is passed, i.e., a exclusive nor operation is performed for the local momentum to eliminate the effect of the previous value on itself.
A distributed deep learning computing node apparatus, comprising:
the sparse module is responsible for selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning;
and the quantization module is responsible for quantizing the partial gradient values selected by the thinning module, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication.
A distributed deep learning system comprising at least one computing node device as described above, and a parameter server in communication connection with the computing node device.
The key point of the invention is to provide the combination of gradient sparse transmission and quantization compression, and divide three different training stages according to different characteristics in the training process, the compression method can be adjusted according to different training conditions, the effect of gradient compression is further improved, the training efficiency under a distributed scene is improved, and no obvious negative influence is generated on the convergence performance and the accuracy of the model.
In the existing gradient compression technology, the compression rate can only reach hundreds of times in the best condition from the perspective of the compression rate. The invention applies two multi-stage gradient compression technologies with different compression modes, can achieve compression ratio of thousands of times on a part of models, and thus improves the training efficiency of the distributed system.
Drawings
FIG. 1 is a diagram of a ps-worker architecture for a stochastic gradient descent algorithm.
Fig. 2 is a schematic diagram of the training situation of the image classification network AlexNet on the Tesla K80 graphics card.
Fig. 3 is a schematic diagram of a computation-communication comparison of distributed deep learning.
Fig. 4 is a flow chart of the main steps of the present invention.
Fig. 5 is a schematic diagram of the division of the training process into three training phases.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
In order to take advantage of the advantages of good gradient sparsification and gradient quantization and solve the problems in the prior art, the present invention adopts a compression strategy to combine the two (as shown in fig. 4): for the calculated gradient, a sparsification method is firstly applied, a part of large gradient values are selected as transmission objects, and the value ranges of the gradient values are relatively concentrated; and then, quantizing the gradients by applying a three-valued compression strategy, and taking a threshold value when the gradients are selected during thinning as a reconstruction scale. The two reasons are adopted, firstly, because the filtered gradient values are concentrated, the error generated by taking the waiting threshold value as the reconstructed gradient value is small, and the training effect is favorably improved; second, in the existing ternary method, the gradient clipping is mentioned to ensure the overall convergence performance, but a hyper-parameter is introduced for clipping. The hyper-parameter itself is chosen to intercept a proportion of the parameters from the gradient. Thus, the direct use of a threshold value may circumvent this hyper-parameter.
The basic formula for the stochastic gradient descent is given below:
Figure BDA0001945852630000061
wherein the content of the first and second substances,
Figure BDA0001945852630000062
the updated parameter value of the t-th round is shown,
Figure BDA0001945852630000063
denotes a parameter value at the start of the t-th round, η denotes a learning rate,
Figure BDA0001945852630000064
representing the calculated gradient, x representing the input data,
Figure BDA0001945852630000065
the loss function is represented.
From the above formula of SGD, it can be seen that the variation amplitude of the parameter is related to the learning rate and the value of the gradient. In the training process of the neural network, all parameter values are randomly generated at the beginning, so that the condition of the whole network is in a very unstable state at this time, the parameters in the stage are greatly changed, the fluctuation is severe, the paper mentions that a fixed relatively small learning rate is set in the first few epochs to reduce the fluctuation, and the facebook trains the large network to make a change so that the learning rate linearly increases from a relatively small value to a final value in the first five epochs. Where epoch is a single training iteration for all batches in forward and backward propagation, all sample data in a sample is computed once called an epoch.
In the MGC (Multi-stage Gradient Compression) strategy, in order to prevent parameters from being misled by error gradients when the parameters are changed violently at first, the invention adopts a sparse automatic adjustment technology to control training in a warm-up stage.
After a period of training, the variation of the loss function and the variation of the parameters both tend to be smooth. The experimental part of "ternggrad" found that in the latter half of training, if the compression strategy was adjusted and the quantization compression was stopped in time, better results could be obtained. This is because the gradient quantization compression is essentially a lossy representation of the data, and the result of the quantization is that the transmitted gradient values are mostly in error, which affects the direction represented by the overall gradient. In the final stage of training, the loss function is close to convergence, and the difference of the gradient values influences the direction represented by the loss function, so that the loss function may deviate from the target optimal solution. There should therefore also be a "best-effort" phase in order to reach the final goal more accurately and more quickly in the phase of approaching convergence. The invention adopts a strategy of quantitative regression to realize the change of the training strategy.
1. Multi-stage gradient compression
The algorithm one in table 1 summarizes the basic flow of the multi-stage gradient compression. Compared with the traditional distributed random gradient descent algorithm, the method adds two steps to the work of the computing node, namely sparse automatic adjustment and quantitative regression. The former will change the sparsity of the gradient to be selected according to the current training situation. The latter abandons the quantification operation for the gradient at a proper time, so that the whole model can reach a convergence state more accurately.
2. Automatic adjustment for thinning
In the loss function of deep learning, a regularization term is added to prevent overfitting and increase the generalization capability of the model. Because the initialization of the parameters is random, the value of the regularization term tends to be large at this time. The regularization term changes more dramatically in the early stages of training and can often fall to a reasonable level more quickly. The present invention uses the value of the regularization term to determine the specific degree of sparseness. In the compression strategy of the MGC, the sparsification degree of the gradient is relatively low at the beginning stage of training, and it is advantageous to transmit a relatively complete gradient at this time to transition from the fast to the steady stage. The specific algorithm is shown in table 2 below.
TABLE 1 Algorithm 1
Figure BDA0001945852630000071
TABLE 2 Algorithm two
Figure BDA0001945852630000072
3. Quantitative regression
The algorithm of the stochastic gradient descent intuitively understands that the algorithm is more like climbing a mountain, each step descends along the direction of the fastest descending position, intuitively, when approaching a certain lowest point, the compression amplitude should be reduced a little for reaching faster and not repeated because of direction deviation. The present invention therefore incorporates a method of quantifying regression in the MGC. The current strategy for quantizing fading is selected according to the total duration of the whole training, and the compression strategy in the later period (for example, the later 20% of the period) will abandon quantization. For example, if the total number of epochs is 100, the last 20 epochs will be discarded.
4. Merging residual corrections
Because the compression strategy of the invention simultaneously uses two compression modes, and the two compression modes can generate errors, in order to make up for the loss, the invention uses a method for combining residual error correction. During the first step of sparsification, the gradient which is not selected for transmission is firstly retained locally, during the second step of quantization, the generated quantization error is accumulated with the retained gradient of the first step, and the combined gradient is called as a combined residual error. And adding the new calculated gradient value at the next training round.
The existing SGD algorithm often adds momentum to accelerate training, and the problem of model performance can be solved through momentum correction. For the algorithm of the invention, some processing is performed on the existing momentum correction, and the measure of momentum correction is also applied to the local merged residual.
For those merged residual gradients that remain locally each time, a correction operation is performed after the residual reaches the threshold and a union operation is performed for the local momentum to eliminate the effect of the previous value on itself, as shown in the following equation, where threshold represents the threshold and u is the value oft,kRepresenting momentum, t representing the round of computation, k representing the index of the gradient.
Figure BDA0001945852630000081
Overall, the present invention divides the whole training process into three main phases, namely a warm-up phase, a steady phase and a best-effort phase, as shown in fig. 5, wherein the abscissa represents the epoch number of training and the ordinate represents the loss function value. And the automatic adjustment of different stages is ensured by sparse automatic adjustment and quantitative regression, and a self-adaptive staged gradient quantization strategy is realized.
In the existing gradient compression technology, the compression rate can only reach hundreds of times in the best condition from the perspective of the compression rate. After MGCs (compression techniques used in the present invention) using two different compression methods are applied, as shown in table 3 below, a compression ratio of several thousand times can be achieved on a part of models, and thus, the training efficiency of the distributed system is improved.
Another embodiment of the present invention provides a distributed deep learning computing node device, including:
the sparse module is responsible for selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning;
and the quantization module is responsible for quantizing the partial gradient values selected by the thinning module, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication.
The specific implementation of the thinning module and the quantization module is as described above for the method of the present invention.
Another embodiment of the present invention provides a distributed deep learning system, which includes at least one computing node apparatus as described above, and a parameter server establishing a communication connection with the computing node apparatus.
TABLE 3 comparison of the Effect of the invention with other methods
Figure BDA0001945852630000091
The gradient sparsification and gradient quantization techniques in the present invention are not limited to the three-valued quantization and deep gradient compression techniques, and other compression methods for gradient sparsification or quantization (such as the one-bit quantization strategy of CNTK) can also be applied in the present scheme.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (6)

1. A communication optimization method for distributed deep learning is characterized by comprising the following steps:
1) selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning;
2) quantizing a part of gradient values selected by a thinning method, and transmitting the quantized gradient values to a parameter server as transmission objects of gradient communication;
the training process of the distributed deep learning adopts a random gradient descent algorithm, and comprises three stages: a warm-up phase, a plateau phase, and a best-effort phase; ensuring automatic adjustment of different stages through automatic sparse adjustment and quantitative regression, and realizing a self-adaptive staged gradient quantization strategy; the quantitative regression is to give up the quantitative operation on the gradient at a proper time, so that the whole model can reach a convergence state more accurately;
the random gradient descent algorithm comprises the following steps:
for compute nodes numbered 1-N:
1) obtaining a training sample Z assigned to itselft(i);
2) Calculate the gradient gt(i) And the newly calculated gradient is compared with the local gradient residual gre(i) Cumulative gt(i)=gt(i)+gre(i);
3) Running a sparse automatic adjustment algorithm to obtain sparsity tau;
4) filtering the gradient with the obtained sparsity and locally saving the threshold and residual gre
5) Applying a gradient regression strategy to decide whether to quantify the gradient of the transfer;
6) transmitting the gradient through the network;
7) applying hybrid residual correction to the local gradient residual;
8) acquiring the summarized gradient value from a server side;
9) updating local parameters;
the parameter server averages the obtained gradient values;
the sparse automatic adjustment algorithm adopts the value of a regularization term to decide the specific sparsity, and comprises the following substeps:
a) initializing parameters and calculating the value r of a regularization term0
b) Training is started and regularization items after each round of training are stored;
c) calculating an average regularization term in an epoch;
d) modifying sparsity to
Figure 1
Wherein r isiRepresents the average regularization term value in the ith epoch;
e) if the sparsity has increased to the maximum sparsity or the regularization term is no longer decreasing within an epoch or the number of rounds of epoch has reached an upper limit, the sparsity is set to the maximum value.
2. The method of claim 1, wherein the quantitative regression abandons quantification at the last 20% of the total duration of the overall training.
3. The method according to claim 1 or 2, wherein the correcting the error generated by the thinning and the quantizing by using the method of combining residuals comprises: the gradient which is not selected for transmission is firstly kept locally during the sparsification, the quantization error generated during the quantization and the gradient kept during the sparsification are accumulated, the combined gradient is called a combined residual error and is added to a newly calculated gradient value during the next round of training.
4. A method according to claim 3, characterized in that for each locally retained merged residual, a momentum correction operation is performed after the residual has reached a threshold value and has been passed, i.e. a exclusive nor operation is performed once for the local momentum to eliminate the effect of the previous value on itself.
5. A computing node device for distributed deep learning by using the method of any one of claims 1 to 4, comprising:
the sparse module is responsible for selecting partial gradient values by adopting a sparse method for gradients generated by each computing node in the training process of distributed deep learning;
and the quantization module is responsible for quantizing the partial gradient values selected by the thinning module, and transmitting the quantized gradient values to the parameter server as transmission objects of gradient communication.
6. A distributed deep learning system comprising at least one computing node apparatus of claim 5 and a parameter server in communication with the computing node apparatus.
CN201910035739.9A 2019-01-15 2019-01-15 Communication optimization method and system for distributed deep learning Active CN109951438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910035739.9A CN109951438B (en) 2019-01-15 2019-01-15 Communication optimization method and system for distributed deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910035739.9A CN109951438B (en) 2019-01-15 2019-01-15 Communication optimization method and system for distributed deep learning

Publications (2)

Publication Number Publication Date
CN109951438A CN109951438A (en) 2019-06-28
CN109951438B true CN109951438B (en) 2020-11-20

Family

ID=67007902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910035739.9A Active CN109951438B (en) 2019-01-15 2019-01-15 Communication optimization method and system for distributed deep learning

Country Status (1)

Country Link
CN (1) CN109951438B (en)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287031B (en) * 2019-07-01 2023-05-09 南京大学 Method for reducing communication overhead of distributed machine learning
CN110503194B (en) * 2019-08-09 2022-05-24 苏州浪潮智能科技有限公司 Distributed parallel training method and system
CN110633798B (en) * 2019-09-12 2023-04-07 北京金山数字娱乐科技有限公司 Parameter updating method and device in distributed training
CN110647765B (en) * 2019-09-19 2022-04-12 济南大学 Privacy protection method and system based on knowledge migration under collaborative learning framework
CN110619388B (en) * 2019-09-20 2024-04-02 北京金山数字娱乐科技有限公司 Gradient synchronization method and device in distributed training
CN112651411B (en) * 2019-10-10 2022-06-07 中国人民解放军国防科技大学 Gradient quantization method and system for distributed deep learning
CN111027671B (en) * 2019-11-12 2023-07-04 华中科技大学 Distributed deep learning communication method and system based on model structural characteristics
CN112948105B (en) * 2019-12-11 2023-10-17 香港理工大学深圳研究院 Gradient transmission method, gradient transmission device and parameter server
CN111382844B (en) * 2020-03-11 2023-07-07 华南师范大学 Training method and device for deep learning model
US20210295168A1 (en) * 2020-03-23 2021-09-23 Amazon Technologies, Inc. Gradient compression for distributed training
CN111582494B (en) * 2020-04-17 2023-07-07 浙江大学 Mixed distributed machine learning updating method based on delay processing
CN111625603A (en) * 2020-05-28 2020-09-04 浪潮电子信息产业股份有限公司 Gradient information updating method for distributed deep learning and related device
CN111429142B (en) * 2020-06-10 2020-09-11 腾讯科技(深圳)有限公司 Data processing method and device and computer readable storage medium
CN112134812B (en) * 2020-09-08 2022-08-05 华东师范大学 Distributed deep learning performance optimization method based on network bandwidth allocation
CN112463189B (en) * 2020-11-20 2022-04-22 中国人民解放军国防科技大学 Distributed deep learning multi-step delay updating method based on communication operation sparsification
CN112564118B (en) * 2020-11-23 2022-03-18 广西大学 Distributed real-time voltage control method capable of expanding quantum deep width learning
CN112561078B (en) * 2020-12-18 2021-12-28 北京百度网讯科技有限公司 Distributed model training method and related device
CN113095510B (en) * 2021-04-14 2024-03-01 深圳前海微众银行股份有限公司 Federal learning method and device based on block chain
CN113139663B (en) * 2021-04-23 2023-01-20 深圳市大数据研究院 Federal edge learning configuration information acquisition method, device, equipment and medium
CN113159331B (en) * 2021-05-24 2023-06-30 同济大学 Self-adaptive sparseness quantization method of networked machine learning system
CN113487036B (en) * 2021-06-24 2022-06-17 浙江大学 Distributed training method and device of machine learning model, electronic equipment and medium
CN113467949B (en) * 2021-07-07 2022-06-28 河海大学 Gradient compression method for distributed DNN training in edge computing environment
CN113660113B (en) * 2021-07-27 2023-09-15 上海大学 Self-adaptive sparse parameter model design and quantization transmission method for distributed machine learning
CN113592701B (en) * 2021-08-05 2024-03-29 中国科学技术大学 Method and system for registering gradient compression algorithm development into deep learning framework
CN114125070B (en) * 2021-11-10 2023-06-13 深圳大学 Communication method, system, electronic device and storage medium for quantization compression
CN114298277B (en) * 2021-12-28 2023-09-12 四川大学 Distributed deep learning training method and system based on layer sparsification
CN114386622A (en) * 2022-01-14 2022-04-22 平安科技(深圳)有限公司 Gradient compression method, device, equipment and storage medium
CN114710415B (en) * 2022-05-23 2022-08-12 北京理工大学 Redundant coded passive message reliable transmission and processing system
CN115103031B (en) * 2022-06-20 2023-07-14 西南交通大学 Multistage quantization and self-adaptive adjustment method
CN116341628B (en) * 2023-02-24 2024-02-13 北京大学长沙计算与数字经济研究院 Gradient sparsification method, system, equipment and storage medium for distributed training
CN117910521B (en) * 2024-03-20 2024-06-14 浪潮电子信息产业股份有限公司 Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106062786B (en) * 2014-09-12 2019-12-31 微软技术许可有限责任公司 Computing system for training neural networks
CN104714852B (en) * 2015-03-17 2018-05-22 华中科技大学 A kind of parameter synchronization optimization method and its system suitable for distributed machines study
US11106973B2 (en) * 2016-03-16 2021-08-31 Hong Kong Applied Science and Technology Research Institute Company Limited Method and system for bit-depth reduction in artificial neural networks
US10013644B2 (en) * 2016-11-08 2018-07-03 International Business Machines Corporation Statistical max pooling with deep learning
US20200380356A1 (en) * 2017-02-23 2020-12-03 Sony Corporation Information processing apparatus, information processing method, and program
CN107679618B (en) * 2017-07-28 2021-06-11 赛灵思电子科技(北京)有限公司 Static strategy fixed-point training method and device
CN107832847A (en) * 2017-10-26 2018-03-23 北京大学 A kind of neural network model compression method based on rarefaction back-propagating training
CN108491928B (en) * 2018-03-29 2019-10-25 腾讯科技(深圳)有限公司 Model parameter sending method, device, server and storage medium
CN108829441B (en) * 2018-05-14 2022-10-18 中山大学 Distributed deep learning parameter updating and optimizing system
CN108875602A (en) * 2018-05-31 2018-11-23 珠海亿智电子科技有限公司 Monitor the face identification method based on deep learning under environment

Also Published As

Publication number Publication date
CN109951438A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109951438B (en) Communication optimization method and system for distributed deep learning
CN111382844B (en) Training method and device for deep learning model
CN112181666B (en) Equipment assessment and federal learning importance aggregation method based on edge intelligence
CN110969251B (en) Neural network model quantification method and device based on label-free data
EP3889846A1 (en) Deep learning model training method and system
CN112463189B (en) Distributed deep learning multi-step delay updating method based on communication operation sparsification
CN113778691B (en) Task migration decision method, device and system
CN115374853A (en) Asynchronous federal learning method and system based on T-Step polymerization algorithm
CN110689113A (en) Deep neural network compression method based on brain consensus initiative
CN114169543A (en) Federal learning algorithm based on model obsolescence and user participation perception
CN111740925A (en) Deep reinforcement learning-based flow scheduling method
CN114465900B (en) Data sharing delay optimization method and device based on federal edge learning
CN113627519A (en) Distributed random gradient descent method with compression and delay compensation
CN117290721A (en) Digital twin modeling method, device, equipment and medium
CN115408072A (en) Rapid adaptation model construction method based on deep reinforcement learning and related device
Zhou et al. AdaptCL: Efficient collaborative learning with dynamic and adaptive pruning
CN117202264A (en) 5G network slice oriented computing and unloading method in MEC environment
CN115150288B (en) Distributed communication system and method
CN104537224B (en) Multi-state System Reliability analysis method and system based on adaptive learning algorithm
CN115129471A (en) Distributed local random gradient descent method for large-scale GPU cluster
CN116524173A (en) Deep learning network model optimization method based on parameter quantization
CN112906291B (en) Modeling method and device based on neural network
CN114386469A (en) Method and device for quantizing convolutional neural network model and electronic equipment
CN115719086B (en) Method for automatically obtaining hybrid precision quantized global optimization strategy
CN115696405B (en) Computing task unloading optimization method and system considering fairness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant