WO2022003562A1 - Statistical-based gradient compression method for distributed training system - Google Patents

Statistical-based gradient compression method for distributed training system Download PDF

Info

Publication number
WO2022003562A1
WO2022003562A1 PCT/IB2021/055814 IB2021055814W WO2022003562A1 WO 2022003562 A1 WO2022003562 A1 WO 2022003562A1 IB 2021055814 W IB2021055814 W IB 2021055814W WO 2022003562 A1 WO2022003562 A1 WO 2022003562A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient vector
threshold
compressed
node
sid
Prior art date
Application number
PCT/IB2021/055814
Other languages
French (fr)
Inventor
Ahmed MOHAMED ABDELMONIEM SAYED
Ahmed ELZANATY
Marco Canini
Mohamed-Slim Alouini
Original Assignee
King Abdullah University Of Science And Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by King Abdullah University Of Science And Technology filed Critical King Abdullah University Of Science And Technology
Publication of WO2022003562A1 publication Critical patent/WO2022003562A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • Embodiments of the subject matter disclosed herein generally relate to exchanging data within a neural network, and more particularly, to applying an efficient gradient compression technique to a distributed training neural network.
  • DNNs deep neural networks
  • Modern deep learning toolkits e.g., pytorch.org
  • a worker is understood herein to be a component or a node of the DNN that executes training, for example, a processor or a computing device.
  • Training DNNs in such settings relies on synchronous distributed Stochastic Gradient Descent (SGD) or similar optimizers.
  • SGD synchronous distributed Stochastic Gradient Descent
  • N (a positive integer) is the number of workers of the training DNN
  • e M d denotes the model parameters with d dimensions at iteration /.
  • a bold symbol means in this context a vector.
  • each worker n (where n takes a value between 1 and N) runs a back- propagation algorithm to produce a local stochastic gradient e M d .
  • each worker updates its model parameters x ⁇ i+1 ⁇ using the final gradient aggregated across all workers, i.e. , where l is the learning rate. This means that each worker n of the N workers needs to receive all individual gradients from all other workers and calculate the new aggregated gradient
  • Gradient aggregation involves extensive communication, which is either between the workers in a peer-to-peer fashion (typically through collective communication primitives like all-reduce) or via a parameter server architecture. Due to the synchronous nature of the optimizer, workers cannot proceed with the (i + 1) th iteration until the aggregated gradient is available. Therefore, in distributed training workloads, communication is commonly one of the predominant bottlenecks.
  • a statistical-based gradient compression method that includes receiving input data at plural nodes of a neural network, running the input data forward and backward through the plural nodes to generate node gradient vectors, fitting a sparsity-inducing distribution, SID, at each node, to a corresponding node gradient vector, calculating a first threshold 3 ⁇ 4, based on the SID, for the corresponding node gradient vector, compressing the corresponding node gradient vector to obtain a first compressed gradient vector, by setting to zero those components that are smaller than the first threshold h 1 and transmitting a compressed gradient vector, which is related to the first compressed gradient vector, to all other nodes for updating a corresponding model, where the compressed gradient vector is sparse as only some components are non-zero.
  • a neural network system that uses a statistical-based gradient compression method, and the system includes plural nodes, each configured to receive input data, and each node including a processor configured to run the input data forward and backward to generate a node gradient vector, fit a sparsity-inducing distribution, SID, to the node gradient vector, calculate a first threshold h 1 based on the SID, for the node gradient vector, compress the node gradient vector to obtain a first compressed gradient vector, by setting to zero those components that are smaller than the first threshold h 1 and transmit a compressed gradient vector, which is related to the first compressed gradient vector, to all other nodes for updating a corresponding model, where the compressed gradient vector is sparse as only some components are non-zero.
  • a non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a computer, implement a method for gradient compression using statistical methods as discussed above.
  • Figures 1A and 1B illustrate the compression speedups over Top k using different compression ratios (0.1, 0.01 , 0.001) on a GPU ( Figure 1A) and a CPU ( Figure 1 B);
  • Figure 1C shows the average estimation quality of a target /c
  • Figure 2A shows the sorted magnitude of the gradients versus their indexes and the fitted curve via a power law
  • Figure 2B shows the approximation error of the Top 3 ⁇ 4 versus the number of nonzero elements /c;
  • Figures 3A to 3D show gradient fitting using three sparsity-induced distributions (SIDs) for the gradient vector along with the empirical distribution generated from training ResNet-20 on CIFAR10 dataset using Top 3 ⁇ 4 compressor with error compensation mechanism, for the 100th [(A) PDF, (B) CDF] and 10,000th [(C) PDF, (D) CDF] iterations;
  • Figures 4A and 4B schematically illustrate a neural network configured to run a statistical-based gradient compression method
  • Figures 5A and 5B illustrate the pseudo-code run by the statistical- based gradient compression method
  • Figure 6 illustrates the iteration aspect of the statistical-based gradient compression method
  • Figure 7 provides an example for the application of the statistical-based gradient compression method
  • Figures 8A to 8C illustrate the performance of the statistical-based gradient compression method versus known methods for speed-up, throughput, and estimation quality for a first dataset
  • Figures 9A and 9B show the train loss versus number of iterations and the threshold estimation quality, respectively, for the statistical-based gradient compression method and other background methods for the first dataset;
  • Figures 10A to 10C illustrate the performance of the statistical-based gradient compression method versus known methods for speed-up, throughput, and estimation quality for a second dataset
  • Figures 11 A and 11 B show the train loss versus number of iterations and the threshold estimation quality, respectively, for the statistical-based gradient compression method and other background methods for the second dataset;
  • Figure 12 is a flow chart of a method for compressing gradient vectors based on sparsity-inducing distributions.
  • Figure 13 is a schematic diagram of a node of a neural network.
  • a novel gradient compressor with minimal overhead is introduced. Noting the sparsity of the gradients, the compressor models the gradients as random variables distributed according to some sparsity- inducing distributions (SIDs). This approach is validated by studying the statistical characteristics of the evolution of the gradient vectors over the training process.
  • SIDCo Sparsity-Inducing Distribution-based Compression
  • Gradient quantization is another way to reduce the size of the transmitted gradients and it represents gradients with fewer bits for each gradient element. Under some conditions, quantization is known to achieve the same convergence as no compression. Error compensation (EC) is used to attain convergence when the gradients are quantized using fewer bits. Given the standard 32-bit float number representation, the volume reduction of the quantization is limited by 32x, i.e. , 1 bit out of 32 bits, which may not be sufficient for large models or slow networks and it requires expensive encoding to pack the quantized bits.
  • EC Error compensation
  • the gradient sparsification selects a subset of the gradient elements for the next iteration. It is generally more flexible than the quantization approach, as it can reduce the transmitted volume by up to d times and adapts easily to network conditions [6] It was shown that in some cases, up to 99.9% of the non-significant gradient elements can be dropped with limited impact on its convergence.
  • Gradient sparsification using Top k i.e., selecting the top k elements by their magnitude, is known to yield better convergence compared to other compression schemes, e.g., Random-/c.
  • Top k or its variants are notorious for being computationally inefficient. The Top k selection does not perform well on accelerators such as GPUs. For instance, in many cases, it is reported that Top / ⁇ imposes high overheads and worsens the run time of distributed training systems.
  • the main challenge with using gradient compression is the computational overhead the algorithm itself introduces in the training. If the overhead is greater than the reduction gains in the communication time, the overall iteration time increases.
  • a robust compressor should have a low overhead.
  • one of the dominantly robust compressors is Top k .
  • this compressor is also computationally intensive. Because of this, when using Top k for large models, it results in either an increased training time or unsatisfactory performance benefits.
  • Top / ⁇ is the slowest on GPUs and not the fastest on CPUs.
  • threshold-based methods which aim to overcome the overhead of Jop k , select linear time gradient elements larger in magnitude than a threshold h.
  • DGC [3] proposes to sample a random subset of the gradients (e.g., 1%), and apply Jop k on the sampled sub-population to find a threshold which is then used to hierarchically obtain the actual Top 3 ⁇ 4 elements. Even though the DGC leads to improved performance over Jop k , its computational complexity is still in the same order as the Tope's complexity.
  • the threshold estimation methods are shown to attain a linear time complexity. In this regard, several works have leveraged certain features of the gradients to enhance the training process.
  • the SIDCo algorithm uses the compressibility of the gradients and their statistical distribution. Signals, including gradient vectors of DNNs, can be efficiently compressed by exploiting some of their statistical features. Among these features, sparsity and compressibility are the key drivers for performing signal compression [13-15]
  • a definition of compressible signals is as follows: the signal g e M d is compressible if the magnitudes of its sorted coefficients obey the following power law decay:
  • the signal is more compressible if it decays faster, i.e., p is higher.
  • the compressibility of the gradient vector allows efficient compression for the gradients through sparsification techniques, e.g., Top k and thresholding- based compression.
  • the gradients generated while training the ResNet20 were used.
  • Figure 2A shows the elements of the gradient vector g , i.e., g, ⁇ , which are reported versus their index, for three iterations in the beginning, middle, and end of the training.
  • FIG. 2B shows the sparsification error for the best k approximation, e.g., the Top 3 ⁇ 4 , as a function of k.
  • the goal is to find the distribution of the gradient vector, while accounting for the compressibility of the gradients.
  • the selection of sparsity- promoting priors that are able to efficiently capture the statistical characteristics of the gradients with low computational complexity is a challenging task.
  • the inventors noticed a specific property for the distribution of the gradients, which permits high compression gains with low computational overhead.
  • This property indicates that gradients generated from many DNNs during the training can be modeled as random variables (r.v.s) distributed according to some sparsity-inducing distributions, i.e. , double exponential, double gamma and double generalized Pareto (GP) distributions. More specifically, the gradient G can be modeled or fitted as
  • Distribution ( ) is one of the three SIDs with parameters indicated by the vector Q, which generally depend on the iteration and worker’s data.
  • PDF probability density function
  • the threshold h that achieves the compression ratio d can be computed as: where b is the maximum likelihood estimate (MLE) of the scale parameter.
  • MLE maximum likelihood estimate
  • the vector that contains only the exceedance non-zero gradients i.e. , the gradients that are larger than the threshold h
  • the number of its components is k.
  • the target compression ratio d can be as low as 10 4 . Therefore, in order to accurately estimate the threshold h, the fitted distribution should tightly resemble the gradient distribution at the tail. This is quite challenging because the estimation of the parameters tends to account more for the majority of data at the expense of the tail. Hence, the threshold h obtained from single-stage fitting discussed above is accurate up to some moderate compression ratios.
  • a multi-stage fitting approach is proposed in this embodiment.
  • a two-stage approach is first discussed.
  • the calculated vector of the exceedance gradients g is used to fit another distribution, defined precisely below.
  • the threshold for the multi-stage approach.
  • the absolute of the exceedance gradients ⁇ G m ⁇ can be modeled as are the shape, scale, and location parameters.
  • the threshold that achieves a compression ratio 5 m is obtained as: where h pi® is the threshold computed at the previous stage and fl and a 2 are the sample mean and variance of ⁇ g m ⁇ h th® , respectively.
  • FIGS 4A and 4B show a neural network/computing system 400 that includes n workers, where n is an integer number that varies between 2 and N. Only two workers or nodes 410-1 and 410-n are shown in the figure for simplicity.
  • a training node may be a GPU, a CPU, a computing device, etc.
  • Each training node 410-n receives input data 412 for training, and performs a forward-backward pass 414/416 on different batches sampled from the training dataset 412 with the same network model.
  • each worker runs a local-version of the SGD algorithm to produce the corresponding gradients 418, which are then reshaped by the gradient reshaping module 420 to generate the gradient vector e M d 422.
  • Figures 5A and 5B, lines 20-26 show how to calculate the number of stages M.
  • the processor of each worker estimates the parameters of the selected SID to effectively fit the gradient vector to the selected SID.
  • the function Thresh_Estimation shown in line 13 in Figures 5A and 5B use the chosen SID to obtain a corresponding threshold.
  • the algorithm dynamically adapts the number of stages M by monitoring the quality of its estimated selection of elements and adjusting M using the function Adapt_Stages noted in line 20 in Figure 5B.
  • the algorithm in Figure 5A starts by calling the sparsify function, which takes the gradient vector 422 and the target ratio as the parameters. Then, the algorithm applies a multi-stage estimation loop of M iterations. In each iteration, the compressed gradient vector gfa 430 is partially sparsified with the previously estimated threshold obtained from the previous stage m - 1.
  • the chosen SID distribution fitting is invoked via the function Thresh_Estimation to obtain a new threshold.
  • the resulting estimation threshold should approximate the threshold that would obtain the target ratio ⁇ 5 of the input vector.
  • the estimated threshold is used to sparsify the full gradient vector and obtain the values and their corresponding indices. For each invocation of the algorithm in each training iteration, the algorithm maintains statistics like the average ratio of the quality of its estimations over the past training steps Q.
  • the algorithm invokes the Adapt_Stages function (see line 20 in Figure 5B), which adjusts the current number of stages M based on a user-defined allowable error bounds of the estimation (i.e. , e H and e L ).
  • the next algorithm iteration invocation will use the new number of stages M.
  • the number of stages is adjusted only if the obtained ratio is not within the error bounds.
  • Figure 6 illustrates these steps of the fitting refinement 610 in which the threshold h is adjusted for each iteration to arrive at the desired target ratio d.
  • FIG. 7 A simplified example for compressing the gradient vector is illustrated in Figure 7.
  • the gradient vector gfa 422 has the 10 values 710, as illustrated in Figure 7.
  • the method computes for each stage the threshold from the peak-over-threshold data from the previous stage (i.e., the gradient elements that have an absolute value larger that the threshold). After finishing all the stages, the final threshold is used to compress and send the vectors. For M stages, this process is repeated ensuring that the target compression ratio is the product of all the previous compression rates.
  • the all-gather module 432 is configured to collect the compressed and sparsified gradient vectors from each worker and to provide this info to all the workers in step 434.
  • the averaged gradient 436 from all the workers is used in step 438 to update the model e M d used by each worker and then the steps discussed above are repeated for the next iteration i+1.
  • the SIDCo algorithm was compared to the Top k , DGC, RedSync and GaussianKSGD.
  • the EC mechanism is employed to further enhance the convergence of SGD with compressed gradients.
  • SIDCo-E double exponential fitting
  • Normalized Training Speed-up the model quality is evaluated at iteration T (the end of training) and it is divided by the time taken to complete T iterations. This quantity is normalized by the same measurement calculated for the baseline case. This is the normalized training speed-up relative to the baseline;
  • Normalized Average Training Throughput is the average throughput normalized by the baseline’s throughput, which illustrates the speed-up from compression irrespective of its impact on the model quality;
  • the number of selected elements is two orders- of-magnitude lower than the target.
  • the estimation quality of RedSync has a high variance, harming its convergence.
  • Figure 9B shows that, at a target ratio of 0.001, the RedSync causes significant fluctuation in the compression ratio and the training does not converge.
  • GaussianKSGD results in a very low compression ratio, which is close to 0, and far from the target leading to significantly higher loss (and test perplexity) values compared to the target values.
  • FIG. 10A shows that SIDCo achieves higher gains compared to other compressors by up to about 2.1 times for ratios of 0.1 and 0.01. Notability, at a ratio of 0.001, only SIDCo achieved the target character error rate (CER). Thus, other compressors were run for 250 epochs to achieve the target CER (instead of the default 150), except for the GaussianKSGD, which does not converge. The gains of the SIDCo method over the other compressors are increased by up to about 4 times. The reason could be that the model is more sensitive to compression (especially in the initial training phase).
  • the SIDCo method starts as a single-stage before performing stage adaptations, leading to a slight over-estimation of k and so more gradient elements are sent during the training start-up.
  • Figure 10B shows that the threshold- estimation methods including the SIDCo method enjoy high training throughput, explaining the gains over the baseline.
  • Figure 10C shows that on average, with low variance, SIDCo closely matches the estimated ratios of DGC while other estimation methods have poor estimation quality.
  • Figures 11 A and 11 B show that, at target ratio of 0.001 , the RedSync causes significant fluctuation in compression ratio and the GaussianKSGD method results in a very low compression ratio (close to 0), which is far from the target. This leads both methods to achieve significantly higher loss (or test perplexity) values compared to the target loss (or test perplexity) values.
  • the SIDCo method estimates the threshold with very high quality for all ratios. Similar trends are observed for the VGG19 benchmark where a compression ratio of 0.001 was used. The results also indicate that the SIDCo method estimates the threshold with a high-quality, and achieves the highest top-1 accuracy and training throughput among all methods. The accuracy gains compared to the baseline, Top 3 ⁇ 4 and DGC methods are about 34, 2.9, and 1.13 time higher, respectively.
  • the novel SIDCo method solves a practical problem in distributed deep learning.
  • the performance of compressors other than threshold-based ones has high computational costs whereas the existing threshold-estimation methods fail to achieve their target.
  • the novel SIDCo threshold-based compressor is introduced, which imposes a sparsity prior on the gradients.
  • the method includes a step 1200 of receiving input data 412 at plural nodes 410-n of a neural network 400, a step 1202 of running the input data 412 forward and backward through the plural nodes 410-n to generate node gradient vectors 422, a step 1204 of fitting a sparsity-inducing distribution, SID, at each node 410-n, to the corresponding gradient vector 422, a step 1206 of calculating a first threshold h 1 based on the SID, for the corresponding node gradient vector 422, a step 1206 of compressing the corresponding node gradient vector 422 to obtain a first compressed gradient vector 430, by making zero those components that are smaller than the first threshold h 1 and a step 1208 of transmitting a compressed gradient vector, which is related to the first compressed gradient vector 430, to all other nodes 410-n for updating a corresponding model.
  • the method may further include selecting the SID from double exponential distribution, double gamma distribution and double generalized Pareto distribution, and/or fitting the SID or another SID, at each node, to the first compressed gradient vector, and/or calculating a second threshold h 2 , based on the SID or another SID, for the first compressed gradient vector, compressing the first compressed gradient vector to obtain a second compressed gradient vector, by setting the components that are smaller than the second threshold h 2 to zero, and transmitting the second compressed gradient vector, i.e. , the non-zero components of the second compressed vector and their indexes, to all other nodes for updating a corresponding model. For each threshold, a target compression ratio is used to calculate the threshold.
  • a product of the target compression ratio for each stage is equal to an overall target compression ratio.
  • the method may further include calculating a number of stages for which to repeat the steps of fitting, calculating and compressing before the step of transmitting.
  • the input data is training data for the neural network.
  • the steps of fitting, calculating and compressing are run independently and simultaneously on the plural nodes.
  • the step of compressing has a target compression ratio, and the threshold is calculated based on the target compression ratio.
  • Computing device 1300 of Figure 13 is suitable for performing the activities described above with regard to a node 410-n, and may include a server 1301.
  • a server 1301 may include a central processor (CPU) 1302 coupled to a random access memory (RAM) 1304 and to a read-only memory (ROM) 1306.
  • RAM random access memory
  • ROM 1306 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc.
  • Processor 1302 may communicate with other internal and external components through input/output (I/O) circuitry 1308 and bussing 1310 to provide control signals and the like.
  • I/O input/output
  • Processor 1302 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions.
  • Server 1301 may also include one or more data storage devices, including hard drives 1312, CD-ROM drives 1314 and other hardware capable of reading and/or storing information, such as DVD, etc.
  • software for carrying out the above-discussed steps may be stored and distributed on a CD- ROM or DVD 1316, a USB storage device 1318 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 1314, disk drive 1312, etc.
  • Server 1301 may be coupled to a display 1320, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc.
  • a user input interface 1322 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.
  • Server 1301 may be coupled to other devices, such as a database of data that needs to be analyzed, etc.
  • the server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 1328, which allows ultimate connection to various landline and/or mobile computing devices.
  • GAN global area network
  • the disclosed embodiments provide a statistical-based gradient compression method and system for distributed training systems. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

A statistical-based gradient compression method includes receiving (1200) input data (412) at plural nodes (410-n) of a neural network, running (1202) the input data (412) forward and backward through the plural nodes (410-n) to generate node gradient vectors, fitting (1204) a sparsity-inducing distribution, SID, at each node (410-n), to a corresponding node gradient vector, calculating (1206) a first threshold η 1, based on the SID, for the corresponding node gradient vector, compressing (1208) the corresponding node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold η 1, transmitting (1210) a compressed gradient vector, which is related to the first compressed gradient vector (430), to all other nodes (410-n) for updating a corresponding model.

Description

STATISTICAL-BASED GRADIENT COMPRESSION METHOD FOR
DISTRIBUTED TRAINING SYSTEM
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional Patent Application No. 63/045,346, filed on June 29, 2020, entitled “EFFICIENT GRADIENT COMPRESSION FOR FAST DISTRIBUTED TRAINING,” the disclosure of which is incorporated herein by reference in its entirety.
BACKGROUND
TECHNICAL FIELD
[0002] Embodiments of the subject matter disclosed herein generally relate to exchanging data within a neural network, and more particularly, to applying an efficient gradient compression technique to a distributed training neural network.
DISCUSSION OF THE BACKGROUND
[0003] As the deep neural networks (DNNs) continue to become larger and more sophisticated, and an ever increasing amount of training data is used, scaling the training process to run efficiently on a distributed cluster is currently a goal that is attracting a multitude of efforts. Modern deep learning toolkits (e.g., pytorch.org) are capable of distributed data-parallel training whereby the model is replicated and the training data are partitioned among plural workers. A worker is understood herein to be a component or a node of the DNN that executes training, for example, a processor or a computing device. Training DNNs in such settings relies on synchronous distributed Stochastic Gradient Descent (SGD) or similar optimizers. [0004] More specifically, assume that N (a positive integer) is the number of workers of the training DNN,
Figure imgf000004_0001
e Md denotes the model parameters with d dimensions at iteration /. A bold symbol means in this context a vector. At the end of the ith iteration, each worker n (where n takes a value between 1 and N) runs a back- propagation algorithm to produce a local stochastic gradient
Figure imgf000004_0002
e Md. Then, each worker updates its model parameters x{i+1} using the final gradient aggregated across all workers, i.e. ,
Figure imgf000004_0003
where l is the learning rate. This means that each worker n of the N workers needs to receive all individual gradients from all other workers and calculate the new aggregated gradient
Figure imgf000004_0004
[0005] Gradient aggregation involves extensive communication, which is either between the workers in a peer-to-peer fashion (typically through collective communication primitives like all-reduce) or via a parameter server architecture. Due to the synchronous nature of the optimizer, workers cannot proceed with the (i + 1)th iteration until the aggregated gradient
Figure imgf000004_0005
is available. Therefore, in distributed training workloads, communication is commonly one of the predominant bottlenecks.
[0006] Addressing this communication bottleneck is the focus of intensive research, where one avenue is pursuing a path of improving training by reducing the communicated data volume via lossy gradient compression. Compression entails two main challenges: (i) it can negatively affect the training accuracy (because the greater the compression is, the larger the error in the aggregated gradient), and (ii) it introduces extra computation latency (due to the compression operation itself). While the former can be mitigated by applying compression to a smaller extent or using compression with error-feedback, the latter, if gone unchecked, can actually slow down the training of the neural network when compared to the no compressing option.
[0007] Thus, there is a need for a new system and algorithm that are capable to accurately compress the gradients exchanged by the workers of the neural network so that almost no error is introduced and the compression process does not increase the computation latency to undesirable levels.
BRIEF SUMMARY OF THE INVENTION
[0008] According to an embodiment, there is a statistical-based gradient compression method that includes receiving input data at plural nodes of a neural network, running the input data forward and backward through the plural nodes to generate node gradient vectors, fitting a sparsity-inducing distribution, SID, at each node, to a corresponding node gradient vector, calculating a first threshold ¾, based on the SID, for the corresponding node gradient vector, compressing the corresponding node gradient vector to obtain a first compressed gradient vector, by setting to zero those components that are smaller than the first threshold h1 and transmitting a compressed gradient vector, which is related to the first compressed gradient vector, to all other nodes for updating a corresponding model, where the compressed gradient vector is sparse as only some components are non-zero. [0009] According to another embodiment, there is a neural network system that uses a statistical-based gradient compression method, and the system includes plural nodes, each configured to receive input data, and each node including a processor configured to run the input data forward and backward to generate a node gradient vector, fit a sparsity-inducing distribution, SID, to the node gradient vector, calculate a first threshold h1 based on the SID, for the node gradient vector, compress the node gradient vector to obtain a first compressed gradient vector, by setting to zero those components that are smaller than the first threshold h1 and transmit a compressed gradient vector, which is related to the first compressed gradient vector, to all other nodes for updating a corresponding model, where the compressed gradient vector is sparse as only some components are non-zero. [0010] According to still another embodiment, there is a non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a computer, implement a method for gradient compression using statistical methods as discussed above.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
[0012] Figures 1A and 1B illustrate the compression speedups over Top k using different compression ratios (0.1, 0.01 , 0.001) on a GPU (Figure 1A) and a CPU (Figure 1 B);
[0013] Figure 1C shows the average estimation quality of a target /c;
[0014] Figure 2A shows the sorted magnitude of the gradients versus their indexes and the fitted curve via a power law;
[0015] Figure 2B shows the approximation error of the Top ¾ versus the number of nonzero elements /c;
[0016] Figures 3A to 3D show gradient fitting using three sparsity-induced distributions (SIDs) for the gradient vector along with the empirical distribution generated from training ResNet-20 on CIFAR10 dataset using Top ¾ compressor with error compensation mechanism, for the 100th [(A) PDF, (B) CDF] and 10,000th [(C) PDF, (D) CDF] iterations;
[0017] Figures 4A and 4B schematically illustrate a neural network configured to run a statistical-based gradient compression method;
[0018] Figures 5A and 5B illustrate the pseudo-code run by the statistical- based gradient compression method; [0019] Figure 6 illustrates the iteration aspect of the statistical-based gradient compression method;
[0020] Figure 7 provides an example for the application of the statistical-based gradient compression method;
[0021] Figures 8A to 8C illustrate the performance of the statistical-based gradient compression method versus known methods for speed-up, throughput, and estimation quality for a first dataset;
[0022] Figures 9A and 9B show the train loss versus number of iterations and the threshold estimation quality, respectively, for the statistical-based gradient compression method and other background methods for the first dataset;
[0023] Figures 10A to 10C illustrate the performance of the statistical-based gradient compression method versus known methods for speed-up, throughput, and estimation quality for a second dataset;
[0024] Figures 11 A and 11 B show the train loss versus number of iterations and the threshold estimation quality, respectively, for the statistical-based gradient compression method and other background methods for the second dataset;
[0025] Figure 12 is a flow chart of a method for compressing gradient vectors based on sparsity-inducing distributions; and
[0026] Figure 13 is a schematic diagram of a node of a neural network.
DETAILED DESCRIPTION OF THE INVENTION
[0027] The following description of the embodiments refers to the accompanying drawings. The same reference numbers in different drawings identify the same or similar elements. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims. The following embodiments are discussed, for simplicity, with regard to training a DNN. However, the embodiments to be discussed next are not limited to training a DNN, but may be applied to other neural networks or other purposes than training.
[0028] Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with an embodiment is included in at least one embodiment of the subject matter disclosed. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
[0029] According to an embodiment, a novel gradient compressor with minimal overhead is introduced. Noting the sparsity of the gradients, the compressor models the gradients as random variables distributed according to some sparsity- inducing distributions (SIDs). This approach is validated by studying the statistical characteristics of the evolution of the gradient vectors over the training process. A Sparsity-Inducing Distribution-based Compression (SIDCo) method is then introduced, which takes advantage of a threshold-based sparsification scheme that enjoys similar threshold estimation quality to the deep gradient compression (DGC) while being faster by imposing lower compression overhead. An evaluation of popular machine learning benchmarks involving both recurrent neural network (RNN) and convolution neural network (CNN) models shows that the SIDCo method speeds up training by up to 41.7, 7.6, and 1.9 when compared to the no-compression baseline, Topk, and DGC compressors, respectively. The SIDCo method builds on a sound theory of signal compressibility and enjoys linear complexity in terms of the size of the model parameters. This approach affords for an implementation that parallelizes very efficiently using modern GPUs and other hardware targets. Thus, in one application, the new scheme addresses a previously-overlooked yet crucial technical obstacle to using compression in practice, especially for communication- bounded training of large models.
[0030] To better understand the applicability of the SIDCo method in the context of neural networks, before discussing the novel features of this method, it is believed that a discussion of the existing compressing methods is in order. Efficient communication in distributed training has received extensive attention. One approach tries to maximize the overlap between the computation and communication to hide the communication overhead. However, the gains from these methods are bounded by the length of the computation and are modest when the training is dominantly communication-bound. Alternatively, many approaches adopt methods that reduce the amount of communication, volume or frequency. For example, gradient compression is a well-known volume reduction technique [1-5] According to this approach, each worker applies a compression operator (C to the corresponding gradient g i} to produce a compressed gradient vector, which is then transmitted for aggregation. Generally, the compressor C involves quantization and/or sparsification operations.
[0031] Gradient quantization is another way to reduce the size of the transmitted gradients and it represents gradients with fewer bits for each gradient element. Under some conditions, quantization is known to achieve the same convergence as no compression. Error compensation (EC) is used to attain convergence when the gradients are quantized using fewer bits. Given the standard 32-bit float number representation, the volume reduction of the quantization is limited by 32x, i.e. , 1 bit out of 32 bits, which may not be sufficient for large models or slow networks and it requires expensive encoding to pack the quantized bits.
[0032] Another approach is the gradient sparsification. The gradient sparsification selects a subset of the gradient elements for the next iteration. It is generally more flexible than the quantization approach, as it can reduce the transmitted volume by up to d times and adapts easily to network conditions [6] It was shown that in some cases, up to 99.9% of the non-significant gradient elements can be dropped with limited impact on its convergence. Gradient sparsification using Top k, i.e., selecting the top k elements by their magnitude, is known to yield better convergence compared to other compression schemes, e.g., Random-/c. However, Top k or its variants are notorious for being computationally inefficient. The Top k selection does not perform well on accelerators such as GPUs. For instance, in many cases, it is reported that Top/< imposes high overheads and worsens the run time of distributed training systems.
[0033] Thus, the main challenge with using gradient compression (e.g., sparsification or quantization) is the computational overhead the algorithm itself introduces in the training. If the overhead is greater than the reduction gains in the communication time, the overall iteration time increases. Hence, to be useful, a robust compressor should have a low overhead. As discussed earlier, one of the dominantly robust compressors is Top k. However, this compressor is also computationally intensive. Because of this, when using Topk for large models, it results in either an increased training time or unsatisfactory performance benefits. [0034] Numerous efforts based on algorithmic or heuristic approaches have been dedicated to enhancing the performance of Topk [3, 7-9] Existing fast implementations of Top/< are computer-intensive, e.g., on the CPU, the computational complexity is 0(d log2 k). Recently, more optimized implementations for multi-core hardware were proposed, which greatly depend on the data distribution and these implementations work best for a small number of k [8] For instance, the Radix select algorithm used in PyTorch is 0(\b/r]d ) where b is the number of bits in the data values and ris the radix size (pytorch.org). Yet, using gradient vectors of various sizes, Top/< is the slowest on GPUs and not the fastest on CPUs.
[0035] In the context of gradient compression, threshold-based methods, which aim to overcome the overhead of Jopk, select linear time gradient elements larger in magnitude than a threshold h. DGC [3] proposes to sample a random subset of the gradients (e.g., 1%), and apply Jopk on the sampled sub-population to find a threshold which is then used to hierarchically obtain the actual Top ¾ elements. Even though the DGC leads to improved performance over Jopk, its computational complexity is still in the same order as the Tope's complexity. The threshold estimation methods, on the other hand, are shown to attain a linear time complexity. In this regard, several works have leveraged certain features of the gradients to enhance the training process. Some approaches leveraged these features and devised heuristics to estimate and find the Top k threshold, which exhibits lower compression overhead compared to the traditional Top k and DGC [2, 9] In particular, RedSync [2] finds the threshold by moving the ratio between the maximum and mean values of the gradient while GaussianKSGD [9] adjusts an initial threshold obtained from fitting a Gaussian distribution through an iterative heuristic to obtain the Top k elements. Nevertheless, the threshold estimation of these methods is of bad quality and the number of selected elements k varies significantly from the target k, as discussed later.
[0036] To overcome the problems noted above, a statistical approach is developed in this embodiment to estimate an accurate threshold for selecting the Top k elements with minimal overhead. In particular, the compressibility of the gradients is exploited, and a SID is adopted that fits the gradients well. For instance, double exponential (i.e. , Laplace), double gamma and double generalized Pareto distributions have been used as sparsity-promoting priors in Bayesian estimation framework [10-12] The inventors have discovered that as the gradients are compressible, they are suitable to be modeled as random variables (r.v.s) distributed according to one of the SIDs noted above. [0037] To validate this novel approach, initial micro-benchmark experiments were conducted to evaluate the compression overhead of the existing sparsification techniques, e.g., Top k, DGC (which uses random sub-sample for threshold calculation), RedSync and GaussianKSGD (which heuristically estimate the threshold), with one of the proposed SIDCo schemes that estimates the threshold via a multi-stage fitting. Both CPU and GPU were used to benchmark the performance of these algorithms. It was observed from the results of this comparison that the methods based on random sub-sampling (e.g., DGC) excel on GPU (see Figure 1A), but they impose a huge overhead on the CPU and leads to DGC performing significantly worse than Top ¾ on the CPU (see Figure 1B). In contrast, the methods that are based on estimating a threshold over which only k elements are selected, consistently impose lower compression overhead compared to Top ¾ and DGC on both the GPU and CPU. This shows that, except for linear time threshold-based methods, a variable compression overhead is to be expected on different architectures (e.g., CPU, GPU, TPU, FPGA or Al chips). Figure 1C shows the normalized actual compression ratio (i.e. , k/k) for various schemes. Note that the heuristic approaches fail to obtain the right threshold, leading to unpredictable behavior.
[0038] The SIDCo algorithm uses the compressibility of the gradients and their statistical distribution. Signals, including gradient vectors of DNNs, can be efficiently compressed by exploiting some of their statistical features. Among these features, sparsity and compressibility are the key drivers for performing signal compression [13-15] A definition of compressible signals is as follows: the signal g e Md is compressible if the magnitudes of its sorted coefficients obey the following power law decay:
3j £ ci)~v V/ <º {1, 2, d], (1) where g is the sorted vector of \g\ in descending order, g,· is the jth element of g, and p > ½ is the decay exponent, for some constant ci. For compressible signals obeying a power law decay, the sparsification error ak(g ) is bounded as follows:
Figure imgf000016_0001
where HxH, =
Figure imgf000016_0002
is the lq norm of x, Tfe{·} is the Top ¾ sparsification operator that keeps only the largest k elements in terms of magnitude and sets the others to zero, T k{g} is a /c-sparse vector with only k non-zero elements, and cå is a constant. The signal is more compressible if it decays faster, i.e., p is higher.
[0039] The compressibility of the gradient vector allows efficient compression for the gradients through sparsification techniques, e.g., Top k and thresholding- based compression. To verify the vector compressibility, the gradients generated while training the ResNet20 were used. The absolute values of the gradients are sorted in descending order to obtain the vector g with d = 269,722. Figure 2A shows the elements of the gradient vector g , i.e., g,·, which are reported versus their index, for three iterations in the beginning, middle, and end of the training. As a benchmark, a power low decay example with decay exponent p > 0.5, i.e., p = 0.7 is used. It can be noticed that the gradients follow a power-law decay with a decay exponent p >
0.7 > 0.5, which indicate that the gradients form a compressible signal according to the definition given by equation (1). Figure 2B shows the sparsification error for the best k approximation, e.g., the Top ¾, as a function of k. An example of the power decay model with the decay exponent p - 0.5 = 0.2 is used. It can be seen that the best k approximation error decays faster than the benchmark. Hence, the vector can be considered compressible, according to equation (2).
[0040] Next, the goal is to find the distribution of the gradient vector, while accounting for the compressibility of the gradients. The selection of sparsity- promoting priors that are able to efficiently capture the statistical characteristics of the gradients with low computational complexity is a challenging task. However, the inventors noticed a specific property for the distribution of the gradients, which permits high compression gains with low computational overhead. This property indicates that gradients generated from many DNNs during the training can be modeled as random variables (r.v.s) distributed according to some sparsity-inducing distributions, i.e. , double exponential, double gamma and double generalized Pareto (GP) distributions. More specifically, the gradient G can be modeled or fitted as
G~Distribution( Q), (3) where Distribution ( ) is one of the three SIDs with parameters indicated by the vector Q, which generally depend on the iteration and worker’s data. Also, the probability density function (PDF) of G is Fc(g; Q), which is symmetric around zero. [0041] Because the gradients G are compressible as indicated by equation (1), they can be well approximated by sparse vectors with minimal error, as implied by equation (2). Hence, the distributions that promote sparsity are good candidates for fitting (or modeling) the gradient vector G. For instance, the double exponential, double gamma, double GP, and Bernoulli-Gaussian distributions have been used as priors that promote sparsity in [11, 12, 14] The gradients resulting from the training of ResNet-20 with SGD have been considered and they are fitted by the three proposed SIDs, i.e. , double exponential, double gamma, and double GP distributions. Figures 3A to 3D show the empirical distribution of the gradients and their absolutes, without the EC mechanism, along with the distributions of the three fitted SID for two different iterations. In this regard, Figure 3A shows that the three proposed distributions can capture the main statistical characteristic of the gradients, as their PDFs approximate the empirical distribution for most of the gradient domain. This is so because of the compressibility of the gradients illustrated before. The compressibility of r.v.s distributed according to one of the SIDs can be attributed to the shape of their PDFs, where the most probable values are those with small amplitudes. From Figures 3A and 3C it can be seen that the gradients at iteration 10,000 (Figure 3C) are more sparse than those at iteration 100 (Figure 3A), where the PDF at iteration 10,000 has higher values at smaller gradient values and it has a faster tail. Regarding the cumulative distribution function (CDF) of the absolute value of the gradients shown in Figures 3B and 3D, it can be seen that the SIDs approximate well the empirical CDF. However, at the tail of the distribution, they tend to slightly overestimate/underestimate the CDF. The reason for this is that the fitting is biased toward the majority of the data with lower values, as the gradient vector is sparse.
[0042] Using the statistical properties noted above, two threshold estimators are now discussed, the single-stage and the multiple-stage threshold estimators. The threshold that yields the target compression ratio d = k/d is derived for each of the three proposed SIDs. Then, the single-stage thresholding scheme is discussed for moderate compression ratios and the multi-stage thresholding scheme is discussed for aggressive compression ratios with d « 1, e.g., d £ 0.001. The sparsification threshold can be computed from the fitted distribution of the gradients as follows. For G~Distribution(G ) with CDF Fc(g, Q), the threshold h that yields the Top k vector with the average target compression ratio d = k/d can be derived as:
>?(«) = ¾(! - ¾ §) (4)
Figure imgf000019_0001
where 0 is the estimated parameters for the gradient distribution, Fc(g, Q) is the CDF of the absolute gradient,
Figure imgf000019_0002
¾(#; 0) = p} is the inverse CDF of the absolute gradient at probability p, and
Figure imgf000019_0003
0) is the inverse CDF of the gradient, also known as quantile function or percent-point function (PPF).
[0043] The threshold calculation for three SIDs mentioned above is now discussed. For the double exponentially distributed gradients with a scale parameter b and location zero (symmetric around zero), i.e. , G-Laplace^), the threshold h that achieves the compression ratio d can be computed as:
Figure imgf000019_0004
where b is the maximum likelihood estimate (MLE) of the scale parameter. [0044] For the gradients that can be well-fitted by the double gamma distribution with a shape parameter a £ 1, the absolute of the gradient is gamma distributed, i.e. , \G\~gamma(a ,b). The sparsifying threshold h can be derived as: h{d) = br- a, 1 - d) (7)
Figure imgf000020_0001
log(r(a))], (8) where P(a,x ) = is the regularized lower incomplete gamma
Figure imgf000020_0002
function, P_1(a,p) = (x: P(a,x ) = p) is the inverse of the regularized lower incomplete gamma function,
Figure imgf000020_0003
with s = log
Figure imgf000020_0004
are the sample mean and variance for the absolute gradient vector \g\, respectively.
[0045] The threshold calculation for the generalized Pareto Distributed gradients is now discussed. For gradients distributed as double generalized Pareto r.v.s, the absolute of the gradients is modeled as GP distributed r.v.s |G|~GP(a,/?, a), where 0 < a < ^,b, a = 0 are the shape, scale, and location parameters. The sparsifying threshold h that achieves a compression ratio d is given by: where
Figure imgf000020_0005
with fi and s2 being the sample mean and variance for the absolute gradient vector \g\, respectively.
[0046] After computing the threshold h for the three SIDs discussed above, the single-stage approach calculates the compressed gradient vector as
Figure imgf000021_0001
=
Figure imgf000021_0002
each j e {1, 2, d} where the vector g e Md is the compressed gradient vector, I [condition] 's an indicator function that equals one when the condition is satisfied and zero otherwise. In the following, the vector that contains only the exceedance non-zero gradients (i.e. , the gradients that are larger than the threshold h ) are denoted by g and the number of its components is k.
[0047] The target compression ratio d can be as low as 10 4. Therefore, in order to accurately estimate the threshold h, the fitted distribution should tightly resemble the gradient distribution at the tail. This is quite challenging because the estimation of the parameters tends to account more for the majority of data at the expense of the tail. Hence, the threshold h obtained from single-stage fitting discussed above is accurate up to some moderate compression ratios.
[0048] To overcome this problem, a multi-stage fitting approach is proposed in this embodiment. For convenience, a two-stage approach is first discussed. The gradients are fitted with one of the three SIDs and compressed using the proposed procedure discussed above with a first threshold h1 computed to yield a first compression ratio d1 = > 5. Then, the calculated vector of the exceedance gradients g is used to fit another distribution, defined precisely below. Then, a second threshold h2 is computed to achieve a second compression ratio d2 = — with
Figure imgf000021_0003
respect to the exceedance gradients g . The second compression ratio <¾ is chosen such that the overall compression ratio of the original gradient vector is the target
Figure imgf000022_0001
ratio d, i.e., d2 = Then, the estimated threshold from the last stage is applied to
Figure imgf000022_0002
compress the newly compressed gradient. This procedure can be extended to multi stages such that d = Um=i s m where M is the number of stages.
[0049] Because the method above replaces the original gradients with the exceedance gradients, which are also known as Peak over Treshold (PoT), the question of whether the exceedance gradients have the same distribution as the original gradients before the compression needs to be addressed. The extreme value theory in statistics can provide an answer for this question. Let km be the number of exceedance gradients after the mth thresholding stage. Then, if a threshold operator is applied on a sequence of r.v.s IG- , \ G2\, |G¾ |, the distribution of the exceedance r.v.
Figure imgf000022_0003
can be approximated by a GP distribution for large enough threshold and vector dimension, irrespective of the original distribution of the gradients. This means, that the exceedance step does not impair the accuracy of this method.
[0050] Next, using the extreme value theory, it is possible to compute the threshold for the multi-stage approach. Considering that for the mth thresholding stage with m ³ 2, the absolute of the exceedance gradients \Gm\ can be modeled as
Figure imgf000022_0004
are the shape, scale, and location parameters. The threshold that achieves a compression ratio 5m is obtained as:
Figure imgf000023_0001
where hpi® is the threshold computed at the previous stage and fl and a2 are the sample mean and variance of \gm\ hth®, respectively.
[0051] If the absolute of the gradients is modeled as exponentially distributed r.v.s \Gm\~Exp(fim), the distribution of the exceedance gradients over the threshold h-m-t, after proper shifting, is still exponentially distributed, i.e. , \Gm\ 11~Exp^m). The new stage threshold is then:
Figure imgf000023_0002
where gj is the jth element of the vector gm.
[0052] These equations are used such that when the absolute of the gradients is fitted by an exponential distribution in the first stage, the latter stages for the exceedance gradients are also fitted by exponential distributions, i.e., multi-stage exponential. On the other hand, for the gamma-fitted absolute gradients in the first stages, the latter stages are fitted by a GP distribution (based on equations (12) and (13)), i.e., gamma-GP. Finally, for GP distributed absolute gradients in the first stage, the GP is still used for the PoT data (based on equations (12) and (13)), i.e., multi stage GP. [0053] The SIDCo algorithm leverages the SIDs discussed above to obtain a threshold via the multi-stage threshold estimator. For this approach, the number of stages M is selected via an adaptive algorithm such that the estimation error, averaged over Q iterations, is bounded below a predefined error tolerance, i.e.,
|<5 — 51 < sS.with 0 < e < 1.
[0054] The SIDCo algorithm, which is schematically illustrated in Figures 4A and 4B together with the corresponding neural network system 400, and whose full pseudo-code is shown in Figures 5A and 5B, is now discussed. Figures 4A and 4B show a neural network/computing system 400 that includes n workers, where n is an integer number that varies between 2 and N. Only two workers or nodes 410-1 and 410-n are shown in the figure for simplicity. A training node may be a GPU, a CPU, a computing device, etc. Each training node 410-n receives input data 412 for training, and performs a forward-backward pass 414/416 on different batches sampled from the training dataset 412 with the same network model. At the end of the ith iteration (Figures 4A and 4B show all the workers 410-n simultaneously and independently performing the same iteration ith), each worker runs a local-version of the SGD algorithm to produce the corresponding gradients 418, which are then reshaped by the gradient reshaping module 420 to generate the gradient vector
Figure imgf000024_0001
e Md 422. Note that Figures 5A and 5B, lines 20-26, show how to calculate the number of stages M. Then, in step 424, the processor of each worker estimates the parameters of the selected SID to effectively fit the gradient vector to the selected SID. Then, in step 426, the threshold h for the selected SID is calculated (see lines 13-19 in Figures 5A and 5B) and the gradient vector is sparsified (see lines 2-12 in Figures 5A and 5B) in step 428, by applying the compression operator .h, to obtain the compressed gradient 430 gt =
Figure imgf000025_0001
the values below the threshold are made zero by using the operator 1{|5ί|³?7} and all other values are maintained. The compressed gradient g 430 from each worker is then supplied to an all-gather module 432.
[0055] Note that for each stage, the function Thresh_Estimation shown in line 13 in Figures 5A and 5B use the chosen SID to obtain a corresponding threshold. The algorithm dynamically adapts the number of stages M by monitoring the quality of its estimated selection of elements and adjusting M using the function Adapt_Stages noted in line 20 in Figure 5B. The algorithm in Figure 5A starts by calling the sparsify function, which takes the gradient vector 422 and the target ratio as the parameters. Then, the algorithm applies a multi-stage estimation loop of M iterations. In each iteration, the compressed gradient vector gfa 430 is partially sparsified with the previously estimated threshold obtained from the previous stage m - 1. Then, given the ratio 5m at loop step m, the chosen SID distribution fitting is invoked via the function Thresh_Estimation to obtain a new threshold. At the last stage (i.e. , step M of the loop), the resulting estimation threshold should approximate the threshold that would obtain the target ratio <5 of the input vector. Then, the estimated threshold is used to sparsify the full gradient vector and obtain the values and their corresponding indices. For each invocation of the algorithm in each training iteration, the algorithm maintains statistics like the average ratio of the quality of its estimations over the past training steps Q. Then, at the end of every Q training steps, the algorithm invokes the Adapt_Stages function (see line 20 in Figure 5B), which adjusts the current number of stages M based on a user-defined allowable error bounds of the estimation (i.e. , eH and eL). After the adjustment, the next algorithm iteration invocation will use the new number of stages M. The number of stages is adjusted only if the obtained ratio is not within the error bounds. Figure 6 illustrates these steps of the fitting refinement 610 in which the threshold h is adjusted for each iteration to arrive at the desired target ratio d.
[0056] A simplified example for compressing the gradient vector is illustrated in Figure 7. Suppose that the gradient vector gfa 422 has the 10 values 710, as illustrated in Figure 7. The Adapt_Stages function from Figures 5A and 5B has calculated that M = 2 and the target compression rate 5 is 0.1. The SID was selected and thus, the selected SID is fitted on the gradient vector to obtain the threshold h1 = 11 and the associated compression rate is d1 = 0.2 for the first stage. After applying the compression operator, only two values (PoT) 720 of the gradient vector glfo 422 are larger than the threshold h1 = 11. The two PoT values form the compressed gradient vector, which is fit again with the selected SID and a new threshold h2 = 20 is calculated and an associated compression rate is d2 = 0.5. After applying again the compression operator, only one value 730 survives and has the desired compression ratio d = 0.1. In other words, the method computes for each stage the threshold from the peak-over-threshold data from the previous stage (i.e., the gradient elements that have an absolute value larger that the threshold). After finishing all the stages, the final threshold is used to compress and send the vectors. For M stages, this process is repeated ensuring that the target compression ratio is the product of all the previous compression rates.
[0057] Returning to Figures 4A and 4B, the all-gather module 432 is configured to collect the compressed and sparsified gradient vectors from each worker and to provide this info to all the workers in step 434. The averaged gradient 436 from all the workers is used in step 438 to update the model
Figure imgf000027_0001
e Md used by each worker and then the steps discussed above are repeated for the next iteration i+1.
[0058] The above approach has been tested with CNN and RNN against actual databases and various compressors. The tests have been performed on 8 server machines equipped with dual 2.6 GHz 16-core Intel Xeon Silver 4112 CPU, 512GB of RAM, and 10 Gbps NICs. Each machine has an NVIDIA V100 GPU with 16GB of memory. PyTorch 1.1.0 with CUDA 10.2 was used as the ML toolkit and Horovod 0.16.4 configured with OpenMPI 4.0.0 for collective communication.
[0059] For the benchmarks and hyper-parameters various models have been used, for example, the LSTM model on the PTB dataset with Nesterov-Momentum- SGD local optimizer, LSTM model with AN4 dataset, ResNet20 model with CIFAR- 10 dataset, and ResNet-50 model with ImageNet dataset. All these models, datasets and optimizers are known in the literature, and thus, their description is omitted herein. Both CNN and RNN models were used for image classification and language modeling tasks, respectively. Various compression ratios (d) of 0.1 (10%), 0.01 (1%) and 0.001 (0.1%) were used to span a wide range of the trade-off between compression and accuracy similar to prior work. The SIDCo algorithm was compared to the Top k, DGC, RedSync and GaussianKSGD. The EC mechanism is employed to further enhance the convergence of SGD with compressed gradients. For the SIDCo method, the compression rate di = 0.25, e = 20%, and Q =5 iterations were selected to adapt the stages as in the algorithm illustrated in Figures 5A and 5B. For conciseness, only the performance of the SIDCo method with the double exponential fitting (shown in the figures as “SIDCo-E”) is presented.
[0060] The performance of a given scheme (i.e., SIDCo, Top ¾, DGC, RedSync or GaussianKSGD) is quantified via the following metrics:
[0061] Normalized Training Speed-up: the model quality is evaluated at iteration T (the end of training) and it is divided by the time taken to complete T iterations. This quantity is normalized by the same measurement calculated for the baseline case. This is the normalized training speed-up relative to the baseline; [0062] Normalized Average Training Throughput: is the average throughput normalized by the baseline’s throughput, which illustrates the speed-up from compression irrespective of its impact on the model quality; and [0063] Estimation Quality: is the compression ratio k/d averaged over the training divided by the target ratio d = k/d along with the 90% confidence interval as error-bars.
[0064] The RNN-LSTM on PTB has been first tested. This benchmark has the highest communication overhead. In Figure 8A, the SIDCo method shows significant speed-up over the no-compression method, by about 41.7 times and improves over Top k and DGC by up to about 7.6 and about 1.9 times, respectively. At a high compression ratio of 0.001, both the RedSync and GaussianKSGD compression methods do not converge to the target loss and test perplexity, as shown in Figure 9A, and therefore they attain zero speed-ups. Figure 8B shows that the threshold estimation schemes including SIDCo have the highest training throughput. In Figure 8C, the DGC and SIDCo are the only methods that accurately estimate the target ratio with high confidence. However, for the GaussianKSGD at a ratio of 0.001 and RedSync at ratios of 0.01 and 0.001, the number of selected elements is two orders- of-magnitude lower than the target. Moreover, over the training process, the estimation quality of RedSync has a high variance, harming its convergence. Figure 9B shows that, at a target ratio of 0.001, the RedSync causes significant fluctuation in the compression ratio and the training does not converge. GaussianKSGD results in a very low compression ratio, which is close to 0, and far from the target leading to significantly higher loss (and test perplexity) values compared to the target values. [0065] When the RNN-LSTM is run on the AN4 data, Figure 10A shows that SIDCo achieves higher gains compared to other compressors by up to about 2.1 times for ratios of 0.1 and 0.01. Notability, at a ratio of 0.001, only SIDCo achieved the target character error rate (CER). Thus, other compressors were run for 250 epochs to achieve the target CER (instead of the default 150), except for the GaussianKSGD, which does not converge. The gains of the SIDCo method over the other compressors are increased by up to about 4 times. The reason could be that the model is more sensitive to compression (especially in the initial training phase). The SIDCo method starts as a single-stage before performing stage adaptations, leading to a slight over-estimation of k and so more gradient elements are sent during the training start-up. Throughput-wise, Figure 10B shows that the threshold- estimation methods including the SIDCo method enjoy high training throughput, explaining the gains over the baseline. Similar to the LSTM-PTB results, Figure 10C shows that on average, with low variance, SIDCo closely matches the estimated ratios of DGC while other estimation methods have poor estimation quality. Similar to the PTB, Figures 11 A and 11 B show that, at target ratio of 0.001 , the RedSync causes significant fluctuation in compression ratio and the GaussianKSGD method results in a very low compression ratio (close to 0), which is far from the target. This leads both methods to achieve significantly higher loss (or test perplexity) values compared to the target loss (or test perplexity) values.
[0066] For CNN networks applied to CIFAR-10 data with the ResNet20 model, all compressors achieve somewhat comparable and modest speed-ups over the no compression baseline (except at ratio of 0.001, where accuracy is degraded and hence the lower speed-up than the baseline). This is not surprising because ResNet20 is not network-bound. However, for the larger VGG16 model, the SIDCo method achieves significant speed-ups over the no-compression method, Top ¾ and DGC, by up to about 5, 1.5, and 1.2 times, respectively. Unlike other estimation schemes, the SIDCo method can accurately achieve the target ratio.
[0067] When the ResNet50 and VGG19 models were considered for the ImageNet data, a time-limit of 5 hours per run was set to reduce the costs. For calculating the speed-up, the top-1 accuracy achieved by different methods at the end of training was compared. First, for the ResNet50 benchmark, the compression ratios of 0.1 , 0.01, and 0.001 were used. It was found that SIDCo method achieves the highest accuracy, which is higher than the baseline, Top k and DGC by about 15, 3, and 2 accuracy points, i.e. , normalized accuracy gains of about 40%, 5%, and 4%, respectively. Further, the inventors found that the SIDCo method attains the highest throughput among all methods (except for RedSync at 0.1 compression). The results indicate that, unlike GaussianKSGD and RedSync, which both result in estimation quality far from the target with high variance, the SIDCo method estimates the threshold with very high quality for all ratios. Similar trends are observed for the VGG19 benchmark where a compression ratio of 0.001 was used. The results also indicate that the SIDCo method estimates the threshold with a high-quality, and achieves the highest top-1 accuracy and training throughput among all methods. The accuracy gains compared to the baseline, Top ¾ and DGC methods are about 34, 2.9, and 1.13 time higher, respectively.
[0068] These results indicate that the SIDCo method goes beyond existing methods that estimate a threshold for the Top ¾ sparsification. These methods either do not leverage the statistical property of the gradients (DGC) or assume the Gaussian distribution without a thorough study of the gradient (e.g., RedSync, GaussianKSGD). On a GPU, the SIDCo method improves over DGC by at least 2 times, and the speed-ups are significantly larger on the CPU. As a threshold estimation method, SIDCo does not only benefit from the throughput gains of the threshold methods, but also from the high-quality of its threshold estimation. The results discussed above indicate that the existing estimation methods (e.g., RedSync and Gaus-sianKSGD) fail to achieve consistent threshold estimation behavior even though they may provide throughput gains. Their throughput gains, in many cases, are due to severe under-estimation of the target ratio, which results in lower volumes of sent data compared to other compressors.
[0069] Thus, the novel SIDCo method solves a practical problem in distributed deep learning. The performance of compressors other than threshold-based ones has high computational costs whereas the existing threshold-estimation methods fail to achieve their target. To address these issues, the novel SIDCo threshold-based compressor is introduced, which imposes a sparsity prior on the gradients.
[0070] A statistical-based gradient compression method is now discussed with regard to Figure 12. The method includes a step 1200 of receiving input data 412 at plural nodes 410-n of a neural network 400, a step 1202 of running the input data 412 forward and backward through the plural nodes 410-n to generate node gradient vectors 422, a step 1204 of fitting a sparsity-inducing distribution, SID, at each node 410-n, to the corresponding gradient vector 422, a step 1206 of calculating a first threshold h1 based on the SID, for the corresponding node gradient vector 422, a step 1206 of compressing the corresponding node gradient vector 422 to obtain a first compressed gradient vector 430, by making zero those components that are smaller than the first threshold h1 and a step 1208 of transmitting a compressed gradient vector, which is related to the first compressed gradient vector 430, to all other nodes 410-n for updating a corresponding model.
[0071] The method may further include selecting the SID from double exponential distribution, double gamma distribution and double generalized Pareto distribution, and/or fitting the SID or another SID, at each node, to the first compressed gradient vector, and/or calculating a second threshold h2, based on the SID or another SID, for the first compressed gradient vector, compressing the first compressed gradient vector to obtain a second compressed gradient vector, by setting the components that are smaller than the second threshold h2 to zero, and transmitting the second compressed gradient vector, i.e. , the non-zero components of the second compressed vector and their indexes, to all other nodes for updating a corresponding model. For each threshold, a target compression ratio is used to calculate the threshold. A product of the target compression ratio for each stage is equal to an overall target compression ratio. The method may further include calculating a number of stages for which to repeat the steps of fitting, calculating and compressing before the step of transmitting. In one application, the input data is training data for the neural network. The steps of fitting, calculating and compressing are run independently and simultaneously on the plural nodes. In this or another application, the step of compressing has a target compression ratio, and the threshold is calculated based on the target compression ratio.
[0072] The above-discussed procedures and methods may be implemented in a computing device as illustrated in Figure 13. Hardware, firmware, software or a combination thereof may be used to perform the various steps and operations described herein. Computing device 1300 of Figure 13 is suitable for performing the activities described above with regard to a node 410-n, and may include a server 1301. Such a server 1301 may include a central processor (CPU) 1302 coupled to a random access memory (RAM) 1304 and to a read-only memory (ROM) 1306. ROM 1306 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc. Processor 1302 may communicate with other internal and external components through input/output (I/O) circuitry 1308 and bussing 1310 to provide control signals and the like. Processor 1302 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions.
[0073] Server 1301 may also include one or more data storage devices, including hard drives 1312, CD-ROM drives 1314 and other hardware capable of reading and/or storing information, such as DVD, etc. In one embodiment, software for carrying out the above-discussed steps may be stored and distributed on a CD- ROM or DVD 1316, a USB storage device 1318 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 1314, disk drive 1312, etc. Server 1301 may be coupled to a display 1320, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc. A user input interface 1322 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.
[0074] Server 1301 may be coupled to other devices, such as a database of data that needs to be analyzed, etc. The server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 1328, which allows ultimate connection to various landline and/or mobile computing devices. [0075] The disclosed embodiments provide a statistical-based gradient compression method and system for distributed training systems. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.
[0076] Although the features and elements of the present embodiments are described in the embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the embodiments or in various combinations with or without other features and elements disclosed herein. [0077] This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.
References
The entire content of all the publications listed herein is incorporated by reference in this patent application.
[1] Dutta, A., Bergou, E. H., Abdelmoniem, A. M., Ho, C.-Y., Sahu, A. N., Canini, M., and Kalnis, P., On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning. In AAAI, 2020.
[2] Fang, J., Fu, H., Yang, G., and Hsieh, C.-J., RedSync: Reducing synchronization bandwidth for distributed deep learning training system. Journal of Parallel and Distributed Computing, 133, 2019.
[3] Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, W., Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR, 2018.
[4] Wangni, J., Wang, J., Liu, J., and Zhang, T., Gradient Sparsification for Communication-Efficient Distributed Optimization. In NeurlPS, 2018.
[5] Xu, H., Ho, C.-Y., Abdelmoniem, A. M., Dutta, A., Bergou, E. H., Karatsenidis, K., Canini, M., and Kalnis, P., Compressed communication for distributed deep learning: Survey and quantitative evaluation. Technical report, KAUST, 2020. http://hdl. handle net/10754/662495.
[6] Abdelmoniem, A. M. and Canini, M., DC2: Delay-aware Compression Control for Distributed Machine Learning. In INFOCOM, 2021.
[7] Jiang, J., Fu, F., Yang, T., and Cui, B., SketchML: Accelerating Distributed Machine Learning with Data Sketches. In SIGMOD, 2018. [8] Shanbhag, A., Pirk, H., and Madden, S., Efficient Top-K Query Processing on Massively Parallel Hardware. In SIGMOD, 2018.
[9] Shi, S., Chu, X., Cheung, K. C., and See, S., Understanding Top-k Sparsification in Distributed Deep Learning. arXiv 1911.08772, 2019.
[10] Armagan, A., Dunson, D. B., and Lee, J., Generalized double Pareto shrinkage. Statistica Sinica, 23(1), 2013.
[11] Babacan, S. D., Molina, R., and Katsaggelos, A. K., Bayesian Compressive Sensing Using Laplace Priors. IEEE Transactions on Image Processing, 19(1), 2010.
[12] Monga, V., Mousavi, H. S., and Srinivas, U., Sparsity Constrained Estimation in Image Processing and Computer Vision. In Handbook of Convex Optimization Methods in Imaging Science, pp. 177-206. Springer International Publishing, 2018.
[13] Elzanaty, A., Giorgetti, A., and Chiani, M., Limits on Sparse Data Acquisition:
RIC Analysis of Finite Gaussian Matrices. IEEE Transactions on Information Theory, 65(3), 2019.
[14] Elzanaty, A., Giorgetti, A., and Chiani, M., Lossy Compression of Noisy Sparse Sources Based on Syndrome Encoding. IEEE Transactions on Communications, 67(10), 2019.
[15] Mallat, S. A Wavelet Tour of Signal Processing: The Sparse Way.
Academic Press, 2009.

Claims

WHAT IS CLAIMED IS:
1. A statistical-based gradient compression method comprising: receiving (1200) input data (412) at plural nodes (410-n) of a neural network
(400); running (1202) the input data (412) forward and backward through the plural nodes (410-n) to generate node gradient vectors (422); fitting (1204) a sparsity-inducing distribution, SID, at each node (410-n), to a corresponding node gradient vector (422); calculating (1206) a first threshold h1 based on the SID, for the corresponding node gradient vector (422); compressing (1208) the corresponding node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold
Figure imgf000038_0001
and transmitting (1210) a compressed gradient vector, which is related to the first compressed gradient vector (430), to all other nodes (410-n) for updating a corresponding model, wherein the compressed gradient vector is sparse as only some components are non-zero.
2. The method of Claim 1, further comprising: selecting the SID from a double exponential distribution, a double gamma distribution, and a double generalized Pareto distribution.
3. The method of Claim 1, further comprising: fitting the SID or another SID, at each node (410-n), to the first compressed gradient vector (430).
4. The method of Claim 3, further comprising: calculating a second threshold h2, based on a distribution of new peak-over- threshold data using extreme value theory, for the first compressed gradient vector (430); further compressing the first compressed gradient vector (430) to obtain a second compressed gradient vector, by making zero those components that are smaller than the second threshold h2 ; repeating this process for several stages to obtain final threshold and the corresponding compressed gradients; and transmitting the final compressed gradient vector to all other nodes (410-n) for updating the corresponding model.
5. The method of Claim 4, wherein for each threshold, a target compression ratio is used to calculate the threshold.
6. The method of Claim 5, wherein a product of the target compression ratios for each stage is equal to an overall target compression ratio.
7. The method of Claim 3, further comprising: calculating a number of stages for which to repeat the steps of fitting.
8. The method of Claim 1, wherein the input data is training data for the neural network.
9. The method of Claim 1, wherein the steps of fitting, calculating and compressing are run independently and simultaneously on each of the plural nodes.
10. The method of Claim 1, wherein the step of compressing has a target compression ratio, and the threshold is calculated based on the target compression ratio and the SID.
11. A neural network system (400) that uses a statistical-based gradient compression method, the system (400) comprising: plural nodes (410-n), each configured to receive (1200) input data (412); and each node (410-n) including a processor (1302) configured to, run (1202) the input data (412) forward and backward to generate a node gradient vector (422); fit (1204) a sparsity-inducing distribution, SID, to the node gradient vector
(422); calculate (1206) a first threshold h1 based on the SID, for the node gradient vector (422); compress (1208) the node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold h , and transmit (1210) a compressed gradient vector, which is related to the first compressed gradient vector (430), to all other nodes (410-n) for updating a corresponding model, wherein the compressed gradient vector is sparse as only some components are non-zero.
12. The system of Claim 11 , wherein the node is configured to: select the SID from a double exponential distribution, a double gamma distribution and a double generalized Pareto distribution.
13. The system of Claim 11 , wherein the processor is configured to: fit the SID or another SID to the first compressed gradient vector (430).
14. The system of Claim 13, wherein the processor is further configured to: calculate a second threshold h2, based on extreme value theory, for the first compressed gradient vector (430); compress the first compressed gradient vector (430) to obtain a second compressed gradient vector, by making zero those components of the first compressed gradient vector that are smaller than the second threshold h2 ; and transmit the second compressed gradient vector to all other nodes (410-n) for updating the corresponding model.
15. The system of Claim 14, wherein for each threshold, a target compression ratio is used to calculate the threshold.
16. The system of Claim 11, wherein the processor is further configured to: calculate a number of stages for which to repeat the steps of fitting, calculating and compressing before the step of transmitting.
17. The system of Claim 11, wherein the input data is training data for the neural network.
18. The system of Claim 11, wherein the steps of fitting, calculating and compressing are run independently and simultaneously on each of the plural nodes.
19. The system of Claim 11, wherein the step of compressing has a target compression ratio, and the threshold is calculated by the processor based on the target compression ratio.
20. A non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a computer, implement a method for gradient compression using statistical methods, the method comprising: receiving (1200) input data (412) at plural nodes (410-n) of a neural network
(400); running (1202) the input data (412) forward and backward through the plural nodes (410-n) to generate node gradient vectors (422); fitting (1204) a sparsity-inducing distribution, SID, at each node (410-n), to the corresponding node gradient vector (422); calculating (1206) a first threshold h1 based on the SID, for the corresponding node gradient vector (422); compressing (1208) the corresponding node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold
Figure imgf000043_0001
and transmitting (1210) a compressed gradient vector, which is related to the first compressed gradient vector (430), to all other nodes (410-n) for updating a corresponding model, wherein the compressed gradient vector is sparse as only some components are non-zero.
PCT/IB2021/055814 2020-06-29 2021-06-29 Statistical-based gradient compression method for distributed training system WO2022003562A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063045346P 2020-06-29 2020-06-29
US63/045,346 2020-06-29

Publications (1)

Publication Number Publication Date
WO2022003562A1 true WO2022003562A1 (en) 2022-01-06

Family

ID=76797045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2021/055814 WO2022003562A1 (en) 2020-06-29 2021-06-29 Statistical-based gradient compression method for distributed training system

Country Status (1)

Country Link
WO (1) WO2022003562A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710415A (en) * 2022-05-23 2022-07-05 北京理工大学 Redundant coded passive message reliable transmission and processing system
CN114900482A (en) * 2022-03-28 2022-08-12 中国科学技术大学苏州高等研究院 Gradient scheduling method and device based on programmable switch under PS (packet switched) architecture
CN117910521A (en) * 2024-03-20 2024-04-19 浪潮电子信息产业股份有限公司 Gradient compression method, gradient compression device, gradient compression equipment, gradient compression distributed cluster and storage medium

Non-Patent Citations (18)

* Cited by examiner, † Cited by third party
Title
ABDELMONIEM, A. M.CANINI, M.: "DC2: Delay-aware Compression Control for Distributed Machine Learning", INFOCOM, 2021
AHMED M ABDELMONIEM ET AL: "An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 January 2021 (2021-01-26), XP081867420 *
ARMAGAN, A.DUNSON, D. B.LEE, J.: "Generalized double Pareto shrinkage", STATISTICA SINICA, vol. 23, no. 1, 2013
BABACAN, S. D.MOLINA, R.KATSAGGELOS, A. K.: "Bayesian Compressive Sensing Using Laplace Priors", IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 19, no. 1, 2010, XP011283168, DOI: 10.1109/TIP.2009.2032894
DUTTA, A.BERGOU, E. H.ABDELMONIEM, A. M.HO, C.-Y.SAHU, A. N.CANINI, M.KALNIS, P.: "On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning", AAAI, 2020
ELZANATY, A.GIORGETTI, A.CHIANI, M.: "Limits on Sparse Data Acquisition: RIC Analysis of Finite Gaussian Matrices", IEEE TRANSACTIONS ON INFORMATION THEORY, vol. 65, no. 3, 2019, XP011710205, DOI: 10.1109/TIT.2018.2859327
ELZANATY, A.GIORGETTI, A.CHIANI, M.: "Lossy Compression of Noisy Sparse Sources Based on Syndrome Encoding", IEEE TRANSACTIONS ON COMMUNICATIONS, vol. 67, no. 10, 2019, XP011750541, DOI: 10.1109/TCOMM.2019.2926080
FANG, J.FU, H.YANG, G.HSIEH, C.-J.: "RedSync: Reducing synchronization bandwidth for distributed deep learning training system", JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, pages 133
JIANG, J.FU, F.YANG, T.CUI, B.SKETCHML: "Accelerating Distributed Machine Learning with Data Sketches", SIGMOD, 2018
LIN, Y.HAN, S.MAO, H.WANG, Y.DALLY, W.: "Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training", ICLR, 2018
MONGA, V.MOUSAVI, H. S.SRINIVAS, U.: "Handbook of Convex Optimization Methods in Imaging Science", 2018, SPRINGER INTERNATIONAL PUBLISHING, article "Sparsity Constrained Estimation in Image Processing and Computer Vision", pages: 177 - 206
SHANBHAG, A.PIRK, H.MADDEN, S.: "Efficient Top-K Query Processing on Massively Parallel Hardware", SIGMOD, 2018
SHAOHUAI SHI ET AL: "Understanding Top-k Sparsification in Distributed Deep Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 November 2019 (2019-11-20), XP081536126 *
SHI, S.CHU, X.CHEUNG, K. C.SEE, S.: "Understanding Top-k Sparsification in Distributed Deep Learning", ARXIV 1911.08772, 2019
WANGNI, J.WANG, J.LIU, J.ZHANG, T.: "Gradient Sparsification for Communication-Efficient Distributed Optimization", NEURLPS, 2018
XU, H.HO, C.-Y.ABDELMONIEM, A. M.DUTTA, A.BERGOU, E. H.KARATSENIDIS, K.CANINI, M.KALNIS, P.: "Compressed communication for distributed deep learning: Survey and quantitative evaluation", TECHNICAL REPORT, KAUST, 2020
YE XUCHENG ET AL: "Accelerating CNN Training by Pruning Activation Gradients", 26 March 2020, ADVANCES IN CRYPTOLOGY - CRYPTO 2013; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], PAGE(S) 322 - 338, ISSN: 0302-9743, XP047570593 *
ZIJIE YAN: "Gradient Sparification for Asynchronous Distributed Training", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 24 October 2019 (2019-10-24), XP081520019 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114900482A (en) * 2022-03-28 2022-08-12 中国科学技术大学苏州高等研究院 Gradient scheduling method and device based on programmable switch under PS (packet switched) architecture
CN114900482B (en) * 2022-03-28 2023-05-30 中国科学技术大学苏州高等研究院 Gradient scheduling method and device based on programmable switch under PS architecture
CN114710415A (en) * 2022-05-23 2022-07-05 北京理工大学 Redundant coded passive message reliable transmission and processing system
CN114710415B (en) * 2022-05-23 2022-08-12 北京理工大学 Redundant coded passive message reliable transmission and processing system
CN117910521A (en) * 2024-03-20 2024-04-19 浪潮电子信息产业股份有限公司 Gradient compression method, gradient compression device, gradient compression equipment, gradient compression distributed cluster and storage medium

Similar Documents

Publication Publication Date Title
WO2022003562A1 (en) Statistical-based gradient compression method for distributed training system
M Abdelmoniem et al. An efficient statistical-based gradient compression technique for distributed training systems
Choukroun et al. Low-bit quantization of neural networks for efficient inference
Kim et al. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation
US8438126B2 (en) Targeted maximum likelihood estimation
US8542869B2 (en) Projection based hashing that balances robustness and sensitivity of media fingerprints
US11651260B2 (en) Hardware-based machine learning acceleration
US11823058B2 (en) Data valuation using reinforcement learning
CN113313229A (en) Bayesian optimization of sparsity in model compression
WO2023020456A1 (en) Network model quantification method and apparatus, device, and storage medium
US20230084865A1 (en) Method and apparatus for determining signal sampling quality, electronic device and storage medium
Cutkosky et al. Matrix-free preconditioning in online learning
Basat et al. QUIC-FL: Quick Unbiased Compression for Federated Learning
Zhang et al. MIPD: An adaptive gradient sparsification framework for distributed DNNs training
Niu et al. Parameter-parallel distributed variational quantum algorithm
Klusowski Analyzing cart
Li et al. An alternating nonmonotone projected Barzilai–Borwein algorithm of nonnegative factorization of big matrices
WO2020177863A1 (en) Training of algorithms
Al-Behadili et al. Semi-supervised learning using incremental support vector machine and extreme value theory in gesture data
CN115294396A (en) Backbone network training method and image classification method
Cheng et al. Use of biclustering for missing value imputation in gene expression data.
KR102202823B1 (en) Method and device for binary classification using characteristics of weighted maximum mean discrepancy operations for positive-unlabeled learning
Tretiak et al. Physics-constrained generative adversarial networks for 3D turbulence
Chen et al. Attention Loss Adjusted Prioritized Experience Replay
Schröder et al. Training discrete ebms with energy discrepancy

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21737799

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21737799

Country of ref document: EP

Kind code of ref document: A1