WO2022003562A1

WO2022003562A1 - Statistical-based gradient compression method for distributed training system

Info

Publication number: WO2022003562A1
Application number: PCT/IB2021/055814
Authority: WO
Inventors: Ahmed MOHAMED ABDELMONIEM SAYED; Ahmed ELZANATY; Marco Canini; Mohamed-Slim Alouini
Original assignee: King Abdullah University Of Science And Technology
Priority date: 2020-06-29
Filing date: 2021-06-29
Publication date: 2022-01-06

Abstract

A statistical-based gradient compression method includes receiving (1200) input data (412) at plural nodes (410-n) of a neural network, running (1202) the input data (412) forward and backward through the plural nodes (410-n) to generate node gradient vectors, fitting (1204) a sparsity-inducing distribution, SID, at each node (410-n), to a corresponding node gradient vector, calculating (1206) a first threshold η ₁, based on the SID, for the corresponding node gradient vector, compressing (1208) the corresponding node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold η ₁, transmitting (1210) a compressed gradient vector, which is related to the first compressed gradient vector (430), to all other nodes (410-n) for updating a corresponding model.

Description

STATISTICAL-BASED GRADIENT COMPRESSION METHOD FOR

DISTRIBUTED TRAINING SYSTEM

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Patent Application No. 63/045,346, filed on June 29, 2020, entitled “EFFICIENT GRADIENT COMPRESSION FOR FAST DISTRIBUTED TRAINING,” the disclosure of which is incorporated herein by reference in its entirety.

BACKGROUND

TECHNICAL FIELD

[0002] Embodiments of the subject matter disclosed herein generally relate to exchanging data within a neural network, and more particularly, to applying an efficient gradient compression technique to a distributed training neural network.

DISCUSSION OF THE BACKGROUND

[0003] As the deep neural networks (DNNs) continue to become larger and more sophisticated, and an ever increasing amount of training data is used, scaling the training process to run efficiently on a distributed cluster is currently a goal that is attracting a multitude of efforts. Modern deep learning toolkits (e.g., pytorch.org) are capable of distributed data-parallel training whereby the model is replicated and the training data are partitioned among plural workers. A worker is understood herein to be a component or a node of the DNN that executes training, for example, a processor or a computing device. Training DNNs in such settings relies on synchronous distributed Stochastic Gradient Descent (SGD) or similar optimizers. [0004] More specifically, assume that N (a positive integer) is the number of workers of the training DNN,

e M^d denotes the model parameters with d dimensions at iteration /. A bold symbol means in this context a vector. At the end of the i^th iteration, each worker n (where n takes a value between 1 and N) runs a back- propagation algorithm to produce a local stochastic gradient

e M^d. Then, each worker updates its model parameters x_{i+1} using the final gradient aggregated across all workers, i.e. ,

where l is the learning rate. This means that each worker n of the N workers needs to receive all individual gradients from all other workers and calculate the new aggregated gradient

[0005] Gradient aggregation involves extensive communication, which is either between the workers in a peer-to-peer fashion (typically through collective communication primitives like all-reduce) or via a parameter server architecture. Due to the synchronous nature of the optimizer, workers cannot proceed with the (i + 1)^th iteration until the aggregated gradient

is available. Therefore, in distributed training workloads, communication is commonly one of the predominant bottlenecks.

[0006] Addressing this communication bottleneck is the focus of intensive research, where one avenue is pursuing a path of improving training by reducing the communicated data volume via lossy gradient compression. Compression entails two main challenges: (i) it can negatively affect the training accuracy (because the greater the compression is, the larger the error in the aggregated gradient), and (ii) it introduces extra computation latency (due to the compression operation itself). While the former can be mitigated by applying compression to a smaller extent or using compression with error-feedback, the latter, if gone unchecked, can actually slow down the training of the neural network when compared to the no compressing option.

[0007] Thus, there is a need for a new system and algorithm that are capable to accurately compress the gradients exchanged by the workers of the neural network so that almost no error is introduced and the compression process does not increase the computation latency to undesirable levels.

BRIEF SUMMARY OF THE INVENTION

[0008] According to an embodiment, there is a statistical-based gradient compression method that includes receiving input data at plural nodes of a neural network, running the input data forward and backward through the plural nodes to generate node gradient vectors, fitting a sparsity-inducing distribution, SID, at each node, to a corresponding node gradient vector, calculating a first threshold ¾, based on the SID, for the corresponding node gradient vector, compressing the corresponding node gradient vector to obtain a first compressed gradient vector, by setting to zero those components that are smaller than the first threshold h₁ and transmitting a compressed gradient vector, which is related to the first compressed gradient vector, to all other nodes for updating a corresponding model, where the compressed gradient vector is sparse as only some components are non-zero. [0009] According to another embodiment, there is a neural network system that uses a statistical-based gradient compression method, and the system includes plural nodes, each configured to receive input data, and each node including a processor configured to run the input data forward and backward to generate a node gradient vector, fit a sparsity-inducing distribution, SID, to the node gradient vector, calculate a first threshold h₁ based on the SID, for the node gradient vector, compress the node gradient vector to obtain a first compressed gradient vector, by setting to zero those components that are smaller than the first threshold h₁ and transmit a compressed gradient vector, which is related to the first compressed gradient vector, to all other nodes for updating a corresponding model, where the compressed gradient vector is sparse as only some components are non-zero. [0010] According to still another embodiment, there is a non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a computer, implement a method for gradient compression using statistical methods as discussed above.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

[0012] Figures 1A and 1B illustrate the compression speedups over Top k using different compression ratios (0.1, 0.01 , 0.001) on a GPU (Figure 1A) and a CPU (Figure 1 B);

[0013] Figure 1C shows the average estimation quality of a target /c;

[0014] Figure 2A shows the sorted magnitude of the gradients versus their indexes and the fitted curve via a power law;

[0015] Figure 2B shows the approximation error of the Top _¾ versus the number of nonzero elements /c;

[0016] Figures 3A to 3D show gradient fitting using three sparsity-induced distributions (SIDs) for the gradient vector along with the empirical distribution generated from training ResNet-20 on CIFAR10 dataset using Top _¾ compressor with error compensation mechanism, for the 100th [(A) PDF, (B) CDF] and 10,000th [(C) PDF, (D) CDF] iterations;

[0017] Figures 4A and 4B schematically illustrate a neural network configured to run a statistical-based gradient compression method;

[0018] Figures 5A and 5B illustrate the pseudo-code run by the statistical- based gradient compression method; [0019] Figure 6 illustrates the iteration aspect of the statistical-based gradient compression method;

[0020] Figure 7 provides an example for the application of the statistical-based gradient compression method;

[0021] Figures 8A to 8C illustrate the performance of the statistical-based gradient compression method versus known methods for speed-up, throughput, and estimation quality for a first dataset;

[0022] Figures 9A and 9B show the train loss versus number of iterations and the threshold estimation quality, respectively, for the statistical-based gradient compression method and other background methods for the first dataset;

[0023] Figures 10A to 10C illustrate the performance of the statistical-based gradient compression method versus known methods for speed-up, throughput, and estimation quality for a second dataset;

[0024] Figures 11 A and 11 B show the train loss versus number of iterations and the threshold estimation quality, respectively, for the statistical-based gradient compression method and other background methods for the second dataset;

[0025] Figure 12 is a flow chart of a method for compressing gradient vectors based on sparsity-inducing distributions; and

[0026] Figure 13 is a schematic diagram of a node of a neural network.

DETAILED DESCRIPTION OF THE INVENTION

[0027] The following description of the embodiments refers to the accompanying drawings. The same reference numbers in different drawings identify the same or similar elements. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims. The following embodiments are discussed, for simplicity, with regard to training a DNN. However, the embodiments to be discussed next are not limited to training a DNN, but may be applied to other neural networks or other purposes than training.

[0028] Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with an embodiment is included in at least one embodiment of the subject matter disclosed. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.

[0029] According to an embodiment, a novel gradient compressor with minimal overhead is introduced. Noting the sparsity of the gradients, the compressor models the gradients as random variables distributed according to some sparsity- inducing distributions (SIDs). This approach is validated by studying the statistical characteristics of the evolution of the gradient vectors over the training process. A Sparsity-Inducing Distribution-based Compression (SIDCo) method is then introduced, which takes advantage of a threshold-based sparsification scheme that enjoys similar threshold estimation quality to the deep gradient compression (DGC) while being faster by imposing lower compression overhead. An evaluation of popular machine learning benchmarks involving both recurrent neural network (RNN) and convolution neural network (CNN) models shows that the SIDCo method speeds up training by up to 41.7, 7.6, and 1.9 when compared to the no-compression baseline, Topk, and DGC compressors, respectively. The SIDCo method builds on a sound theory of signal compressibility and enjoys linear complexity in terms of the size of the model parameters. This approach affords for an implementation that parallelizes very efficiently using modern GPUs and other hardware targets. Thus, in one application, the new scheme addresses a previously-overlooked yet crucial technical obstacle to using compression in practice, especially for communication- bounded training of large models.

[0030] To better understand the applicability of the SIDCo method in the context of neural networks, before discussing the novel features of this method, it is believed that a discussion of the existing compressing methods is in order. Efficient communication in distributed training has received extensive attention. One approach tries to maximize the overlap between the computation and communication to hide the communication overhead. However, the gains from these methods are bounded by the length of the computation and are modest when the training is dominantly communication-bound. Alternatively, many approaches adopt methods that reduce the amount of communication, volume or frequency. For example, gradient compression is a well-known volume reduction technique [1-5] According to this approach, each worker applies a compression operator (C to the corresponding gradient g _i} to produce a compressed gradient vector, which is then transmitted for aggregation. Generally, the compressor C involves quantization and/or sparsification operations.

[0031] Gradient quantization is another way to reduce the size of the transmitted gradients and it represents gradients with fewer bits for each gradient element. Under some conditions, quantization is known to achieve the same convergence as no compression. Error compensation (EC) is used to attain convergence when the gradients are quantized using fewer bits. Given the standard 32-bit float number representation, the volume reduction of the quantization is limited by 32x, i.e. , 1 bit out of 32 bits, which may not be sufficient for large models or slow networks and it requires expensive encoding to pack the quantized bits.

[0032] Another approach is the gradient sparsification. The gradient sparsification selects a subset of the gradient elements for the next iteration. It is generally more flexible than the quantization approach, as it can reduce the transmitted volume by up to d times and adapts easily to network conditions [6] It was shown that in some cases, up to 99.9% of the non-significant gradient elements can be dropped with limited impact on its convergence. Gradient sparsification using Top _k, i.e., selecting the top k elements by their magnitude, is known to yield better convergence compared to other compression schemes, e.g., Random-/c. However, Top _k or its variants are notorious for being computationally inefficient. The Top _k selection does not perform well on accelerators such as GPUs. For instance, in many cases, it is reported that Top_/< imposes high overheads and worsens the run time of distributed training systems.

[0033] Thus, the main challenge with using gradient compression (e.g., sparsification or quantization) is the computational overhead the algorithm itself introduces in the training. If the overhead is greater than the reduction gains in the communication time, the overall iteration time increases. Hence, to be useful, a robust compressor should have a low overhead. As discussed earlier, one of the dominantly robust compressors is Top _k. However, this compressor is also computationally intensive. Because of this, when using Top_k for large models, it results in either an increased training time or unsatisfactory performance benefits. [0034] Numerous efforts based on algorithmic or heuristic approaches have been dedicated to enhancing the performance of Topk [3, 7-9] Existing fast implementations of Top_/< are computer-intensive, e.g., on the CPU, the computational complexity is 0(d log₂ k). Recently, more optimized implementations for multi-core hardware were proposed, which greatly depend on the data distribution and these implementations work best for a small number of k [8] For instance, the Radix select algorithm used in PyTorch is 0(\b/r]d ) where b is the number of bits in the data values and ris the radix size (pytorch.org). Yet, using gradient vectors of various sizes, Top_/< is the slowest on GPUs and not the fastest on CPUs.

[0035] In the context of gradient compression, threshold-based methods, which aim to overcome the overhead of Jop_k, select linear time gradient elements larger in magnitude than a threshold h. DGC [3] proposes to sample a random subset of the gradients (e.g., 1%), and apply Jop_k on the sampled sub-population to find a threshold which is then used to hierarchically obtain the actual Top _¾ elements. Even though the DGC leads to improved performance over Jop_k, its computational complexity is still in the same order as the Tope's complexity. The threshold estimation methods, on the other hand, are shown to attain a linear time complexity. In this regard, several works have leveraged certain features of the gradients to enhance the training process. Some approaches leveraged these features and devised heuristics to estimate and find the Top _k threshold, which exhibits lower compression overhead compared to the traditional Top _k and DGC [2, 9] In particular, RedSync [2] finds the threshold by moving the ratio between the maximum and mean values of the gradient while GaussianKSGD [9] adjusts an initial threshold obtained from fitting a Gaussian distribution through an iterative heuristic to obtain the Top _k elements. Nevertheless, the threshold estimation of these methods is of bad quality and the number of selected elements k varies significantly from the target k, as discussed later.

[0036] To overcome the problems noted above, a statistical approach is developed in this embodiment to estimate an accurate threshold for selecting the Top _k elements with minimal overhead. In particular, the compressibility of the gradients is exploited, and a SID is adopted that fits the gradients well. For instance, double exponential (i.e. , Laplace), double gamma and double generalized Pareto distributions have been used as sparsity-promoting priors in Bayesian estimation framework [10-12] The inventors have discovered that as the gradients are compressible, they are suitable to be modeled as random variables (r.v.s) distributed according to one of the SIDs noted above. [0037] To validate this novel approach, initial micro-benchmark experiments were conducted to evaluate the compression overhead of the existing sparsification techniques, e.g., Top _k, DGC (which uses random sub-sample for threshold calculation), RedSync and GaussianKSGD (which heuristically estimate the threshold), with one of the proposed SIDCo schemes that estimates the threshold via a multi-stage fitting. Both CPU and GPU were used to benchmark the performance of these algorithms. It was observed from the results of this comparison that the methods based on random sub-sampling (e.g., DGC) excel on GPU (see Figure 1A), but they impose a huge overhead on the CPU and leads to DGC performing significantly worse than Top _¾ on the CPU (see Figure 1B). In contrast, the methods that are based on estimating a threshold over which only k elements are selected, consistently impose lower compression overhead compared to Top _¾ and DGC on both the GPU and CPU. This shows that, except for linear time threshold-based methods, a variable compression overhead is to be expected on different architectures (e.g., CPU, GPU, TPU, FPGA or Al chips). Figure 1C shows the normalized actual compression ratio (i.e. , k/k) for various schemes. Note that the heuristic approaches fail to obtain the right threshold, leading to unpredictable behavior.

[0038] The SIDCo algorithm uses the compressibility of the gradients and their statistical distribution. Signals, including gradient vectors of DNNs, can be efficiently compressed by exploiting some of their statistical features. Among these features, sparsity and compressibility are the key drivers for performing signal compression [13-15] A definition of compressible signals is as follows: the signal g e M^d is compressible if the magnitudes of its sorted coefficients obey the following power law decay:

3_j £ ^ci)^~v V/ <º {1, 2, d], (1) where g is the sorted vector of \g\ in descending order, g,· is the j^th element of g, and p > ½ is the decay exponent, for some constant ci. For compressible signals obeying a power law decay, the sparsification error a_k(g ) is bounded as follows:

where HxH, =

is the l_q norm of x, T_fe{·} is the Top _¾ sparsification operator that keeps only the largest k elements in terms of magnitude and sets the others to zero, T _k{g} is a /c-sparse vector with only k non-zero elements, and c_å is a constant. The signal is more compressible if it decays faster, i.e., p is higher.

[0039] The compressibility of the gradient vector allows efficient compression for the gradients through sparsification techniques, e.g., Top _k and thresholding- based compression. To verify the vector compressibility, the gradients generated while training the ResNet20 were used. The absolute values of the gradients are sorted in descending order to obtain the vector g with d = 269,722. Figure 2A shows the elements of the gradient vector g , i.e., g,·, which are reported versus their index, for three iterations in the beginning, middle, and end of the training. As a benchmark, a power low decay example with decay exponent p > 0.5, i.e., p = 0.7 is used. It can be noticed that the gradients follow a power-law decay with a decay exponent p >

0.7 > 0.5, which indicate that the gradients form a compressible signal according to the definition given by equation (1). Figure 2B shows the sparsification error for the best k approximation, e.g., the Top _¾, as a function of k. An example of the power decay model with the decay exponent p - 0.5 = 0.2 is used. It can be seen that the best k approximation error decays faster than the benchmark. Hence, the vector can be considered compressible, according to equation (2).

[0040] Next, the goal is to find the distribution of the gradient vector, while accounting for the compressibility of the gradients. The selection of sparsity- promoting priors that are able to efficiently capture the statistical characteristics of the gradients with low computational complexity is a challenging task. However, the inventors noticed a specific property for the distribution of the gradients, which permits high compression gains with low computational overhead. This property indicates that gradients generated from many DNNs during the training can be modeled as random variables (r.v.s) distributed according to some sparsity-inducing distributions, i.e. , double exponential, double gamma and double generalized Pareto (GP) distributions. More specifically, the gradient G can be modeled or fitted as

G~Distribution( Q), (3) where Distribution ( ) is one of the three SIDs with parameters indicated by the vector Q, which generally depend on the iteration and worker’s data. Also, the probability density function (PDF) of G is F_c(g; Q), which is symmetric around zero. [0041] Because the gradients G are compressible as indicated by equation (1), they can be well approximated by sparse vectors with minimal error, as implied by equation (2). Hence, the distributions that promote sparsity are good candidates for fitting (or modeling) the gradient vector G. For instance, the double exponential, double gamma, double GP, and Bernoulli-Gaussian distributions have been used as priors that promote sparsity in [11, 12, 14] The gradients resulting from the training of ResNet-20 with SGD have been considered and they are fitted by the three proposed SIDs, i.e. , double exponential, double gamma, and double GP distributions. Figures 3A to 3D show the empirical distribution of the gradients and their absolutes, without the EC mechanism, along with the distributions of the three fitted SID for two different iterations. In this regard, Figure 3A shows that the three proposed distributions can capture the main statistical characteristic of the gradients, as their PDFs approximate the empirical distribution for most of the gradient domain. This is so because of the compressibility of the gradients illustrated before. The compressibility of r.v.s distributed according to one of the SIDs can be attributed to the shape of their PDFs, where the most probable values are those with small amplitudes. From Figures 3A and 3C it can be seen that the gradients at iteration 10,000 (Figure 3C) are more sparse than those at iteration 100 (Figure 3A), where the PDF at iteration 10,000 has higher values at smaller gradient values and it has a faster tail. Regarding the cumulative distribution function (CDF) of the absolute value of the gradients shown in Figures 3B and 3D, it can be seen that the SIDs approximate well the empirical CDF. However, at the tail of the distribution, they tend to slightly overestimate/underestimate the CDF. The reason for this is that the fitting is biased toward the majority of the data with lower values, as the gradient vector is sparse.

[0042] Using the statistical properties noted above, two threshold estimators are now discussed, the single-stage and the multiple-stage threshold estimators. The threshold that yields the target compression ratio d = k/d is derived for each of the three proposed SIDs. Then, the single-stage thresholding scheme is discussed for moderate compression ratios and the multi-stage thresholding scheme is discussed for aggressive compression ratios with d « 1, e.g., d £ 0.001. The sparsification threshold can be computed from the fitted distribution of the gradients as follows. For G~Distribution(G ) with CDF F_c(g, Q), the threshold h that yields the Top _k vector with the average target compression ratio d = k/d can be derived as:

>?(«) = ¾(! - ¾ §) (4)

where 0 is the estimated parameters for the gradient distribution, F_c(g, Q) is the CDF of the absolute gradient,

¾(#; 0) = p} is the inverse CDF of the absolute gradient at probability p, and

0) is the inverse CDF of the gradient, also known as quantile function or percent-point function (PPF).

[0043] The threshold calculation for three SIDs mentioned above is now discussed. For the double exponentially distributed gradients with a scale parameter b and location zero (symmetric around zero), i.e. , G-Laplace^), the threshold h that achieves the compression ratio d can be computed as:

where b is the maximum likelihood estimate (MLE) of the scale parameter. [0044] For the gradients that can be well-fitted by the double gamma distribution with a shape parameter a £ 1, the absolute of the gradient is gamma distributed, i.e. , \G\~gamma(a ,b). The sparsifying threshold h can be derived as: h{d) = br- a, 1 - d) (7)

log(r(a))], (8) where P(a,x ) = is the regularized lower incomplete gamma

function, P^_1(a,p) = (x: P(a,x ) = p) is the inverse of the regularized lower incomplete gamma function,

with s = log

are the sample mean and variance for the absolute gradient vector \g\, respectively.

[0045] The threshold calculation for the generalized Pareto Distributed gradients is now discussed. For gradients distributed as double generalized Pareto r.v.s, the absolute of the gradients is modeled as GP distributed r.v.s |G|~GP(a,/?, a), where 0 < a < ^,b, a = 0 are the shape, scale, and location parameters. The sparsifying threshold h that achieves a compression ratio d is given by: where

with fi and s² being the sample mean and variance for the absolute gradient vector \g\, respectively.

[0046] After computing the threshold h for the three SIDs discussed above, the single-stage approach calculates the compressed gradient vector as

=

each j e {1, 2, d} where the vector g e M^d is the compressed gradient vector, I _[condition] '^{s an} indicator function that equals one when the condition is satisfied and zero otherwise. In the following, the vector that contains only the exceedance non-zero gradients (i.e. , the gradients that are larger than the threshold h ) are denoted by g and the number of its components is k.

[0047] The target compression ratio d can be as low as 10 ⁴. Therefore, in order to accurately estimate the threshold h, the fitted distribution should tightly resemble the gradient distribution at the tail. This is quite challenging because the estimation of the parameters tends to account more for the majority of data at the expense of the tail. Hence, the threshold h obtained from single-stage fitting discussed above is accurate up to some moderate compression ratios.

[0048] To overcome this problem, a multi-stage fitting approach is proposed in this embodiment. For convenience, a two-stage approach is first discussed. The gradients are fitted with one of the three SIDs and compressed using the proposed procedure discussed above with a first threshold h₁ computed to yield a first compression ratio d₁ = > 5. Then, the calculated vector of the exceedance gradients g is used to fit another distribution, defined precisely below. Then, a second threshold h₂ is computed to achieve a second compression ratio d₂ = — with

respect to the exceedance gradients g . The second compression ratio <¾ is chosen such that the overall compression ratio of the original gradient vector is the target

ratio d, i.e., d₂ = Then, the estimated threshold from the last stage is applied to

compress the newly compressed gradient. This procedure can be extended to multi stages such that d = U_m=i ^s _m where M is the number of stages.

[0049] Because the method above replaces the original gradients with the exceedance gradients, which are also known as Peak over Treshold (PoT), the question of whether the exceedance gradients have the same distribution as the original gradients before the compression needs to be addressed. The extreme value theory in statistics can provide an answer for this question. Let k_m be the number of exceedance gradients after the m^th thresholding stage. Then, if a threshold operator is applied on a sequence of r.v.s IG- , \ G₂\, |G¾ |, the distribution of the exceedance r.v.

can be approximated by a GP distribution for large enough threshold and vector dimension, irrespective of the original distribution of the gradients. This means, that the exceedance step does not impair the accuracy of this method.

[0050] Next, using the extreme value theory, it is possible to compute the threshold for the multi-stage approach. Considering that for the m^th thresholding stage with m ³ 2, the absolute of the exceedance gradients \G_m\ can be modeled as

are the shape, scale, and location parameters. The threshold that achieves a compression ratio 5_m is obtained as:

where h_pi® is the threshold computed at the previous stage and fl and a² are the sample mean and variance of \g_m\ h_th®, respectively.

[0051] If the absolute of the gradients is modeled as exponentially distributed r.v.s \G_m\~Exp(fi_m), the distribution of the exceedance gradients over the threshold h-m-_t, after proper shifting, is still exponentially distributed, i.e. , \G_m\ 11_m®~Exp^_m). The new stage threshold is then:

where g_j is the j^th element of the vector g_m.

[0052] These equations are used such that when the absolute of the gradients is fitted by an exponential distribution in the first stage, the latter stages for the exceedance gradients are also fitted by exponential distributions, i.e., multi-stage exponential. On the other hand, for the gamma-fitted absolute gradients in the first stages, the latter stages are fitted by a GP distribution (based on equations (12) and (13)), i.e., gamma-GP. Finally, for GP distributed absolute gradients in the first stage, the GP is still used for the PoT data (based on equations (12) and (13)), i.e., multi stage GP. [0053] The SIDCo algorithm leverages the SIDs discussed above to obtain a threshold via the multi-stage threshold estimator. For this approach, the number of stages M is selected via an adaptive algorithm such that the estimation error, averaged over Q iterations, is bounded below a predefined error tolerance, i.e.,

|_<5 — 51 < sS.with 0 < e < 1.

[0054] The SIDCo algorithm, which is schematically illustrated in Figures 4A and 4B together with the corresponding neural network system 400, and whose full pseudo-code is shown in Figures 5A and 5B, is now discussed. Figures 4A and 4B show a neural network/computing system 400 that includes n workers, where n is an integer number that varies between 2 and N. Only two workers or nodes 410-1 and 410-n are shown in the figure for simplicity. A training node may be a GPU, a CPU, a computing device, etc. Each training node 410-n receives input data 412 for training, and performs a forward-backward pass 414/416 on different batches sampled from the training dataset 412 with the same network model. At the end of the i^th iteration (Figures 4A and 4B show all the workers 410-n simultaneously and independently performing the same iteration i^th), each worker runs a local-version of the SGD algorithm to produce the corresponding gradients 418, which are then reshaped by the gradient reshaping module 420 to generate the gradient vector

e M^d 422. Note that Figures 5A and 5B, lines 20-26, show how to calculate the number of stages M. Then, in step 424, the processor of each worker estimates the parameters of the selected SID to effectively fit the gradient vector to the selected SID. Then, in step 426, the threshold h for the selected SID is calculated (see lines 13-19 in Figures 5A and 5B) and the gradient vector is sparsified (see lines 2-12 in Figures 5A and 5B) in step 428, by applying the compression operator ._h, to obtain the compressed gradient 430 g_t =

the values below the threshold are made zero by using the operator 1_{|5ί|³?7} and all other values are maintained. The compressed gradient g 430 from each worker is then supplied to an all-gather module 432.

[0055] Note that for each stage, the function Thresh_Estimation shown in line 13 in Figures 5A and 5B use the chosen SID to obtain a corresponding threshold. The algorithm dynamically adapts the number of stages M by monitoring the quality of its estimated selection of elements and adjusting M using the function Adapt_Stages noted in line 20 in Figure 5B. The algorithm in Figure 5A starts by calling the sparsify function, which takes the gradient vector 422 and the target ratio as the parameters. Then, the algorithm applies a multi-stage estimation loop of M iterations. In each iteration, the compressed gradient vector gfa 430 is partially sparsified with the previously estimated threshold obtained from the previous stage m - 1. Then, given the ratio 5_m at loop step m, the chosen SID distribution fitting is invoked via the function Thresh_Estimation to obtain a new threshold. At the last stage (i.e. , step M of the loop), the resulting estimation threshold should approximate the threshold that would obtain the target ratio <5 of the input vector. Then, the estimated threshold is used to sparsify the full gradient vector and obtain the values and their corresponding indices. For each invocation of the algorithm in each training iteration, the algorithm maintains statistics like the average ratio of the quality of its estimations over the past training steps Q. Then, at the end of every Q training steps, the algorithm invokes the Adapt_Stages function (see line 20 in Figure 5B), which adjusts the current number of stages M based on a user-defined allowable error bounds of the estimation (i.e. , e_H and e_L). After the adjustment, the next algorithm iteration invocation will use the new number of stages M. The number of stages is adjusted only if the obtained ratio is not within the error bounds. Figure 6 illustrates these steps of the fitting refinement 610 in which the threshold h is adjusted for each iteration to arrive at the desired target ratio d.

[0056] A simplified example for compressing the gradient vector is illustrated in Figure 7. Suppose that the gradient vector gfa 422 has the 10 values 710, as illustrated in Figure 7. The Adapt_Stages function from Figures 5A and 5B has calculated that M = 2 and the target compression rate 5 is 0.1. The SID was selected and thus, the selected SID is fitted on the gradient vector to obtain the threshold h₁ = 11 and the associated compression rate is d₁ = 0.2 for the first stage. After applying the compression operator, only two values (PoT) 720 of the gradient vector glfo 422 are larger than the threshold h₁ = 11. The two PoT values form the compressed gradient vector, which is fit again with the selected SID and a new threshold h₂ = 20 is calculated and an associated compression rate is d₂ = 0.5. After applying again the compression operator, only one value 730 survives and has the desired compression ratio d = 0.1. In other words, the method computes for each stage the threshold from the peak-over-threshold data from the previous stage (i.e., the gradient elements that have an absolute value larger that the threshold). After finishing all the stages, the final threshold is used to compress and send the vectors. For M stages, this process is repeated ensuring that the target compression ratio is the product of all the previous compression rates.

[0057] Returning to Figures 4A and 4B, the all-gather module 432 is configured to collect the compressed and sparsified gradient vectors from each worker and to provide this info to all the workers in step 434. The averaged gradient 436 from all the workers is used in step 438 to update the model

e M^d used by each worker and then the steps discussed above are repeated for the next iteration i+1.

[0058] The above approach has been tested with CNN and RNN against actual databases and various compressors. The tests have been performed on 8 server machines equipped with dual 2.6 GHz 16-core Intel Xeon Silver 4112 CPU, 512GB of RAM, and 10 Gbps NICs. Each machine has an NVIDIA V100 GPU with 16GB of memory. PyTorch 1.1.0 with CUDA 10.2 was used as the ML toolkit and Horovod 0.16.4 configured with OpenMPI 4.0.0 for collective communication.

[0059] For the benchmarks and hyper-parameters various models have been used, for example, the LSTM model on the PTB dataset with Nesterov-Momentum- SGD local optimizer, LSTM model with AN4 dataset, ResNet20 model with CIFAR- 10 dataset, and ResNet-50 model with ImageNet dataset. All these models, datasets and optimizers are known in the literature, and thus, their description is omitted herein. Both CNN and RNN models were used for image classification and language modeling tasks, respectively. Various compression ratios (d) of 0.1 (10%), 0.01 (1%) and 0.001 (0.1%) were used to span a wide range of the trade-off between compression and accuracy similar to prior work. The SIDCo algorithm was compared to the Top _k, DGC, RedSync and GaussianKSGD. The EC mechanism is employed to further enhance the convergence of SGD with compressed gradients. For the SIDCo method, the compression rate di = 0.25, e = 20%, and Q =5 iterations were selected to adapt the stages as in the algorithm illustrated in Figures 5A and 5B. For conciseness, only the performance of the SIDCo method with the double exponential fitting (shown in the figures as “SIDCo-E”) is presented.

[0060] The performance of a given scheme (i.e., SIDCo, Top _¾, DGC, RedSync or GaussianKSGD) is quantified via the following metrics:

[0061] Normalized Training Speed-up: the model quality is evaluated at iteration T (the end of training) and it is divided by the time taken to complete T iterations. This quantity is normalized by the same measurement calculated for the baseline case. This is the normalized training speed-up relative to the baseline; [0062] Normalized Average Training Throughput: is the average throughput normalized by the baseline’s throughput, which illustrates the speed-up from compression irrespective of its impact on the model quality; and [0063] Estimation Quality: is the compression ratio k/d averaged over the training divided by the target ratio d = k/d along with the 90% confidence interval as error-bars.

[0064] The RNN-LSTM on PTB has been first tested. This benchmark has the highest communication overhead. In Figure 8A, the SIDCo method shows significant speed-up over the no-compression method, by about 41.7 times and improves over Top k and DGC by up to about 7.6 and about 1.9 times, respectively. At a high compression ratio of 0.001, both the RedSync and GaussianKSGD compression methods do not converge to the target loss and test perplexity, as shown in Figure 9A, and therefore they attain zero speed-ups. Figure 8B shows that the threshold estimation schemes including SIDCo have the highest training throughput. In Figure 8C, the DGC and SIDCo are the only methods that accurately estimate the target ratio with high confidence. However, for the GaussianKSGD at a ratio of 0.001 and RedSync at ratios of 0.01 and 0.001, the number of selected elements is two orders- of-magnitude lower than the target. Moreover, over the training process, the estimation quality of RedSync has a high variance, harming its convergence. Figure 9B shows that, at a target ratio of 0.001, the RedSync causes significant fluctuation in the compression ratio and the training does not converge. GaussianKSGD results in a very low compression ratio, which is close to 0, and far from the target leading to significantly higher loss (and test perplexity) values compared to the target values. [0065] When the RNN-LSTM is run on the AN4 data, Figure 10A shows that SIDCo achieves higher gains compared to other compressors by up to about 2.1 times for ratios of 0.1 and 0.01. Notability, at a ratio of 0.001, only SIDCo achieved the target character error rate (CER). Thus, other compressors were run for 250 epochs to achieve the target CER (instead of the default 150), except for the GaussianKSGD, which does not converge. The gains of the SIDCo method over the other compressors are increased by up to about 4 times. The reason could be that the model is more sensitive to compression (especially in the initial training phase). The SIDCo method starts as a single-stage before performing stage adaptations, leading to a slight over-estimation of k and so more gradient elements are sent during the training start-up. Throughput-wise, Figure 10B shows that the threshold- estimation methods including the SIDCo method enjoy high training throughput, explaining the gains over the baseline. Similar to the LSTM-PTB results, Figure 10C shows that on average, with low variance, SIDCo closely matches the estimated ratios of DGC while other estimation methods have poor estimation quality. Similar to the PTB, Figures 11 A and 11 B show that, at target ratio of 0.001 , the RedSync causes significant fluctuation in compression ratio and the GaussianKSGD method results in a very low compression ratio (close to 0), which is far from the target. This leads both methods to achieve significantly higher loss (or test perplexity) values compared to the target loss (or test perplexity) values.

[0066] For CNN networks applied to CIFAR-10 data with the ResNet20 model, all compressors achieve somewhat comparable and modest speed-ups over the no compression baseline (except at ratio of 0.001, where accuracy is degraded and hence the lower speed-up than the baseline). This is not surprising because ResNet20 is not network-bound. However, for the larger VGG16 model, the SIDCo method achieves significant speed-ups over the no-compression method, Top _¾ and DGC, by up to about 5, 1.5, and 1.2 times, respectively. Unlike other estimation schemes, the SIDCo method can accurately achieve the target ratio.

[0067] When the ResNet50 and VGG19 models were considered for the ImageNet data, a time-limit of 5 hours per run was set to reduce the costs. For calculating the speed-up, the top-1 accuracy achieved by different methods at the end of training was compared. First, for the ResNet50 benchmark, the compression ratios of 0.1 , 0.01, and 0.001 were used. It was found that SIDCo method achieves the highest accuracy, which is higher than the baseline, Top k and DGC by about 15, 3, and 2 accuracy points, i.e. , normalized accuracy gains of about 40%, 5%, and 4%, respectively. Further, the inventors found that the SIDCo method attains the highest throughput among all methods (except for RedSync at 0.1 compression). The results indicate that, unlike GaussianKSGD and RedSync, which both result in estimation quality far from the target with high variance, the SIDCo method estimates the threshold with very high quality for all ratios. Similar trends are observed for the VGG19 benchmark where a compression ratio of 0.001 was used. The results also indicate that the SIDCo method estimates the threshold with a high-quality, and achieves the highest top-1 accuracy and training throughput among all methods. The accuracy gains compared to the baseline, Top _¾ and DGC methods are about 34, 2.9, and 1.13 time higher, respectively.

[0068] These results indicate that the SIDCo method goes beyond existing methods that estimate a threshold for the Top _¾ sparsification. These methods either do not leverage the statistical property of the gradients (DGC) or assume the Gaussian distribution without a thorough study of the gradient (e.g., RedSync, GaussianKSGD). On a GPU, the SIDCo method improves over DGC by at least 2 times, and the speed-ups are significantly larger on the CPU. As a threshold estimation method, SIDCo does not only benefit from the throughput gains of the threshold methods, but also from the high-quality of its threshold estimation. The results discussed above indicate that the existing estimation methods (e.g., RedSync and Gaus-sianKSGD) fail to achieve consistent threshold estimation behavior even though they may provide throughput gains. Their throughput gains, in many cases, are due to severe under-estimation of the target ratio, which results in lower volumes of sent data compared to other compressors.

[0069] Thus, the novel SIDCo method solves a practical problem in distributed deep learning. The performance of compressors other than threshold-based ones has high computational costs whereas the existing threshold-estimation methods fail to achieve their target. To address these issues, the novel SIDCo threshold-based compressor is introduced, which imposes a sparsity prior on the gradients.

[0070] A statistical-based gradient compression method is now discussed with regard to Figure 12. The method includes a step 1200 of receiving input data 412 at plural nodes 410-n of a neural network 400, a step 1202 of running the input data 412 forward and backward through the plural nodes 410-n to generate node gradient vectors 422, a step 1204 of fitting a sparsity-inducing distribution, SID, at each node 410-n, to the corresponding gradient vector 422, a step 1206 of calculating a first threshold h₁ based on the SID, for the corresponding node gradient vector 422, a step 1206 of compressing the corresponding node gradient vector 422 to obtain a first compressed gradient vector 430, by making zero those components that are smaller than the first threshold h₁ and a step 1208 of transmitting a compressed gradient vector, which is related to the first compressed gradient vector 430, to all other nodes 410-n for updating a corresponding model.

[0071] The method may further include selecting the SID from double exponential distribution, double gamma distribution and double generalized Pareto distribution, and/or fitting the SID or another SID, at each node, to the first compressed gradient vector, and/or calculating a second threshold h₂, based on the SID or another SID, for the first compressed gradient vector, compressing the first compressed gradient vector to obtain a second compressed gradient vector, by setting the components that are smaller than the second threshold h₂ to zero, and transmitting the second compressed gradient vector, i.e. , the non-zero components of the second compressed vector and their indexes, to all other nodes for updating a corresponding model. For each threshold, a target compression ratio is used to calculate the threshold. A product of the target compression ratio for each stage is equal to an overall target compression ratio. The method may further include calculating a number of stages for which to repeat the steps of fitting, calculating and compressing before the step of transmitting. In one application, the input data is training data for the neural network. The steps of fitting, calculating and compressing are run independently and simultaneously on the plural nodes. In this or another application, the step of compressing has a target compression ratio, and the threshold is calculated based on the target compression ratio.

[0072] The above-discussed procedures and methods may be implemented in a computing device as illustrated in Figure 13. Hardware, firmware, software or a combination thereof may be used to perform the various steps and operations described herein. Computing device 1300 of Figure 13 is suitable for performing the activities described above with regard to a node 410-n, and may include a server 1301. Such a server 1301 may include a central processor (CPU) 1302 coupled to a random access memory (RAM) 1304 and to a read-only memory (ROM) 1306. ROM 1306 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc. Processor 1302 may communicate with other internal and external components through input/output (I/O) circuitry 1308 and bussing 1310 to provide control signals and the like. Processor 1302 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions.

[0073] Server 1301 may also include one or more data storage devices, including hard drives 1312, CD-ROM drives 1314 and other hardware capable of reading and/or storing information, such as DVD, etc. In one embodiment, software for carrying out the above-discussed steps may be stored and distributed on a CD- ROM or DVD 1316, a USB storage device 1318 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 1314, disk drive 1312, etc. Server 1301 may be coupled to a display 1320, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc. A user input interface 1322 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.

[0074] Server 1301 may be coupled to other devices, such as a database of data that needs to be analyzed, etc. The server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 1328, which allows ultimate connection to various landline and/or mobile computing devices. [0075] The disclosed embodiments provide a statistical-based gradient compression method and system for distributed training systems. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.

[0076] Although the features and elements of the present embodiments are described in the embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the embodiments or in various combinations with or without other features and elements disclosed herein. [0077] This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.

References

The entire content of all the publications listed herein is incorporated by reference in this patent application.

[1] Dutta, A., Bergou, E. H., Abdelmoniem, A. M., Ho, C.-Y., Sahu, A. N., Canini, M., and Kalnis, P., On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning. In AAAI, 2020.

[2] Fang, J., Fu, H., Yang, G., and Hsieh, C.-J., RedSync: Reducing synchronization bandwidth for distributed deep learning training system. Journal of Parallel and Distributed Computing, 133, 2019.

[3] Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, W., Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR, 2018.

[4] Wangni, J., Wang, J., Liu, J., and Zhang, T., Gradient Sparsification for Communication-Efficient Distributed Optimization. In NeurlPS, 2018.

[5] Xu, H., Ho, C.-Y., Abdelmoniem, A. M., Dutta, A., Bergou, E. H., Karatsenidis, K., Canini, M., and Kalnis, P., Compressed communication for distributed deep learning: Survey and quantitative evaluation. Technical report, KAUST, 2020. http://hdl. handle net/10754/662495.

[6] Abdelmoniem, A. M. and Canini, M., DC2: Delay-aware Compression Control for Distributed Machine Learning. In INFOCOM, 2021.

[7] Jiang, J., Fu, F., Yang, T., and Cui, B., SketchML: Accelerating Distributed Machine Learning with Data Sketches. In SIGMOD, 2018. [8] Shanbhag, A., Pirk, H., and Madden, S., Efficient Top-K Query Processing on Massively Parallel Hardware. In SIGMOD, 2018.

[9] Shi, S., Chu, X., Cheung, K. C., and See, S., Understanding Top-k Sparsification in Distributed Deep Learning. arXiv 1911.08772, 2019.

[10] Armagan, A., Dunson, D. B., and Lee, J., Generalized double Pareto shrinkage. Statistica Sinica, 23(1), 2013.

[11] Babacan, S. D., Molina, R., and Katsaggelos, A. K., Bayesian Compressive Sensing Using Laplace Priors. IEEE Transactions on Image Processing, 19(1), 2010.

[12] Monga, V., Mousavi, H. S., and Srinivas, U., Sparsity Constrained Estimation in Image Processing and Computer Vision. In Handbook of Convex Optimization Methods in Imaging Science, pp. 177-206. Springer International Publishing, 2018.

[13] Elzanaty, A., Giorgetti, A., and Chiani, M., Limits on Sparse Data Acquisition:

RIC Analysis of Finite Gaussian Matrices. IEEE Transactions on Information Theory, 65(3), 2019.

[14] Elzanaty, A., Giorgetti, A., and Chiani, M., Lossy Compression of Noisy Sparse Sources Based on Syndrome Encoding. IEEE Transactions on Communications, 67(10), 2019.

[15] Mallat, S. A Wavelet Tour of Signal Processing: The Sparse Way.

Academic Press, 2009.

Claims

WHAT IS CLAIMED IS:

1. A statistical-based gradient compression method comprising: receiving (1200) input data (412) at plural nodes (410-n) of a neural network

(400); running (1202) the input data (412) forward and backward through the plural nodes (410-n) to generate node gradient vectors (422); fitting (1204) a sparsity-inducing distribution, SID, at each node (410-n), to a corresponding node gradient vector (422); calculating (1206) a first threshold h₁ based on the SID, for the corresponding node gradient vector (422); compressing (1208) the corresponding node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold

and transmitting (1210) a compressed gradient vector, which is related to the first compressed gradient vector (430), to all other nodes (410-n) for updating a corresponding model, wherein the compressed gradient vector is sparse as only some components are non-zero.

2. The method of Claim 1, further comprising: selecting the SID from a double exponential distribution, a double gamma distribution, and a double generalized Pareto distribution.

3. The method of Claim 1, further comprising: fitting the SID or another SID, at each node (410-n), to the first compressed gradient vector (430).

4. The method of Claim 3, further comprising: calculating a second threshold h₂, based on a distribution of new peak-over- threshold data using extreme value theory, for the first compressed gradient vector (430); further compressing the first compressed gradient vector (430) to obtain a second compressed gradient vector, by making zero those components that are smaller than the second threshold h₂ ; repeating this process for several stages to obtain final threshold and the corresponding compressed gradients; and transmitting the final compressed gradient vector to all other nodes (410-n) for updating the corresponding model.

5. The method of Claim 4, wherein for each threshold, a target compression ratio is used to calculate the threshold.

6. The method of Claim 5, wherein a product of the target compression ratios for each stage is equal to an overall target compression ratio.

7. The method of Claim 3, further comprising: calculating a number of stages for which to repeat the steps of fitting.

8. The method of Claim 1, wherein the input data is training data for the neural network.

9. The method of Claim 1, wherein the steps of fitting, calculating and compressing are run independently and simultaneously on each of the plural nodes.

10. The method of Claim 1, wherein the step of compressing has a target compression ratio, and the threshold is calculated based on the target compression ratio and the SID.

11. A neural network system (400) that uses a statistical-based gradient compression method, the system (400) comprising: plural nodes (410-n), each configured to receive (1200) input data (412); and each node (410-n) including a processor (1302) configured to, run (1202) the input data (412) forward and backward to generate a node gradient vector (422); fit (1204) a sparsity-inducing distribution, SID, to the node gradient vector

(422); calculate (1206) a first threshold h₁ based on the SID, for the node gradient vector (422); compress (1208) the node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold h , and transmit (1210) a compressed gradient vector, which is related to the first compressed gradient vector (430), to all other nodes (410-n) for updating a corresponding model, wherein the compressed gradient vector is sparse as only some components are non-zero.

12. The system of Claim 11 , wherein the node is configured to: select the SID from a double exponential distribution, a double gamma distribution and a double generalized Pareto distribution.

13. The system of Claim 11 , wherein the processor is configured to: fit the SID or another SID to the first compressed gradient vector (430).

14. The system of Claim 13, wherein the processor is further configured to: calculate a second threshold h₂, based on extreme value theory, for the first compressed gradient vector (430); compress the first compressed gradient vector (430) to obtain a second compressed gradient vector, by making zero those components of the first compressed gradient vector that are smaller than the second threshold h₂ ; and transmit the second compressed gradient vector to all other nodes (410-n) for updating the corresponding model.

15. The system of Claim 14, wherein for each threshold, a target compression ratio is used to calculate the threshold.

16. The system of Claim 11, wherein the processor is further configured to: calculate a number of stages for which to repeat the steps of fitting, calculating and compressing before the step of transmitting.

17. The system of Claim 11, wherein the input data is training data for the neural network.

18. The system of Claim 11, wherein the steps of fitting, calculating and compressing are run independently and simultaneously on each of the plural nodes.

19. The system of Claim 11, wherein the step of compressing has a target compression ratio, and the threshold is calculated by the processor based on the target compression ratio.

20. A non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a computer, implement a method for gradient compression using statistical methods, the method comprising: receiving (1200) input data (412) at plural nodes (410-n) of a neural network

(400); running (1202) the input data (412) forward and backward through the plural nodes (410-n) to generate node gradient vectors (422); fitting (1204) a sparsity-inducing distribution, SID, at each node (410-n), to the corresponding node gradient vector (422); calculating (1206) a first threshold h₁ based on the SID, for the corresponding node gradient vector (422); compressing (1208) the corresponding node gradient vector (422) to obtain a first compressed gradient vector (430), by setting to zero those components that are smaller than the first threshold