CN117910521B

CN117910521B - Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium

Info

Publication number: CN117910521B
Application number: CN202410317335.XA
Authority: CN
Inventors: 朱克峰; 王彦伟; 李仁刚; 李兵兵; 宿栋栋; 黄伟
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2024-03-20
Filing date: 2024-03-20
Publication date: 2024-06-14
Anticipated expiration: 2044-03-20
Also published as: CN117910521A

Abstract

The present invention discloses a gradient compression method, device, equipment, distributed cluster and storage medium, which belongs to the field of distributed computing, and is used to refer to two indicators, namely, the model performance optimization rate and the current single-step training duration, to adjust the gradient compression degree, and solve the problem that the model performance and the communication overhead cannot be balanced when the gradient compression is performed on the low-speed network. The present invention uses a single training step as the granularity, and after obtaining the gradient data in any training step after the warm-up stage, the gradient compression degree is reduced when the model performance optimization rate does not meet the standard, so as to improve the model performance, and when the model performance optimization rate meets the standard and the current single-step training duration exceeds the standard, the gradient compression degree can be amplified to reduce the communication overhead, and the present invention can dynamically adjust the compression degree of the gradient data in combination with the influence of the network status, so as to reduce the communication overhead as much as possible on the basis of taking into account the model performance and the network status.

Description

Gradient compression method, device, equipment, distributed cluster system and storage medium

技术领域Technical Field

本发明涉及分布式计算领域，特别是涉及一种梯度压缩方法、装置、设备、分布式集群及存储介质。The present invention relates to the field of distributed computing, and in particular to a gradient compression method, device, equipment, distributed cluster and storage medium.

背景技术Background Art

随着大语言模型（LLM，Large Language Model）的发展，深度学习模型的训练需要更多的计算资源，因此越来越多的使用了分布式集群进行分布式计算；模型训练的过程中，分布式集群中的任一计算节点需要将自身在模型迭代训练过程中产生的梯度数据，通过网络发送至分布式集群中的其他计算节点，以便各计算节点进行模型参数的更新。With the development of large language models (LLM), the training of deep learning models requires more computing resources, so more and more distributed clusters are used for distributed computing. During the model training process, any computing node in the distributed cluster needs to send the gradient data generated during the iterative training of the model to other computing nodes in the distributed cluster through the network, so that each computing node can update the model parameters.

分布式集群在进行模型训练的过程中，需要通过网络传输的梯度数据的数量巨大，为了应对某些低速不稳定的网络，在通过网络传输前可以对梯度数据进行压缩，然而梯度数据的压缩虽然可以降低通信开销，但是其可能对模型性能产生影响，因此在进行梯度数据的压缩过程中如何平衡模型性能与通信开销是亟待解决的难题。During the model training process of distributed clusters, a huge amount of gradient data needs to be transmitted through the network. In order to cope with some slow and unstable networks, the gradient data can be compressed before transmission through the network. However, although the compression of gradient data can reduce communication overhead, it may affect the model performance. Therefore, how to balance the model performance and communication overhead during the compression of gradient data is a difficult problem that needs to be solved urgently.

因此，如何提供一种解决上述技术问题的方案是本领域技术人员目前需要解决的问题。Therefore, how to provide a solution to the above technical problems is a problem that technical personnel in this field need to solve at present.

发明内容Summary of the invention

本发明的目的是提供一种梯度压缩方法、装置、设备、分布式集群及存储介质，以单个训练步为粒度，在预热阶段后的任一个训练步得到梯度数据后，在模型性能优化速率不达标的情况下对梯度压缩程度进行缩小，以便提升模型性能，而在模型性能优化速率达标且当前的单步训练时长超标的情况下，便可对梯度压缩程度进行放大，以便降低通信开销，本发明可结合网络状况的影响动态的调节梯度数据的压缩程度，从而在兼顾模型性能与网络状况的基础上，尽可能的降低通信开销。The purpose of the present invention is to provide a gradient compression method, device, equipment, distributed cluster and storage medium. With a single training step as the granularity, after obtaining gradient data in any training step after the warm-up stage, when the model performance optimization rate does not meet the standard, the gradient compression degree is reduced to improve the model performance. When the model performance optimization rate meets the standard and the current single-step training time exceeds the standard, the gradient compression degree can be amplified to reduce the communication overhead. The present invention can dynamically adjust the compression degree of the gradient data in combination with the influence of the network status, thereby reducing the communication overhead as much as possible on the basis of taking into account the model performance and the network status.

为解决上述技术问题，本发明提供了一种梯度压缩方法，应用于分布式集群中的任一个计算节点，包括：In order to solve the above technical problems, the present invention provides a gradient compression method, which is applied to any computing node in a distributed cluster, comprising:

对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标；For any training step after the warm-up phase of the local model iterative training, after obtaining the gradient data of the current training step, determine whether the current model performance optimization rate meets the standard;

若当前的模型性能优化速率未达标，则按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小；If the current model performance optimization rate does not meet the standard, the current gradient compression degree of the preset compression method is reduced according to the first preset rule;

若当前的模型性能优化速率达标，判断当前的单步训练时长是否超标；If the current model performance optimization rate meets the standard, determine whether the current single-step training time exceeds the standard;

若当前的单步训练时长超标，则按照第二预设规则对预设压缩方法当前的梯度压缩程度进行放大；If the current single-step training duration exceeds the limit, the current gradient compression degree of the preset compression method is amplified according to the second preset rule;

基于最新的梯度压缩程度，采用预设压缩方法对当前的训练步的梯度数据进行压缩，以便将压缩后的梯度数据在所述分布式集群内进行同步；Based on the latest gradient compression degree, compressing the gradient data of the current training step using a preset compression method so as to synchronize the compressed gradient data within the distributed cluster;

其中，单步训练时长与梯度压缩程度以及网络状况相关。Among them, the single-step training duration is related to the degree of gradient compression and the network status.

另一方面，判断当前的模型性能优化速率是否达标包括：On the other hand, judging whether the current model performance optimization rate meets the standard includes:

判断在上一个步数滑动窗口内，模型性能的提升幅度是否达标；Determine whether the improvement in model performance has reached the target within the previous step sliding window;

若模型性能的提升幅度达标，则判定当前的模型性能优化速率达标；If the improvement of model performance meets the standard, it is determined that the current model performance optimization rate meets the standard;

若模型性能的提升幅度未达标，则判定当前的模型性能优化速率未达标；If the improvement of model performance does not meet the standard, it is determined that the current model performance optimization rate does not meet the standard;

其中，所述步数滑动窗口包括预设数目个训练步。Wherein, the step number sliding window includes a preset number of training steps.

另一方面，判断在上一个步数滑动窗口内，模型性能的提升幅度是否达标包括：On the other hand, judging whether the improvement of model performance in the previous step sliding window has reached the standard includes:

依据损失函数变化率确定关系式，确定出本地模型在上一个步数滑动窗口内的损失函数变化率；Determine the relationship based on the loss function change rate to determine the loss function change rate of the local model within the previous step sliding window;

若所述损失函数变化率小于预设变化率阈值，则判定模型性能的提升幅度未达标；If the loss function change rate is less than the preset change rate threshold, it is determined that the improvement in model performance has not reached the target;

若所述损失函数变化率不小于预设变化率阈值，则判定模型性能的提升幅度达标；If the change rate of the loss function is not less than the preset change rate threshold, it is determined that the improvement of the model performance meets the standard;

所述损失函数变化率确定关系式包括：The loss function change rate determination relationship includes:

； ;

其中，表示以当前的第t个训练步为基准，上一个步数滑动窗口内的损失函数L的变化率；为本地模型的损失函数；表示以当前的第t个训练步为基准，上一个步数滑动窗口内的损失函数L的滑动平均值，τ为求和运算中的变量；表示上一个步数滑动窗口内最小的损失函数值，步数滑动窗口包括M个训练步。in, It represents the rate of change of the loss function L in the sliding window of the previous step, based on the current t-th training step; is the loss function of the local model; It represents the sliding average of the loss function L in the sliding window of the previous step based on the current t-th training step, and τ is the variable in the summation operation; Represents the minimum loss function value in the previous step sliding window, and the step sliding window includes M training steps.

另一方面，所述预设变化率阈值包括：On the other hand, the preset change rate threshold includes:

； ;

其中，为与训练步的步数t相关的衰减函数，为超参数。in, is the decay function related to the number of training steps t, is a hyperparameter.

另一方面，所述预设压缩方法包括由梯度量化法与梯度稀疏化法结合的混合梯度压缩方法。On the other hand, the preset compression method includes a hybrid gradient compression method combining a gradient quantization method and a gradient sparsification method.

另一方面，按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小包括：On the other hand, reducing the current gradient compression degree of the preset compression method according to the first preset rule includes:

按照压缩程度缩小关系式，对预设压缩方法当前的梯度压缩程度进行缩小；According to the compression degree reduction relation, the current gradient compression degree of the preset compression method is reduced;

所述压缩程度缩小关系式包括：The compression reduction relational expression includes:

clip_upper(2Q(t))·λS(t)；clip _upper (2Q(t))·λS(t);

其中，Q(t)表示与训练步的步数t相关的梯度量化策略函数，S(t)表示与训练步的步数t相关的梯度稀疏化策略函数，clip_upper()表示用于执行梯度量化上限截取操作的clip函数，λ为预设调节参数，λ＞1。Among them, Q(t) represents the gradient quantization strategy function related to the number of training steps t, S(t) represents the gradient sparsification strategy function related to the number of training steps t, clip _upper () represents the clip function used to perform the gradient quantization upper limit clipping operation, λ is a preset adjustment parameter, λ＞1.

另一方面，按照第二预设规则对预设压缩方法当前的梯度压缩程度进行放大包括：On the other hand, amplifying the current gradient compression degree of the preset compression method according to the second preset rule includes:

按照压缩程度放大关系式，对预设压缩方法当前的梯度压缩程度进行放大；According to the compression degree amplification relation, amplify the current gradient compression degree of the preset compression method;

所述压缩程度放大关系式包括：The compression degree amplification relational expression includes:

； ;

其中，Q(t)表示与训练步的步数t相关的梯度量化策略函数，S(t)表示与训练步的步数t相关的梯度稀疏化策略函数，clip_lower()表示用于执行梯度量化下限截取操作的clip函数，λ为预设调节参数，λ＞1。Among them, Q(t) represents the gradient quantization strategy function related to the number of training steps t, S(t) represents the gradient sparsification strategy function related to the number of training steps t, clip _lower () represents the clip function used to perform the gradient quantization lower limit clipping operation, λ is a preset adjustment parameter, λ＞1.

另一方面，对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标包括：On the other hand, for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step, judging whether the current model performance optimization rate meets the standard includes:

在本地模型的迭代训练开始后的任一个训练步，根据当前的训练步的梯度数据，判断对梯度数据的压缩是否会导致迭代训练无法收敛；At any training step after the iterative training of the local model starts, based on the gradient data of the current training step, determine whether compression of the gradient data will cause the iterative training to fail to converge;

若会导致迭代训练无法收敛，则判定迭代训练处于预热阶段；If it causes the iterative training to fail to converge, the iterative training is determined to be in the warm-up stage;

若不会导致迭代训练无法收敛，则判定迭代训练的预热阶段结束；If it does not cause the iterative training to fail to converge, the warm-up phase of the iterative training is determined to be over;

对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标。For any training step after the warm-up phase of the local model iterative training, after obtaining the gradient data of the current training step, determine whether the current model performance optimization rate meets the standard.

另一方面，根据当前的训练步的梯度数据，判断对梯度数据的压缩是否会导致迭代训练无法收敛包括：On the other hand, based on the gradient data of the current training step, whether the compression of the gradient data will cause the iterative training to fail to converge includes:

确定出当前的训练步的梯度数据的方差；Determine the variance of the gradient data of the current training step;

判断当前的训练步的梯度数据的方差是否大于预设方差阈值；Determine whether the variance of the gradient data of the current training step is greater than the preset variance threshold;

若大于，则判定对梯度数据的压缩会导致迭代训练无法收敛；If it is greater than, it is determined that the compression of gradient data will cause the iterative training to fail to converge;

若不大于，则判定对梯度数据的压缩不会导致迭代训练无法收敛。If it is not greater than, it is determined that the compression of the gradient data will not cause the iterative training to fail to converge.

另一方面，判断当前的模型性能优化速率是否达标之后，所述梯度压缩方法还包括：On the other hand, after determining whether the current model performance optimization rate meets the standard, the gradient compression method further includes:

若当前的模型性能优化速率未达标，将优化速率连续未达标次数加一，并判断所述优化速率连续未达标次数是否达到第一预设次数阈值；If the current model performance optimization rate does not meet the standard, the number of consecutive times the optimization rate does not meet the standard is increased by one, and it is determined whether the number of consecutive times the optimization rate does not meet the standard reaches a first preset number threshold;

若达到，则控制提示器提示优化速率连续未达标次数过高；If it is reached, the control prompter will prompt that the optimization rate has not been reached for too many consecutive times;

若当前的模型性能优化速率达标，将所述优化速率连续未达标次数清零。If the current model performance optimization rate meets the standard, the number of consecutive times that the optimization rate fails to meet the standard is reset to zero.

另一方面，判断当前的单步训练时长是否超标之后，所述梯度压缩方法还包括：On the other hand, after determining whether the current single-step training duration exceeds the standard, the gradient compression method further includes:

若当前的单步训练时长超标，将单步训练时长连续超标次数加一，并判断所述单步训练时长连续超标次数是否达到第二预设次数阈值；If the current single-step training duration exceeds the standard, the number of times the single-step training duration exceeds the standard continuously is increased by one, and it is determined whether the number of times the single-step training duration exceeds the standard continuously reaches a second preset number threshold;

若达到，则控制提示器提示单步训练时长连续超标次数过高；If it is reached, the control prompter will prompt that the single-step training duration has exceeded the limit for too many consecutive times;

若当前的单步训练时长未超标，将所述单步训练时长连续超标次数清零。If the current single-step training duration does not exceed the standard, the number of times the single-step training duration exceeds the standard continuously is reset to zero.

另一方面，所述梯度压缩方法还包括：On the other hand, the gradient compression method further comprises:

响应于标准修改指令，对模型性能优化速率的达标标准和/或单步训练时长的超标标准进行修改。In response to the standard modification instruction, the standard for meeting the model performance optimization rate and/or the standard for exceeding the single-step training duration are modified.

另一方面，判断当前的单步训练时长是否超标包括：On the other hand, judging whether the current single-step training duration exceeds the standard includes:

将上一个训练步的训练时长确定为当前的单步训练时长；Determine the training duration of the previous training step as the current single-step training duration;

判断当前的单步训练时长是否大于预设时长阈值；Determine whether the current single-step training duration is greater than a preset duration threshold;

若大于，则判定当前的单步训练时长超标；If it is greater than, it is determined that the current single-step training duration exceeds the limit;

若不大于，则判定当前的单步训练时长未超标。If it is not greater than, it is determined that the current single-step training duration does not exceed the limit.

另一方面，对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小之前，所述梯度压缩方法还包括：On the other hand, for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step, before reducing the current gradient compression degree of the preset compression method according to the first preset rule, the gradient compression method further includes:

判断基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度是否超标；Determine whether the gradient distortion of the gradient data of the current training step is exceeded based on the current gradient compression degree;

若当前的模型性能优化速率未达标，则按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小包括：If the current model performance optimization rate does not meet the standard, the current gradient compression degree of the preset compression method is reduced according to the first preset rule, including:

若当前的模型性能优化速率未达标和/或所述梯度失真度超标，则按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小；If the current model performance optimization rate does not meet the standard and/or the gradient distortion exceeds the standard, reducing the current gradient compression degree of the preset compression method according to the first preset rule;

若当前的模型性能优化速率达标，判断当前的单步训练时长是否超标包括：If the current model performance optimization rate meets the standard, the following steps are used to determine whether the current single-step training duration exceeds the standard:

若当前的模型性能优化速率达标且所述梯度失真度未超标，判断当前的单步训练时长是否超标。If the current model performance optimization rate meets the standard and the gradient distortion does not exceed the standard, determine whether the current single-step training duration exceeds the standard.

另一方面，判断基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度是否超标包括：On the other hand, judging whether the gradient distortion of the gradient data of the current training step is excessive based on the current gradient compression degree includes:

依据梯度失真度确定关系式，确定出基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度；Determine a relationship based on the gradient distortion, and determine the gradient distortion for compressing the gradient data of the current training step based on the current gradient compression degree;

若所述梯度失真度大于预设失真度阈值，则判定所述梯度失真度超标；If the gradient distortion is greater than a preset distortion threshold, it is determined that the gradient distortion exceeds the standard;

若所述梯度失真度不大于所述预设失真度阈值，则判定所述梯度失真度未超标；If the gradient distortion is not greater than the preset distortion threshold, it is determined that the gradient distortion does not exceed the standard;

所述梯度失真度确定关系式包括：The gradient distortion determination relationship includes:

； ;

其中，g_GC为基于当前的梯度压缩程度压缩后的当前的训练步的梯度数据，g为未经压缩的当前的训练步的梯度数据，表示g_GC与g之间的欧式距离。Wherein, g _GC is the gradient data of the current training step after compression based on the current gradient compression degree, and g is the gradient data of the current training step without compression. represents the Euclidean distance between g _GC and g.

另一方面，判断基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度是否超标之后，所述梯度压缩方法还包括：On the other hand, after determining whether the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard, the gradient compression method further includes:

若所述梯度失真度超标，将梯度失真度连续超标次数加一，并判断所述梯度失真度连续超标次数是否达到第三预设次数阈值；If the gradient distortion exceeds the standard, the number of times the gradient distortion exceeds the standard continuously is increased by one, and it is determined whether the number of times the gradient distortion exceeds the standard continuously reaches a third preset number threshold;

若达到，则控制提示器提示梯度失真度连续超标次数过高；If it is reached, the control prompter will prompt that the gradient distortion has exceeded the limit for too many times in a row;

若所述梯度失真度未超标，将所述梯度失真度连续超标次数清零。If the gradient distortion does not exceed the standard, the number of times the gradient distortion exceeds the standard continuously is reset to zero.

为解决上述技术问题，本发明还提供了一种梯度压缩装置，应用于分布式集群中的任一个计算节点，包括：In order to solve the above technical problems, the present invention further provides a gradient compression device, which is applied to any computing node in a distributed cluster, comprising:

第一判断模块，用于对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标，若当前的模型性能优化速率未达标，则触发第一调节模块，若当前的模型性能优化速率达标，则触发第二判断模块；The first judgment module is used for determining whether the current model performance optimization rate meets the standard for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step. If the current model performance optimization rate does not meet the standard, the first adjustment module is triggered; if the current model performance optimization rate meets the standard, the second judgment module is triggered;

所述第一调节模块，用于按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小；The first adjustment module is used to reduce the current gradient compression degree of the preset compression method according to the first preset rule;

第二判断模块，用于判断当前的单步训练时长是否超标，若当前的单步训练时长超标，则触发第二调节模块；The second judgment module is used to judge whether the current single-step training duration exceeds the standard. If the current single-step training duration exceeds the standard, the second adjustment module is triggered;

所述第二调节模块，用于按照第二预设规则对预设压缩方法当前的梯度压缩程度进行放大；The second adjustment module is used to amplify the current gradient compression degree of the preset compression method according to a second preset rule;

压缩模块，用于基于最新的梯度压缩程度，采用预设压缩方法对当前的训练步的梯度数据进行压缩，以便将压缩后的梯度数据在所述分布式集群内进行同步；A compression module, used to compress the gradient data of the current training step using a preset compression method based on the latest gradient compression degree, so as to synchronize the compressed gradient data within the distributed cluster;

为解决上述技术问题，本发明还提供了一种梯度压缩设备，包括：In order to solve the above technical problems, the present invention further provides a gradient compression device, comprising:

存储器，用于存储计算机程序；Memory for storing computer programs;

处理器，用于执行所述计算机程序时实现如上所述梯度压缩方法的步骤。A processor is used to implement the steps of the gradient compression method described above when executing the computer program.

为解决上述技术问题，本发明还提供了一种分布式集群，包括多个如上所述的梯度压缩设备。To solve the above technical problem, the present invention also provides a distributed cluster, comprising a plurality of gradient compression devices as described above.

为解决上述技术问题，本发明还提供了一种计算机可读存储介质，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现如上所述梯度压缩方法的步骤。To solve the above technical problem, the present invention further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the above gradient compression method are implemented.

有益效果：本发明提供了一种梯度压缩方法，考虑到在任一训练步后需要对梯度数据进行压缩，而梯度压缩程度分别可以对模型性能的优化速度以及单步训练时长产生影响，且单步训练时长与网络状况存在相关性，因此本发明以单个训练步为粒度，在预热阶段后的任一个训练步得到梯度数据后，在模型性能优化速率不达标的情况下对梯度压缩程度进行缩小，以便提升模型性能，而在模型性能优化速率达标且当前的单步训练时长超标的情况下，便可对梯度压缩程度进行放大，以便降低通信开销，本发明可结合网络状况的影响动态的调节梯度数据的压缩程度，从而在兼顾模型性能与网络状况的基础上，尽可能的降低通信开销。Beneficial effects: The present invention provides a gradient compression method. Considering that the gradient data needs to be compressed after any training step, and the degree of gradient compression can respectively affect the optimization speed of model performance and the single-step training time, and the single-step training time is correlated with the network condition, the present invention uses a single training step as the granularity. After obtaining the gradient data in any training step after the warm-up stage, the gradient compression degree is reduced when the model performance optimization rate does not meet the standard, so as to improve the model performance. When the model performance optimization rate meets the standard and the current single-step training time exceeds the standard, the gradient compression degree can be amplified to reduce the communication overhead. The present invention can dynamically adjust the compression degree of the gradient data in combination with the influence of the network condition, so as to reduce the communication overhead as much as possible on the basis of taking into account the model performance and the network condition.

本发明还提供了一种梯度压缩装置、设备、分布式集群及计算机可读存储介质，具有如上梯度压缩方法相同的有益效果。The present invention also provides a gradient compression device, equipment, distributed cluster and computer-readable storage medium, which have the same beneficial effects as the above gradient compression method.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案，下面将对相关技术和实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the relevant technologies and the drawings required for use in the embodiments are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1为本发明提供的一种梯度压缩方法的流程示意图；FIG1 is a schematic flow chart of a gradient compression method provided by the present invention;

图2为本发明提供的一种分布式集群的结构示意图；FIG2 is a schematic diagram of the structure of a distributed cluster provided by the present invention;

图3为本发明提供的另一种梯度压缩方法的流程示意图；FIG3 is a schematic flow chart of another gradient compression method provided by the present invention;

图4为本发明提供的一种梯度压缩装置的结构示意图；FIG4 is a schematic structural diagram of a gradient compression device provided by the present invention;

图5为本发明提供的一种梯度压缩设备的结构示意图；FIG5 is a schematic structural diagram of a gradient compression device provided by the present invention;

图6为本发明提供的一种计算机可读存储介质的结构示意图。FIG. 6 is a schematic diagram of the structure of a computer-readable storage medium provided by the present invention.

具体实施方式DETAILED DESCRIPTION

本发明的核心是提供一种梯度压缩方法、装置、设备、分布式集群及存储介质，以单个训练步为粒度，在预热阶段后的任一个训练步得到梯度数据后，在模型性能优化速率不达标的情况下对梯度压缩程度进行缩小，以便提升模型性能，而在模型性能优化速率达标且当前的单步训练时长超标的情况下，便可对梯度压缩程度进行放大，以便降低通信开销，本发明可结合网络状况的影响动态的调节梯度数据的压缩程度，从而在兼顾模型性能的基础上，尽可能的降低通信开销。The core of the present invention is to provide a gradient compression method, device, equipment, distributed cluster and storage medium. With a single training step as the granularity, after obtaining the gradient data in any training step after the warm-up stage, if the model performance optimization rate does not meet the standard, the gradient compression degree is reduced to improve the model performance. If the model performance optimization rate meets the standard and the current single-step training time exceeds the standard, the gradient compression degree can be amplified to reduce the communication overhead. The present invention can dynamically adjust the compression degree of the gradient data in combination with the influence of the network status, thereby reducing the communication overhead as much as possible on the basis of taking into account the model performance.

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present invention clearer, the technical solution in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

请参考图1，图1为本发明提供的一种梯度压缩方法的流程示意图，该梯度压缩方法应用于分布式集群中的任一个计算节点，包括：Please refer to FIG. 1 , which is a flow chart of a gradient compression method provided by the present invention. The gradient compression method is applied to any computing node in a distributed cluster, and includes:

S101：对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标；S101: For any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step, determine whether the current model performance optimization rate meets the standard;

为了更好地对本发明实施例进行说明，请参考图2，图2为本发明提供的一种分布式集群的结构示意图，本发明实施例中的梯度压缩方法可以应用于分布式集群中的任一个计算节点。To better illustrate the embodiment of the present invention, please refer to FIG. 2 , which is a schematic diagram of the structure of a distributed cluster provided by the present invention. The gradient compression method in the embodiment of the present invention can be applied to any computing node in the distributed cluster.

具体的，考虑到如上背景技术中的技术问题，又结合考虑到在任一训练步后需要对梯度数据进行压缩，而梯度压缩程度分别可以对模型性能的优化速度以及单步训练时长产生影响，因此本发明欲以本地模型的迭代训练过程中的单个训练步为粒度，寻求对梯度压缩程度的调整，从而实现对于梯度压缩程度的动态调整，更加及时且灵活的调节模型性能与通信开销之间的平衡；同时，考虑到在迭代训练过程中“模型性能优化速率”在较大程度上决定了最终的本地模型的性能，基于此，本发明实施例中可以对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标，并将判断结果作为后续步骤的数据基础，以便通过及时调节梯度压缩程度来保证模型性能优化速率。Specifically, considering the technical problems in the above background technology, and considering that the gradient data needs to be compressed after any training step, and the degree of gradient compression can respectively affect the optimization speed of model performance and the single-step training time, the present invention intends to use a single training step in the iterative training process of the local model as the granularity, and seek to adjust the degree of gradient compression, so as to achieve dynamic adjustment of the degree of gradient compression, and more timely and flexibly adjust the balance between model performance and communication overhead; at the same time, considering that the "model performance optimization rate" determines the performance of the final local model to a large extent during the iterative training process, based on this, in an embodiment of the present invention, for any training step after the warm-up stage of the iterative training of the local model, after obtaining the gradient data of the current training step, it is possible to determine whether the current model performance optimization rate meets the standard, and use the judgment result as the data basis for subsequent steps, so as to ensure the model performance optimization rate by timely adjusting the degree of gradient compression.

其中，考虑到在分布式集群中进行本地模型的迭代训练的初期阶段，梯度数据以及模型参数均未形成良好的基础，不能直接引入梯度压缩，这会导致本地模型的迭代训练无法收敛，因此可以在迭代训练的初期划分出预热阶段，在此阶段可以将原始的未经压缩的梯度数据进行网络传输，以便为分布式集群中本地模型的迭代训练打好基础，而在经过预热阶段之后便可以进行梯度数据的压缩以节省通信开销，因此本发明实施例中可以“对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标”。Among them, considering that in the initial stage of iterative training of the local model in the distributed cluster, the gradient data and the model parameters have not formed a good foundation, and gradient compression cannot be directly introduced, which will cause the iterative training of the local model to fail to converge. Therefore, a warm-up stage can be divided in the early stage of the iterative training. In this stage, the original uncompressed gradient data can be transmitted over the network to lay a good foundation for the iterative training of the local model in the distributed cluster. After the warm-up stage, the gradient data can be compressed to save communication overhead. Therefore, in the embodiment of the present invention, "for any training step after the warm-up stage of the iterative training of the local model, after obtaining the gradient data of the current training step, it is determined whether the current model performance optimization rate meets the standard."

S102：若当前的模型性能优化速率未达标，则按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小；S102: If the current model performance optimization rate does not meet the standard, reducing the current gradient compression degree of the preset compression method according to the first preset rule;

具体的，由于“模型性能优化速率”在较大程度上决定了最终的本地模型的性能，而本地模型的性能是一个重点关注的指标，因此本发明实施例中可以在当前的模型性能优化速率未达标的情况下，按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小，梯度压缩程度的缩小代表降低对于梯度数据的压缩程度，有利于在模型参数优化过程中学习到本地训练数据中更多的特征，从而提升模型性能。Specifically, since the "model performance optimization rate" determines the performance of the final local model to a large extent, and the performance of the local model is an indicator of focus, in an embodiment of the present invention, when the current model performance optimization rate does not meet the standard, the current gradient compression degree of the preset compression method can be reduced according to the first preset rule. The reduction in the gradient compression degree represents a reduction in the compression degree of the gradient data, which is conducive to learning more features in the local training data during the model parameter optimization process, thereby improving model performance.

S103：若当前的模型性能优化速率达标，判断当前的单步训练时长是否超标；S103: If the current model performance optimization rate meets the standard, determine whether the current single-step training time exceeds the standard;

具体的，考虑到在模型性能之后，通信开销以及“与通信开销呈正比的训练时长”是需要重点关注的指标，而这两项指标均与本地模型迭代训练过程中的单步训练时长相关，另外，网络带宽的波动也会对单步训练时长造成影响，因此，本发明实施例在当前的模型性能优化速率达标的情况下，便可以判断当前的单步训练时长是否超标，以便根据判断结果触发后续动作，从而结合网络状况对于单步训练时长的影响，可能地降低通信开销。Specifically, considering that after model performance, communication overhead and "training time proportional to communication overhead" are indicators that need to be focused on, and these two indicators are related to the single-step training time in the local model iterative training process. In addition, fluctuations in network bandwidth will also affect the single-step training time. Therefore, the embodiment of the present invention can determine whether the current single-step training time exceeds the standard when the current model performance optimization rate meets the standard, so as to trigger subsequent actions according to the judgment result, thereby combining the impact of network conditions on the single-step training time to reduce the communication overhead as much as possible.

具体的，单步训练时长可以包括单个训练步的计算时间以及梯度数据传输所消耗的通信时间，而其中的通信时间的影响因素分别包括梯度压缩程度以及网络状况，但是实际应用中计算时间与通信时间之间可能存在一定的重贴。Specifically, the single-step training duration may include the computation time of a single training step and the communication time consumed by gradient data transmission, and the communication time is affected by factors including the degree of gradient compression and the network status. However, in actual applications, there may be some overlap between the computation time and the communication time.

S104：若当前的单步训练时长超标，则按照第二预设规则对预设压缩方法当前的梯度压缩程度进行放大；S104: If the current single-step training duration exceeds the standard, amplify the current gradient compression degree of the preset compression method according to the second preset rule;

其中，单步训练时长在理论上存在一定的理论区间，也即本步骤中超标中的“标准”，可认为当单步训练时长超标的情况下存在下调空间，而单步训练时长未超标的情况下则不存在下调空间，因此本步骤在确定当前的单步训练时长超标的情况下，便可以对预设压缩方法当前的梯度压缩程度进行放大，以便下调单步训练时长，从而降低模型的训练时长。Among them, there is a certain theoretical range for the single-step training duration in theory, that is, the "standard" in the exceeding standard in this step. It can be considered that when the single-step training duration exceeds the standard, there is room for downward adjustment, and when the single-step training duration does not exceed the standard, there is no room for downward adjustment. Therefore, in this step, when it is determined that the current single-step training duration exceeds the standard, the current gradient compression degree of the preset compression method can be amplified to reduce the single-step training duration, thereby reducing the training time of the model.

其中，可以通过第二预设规则对预设压缩方法当前的梯度压缩程度进行放大，从而实现较高的灵活性，第一预设规则以及第二预设规则均可以灵活自主设定，本发明实施例在此不做限定。Among them, the current gradient compression degree of the preset compression method can be amplified by the second preset rule, so as to achieve higher flexibility. The first preset rule and the second preset rule can be flexibly and independently set, and the embodiment of the present invention is not limited here.

S105：基于最新的梯度压缩程度，采用预设压缩方法对当前的训练步的梯度数据进行压缩，以便将压缩后的梯度数据在分布式集群内进行同步；S105: Based on the latest gradient compression degree, a preset compression method is used to compress the gradient data of the current training step, so as to synchronize the compressed gradient data in the distributed cluster;

具体的，在经过如上步骤的调整后，基于对“模型性能优化速率”以及“单步训练时长”的判断，便完成了该训练步中对于梯度压缩程度的调节，因此在本步骤中便可以基于最新的梯度压缩程度，采用预设压缩方法对当前的训练步的梯度数据进行压缩，以便将压缩后的梯度数据在分布式集群内进行同步。Specifically, after the adjustment in the above steps, based on the judgment of "model performance optimization rate" and "single-step training duration", the adjustment of the gradient compression degree in the training step is completed. Therefore, in this step, based on the latest gradient compression degree, the preset compression method can be used to compress the gradient data of the current training step, so that the compressed gradient data can be synchronized in the distributed cluster.

本发明提供了一种梯度压缩方法，考虑到在任一训练步后需要对梯度数据进行压缩，而梯度压缩程度分别可以对模型性能的优化速度以及单步训练时长产生影响，且单步训练时长与网络状况存在相关性，因此本发明以单个训练步为粒度，在预热阶段后的任一个训练步得到梯度数据后，在模型性能优化速率不达标的情况下对梯度压缩程度进行缩小，以便提升模型性能，而在模型性能优化速率达标且当前的单步训练时长超标的情况下，便可对梯度压缩程度进行放大，以便降低通信开销，本发明可结合网络状况的影响动态的调节梯度数据的压缩程度，从而在兼顾模型性能与网络状况的基础上，尽可能的降低通信开销。The present invention provides a gradient compression method. Considering that gradient data needs to be compressed after any training step, and the degree of gradient compression can respectively affect the optimization speed of model performance and the single-step training time, and the single-step training time is correlated with the network condition, the present invention takes a single training step as the granularity, and after obtaining gradient data in any training step after the warm-up stage, the degree of gradient compression is reduced in the case that the model performance optimization rate does not meet the standard, so as to improve the model performance, and in the case that the model performance optimization rate meets the standard and the current single-step training time exceeds the standard, the degree of gradient compression can be amplified to reduce the communication overhead. The present invention can dynamically adjust the degree of compression of gradient data in combination with the influence of the network condition, so as to reduce the communication overhead as much as possible on the basis of taking into account both the model performance and the network condition.

在上述实施例的基础上：Based on the above embodiments:

作为一种可选的实施例，判断当前的模型性能优化速率是否达标包括：As an optional embodiment, determining whether the current model performance optimization rate meets the standard includes:

其中，步数滑动窗口包括预设数目个训练步。The step number sliding window includes a preset number of training steps.

具体的，考虑到对于每个训练步来说均可以评估模型性能的提升幅度，而过去的若干个训练步的模型性能的提升幅度则可以代表本地模型当前的模型性能优化速率，因此本发明预先设置了步数滑动窗口，并可以判断在上一个步数滑动窗口内，模型性能的提升幅度是否达标，若模型性能的提升幅度达标，则判定当前的模型性能优化速率达标；若模型性能的提升幅度未达标，则判定当前的模型性能优化速率未达标。Specifically, considering that the improvement of model performance can be evaluated for each training step, and the improvement of model performance in the past several training steps can represent the current model performance optimization rate of the local model, the present invention pre-sets a step sliding window, and can determine whether the improvement of model performance in the previous step sliding window meets the standard. If the improvement of model performance meets the standard, it is determined that the current model performance optimization rate meets the standard; if the improvement of model performance does not meet the standard, it is determined that the current model performance optimization rate does not meet the standard.

其中，步数滑动窗口包括的训练步的预设数目可以进行自主设定，本发明实施例在此不做限定。The preset number of training steps included in the step number sliding window can be set independently, and the embodiment of the present invention does not limit this.

当然，除了该种具体方式外，还可以通过其他方式判断当前的模型性能优化速率是否达标，本发明实施例在此不做限定。Of course, in addition to this specific method, other methods may be used to determine whether the current model performance optimization rate meets the standard, which is not limited in the embodiment of the present invention.

作为一种可选的实施例，判断在上一个步数滑动窗口内，模型性能的提升幅度是否达标包括：As an optional embodiment, judging whether the improvement of the model performance in the previous step number sliding window meets the standard includes:

若损失函数变化率小于预设变化率阈值，则判定模型性能的提升幅度未达标；If the loss function change rate is less than the preset change rate threshold, it is determined that the improvement in model performance has not reached the target;

若损失函数变化率不小于预设变化率阈值，则判定模型性能的提升幅度达标；If the loss function change rate is not less than the preset change rate threshold, it is determined that the improvement in model performance has reached the standard;

损失函数变化率确定关系式包括：The loss function change rate determination relationship includes:

； ;

具体的，考虑到损失函数能够准确评估各个训练步之间的模型性能的变化，而损失函数变化率则可以代表模型性能的优化速率，因此本发明实施例中可以依据损失函数变化率确定关系式，确定出本地模型在上一个步数滑动窗口内的损失函数变化率，并将损失函数变化率与预设变化率阈值进行比较，若损失函数变化率小于预设变化率阈值，则判定模型性能的提升幅度未达标，若损失函数变化率不小于预设变化率阈值，则判定模型性能的提升幅度达标。Specifically, considering that the loss function can accurately evaluate the changes in model performance between each training step, and the loss function change rate can represent the optimization rate of model performance, the embodiment of the present invention can determine the relationship based on the loss function change rate, determine the loss function change rate of the local model in the previous step sliding window, and compare the loss function change rate with the preset change rate threshold. If the loss function change rate is less than the preset change rate threshold, it is determined that the improvement in model performance does not meet the standard. If the loss function change rate is not less than the preset change rate threshold, it is determined that the improvement in model performance meets the standard.

具体的，在如上的损失函数变化率确定关系式中，分子二者相减，表示了当前达到的最小损失函数值相较“上一个步数滑动窗口内的损失函数L的滑动平均值”降低了多少，再除以“上一个步数滑动窗口内的损失函数L的滑动平均值”，则表示了当前达到的最小损失函数值相较“上一个步数滑动窗口内的损失函数L的滑动平均值”所降低的比例，可以关注这一比例的变化，是否足够显著（可以通过与预设变化率阈值进行比较得出）。Specifically, in the above loss function change rate determination formula, the subtraction of the two numerators indicates how much the current minimum loss function value is reduced compared to the "sliding average of the loss function L in the previous step sliding window". Dividing it by the "sliding average of the loss function L in the previous step sliding window" indicates the proportion of the current minimum loss function value reduced compared to the "sliding average of the loss function L in the previous step sliding window". We can pay attention to the change in this proportion to see if it is significant enough (which can be obtained by comparing it with the preset change rate threshold).

其中，通过如上的损失函数变化率确定关系式可以高效精确的确定出本地模型在上一个步数滑动窗口内的损失函数变化率。Among them, the loss function change rate of the local model in the previous step sliding window can be determined efficiently and accurately through the above loss function change rate determination relationship.

当然，除了如上具体形式外，损失函数变化率确定关系式还可以为其他具体形式，本发明实施例在此不做限定。Of course, in addition to the above specific forms, the loss function change rate determination relationship can also be in other specific forms, and the embodiments of the present invention are not limited here.

作为一种可选的实施例，预设变化率阈值包括：As an optional embodiment, the preset change rate threshold includes:

； ;

具体的，考虑到随着本地模型训练阶段的深入，模型性能的优化速率也呈现一个由快到慢的趋势，因此预设变化率阈值在理论上也应该处于一个衰减的趋势，因此为了更加精确的对不同训练步的“模型性能优化速率是否达标”进行判定，本发明实施例中的预设变化率阈值的核心为与训练步的步数相关的衰减函数，配以超参数进行微调。Specifically, considering that as the local model training stage deepens, the optimization rate of model performance also presents a trend from fast to slow, the preset change rate threshold should theoretically also be in a decaying trend. Therefore, in order to more accurately judge whether the "model performance optimization rate meets the standard" of different training steps, the core of the preset change rate threshold in the embodiment of the present invention is a decay function related to the number of training steps, which is fine-tuned with hyperparameters.

具体的，衰减函数可以为多种类型，例如可以包括分阶段衰减、指数衰减以及余弦衰减等，本发明实施例在此不做限定。Specifically, the attenuation function may be of various types, such as staged attenuation, exponential attenuation, and cosine attenuation, etc., which is not limited in the embodiment of the present invention.

当然，除了该具体形式外，预设变化率阈值还可以为其他多种形式，本发明实施例在此不做限定。Of course, in addition to this specific form, the preset change rate threshold may also be in many other forms, which is not limited in the embodiment of the present invention.

作为一种可选的实施例，预设压缩方法包括由梯度量化法与梯度稀疏化法结合的混合梯度压缩方法。As an optional embodiment, the preset compression method includes a hybrid gradient compression method that combines a gradient quantization method with a gradient sparsification method.

具体的，考虑到梯度量化法的本质原理是减少表征单个通信数据（也即梯度数据）所需的bit（比特）数，而梯度稀疏化法的本质原理是减少通信数据的数量，这两种方法本身不存在矛盾，存在共同利用的条件，若仅使用其中一种则压缩程度以及压缩效果有限（因为量化法与稀疏化法对于不同特征的梯度数据的敏感度不同），因此本发明实施例中的预设压缩方法可以包括由梯度量化法与梯度稀疏化法结合的混合梯度压缩方法，具体实施时可以首先对待压缩的梯度数据通过梯度稀疏化法进行稀疏化，而后通过梯度量化法进行量化等，本发明实施例在此不做限定。Specifically, considering that the essential principle of the gradient quantization method is to reduce the number of bits required to characterize a single communication data (that is, gradient data), and the essential principle of the gradient sparsification method is to reduce the amount of communication data, there is no contradiction between the two methods themselves, and there are conditions for common use. If only one of them is used, the degree of compression and the compression effect are limited (because the quantization method and the sparsification method have different sensitivities to gradient data of different features). Therefore, the preset compression method in the embodiment of the present invention may include a hybrid gradient compression method combining the gradient quantization method and the gradient sparsification method. In specific implementation, the gradient data to be compressed can be first sparsely processed by the gradient sparsification method, and then quantized by the gradient quantization method, etc. The embodiment of the present invention is not limited here.

其中，梯度量化法与梯度稀疏化法具体可以为多种类型，例如梯度量化法的量化范围可以进行灵活设定，例如可以为从32bit到1bit等，而具体的梯度稀疏化方法可以为Top-K法（也即从共N _g个梯度数据中保留前K个梯度数据）等，本发明实施例在此不做限定。Among them, the gradient quantization method and the gradient sparsification method can be specifically of multiple types. For example, the quantization range of the gradient quantization method can be flexibly set, for example, from 32 bits to 1 bit, etc., and the specific gradient sparsification method can be the Top-K method (that is, retaining the first K gradient data from a total of N _g gradient data), etc., which is not limited in the embodiments of the present invention.

作为一种可选的实施例，按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小包括：As an optional embodiment, reducing the current gradient compression degree of the preset compression method according to the first preset rule includes:

压缩程度缩小关系式包括：The compression reduction relationships include:

clip_upper(2Q(t))·λS(t)；clip _upper (2Q(t))·λS(t);

具体的，通过如上的压缩程度缩小关系式可以快捷的对梯度压缩程度进行适量的调节，其中的clip_upper(2Q(t))表示对于梯度量化策略函数对应的梯度压缩程度的回调，具体表现为将当前的量化精度Q(t)上调到2Q(t)，但是需要以系统的最高量化精度作为上限，因此可以使用Clip函数对其进行上限截取操作；λS(t)表示对于梯度稀疏化策略对应的梯度压缩程度的回调，具体表现为将当前的稀疏度S(t)放大λ倍。Specifically, the gradient compression degree can be adjusted appropriately and quickly through the compression degree reduction relationship as above, where clip _upper (2Q(t)) represents the callback of the gradient compression degree corresponding to the gradient quantization strategy function, which is specifically manifested as raising the current quantization accuracy Q(t) to 2Q(t), but the highest quantization accuracy of the system is required as the upper limit, so the Clip function can be used to perform an upper limit clipping operation; λS(t) represents the callback of the gradient compression degree corresponding to the gradient sparsification strategy, which is specifically manifested as amplifying the current sparsity S(t) by λ times.

作为一种可选的实施例，按照第二预设规则对预设压缩方法当前的梯度压缩程度进行放大包括：As an optional embodiment, amplifying the current gradient compression degree of the preset compression method according to the second preset rule includes:

压缩程度放大关系式包括：The compression degree amplification relationship includes:

； ;

具体的，通过如上的压缩程度放大关系式可以快捷的对梯度压缩程度进行适量的放大，其中的clip_lower(½Q(t))表示对于梯度量化策略函数对应的梯度压缩程度的放大，具体表现为将当前的量化精度Q(t)上调到½Q(t)，但是需要以系统的最低量化精度作为下限，因此可以使用Clip函数对其进行下限截取操作；而上式的后半部分则表示对于梯度稀疏化策略对应的梯度压缩程度的调节，具体表现为将当前的稀疏度S(t)乘以1/λ。Specifically, the gradient compression degree can be quickly and appropriately amplified through the above compression degree amplification relationship. The clip _lower (½Q(t)) represents the amplification of the gradient compression degree corresponding to the gradient quantization strategy function, which is specifically manifested as increasing the current quantization accuracy Q(t) to ½Q(t). However, the system's lowest quantization accuracy is required as the lower limit, so the Clip function can be used to perform a lower limit clipping operation. The second half of the above formula represents the adjustment of the gradient compression degree corresponding to the gradient sparsification strategy, which is specifically manifested as multiplying the current sparsity S(t) by 1/λ.

当然，除了如上的压缩程度缩小关系式与压缩程度放大关系式外，压缩程度缩小关系式与压缩程度放大关系式还可以为其他具体形式，本发明实施例在此不做限定。Of course, in addition to the above compression reduction relational expression and compression enlargement relational expression, the compression reduction relational expression and compression enlargement relational expression may also be in other specific forms, which are not limited in the embodiments of the present invention.

作为一种可选的实施例，对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标包括：As an optional embodiment, for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step, determining whether the current model performance optimization rate meets the standard includes:

具体的，为了更加精准的对本地模型迭代训练的预热阶段进行识别，考虑到单个训练步的梯度数据即可反映出“梯度数据压缩动作”是否会导致迭代训练无法收敛，进而确定出迭代训练是否处于预热阶段，因此本发明实施例中可以在本地模型的迭代训练开始后的任一个训练步，根据当前的训练步的梯度数据，判断对梯度数据的压缩是否会导致迭代训练无法收敛，若会导致迭代训练无法收敛，则判定迭代训练处于预热阶段，若不会导致迭代训练无法收敛，则判定迭代训练的预热阶段结束，从而高效准确的实现预热阶段的识别。Specifically, in order to more accurately identify the warm-up stage of the iterative training of the local model, considering that the gradient data of a single training step can reflect whether the "gradient data compression action" will cause the iterative training to fail to converge, and then determine whether the iterative training is in the warm-up stage, therefore, in an embodiment of the present invention, at any training step after the start of the iterative training of the local model, according to the gradient data of the current training step, it can be judged whether the compression of the gradient data will cause the iterative training to fail to converge; if it causes the iterative training to fail to converge, it is determined that the iterative training is in the warm-up stage; if it does not cause the iterative training to fail to converge, it is determined that the warm-up stage of the iterative training is over, thereby efficiently and accurately realizing the identification of the warm-up stage.

当然，除了该具体方式外，对于预热阶段的识别还可以通过其他方式进行，本发明实施例在此不做限定。Of course, in addition to this specific method, the identification of the preheating stage can also be performed in other ways, which are not limited in the embodiment of the present invention.

作为一种可选的实施例，根据当前的训练步的梯度数据，判断对梯度数据的压缩是否会导致迭代训练无法收敛包括：As an optional embodiment, judging whether compression of gradient data will cause iterative training to fail to converge according to the gradient data of the current training step includes:

具体的，考虑到梯度数据的方差即可反映出本地模型在当前时刻是否属于预热阶段，也即是否适合进行梯度数据的压缩，因此本发明实施例中可以判断当前的训练步的梯度数据的方差是否大于预设方差阈值，若大于，则判定对梯度数据的压缩会导致迭代训练无法收敛，若不大于，则判定对梯度数据的压缩不会导致迭代训练无法收敛，从而高效准确的“根据当前的训练步的梯度数据，判断对梯度数据的压缩是否会导致迭代训练无法收敛”。Specifically, considering that the variance of the gradient data can reflect whether the local model is in the warm-up stage at the current moment, that is, whether it is suitable for compressing the gradient data, the embodiment of the present invention can determine whether the variance of the gradient data of the current training step is greater than the preset variance threshold. If it is greater than, it is determined that compression of the gradient data will cause the iterative training to fail to converge. If it is not greater than, it is determined that compression of the gradient data will not cause the iterative training to fail to converge, thereby efficiently and accurately "determining whether compression of the gradient data will cause the iterative training to fail to converge based on the gradient data of the current training step".

其中，在预热阶段，梯度量化法与梯度稀疏化法的梯度压缩程度均可以为最低，例如可以为：不进行量化与稀疏化处理的状态。In the warm-up phase, the gradient compression degree of the gradient quantization method and the gradient sparsification method can be the lowest, for example, it can be a state where no quantization and sparsification processing is performed.

当然，除了该具体形式外，“根据当前的训练步的梯度数据，判断对梯度数据的压缩是否会导致迭代训练无法收敛”还可以通过其他方式进行，本发明实施例在此不做限定。Of course, in addition to this specific form, "determining whether compression of gradient data will cause iterative training to fail to converge based on the gradient data of the current training step" can also be performed in other ways, which are not limited in the embodiments of the present invention.

作为一种可选的实施例，判断当前的模型性能优化速率是否达标之后，梯度压缩方法还包括：As an optional embodiment, after determining whether the current model performance optimization rate meets the standard, the gradient compression method further includes:

若当前的模型性能优化速率未达标，将优化速率连续未达标次数加一，并判断优化速率连续未达标次数是否达到第一预设次数阈值；If the current model performance optimization rate does not meet the standard, the number of consecutive times the optimization rate does not meet the standard is increased by one, and it is determined whether the number of consecutive times the optimization rate does not meet the standard reaches a first preset number threshold;

若当前的模型性能优化速率达标，将优化速率连续未达标次数清零。If the current model performance optimization rate meets the standard, the number of consecutive times the optimization rate fails to meet the standard will be reset to zero.

具体的，考虑到在某些情况下，即使经过对于梯度压缩程度的反复调节，模型性能优化速率也可能持续未达标，这种状态需要工作人员进行检查维修，因此为了使得工作人员尽快知情，本发明实施例中可以对这种情况进行监测，因此若当前的模型性能优化速率未达标，将优化速率连续未达标次数加一，并判断优化速率连续未达标次数是否达到第一预设次数阈值；若达到，则控制提示器提示优化速率连续未达标次数过高；若当前的模型性能优化速率达标，将优化速率连续未达标次数清零，从而实现了对于“模型性能优化速率未达标”的连续次数的自动记录以及触发相应的报警，有利于提升系统可靠性以及用户体验。Specifically, considering that in some cases, even after repeated adjustments to the degree of gradient compression, the model performance optimization rate may continue to fail to meet the standard, this state requires staff to conduct inspections and repairs. Therefore, in order to inform the staff as soon as possible, this situation can be monitored in an embodiment of the present invention. Therefore, if the current model performance optimization rate does not meet the standard, the number of consecutive times the optimization rate fails to meet the standard is increased by one, and it is determined whether the number of consecutive times the optimization rate fails to meet the standard reaches a first preset number threshold; if reached, the control prompter prompts that the number of consecutive times the optimization rate fails to meet the standard is too high; if the current model performance optimization rate meets the standard, the number of consecutive times the optimization rate fails to meet the standard is cleared to zero, thereby realizing automatic recording of the consecutive times of "model performance optimization rate fails to meet the standard" and triggering corresponding alarms, which is beneficial to improving system reliability and user experience.

作为一种可选的实施例，判断当前的单步训练时长是否超标之后，梯度压缩方法还包括：As an optional embodiment, after determining whether the current single-step training duration exceeds the standard, the gradient compression method further includes:

若当前的单步训练时长超标，将单步训练时长连续超标次数加一，并判断单步训练时长连续超标次数是否达到第二预设次数阈值；If the current single-step training duration exceeds the standard, the number of times the single-step training duration exceeds the standard continuously is increased by one, and it is determined whether the number of times the single-step training duration exceeds the standard continuously reaches a second preset number threshold;

若当前的单步训练时长未超标，将单步训练时长连续超标次数清零。If the current single-step training duration does not exceed the limit, the number of consecutive times the single-step training duration exceeds the limit will be reset to zero.

具体的，考虑到理论上经过若干次对于梯度压缩程度的调节，单步训练时长便不会连续超标，也即单步训练时长连续多次超标属于一种异常情况，为了使得工作人员尽快知情，本发明实施例中可以对这种情况进行监测，因此若当前的单步训练时长超标，将单步训练时长连续超标次数加一，并判断单步训练时长连续超标次数是否达到第二预设次数阈值；若达到，则控制提示器提示单步训练时长连续超标次数过高；若当前的单步训练时长未超标，将单步训练时长连续超标次数清零，从而实现了对于“单步训练时长超标”的连续次数的自动记录以及触发相应的报警，有利于提升系统可靠性以及用户体验。Specifically, considering that theoretically after several adjustments to the gradient compression degree, the single-step training duration will not exceed the standard continuously, that is, the single-step training duration exceeding the standard for multiple times continuously is an abnormal situation. In order to inform the staff as soon as possible, this situation can be monitored in the embodiment of the present invention. Therefore, if the current single-step training duration exceeds the standard, the number of consecutive single-step training duration exceeding the standard is increased by one, and it is determined whether the number of consecutive single-step training duration exceeding the standard reaches the second preset number threshold; if reached, the control prompter prompts that the number of consecutive single-step training duration exceeding the standard is too high; if the current single-step training duration does not exceed the standard, the number of consecutive single-step training duration exceeding the standard is cleared to zero, thereby realizing automatic recording of the consecutive number of "single-step training duration exceeding the standard" and triggering corresponding alarms, which is beneficial to improving system reliability and user experience.

作为一种可选的实施例，梯度压缩方法还包括：As an optional embodiment, the gradient compression method further includes:

具体的，考虑到工作人员存在对于“模型性能优化速率的达标标准和/或单步训练时长的超标标准”的修改需求，因此为了提升工作效率，本发明实施例中提供了相关的修改接口，从而可以响应于标准修改指令，对模型性能优化速率的达标标准和/或单步训练时长的超标标准进行修改。Specifically, taking into account the staff's need to modify the "standard for meeting the model performance optimization rate and/or the standard for exceeding the single-step training duration", in order to improve work efficiency, a relevant modification interface is provided in the embodiment of the present invention, so that the standard for meeting the model performance optimization rate and/or the standard for exceeding the single-step training duration can be modified in response to standard modification instructions.

作为一种可选的实施例，判断当前的单步训练时长是否超标包括：As an optional embodiment, determining whether the current single-step training duration exceeds the standard includes:

具体的，考虑到在对当前的训练步的梯度数据进行压缩以及同步之前，当前的训练步的梯度数据的通信时间还未确定，因此当前的训练步的单步训练时长无法确定，而上一个训练步的单步训练时长同时受到了当前的梯度压缩程度以及当前的网络状况的影响，因此可以将上一个训练步的训练时长确定为当前的单步训练时长，从而通过判断前的单步训练时长是否大于预设时长阈值判断当前的单步训练时长是否超标。Specifically, considering that the communication time of the gradient data of the current training step has not been determined before the gradient data of the current training step is compressed and synchronized, the single-step training duration of the current training step cannot be determined, and the single-step training duration of the previous training step is affected by the current gradient compression degree and the current network status. Therefore, the training duration of the previous training step can be determined as the current single-step training duration, so as to determine whether the current single-step training duration exceeds the standard by judging whether the previous single-step training duration is greater than the preset duration threshold.

其中，预设时长阈值可以进行自主设定，本发明实施例在此不做限定。The preset duration threshold may be set independently, and the embodiment of the present invention does not limit this.

作为一种可选的实施例，对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小之前，梯度压缩方法还包括：As an optional embodiment, for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step, before reducing the current gradient compression degree of the preset compression method according to the first preset rule, the gradient compression method further includes:

若当前的模型性能优化速率未达标和/或梯度失真度超标，则按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小；If the current model performance optimization rate does not meet the standard and/or the gradient distortion exceeds the standard, the current gradient compression degree of the preset compression method is reduced according to the first preset rule;

若当前的模型性能优化速率达标且梯度失真度未超标，判断当前的单步训练时长是否超标。If the current model performance optimization rate meets the standard and the gradient distortion does not exceed the standard, determine whether the current single-step training time exceeds the standard.

为了更好地对本发明实施例进行说明，请参考图3，图3为本发明提供的另一种梯度压缩方法的流程示意图，其中的S304与S102相同，S305与S103相同，S306与S104相同，S307与S105相同，S301-S303包括：To better illustrate the embodiment of the present invention, please refer to FIG. 3 , which is a flow chart of another gradient compression method provided by the present invention, wherein S304 is the same as S102, S305 is the same as S103, S306 is the same as S104, S307 is the same as S105, and S301-S303 include:

S301：对于本地模型迭代训练的预热阶段之后的任一个训练步，得到当前的训练步的梯度数据；S301: For any training step after the warm-up phase of the iterative training of the local model, obtain the gradient data of the current training step;

S302：当前的模型性能优化速率是否达标；S302: Whether the current model performance optimization rate meets the standard;

S303：梯度失真度是否超标。S303: Whether the gradient distortion exceeds the standard.

具体的，考虑到以训练步为粒度进行如上调节动作的过程中，“基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩”之后很可能导致梯度数据的失真，从而影响本地模型的经度，因此本发明实施例中可以判断基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度是否超标，在超标的情况下同样可以触发“按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小”的动作，从而避免梯度压缩导致的梯度数据失真情况的出现，有利于进一步提升模型精度。Specifically, considering that in the process of performing the above adjustment action with the training step as the granularity, "compressing the gradient data of the current training step based on the current gradient compression degree" is likely to cause distortion of the gradient data, thereby affecting the longitude of the local model, therefore, in the embodiment of the present invention, it can be determined whether the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard. If it exceeds the standard, the action of "reducing the current gradient compression degree of the preset compression method according to the first preset rule" can also be triggered, thereby avoiding the occurrence of gradient data distortion caused by gradient compression, which is conducive to further improving the model accuracy.

具体的，该专利的基础思想基于“低速不稳定网络环境下的数学建模”，具体内容如下：Specifically, the basic idea of the patent is based on "mathematical modeling in a low-speed unstable network environment", and the specific content is as follows:

在网络不稳定的分布式训练场景中，对于给定的训练任务，如果将本地模型训练的总时长记为，那么则有：In a distributed training scenario with unstable network, for a given training task, if the total duration of local model training is recorded as , then we have:

； ;

其中表示第t个训练步的通信时间，表示第t个训练步的计算时间，表示本地模型的迭代训练所需的总步数。由于实际上通信时间与计算时间之间存在一定的重叠，因此此处约等于后者关于t的求和。in represents the communication time of the t-th training step, represents the computation time of the tth training step, Indicates the total number of steps required for iterative training of the local model. Since there is a certain overlap between communication time and computing time in practice, Approximately equal to the sum of the latter with respect to t.

工作中所涉及的场景中真正关注的部分，则是训练步的通信时间。而训练步的通信时间则可以进一步展开，正比于多个关于t的函数的组合：The part of the scenario that we are really interested in is the communication time of the training step. The communication time of the training step can be further expanded and is proportional to the combination of multiple functions about t:

； ;

具体的，GC指Gradient Compression，也即梯度压缩，GC(t)表示关于训练步t的梯度压缩策略。第一项元素为Q(t)，表示梯度量化策略函数，这是一个关于t的分布函数，也就意味着在分布式训练中的每一步，量化策略都可能会变化，因此是一个关于t动态变化的函数。第二项元素为S(t)，表示梯度剪枝（稀疏化）策略函数，这是一个关于t的函数，也就意味着在分布式训练中的每一步，剪枝策略都可能会变化，因此是一个关于t动态变化的函数。Specifically, GC refers to Gradient Compression, and GC(t) represents the gradient compression strategy for training step t. The first element is Q(t), which represents the gradient quantization strategy function. This is a distribution function about t, which means that the quantization strategy may change at each step in distributed training, so it is a function that changes dynamically about t. The second element is S(t), which represents the gradient pruning (sparseness) strategy function. This is a function about t, which means that the pruning strategy may change at each step in distributed training, so it is a function that changes dynamically about t.

第三项BW(t)表示迭代训练中第t个训练步时网络的带宽，如前文描述，本发明针对低速且不稳定网络环境，因此网络带宽随着时间变化，则同样表现为一个关于t动态变化的函数。The third item BW(t) represents the bandwidth of the network at the t-th training step in the iterative training. As described above, the present invention is aimed at a low-speed and unstable network environment. Therefore, the network bandwidth changes over time and is also expressed as a function of dynamic changes with t.

这样，就把构建动态梯度压缩策略的问题转化为了针对给定的模型性能要求的情况下，针对目标函数（或），求解最优的联合压缩策略（量化+稀疏化）的问题，具体数学表达如下：In this way, the problem of constructing a dynamic gradient compression strategy is transformed into a problem of optimizing the objective function ( or ), solve the optimal joint compression strategy (quantization + sparsification) The specific mathematical expression is as follows:

； ;

其中，N _g表示梯度数据的数量；L _required表示给定的模型性能要求（可以表现为本发明中的“模型性能优化速率达标”），表示使右侧式子（训练步的通信时间）达到最小值的最优混合压缩策略（该策略是关于时间t变化的），这就是需要解决的目标最优化问题。Wherein, Ng represents the number of gradient data; Lrequired represents _a given model performance requirement (which can be expressed as “model performance optimization rate reaching standard” in the _present invention), Indicates that the right side of the formula The optimal hybrid compression strategy that achieves the minimum value (communication time of the training step) (the strategy varies with time t) is the target optimization problem that needs to be solved.

作为一种可选的实施例，判断基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度是否超标包括：As an optional embodiment, judging whether the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard includes:

若梯度失真度大于预设失真度阈值，则判定梯度失真度超标；If the gradient distortion is greater than the preset distortion threshold, it is determined that the gradient distortion exceeds the standard;

若梯度失真度不大于预设失真度阈值，则判定梯度失真度未超标；If the gradient distortion is not greater than the preset distortion threshold, it is determined that the gradient distortion is within the standard;

梯度失真度确定关系式包括：The gradient distortion determination relationship includes:

； ;

具体的，通过如上的梯度失真度确定关系式可以高效准确地确定出“基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度”。Specifically, the gradient distortion determination formula as above can efficiently and accurately determine “the gradient distortion of compressing the gradient data of the current training step based on the current gradient compression degree”.

当然，除了该具体方式外，还可以通过其他方式确定“基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度”，本发明实施例在此不做限定。Of course, in addition to this specific method, the "gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree" may also be determined in other ways, which are not limited in the embodiment of the present invention.

作为一种可选的实施例，判断基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度是否超标之后，梯度压缩方法还包括：As an optional embodiment, after determining whether the gradient distortion degree of the gradient data of the current training step is compressed based on the current gradient compression degree exceeds the standard, the gradient compression method further includes:

若梯度失真度超标，将梯度失真度连续超标次数加一，并判断梯度失真度连续超标次数是否达到第三预设次数阈值；If the gradient distortion exceeds the standard, the number of times the gradient distortion exceeds the standard continuously is increased by one, and it is determined whether the number of times the gradient distortion exceeds the standard continuously reaches a third preset number threshold;

若梯度失真度未超标，将梯度失真度连续超标次数清零。If the gradient distortion does not exceed the standard, the number of times the gradient distortion exceeds the standard continuously is reset to zero.

具体的，考虑到在经过持续的对于梯度压缩程度的调整之后，梯度失真度超标的情况仍连续多次出现，这属于异常情况，为了对该种情况进行监测，本发明实施例中若梯度失真度超标，将梯度失真度连续超标次数加一，并判断梯度失真度连续超标次数是否达到第三预设次数阈值；若达到，则控制提示器提示梯度失真度连续超标次数过高；若梯度失真度未超标，将梯度失真度连续超标次数清零，从而实现自动化的对于“梯度失真度连续超标次数过高”情况的监测以及报警出发。Specifically, considering that after continuous adjustment of the gradient compression degree, the gradient distortion degree exceeds the standard and still occurs for multiple times in succession, this is an abnormal situation. In order to monitor this situation, in the embodiment of the present invention, if the gradient distortion degree exceeds the standard, the number of times the gradient distortion degree exceeds the standard continuously is increased by one, and it is determined whether the number of times the gradient distortion degree exceeds the standard continuously reaches a third preset number threshold; if it reaches the third preset number threshold, the prompter is controlled to prompt that the number of times the gradient distortion degree exceeds the standard continuously is too high; if the gradient distortion degree does not exceed the standard, the number of times the gradient distortion degree exceeds the standard continuously is cleared to zero, thereby realizing automatic monitoring of the situation of "the number of times the gradient distortion degree exceeds the standard continuously is too high" and triggering an alarm.

其中，第一预设次数阈值、第二预设次数阈值以及第三预设次数阈值均可以进行自主设定，本发明实施例在此不做限定。The first preset number threshold, the second preset number threshold and the third preset number threshold can all be set independently, and the embodiment of the present invention does not limit this.

为了更好地对本发明实施例进行说明，请参考图4，图4为本发明提供的一种梯度压缩装置的结构示意图，该梯度压缩装置应用于分布式集群中的任一个计算节点，包括：To better illustrate the embodiment of the present invention, please refer to FIG. 4 , which is a schematic diagram of the structure of a gradient compression device provided by the present invention. The gradient compression device is applied to any computing node in a distributed cluster, including:

第一判断模块41，用于对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标，若当前的模型性能优化速率未达标，则触发第一调节模块42，若当前的模型性能优化速率达标，则触发第二判断模块43；The first judgment module 41 is used for determining whether the current model performance optimization rate meets the standard for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step. If the current model performance optimization rate does not meet the standard, the first adjustment module 42 is triggered; if the current model performance optimization rate meets the standard, the second judgment module 43 is triggered;

第一调节模块42，用于按照第一预设规则对预设压缩方法当前的梯度压缩程度进行缩小；A first adjustment module 42, configured to reduce the current gradient compression degree of the preset compression method according to a first preset rule;

第二判断模块43，用于判断当前的单步训练时长是否超标，若当前的单步训练时长超标，则触发第二调节模块44；The second judging module 43 is used to judge whether the current single-step training duration exceeds the standard, and if the current single-step training duration exceeds the standard, the second adjusting module 44 is triggered;

第二调节模块44，用于按照第二预设规则对预设压缩方法当前的梯度压缩程度进行放大；A second adjustment module 44, used to amplify the current gradient compression degree of the preset compression method according to a second preset rule;

压缩模块45，用于基于最新的梯度压缩程度，采用预设压缩方法对当前的训练步的梯度数据进行压缩，以便将压缩后的梯度数据在分布式集群内进行同步；A compression module 45 is used to compress the gradient data of the current training step using a preset compression method based on the latest gradient compression degree, so as to synchronize the compressed gradient data in the distributed cluster;

作为一种可选的实施例，第一判断模块41包括：As an optional embodiment, the first determination module 41 includes:

第一判断子模块，用于判断在上一个步数滑动窗口内，模型性能的提升幅度是否达标，则触发第一判定模块，若模型性能的提升幅度未达标，则触发第二判定模块；The first judgment submodule is used to judge whether the improvement of the model performance in the previous step number sliding window meets the standard, and then trigger the first judgment module. If the improvement of the model performance does not meet the standard, then trigger the second judgment module;

第一判定模块，用于判定当前的模型性能优化速率达标；The first determination module is used to determine whether the current model performance optimization rate meets the standard;

第二判定模块，用于判定当前的模型性能优化速率未达标；The second determination module is used to determine that the current model performance optimization rate does not meet the standard;

作为一种可选的实施例，第一判断子模块包括：As an optional embodiment, the first judgment submodule includes:

第一确定模块，用于依据损失函数变化率确定关系式，确定出本地模型在上一个步数滑动窗口内的损失函数变化率；A first determination module is used to determine a relationship according to the loss function change rate, and determine the loss function change rate of the local model within the previous step number sliding window;

第三判定模块，用于若损失函数变化率小于预设变化率阈值，则判定模型性能的提升幅度未达标；A third determination module is used to determine that the improvement of model performance does not meet the standard if the loss function change rate is less than a preset change rate threshold;

第四判定模块，用于若损失函数变化率不小于预设变化率阈值，则判定模型性能的提升幅度达标；A fourth determination module is used to determine whether the improvement of model performance meets the standard if the loss function change rate is not less than a preset change rate threshold;

； ;

clip_upper(2Q(t))·λS(t)；clip _upper (2Q(t))·λS(t);

； ;

第二判断子模块，用于在本地模型的迭代训练开始后的任一个训练步，根据当前的训练步的梯度数据，判断对梯度数据的压缩是否会导致迭代训练无法收敛，若会导致迭代训练无法收敛，则触发第五判定模块，若不会导致迭代训练无法收敛，则触发第六判定模块；The second judgment submodule is used to judge whether compression of gradient data will cause iterative training to fail to converge at any training step after the iterative training of the local model starts, based on the gradient data of the current training step, and if so, trigger the fifth judgment module; if not, trigger the sixth judgment module;

第五判定模块，用于判定迭代训练处于预热阶段；A fifth determination module, used to determine whether the iterative training is in a warm-up stage;

第六判定模块，用于判定迭代训练的预热阶段结束；A sixth determination module, used to determine whether the warm-up phase of the iterative training is over;

第三判断子模块，用于对于本地模型迭代训练的预热阶段之后的任一个训练步，在得到当前的训练步的梯度数据后，判断当前的模型性能优化速率是否达标。The third judgment submodule is used to determine whether the current model performance optimization rate meets the standard after obtaining the gradient data of the current training step for any training step after the warm-up stage of the iterative training of the local model.

作为一种可选的实施例，第二判断子模块包括：As an optional embodiment, the second judgment submodule includes:

第二确定模块，用于确定出当前的训练步的梯度数据的方差；The second determination module is used to determine the variance of the gradient data of the current training step;

第四判断子模块，判断当前的训练步的梯度数据的方差是否大于预设方差阈值，若大于，则触发第七判定模块，若不大于，则触发第八判定模块；The fourth judgment submodule judges whether the variance of the gradient data of the current training step is greater than a preset variance threshold. If so, the seventh judgment module is triggered; if not, the eighth judgment module is triggered;

第七判定模块，用于判定对梯度数据的压缩会导致迭代训练无法收敛；A seventh determination module is used to determine whether compression of gradient data will cause iterative training to fail to converge;

第八判定模块，用于判定对梯度数据的压缩不会导致迭代训练无法收敛。The eighth determination module is used to determine whether compression of gradient data will cause iterative training to fail to converge.

作为一种可选的实施例，梯度压缩装置还包括：As an optional embodiment, the gradient compression device further includes:

第三判断模块，用于若当前的模型性能优化速率未达标，将优化速率连续未达标次数加一，并判断优化速率连续未达标次数是否达到第一预设次数阈值，若达到，则触发第一提示模块；The third judgment module is used for, if the current model performance optimization rate does not meet the standard, adding one to the number of times the optimization rate fails to meet the standard continuously, and judging whether the number of times the optimization rate fails to meet the standard continuously reaches a first preset number threshold, and if so, triggering the first prompt module;

第一提示模块，用于控制提示器提示优化速率连续未达标次数过高；The first prompt module is used to control the prompter to prompt that the number of times the optimization rate fails to meet the standard is too high;

第一清零模块，用于若当前的模型性能优化速率达标，将优化速率连续未达标次数清零。The first clearing module is used to clear the number of consecutive times that the optimization rate fails to meet the standard if the current model performance optimization rate meets the standard.

第四判断模块，用于若当前的单步训练时长超标，将单步训练时长连续超标次数加一，并判断单步训练时长连续超标次数是否达到第二预设次数阈值，若达到，则触发第二提示模块；The fourth judgment module is used for, if the current single-step training duration exceeds the standard, increasing the number of times the single-step training duration exceeds the standard by one, and judging whether the number of times the single-step training duration exceeds the standard continuously reaches a second preset number threshold, and if so, triggering the second prompt module;

第二提示模块，用于控制提示器提示单步训练时长连续超标次数过高；The second prompt module is used to control the prompter to prompt that the single-step training duration exceeds the limit for a continuous number of times;

第二清零模块，用于若当前的单步训练时长未超标，将单步训练时长连续超标次数清零。The second clearing module is used to clear the number of times the single-step training duration exceeds the standard if the current single-step training duration does not exceed the standard.

修改模块，用于响应于标准修改指令，对模型性能优化速率的达标标准和/或单步训练时长的超标标准进行修改。The modification module is used to modify the standard for meeting the model performance optimization rate and/or the standard for exceeding the single-step training duration in response to the standard modification instruction.

作为一种可选的实施例，第二判断模块43包括：As an optional embodiment, the second determination module 43 includes:

第三确定模块，用于将上一个训练步的训练时长确定为当前的单步训练时长；A third determination module is used to determine the training duration of the previous training step as the current single-step training duration;

第五判断子模块，用于判断当前的单步训练时长是否大于预设时长阈值，若大于，则触发第九判定模块，若不大于，则触发第十判定模块；The fifth judgment submodule is used to judge whether the current single-step training duration is greater than a preset duration threshold, and if so, the ninth judgment module is triggered; if not, the tenth judgment module is triggered;

第九判定模块，用于判定当前的单步训练时长超标；A ninth determination module, used to determine whether the current single-step training duration exceeds the limit;

第十判定模块，用于判定当前的单步训练时长未超标。The tenth determination module is used to determine whether the current single-step training duration has not exceeded the limit.

第五判断模块，用于判断基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度是否超标，若梯度失真度超标，则触发第一调节模块42；A fifth judgment module, used to judge whether the gradient distortion degree of the gradient data of the current training step compressed based on the current gradient compression degree exceeds the standard, and if the gradient distortion degree exceeds the standard, triggering the first adjustment module 42;

第二判断模块43的触发条件包括：The triggering conditions of the second judgment module 43 include:

若当前的模型性能优化速率达标且梯度失真度未超标。If the current model performance optimization rate meets the standard and the gradient distortion does not exceed the standard.

作为一种可选的实施例，第五判断模块包括：As an optional embodiment, the fifth judgment module includes:

第四确定模块，用于依据梯度失真度确定关系式，确定出基于当前的梯度压缩程度对当前的训练步的梯度数据进行压缩的梯度失真度；A fourth determination module is used to determine the gradient distortion degree according to the gradient distortion degree determination relationship, and determine the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree;

第十一判定模块，用于若梯度失真度大于预设失真度阈值，则判定梯度失真度超标；an eleventh determination module, configured to determine that the gradient distortion exceeds a standard if the gradient distortion is greater than a preset distortion threshold;

第十二判定模块，用于若梯度失真度不大于预设失真度阈值，则判定梯度失真度未超标；A twelfth determination module, configured to determine that the gradient distortion degree does not exceed the standard if the gradient distortion degree is not greater than a preset distortion degree threshold;

； ;

第六判断模块，用于若梯度失真度超标，将梯度失真度连续超标次数加一，并判断梯度失真度连续超标次数是否达到第三预设次数阈值，若达到，则触发第三提示模块；a sixth judgment module, configured to, if the gradient distortion exceeds the standard, increase the number of times the gradient distortion exceeds the standard continuously by one, and judge whether the number of times the gradient distortion exceeds the standard continuously reaches a third preset number threshold, and if so, trigger the third prompt module;

第三提示模块，用于控制提示器提示梯度失真度连续超标次数过高；The third prompting module is used to control the prompter to prompt that the gradient distortion degree exceeds the standard for a continuous number of times;

第三清零模块，用于若梯度失真度未超标，将梯度失真度连续超标次数清零。The third clearing module is used to clear the number of times the gradient distortion degree exceeds the standard continuously if the gradient distortion degree does not exceed the standard.

对于本发明实施例提供的梯度压缩装置的介绍请参考前述的梯度压缩方法的实施例，本发明实施例在此不再赘述。For an introduction to the gradient compression device provided in the embodiment of the present invention, please refer to the aforementioned embodiment of the gradient compression method, and the embodiment of the present invention will not be described in detail here.

为了更好地对本发明实施例进行说明，请参考图5，图5为本发明提供的一种梯度压缩设备的结构示意图，该梯度压缩设备包括：To better illustrate the embodiment of the present invention, please refer to FIG. 5 , which is a schematic diagram of the structure of a gradient compression device provided by the present invention. The gradient compression device includes:

存储器51，用于存储计算机程序；A memory 51, used for storing computer programs;

处理器52，用于执行计算机程序时实现如前述实施例中梯度压缩方法的步骤。The processor 52 is used to implement the steps of the gradient compression method in the above-mentioned embodiment when executing the computer program.

对于本发明实施例提供的梯度压缩设备的介绍请参考前述的梯度压缩方法的实施例，本发明实施例在此不再赘述。For an introduction to the gradient compression device provided in the embodiment of the present invention, please refer to the aforementioned embodiment of the gradient compression method, and the embodiment of the present invention will not be described in detail here.

本发明还提供了一种分布式集群，包括多个如前述实施例中的梯度压缩设备。The present invention also provides a distributed cluster, comprising a plurality of gradient compression devices as described in the aforementioned embodiments.

对于本发明实施例提供的分布式集群的介绍请参考前述的梯度压缩方法的实施例，本发明实施例在此不再赘述。For an introduction to the distributed cluster provided in the embodiment of the present invention, please refer to the aforementioned embodiment of the gradient compression method, and the embodiment of the present invention will not be described in detail here.

为了更好地对本发明实施例进行说明，请参考图6，图6为本发明提供的一种计算机可读存储介质的结构示意图，计算机可读存储介质60上存储有计算机程序61，计算机程序61被处理器52执行时实现如前述实施例中梯度压缩方法的步骤。To better illustrate an embodiment of the present invention, please refer to FIG. 6 , which is a schematic diagram of the structure of a computer-readable storage medium provided by the present invention. A computer program 61 is stored on the computer-readable storage medium 60. When the computer program 61 is executed by the processor 52, the steps of the gradient compression method in the aforementioned embodiment are implemented.

对于本发明实施例提供的计算机可读存储介质的介绍请参考前述的梯度压缩方法的实施例，本发明实施例在此不再赘述。For an introduction to the computer-readable storage medium provided in an embodiment of the present invention, please refer to the aforementioned embodiment of the gradient compression method, and the embodiment of the present invention will not be described in detail here.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。还需要说明的是，在本说明书中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者设备中还存在另外的相同要素。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part description. It should also be noted that in this specification, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is any such actual relationship or order between these entities or operations. Moreover, the term "includes", "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, so that the process, method, article or equipment including a series of elements includes not only those elements, but also includes other elements that are not explicitly listed, or also includes elements inherent to such process, method, article or equipment. In the absence of more restrictions, the elements defined by the sentence "including one..." do not exclude the existence of other identical elements in the process, method, article or equipment including the element.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其他实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown herein, but rather to the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A gradient compression method, characterized in that it is applied to any computing node in a distributed cluster, comprising:

For any training step after the warm-up phase of the local model iterative training, after obtaining the gradient data of the current training step, determine whether the current model performance optimization rate meets the standard;

If the current model performance optimization rate does not meet the standard, the current gradient compression degree of the preset compression method is reduced according to the first preset rule;

If the current model performance optimization rate meets the standard, determine whether the current single-step training time exceeds the standard;

If the current single-step training duration exceeds the limit, the current gradient compression degree of the preset compression method is amplified according to the second preset rule;

Based on the latest gradient compression degree, compressing the gradient data of the current training step using a preset compression method so as to synchronize the compressed gradient data within the distributed cluster;

Among them, the single-step training duration is related to the degree of gradient compression and network conditions;

Judging whether the current model performance optimization rate meets the standard includes:

Determine whether the improvement in model performance has reached the target within the previous step sliding window;

If the improvement of model performance meets the standard, it is determined that the current model performance optimization rate meets the standard;

If the improvement of model performance does not meet the standard, it is determined that the current model performance optimization rate does not meet the standard;

Wherein, the step number sliding window includes a preset number of training steps;

Determining whether the model performance improvement in the previous step sliding window meets the standard includes:

Determine the relationship based on the loss function change rate to determine the loss function change rate of the local model within the previous step sliding window;

If the loss function change rate is less than the preset change rate threshold, it is determined that the improvement in model performance has not reached the target;

If the change rate of the loss function is not less than the preset change rate threshold, it is determined that the improvement of the model performance meets the standard;

The loss function change rate determination relationship includes:

;

in, It represents the rate of change of the loss function L in the sliding window of the previous step, based on the current t-th training step; is the loss function of the local model; It represents the sliding average of the loss function L in the sliding window of the previous step based on the current t-th training step, and τ is the variable in the summation operation; Represents the minimum loss function value in the previous step sliding window, and the step sliding window includes M training steps.

2. The gradient compression method according to claim 1, wherein the preset change rate threshold comprises:

;

in, is the decay function related to the number of training steps t, is a hyperparameter.

3. The gradient compression method according to claim 1, characterized in that the preset compression method comprises a hybrid gradient compression method combining a gradient quantization method and a gradient sparsification method.

4. The gradient compression method according to claim 3, wherein reducing the current gradient compression degree of the preset compression method according to the first preset rule comprises:

According to the compression degree reduction relation, the current gradient compression degree of the preset compression method is reduced;

The compression reduction relational expression includes:

clip _upper (2Q(t))·λS(t);

Among them, Q(t) represents the gradient quantization strategy function related to the number of training steps t, S(t) represents the gradient sparsification strategy function related to the number of training steps t, clip _upper () represents the clip function used to perform the gradient quantization upper limit clipping operation, λ is a preset adjustment parameter, λ＞1.

5. The gradient compression method according to claim 3, wherein amplifying the current gradient compression degree of the preset compression method according to the second preset rule comprises:

According to the compression degree amplification relation, amplify the current gradient compression degree of the preset compression method;

The compression degree amplification relational expression includes:

;

Among them, Q(t) represents the gradient quantization strategy function related to the number of training steps t, S(t) represents the gradient sparsification strategy function related to the number of training steps t, clip _lower () represents the clip function used to perform the gradient quantization lower limit clipping operation, λ is a preset adjustment parameter, λ＞1.

6. The gradient compression method according to claim 1, characterized in that, for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step, determining whether the current model performance optimization rate meets the standard comprises:

At any training step after the iterative training of the local model starts, based on the gradient data of the current training step, determine whether compression of the gradient data will cause the iterative training to fail to converge;

If it causes the iterative training to fail to converge, the iterative training is determined to be in the warm-up stage;

If it does not cause the iterative training to fail to converge, the warm-up phase of the iterative training is determined to be over;

For any training step after the warm-up phase of the local model iterative training, after obtaining the gradient data of the current training step, determine whether the current model performance optimization rate meets the standard.

7. The gradient compression method according to claim 6, wherein judging, based on the gradient data of the current training step, whether compression of the gradient data will cause the iterative training to fail to converge comprises:

Determine the variance of the gradient data of the current training step;

Determine whether the variance of the gradient data of the current training step is greater than the preset variance threshold;

If it is greater than, it is determined that the compression of gradient data will cause the iterative training to fail to converge;

If it is not greater than, it is determined that the compression of the gradient data will not cause the iterative training to fail to converge.

8. The gradient compression method according to claim 1, characterized in that after determining whether the current model performance optimization rate meets the standard, the gradient compression method further comprises:

If the current model performance optimization rate does not meet the standard, the number of consecutive times the optimization rate does not meet the standard is increased by one, and it is determined whether the number of consecutive times the optimization rate does not meet the standard reaches a first preset number threshold;

If it is reached, the control prompter will prompt that the optimization rate has not been reached for too many consecutive times;

If the current model performance optimization rate meets the standard, the number of consecutive times that the optimization rate fails to meet the standard is reset to zero.

9. The gradient compression method according to claim 1, characterized in that after determining whether the current single-step training duration exceeds the standard, the gradient compression method further comprises:

If the current single-step training duration exceeds the standard, the number of times the single-step training duration exceeds the standard continuously is increased by one, and it is determined whether the number of times the single-step training duration exceeds the standard continuously reaches a second preset number threshold;

If it is reached, the control prompter will prompt that the single-step training duration has exceeded the limit for too many consecutive times;

If the current single-step training duration does not exceed the standard, the number of times the single-step training duration exceeds the standard continuously is reset to zero.

10. The gradient compression method according to claim 1, characterized in that the gradient compression method further comprises:

In response to the standard modification instruction, the standard for meeting the model performance optimization rate and/or the standard for exceeding the single-step training duration are modified.

11. The gradient compression method according to claim 1, wherein determining whether the current single-step training duration exceeds the standard comprises:

Determine the training duration of the previous training step as the current single-step training duration;

Determine whether the current single-step training duration is greater than a preset duration threshold;

If it is greater than, it is determined that the current single-step training duration exceeds the limit;

If it is not greater than, it is determined that the current single-step training duration does not exceed the limit.

12. The gradient compression method according to any one of claims 1 to 11, characterized in that, for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step, before reducing the current gradient compression degree of the preset compression method according to the first preset rule, the gradient compression method further comprises:

Determine whether the gradient distortion of the gradient data of the current training step is exceeded based on the current gradient compression degree;

If the current model performance optimization rate does not meet the standard, the current gradient compression degree of the preset compression method is reduced according to the first preset rule, including:

If the current model performance optimization rate does not meet the standard and/or the gradient distortion exceeds the standard, reducing the current gradient compression degree of the preset compression method according to the first preset rule;

If the current model performance optimization rate meets the standard, the following steps are used to determine whether the current single-step training duration exceeds the standard:

If the current model performance optimization rate meets the standard and the gradient distortion does not exceed the standard, determine whether the current single-step training duration exceeds the standard.

13. The gradient compression method according to claim 12, wherein judging whether the gradient distortion degree of the gradient data of the current training step compressed based on the current gradient compression degree exceeds the standard comprises:

Determine a relationship based on the gradient distortion, and determine the gradient distortion for compressing the gradient data of the current training step based on the current gradient compression degree;

If the gradient distortion is greater than a preset distortion threshold, it is determined that the gradient distortion exceeds the standard;

If the gradient distortion is not greater than the preset distortion threshold, it is determined that the gradient distortion does not exceed the standard;

The gradient distortion determination relationship includes:

;

Wherein, g _GC is the gradient data of the current training step after compression based on the current gradient compression degree, and g is the gradient data of the current training step without compression. represents the Euclidean distance between g _GC and g.

14. The gradient compression method according to claim 12, characterized in that after determining whether the gradient distortion degree of the gradient data of the current training step is compressed based on the current gradient compression degree exceeds the standard, the gradient compression method further comprises:

If the gradient distortion exceeds the standard, the number of times the gradient distortion exceeds the standard continuously is increased by one, and it is determined whether the number of times the gradient distortion exceeds the standard continuously reaches a third preset number threshold;

If it is reached, the control prompter will prompt that the gradient distortion has exceeded the limit for too many times in a row;

If the gradient distortion does not exceed the standard, the number of times the gradient distortion exceeds the standard continuously is reset to zero.

15. A gradient compression device, characterized in that it is applied to any computing node in a distributed cluster, comprising:

The first judgment module is used for determining whether the current model performance optimization rate meets the standard for any training step after the warm-up phase of the iterative training of the local model, after obtaining the gradient data of the current training step. If the current model performance optimization rate does not meet the standard, the first adjustment module is triggered; if the current model performance optimization rate meets the standard, the second judgment module is triggered;

The first adjustment module is used to reduce the current gradient compression degree of the preset compression method according to the first preset rule;

The second judging module is used to judge whether the current single-step training duration exceeds the standard, and if the current single-step training duration exceeds the standard, trigger the second regulating module;

The second adjustment module is used to amplify the current gradient compression degree of the preset compression method according to a second preset rule;

A compression module, used to compress the gradient data of the current training step using a preset compression method based on the latest gradient compression degree, so as to synchronize the compressed gradient data within the distributed cluster;

The first judgment module includes:

The first judgment submodule is used to judge whether the improvement of the model performance in the previous step number sliding window meets the standard. If the improvement of the model performance meets the standard, the first judgment module is triggered; if the improvement of the model performance does not meet the standard, the second judgment module is triggered;

The first determination module is used to determine whether the current model performance optimization rate meets the standard;

The second determination module is used to determine that the current model performance optimization rate does not meet the standard;

The first judgment submodule includes:

A first determination module is used to determine a relationship according to the loss function change rate, and determine the loss function change rate of the local model within the previous step number sliding window;

A third determination module is used to determine that the improvement of model performance does not meet the standard if the loss function change rate is less than a preset change rate threshold;

A fourth determination module is used to determine whether the improvement of model performance meets the standard if the loss function change rate is not less than a preset change rate threshold;

The loss function change rate determination relationship includes:

;

16. A gradient compression device, comprising:

Memory for storing computer programs;

A processor, configured to implement the steps of the gradient compression method according to any one of claims 1 to 14 when executing the computer program.

17. A distributed cluster system, characterized by comprising a plurality of gradient compression devices as claimed in claim 16.

18. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gradient compression method according to any one of claims 1 to 14 are implemented.