CN117910521B - Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium - Google Patents
Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium Download PDFInfo
- Publication number
- CN117910521B CN117910521B CN202410317335.XA CN202410317335A CN117910521B CN 117910521 B CN117910521 B CN 117910521B CN 202410317335 A CN202410317335 A CN 202410317335A CN 117910521 B CN117910521 B CN 117910521B
- Authority
- CN
- China
- Prior art keywords
- gradient
- training
- current
- standard
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000006835 compression Effects 0.000 title claims abstract description 307
- 238000007906 compression Methods 0.000 title claims abstract description 307
- 238000000034 method Methods 0.000 title claims abstract description 150
- 238000012549 training Methods 0.000 claims abstract description 386
- 238000005457 optimization Methods 0.000 claims abstract description 123
- 230000006870 function Effects 0.000 claims description 103
- 230000008859 change Effects 0.000 claims description 67
- 238000013139 quantization Methods 0.000 claims description 38
- 230000006872 improvement Effects 0.000 claims description 29
- 230000014509 gene expression Effects 0.000 claims description 22
- 230000009467 reduction Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 230000003321 amplification Effects 0.000 claims description 8
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 8
- 230000004048 modification Effects 0.000 claims description 7
- 238000012986 modification Methods 0.000 claims description 7
- 230000004044 response Effects 0.000 claims description 3
- 238000004891 communication Methods 0.000 abstract description 24
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 230000009471 action Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000001276 controlling effect Effects 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000002222 downregulating effect Effects 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Mathematics (AREA)
- Algebra (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Databases & Information Systems (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a gradient compression method, a gradient compression device, gradient compression equipment, a distributed cluster and a storage medium, belongs to the field of distributed computation, and is used for adjusting gradient compression degree by referring to two indexes of model performance optimization rate and current single-step training duration, thereby solving the problem that model performance and communication overhead cannot be balanced when a low-speed network is subjected to gradient compression. According to the invention, a single training step is taken as granularity, after gradient data is obtained in any training step after a preheating stage, the gradient compression degree is reduced under the condition that the model performance optimization rate does not reach the standard so as to improve the model performance, and under the condition that the model performance optimization rate reaches the standard and the current single step training time length exceeds the standard, the gradient compression degree can be amplified so as to reduce the communication expense.
Description
Technical Field
The present invention relates to the field of distributed computing, and in particular, to a gradient compression method, apparatus, device, distributed cluster, and storage medium.
Background
With the development of large language models (LLM, large Language Model), training of deep learning models requires more computing resources, so distributed clusters are increasingly used for distributed computation; in the model training process, any computing node in the distributed cluster needs to send gradient data generated in the model iterative training process to other computing nodes in the distributed cluster through a network so that each computing node can update model parameters.
In the process of model training, the distributed cluster needs to have huge amount of gradient data transmitted through a network, in order to cope with some low-speed unstable networks, the gradient data can be compressed before being transmitted through the network, however, the compression of the gradient data can reduce the communication cost, but the compression of the gradient data can possibly influence the performance of the model, so how to balance the performance of the model and the communication cost in the process of compressing the gradient data is a difficult problem to be solved.
Therefore, how to provide a solution to the above technical problem is a problem that a person skilled in the art needs to solve at present.
Disclosure of Invention
The invention aims to provide a gradient compression method, a device, equipment, a distributed cluster and a storage medium, wherein a single training step is taken as granularity, after gradient data is obtained in any training step after a preheating stage, the gradient compression degree is reduced under the condition that the model performance optimization rate does not reach the standard so as to improve the model performance, and the gradient compression degree can be amplified under the condition that the model performance optimization rate reaches the standard and the current single step training time length exceeds the standard so as to reduce the communication expense.
In order to solve the above technical problems, the present invention provides a gradient compression method applied to any computing node in a distributed cluster, including:
for any training step after the preheating stage of the local model iterative training, judging whether the current model performance optimization rate meets the standard after gradient data of the current training step is obtained;
If the current model performance optimization rate does not reach the standard, reducing the current gradient compression degree of the preset compression method according to a first preset rule;
If the current model performance optimization rate reaches the standard, judging whether the current single-step training time length exceeds the standard;
If the current single-step training time length exceeds the standard, amplifying the current gradient compression degree of the preset compression method according to a second preset rule;
Based on the latest gradient compression degree, compressing gradient data of the current training step by adopting a preset compression method so as to synchronize the compressed gradient data in the distributed cluster;
wherein the single step training period is related to the gradient compression degree and the network condition.
On the other hand, judging whether the current model performance optimization rate meets the standard comprises:
Judging whether the improvement amplitude of the model performance reaches the standard in the sliding window of the last step number;
If the improvement amplitude of the model performance meets the standard, judging that the current model performance optimization rate meets the standard;
if the improvement amplitude of the model performance does not reach the standard, judging that the current model performance optimization rate does not reach the standard;
the step number sliding window comprises a preset number of training steps.
On the other hand, judging whether the improvement amplitude of the model performance reaches the standard in the previous step number sliding window comprises the following steps:
determining a relation according to the change rate of the loss function, and determining the change rate of the loss function of the local model in the last step sliding window;
If the change rate of the loss function is smaller than a preset change rate threshold, judging that the improvement amplitude of the model performance does not reach the standard;
If the change rate of the loss function is not smaller than a preset change rate threshold, judging that the improvement amplitude of the model performance reaches the standard;
the loss function change rate determination relation includes:
;
Wherein, Representing the change rate of the loss function L in the sliding window of the last step number by taking the current t training step as a reference; /(I)A loss function that is a local model; /(I)Representing the sliding average value of the loss function L in the sliding window of the last step number by taking the current t training step as a reference, wherein tau is a variable in the sum operation; Representing the smallest loss function value in the last step number sliding window, the step number sliding window includes M training steps.
In another aspect, the preset rate of change threshold includes:
;
Wherein, As a decay function related to the number of steps t of the training step,Is a super parameter.
In another aspect, the preset compression method includes a mixed gradient compression method combined with a gradient quantization method and a gradient thinning method.
On the other hand, reducing the current gradient compression degree of the preset compression method according to the first preset rule comprises:
according to the compression degree reduction relational expression, the current gradient compression degree of the preset compression method is reduced;
the compression degree reduction relation includes:
clipupper(2Q(t))·λS(t);
Wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip upper () represents a clip function for performing a gradient quantization upper limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
On the other hand, amplifying the current gradient compression degree of the preset compression method according to the second preset rule includes:
amplifying the current gradient compression degree of a preset compression method according to a compression degree amplification relational expression;
The compression degree amplifying relational expression comprises:
;
Wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip lower () represents a clip function for performing a gradient quantization lower limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
On the other hand, for any training step after the preheating stage of the local model iterative training, after gradient data of the current training step is obtained, judging whether the current model performance optimization rate meets the standard comprises:
Judging whether compression of gradient data can cause the iterative training to be unable to converge according to the gradient data of the current training step at any training step after the iterative training of the local model begins;
if the iterative training can not be converged, judging that the iterative training is in a preheating stage;
If the iterative training cannot be converged, judging that the preheating stage of the iterative training is ended;
For any training step after the preheating stage of the local model iterative training, judging whether the current model performance optimization rate meets the standard after gradient data of the current training step is obtained.
On the other hand, according to the gradient data of the current training step, judging whether the compression of the gradient data can cause the iterative training to fail to converge includes:
Determining the variance of gradient data of the current training step;
judging whether the variance of gradient data of the current training step is larger than a preset variance threshold value or not;
if the data is larger than the first threshold, judging that the compression of the gradient data can cause the convergence failure of the iterative training;
if the data is not greater than the first threshold, the compression of the gradient data is judged to not cause the failure of convergence of the iterative training.
On the other hand, after judging whether the current model performance optimization rate meets the standard, the gradient compression method further comprises the following steps:
If the current model performance optimization rate does not reach the standard, adding one to the continuous times of the optimization rate, and judging whether the continuous times of the optimization rate do not reach the first preset times threshold;
if the frequency reaches the preset frequency, the control prompter prompts that the optimization rate is not up to the standard continuously and the frequency is too high;
And if the current model performance optimization rate meets the standard, resetting the times of continuously unqualified optimization rate.
On the other hand, after judging whether the current single-step training time length exceeds the standard, the gradient compression method further comprises the following steps:
If the current single-step training time length exceeds the standard, adding one to the single-step training time length continuous exceeding times, and judging whether the single-step training time length continuous exceeding times reach a second preset time threshold value or not;
if the single-step training time is up, the control prompter prompts that the continuous exceeding frequency of the single-step training time is too high;
If the current single-step training time length does not exceed the standard, the continuous exceeding times of the single-step training time length are cleared.
In another aspect, the gradient compression method further comprises:
in response to the standard modification instruction, a standard-up standard for the model performance optimization rate and/or a standard-exceeding standard for the single step training duration are modified.
On the other hand, judging whether the current single-step training time length exceeds the standard comprises the following steps:
determining the training duration of the last training step as the current single-step training duration;
Judging whether the current single-step training time length is greater than a preset time length threshold value or not;
If the training time is greater than the standard, judging that the current single-step training time exceeds the standard;
If the training time is not greater than the preset training time, judging that the current single-step training time is not out of standard.
On the other hand, for any training step after the preheating stage of the local model iterative training, after gradient data of the current training step are obtained, before the current gradient compression degree of the preset compression method is reduced according to the first preset rule, the gradient compression method further comprises:
judging whether the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard or not;
If the current model performance optimization rate does not reach the standard, the step of reducing the current gradient compression degree of the preset compression method according to the first preset rule comprises the following steps:
If the current model performance optimization rate does not reach the standard and/or the gradient distortion degree exceeds the standard, reducing the current gradient compression degree of the preset compression method according to a first preset rule;
if the current model performance optimization rate reaches the standard, judging whether the current single-step training time length exceeds the standard comprises the following steps:
And if the current model performance optimization rate reaches the standard and the gradient distortion degree is not out of standard, judging whether the current single-step training time length is out of standard.
On the other hand, determining whether the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard includes:
Determining a relation according to the gradient distortion degree, and determining the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree;
If the gradient distortion is greater than a preset distortion threshold, judging that the gradient distortion exceeds the standard;
If the gradient distortion is not greater than the preset distortion threshold, judging that the gradient distortion is not out of standard;
The gradient distortion degree determination relation comprises:
;
Wherein g GC is gradient data of the current training step compressed based on the current gradient compression degree, g is gradient data of the uncompressed current training step, The Euclidean distance between g GC and g is indicated.
On the other hand, after judging whether the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard, the gradient compression method further comprises:
if the gradient distortion degree exceeds the standard, adding one to the continuous exceeding number of the gradient distortion degree, and judging whether the continuous exceeding number of the gradient distortion degree reaches a third preset number threshold;
if the number of the continuous exceeding standard times of the gradient distortion degree is too high, the control prompter prompts;
And if the gradient distortion degree does not exceed the standard, clearing the continuous exceeding frequency of the gradient distortion degree.
In order to solve the above technical problem, the present invention further provides a gradient compression device, which is applied to any computing node in a distributed cluster, including:
The first judging module is used for judging whether the current model performance optimization rate meets the standard or not for any training step after the preheating stage of the local model iterative training after the gradient data of the current training step is obtained, triggering the first adjusting module if the current model performance optimization rate does not meet the standard, and triggering the second judging module if the current model performance optimization rate meets the standard;
The first adjusting module is used for reducing the current gradient compression degree of the preset compression method according to a first preset rule;
The second judging module is used for judging whether the current single-step training time length exceeds the standard, and triggering the second adjusting module if the current single-step training time length exceeds the standard;
the second adjusting module is used for amplifying the current gradient compression degree of the preset compression method according to a second preset rule;
The compression module is used for compressing gradient data of the current training step by adopting a preset compression method based on the latest gradient compression degree so as to synchronize the compressed gradient data in the distributed cluster;
wherein the single step training period is related to the gradient compression degree and the network condition.
In order to solve the technical problem, the present invention further provides a gradient compression device, including:
A memory for storing a computer program;
a processor for implementing the steps of the gradient compression method as described above when executing the computer program.
In order to solve the technical problem, the invention also provides a distributed cluster which comprises a plurality of gradient compression devices.
To solve the above technical problem, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the gradient compression method as described above.
The beneficial effects are that: the invention provides a gradient compression method, which takes into consideration that gradient data needs to be compressed after any training step, and the gradient compression degree can respectively influence the optimization speed of model performance and the single-step training time length, and the single-step training time length has correlation with network conditions, so that the gradient compression degree is reduced under the condition that the model performance optimization rate does not reach the standard after any training step after a preheating stage is obtained by taking the single training step as granularity, so as to improve the model performance, and the gradient compression degree can be amplified under the condition that the model performance optimization rate reaches the standard and the current single-step training time length exceeds the standard, so as to reduce communication cost.
The invention also provides a gradient compression device, equipment, a distributed cluster and a computer readable storage medium, which have the same beneficial effects as the gradient compression method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly explain the related art and the drawings required to be used in the embodiments, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a schematic flow chart of a gradient compression method provided by the invention;
Fig. 2 is a schematic structural diagram of a distributed cluster according to the present invention;
FIG. 3 is a schematic flow chart of another gradient compression method according to the present invention;
FIG. 4 is a schematic diagram of a gradient compression device according to the present invention;
FIG. 5 is a schematic diagram of a gradient compression apparatus according to the present invention;
Fig. 6 is a schematic structural diagram of a computer readable storage medium according to the present invention.
Detailed Description
The invention provides a gradient compression method, a device, equipment, a distributed cluster and a storage medium, wherein a single training step is taken as granularity, after gradient data is obtained in any training step after a preheating stage, the gradient compression degree is reduced under the condition that the model performance optimization rate does not reach the standard so as to improve the model performance, and the gradient compression degree can be amplified under the condition that the model performance optimization rate reaches the standard and the current single step training time length exceeds the standard so as to reduce the communication cost.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flow chart of a gradient compression method provided by the present invention, where the gradient compression method is applied to any computing node in a distributed cluster, and includes:
S101: for any training step after the preheating stage of the local model iterative training, judging whether the current model performance optimization rate meets the standard after gradient data of the current training step is obtained;
for better explaining the embodiments of the present invention, please refer to fig. 2, fig. 2 is a schematic structural diagram of a distributed cluster provided by the present invention, and the gradient compression method in the embodiments of the present invention may be applied to any computing node in the distributed cluster.
Specifically, considering the technical problems in the background art and considering that gradient data needs to be compressed after any training step, the gradient compression degree can influence the optimization speed of model performance and the single-step training time length respectively, so that the invention aims to seek the adjustment of the gradient compression degree by taking a single training step in the iterative training process of the local model as granularity, thereby realizing the dynamic adjustment of the gradient compression degree and adjusting the balance between the model performance and communication overhead more timely and flexibly; meanwhile, considering that the model performance optimization rate determines the performance of the final local model to a large extent in the iterative training process, based on the fact that any training step after the preheating stage of the iterative training of the local model can be subjected to the gradient data of the current training step, whether the current model performance optimization rate meets the standard is judged, and the judgment result is used as a data basis of the follow-up step so as to guarantee the model performance optimization rate by timely adjusting the gradient compression degree.
Considering that the initial stage of the iterative training of the local model in the distributed cluster is not a good basis, gradient data and model parameters cannot be directly introduced into gradient compression, which causes the iterative training of the local model to be unable to converge, so that a preheating stage can be divided at the initial stage of the iterative training, the original uncompressed gradient data can be transmitted in a network manner at the stage so as to lay a basis for the iterative training of the local model in the distributed cluster, and the compression of the gradient data can be performed after the preheating stage so as to save communication cost, so that in the embodiment of the invention, after the gradient data of the current training stage is obtained, whether the current model performance optimization rate reaches the standard or not can be judged.
S102: if the current model performance optimization rate does not reach the standard, reducing the current gradient compression degree of the preset compression method according to a first preset rule;
Specifically, the "model performance optimization rate" determines the performance of the final local model to a larger extent, and the performance of the local model is an index focused on, so that in the embodiment of the invention, the current gradient compression degree of the preset compression method can be reduced according to the first preset rule under the condition that the current model performance optimization rate does not reach the standard, the reduction of the gradient compression degree represents the reduction of the compression degree of gradient data, and the model performance is improved by being beneficial to learning more features in the local training data in the model parameter optimization process.
S103: if the current model performance optimization rate reaches the standard, judging whether the current single-step training time length exceeds the standard;
Specifically, considering that after model performance, communication overhead and training time length which is in direct proportion to the communication overhead are indexes which need to be focused, and the two indexes are related to single-step training time length in a local model iterative training process, in addition, fluctuation of network bandwidth also affects the single-step training time length, therefore, under the condition that the current model performance optimization rate reaches the standard, the embodiment of the invention can judge whether the current single-step training time length exceeds the standard or not so as to trigger subsequent actions according to a judging result, thereby combining the influence of network conditions on the single-step training time length and possibly reducing the communication overhead.
Specifically, the single-step training duration may include the calculation time of a single training step and the communication time consumed by gradient data transmission, and the influencing factors of the communication time include the gradient compression degree and the network condition, but in practical application, a certain re-pasting may exist between the calculation time and the communication time.
S104: if the current single-step training time length exceeds the standard, amplifying the current gradient compression degree of the preset compression method according to a second preset rule;
the single-step training duration theoretically has a certain theoretical interval, namely a standard in the exceeding of the single-step training duration, and the single-step training duration can be considered to have a down-regulating space under the condition that the single-step training duration exceeds the standard, and the single-step training duration does not have the down-regulating space under the condition that the single-step training duration does not exceed the standard, so that the current gradient compression degree of the preset compression method can be amplified under the condition that the current single-step training duration exceeds the standard, so that the single-step training duration is adjusted down, and the training duration of the model is reduced.
The current gradient compression degree of the preset compression method can be amplified through the second preset rule, so that higher flexibility is achieved, the first preset rule and the second preset rule can be flexibly and autonomously set, and the embodiment of the invention is not limited herein.
S105: based on the latest gradient compression degree, compressing gradient data of the current training step by adopting a preset compression method so as to synchronize the compressed gradient data in the distributed cluster;
wherein the single step training period is related to the gradient compression degree and the network condition.
Specifically, after the adjustment of the steps, based on the judgment of the model performance optimization rate and the single-step training duration, the adjustment of the gradient compression degree in the training step is completed, so that the gradient data of the current training step can be compressed by adopting a preset compression method based on the latest gradient compression degree in the step, and the compressed gradient data can be synchronized in the distributed cluster.
The invention provides a gradient compression method, which takes into consideration that gradient data needs to be compressed after any training step, and the gradient compression degree can respectively influence the optimization speed of model performance and the single-step training time length, and the single-step training time length has correlation with network conditions, so that the gradient compression degree is reduced under the condition that the model performance optimization rate does not reach the standard after any training step after a preheating stage is obtained by taking the single training step as granularity, so as to improve the model performance, and the gradient compression degree can be amplified under the condition that the model performance optimization rate reaches the standard and the current single-step training time length exceeds the standard, so as to reduce communication cost.
Based on the above embodiments:
as an alternative embodiment, determining whether the current model performance optimization rate meets the criteria includes:
Judging whether the improvement amplitude of the model performance reaches the standard in the sliding window of the last step number;
If the improvement amplitude of the model performance meets the standard, judging that the current model performance optimization rate meets the standard;
if the improvement amplitude of the model performance does not reach the standard, judging that the current model performance optimization rate does not reach the standard;
The step number sliding window comprises a preset number of training steps.
Specifically, considering that the improvement amplitude of the model performance can be estimated for each training step, and the improvement amplitude of the model performance of a plurality of past training steps can represent the current model performance optimization rate of the local model, the invention presets a step number sliding window, can judge whether the improvement amplitude of the model performance reaches the standard in the last step number sliding window, and judges that the current model performance optimization rate reaches the standard if the improvement amplitude of the model performance reaches the standard; and if the improvement amplitude of the model performance does not reach the standard, judging that the current model performance optimization rate does not reach the standard.
The preset number of training steps included in the step number sliding window can be set autonomously, and the embodiment of the invention is not limited herein.
Of course, besides this specific manner, it may also be determined in other manners whether the current model performance optimization rate meets the standard, which is not limited herein.
As an alternative embodiment, determining whether the improvement amplitude of the model performance meets the standard in the previous step number sliding window includes:
determining a relation according to the change rate of the loss function, and determining the change rate of the loss function of the local model in the last step sliding window;
if the change rate of the loss function is smaller than a preset change rate threshold, judging that the improvement amplitude of the model performance does not reach the standard;
if the change rate of the loss function is not smaller than the preset change rate threshold, judging that the improvement amplitude of the model performance reaches the standard;
the loss function change rate determination relation includes:
;
Wherein, Representing the change rate of the loss function L in the sliding window of the last step number by taking the current t training step as a reference; /(I)A loss function that is a local model; /(I)Representing the sliding average value of the loss function L in the sliding window of the last step number by taking the current t training step as a reference, wherein tau is a variable in the sum operation; Representing the smallest loss function value in the last step number sliding window, the step number sliding window includes M training steps.
Specifically, considering that the loss function can accurately evaluate the change of the model performance between each training step, and the change rate of the loss function can represent the optimization rate of the model performance, in the embodiment of the invention, a relational expression can be determined according to the change rate of the loss function, the change rate of the loss function in the last step sliding window of the local model is determined, the change rate of the loss function is compared with a preset change rate threshold, if the change rate of the loss function is smaller than the preset change rate threshold, the improvement amplitude of the model performance is judged to be not up to standard, and if the change rate of the loss function is not smaller than the preset change rate threshold, the improvement amplitude of the model performance is judged to be up to standard.
Specifically, in the above relation of determining the change rate of the loss function, the subtraction of the two molecules indicates how much the currently achieved minimum loss function value is reduced compared with the "sliding average value of the loss function L in the previous step sliding window", and divided by the "sliding average value of the loss function L in the previous step sliding window", and indicates the proportion of the currently achieved minimum loss function value reduced compared with the "sliding average value of the loss function L in the previous step sliding window", and whether the change of the proportion is significant enough (can be obtained by comparing with the preset change rate threshold).
The loss function change rate of the local model in the last step sliding window can be determined efficiently and accurately by the loss function change rate determination relation.
Of course, the loss function change rate determination relation may be in other specific forms besides the above specific forms, and embodiments of the present invention are not limited herein.
As an alternative embodiment, the preset change rate threshold comprises:
;
Wherein, As a decay function related to the number of steps t of the training step,Is a super parameter.
Specifically, considering that the optimization rate of the model performance also presents a trend from fast to slow along with the deep training stage of the local model, the preset change rate threshold should be in a trend of attenuation in theory, so that in order to more accurately judge whether the model performance optimization rate of different training steps meets the standard, the core of the preset change rate threshold in the embodiment of the invention is an attenuation function related to the number of the training steps, and is matched with a super-parameter for fine adjustment.
Specifically, the attenuation function may be of various types, and may include, for example, a staged attenuation, an exponential attenuation, a cosine attenuation, etc., which are not limited herein.
Of course, the preset change rate threshold may take a variety of forms other than this specific form, and embodiments of the present invention are not limited herein.
As an alternative embodiment, the preset compression method includes a mixed gradient compression method by a combination of gradient quantization and gradient thinning.
Specifically, considering that the essential principle of the gradient quantization method is to reduce the bit number required for characterizing a single communication data (i.e., gradient data), while the essential principle of the gradient thinning method is to reduce the number of communication data, these two methods do not have contradictions, and there are conditions of common use, and if only one of them is used, the compression degree and the compression effect are limited (because the sensitivity of the quantization method and the thinning method to gradient data with different characteristics is different), the preset compression method in the embodiment of the present invention may include a hybrid gradient compression method that combines the gradient quantization method and the gradient thinning method, and when specifically implemented, the gradient data to be compressed may be thinned by the gradient thinning method first, and then quantized by the gradient quantization method, and the embodiment of the present invention is not limited herein.
The gradient quantization method and the gradient thinning method may be specifically of various types, for example, the quantization range of the gradient quantization method may be flexibly set, for example, from 32bit to 1bit, and the specific gradient thinning method may be Top-K method (i.e., the first K gradient data are reserved from N g gradient data) or the like, which is not limited herein.
As an optional embodiment, reducing the current gradient compression degree of the preset compression method according to the first preset rule includes:
according to the compression degree reduction relational expression, the current gradient compression degree of the preset compression method is reduced;
the compression degree reduction relation includes:
clipupper(2Q(t))·λS(t);
Wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip upper () represents a clip function for performing a gradient quantization upper limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
Specifically, the gradient compression degree can be quickly and properly adjusted by the compression degree reduction relational expression, wherein Clip upper (2Q (t)) represents a callback of the gradient compression degree corresponding to the gradient quantization strategy function, and the callback is specifically expressed by that the current quantization precision Q (t) is up-regulated to 2Q (t), but the highest quantization precision of the system is required to be used as an upper limit, so that the Clip function can be used for carrying out upper limit interception operation on the gradient compression degree; λs (t) represents a callback to the gradient compression degree corresponding to the gradient sparsification strategy, embodied by multiplying the current sparsity S (t) by λ.
As an alternative embodiment, amplifying the current gradient compression degree of the preset compression method according to the second preset rule includes:
amplifying the current gradient compression degree of a preset compression method according to a compression degree amplification relational expression;
the compression degree amplification relation includes:
;
Wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip lower () represents a clip function for performing a gradient quantization lower limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
Specifically, the gradient compression degree can be quickly and properly amplified by the compression degree amplification relational expression, wherein Clip lower (i.e. Q (t)) represents the amplification of the gradient compression degree corresponding to the gradient quantization strategy function, and the current quantization precision Q (t) is particularly adjusted to be Q (t), but the minimum quantization precision of the system is required to be taken as a lower limit, so that the Clip function can be used for carrying out lower limit interception operation on the gradient compression degree; the latter half of the above equation represents the adjustment of the gradient compression degree corresponding to the gradient sparsification strategy, which is embodied by multiplying the current sparsity S (t) by 1/λ.
Of course, the compression degree reducing relational expression and the compression degree enlarging relational expression may be other specific forms besides the compression degree reducing relational expression and the compression degree enlarging relational expression described above, and the embodiment of the present invention is not limited thereto.
As an alternative embodiment, for any training step after the warm-up phase of the local model iterative training, after obtaining gradient data of the current training step, determining whether the current model performance optimization rate meets the standard includes:
Judging whether compression of gradient data can cause the iterative training to be unable to converge according to the gradient data of the current training step at any training step after the iterative training of the local model begins;
if the iterative training can not be converged, judging that the iterative training is in a preheating stage;
If the iterative training cannot be converged, judging that the preheating stage of the iterative training is ended;
For any training step after the preheating stage of the local model iterative training, judging whether the current model performance optimization rate meets the standard after gradient data of the current training step is obtained.
Specifically, in order to more accurately identify the preheating stage of the local model iterative training, whether the iterative training cannot be converged or not can be reflected by considering the gradient data of a single training step, and further whether the iterative training is in the preheating stage or not is determined.
Of course, the identification of the preheating stage may be performed in other ways besides this specific way, and embodiments of the present invention are not limited herein.
As an alternative embodiment, determining whether compression of the gradient data would result in failure of convergence of the iterative training based on the gradient data of the current training step includes:
Determining the variance of gradient data of the current training step;
judging whether the variance of gradient data of the current training step is larger than a preset variance threshold value or not;
if the data is larger than the first threshold, judging that the compression of the gradient data can cause the convergence failure of the iterative training;
if the data is not greater than the first threshold, the compression of the gradient data is judged to not cause the failure of convergence of the iterative training.
Specifically, considering the variance of the gradient data, whether the local model belongs to the preheating stage at the current moment or not, namely, whether the local model is suitable for compressing the gradient data or not can be reflected, so that in the embodiment of the invention, whether the variance of the gradient data of the current training step is larger than the preset variance threshold value or not can be judged, if so, the compression of the gradient data can be judged to cause the non-convergence of iterative training, and if not, the compression of the gradient data can be judged to not cause the non-convergence of iterative training, thereby being efficient and accurate.
In the preheating stage, the gradient compression degrees of the gradient quantization method and the gradient thinning method may be the lowest, for example: and (3) a state in which quantization and thinning processing are not performed.
Of course, in addition to this specific form, "determining whether compression of gradient data will cause the iterative training to fail to converge according to gradient data of the current training step" may be performed in other manners, and embodiments of the present invention are not limited herein.
As an optional embodiment, after determining whether the current model performance optimization rate meets the standard, the gradient compression method further includes:
If the current model performance optimization rate does not reach the standard, adding one to the continuous times of the optimization rate, and judging whether the continuous times of the optimization rate reach a first preset time threshold;
if the frequency reaches the preset frequency, the control prompter prompts that the optimization rate is not up to the standard continuously and the frequency is too high;
And if the current model performance optimization rate meets the standard, resetting the times of continuously unqualified optimization rate.
Specifically, considering that in some cases, even though the gradient compression degree is repeatedly adjusted, the model performance optimization rate may still not reach the standard, and in this state, the staff is required to check and maintain the model performance optimization rate, so in order to enable the staff to know the condition as soon as possible, in the embodiment of the invention, the condition can be monitored, and therefore if the current model performance optimization rate does not reach the standard, the continuous substandard times of the optimization rate are increased by one, and whether the continuous substandard times of the optimization rate reach a first preset times threshold value is judged; if the frequency reaches the preset frequency, the control prompter prompts that the optimization rate is not up to the standard continuously and the frequency is too high; if the current model performance optimization rate meets the standard, the optimization rate is continuously cleared for times which do not meet the standard, so that automatic recording of the continuous times of the model performance optimization rate which do not meet the standard and corresponding alarm triggering are realized, and the reliability of the system and the user experience are improved.
As an optional embodiment, after determining whether the current single-step training duration exceeds the standard, the gradient compression method further includes:
If the current single-step training time length exceeds the standard, adding one to the single-step training time length continuous exceeding times, and judging whether the single-step training time length continuous exceeding times reach a second preset time threshold value or not;
if the single-step training time is up, the control prompter prompts that the continuous exceeding frequency of the single-step training time is too high;
if the current single-step training time length does not exceed the standard, the continuous exceeding times of the single-step training time length are cleared.
Specifically, considering that the single-step training duration is not continuously exceeded after the gradient compression degree is adjusted for a plurality of times theoretically, namely, the single-step training duration is continuously exceeded for a plurality of times, which belongs to an abnormal condition, in order to enable staff to know as soon as possible, the condition can be monitored in the embodiment of the invention, therefore, if the current single-step training duration exceeds the standard, the single-step training duration is continuously exceeded times are increased by one, and whether the single-step training duration continuously exceeds the standard times reaches a second preset times threshold value is judged; if the single-step training time is up, the control prompter prompts that the continuous exceeding frequency of the single-step training time is too high; if the current single-step training duration is not out of standard, the continuous out-of-standard times of the single-step training duration are cleared, so that automatic recording of the continuous times of the 'single-step training duration out of standard' and corresponding alarm triggering are realized, and the reliability of the system and the user experience are improved.
As an alternative embodiment, the gradient compression method further comprises:
in response to the standard modification instruction, a standard-up standard for the model performance optimization rate and/or a standard-exceeding standard for the single step training duration are modified.
In particular, in consideration of the need for a worker to modify the standard of the model performance optimization rate and/or the standard exceeding of the single-step training duration, in order to improve the working efficiency, the embodiment of the invention provides a related modification interface, so that the standard modifying instruction can be responded to modify the standard of the model performance optimization rate and/or the standard exceeding of the single-step training duration.
As an alternative embodiment, determining whether the current single step training period exceeds the standard includes:
determining the training duration of the last training step as the current single-step training duration;
Judging whether the current single-step training time length is greater than a preset time length threshold value or not;
If the training time is greater than the standard, judging that the current single-step training time exceeds the standard;
If the training time is not greater than the preset training time, judging that the current single-step training time is not out of standard.
Specifically, considering that before the gradient data of the current training step are compressed and synchronized, the communication time of the gradient data of the current training step is not determined, so that the single-step training duration of the current training step cannot be determined, and the single-step training duration of the last training step is influenced by the current gradient compression degree and the current network condition, so that the training duration of the last training step can be determined as the current single-step training duration, and whether the current single-step training duration exceeds the standard is determined by judging whether the previous single-step training duration is greater than a preset duration threshold.
The preset duration threshold may be set autonomously, which is not limited in the embodiment of the present invention.
As an optional embodiment, for any training step after the preheating stage of the local model iterative training, after obtaining the gradient data of the current training step, before reducing the current gradient compression degree of the preset compression method according to the first preset rule, the gradient compression method further includes:
judging whether the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard or not;
If the current model performance optimization rate does not reach the standard, the step of reducing the current gradient compression degree of the preset compression method according to the first preset rule comprises the following steps:
if the current model performance optimization rate does not reach the standard and/or the gradient distortion degree exceeds the standard, reducing the current gradient compression degree of the preset compression method according to a first preset rule;
if the current model performance optimization rate reaches the standard, judging whether the current single-step training time length exceeds the standard comprises the following steps:
If the current model performance optimization rate reaches the standard and the gradient distortion degree is not out of standard, judging whether the current single-step training time length is out of standard or not.
For better explanation of the embodiment of the present invention, please refer to fig. 3, fig. 3 is a flow chart of another gradient compression method provided by the present invention, wherein S304 is the same as S102, S305 is the same as S103, S306 is the same as S104, S307 is the same as S105, and S301-S303 comprise:
S301: for any training step after the preheating stage of the local model iterative training, gradient data of the current training step is obtained;
s302: whether the current model performance optimization rate meets the standard or not;
s303: whether the gradient distortion degree exceeds the standard.
Specifically, considering that in the process of performing the adjustment action with the training step as granularity, the distortion of the gradient data is likely to be caused after the gradient data of the current training step is compressed based on the current gradient compression degree, so as to influence the longitude of the local model, in the embodiment of the invention, whether the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard can be judged, and the action of reducing the current gradient compression degree of the preset compression method according to the first preset rule can be triggered under the condition of exceeding the standard, so that the occurrence of the gradient data distortion condition caused by gradient compression is avoided, and the model precision is further improved.
Specifically, the basic idea of the patent is based on 'mathematical modeling under a low-speed unstable network environment', and the specific contents are as follows:
In a distributed training scenario where the network is unstable, for a given training task, if the total duration of the local model training is noted as Then there are:
;/>
Wherein the method comprises the steps of Representing the communication time of the t-th training step,Representing the calculation time of the t-th training step,Representing the total number of steps required for iterative training of the local model. Since there is in fact a certain overlap between the communication time and the calculation time, hereAbout equal to the sum of the latter with respect to t.
The part of the scene involved in the work that really focuses on is the communication time of the training step. The communication time of the training step can be further extended, proportional to the combination of a plurality of functions related to t:
;
Specifically, GC refers to Gradient Compression, i.e., gradient compression, and GC (t) represents a gradient compression strategy with respect to training step t. The first term element is Q (t), representing the gradient quantization strategy function, which is a distribution function with respect to t, meaning that the quantization strategy may change at each step in the distributed training and thus be a function of the dynamic change of t. The second element is S (t), representing a gradient pruning (sparsification) strategy function, which is a function of t, meaning that the pruning strategy may change at each step in the distributed training, and thus a function of t dynamics.
The third term BW (t) represents the bandwidth of the network at the t-th training step in iterative training, as described above, the present invention is directed to a low-speed and unstable network environment, so that the network bandwidth changes over time, and thus also behaves as a function of the dynamic change of t.
Thus, the problem of constructing dynamic gradient compression strategy is converted into the problem of aiming at objective function under the condition of aiming at given model performance requirementOr) Solving the optimal joint compression strategy (quantization+sparsification)The specific mathematical expressions are as follows:
;
Wherein N g represents the number of gradient data; l required represents a given model performance requirement (which may be manifested as "model performance optimization rate up to standard" in the present invention), Representation of right formulaThe optimal hybrid compression strategy (the communication time of the training step) that reaches a minimum value, which is a variable with respect to time t, is the objective optimization problem that needs to be solved.
As an alternative embodiment, determining whether the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds a standard includes:
Determining a relation according to the gradient distortion degree, and determining the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree;
if the gradient distortion is greater than a preset distortion threshold, judging that the gradient distortion exceeds the standard;
If the gradient distortion is not greater than the preset distortion threshold, judging that the gradient distortion is not out of standard;
The gradient distortion degree determination relation includes:
;
Wherein g GC is gradient data of the current training step compressed based on the current gradient compression degree, g is gradient data of the uncompressed current training step, The Euclidean distance between g GC and g is indicated.
Specifically, the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree can be determined efficiently and accurately by the gradient distortion degree determination relational expression as described above.
Of course, in addition to this specific manner, the "gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree" may be determined in other manners, and the embodiment of the present invention is not limited herein.
As an optional embodiment, after determining whether the gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard, the gradient compression method further includes:
if the gradient distortion degree exceeds the standard, adding one to the continuous exceeding number of times of the gradient distortion degree, and judging whether the continuous exceeding number of times of the gradient distortion degree reaches a third preset number of times threshold;
if the number of the continuous exceeding standard times of the gradient distortion degree is too high, the control prompter prompts;
and if the gradient distortion degree does not exceed the standard, resetting the continuous exceeding number of times of the gradient distortion degree.
Specifically, considering that the situation that the gradient distortion degree exceeds the standard still continuously occurs for many times after the gradient compression degree is continuously adjusted, which belongs to an abnormal situation, in order to monitor the situation, if the gradient distortion degree exceeds the standard, the continuous exceeding number of times of the gradient distortion degree is increased by one, and whether the continuous exceeding number of times of the gradient distortion degree reaches a third preset number of times threshold is judged; if the number of the continuous exceeding standard times of the gradient distortion degree is too high, the control prompter prompts; if the gradient distortion degree does not exceed the standard, the continuous exceeding frequency of the gradient distortion degree is cleared, so that automatic monitoring and alarm starting for the condition of' the continuous exceeding frequency of the gradient distortion degree is realized.
The first preset frequency threshold, the second preset frequency threshold and the third preset frequency threshold can be set independently, which is not limited herein.
For better explaining the embodiments of the present invention, please refer to fig. 4, fig. 4 is a schematic structural diagram of a gradient compression device provided by the present invention, where the gradient compression device is applied to any computing node in a distributed cluster, and the gradient compression device includes:
a first judging module 41, configured to, for any training step after the warm-up phase of the local model iterative training, judge whether the current model performance optimization rate meets the standard after obtaining gradient data of the current training step, trigger the first adjusting module 42 if the current model performance optimization rate does not meet the standard, and trigger the second judging module 43 if the current model performance optimization rate meets the standard;
the first adjusting module 42 is configured to reduce a current gradient compression degree of the preset compression method according to a first preset rule;
A second judging module 43, configured to judge whether the current single-step training duration exceeds a standard, and trigger the second adjusting module 44 if the current single-step training duration exceeds the standard;
The second adjusting module 44 is configured to amplify the current gradient compression degree of the preset compression method according to a second preset rule;
The compression module 45 is configured to compress gradient data of a current training step by using a preset compression method based on the latest gradient compression degree, so as to synchronize the compressed gradient data in the distributed cluster;
wherein the single step training period is related to the gradient compression degree and the network condition.
As an alternative embodiment, the first judging module 41 includes:
The first judging sub-module is used for judging whether the lifting amplitude of the model performance meets the standard in the previous step number sliding window, triggering the first judging module, and triggering the second judging module if the lifting amplitude of the model performance does not meet the standard;
the first judging module is used for judging that the current model performance optimization rate reaches the standard;
the second judging module is used for judging that the current model performance optimization rate does not reach the standard;
The step number sliding window comprises a preset number of training steps.
As an alternative embodiment, the first judging submodule includes:
The first determining module is used for determining a relational expression according to the change rate of the loss function and determining the change rate of the loss function of the local model in the last step number sliding window;
the third judging module is used for judging that the improvement amplitude of the model performance does not reach the standard if the change rate of the loss function is smaller than a preset change rate threshold value;
The fourth judging module is used for judging that the improvement amplitude of the performance of the model reaches the standard if the change rate of the loss function is not smaller than a preset change rate threshold value;
the loss function change rate determination relation includes:
;
Wherein, Representing the change rate of the loss function L in the sliding window of the last step number by taking the current t training step as a reference; /(I)A loss function that is a local model; /(I)Representing the sliding average value of the loss function L in the sliding window of the last step number by taking the current t training step as a reference, wherein tau is a variable in the sum operation; Representing the smallest loss function value in the last step number sliding window, the step number sliding window includes M training steps.
As an alternative embodiment, the preset change rate threshold comprises:
;
Wherein, As a decay function related to the number of steps t of the training step,Is a super parameter.
As an alternative embodiment, the preset compression method includes a mixed gradient compression method by a combination of gradient quantization and gradient thinning.
As an optional embodiment, reducing the current gradient compression degree of the preset compression method according to the first preset rule includes:
according to the compression degree reduction relational expression, the current gradient compression degree of the preset compression method is reduced;
the compression degree reduction relation includes:
clipupper(2Q(t))·λS(t);
Wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip upper () represents a clip function for performing a gradient quantization upper limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
As an alternative embodiment, amplifying the current gradient compression degree of the preset compression method according to the second preset rule includes:
amplifying the current gradient compression degree of a preset compression method according to a compression degree amplification relational expression;
the compression degree amplification relation includes:
;
Wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip lower () represents a clip function for performing a gradient quantization lower limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
As an alternative embodiment, the first judging module 41 includes:
The second judging sub-module is used for judging whether the compression of the gradient data can cause the convergence failure of the iterative training according to the gradient data of the current training step at any training step after the iterative training of the local model begins, triggering the fifth judging module if the convergence failure of the iterative training can be caused, and triggering the sixth judging module if the convergence failure of the iterative training can not be caused;
a fifth judging module, configured to judge that the iterative training is in a preheating stage;
A sixth judging module, configured to judge that the preheating phase of the iterative training ends;
and the third judging sub-module is used for judging whether the current model performance optimization rate reaches the standard or not after gradient data of the current training step is obtained for any training step after the preheating stage of the local model iterative training.
As an alternative embodiment, the second judging submodule includes:
The second determining module is used for determining the variance of the gradient data of the current training step;
a fourth judging sub-module for judging whether the variance of the gradient data of the current training step is larger than a preset variance threshold, if so, triggering a seventh judging module, and if not, triggering an eighth judging module;
A seventh determining module, configured to determine that compression of the gradient data may cause the iterative training to fail to converge;
and the eighth judging module is used for judging that the compression of the gradient data does not lead to the failure of convergence of the iterative training.
As an alternative embodiment, the gradient compression unit further comprises:
The third judging module is used for adding one to the continuous unqualified times of the optimized rate if the current performance optimized rate of the model is unqualified, judging whether the continuous unqualified times of the optimized rate reach a first preset time threshold, and triggering the first prompting module if the continuous unqualified times of the optimized rate reach the first preset time threshold;
The first prompting module is used for controlling the prompter to prompt that the optimization rate is continuously not up to standard for too high times;
and the first zero clearing module is used for clearing the times of continuous failure of the optimization rate if the current model performance optimization rate reaches the standard.
As an alternative embodiment, the gradient compression unit further comprises:
the fourth judging module is used for adding one to the continuous exceeding times of the single-step training time length if the current single-step training time length exceeds the standard, judging whether the continuous exceeding times of the single-step training time length reach a second preset time threshold, and triggering a second prompting module if the continuous exceeding times of the single-step training time length reach the second preset time threshold;
The second prompting module is used for controlling the prompting device to prompt that the single-step training time duration is over high in continuous exceeding frequency;
and the second zero clearing module is used for clearing the continuous exceeding times of the single-step training duration if the current single-step training duration does not exceed the standard.
As an alternative embodiment, the gradient compression unit further comprises:
And the modification module is used for responding to the standard modification instruction and modifying standard reaching standard of the model performance optimization rate and/or standard exceeding standard of the single-step training duration.
As an alternative embodiment, the second judging module 43 includes:
The third determining module is used for determining the training duration of the last training step as the current single-step training duration;
A fifth judging sub-module, configured to judge whether the current single-step training duration is greater than a preset duration threshold, if so, trigger a ninth judging module, and if not, trigger a tenth judging module;
A ninth judging module, configured to judge that the current single-step training duration exceeds the standard;
And a tenth judging module, configured to judge that the current single-step training duration does not exceed the standard.
As an alternative embodiment, the gradient compression unit further comprises:
A fifth judging module, configured to judge whether a gradient distortion degree of compressing the gradient data of the current training step based on the current gradient compression degree exceeds a standard, and if the gradient distortion degree exceeds the standard, trigger the first adjusting module 42;
the triggering conditions of the second judging module 43 include:
if the current model performance optimization rate reaches the standard and the gradient distortion degree is not out of standard.
As an alternative embodiment, the fifth judging module includes:
the fourth determining module is used for determining a relational expression according to the gradient distortion degree and determining the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree;
the eleventh judging module is used for judging that the gradient distortion degree exceeds the standard if the gradient distortion degree is larger than a preset distortion degree threshold value;
A twelfth determining module, configured to determine that the gradient distortion degree does not exceed the standard if the gradient distortion degree is not greater than the preset distortion degree threshold;
The gradient distortion degree determination relation includes:
;
Wherein g GC is gradient data of the current training step compressed based on the current gradient compression degree, g is gradient data of the uncompressed current training step, The Euclidean distance between g GC and g is indicated.
As an alternative embodiment, the gradient compression unit further comprises:
the sixth judging module is used for adding one to the continuous exceeding number of times of the gradient distortion degree if the gradient distortion degree exceeds the standard, judging whether the continuous exceeding number of times of the gradient distortion degree reaches a third preset number of times threshold, and triggering a third prompting module if the continuous exceeding number of times of the gradient distortion degree reaches the third preset number of times threshold;
the third prompting module is used for controlling the prompting device to prompt that the gradient distortion degree is continuously over-marked for too high times;
and the third zero clearing module is used for clearing the continuous exceeding times of the gradient distortion degree if the gradient distortion degree does not exceed the standard.
For the description of the gradient compression apparatus provided in the embodiment of the present invention, reference is made to the foregoing embodiment of the gradient compression method, and the embodiment of the present invention is not repeated here.
For better illustrating an embodiment of the present invention, please refer to fig. 5, fig. 5 is a schematic structural diagram of a gradient compression device provided by the present invention, the gradient compression device includes:
A memory 51 for storing a computer program;
a processor 52 for implementing the steps of the gradient compression method in the previous embodiment when executing a computer program.
For the description of the gradient compression apparatus provided in the embodiment of the present invention, reference is made to the foregoing embodiment of the gradient compression method, and the embodiment of the present invention is not repeated herein.
The invention also provides a distributed cluster comprising a plurality of gradient compression devices as in the previous embodiments.
For the description of the distributed clusters provided in the embodiments of the present invention, reference is made to the foregoing embodiments of the gradient compression method, and the embodiments of the present invention are not repeated herein.
For a better explanation of the embodiments of the present invention, please refer to fig. 6, fig. 6 is a schematic structural diagram of a computer readable storage medium provided by the present invention, the computer readable storage medium 60 stores a computer program 61 thereon, and the computer program 61 implements the steps of the gradient compression method according to the previous embodiments when executed by the processor 52.
For the description of the computer readable storage medium provided in the embodiment of the present invention, please refer to the foregoing embodiment of the gradient compression method, and the embodiment of the present invention is not repeated here.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (18)
1. A gradient compression method, applied to any computing node in a distributed cluster, comprising:
for any training step after the preheating stage of the local model iterative training, judging whether the current model performance optimization rate meets the standard after gradient data of the current training step is obtained;
If the current model performance optimization rate does not reach the standard, reducing the current gradient compression degree of the preset compression method according to a first preset rule;
If the current model performance optimization rate reaches the standard, judging whether the current single-step training time length exceeds the standard;
If the current single-step training time length exceeds the standard, amplifying the current gradient compression degree of the preset compression method according to a second preset rule;
Based on the latest gradient compression degree, compressing gradient data of the current training step by adopting a preset compression method so as to synchronize the compressed gradient data in the distributed cluster;
wherein the single step training duration is related to gradient compression and network conditions;
Judging whether the current model performance optimization rate meets the standard comprises the following steps:
Judging whether the improvement amplitude of the model performance reaches the standard in the sliding window of the last step number;
If the improvement amplitude of the model performance meets the standard, judging that the current model performance optimization rate meets the standard;
if the improvement amplitude of the model performance does not reach the standard, judging that the current model performance optimization rate does not reach the standard;
Wherein the step number sliding window comprises a preset number of training steps;
judging whether the improvement amplitude of the model performance reaches the standard in the sliding window of the last step number or not comprises the following steps:
determining a relation according to the change rate of the loss function, and determining the change rate of the loss function of the local model in the last step sliding window;
If the change rate of the loss function is smaller than a preset change rate threshold, judging that the improvement amplitude of the model performance does not reach the standard;
If the change rate of the loss function is not smaller than a preset change rate threshold, judging that the improvement amplitude of the model performance reaches the standard;
the loss function change rate determination relation includes:
;
Wherein, Representing the change rate of the loss function L in the sliding window of the last step number by taking the current t training step as a reference; /(I)A loss function that is a local model; /(I)Representing the sliding average value of the loss function L in the sliding window of the last step number by taking the current t training step as a reference, wherein tau is a variable in the sum operation; Representing the smallest loss function value in the last step number sliding window, the step number sliding window includes M training steps.
2. The gradient compression method of claim 1, wherein the preset rate of change threshold comprises:
;
Wherein, As a decay function related to the number of steps t of the training step,Is a super parameter.
3. The gradient compression method according to claim 1, wherein the preset compression method includes a mixed gradient compression method by a combination of a gradient quantization method and a gradient thinning method.
4. A gradient compression method according to claim 3, wherein reducing the current gradient compression degree of the preset compression method according to the first preset rule comprises:
according to the compression degree reduction relational expression, the current gradient compression degree of the preset compression method is reduced;
the compression degree reduction relation includes:
clipupper(2Q(t))·λS(t);
Wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip upper () represents a clip function for performing a gradient quantization upper limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
5. A gradient compression method according to claim 3, wherein amplifying the current gradient compression degree of the preset compression method according to the second preset rule comprises:
amplifying the current gradient compression degree of a preset compression method according to a compression degree amplification relational expression;
The compression degree amplifying relational expression comprises:
;
wherein Q (t) represents a gradient quantization strategy function related to the number of steps t of the training step, S (t) represents a gradient sparsification strategy function related to the number of steps t of the training step, clip lower () represents a clip function for performing a gradient quantization lower limit clipping operation, lambda is a preset adjustment parameter, lambda > 1.
6. The gradient compression method of claim 1, wherein for any one of the training steps after the warm-up phase of the local model iterative training, after obtaining gradient data for the current training step, determining whether the current model performance optimization rate meets the criteria comprises:
Judging whether compression of gradient data can cause the iterative training to be unable to converge according to the gradient data of the current training step at any training step after the iterative training of the local model begins;
if the iterative training can not be converged, judging that the iterative training is in a preheating stage;
If the iterative training cannot be converged, judging that the preheating stage of the iterative training is ended;
For any training step after the preheating stage of the local model iterative training, judging whether the current model performance optimization rate meets the standard after gradient data of the current training step is obtained.
7. The gradient compression method of claim 6, wherein determining whether compression of the gradient data would result in failure to converge of iterative training based on the gradient data of the current training step comprises:
Determining the variance of gradient data of the current training step;
judging whether the variance of gradient data of the current training step is larger than a preset variance threshold value or not;
if the data is larger than the first threshold, judging that the compression of the gradient data can cause the convergence failure of the iterative training;
if the data is not greater than the first threshold, the compression of the gradient data is judged to not cause the failure of convergence of the iterative training.
8. The gradient compression method according to claim 1, wherein after determining whether the current model performance optimization rate meets the standard, the gradient compression method further comprises:
If the current model performance optimization rate does not reach the standard, adding one to the continuous times of the optimization rate, and judging whether the continuous times of the optimization rate do not reach the first preset times threshold;
if the frequency reaches the preset frequency, the control prompter prompts that the optimization rate is not up to the standard continuously and the frequency is too high;
And if the current model performance optimization rate meets the standard, resetting the times of continuously unqualified optimization rate.
9. The gradient compression method according to claim 1, wherein after determining whether the current single step training period exceeds the standard, the gradient compression method further comprises:
If the current single-step training time length exceeds the standard, adding one to the single-step training time length continuous exceeding times, and judging whether the single-step training time length continuous exceeding times reach a second preset time threshold value or not;
if the single-step training time is up, the control prompter prompts that the continuous exceeding frequency of the single-step training time is too high;
If the current single-step training time length does not exceed the standard, the continuous exceeding times of the single-step training time length are cleared.
10. The gradient compression method of claim 1, further comprising:
in response to the standard modification instruction, a standard-up standard for the model performance optimization rate and/or a standard-exceeding standard for the single step training duration are modified.
11. The gradient compression method of claim 1, wherein determining whether the current single step training period exceeds a standard comprises:
determining the training duration of the last training step as the current single-step training duration;
Judging whether the current single-step training time length is greater than a preset time length threshold value or not;
If the training time is greater than the standard, judging that the current single-step training time exceeds the standard;
If the training time is not greater than the preset training time, judging that the current single-step training time is not out of standard.
12. The gradient compression method according to any one of claims 1 to 11, wherein for any one of the training steps after the warm-up phase of the local model iterative training, before the gradient data of the current training step is obtained and the current gradient compression degree of the preset compression method is reduced according to the first preset rule, the gradient compression method further comprises:
judging whether the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree exceeds the standard or not;
If the current model performance optimization rate does not reach the standard, the step of reducing the current gradient compression degree of the preset compression method according to the first preset rule comprises the following steps:
If the current model performance optimization rate does not reach the standard and/or the gradient distortion degree exceeds the standard, reducing the current gradient compression degree of the preset compression method according to a first preset rule;
if the current model performance optimization rate reaches the standard, judging whether the current single-step training time length exceeds the standard comprises the following steps:
And if the current model performance optimization rate reaches the standard and the gradient distortion degree is not out of standard, judging whether the current single-step training time length is out of standard.
13. The gradient compression method of claim 12, wherein determining whether a gradient distortion degree by which gradient data of a current training step is compressed based on a current gradient compression degree exceeds a standard comprises:
Determining a relation according to the gradient distortion degree, and determining the gradient distortion degree for compressing the gradient data of the current training step based on the current gradient compression degree;
If the gradient distortion is greater than a preset distortion threshold, judging that the gradient distortion exceeds the standard;
If the gradient distortion is not greater than the preset distortion threshold, judging that the gradient distortion is not out of standard;
The gradient distortion degree determination relation comprises:
;
Wherein g GC is gradient data of the current training step compressed based on the current gradient compression degree, g is gradient data of the uncompressed current training step, The Euclidean distance between g GC and g is indicated.
14. The gradient compression method according to claim 12, wherein after judging whether or not a gradient distortion degree for compressing gradient data of a current training step based on a current gradient compression degree exceeds a standard, the gradient compression method further comprises:
if the gradient distortion degree exceeds the standard, adding one to the continuous exceeding number of the gradient distortion degree, and judging whether the continuous exceeding number of the gradient distortion degree reaches a third preset number threshold;
if the number of the continuous exceeding standard times of the gradient distortion degree is too high, the control prompter prompts;
And if the gradient distortion degree does not exceed the standard, clearing the continuous exceeding frequency of the gradient distortion degree.
15. A gradient compression apparatus for use with any one of computing nodes in a distributed cluster, comprising:
The first judging module is used for judging whether the current model performance optimization rate meets the standard or not for any training step after the preheating stage of the local model iterative training after the gradient data of the current training step is obtained, triggering the first adjusting module if the current model performance optimization rate does not meet the standard, and triggering the second judging module if the current model performance optimization rate meets the standard;
The first adjusting module is used for reducing the current gradient compression degree of the preset compression method according to a first preset rule;
the second judging module is used for judging whether the current single-step training time length exceeds the standard, and triggering the second adjusting module if the current single-step training time length exceeds the standard;
the second adjusting module is used for amplifying the current gradient compression degree of the preset compression method according to a second preset rule;
The compression module is used for compressing gradient data of the current training step by adopting a preset compression method based on the latest gradient compression degree so as to synchronize the compressed gradient data in the distributed cluster;
wherein the single step training duration is related to gradient compression and network conditions;
The first judging module includes:
The first judging sub-module is used for judging whether the lifting amplitude of the model performance meets the standard in the previous step number sliding window, triggering the first judging module if the lifting amplitude of the model performance meets the standard, and triggering the second judging module if the lifting amplitude of the model performance does not meet the standard;
the first judging module is used for judging that the current model performance optimization rate reaches the standard;
the second judging module is used for judging that the current model performance optimization rate does not reach the standard;
The step number sliding window comprises a preset number of training steps;
the first judgment submodule includes:
The first determining module is used for determining a relational expression according to the change rate of the loss function and determining the change rate of the loss function of the local model in the last step number sliding window;
the third judging module is used for judging that the improvement amplitude of the model performance does not reach the standard if the change rate of the loss function is smaller than a preset change rate threshold value;
The fourth judging module is used for judging that the improvement amplitude of the performance of the model reaches the standard if the change rate of the loss function is not smaller than a preset change rate threshold value;
the loss function change rate determination relation includes:
;
Wherein, Representing the change rate of the loss function L in the sliding window of the last step number by taking the current t training step as a reference; /(I)A loss function that is a local model; /(I)Representing the sliding average value of the loss function L in the sliding window of the last step number by taking the current t training step as a reference, wherein tau is a variable in the sum operation; Representing the smallest loss function value in the last step number sliding window, the step number sliding window includes M training steps.
16. A gradient compression apparatus, comprising:
A memory for storing a computer program;
processor for implementing the steps of the gradient compression method as claimed in any one of claims 1 to 14 when executing the computer program.
17. A distributed cluster system comprising a plurality of gradient compression apparatuses as claimed in claim 16.
18. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the gradient compression method as claimed in any of claims 1 to 14.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410317335.XA CN117910521B (en) | 2024-03-20 | 2024-03-20 | Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410317335.XA CN117910521B (en) | 2024-03-20 | 2024-03-20 | Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117910521A CN117910521A (en) | 2024-04-19 |
CN117910521B true CN117910521B (en) | 2024-06-14 |
Family
ID=90686309
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410317335.XA Active CN117910521B (en) | 2024-03-20 | 2024-03-20 | Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117910521B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109472347A (en) * | 2018-10-15 | 2019-03-15 | 中山大学 | A kind of gradient compression method of distribution deep learning |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109951438B (en) * | 2019-01-15 | 2020-11-20 | 中国科学院信息工程研究所 | Communication optimization method and system for distributed deep learning |
CN110245743A (en) * | 2019-05-23 | 2019-09-17 | 中山大学 | A kind of asynchronous distributed deep learning training method, apparatus and system |
CN110309283B (en) * | 2019-06-28 | 2023-03-21 | 创新先进技术有限公司 | Answer determination method and device for intelligent question answering |
WO2022003562A1 (en) * | 2020-06-29 | 2022-01-06 | King Abdullah University Of Science And Technology | Statistical-based gradient compression method for distributed training system |
CN113988266A (en) * | 2021-11-01 | 2022-01-28 | 南京大学 | Top-k-based adaptive distributed gradient compression method supporting complex network conditions |
CN115146119A (en) * | 2022-07-22 | 2022-10-04 | 上海燧原科技有限公司 | Compression training method, device, equipment and storage medium for distributed gradient |
WO2024050659A1 (en) * | 2022-09-05 | 2024-03-14 | 华南理工大学 | Federated learning lower-side cooperative channel adaptive gradient compression method |
CN115719093A (en) * | 2022-11-22 | 2023-02-28 | 京东科技信息技术有限公司 | Distributed training method, device, system, storage medium and electronic equipment |
CN116484946A (en) * | 2023-05-19 | 2023-07-25 | 平安科技(深圳)有限公司 | Model parameter adjustment method, device, equipment and medium based on dynamic compression |
CN116739107A (en) * | 2023-06-09 | 2023-09-12 | 平安科技(深圳)有限公司 | Gradient quantization method, device, equipment and storage medium based on federal learning |
-
2024
- 2024-03-20 CN CN202410317335.XA patent/CN117910521B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109472347A (en) * | 2018-10-15 | 2019-03-15 | 中山大学 | A kind of gradient compression method of distribution deep learning |
Non-Patent Citations (1)
Title |
---|
一种基于4Bit编码的深度学习梯度压缩算法;蒋文斌 等;计算机科学;20200731;第47卷(第7期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117910521A (en) | 2024-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107871164B (en) | Fog computing environment personalized deep learning method | |
CN110198339A (en) | A kind of edge calculations method for scheduling task based on QoE perception | |
EP3979563B1 (en) | Inter-domain data interaction method and apparatus | |
CN110601935A (en) | Processing method and device for tasks in intelligent home operating system and cloud platform system | |
CN116643844B (en) | Intelligent management system and method for automatic expansion of power super-computing cloud resources | |
CN114615294A (en) | Electric power internet of things gateway edge calculation method | |
CN117880291A (en) | Multi-cloud resource load balancing method, device, equipment and medium based on GPT technology | |
CN106603695B (en) | Method and device for adjusting query rate per second | |
CN117910521B (en) | Gradient compression method, gradient compression device, gradient compression equipment, distributed cluster system and storage medium | |
CN113162888A (en) | Security threat event processing method and device and computer storage medium | |
CN112214303A (en) | Kubernetes cluster automatic scaling system | |
CN110708370B (en) | Data processing method and terminal | |
CN117114113A (en) | Collaborative reasoning acceleration method based on queuing theory | |
CN116367223B (en) | XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium | |
CN116382799A (en) | Service system configuration method, device, equipment and medium | |
CN116017570A (en) | Edge computing system resource management method based on block chain | |
CN116993240A (en) | Logistics system control method, device, equipment and storage medium | |
CN115236398A (en) | Distribution network harmonic real-time state evaluation and data transmission method and system | |
CN109685101B (en) | Multi-dimensional data self-adaptive acquisition method and system | |
CN109756386B (en) | Communication optimization method and device based on cloud network vehicle cooperation | |
CN112367382A (en) | File uploading method, device, equipment and storage medium | |
CN113837896B (en) | Method and device for optimally regulating and controlling flexible load resources, storage medium and processor | |
CN114336600B (en) | Power control method and device, mobile terminal and storage medium | |
CN114662252B (en) | Method for improving performance index of complex networked random system | |
CN118660014A (en) | Dynamic load balancing method and system for optical communication device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |