WO2021238274A1

WO2021238274A1 - Gradient information updating method for distributed deep learning, and related apparatus

Info

Publication number: WO2021238274A1
Application number: PCT/CN2021/073493
Authority: WO
Inventors: 张玉彦; 陈培; 张东
Original assignee: 浪潮电子信息产业股份有限公司
Priority date: 2020-05-28
Filing date: 2021-01-25
Publication date: 2021-12-02
Also published as: CN111625603A

Abstract

A gradient information updating method for distributed deep learning, and a gradient information updating apparatus, a computer device and a computer-readable storage medium. The method comprises: when the number of iterations is greater than a preset number of times of switching, a node performing sparse communication on calculated gradient information according to a preset gradient threshold value; reducing received gradient information to obtain reduced gradient information; and correcting the reduced gradient information according to a set inertia coefficient so as to obtain a target gradient, and updating a copy model in the node according to the target gradient. The present invention is aimed at reducing the data volume of communication and reducing the duration of communication while ensuring that precision is not lost.

Description

Method and related device for updating gradient information of distributed deep learning

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 28, 2020, the application number is 202010469747.7, and the invention title is "a distributed deep learning gradient information update method and related device", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of computer technology, and in particular to a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning.

Background technique

With the continuous development of information technology, the requirements for deep learning training technology are getting higher and higher. In order to improve the efficiency of training, distributed deep learning has appeared. When the distributed deep learning model is trained, communication between computing nodes is required to realize the reduction processing of gradient information and ensure that the replica model of each computing node is updated synchronously. When the model is very large, the sample size is very large, and the computing cluster is very large, the communication time between computing nodes is long and causes a long waiting time for message processing. This process has low GPU utilization and cannot make full use of computing resources, leading to the model Long training time reduces the efficiency of trial and error and debugging.

In the prior art, the main reason is that when a large-scale distributed deep learning model is trained, each computing node generates a piece of gradient information (local gradient information), and gradient exchange and reduction (global reduction operation) are required between nodes. In order to obtain the global gradient value, the model is updated synchronously to ensure the consistency of the replica model of each computing node. In this process, the size of the model and the communication bandwidth between nodes determine the communication duration, the heterogeneous characteristics of the nodes or the manufacturing errors of the homogeneous equipment determine the communication waiting time, and the above two jointly affect the training duration. Furthermore, in the process of distributed deep model training, the communication strategy between each node is very important. The communication strategy in the prior art will seriously lead to the efficiency of model training and reduce the efficiency of deep learning.

Therefore, how to improve the efficiency of distributed deep learning or the speed of sparse communication is a key issue for those skilled in the art.

Summary of the invention

The purpose of this application is to provide a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning, which can be used to determine the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times. Perform coefficient communication. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the replica model. On the basis of sparse communication, the number of communications is reduced and the amount of data passed is also reduced. Finally, correction processing is used to maintain the reliability of gradient information, realize sparse communication with less data, and improve the efficiency of distributed deep learning and the speed of sparse communication.

In order to solve the above technical problems, this application provides a method for updating gradient information of distributed deep learning, including:

When the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;

Perform reduction processing on the received gradient information to obtain reduced gradient information;

Correcting the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and updating the replica model in the node according to the target gradient.

Optionally, when the number of iterations is greater than the preset number of switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold, including:

When the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;

Taking gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcasting the sparse index information, so that other nodes send communication requests of the sparse gradient to the node according to the index information;

When receiving a communication request, perform zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information;

The zero-return gradient information is communicated.

Optionally, when a communication request is received, performing zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information includes:

When the communication request is received, setting the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold value to zero to obtain a zero-returning gradient value;

The remaining sparse gradient and the return-to-zero gradient value are used as the return-to-zero gradient information.

Optionally, when the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information, including:

When the number of iterations is greater than the preset switching number, the node determines whether the calculation of gradient information is completed;

If yes, calculate the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;

If not, send a broadcast waiting message.

This application also provides a gradient information update device for distributed deep learning, including:

The sparse communication module is used for when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;

The reduction processing module is used to reduce the received gradient information to obtain the reduced gradient information;

The copy model update module is used to correct the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the copy model in the node according to the target gradient.

Optionally, the sparse communication module includes:

The importance degree calculation unit is configured to calculate the importance degree coefficient of the calculated gradient information by the node when the number of iterations is greater than the preset switching times to obtain the importance degree coefficient of each gradient information;

An index information broadcasting unit, configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication of the sparse gradient to the node according to the index information ask;

A zeroing processing unit, configured to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received, to obtain zeroing gradient information;

The sparse communication unit is used to communicate the zero-return gradient information.

Optionally, the resetting processing unit includes:

A gradient value setting subunit, configured to, when the communication request is received, set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero to obtain a zero-returned gradient value;

The return-to-zero information acquisition subunit is configured to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.

Optionally, the importance degree calculation unit includes:

The calculation completion judging subunit is configured to judge whether the calculation of the gradient information is completed by the node when the number of iterations is greater than the preset switching number;

The importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node has completed the calculation of the gradient information to obtain the importance degree coefficient of each gradient information;

The broadcast waiting unit is configured to send a broadcast waiting message when the node has not completed the calculation of the gradient information.

This application also provides a computer device, including:

Memory, used to store computer programs;

The processor is used to implement the steps of the gradient information update method described above when the computer program is executed.

The present application also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gradient information update method described above are realized.

The gradient information update method for distributed deep learning provided by this application includes: when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold; The obtained gradient information is reduced to obtain reduced gradient information; the reduced gradient information is corrected according to the set inertia coefficient to obtain the target gradient, and the copy model in the node is processed according to the target gradient. renew.

When the number of iterations of the node is greater than the preset number of switching times, the coefficient communication is carried out according to the gradient threshold. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the copy model. On the basis of sparse communication, the number of communications is reduced and the amount of data passing through is also reduced. Finally, correction processing is used to maintain the reliability of the gradient information, realize sparse communication with less data, and improve the efficiency in distributed deep learning. High and sparse communication speed.

This application also provides a distributed deep learning gradient information update device, computer equipment, and computer-readable storage medium, which have the above beneficial effects, and will not be repeated here.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are the embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without creative work.

FIG. 1 is a flowchart of a method for updating gradient information of distributed deep learning according to an embodiment of the application;

FIG. 2 is a schematic structural diagram of a gradient information update device for distributed deep learning provided by an embodiment of the application.

Detailed ways

The core of this application is to provide a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning. The gradient threshold is determined only when the number of iterations of the node is greater than the preset number of switching times. Perform coefficient communication. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the replica model. On the basis of sparse communication, the number of communications is reduced and the amount of data passed is also reduced. Finally, correction processing is used to maintain the reliability of gradient information, realize sparse communication with less data, and improve the efficiency of distributed deep learning and the speed of sparse communication.

In order to make the purpose, technical solutions and advantages of the embodiments of this application clearer, the following will clearly and completely describe the technical solutions in the embodiments of this application with reference to the drawings in the embodiments of this application. Obviously, the described embodiments These are a part of the embodiments of this application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Therefore, this application provides a method for updating gradient information of distributed deep learning, which communicates coefficients according to the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times. At the same time, when the gradient information is received, the reduced The gradient is corrected to obtain the target gradient. Finally, the target gradient is used to update the replica model. On the basis of sparse communication, the number of communications is reduced and the amount of data passed. Finally, the correction process is used to maintain the reliability of the gradient information, and achieve Sparse communication with less data increases the efficiency and speed of sparse communication in distributed deep learning.

Please refer to FIG. 1. FIG. 1 is a flowchart of a method for updating gradient information of distributed deep learning according to an embodiment of the application.

In this embodiment, the method may include:

S101: When the number of iterations is greater than the preset switching number, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;

The purpose of this step is to sparsely pass the calculated gradient information according to the preset gradient threshold value when the number of iterations is greater than the preset number of switching times. That is, during the training process of distributed deep learning, it is judged in real time whether the number of iterations of training is greater than the preset number of switching times. When the number of iterations is greater than the preset number of switching times, in this embodiment, the calculated gradient information is communicated with coefficients according to the preset gradient threshold value through this step.

Mainly in the prior art, in order to improve the communication efficiency in distributed deep learning, the sparse communication strategy is usually directly adopted to reduce the amount of communication data between various nodes and improve the efficiency of data communication. However, the sparse communication method used in the prior art is generally to perform sparse communication throughout the distributed deep learning process. However, in the early stage of model training, the gradient value is often relatively large, which is crucial to the decline of the objective function. Therefore, in this step, a warm-up strategy is added to the process of updating the gradient information, so that when a certain learning rate is reached, data communication is started to achieve a more suitable gradient value.

Optionally, in order to further explain this step, this step may include:

Step 1: When the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;

Step 2: Use gradient information whose importance degree coefficient is greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information so that other nodes can send communication requests of the sparse gradient to the nodes according to the index information;

Step 3: When receiving the communication request, perform zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information;

Step 4. Communicate the zero-return gradient information.

It can be seen that in this optional solution, step 1 to step 4 are mainly used to realize the coefficient communication operation of gradient information. First, when the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information. Among them, the importance degree coefficient calculation is mainly to determine the importance of gradient information, so that only important gradient information is transmitted during the data transmission process, which reduces the amount of data transmitted and improves the efficiency of data transmission. Then, the gradient information whose importance degree coefficient is greater than the preset coefficient is used as the sparse gradient, and the sparse index information is broadcasted so that other nodes send communication requests of the sparse gradient to the nodes according to the index information. That is, other nodes are notified to request gradient information from this node, so that the corresponding gradient information can be sent to other nodes. Then, when a communication request is received, the sparse gradient is returned to zero according to the gradient threshold to obtain the zero-returned gradient information. That is, when the local node receives the communication request, the local node performs zeroing processing on the coefficient gradient according to the gradient threshold to obtain zeroing gradient information. That is, some gradient information less than the gradient threshold is directly set to zero, further reducing the amount of data. Finally, the zero-zero gradient information is communicated in order to achieve sparse traffic.

In this optional solution, the data volume of the coefficient gradient is further reduced by the gradient threshold, which further reduces the data volume for coefficient communication, and greatly improves the efficiency of sparse communication.

Optionally, step 3 in this optional solution may include:

When a communication request is received, the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold is set to zero to obtain the zero-return gradient value; the remaining sparse gradient and the return-to-zero gradient value are used as the zero-return gradient information.

That is, the sparse gradient is classified according to the gradient threshold, and the sparse gradient is classified into the sparse gradient whose absolute value is less than the gradient threshold and the remaining sparse gradient. The gradient value of the sparse gradient whose absolute value is less than the gradient threshold is set to zero, and finally the zero-returning gradient information is obtained. The data volume of gradient information is further reduced, and the efficiency of sparse communication is improved.

Optionally, step 1 in this optional solution may include:

Step 1.1, when the number of iterations is greater than the preset number of switching times, the node judges whether the calculation of the gradient information is completed; if so, execute step 1.2; if not, execute step 1.3;

Step 1.2, calculate the importance degree coefficient of the calculated gradient information, and obtain the importance degree coefficient of each gradient information;

Step 1.3, send a broadcast waiting message.

When the number of iterations is greater than the preset number of switches, that is, whether the node has completed the calculation of the gradient information. The main reason is that when different nodes are performing gradient importance evaluation, the amount of calculation will increase, and the increase of the amount of calculation will increase linearly with the number of computing nodes; and the sparse gradient information obtained by different computing nodes does not match in dimension and position. In response to this problem, in this optional solution, only the node whose gradient is calculated first is evaluated for the importance of the gradient, and the index information of the important gradient is obtained and broadcast to all other nodes. All nodes only select the gradient of the corresponding index information to initiate a communication request. Therefore, in this optional solution, the node determines whether to complete the calculation of the gradient information. If yes, calculate the importance level. If not, wait for the broadcast message to obtain the corresponding gradient information.

S102: Perform reduction processing on the received gradient information to obtain reduced gradient information;

On the basis of S101, the model update is performed in the distributed network in this embodiment. Therefore, each node performing sparse communication is sending gradient data to other nodes, so that each node updates the gradient information of the corresponding replica model. In this step, when the node receives the gradient information sent by other nodes, the received gradient information is reduced to obtain the reduced gradient information.

Among them, the reduction processing in this step mainly refers to the global reduction processing. Specifically, any reduction processing method provided by the prior art can be used, and will not be repeated here.

S103: Perform correction processing on the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the replica model in the node according to the target gradient.

On the basis of S102, the purpose of this step is to correct the reduced gradient information according to the set inertia coefficient to obtain the target gradient, and update the replica model in the node according to the target gradient. That is, each node that receives the gradient information and performs the reduction operation to obtain the final reduced gradient information, further corrects the reduced gradient information according to the set inertia coefficient to avoid the gradient information from being too low accuracy The problem of obtaining the target gradient, and finally updating the replica model in the node according to the target gradient.

The main reason is that in the prior art, when the gradient information is discarded, the weight update direction will deviate from the ideal direction when the gradient information is dropped, and this deviation will be accumulated as the iteration progresses, and the training may not converge, resulting in invalid training. Case. Therefore, in order to avoid training failure, in this step, the gradient information of longan is corrected according to the set inertia coefficient, so as to solve the problem of inability to converge. Specifically, in this embodiment, the following formula can be used to correct the reduced gradient information.

The specific formula is as follows:

G ^t+1 =mG ^t +sparse(G ^t+1 )

Among them, G ^t+1 is the gradient value of the t+1th iteration, G ^t is the gradient value of the tth iteration, sparse(·) is the sparse gradient obtained after the gradient importance is evaluated, and m is the inertia coefficient.

In summary, this embodiment communicates coefficients according to the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and the target gradient is finally adopted. The gradient updates the replica model, which reduces the number of communications on the basis of sparse communication and also reduces the amount of data that passes. Finally, correction processing is used to maintain the reliability of the gradient information, achieve sparse communication with less data, and improve distribution The high efficiency and sparse communication speed in deep learning.

In the following, a specific embodiment is used to describe a method for updating gradient information of distributed deep learning provided in the present application.

In the prior art, the problem of long communication time for large-scale distributed model training lies in the large number of communications and calculations between computing nodes. This embodiment proposes a sparse communication strategy for this point. The main reason is that when a computing node finishes executing Forward-Backward, it judges the importance of the gradient, and only initiates communication requests for important gradients, ignoring secondary gradient information.

In addition, during the training of the deep learning model, a large number of trainable parameters are zero or close to zero. Therefore, this patent uses the absolute value of the gradient value as the basis for determining the importance of the gradient. When the absolute value of the gradient value is greater than the preset threshold Threshold, the gradient is considered important; otherwise, the gradient is considered unimportant and the gradient is retained value. The formula is as follows:

Among them, G is all gradients, and G _i is the i-th dimension of G.

In addition, considering that the gradient value is often large at the beginning of training, which is crucial to the decline of the objective function, a warm-up strategy is proposed based on the above-mentioned sparse strategy. In the warm-up strategy, the switch value Iter is set. When the iteration count reaches Iter, the gradient importance is evaluated, and the important gradient is selected to initiate a communication request. The formula is as follows:

Among them, Iter _count is the number of iterations currently counted.

When Iter is set to 0, it is considered that sparse communication will be carried out after the training starts; when Iter is set to the maximum number of iterations Iter _max , it is considered that sparse communication is not used.

In addition, it is worth noting that discarding the gradient information will cause the weight update direction to deviate from the ideal direction when the gradient information is dropped, and this deviation will be accumulated as the iteration progresses, and the training may not converge, resulting in invalid training. Condition. In response to this situation, a gradient correction strategy is proposed. The specific formula is as follows:

G ^t+1 =mG ^t +sparse(G ^t+1 )

In addition, if different computing nodes are evaluated for the importance of gradients, it will bring two defects: 1) increase the amount of calculation, and the increase of the amount of calculation increases linearly with the number of computing nodes; 2) the sparse gradient information obtained by different computing nodes is Mismatch in dimensions and location. In response to this problem, a fastest processing strategy is proposed, which is to evaluate the importance of the gradient only on the node that calculates the gradient first, obtain the index information of the important gradient, and broadcast it to all other nodes. All nodes only select the gradient of the corresponding index to initiate a communication request.

Based on the above description, the method in this embodiment may include:

Step 1. When the switch value Iter is reached, the node that calculates the gradient first among any nodes evaluates the importance of the gradient, obtains the index information of the important gradient, and initiates a global broadcast to broadcast the index information to each node being calculated;

Step 2: Each node selects the gradient at the corresponding position from the gradient according to the index information to form a sparse gradient, and initiates a communication request;

Step 3: Perform sparse communication between computing nodes, perform a reduction operation, and obtain the gradient after the global reduction process, and use the gradient to update the replica model on each node.

It can be seen that the sparse communication strategy in this embodiment aims to reduce the amount of communication data and reduce the communication duration without loss of accuracy.

The following describes a gradient information update device for distributed deep learning provided by an embodiment of the present application. The gradient information update device for distributed deep learning described below is the same as the gradient information update device for distributed deep learning described above. The methods can correspond to each other and refer to each other.

Please refer to FIG. 2, which is a schematic structural diagram of a gradient information update device for distributed deep learning provided by an embodiment of the application.

In this embodiment, the device may include:

The sparse communication module 100 is configured to perform sparse communication on the calculated gradient information according to the preset gradient threshold when the number of iterations is greater than the preset number of switching times;

The reduction processing module 200 is configured to perform reduction processing on the received gradient information to obtain reduced gradient information;

The replica model update module 300 is used to correct the reduced gradient information according to the set inertia coefficient to obtain the target gradient, and update the replica model in the node according to the target gradient.

Optionally, the sparse communication module 100 may include:

The importance degree calculation unit is used for when the number of iterations is greater than the preset switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;

An index information broadcasting unit, configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication requests for the sparse gradient to the node according to the index information;

The zeroing processing unit is used to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received to obtain zeroing gradient information;

The sparse communication unit is used to communicate the zero gradient information.

Optionally, the resetting processing unit may include:

The gradient value setting subunit is used to set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero when a communication request is received, to obtain the zero-returned gradient value;

The return-to-zero information acquisition subunit is used to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.

Optionally, the importance degree calculation unit may include:

The calculation completion judging subunit is used for the node to judge whether the calculation of the gradient information is completed when the number of iterations is greater than the preset number of switching times;

The importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node completes the calculation of the gradient information, and obtain the importance degree coefficient of each gradient information;

The broadcast waiting unit is used to send a broadcast waiting message when the node has not completed the calculation of the gradient information.

This application also provides a computer device, including:

Memory, used to store computer programs;

The processor is used to implement the steps of the gradient information update method described in the above embodiment when the computer program is executed.

The present application also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gradient information update method as described in the above embodiments are implemented.

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method part.

Professionals may further realize that the units and algorithm steps of the examples described in the embodiments disclosed in this article can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.

The steps of the method or algorithm described in combination with the embodiments disclosed herein can be directly implemented by hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.

The above describes in detail a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning provided by this application. In this article, specific examples are used to describe the principles and implementation of the application, and the description of the above examples is only used to help understand the methods and core ideas of the application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and modifications can be made to this application, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims

A method for updating gradient information of distributed deep learning, which is characterized in that it includes:

When the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;

Perform reduction processing on the received gradient information to obtain reduced gradient information;

Correcting the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and updating the replica model in the node according to the target gradient.
The gradient information update method according to claim 1, wherein when the number of iterations is greater than the preset number of switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold value, comprising:

When the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;

Taking gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcasting the sparse index information, so that other nodes send communication requests of the sparse gradient to the node according to the index information;

When receiving a communication request, perform zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information;

The zero-return gradient information is communicated.
The gradient information update method according to claim 2, wherein when a communication request is received, performing zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information comprises:

When the communication request is received, setting the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold value to zero to obtain a zero-returning gradient value;

The remaining sparse gradient and the return-to-zero gradient value are used as the return-to-zero gradient information.
The gradient information update method according to claim 2, wherein when the number of iterations is greater than the preset number of switching times, the node performs an important degree coefficient calculation on the calculated gradient information to obtain each gradient information The importance factor of includes:

When the number of iterations is greater than the preset switching number, the node determines whether the calculation of gradient information is completed;

If yes, calculate the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;

If not, send a broadcast waiting message.
A gradient information update device for distributed deep learning, which is characterized in that it comprises:

The sparse communication module is used for when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;

The reduction processing module is used to reduce the received gradient information to obtain the reduced gradient information;

The copy model update module is used to correct the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the copy model in the node according to the target gradient.
The gradient information update device according to claim 5, wherein the sparse communication module comprises:

The importance degree calculation unit is configured to calculate the importance degree coefficient of the calculated gradient information by the node when the number of iterations is greater than the preset switching times to obtain the importance degree coefficient of each gradient information;

An index information broadcasting unit, configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication of the sparse gradient to the node according to the index information ask;

A zeroing processing unit, configured to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received, to obtain zeroing gradient information;

The sparse communication unit is used to communicate the zero-return gradient information.
The gradient information update device according to claim 5, wherein the return-to-zero processing unit comprises:

A gradient value setting subunit, configured to, when the communication request is received, set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero to obtain a zero-returned gradient value;

The return-to-zero information acquisition subunit is configured to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.
The gradient information update device according to claim 5, wherein the importance degree calculation unit comprises:

The calculation completion judging subunit is configured to judge whether the calculation of the gradient information is completed by the node when the number of iterations is greater than the preset switching number;

The importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node has completed the calculation of the gradient information to obtain the importance degree coefficient of each gradient information;

The broadcast waiting unit is configured to send a broadcast waiting message when the node has not completed the calculation of the gradient information.
A computer device, characterized in that it comprises:

Memory, used to store computer programs;

The processor is configured to implement the steps of the gradient information update method according to any one of claims 1 to 4 when the computer program is executed.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the gradient information update according to any one of claims 1 to 4 is realized Method steps.