WO2021238274A1 - Gradient information updating method for distributed deep learning, and related apparatus - Google Patents

Gradient information updating method for distributed deep learning, and related apparatus Download PDF

Info

Publication number
WO2021238274A1
WO2021238274A1 PCT/CN2021/073493 CN2021073493W WO2021238274A1 WO 2021238274 A1 WO2021238274 A1 WO 2021238274A1 CN 2021073493 W CN2021073493 W CN 2021073493W WO 2021238274 A1 WO2021238274 A1 WO 2021238274A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient
gradient information
sparse
information
node
Prior art date
Application number
PCT/CN2021/073493
Other languages
French (fr)
Chinese (zh)
Inventor
张玉彦
陈培
张东
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2021238274A1 publication Critical patent/WO2021238274A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Definitions

  • This application relates to the field of computer technology, and in particular to a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning.
  • each computing node when a large-scale distributed deep learning model is trained, each computing node generates a piece of gradient information (local gradient information), and gradient exchange and reduction (global reduction operation) are required between nodes.
  • the model In order to obtain the global gradient value, the model is updated synchronously to ensure the consistency of the replica model of each computing node.
  • the size of the model and the communication bandwidth between nodes determine the communication duration
  • the heterogeneous characteristics of the nodes or the manufacturing errors of the homogeneous equipment determine the communication waiting time, and the above two jointly affect the training duration.
  • the communication strategy between each node is very important. The communication strategy in the prior art will seriously lead to the efficiency of model training and reduce the efficiency of deep learning.
  • the purpose of this application is to provide a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning, which can be used to determine the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times.
  • Perform coefficient communication At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the replica model.
  • the number of communications is reduced and the amount of data passed is also reduced.
  • correction processing is used to maintain the reliability of gradient information, realize sparse communication with less data, and improve the efficiency of distributed deep learning and the speed of sparse communication.
  • this application provides a method for updating gradient information of distributed deep learning, including:
  • the node When the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;
  • the node when the number of iterations is greater than the preset number of switching times, performs sparse communication on the calculated gradient information according to the preset gradient threshold, including:
  • the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information
  • the zero-return gradient information is communicated.
  • performing zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information includes:
  • the remaining sparse gradient and the return-to-zero gradient value are used as the return-to-zero gradient information.
  • the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information, including:
  • the node determines whether the calculation of gradient information is completed
  • This application also provides a gradient information update device for distributed deep learning, including:
  • the sparse communication module is used for when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;
  • the reduction processing module is used to reduce the received gradient information to obtain the reduced gradient information
  • the copy model update module is used to correct the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the copy model in the node according to the target gradient.
  • the sparse communication module includes:
  • the importance degree calculation unit is configured to calculate the importance degree coefficient of the calculated gradient information by the node when the number of iterations is greater than the preset switching times to obtain the importance degree coefficient of each gradient information;
  • An index information broadcasting unit configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication of the sparse gradient to the node according to the index information ask;
  • a zeroing processing unit configured to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received, to obtain zeroing gradient information
  • the sparse communication unit is used to communicate the zero-return gradient information.
  • the resetting processing unit includes:
  • a gradient value setting subunit configured to, when the communication request is received, set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero to obtain a zero-returned gradient value
  • the return-to-zero information acquisition subunit is configured to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.
  • the importance degree calculation unit includes:
  • the calculation completion judging subunit is configured to judge whether the calculation of the gradient information is completed by the node when the number of iterations is greater than the preset switching number;
  • the importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node has completed the calculation of the gradient information to obtain the importance degree coefficient of each gradient information;
  • the broadcast waiting unit is configured to send a broadcast waiting message when the node has not completed the calculation of the gradient information.
  • This application also provides a computer device, including:
  • Memory used to store computer programs
  • the processor is used to implement the steps of the gradient information update method described above when the computer program is executed.
  • the present application also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gradient information update method described above are realized.
  • the gradient information update method for distributed deep learning includes: when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold; The obtained gradient information is reduced to obtain reduced gradient information; the reduced gradient information is corrected according to the set inertia coefficient to obtain the target gradient, and the copy model in the node is processed according to the target gradient. renew.
  • the coefficient communication is carried out according to the gradient threshold.
  • the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the copy model.
  • the number of communications is reduced and the amount of data passing through is also reduced.
  • correction processing is used to maintain the reliability of the gradient information, realize sparse communication with less data, and improve the efficiency in distributed deep learning. High and sparse communication speed.
  • This application also provides a distributed deep learning gradient information update device, computer equipment, and computer-readable storage medium, which have the above beneficial effects, and will not be repeated here.
  • FIG. 1 is a flowchart of a method for updating gradient information of distributed deep learning according to an embodiment of the application
  • FIG. 2 is a schematic structural diagram of a gradient information update device for distributed deep learning provided by an embodiment of the application.
  • the core of this application is to provide a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning.
  • the gradient threshold is determined only when the number of iterations of the node is greater than the preset number of switching times.
  • Perform coefficient communication At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the replica model.
  • the number of communications is reduced and the amount of data passed is also reduced.
  • correction processing is used to maintain the reliability of gradient information, realize sparse communication with less data, and improve the efficiency of distributed deep learning and the speed of sparse communication.
  • each computing node when a large-scale distributed deep learning model is trained, each computing node generates a piece of gradient information (local gradient information), and gradient exchange and reduction (global reduction operation) are required between nodes.
  • the model In order to obtain the global gradient value, the model is updated synchronously to ensure the consistency of the replica model of each computing node.
  • the size of the model and the communication bandwidth between nodes determine the communication duration
  • the heterogeneous characteristics of the nodes or the manufacturing errors of the homogeneous equipment determine the communication waiting time, and the above two jointly affect the training duration.
  • the communication strategy between each node is very important. The communication strategy in the prior art will seriously lead to the efficiency of model training and reduce the efficiency of deep learning.
  • this application provides a method for updating gradient information of distributed deep learning, which communicates coefficients according to the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times.
  • the reduced The gradient is corrected to obtain the target gradient.
  • the target gradient is used to update the replica model.
  • the correction process is used to maintain the reliability of the gradient information, and achieve Sparse communication with less data increases the efficiency and speed of sparse communication in distributed deep learning.
  • FIG. 1 is a flowchart of a method for updating gradient information of distributed deep learning according to an embodiment of the application.
  • the method may include:
  • the purpose of this step is to sparsely pass the calculated gradient information according to the preset gradient threshold value when the number of iterations is greater than the preset number of switching times. That is, during the training process of distributed deep learning, it is judged in real time whether the number of iterations of training is greater than the preset number of switching times. When the number of iterations is greater than the preset number of switching times, in this embodiment, the calculated gradient information is communicated with coefficients according to the preset gradient threshold value through this step.
  • the sparse communication strategy is usually directly adopted to reduce the amount of communication data between various nodes and improve the efficiency of data communication.
  • the sparse communication method used in the prior art is generally to perform sparse communication throughout the distributed deep learning process.
  • the gradient value is often relatively large, which is crucial to the decline of the objective function. Therefore, in this step, a warm-up strategy is added to the process of updating the gradient information, so that when a certain learning rate is reached, data communication is started to achieve a more suitable gradient value.
  • this step may include:
  • Step 1 When the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
  • Step 2 Use gradient information whose importance degree coefficient is greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information so that other nodes can send communication requests of the sparse gradient to the nodes according to the index information;
  • Step 3 When receiving the communication request, perform zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information
  • Step 4. Communicate the zero-return gradient information.
  • step 1 to step 4 are mainly used to realize the coefficient communication operation of gradient information.
  • the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information.
  • the importance degree coefficient calculation is mainly to determine the importance of gradient information, so that only important gradient information is transmitted during the data transmission process, which reduces the amount of data transmitted and improves the efficiency of data transmission.
  • the gradient information whose importance degree coefficient is greater than the preset coefficient is used as the sparse gradient, and the sparse index information is broadcasted so that other nodes send communication requests of the sparse gradient to the nodes according to the index information.
  • the sparse gradient is returned to zero according to the gradient threshold to obtain the zero-returned gradient information. That is, when the local node receives the communication request, the local node performs zeroing processing on the coefficient gradient according to the gradient threshold to obtain zeroing gradient information. That is, some gradient information less than the gradient threshold is directly set to zero, further reducing the amount of data. Finally, the zero-zero gradient information is communicated in order to achieve sparse traffic.
  • the data volume of the coefficient gradient is further reduced by the gradient threshold, which further reduces the data volume for coefficient communication, and greatly improves the efficiency of sparse communication.
  • step 3 in this optional solution may include:
  • the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold is set to zero to obtain the zero-return gradient value; the remaining sparse gradient and the return-to-zero gradient value are used as the zero-return gradient information.
  • the sparse gradient is classified according to the gradient threshold, and the sparse gradient is classified into the sparse gradient whose absolute value is less than the gradient threshold and the remaining sparse gradient.
  • the gradient value of the sparse gradient whose absolute value is less than the gradient threshold is set to zero, and finally the zero-returning gradient information is obtained.
  • the data volume of gradient information is further reduced, and the efficiency of sparse communication is improved.
  • step 1 in this optional solution may include:
  • Step 1.1 when the number of iterations is greater than the preset number of switching times, the node judges whether the calculation of the gradient information is completed; if so, execute step 1.2; if not, execute step 1.3;
  • Step 1.2 calculate the importance degree coefficient of the calculated gradient information, and obtain the importance degree coefficient of each gradient information
  • Step 1.3 send a broadcast waiting message.
  • the node determines whether to complete the calculation of the gradient information. If yes, calculate the importance level. If not, wait for the broadcast message to obtain the corresponding gradient information.
  • S102 Perform reduction processing on the received gradient information to obtain reduced gradient information
  • each node performing sparse communication is sending gradient data to other nodes, so that each node updates the gradient information of the corresponding replica model.
  • the received gradient information is reduced to obtain the reduced gradient information.
  • the reduction processing in this step mainly refers to the global reduction processing. Specifically, any reduction processing method provided by the prior art can be used, and will not be repeated here.
  • S103 Perform correction processing on the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the replica model in the node according to the target gradient.
  • the purpose of this step is to correct the reduced gradient information according to the set inertia coefficient to obtain the target gradient, and update the replica model in the node according to the target gradient. That is, each node that receives the gradient information and performs the reduction operation to obtain the final reduced gradient information, further corrects the reduced gradient information according to the set inertia coefficient to avoid the gradient information from being too low accuracy The problem of obtaining the target gradient, and finally updating the replica model in the node according to the target gradient.
  • the gradient information of longan is corrected according to the set inertia coefficient, so as to solve the problem of inability to converge.
  • the following formula can be used to correct the reduced gradient information.
  • G t+1 is the gradient value of the t+1th iteration
  • G t is the gradient value of the tth iteration
  • sparse( ⁇ ) is the sparse gradient obtained after the gradient importance is evaluated
  • m is the inertia coefficient.
  • this embodiment communicates coefficients according to the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times.
  • the reduced gradient is corrected to obtain the target gradient, and the target gradient is finally adopted.
  • the gradient updates the replica model, which reduces the number of communications on the basis of sparse communication and also reduces the amount of data that passes.
  • correction processing is used to maintain the reliability of the gradient information, achieve sparse communication with less data, and improve distribution The high efficiency and sparse communication speed in deep learning.
  • this patent uses the absolute value of the gradient value as the basis for determining the importance of the gradient.
  • Threshold the preset threshold
  • G is all gradients
  • G i is the i-th dimension of G.
  • Iter count is the number of iterations currently counted.
  • G t+1 is the gradient value of the t+1th iteration
  • G t is the gradient value of the tth iteration
  • sparse( ⁇ ) is the sparse gradient obtained after the gradient importance is evaluated
  • m is the inertia coefficient.
  • the method in this embodiment may include:
  • Step 1 When the switch value Iter is reached, the node that calculates the gradient first among any nodes evaluates the importance of the gradient, obtains the index information of the important gradient, and initiates a global broadcast to broadcast the index information to each node being calculated;
  • Step 2 Each node selects the gradient at the corresponding position from the gradient according to the index information to form a sparse gradient, and initiates a communication request;
  • Step 3 Perform sparse communication between computing nodes, perform a reduction operation, and obtain the gradient after the global reduction process, and use the gradient to update the replica model on each node.
  • the sparse communication strategy in this embodiment aims to reduce the amount of communication data and reduce the communication duration without loss of accuracy.
  • the following describes a gradient information update device for distributed deep learning provided by an embodiment of the present application.
  • the gradient information update device for distributed deep learning described below is the same as the gradient information update device for distributed deep learning described above.
  • the methods can correspond to each other and refer to each other.
  • FIG. 2 is a schematic structural diagram of a gradient information update device for distributed deep learning provided by an embodiment of the application.
  • the device may include:
  • the sparse communication module 100 is configured to perform sparse communication on the calculated gradient information according to the preset gradient threshold when the number of iterations is greater than the preset number of switching times;
  • the reduction processing module 200 is configured to perform reduction processing on the received gradient information to obtain reduced gradient information
  • the replica model update module 300 is used to correct the reduced gradient information according to the set inertia coefficient to obtain the target gradient, and update the replica model in the node according to the target gradient.
  • the sparse communication module 100 may include:
  • the importance degree calculation unit is used for when the number of iterations is greater than the preset switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
  • An index information broadcasting unit configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication requests for the sparse gradient to the node according to the index information;
  • the zeroing processing unit is used to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received to obtain zeroing gradient information;
  • the sparse communication unit is used to communicate the zero gradient information.
  • the resetting processing unit may include:
  • the gradient value setting subunit is used to set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero when a communication request is received, to obtain the zero-returned gradient value;
  • the return-to-zero information acquisition subunit is used to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.
  • the importance degree calculation unit may include:
  • the calculation completion judging subunit is used for the node to judge whether the calculation of the gradient information is completed when the number of iterations is greater than the preset number of switching times;
  • the importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node completes the calculation of the gradient information, and obtain the importance degree coefficient of each gradient information;
  • the broadcast waiting unit is used to send a broadcast waiting message when the node has not completed the calculation of the gradient information.
  • This application also provides a computer device, including:
  • Memory used to store computer programs
  • the processor is used to implement the steps of the gradient information update method described in the above embodiment when the computer program is executed.
  • the present application also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gradient information update method as described in the above embodiments are implemented.
  • the steps of the method or algorithm described in combination with the embodiments disclosed herein can be directly implemented by hardware, a software module executed by a processor, or a combination of the two.
  • the software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Complex Calculations (AREA)

Abstract

A gradient information updating method for distributed deep learning, and a gradient information updating apparatus, a computer device and a computer-readable storage medium. The method comprises: when the number of iterations is greater than a preset number of times of switching, a node performing sparse communication on calculated gradient information according to a preset gradient threshold value; reducing received gradient information to obtain reduced gradient information; and correcting the reduced gradient information according to a set inertia coefficient so as to obtain a target gradient, and updating a copy model in the node according to the target gradient. The present invention is aimed at reducing the data volume of communication and reducing the duration of communication while ensuring that precision is not lost.

Description

一种分布式深度学习的梯度信息更新方法及相关装置Method and related device for updating gradient information of distributed deep learning
本申请要求于2020年05月28日提交中国专利局、申请号为202010469747.7、发明名称为“一种分布式深度学习的梯度信息更新方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 28, 2020, the application number is 202010469747.7, and the invention title is "a distributed deep learning gradient information update method and related device", and its entire content Incorporated in this application by reference.
技术领域Technical field
本申请涉及计算机技术领域,特别涉及一种分布式深度学习的梯度信息更新方法、梯度信息更新装置、计算机设备以及计算机可读存储介质。This application relates to the field of computer technology, and in particular to a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning.
背景技术Background technique
随着信息技术的不断发展,深度学习训练技术的要求越来越高,为了提高训练的效率出现了分布式深度学习。分布式深度学习模型训练时,需进行计算节点间通信,以实现梯度信息的归约处理,保证各计算节点的副本模型同步更新。当模型非常大、样本量非常多、计算集群规模非常大时,计算节点间的通信时间长并且造成冗长的消息处理等待时间,该过程对GPU的利用率低,不能充分利用计算资源,导致模型训练时间长,降低试错和调试的效率。With the continuous development of information technology, the requirements for deep learning training technology are getting higher and higher. In order to improve the efficiency of training, distributed deep learning has appeared. When the distributed deep learning model is trained, communication between computing nodes is required to realize the reduction processing of gradient information and ensure that the replica model of each computing node is updated synchronously. When the model is very large, the sample size is very large, and the computing cluster is very large, the communication time between computing nodes is long and causes a long waiting time for message processing. This process has low GPU utilization and cannot make full use of computing resources, leading to the model Long training time reduces the efficiency of trial and error and debugging.
现有技术中,主要是由于大规模分布式深度学习模型训练时,每个计算节点会产生一份梯度信息(局部梯度信息),节点间需进行梯度交换和归约(全局归约操作),以获取全局梯度值,进而对模型同步更新,保证各计算节点副本模型的一致。该过程中,模型的大小和节点间的通信带宽决定了通信时长,节点的异构特性或者同构设备的制造误差等决定了通信等待时长,上述两者共同影响训练时长。进而面对分布式深度模型训练的过程中,各个节点间的通信策略十分重要。现有技术中的通信策略会严重导致模型训练的效率,减低深度学习的效率。In the prior art, the main reason is that when a large-scale distributed deep learning model is trained, each computing node generates a piece of gradient information (local gradient information), and gradient exchange and reduction (global reduction operation) are required between nodes. In order to obtain the global gradient value, the model is updated synchronously to ensure the consistency of the replica model of each computing node. In this process, the size of the model and the communication bandwidth between nodes determine the communication duration, the heterogeneous characteristics of the nodes or the manufacturing errors of the homogeneous equipment determine the communication waiting time, and the above two jointly affect the training duration. Furthermore, in the process of distributed deep model training, the communication strategy between each node is very important. The communication strategy in the prior art will seriously lead to the efficiency of model training and reduce the efficiency of deep learning.
因此,如何提高分布式深度学习中的效率高或者是稀疏通信的速度,是本领域技术人员关注的重点问题。Therefore, how to improve the efficiency of distributed deep learning or the speed of sparse communication is a key issue for those skilled in the art.
发明内容Summary of the invention
本申请的目的是提供一种分布式深度学习的梯度信息更新方法、梯度信息更新装置、计算机设备以及计算机可读存储介质,通过当节点的迭代次数大于预设开关次数时才根据梯度门限值进行系数通信,同时当接收到梯度信息后对已归约梯度进行修正处理得到目标梯度,最后采用目标梯度进行副本模型更新,在稀疏通信的基础上降低了通信次数同时还降低了通行的数据量,最后采用修正处理保持了梯度信息的可靠性,实现了数据量更少的稀疏通信,提高了分布式深度学习中的效率高和稀疏通信的速度。The purpose of this application is to provide a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning, which can be used to determine the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times. Perform coefficient communication. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the replica model. On the basis of sparse communication, the number of communications is reduced and the amount of data passed is also reduced. Finally, correction processing is used to maintain the reliability of gradient information, realize sparse communication with less data, and improve the efficiency of distributed deep learning and the speed of sparse communication.
为解决上述技术问题,本申请提供一种分布式深度学习的梯度信息更新方法,包括:In order to solve the above technical problems, this application provides a method for updating gradient information of distributed deep learning, including:
当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信;When the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;
对接收到的梯度信息进行归约处理,得到已归约梯度信息;Perform reduction processing on the received gradient information to obtain reduced gradient information;
根据设置的惯量系数对所述已归约梯度信息进行修正处理得到目标梯度,根据所述目标梯度将所述节点中的副本模型进行更新。Correcting the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and updating the replica model in the node according to the target gradient.
可选的,当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信,包括:Optionally, when the number of iterations is greater than the preset number of switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold, including:
当所述迭代次数大于所述预设开关次数时,所述节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;When the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
将重要程度系数大于预设系数的梯度信息作为稀疏梯度,并将所述稀疏的索引信息进行广播,以便其它节点根据所述索引信息向所述节点发送稀疏梯度的通信请求;Taking gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcasting the sparse index information, so that other nodes send communication requests of the sparse gradient to the node according to the index information;
当接收到通信请求时,根据所述梯度门限值对所述稀疏梯度进行归零处理,得到归零梯度信息;When receiving a communication request, perform zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information;
将所述归零梯度信息进行通信。The zero-return gradient information is communicated.
可选的,当接收到通信请求时,根据所述梯度门限值对所述稀疏梯度进行归零处理,得到归零梯度信息,包括:Optionally, when a communication request is received, performing zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information includes:
当接收到所述通信请求时,将所述稀疏梯度的绝对值小于所述梯度门限值的稀疏梯度值设为零,得到归零梯度值;When the communication request is received, setting the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold value to zero to obtain a zero-returning gradient value;
将剩余的稀疏梯度和所述归零梯度值作为所述归零梯度信息。The remaining sparse gradient and the return-to-zero gradient value are used as the return-to-zero gradient information.
可选的,当所述迭代次数大于所述预设开关次数时,所述节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数,包括:Optionally, when the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information, including:
当所述迭代次数大于所述预设开关次数时,所述节点判断是否将梯度信息计算完成;When the number of iterations is greater than the preset switching number, the node determines whether the calculation of gradient information is completed;
若是,则对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;If yes, calculate the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
若否,则发送广播等待消息。If not, send a broadcast waiting message.
本申请还提供一种分布式深度学习的梯度信息更新装置,包括:This application also provides a gradient information update device for distributed deep learning, including:
稀疏通信模块,用于当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信;The sparse communication module is used for when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;
归约处理模块,用于对接收到的梯度信息进行归约处理,得到已归约梯度信息;The reduction processing module is used to reduce the received gradient information to obtain the reduced gradient information;
副本模型更新模块,用于根据设置的惯量系数对所述已归约梯度信息进行修正处理得到目标梯度,根据所述目标梯度将所述节点中的副本模型进行更新。The copy model update module is used to correct the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the copy model in the node according to the target gradient.
可选的,所述稀疏通信模块,包括:Optionally, the sparse communication module includes:
重要程度计算单元,用于当所述迭代次数大于所述预设开关次数时,所述节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;The importance degree calculation unit is configured to calculate the importance degree coefficient of the calculated gradient information by the node when the number of iterations is greater than the preset switching times to obtain the importance degree coefficient of each gradient information;
索引信息广播单元,用于将重要程度系数大于预设系数的梯度信息作为稀疏梯度,并将所述稀疏的索引信息进行广播,以便其它节点根据所述索引信息向所述节点发送稀疏梯度的通信请求;An index information broadcasting unit, configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication of the sparse gradient to the node according to the index information ask;
归零处理单元,用于当接收到通信请求时,根据所述梯度门限值对所述稀疏梯度进行归零处理,得到归零梯度信息;A zeroing processing unit, configured to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received, to obtain zeroing gradient information;
稀疏通信单元,用于将所述归零梯度信息进行通信。The sparse communication unit is used to communicate the zero-return gradient information.
可选的,所述归零处理单元,包括:Optionally, the resetting processing unit includes:
梯度值设置子单元,用于当接收到所述通信请求时,将所述稀疏梯度 的绝对值小于所述梯度门限值的稀疏梯度值设为零,得到归零梯度值;A gradient value setting subunit, configured to, when the communication request is received, set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero to obtain a zero-returned gradient value;
归零信息获取子单元,用于将剩余的稀疏梯度和所述归零梯度值作为所述归零梯度信息。The return-to-zero information acquisition subunit is configured to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.
可选的,所述重要程度计算单元,包括:Optionally, the importance degree calculation unit includes:
计算完成判断子单元,用于当所述迭代次数大于所述预设开关次数时,所述节点判断是否将梯度信息计算完成;The calculation completion judging subunit is configured to judge whether the calculation of the gradient information is completed by the node when the number of iterations is greater than the preset switching number;
重要程度计算子单元,用于当所述节点将梯度信息计算完成时,对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;The importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node has completed the calculation of the gradient information to obtain the importance degree coefficient of each gradient information;
广播等待单元,用于当所述节点将梯度信息计算未完成时,发送广播等待消息。The broadcast waiting unit is configured to send a broadcast waiting message when the node has not completed the calculation of the gradient information.
本申请还提供一种计算机设备,包括:This application also provides a computer device, including:
存储器,用于存储计算机程序;Memory, used to store computer programs;
处理器,用于执行所述计算机程序时实现如上所述的梯度信息更新方法的步骤。The processor is used to implement the steps of the gradient information update method described above when the computer program is executed.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的梯度信息更新方法的步骤。The present application also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gradient information update method described above are realized.
本申请所提供的一种分布式深度学习的梯度信息更新方法,包括:当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信;对接收到的梯度信息进行归约处理,得到已归约梯度信息;根据设置的惯量系数对所述已归约梯度信息进行修正处理得到目标梯度,根据所述目标梯度将所述节点中的副本模型进行更新。The gradient information update method for distributed deep learning provided by this application includes: when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold; The obtained gradient information is reduced to obtain reduced gradient information; the reduced gradient information is corrected according to the set inertia coefficient to obtain the target gradient, and the copy model in the node is processed according to the target gradient. renew.
通过当节点的迭代次数大于预设开关次数时才根据梯度门限值进行系数通信,同时当接收到梯度信息后对已归约梯度进行修正处理得到目标梯度,最后采用目标梯度进行副本模型更新,在稀疏通信的基础上降低了通信次数同时还降低了通行的数据量,最后采用修正处理保持了梯度信息的可靠性,实现了数据量更少的稀疏通信,提高了分布式深度学习中的效率高和稀疏通信的速度。When the number of iterations of the node is greater than the preset number of switching times, the coefficient communication is carried out according to the gradient threshold. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the copy model. On the basis of sparse communication, the number of communications is reduced and the amount of data passing through is also reduced. Finally, correction processing is used to maintain the reliability of the gradient information, realize sparse communication with less data, and improve the efficiency in distributed deep learning. High and sparse communication speed.
本申请还提供一种分布式深度学习的梯度信息更新装置、计算机设备以及计算机可读存储介质,具有以上有益效果,在此不做赘述。This application also provides a distributed deep learning gradient information update device, computer equipment, and computer-readable storage medium, which have the above beneficial effects, and will not be repeated here.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are the embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without creative work.
图1为本申请实施例所提供的一种分布式深度学习的梯度信息更新方法的流程图;FIG. 1 is a flowchart of a method for updating gradient information of distributed deep learning according to an embodiment of the application;
图2为本申请实施例所提供的一种分布式深度学习的梯度信息更新装置的结构示意图。FIG. 2 is a schematic structural diagram of a gradient information update device for distributed deep learning provided by an embodiment of the application.
具体实施方式Detailed ways
本申请的核心是提供一种分布式深度学习的梯度信息更新方法、梯度信息更新装置、计算机设备以及计算机可读存储介质,通过当节点的迭代次数大于预设开关次数时才根据梯度门限值进行系数通信,同时当接收到梯度信息后对已归约梯度进行修正处理得到目标梯度,最后采用目标梯度进行副本模型更新,在稀疏通信的基础上降低了通信次数同时还降低了通行的数据量,最后采用修正处理保持了梯度信息的可靠性,实现了数据量更少的稀疏通信,提高了分布式深度学习中的效率高和稀疏通信的速度。The core of this application is to provide a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning. The gradient threshold is determined only when the number of iterations of the node is greater than the preset number of switching times. Perform coefficient communication. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and finally the target gradient is used to update the replica model. On the basis of sparse communication, the number of communications is reduced and the amount of data passed is also reduced. Finally, correction processing is used to maintain the reliability of gradient information, realize sparse communication with less data, and improve the efficiency of distributed deep learning and the speed of sparse communication.
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of this application clearer, the following will clearly and completely describe the technical solutions in the embodiments of this application with reference to the drawings in the embodiments of this application. Obviously, the described embodiments These are a part of the embodiments of this application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.
现有技术中,主要是由于大规模分布式深度学习模型训练时,每个计算节点会产生一份梯度信息(局部梯度信息),节点间需进行梯度交换和 归约(全局归约操作),以获取全局梯度值,进而对模型同步更新,保证各计算节点副本模型的一致。该过程中,模型的大小和节点间的通信带宽决定了通信时长,节点的异构特性或者同构设备的制造误差等决定了通信等待时长,上述两者共同影响训练时长。进而面对分布式深度模型训练的过程中,各个节点间的通信策略十分重要。现有技术中的通信策略会严重导致模型训练的效率,减低深度学习的效率。In the prior art, the main reason is that when a large-scale distributed deep learning model is trained, each computing node generates a piece of gradient information (local gradient information), and gradient exchange and reduction (global reduction operation) are required between nodes. In order to obtain the global gradient value, the model is updated synchronously to ensure the consistency of the replica model of each computing node. In this process, the size of the model and the communication bandwidth between nodes determine the communication duration, the heterogeneous characteristics of the nodes or the manufacturing errors of the homogeneous equipment determine the communication waiting time, and the above two jointly affect the training duration. Furthermore, in the process of distributed deep model training, the communication strategy between each node is very important. The communication strategy in the prior art will seriously lead to the efficiency of model training and reduce the efficiency of deep learning.
因此,本申请提供一种分布式深度学习的梯度信息更新方法,通过当节点的迭代次数大于预设开关次数时才根据梯度门限值进行系数通信,同时当接收到梯度信息后对已归约梯度进行修正处理得到目标梯度,最后采用目标梯度进行副本模型更新,在稀疏通信的基础上降低了通信次数同时还降低了通行的数据量,最后采用修正处理保持了梯度信息的可靠性,实现了数据量更少的稀疏通信,提高了分布式深度学习中的效率高和稀疏通信的速度。Therefore, this application provides a method for updating gradient information of distributed deep learning, which communicates coefficients according to the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times. At the same time, when the gradient information is received, the reduced The gradient is corrected to obtain the target gradient. Finally, the target gradient is used to update the replica model. On the basis of sparse communication, the number of communications is reduced and the amount of data passed. Finally, the correction process is used to maintain the reliability of the gradient information, and achieve Sparse communication with less data increases the efficiency and speed of sparse communication in distributed deep learning.
请参考图1,图1为本申请实施例所提供的一种分布式深度学习的梯度信息更新方法的流程图。Please refer to FIG. 1. FIG. 1 is a flowchart of a method for updating gradient information of distributed deep learning according to an embodiment of the application.
本实施例中,该方法可以包括:In this embodiment, the method may include:
S101,当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信;S101: When the number of iterations is greater than the preset switching number, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;
本步骤旨在当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通行。也就是,在分布式深度学习进行训练的过程中,实时判断训练的迭代次数是否大于该预设开关次数。当该迭代次数大于该预设开关次数时,在本实施例中通过本步骤根据预设的梯度门限值对计算出的梯度信息进行系数通信。The purpose of this step is to sparsely pass the calculated gradient information according to the preset gradient threshold value when the number of iterations is greater than the preset number of switching times. That is, during the training process of distributed deep learning, it is judged in real time whether the number of iterations of training is greater than the preset number of switching times. When the number of iterations is greater than the preset number of switching times, in this embodiment, the calculated gradient information is communicated with coefficients according to the preset gradient threshold value through this step.
主要是在现有技术中,为了提高在分布式深度学习的通信效率,通常直接采用稀疏通信策略,以便降低各个节点之间通信数据量,提高数据通信效率。但是,现有技术中采用稀疏通信的方式一般是在分布式深度学习的全程进行稀疏通信。但是,在模型训练的初期,梯度值往往比较大,对目标函数的下降至关重要。因此,本步骤中对该梯度信息更新的过程中加 入一种warm-up策略,以便当到达一定的学习率后进行将数据通信进行启动,以便达到更合适的梯度值。Mainly in the prior art, in order to improve the communication efficiency in distributed deep learning, the sparse communication strategy is usually directly adopted to reduce the amount of communication data between various nodes and improve the efficiency of data communication. However, the sparse communication method used in the prior art is generally to perform sparse communication throughout the distributed deep learning process. However, in the early stage of model training, the gradient value is often relatively large, which is crucial to the decline of the objective function. Therefore, in this step, a warm-up strategy is added to the process of updating the gradient information, so that when a certain learning rate is reached, data communication is started to achieve a more suitable gradient value.
可选的,为了进一步对本步骤进行说明,本步骤可以包括:Optionally, in order to further explain this step, this step may include:
步骤1,当迭代次数大于预设开关次数时,节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;Step 1: When the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
步骤2,将重要程度系数大于预设系数的梯度信息作为稀疏梯度,并将稀疏的索引信息进行广播,以便其它节点根据索引信息向节点发送稀疏梯度的通信请求;Step 2: Use gradient information whose importance degree coefficient is greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information so that other nodes can send communication requests of the sparse gradient to the nodes according to the index information;
步骤3,当接收到通信请求时,根据梯度门限值对稀疏梯度进行归零处理,得到归零梯度信息;Step 3: When receiving the communication request, perform zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information;
步骤4,将归零梯度信息进行通信。Step 4. Communicate the zero-return gradient information.
可见,在本可选方案中主要通过步骤1至步骤4实现了梯度信息的系数通信操作。首先,当迭代次数大于预设开关次数时,节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数。其中,该重要程度系数计算主要是对梯度信息的重要性进行判定,以便在数据传输的过程只传输重要的梯度信息,降低传输的数据量的同时提高数据传输的效率。然后,将重要程度系数大于预设系数的梯度信息作为稀疏梯度,并将稀疏的索引信息进行广播,以便其它节点根据索引信息向节点发送稀疏梯度的通信请求。也就是通知其他节点向本节点进行梯度信息请求,以便将对应的梯度信息发送至其他节点中。紧接着,当接收到通信请求时,根据梯度门限值对稀疏梯度进行归零处理,得到归零梯度信息。也就是,本节点当接收到通信请求时,本节点根据该梯度门限值对该系数梯度进行归零处理,得到归零梯度信息。也就是,将一些小于该梯度门限值的梯度信息直接设置为零,进一步降低数据量。最后,将该归零梯度信息进行通信,以便实现稀疏通行。It can be seen that in this optional solution, step 1 to step 4 are mainly used to realize the coefficient communication operation of gradient information. First, when the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information. Among them, the importance degree coefficient calculation is mainly to determine the importance of gradient information, so that only important gradient information is transmitted during the data transmission process, which reduces the amount of data transmitted and improves the efficiency of data transmission. Then, the gradient information whose importance degree coefficient is greater than the preset coefficient is used as the sparse gradient, and the sparse index information is broadcasted so that other nodes send communication requests of the sparse gradient to the nodes according to the index information. That is, other nodes are notified to request gradient information from this node, so that the corresponding gradient information can be sent to other nodes. Then, when a communication request is received, the sparse gradient is returned to zero according to the gradient threshold to obtain the zero-returned gradient information. That is, when the local node receives the communication request, the local node performs zeroing processing on the coefficient gradient according to the gradient threshold to obtain zeroing gradient information. That is, some gradient information less than the gradient threshold is directly set to zero, further reducing the amount of data. Finally, the zero-zero gradient information is communicated in order to achieve sparse traffic.
本可选方案中,通过该梯度门限值将系数梯度的数据量进一步进行缩减,进一步的降低了进行系数通信的数据量,极大的提高了稀疏通信的效率。In this optional solution, the data volume of the coefficient gradient is further reduced by the gradient threshold, which further reduces the data volume for coefficient communication, and greatly improves the efficiency of sparse communication.
可选的,本可选方案中的步骤3可以包括:Optionally, step 3 in this optional solution may include:
当接收到通信请求时,将稀疏梯度的绝对值小于梯度门限值的稀疏梯度值设为零,得到归零梯度值;将剩余的稀疏梯度和归零梯度值作为归零梯度信息。When a communication request is received, the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold is set to zero to obtain the zero-return gradient value; the remaining sparse gradient and the return-to-zero gradient value are used as the zero-return gradient information.
也就是根据梯度门限值将稀疏梯度进行分类,分类为绝对值小于该梯度门限值的稀疏梯度和剩余的稀疏梯度。将其中的绝对值小于该梯度门限值的稀疏梯度的梯度值设为零,最后得到了该归零梯度信息。进一步的减少了梯度信息的数据量,提高了稀疏通信的效率。That is, the sparse gradient is classified according to the gradient threshold, and the sparse gradient is classified into the sparse gradient whose absolute value is less than the gradient threshold and the remaining sparse gradient. The gradient value of the sparse gradient whose absolute value is less than the gradient threshold is set to zero, and finally the zero-returning gradient information is obtained. The data volume of gradient information is further reduced, and the efficiency of sparse communication is improved.
可选的,本可选方案中的步骤1可以包括:Optionally, step 1 in this optional solution may include:
步骤1.1,当迭代次数大于预设开关次数时,节点判断是否将梯度信息计算完成;若是,则执行步骤1.2;若否,则执行步骤1.3;Step 1.1, when the number of iterations is greater than the preset number of switching times, the node judges whether the calculation of the gradient information is completed; if so, execute step 1.2; if not, execute step 1.3;
步骤1.2,对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;Step 1.2, calculate the importance degree coefficient of the calculated gradient information, and obtain the importance degree coefficient of each gradient information;
步骤1.3,发送广播等待消息。Step 1.3, send a broadcast waiting message.
当该迭代次数大于该预设开关次数时,也就是本节点是否将梯度信息计算完成。主要是由于不同节点均进行梯度重要程度评估时,会增加计算量,且计算量的增加随着计算节点数线性增加;并且,不同计算节点得到的稀疏梯度信息在维度和位置上不匹配。针对该问题,本可选方案中只对最先计算出梯度的节点进行梯度重要程度评估,得到重要梯度的索引信息,并广播给所有其他节点。所有节点只选取对应索引信息的梯度,用于发起通信请求。因此,在本可选方案中,该节点判断是否将梯度信息计算完成。若是,则进行重要程度计算。若否,就等待广播消息,以便获取到对应的梯度信息。When the number of iterations is greater than the preset number of switches, that is, whether the node has completed the calculation of the gradient information. The main reason is that when different nodes are performing gradient importance evaluation, the amount of calculation will increase, and the increase of the amount of calculation will increase linearly with the number of computing nodes; and the sparse gradient information obtained by different computing nodes does not match in dimension and position. In response to this problem, in this optional solution, only the node whose gradient is calculated first is evaluated for the importance of the gradient, and the index information of the important gradient is obtained and broadcast to all other nodes. All nodes only select the gradient of the corresponding index information to initiate a communication request. Therefore, in this optional solution, the node determines whether to complete the calculation of the gradient information. If yes, calculate the importance level. If not, wait for the broadcast message to obtain the corresponding gradient information.
S102,对接收到的梯度信息进行归约处理,得到已归约梯度信息;S102: Perform reduction processing on the received gradient information to obtain reduced gradient information;
在S101的基础上,由于本实施例中是在分布式网络中进行模型更新。因此,每个进行稀疏通信的节点都在向其他节点发送梯度数据,以便每个节点都进行对应的副本模型的梯度信息更新。本步骤中就是当该节点接收到其他节点发送的梯度信息,对该接收到的梯度信息进行归约处理,得到已归约梯度信息。On the basis of S101, the model update is performed in the distributed network in this embodiment. Therefore, each node performing sparse communication is sending gradient data to other nodes, so that each node updates the gradient information of the corresponding replica model. In this step, when the node receives the gradient information sent by other nodes, the received gradient information is reduced to obtain the reduced gradient information.
其中,本步骤中的归约处理主要是指全局归约处理。具体的,可以采 用现有技术提供的任意一种归约处理方法,在此不做赘述。Among them, the reduction processing in this step mainly refers to the global reduction processing. Specifically, any reduction processing method provided by the prior art can be used, and will not be repeated here.
S103,根据设置的惯量系数对已归约梯度信息进行修正处理得到目标梯度,根据目标梯度将节点中的副本模型进行更新。S103: Perform correction processing on the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the replica model in the node according to the target gradient.
在S102的基础上,本步骤旨在根据设置的惯量系数对该已归约梯度信息进行修正处理得到目标梯度,根据该目标梯度将该节点中的副本模型进行更新。也就是每个接收到梯度信息并进行已归约操作,得到最后的已归约梯度信息的节点,根据设置的惯量系数对已归约梯度信息进一步进行修正处理,避免出现梯度信息准确性过低的问题,得到该目标梯度,最后根据该目标梯度将该节点中的副本模型进行更新。On the basis of S102, the purpose of this step is to correct the reduced gradient information according to the set inertia coefficient to obtain the target gradient, and update the replica model in the node according to the target gradient. That is, each node that receives the gradient information and performs the reduction operation to obtain the final reduced gradient information, further corrects the reduced gradient information according to the set inertia coefficient to avoid the gradient information from being too low accuracy The problem of obtaining the target gradient, and finally updating the replica model in the node according to the target gradient.
主要是由于现有技术中,丢弃梯度信息会造成随机梯度下降时,重量更新方向与理想方向发生偏移,并且这种偏移会随着迭代的进行得到累加,训练可能不收敛,造成训练无效的情况。因此,为了避免出现训练失效的情况,本步骤中根据设置的惯量系数将已桂圆梯度信息进行修正处理,以便解决无法收敛的问题。具体而言,本实施例中,可以采用如下公式对已归约梯度信息进行修正处理。The main reason is that in the prior art, when the gradient information is discarded, the weight update direction will deviate from the ideal direction when the gradient information is dropped, and this deviation will be accumulated as the iteration progresses, and the training may not converge, resulting in invalid training. Case. Therefore, in order to avoid training failure, in this step, the gradient information of longan is corrected according to the set inertia coefficient, so as to solve the problem of inability to converge. Specifically, in this embodiment, the following formula can be used to correct the reduced gradient information.
具体公式如下:The specific formula is as follows:
G t+1=mG t+sparse(G t+1) G t+1 =mG t +sparse(G t+1 )
其中,G t+1为第t+1次迭代的梯度值,G t为第t次迭代的梯度值,sparse(·)为经梯度重要程度评估后得到的稀疏梯度,m为惯量系数。 Among them, G t+1 is the gradient value of the t+1th iteration, G t is the gradient value of the tth iteration, sparse(·) is the sparse gradient obtained after the gradient importance is evaluated, and m is the inertia coefficient.
综上,本实施例通过当节点的迭代次数大于预设开关次数时才根据梯度门限值进行系数通信,同时当接收到梯度信息后对已归约梯度进行修正处理得到目标梯度,最后采用目标梯度进行副本模型更新,在稀疏通信的基础上降低了通信次数同时还降低了通行的数据量,最后采用修正处理保持了梯度信息的可靠性,实现了数据量更少的稀疏通信,提高了分布式深度学习中的效率高和稀疏通信的速度。In summary, this embodiment communicates coefficients according to the gradient threshold only when the number of iterations of the node is greater than the preset number of switching times. At the same time, when the gradient information is received, the reduced gradient is corrected to obtain the target gradient, and the target gradient is finally adopted. The gradient updates the replica model, which reduces the number of communications on the basis of sparse communication and also reduces the amount of data that passes. Finally, correction processing is used to maintain the reliability of the gradient information, achieve sparse communication with less data, and improve distribution The high efficiency and sparse communication speed in deep learning.
以下通过一个具体的实施例,对本申请提供的一种分布式深度学习的梯度信息更新方法进行说明。In the following, a specific embodiment is used to describe a method for updating gradient information of distributed deep learning provided in the present application.
现有技术中,大规模分布式模型训练通信时间长的问题在于计算节点 间的大量通信和计算,本实施例针对此点提出一种稀疏通信策略。主要是是当一个计算节点执行完Forward-Backward时,对梯度的重要程度进行判断,只对重要的梯度发起通信请求,忽略次要的梯度信息。In the prior art, the problem of long communication time for large-scale distributed model training lies in the large number of communications and calculations between computing nodes. This embodiment proposes a sparse communication strategy for this point. The main reason is that when a computing node finishes executing Forward-Backward, it judges the importance of the gradient, and only initiates communication requests for important gradients, ignoring secondary gradient information.
此外,在深度学习模型在训练过程中,大量的可训练参数为0或者接近0的值。因此,本专利以梯度值的绝对值作为梯度重要程度的判断依据,当梯度值的绝对值大于预设的门限值Threshold,则认为该梯度重要;否则,认为梯度是不重要的,保留梯度值。公式如下:In addition, during the training of the deep learning model, a large number of trainable parameters are zero or close to zero. Therefore, this patent uses the absolute value of the gradient value as the basis for determining the importance of the gradient. When the absolute value of the gradient value is greater than the preset threshold Threshold, the gradient is considered important; otherwise, the gradient is considered unimportant and the gradient is retained value. The formula is as follows:
Figure PCTCN2021073493-appb-000001
Figure PCTCN2021073493-appb-000001
其中,G为所有梯度,G i为G的第i维。 Among them, G is all gradients, and G i is the i-th dimension of G.
另外,考虑到训练初期,梯度值往往较大,对目标函数的下降至关重要,因此基于上述的稀疏策略,提出一种warm-up策略。在warm-up策略中,设置开关值Iter,当迭代计数达到Iter时,则进行梯度重要程度评估,选择重要的梯度发起通信请求。公式如下:In addition, considering that the gradient value is often large at the beginning of training, which is crucial to the decline of the objective function, a warm-up strategy is proposed based on the above-mentioned sparse strategy. In the warm-up strategy, the switch value Iter is set. When the iteration count reaches Iter, the gradient importance is evaluated, and the important gradient is selected to initiate a communication request. The formula is as follows:
Figure PCTCN2021073493-appb-000002
Figure PCTCN2021073493-appb-000002
其中,Iter count为当前计数的迭代次数。 Among them, Iter count is the number of iterations currently counted.
当Iter设置为0,认为训练开始后即进行稀疏通信;当Iter设置为最大迭代次数Iter max,认为不采用稀疏通信。 When Iter is set to 0, it is considered that sparse communication will be carried out after the training starts; when Iter is set to the maximum number of iterations Iter max , it is considered that sparse communication is not used.
此外,值得注意的是,丢弃梯度信息会造成随机梯度下降时,重量更新方向与理想方向发生偏移,并且这种偏移会随着迭代的进行得到累加,训练可能不收敛,造成训练无效的情况。针对该情况,提出一种梯度修正策略,具体公式如下:In addition, it is worth noting that discarding the gradient information will cause the weight update direction to deviate from the ideal direction when the gradient information is dropped, and this deviation will be accumulated as the iteration progresses, and the training may not converge, resulting in invalid training. Condition. In response to this situation, a gradient correction strategy is proposed. The specific formula is as follows:
G t+1=mG t+sparse(G t+1) G t+1 =mG t +sparse(G t+1 )
其中,G t+1为第t+1次迭代的梯度值,G t为第t次迭代的梯度值,sparse(·)为经梯度重要程度评估后得到的稀疏梯度,m为惯量系数。 Among them, G t+1 is the gradient value of the t+1th iteration, G t is the gradient value of the tth iteration, sparse(·) is the sparse gradient obtained after the gradient importance is evaluated, and m is the inertia coefficient.
另外,若不同计算节点均进行梯度重要程度评估,会带来2个缺陷:1)增加计算量,且计算量的增加随着计算节点数线性增加;2)不同计算节点得到的稀疏梯度信息在维度和位置上不匹配。针对该问题,提出一种最快 处理策略,即只对最先计算出梯度的节点进行梯度重要程度评估,得到重要梯度的索引信息,并广播给所有其他节点。所有节点只选取对应索引的梯度,用于发起通信请求。In addition, if different computing nodes are evaluated for the importance of gradients, it will bring two defects: 1) increase the amount of calculation, and the increase of the amount of calculation increases linearly with the number of computing nodes; 2) the sparse gradient information obtained by different computing nodes is Mismatch in dimensions and location. In response to this problem, a fastest processing strategy is proposed, which is to evaluate the importance of the gradient only on the node that calculates the gradient first, obtain the index information of the important gradient, and broadcast it to all other nodes. All nodes only select the gradient of the corresponding index to initiate a communication request.
基于以上说明,本实施例中该方法可以包括:Based on the above description, the method in this embodiment may include:
步骤1,当达到开关值Iter时,任意节点中最先计算出梯度的节点进行梯度重要程度评估,得到重要梯度的索引信息,并发起全局广播,将索引信息广播到各个正在计算的节点上;Step 1. When the switch value Iter is reached, the node that calculates the gradient first among any nodes evaluates the importance of the gradient, obtains the index information of the important gradient, and initiates a global broadcast to broadcast the index information to each node being calculated;
步骤2,每个节点根据索引信息,从梯度中选取对应位置上的梯度,构成稀疏化的梯度,并发起通信请求;Step 2: Each node selects the gradient at the corresponding position from the gradient according to the index information to form a sparse gradient, and initiates a communication request;
步骤3,进行计算节点间稀疏通信,进行归约操作,得到全局归约处理后的梯度,将梯度用于每个节点上副本模型的更新。Step 3: Perform sparse communication between computing nodes, perform a reduction operation, and obtain the gradient after the global reduction process, and use the gradient to update the replica model on each node.
可见,本实施例中的稀疏通信策略,旨在保障精度不丢失的情况下,减少通信的数据量,降低通信时长。It can be seen that the sparse communication strategy in this embodiment aims to reduce the amount of communication data and reduce the communication duration without loss of accuracy.
下面对本申请实施例提供的一种分布式深度学习的梯度信息更新装置进行介绍,下文描述的一种分布式深度学习的梯度信息更新装置与上文描述的一种分布式深度学习的梯度信息更新方法可相互对应参照。The following describes a gradient information update device for distributed deep learning provided by an embodiment of the present application. The gradient information update device for distributed deep learning described below is the same as the gradient information update device for distributed deep learning described above. The methods can correspond to each other and refer to each other.
请参考图2,图2为本申请实施例所提供的一种分布式深度学习的梯度信息更新装置的结构示意图。Please refer to FIG. 2, which is a schematic structural diagram of a gradient information update device for distributed deep learning provided by an embodiment of the application.
本实施例中,该装置可以包括:In this embodiment, the device may include:
稀疏通信模块100,用于当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信;The sparse communication module 100 is configured to perform sparse communication on the calculated gradient information according to the preset gradient threshold when the number of iterations is greater than the preset number of switching times;
归约处理模块200,用于对接收到的梯度信息进行归约处理,得到已归约梯度信息;The reduction processing module 200 is configured to perform reduction processing on the received gradient information to obtain reduced gradient information;
副本模型更新模块300,用于根据设置的惯量系数对已归约梯度信息进行修正处理得到目标梯度,根据目标梯度将节点中的副本模型进行更新。The replica model update module 300 is used to correct the reduced gradient information according to the set inertia coefficient to obtain the target gradient, and update the replica model in the node according to the target gradient.
可选的,该稀疏通信模块100,可以包括:Optionally, the sparse communication module 100 may include:
重要程度计算单元,用于当迭代次数大于预设开关次数时,节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系 数;The importance degree calculation unit is used for when the number of iterations is greater than the preset switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
索引信息广播单元,用于将重要程度系数大于预设系数的梯度信息作为稀疏梯度,并将稀疏的索引信息进行广播,以便其它节点根据索引信息向节点发送稀疏梯度的通信请求;An index information broadcasting unit, configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication requests for the sparse gradient to the node according to the index information;
归零处理单元,用于当接收到通信请求时,根据梯度门限值对稀疏梯度进行归零处理,得到归零梯度信息;The zeroing processing unit is used to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received to obtain zeroing gradient information;
稀疏通信单元,用于将归零梯度信息进行通信。The sparse communication unit is used to communicate the zero gradient information.
可选的,该归零处理单元,可以包括:Optionally, the resetting processing unit may include:
梯度值设置子单元,用于当接收到通信请求时,将稀疏梯度的绝对值小于梯度门限值的稀疏梯度值设为零,得到归零梯度值;The gradient value setting subunit is used to set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero when a communication request is received, to obtain the zero-returned gradient value;
归零信息获取子单元,用于将剩余的稀疏梯度和归零梯度值作为归零梯度信息。The return-to-zero information acquisition subunit is used to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.
可选的,该重要程度计算单元,可以包括:Optionally, the importance degree calculation unit may include:
计算完成判断子单元,用于当迭代次数大于预设开关次数时,节点判断是否将梯度信息计算完成;The calculation completion judging subunit is used for the node to judge whether the calculation of the gradient information is completed when the number of iterations is greater than the preset number of switching times;
重要程度计算子单元,用于当节点将梯度信息计算完成时,对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;The importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node completes the calculation of the gradient information, and obtain the importance degree coefficient of each gradient information;
广播等待单元,用于当节点将梯度信息计算未完成时,发送广播等待消息。The broadcast waiting unit is used to send a broadcast waiting message when the node has not completed the calculation of the gradient information.
本申请还提供一种计算机设备,包括:This application also provides a computer device, including:
存储器,用于存储计算机程序;Memory, used to store computer programs;
处理器,用于执行所述计算机程序时实现如以上实施例所述的梯度信息更新方法的步骤。The processor is used to implement the steps of the gradient information update method described in the above embodiment when the computer program is executed.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如以上实施例所述的梯度信息更新方法的步骤。The present application also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gradient information update method as described in the above embodiments are implemented.
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method part.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Professionals may further realize that the units and algorithm steps of the examples described in the embodiments disclosed in this article can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in combination with the embodiments disclosed herein can be directly implemented by hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.
以上对本申请所提供的一种分布式深度学习的梯度信息更新方法、梯度信息更新装置、计算机设备以及计算机可读存储介质进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。The above describes in detail a gradient information update method, gradient information update device, computer equipment, and computer-readable storage medium for distributed deep learning provided by this application. In this article, specific examples are used to describe the principles and implementation of the application, and the description of the above examples is only used to help understand the methods and core ideas of the application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and modifications can be made to this application, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims (10)

  1. 一种分布式深度学习的梯度信息更新方法,其特征在于,包括:A method for updating gradient information of distributed deep learning, which is characterized in that it includes:
    当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信;When the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;
    对接收到的梯度信息进行归约处理,得到已归约梯度信息;Perform reduction processing on the received gradient information to obtain reduced gradient information;
    根据设置的惯量系数对所述已归约梯度信息进行修正处理得到目标梯度,根据所述目标梯度将所述节点中的副本模型进行更新。Correcting the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and updating the replica model in the node according to the target gradient.
  2. 根据权利要求1所述的梯度信息更新方法,其特征在于,当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信,包括:The gradient information update method according to claim 1, wherein when the number of iterations is greater than the preset number of switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold value, comprising:
    当所述迭代次数大于所述预设开关次数时,所述节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;When the number of iterations is greater than the preset number of switching times, the node calculates the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
    将重要程度系数大于预设系数的梯度信息作为稀疏梯度,并将所述稀疏的索引信息进行广播,以便其它节点根据所述索引信息向所述节点发送稀疏梯度的通信请求;Taking gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcasting the sparse index information, so that other nodes send communication requests of the sparse gradient to the node according to the index information;
    当接收到通信请求时,根据所述梯度门限值对所述稀疏梯度进行归零处理,得到归零梯度信息;When receiving a communication request, perform zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information;
    将所述归零梯度信息进行通信。The zero-return gradient information is communicated.
  3. 根据权利要求2所述的梯度信息更新方法,其特征在于,当接收到通信请求时,根据所述梯度门限值对所述稀疏梯度进行归零处理,得到归零梯度信息,包括:The gradient information update method according to claim 2, wherein when a communication request is received, performing zeroing processing on the sparse gradient according to the gradient threshold to obtain zeroing gradient information comprises:
    当接收到所述通信请求时,将所述稀疏梯度的绝对值小于所述梯度门限值的稀疏梯度值设为零,得到归零梯度值;When the communication request is received, setting the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold value to zero to obtain a zero-returning gradient value;
    将剩余的稀疏梯度和所述归零梯度值作为所述归零梯度信息。The remaining sparse gradient and the return-to-zero gradient value are used as the return-to-zero gradient information.
  4. 根据权利要求2所述的梯度信息更新方法,其特征在于,当所述迭代次数大于所述预设开关次数时,所述节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数,包括:The gradient information update method according to claim 2, wherein when the number of iterations is greater than the preset number of switching times, the node performs an important degree coefficient calculation on the calculated gradient information to obtain each gradient information The importance factor of includes:
    当所述迭代次数大于所述预设开关次数时,所述节点判断是否将梯度信息计算完成;When the number of iterations is greater than the preset switching number, the node determines whether the calculation of gradient information is completed;
    若是,则对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;If yes, calculate the importance degree coefficient of the calculated gradient information to obtain the importance degree coefficient of each gradient information;
    若否,则发送广播等待消息。If not, send a broadcast waiting message.
  5. 一种分布式深度学习的梯度信息更新装置,其特征在于,包括:A gradient information update device for distributed deep learning, which is characterized in that it comprises:
    稀疏通信模块,用于当迭代次数大于预设开关次数时,节点根据预设的梯度门限值对计算出的梯度信息进行稀疏通信;The sparse communication module is used for when the number of iterations is greater than the preset switching times, the node performs sparse communication on the calculated gradient information according to the preset gradient threshold;
    归约处理模块,用于对接收到的梯度信息进行归约处理,得到已归约梯度信息;The reduction processing module is used to reduce the received gradient information to obtain the reduced gradient information;
    副本模型更新模块,用于根据设置的惯量系数对所述已归约梯度信息进行修正处理得到目标梯度,根据所述目标梯度将所述节点中的副本模型进行更新。The copy model update module is used to correct the reduced gradient information according to the set inertia coefficient to obtain a target gradient, and update the copy model in the node according to the target gradient.
  6. 根据权利要求5所述的梯度信息更新装置,其特征在于,所述稀疏通信模块,包括:The gradient information update device according to claim 5, wherein the sparse communication module comprises:
    重要程度计算单元,用于当所述迭代次数大于所述预设开关次数时,所述节点对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;The importance degree calculation unit is configured to calculate the importance degree coefficient of the calculated gradient information by the node when the number of iterations is greater than the preset switching times to obtain the importance degree coefficient of each gradient information;
    索引信息广播单元,用于将重要程度系数大于预设系数的梯度信息作为稀疏梯度,并将所述稀疏的索引信息进行广播,以便其它节点根据所述索引信息向所述节点发送稀疏梯度的通信请求;An index information broadcasting unit, configured to use gradient information with an importance degree coefficient greater than a preset coefficient as a sparse gradient, and broadcast the sparse index information, so that other nodes send communication of the sparse gradient to the node according to the index information ask;
    归零处理单元,用于当接收到通信请求时,根据所述梯度门限值对所述稀疏梯度进行归零处理,得到归零梯度信息;A zeroing processing unit, configured to perform zeroing processing on the sparse gradient according to the gradient threshold value when a communication request is received, to obtain zeroing gradient information;
    稀疏通信单元,用于将所述归零梯度信息进行通信。The sparse communication unit is used to communicate the zero-return gradient information.
  7. 根据权利要求5所述的梯度信息更新装置,其特征在于,所述归零处理单元,包括:The gradient information update device according to claim 5, wherein the return-to-zero processing unit comprises:
    梯度值设置子单元,用于当接收到所述通信请求时,将所述稀疏梯度的绝对值小于所述梯度门限值的稀疏梯度值设为零,得到归零梯度值;A gradient value setting subunit, configured to, when the communication request is received, set the sparse gradient value whose absolute value of the sparse gradient is less than the gradient threshold to zero to obtain a zero-returned gradient value;
    归零信息获取子单元,用于将剩余的稀疏梯度和所述归零梯度值作为所述归零梯度信息。The return-to-zero information acquisition subunit is configured to use the remaining sparse gradient and the return-to-zero gradient value as the return-to-zero gradient information.
  8. 根据权利要求5所述的梯度信息更新装置,其特征在于,所述重要 程度计算单元,包括:The gradient information update device according to claim 5, wherein the importance degree calculation unit comprises:
    计算完成判断子单元,用于当所述迭代次数大于所述预设开关次数时,所述节点判断是否将梯度信息计算完成;The calculation completion judging subunit is configured to judge whether the calculation of the gradient information is completed by the node when the number of iterations is greater than the preset switching number;
    重要程度计算子单元,用于当所述节点将梯度信息计算完成时,对计算出的梯度信息进行重要程度系数计算,得到每个梯度信息的重要程度系数;The importance degree calculation subunit is used to calculate the importance degree coefficient of the calculated gradient information when the node has completed the calculation of the gradient information to obtain the importance degree coefficient of each gradient information;
    广播等待单元,用于当所述节点将梯度信息计算未完成时,发送广播等待消息。The broadcast waiting unit is configured to send a broadcast waiting message when the node has not completed the calculation of the gradient information.
  9. 一种计算机设备,其特征在于,包括:A computer device, characterized in that it comprises:
    存储器,用于存储计算机程序;Memory, used to store computer programs;
    处理器,用于执行所述计算机程序时实现如权利要求1至4任一项所述的梯度信息更新方法的步骤。The processor is configured to implement the steps of the gradient information update method according to any one of claims 1 to 4 when the computer program is executed.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至4任一项所述的梯度信息更新方法的步骤。A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the gradient information update according to any one of claims 1 to 4 is realized Method steps.
PCT/CN2021/073493 2020-05-28 2021-01-25 Gradient information updating method for distributed deep learning, and related apparatus WO2021238274A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010469747.7 2020-05-28
CN202010469747.7A CN111625603A (en) 2020-05-28 2020-05-28 Gradient information updating method for distributed deep learning and related device

Publications (1)

Publication Number Publication Date
WO2021238274A1 true WO2021238274A1 (en) 2021-12-02

Family

ID=72272640

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073493 WO2021238274A1 (en) 2020-05-28 2021-01-25 Gradient information updating method for distributed deep learning, and related apparatus

Country Status (2)

Country Link
CN (1) CN111625603A (en)
WO (1) WO2021238274A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115906982A (en) * 2022-11-15 2023-04-04 北京百度网讯科技有限公司 Distributed training method, gradient communication method, device and electronic equipment
WO2024027676A1 (en) * 2022-08-05 2024-02-08 索尼集团公司 Apparatus and method for handover in hierarchical federated learning network, and medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625603A (en) * 2020-05-28 2020-09-04 浪潮电子信息产业股份有限公司 Gradient information updating method for distributed deep learning and related device
CN113570067B (en) * 2021-07-23 2022-08-02 北京百度网讯科技有限公司 Synchronization method and device of distributed system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814159A (en) * 2009-02-24 2010-08-25 余华 Speaker verification method based on combination of auto-associative neural network and Gaussian mixture model-universal background model
CN108021982A (en) * 2016-10-28 2018-05-11 北京市商汤科技开发有限公司 Data transmission method and system, electronic equipment
CN109102075A (en) * 2018-07-26 2018-12-28 联想(北京)有限公司 Gradient updating method and relevant device during a kind of distribution is trained
US20190213470A1 (en) * 2018-01-09 2019-07-11 NEC Laboratories Europe GmbH Zero injection for distributed deep learning
CN111625603A (en) * 2020-05-28 2020-09-04 浪潮电子信息产业股份有限公司 Gradient information updating method for distributed deep learning and related device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472347A (en) * 2018-10-15 2019-03-15 中山大学 A kind of gradient compression method of distribution deep learning
CN109951438B (en) * 2019-01-15 2020-11-20 中国科学院信息工程研究所 Communication optimization method and system for distributed deep learning
CN110287031B (en) * 2019-07-01 2023-05-09 南京大学 Method for reducing communication overhead of distributed machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814159A (en) * 2009-02-24 2010-08-25 余华 Speaker verification method based on combination of auto-associative neural network and Gaussian mixture model-universal background model
CN108021982A (en) * 2016-10-28 2018-05-11 北京市商汤科技开发有限公司 Data transmission method and system, electronic equipment
US20190213470A1 (en) * 2018-01-09 2019-07-11 NEC Laboratories Europe GmbH Zero injection for distributed deep learning
CN109102075A (en) * 2018-07-26 2018-12-28 联想(北京)有限公司 Gradient updating method and relevant device during a kind of distribution is trained
CN111625603A (en) * 2020-05-28 2020-09-04 浪潮电子信息产业股份有限公司 Gradient information updating method for distributed deep learning and related device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024027676A1 (en) * 2022-08-05 2024-02-08 索尼集团公司 Apparatus and method for handover in hierarchical federated learning network, and medium
CN115906982A (en) * 2022-11-15 2023-04-04 北京百度网讯科技有限公司 Distributed training method, gradient communication method, device and electronic equipment
CN115906982B (en) * 2022-11-15 2023-10-24 北京百度网讯科技有限公司 Distributed training method, gradient communication device and electronic equipment

Also Published As

Publication number Publication date
CN111625603A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
WO2021238274A1 (en) Gradient information updating method for distributed deep learning, and related apparatus
CN111092823B (en) Method and system for adaptively adjusting congestion control initial window
US11784931B2 (en) Network burst load evacuation method for edge servers
CN110832509A (en) Black box optimization using neural networks
CN113989561B (en) Parameter aggregation updating method, device and system based on asynchronous federal learning
WO2022042741A1 (en) Learning model training method, working node, server, device and medium
WO2020042332A1 (en) Word vector-based event-driven service matching method
CN111935783A (en) Edge cache system and method based on flow perception
CN112862088A (en) Distributed deep learning method based on pipeline annular parameter communication
CN113723619A (en) Federal learning training method based on training phase perception strategy
EP4187882A1 (en) Data transmission method and apparatus, device, storage medium, and computer program product
WO2021128293A1 (en) Model training method and apparatus, and storage medium and program product
CN113222148A (en) Neural network reasoning acceleration method for material identification
CN107357592A (en) A kind of event-handling method and device based on state machine mechanism
CN105939385A (en) Request frequency based real-time data replacement method in NDN cache
CN113902128B (en) Asynchronous federal learning method, device and medium for improving utilization efficiency of edge device
CN113064907B (en) Content updating method based on deep reinforcement learning
WO2023045161A1 (en) Correction method for electron beam proximity effect and device therefor
CN117151208B (en) Asynchronous federal learning parameter updating method based on self-adaptive learning rate, electronic equipment and storage medium
CN112492026B (en) Hybrid self-adaptive copy consistency updating method in dynamic cloud storage environment
CN111027671A (en) Distributed deep learning communication method and system based on model structure characteristics
CN113114762A (en) Data caching method and system
CN111465057B (en) Edge caching method and device based on reinforcement learning and electronic equipment
Yu et al. Accelerating distributed training in heterogeneous clusters via a straggler-aware parameter server
JP2020003860A (en) Learning system, processing device, processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21812485

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21812485

Country of ref document: EP

Kind code of ref document: A1