CN110472731A

CN110472731A - Gradient synchronous method and device during a kind of distribution is trained

Info

Publication number: CN110472731A
Application number: CN201910760056.XA
Authority: CN
Inventors: 李小龙; 王洪伟; 李鑫; 李长亮
Original assignee: Chengdu Kingsoft Digital Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd; Chengdu Kingsoft Digital Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2019-11-19

Abstract

The application provides gradient synchronous method and device in a kind of distributed training, wherein gradient synchronous method includes: to be grouped to the training data in distributed training cluster on each trained node in the distributed training, obtain multiple sub- training datas on each trained node, wherein, the training node in distributed training cluster circularizes connection；Calculate the son training accumulation gradient of every sub- training data in the training node of the distributed training cluster；Son training accumulated gradient corresponding with the son training accumulation gradient is obtained according to the son training accumulation gradient；The son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.By accumulating the gradient of the sub- training data, the synchronization times of accumulation gradient are reduced, communication frequency is reduced, accelerates the speed of model training.

Description

Gradient synchronous method and device during a kind of distribution is trained

Technical field

This application involves field of computer technology, in particular to gradient synchronous method and device in a kind of distributed training, Calculate equipment, computer readable storage medium and chip.

Background technique

Currently, depth learning technology is also rapid progress, with depth with the fast development of computer technology Habit technology is goed deep into, and the algorithm to become increasingly complex is developed, these algorithms need a large amount of data and take a substantial amount of time It can effectively complete to train, therefore have developed distributed training.

In the model optimization of deep learning, needs to calculate gradient using the method that gradient declines and find the smallest loss letter Number, carrys out training pattern with this, accelerates the convergence of model.In current distribution training, need every primary training of completion will The transmitting for carrying out gradient information is synchronous with gradient information, in order to share the gradient on distributed training node, finds most Small loss function, therefore can be because the transmitting of high-frequency gradient information and transmitting cause the problem of containing much information in model training The model training time is long, span is big, has seriously delayed the speed of model training.

Therefore, how to improve the above problem, just become current urgent problem to be solved.

Summary of the invention

It is set in view of this, the embodiment of the present application provides gradient synchronous method and device, calculating in a kind of distributed training Standby, computer readable storage medium and chip, to solve technological deficiency existing in the prior art.

According to the embodiment of the present application in a first aspect, providing gradient synchronous method in a kind of distributed training, comprising:

Training data on each trained node in distributed training cluster is grouped, is obtained on each trained node Multiple sub- training datas, wherein distribution training cluster in training node circularize connection；

Calculate the son training accumulation gradient of every sub- training data in the training node of the distributed training cluster；

Son training accumulated gradient corresponding with the son training accumulation gradient is obtained according to the son training accumulation gradient；

The son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.

According to the second aspect of the embodiment of the present application, gradient synchronizing device in a kind of distributed training is provided, comprising:

Grouping module is configured as being grouped the training data in distributed training cluster on each trained node, Obtain multiple sub- training datas on each trained node, wherein the training node in distribution training cluster circularizes connection；

Computing module is configured as the son of every sub- training data in the training node for calculating the distributed training cluster Training accumulation gradient；

Accumulator module is configured as being obtained according to the son training accumulation gradient corresponding with the son training accumulation gradient Son training accumulated gradient；

Synchronization module is configured as the son training accumulated gradient being synchronized to each instruction of the distributed training cluster Practice node.

According to the third aspect of the embodiment of the present application, a kind of calculating equipment, including memory, processor and storage are provided On a memory and the computer instruction that can run on a processor, the processor realize the distribution when executing described instruction In formula training the step of gradient synchronous method.

According to the fourth aspect of the embodiment of the present application, a kind of computer readable storage medium is provided, is stored with calculating The step of machine instruction, which realizes gradient synchronous method in distributed training when being executed by processor.

According to the 5th of the embodiment of the present application the aspect, a kind of chip is provided, computer instruction is stored with, the instruction quilt Chip realizes the step of gradient synchronous method in the distributed training when executing.

Gradient synchronous method in distributed training provided by the present application, by being saved to each training in distribution training cluster Training data on point is grouped, and obtains multiple sub- training datas on each trained node, wherein distribution training cluster In training node circularize connection；Calculate the son instruction of every sub- training data in the training node of the distributed training cluster Practice accumulation gradient；The cumulative ladder of son training corresponding with the son training accumulation gradient is obtained according to the son training accumulation gradient Degree；The son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.In model training process In, the synchronization of gradient will be carried out again after the gradient information repeatedly calculated accumulation, significantly reduce the communication frequency of gradient information, The passing time for reducing gradient information accelerates the training speed of model, improves the efficiency of model training.

Detailed description of the invention

Fig. 1 is the structural block diagram provided by the embodiments of the present application for calculating equipment；

Fig. 2 is the flow chart of gradient synchronous method in distributed training provided by the embodiments of the present application；

Fig. 3 is the flow chart of son training accumulation gradient calculation method provided by the embodiments of the present application；

Fig. 4 is the flow chart of gradient synchronous method during the distribution that another embodiment of the application provides is trained；

Fig. 5 is the structural schematic diagram for the distribution training cluster that one embodiment of the application provides；

Fig. 6 is the structural schematic diagram of gradient synchronizing device in distributed training provided by the embodiments of the present application.

Specific embodiment

Many details are explained in the following description in order to fully understand the application.But the application can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to the application intension the case where Under do similar popularization, therefore the application is not limited by following public specific implementation.

The term used in the application one or more embodiment be only merely for for the purpose of describing particular embodiments, and It is not intended to be limiting the application one or more embodiment.The institute in the application one or more embodiment and the appended claims The "an" of the singular used, " described " and "the" are also intended to including most forms, unless context clearly shows that it His meaning.It is also understood that term "and/or" used in the application one or more embodiment refers to and includes one or more A associated any or all of project listed may combine.

It will be appreciated that though may be described using term first, second etc. in the application one or more embodiment Various information, but these information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out. For example, first can also be referred to as second in the case where not departing from the application one or more scope of embodiments, similarly, Second can also be referred to as first.Depending on context, word as used in this " if " can be construed to " ... When " or " when ... " or " in response to determination ".

Firstly, the vocabulary of terms being related to one or more embodiments of the invention explains.

Gradient: gradient is a vector, and it is maximum to indicate that the directional derivative of a certain function at this point is obtained along the direction Value, i.e. function change most fast, change rate maximum, the gradient in model training along the direction (direction of this gradient) at this point For finding least disadvantage function, training pattern accelerates the convergence of model.Number, that is, gradient step number of model training.

Gradient descent method (gradient descent): being an optimization algorithm, also commonly referred to as steepest descent method.Ladder Degree descent method is to solve for one of common method of unconstrained optimization problem, now more for being used to recurrence in machine learning Approach minimum deflection model to property.Especially for the back-propagation algorithm in neural network, gradient descent method is provided for Theoretical basis.

Gradient accumulation: together by the gradient accumulation of multistep training.

Distribution training: the method being trained using multiple trained nodes.

Son training accumulation gradient: calculating the gradient of the training data on the trained node of each of distributed training cluster, will Repeatedly calculate to gradient accumulation, obtain son training accumulation gradient.

Son training accumulated gradient: the corresponding son training accumulation gradient of nodes all in distribution training cluster is added up Come, obtains son training accumulated gradient.

In this application, provide gradient synchronous method and device in a kind of distributed training, calculate equipment, computer can Storage medium and chip are read, is described in detail one by one in the following embodiments.

Fig. 1 shows the structural block diagram of the calculating equipment 100 according to one embodiment of the application.The portion of the calculating equipment 100 Part includes but is not limited to memory 110 and processor 120.Processor 120 is connected with memory 110 by bus 130, data Library 150 is for saving data.

Calculating equipment 100 further includes access device 140, access device 140 enable calculate equipment 100 via one or Multiple networks 160 communicate.The example of these networks includes public switched telephone network (PSTN), local area network (LAN), wide area network (WAN), the combination of the communication network of personal area network (PAN) or such as internet.Access device 140 may include wired or wireless One or more of any kind of network interface (for example, network interface card (NIC)), such as IEEE802.11 wireless local area Net (WLAN) wireless interface, worldwide interoperability for microwave accesses (Wi-MAX) interface, Ethernet interface, universal serial bus (USB) connect Mouth, cellular network interface, blue tooth interface, near-field communication (NFC) interface, etc..

In one embodiment of the application, unshowned other component in the above-mentioned component and Fig. 1 of equipment 100 is calculated It can also be connected to each other, such as pass through bus.It should be appreciated that calculating device structure block diagram shown in FIG. 1 is merely for the sake of showing The purpose of example, rather than the limitation to the application range.Those skilled in the art can according to need, and increase or replace other portions Part.

Calculating equipment 100 can be any kind of static or mobile computing device, including mobile computer or mobile meter Calculate equipment (for example, tablet computer, personal digital assistant, laptop computer, notebook computer, net book etc.), movement Phone (for example, smart phone), wearable calculating equipment (for example, smartwatch, intelligent glasses etc.) or other kinds of shifting Dynamic equipment, or the static calculating equipment of such as desktop computer or PC.Calculating equipment 100 can also be mobile or state type Server.

Wherein, processor 120 can execute the step in the training of distribution shown in Fig. 2 in gradient synchronous method.Fig. 2 shows According to the flow chart of gradient synchronous method in the distribution training of the application one embodiment, including step 202 is to step 208.

Step 202: the training data on each trained node in distributed training cluster being grouped, each instruction is obtained Practice multiple sub- training datas on node, wherein the training node in distribution training cluster circularizes connection.

Distribution training cluster in distribution training has multiple trained nodes, is placed in each trained node identical To training pattern, all training samples are evenly distributed in each trained node, using difference in each trained node Training sample treat training pattern and individually trained, the training node in distribution training cluster circularizes connection, is Each trained node specifies corresponding previous trained node and latter trained node, and it is current that each trained node completes oneself Gradient is trained and calculated to node, and gradient is transferred to the latter training node, while receiving previous trained node Gradient.

Sub- training data is the multiple training datas that will be obtained after the training data grouping on each trained node.

Optionally, according to the quantity of training node in distributed training cluster to the training data on each trained node into Row grouping, obtains multiple sub- training datas on each trained node, wherein the quantity of the sub- training data of acquisition and described point The quantity of training node is equal in cloth training cluster.

The quantity for obtaining training node in distributed training cluster is n, and the training data on each trained node is equal Deng be randomly divided into n group, obtain n sub- training datas, quantity and the distribution training of the training data on each trained node The quantity of training node is equal in cluster, by the random grouping of the training data equalization on each trained node, ensure that training The randomness of data ensure that the load balancing of each trained node in distributed training cluster.

Step 204: calculating the son training accumulation of every sub- training data in the training node of the distributed training cluster Gradient.

In each trained node, model is repeatedly trained using sub- training data respectively, calculates sub- training data The son training gradient of training every time, and the son training gradient accumulation of repeatedly training is obtained into son training accumulation gradient.

Gradient is closely related with training data, and then calculated corresponding gradient is different for training data difference, in each training Corresponding son training gradient is calculated with the multiple sub- training datas being grouped at random in node, ensure that the random of son training gradient Property, be conducive to be quickly found out least disadvantage function, accelerate the convergence of training pattern.

Optionally, referring to Fig. 3, step 204 can be realized by following step 302 to step 306.

Step 302: obtaining preset accumulation gradient step number.

After the completion of model training, needing the step number of accumulation gradient is accumulation gradient step number, and the accumulation gradient step number is It pre-sets, preset accumulation gradient step number need to be obtained at this.

In embodiment provided by the present application, on each trained node, after completing model training, need to accumulate 5 steps Gradient, the preset accumulation gradient step number of acquisition is 5.

Step 304: calculating the son training gradient of the sub- training data in the accumulation gradient step number.

On some training node, certain terraced gradient of son training in accumulation gradient step number for organizing sub- training data is calculated.

In embodiment provided by the present application, use the example above, on first trained node, first group of sub- training data note It is d₀, the son training gradient obtained, which is calculated, in the first step is denoted as g₁, second step calculate obtain son training gradient be denoted as g₂, third Step calculates the son training gradient obtained and is denoted as g₃, four-step calculation obtain son training gradient be denoted as g₄, the calculating acquisition of the 5th step Son training gradient is denoted as g₅。

Step 306: by the son training gradient accumulation in the accumulation gradient step number, obtaining son training accumulation gradient.

Son training gradient in the accumulation gradient step number is accumulated, son training accumulation gradient is obtained.

It in embodiment provided by the present application, uses the example above, first group of sub- training data d₀Corresponding son training accumulation ladder Degree is a₀, wherein a₀=g₁+g₂+g₃+g₄+g₅。

Step 206: it is tired that son training corresponding with the son training accumulation gradient being obtained according to the son training accumulation gradient Add gradient.

Optionally, the son training accumulation gradient is carried out in the trained node of each of the distributed training cluster tired Add, obtains son training accumulated gradient corresponding with the son training accumulation gradient.

In each trained node, the corresponding son training accumulation gradient of the sub- training data on each trained node is carried out It is cumulative, such as have 5 trained nodes in distributed training cluster, there are 5 sub- training datas in each trained node, by each instruction Practice the corresponding son training accumulation gradient of the 1st sub- training data on node to add up, the corresponding son of the 2nd sub- training data Training accumulation gradient adds up, and so on, until the corresponding sub- training accumulation gradient of the 5th sub- training data is carried out tired Add.

Step 208: the son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.

Optionally, it is sub accordingly that the son training accumulated gradient is successively synchronized to each trained node according to preset order In training accumulation gradient.

Training node in distribution training cluster circularizes connection, therefore preset order can be clockwise, can also be with It is counterclockwise.Son training accumulated gradient on training node is successively synchronized to the distributed training set according to preset sequence On the trained node of each of group.

Gradient synchronous method in distributed training provided by the embodiments of the present application, by the son training for calculating sub- training data Gradient, and the training gradient accumulation of multistep is obtained into son training accumulation gradient, then by son training accumulation gradient in distribution training Gradient information transmitting is carried out in cluster on each trained node, reduces the number of communication frequency and gradient information transmitting, effectively Solve the problems, such as that during model training, gradient information transmitting is time-consuming serious, the time is saved in acceleration model training.

Fig. 4 shows gradient synchronous method in the distribution training of one embodiment of the application, gradient in distribution training Synchronous method is described by training for gradient is synchronous in cluster to the distribution for including 3 trained nodes, including step 402 to step 420.

Step 402: the training data on each trained node in distributed training cluster being grouped, each instruction is obtained Practice multiple sub- training datas on node, wherein the training node in distribution training cluster circularizes connection.

Step 404: calculating the son training accumulation of every sub- training data in the training node of the distributed training cluster Gradient.

Step 402 is consistent with the method for above-mentioned steps 202 to step 204 to step 404, the tool about step 402~404 Body explains that, referring to the detailed content of the step 202 in previous embodiment~204, details are not described herein again.

The structural schematic diagram of the distribution training cluster of one embodiment of the application offer is shown, referring to Fig. 5, Fig. 5 to divide There are three explain for point three groups of sub- training datas in training node and each trained node in cloth training cluster It is bright.

Training node in Fig. 5 circularizes connection, specifies corresponding previous training for the trained node of each of Fig. 5 Node and latter trained node, that is, training the previous trained node of node 0 is training node 2, the latter training section of training node 0 Point is training node 1；The previous trained node of training node 1 is training node 0, and the latter trained node of training node 1 is instruction Practice node 2；The previous trained node of training node 2 is training node 1, and the latter trained node of training node 2 is training node 0。

Step 406: whether i-th of son training accumulation gradient in the current training node of judgement is starting son training accumulation ladder Degree.If so, 408 are thened follow the steps, if it is not, thening follow the steps 410.

Wherein i is positive integer, when starting son training accumulation gradient refers to beginning with step training accumulation gradient, in distribution In the trained node of each of training cluster, each node can correspond to a son training accumulation gradient, from son training accumulation ladder It is synchronous that degree starts progress gradient.Starting son training accumulation gradient is first and starts the synchronous son training accumulation gradient of gradient.I-th Height training accumulation gradient is only starting son training accumulation gradient in a trained node of distributed training cluster.For example, 2nd son training accumulation gradient is only starting son training accumulation gradient in the training node 1 of distributed training cluster, at other It all will not be by as starting son training accumulation gradient in other training nodes of distribution training cluster.

It is further preferred that a starting son training accumulation gradient, starting sub- instruction is arranged automatically in each trained node It is consistent with the current training number of node to practice the corresponding number of accumulation gradient, guarantees load balancing in distributed training cluster, fills Divide using the training node in distributed training cluster, economizes on resources.

In embodiment provided by the present application, referring to table 1, as shown in table 1, a in training node 0₀, train in node 1 B₁, train the c in node 2₂, accumulation gradient is trained for starting son, therefore to a₀, b₁, c₂, step 408 is executed, for other sons Training accumulation gradient, executes step 410.

Table 1

Step 408: starting son training accumulation gradient is sent to latter trained node.

In embodiment provided by the present application, referring to table 1, by starting son training accumulation gradient a in training node 0₀Hair It send to training node 1, by starting son training accumulation gradient b in training node 1₁It is sent to trained node 2, by training node 2 In it is starting son training accumulation gradient c₂It is sent to trained node 0.

Step 410:, will be previous in the case of receiving i-th of son training accumulation gradient that previous trained node is sent I-th of son training accumulation gradient of training node and i-th of son training accumulation gradient of current training node carry out accumulation operations, Son training accumulated gradient after being added up.

The son training that a training node is sent to be received of going forward such as non-starting son training accumulation gradient in training node is tired Product gradient, and the son training accumulation gradient for the son training accumulation gradient and current training node that previous trained node is sent into Row accumulation operations, the son training accumulated gradient after being added up.

In embodiment provided by the present application, referring to table 2, as shown in table 2, training node 0 receives trained node 2 and sends The 3rd son training accumulation gradient c₂, and the 3rd son training accumulation gradient c with training node 0₂Accumulation operations are carried out, are obtained Son training accumulated gradient c after cumulative₂+c₀, and so on, the son training accumulated gradient after training node 1 is added up is a₀+ a₁, the son training accumulated gradient after training node 1 is added up is b₁+b₂。

Table 2

Step 412: whether the son training accumulated gradient after judgement is cumulative is final son training accumulated gradient, if it is not, then holding Row step 414, if so, thening follow the steps 416.

Final son training accumulated gradient is that the son training accumulation gradient on each trained node passes through after one-accumulate Son training accumulation gradient, cumulative number is that the training number of nodes in distributed training cluster subtracts 1, i.e., when distributed training set When training number of nodes in group is n, final son training accumulated gradient is that the son training accumulated gradient on each trained node passes through N-1 times it is cumulative after the gradient that obtains.When son training accumulated gradient after cumulative is not final son training accumulated gradient, step is executed Rapid 414, when the son training accumulated gradient after cumulative is final son training accumulated gradient, execute step 416.

Step 414: the son training accumulated gradient after will be cumulative is sent to latter trained node.

In the embodiment that this Shen provides, training number of nodes is 3 in distribution training cluster, final son training accumulated gradient It should be the gradient obtained after 2 times cumulative, referring to table 2, cumulative number is 1 time at this time, a₀+a₁, b₁+b₂, c₂+c₀By sentencing Disconnected is not final son training accumulated gradient, it is therefore desirable to the son training accumulated gradient after adding up be added to be sent to latter training section Point continues the cumulative of son training accumulation gradient.

Step 416: stopping cumulative, the final son training accumulated gradient of acquisition.

In the embodiment that this Shen provides, referring to table 3, the b in node 0 is trained₁+b₂+b₀To be obtained after adding up twice Gradient, be final son training accumulated gradient, similarly train the c in node 1₂+c₀+c₁For final son training accumulated gradient, training A in node 2₀+a₁+a₂For final son training accumulated gradient.

Table 3

Step 418: default training node receives the synchronization that the trained node of each of the distributed training cluster issues Information, and the instruction for synchronizing the son training accumulated gradient is issued to the trained node of each of the distributed training cluster.

It circularizes and is not only used for receiving the training node of synchronizing information in the distribution training cluster of connection, therefore in advance A specified trained node does the instruction for receiving and sending synchronizing information, the node also assist in gradient information receive, transmission it is same It walks in behavior act, flows gradient information in the annular training node in the distributed training cluster and synchronize.

In embodiment provided by the present application, chooses training node 0 and be used as preset trained node, training node 0 is connecing The son training accumulated gradient accumulated completion that trained node 0, training node 1, training node 2 issue is received, can be synchronized After synchronizing information, Xiang Xunlian node 0, training node 1, training node 2 issue the instruction with step training accumulated gradient.

Optionally, the son training accumulated gradient is compressed, obtains the cumulative compression gradient of son training.

Since son training accumulated gradient is after cumulative, causes the capacity of son training accumulated gradient bigger, synchronizing Cheng Zhong needs to expend longer time and completes parameter synchronization, therefore, can be compressed, be pressed with antithetical phrase training accumulated gradient The cumulative compression gradient of son training after contracting.

In embodiment provided by the present application, the son training accumulated gradient of acquisition is 32, will be described by compression gradient Son training accumulated gradient is compressed to 8.By compression son training accumulated gradient, as concentration gradient compresses (Deep Gradient Compression, DGC) etc., it is possible to reduce the parameter size of son training accumulated gradient reduces the big of parameter in synchronizing process It is small, call duration time has been saved, model training efficiency is improved.

Step 420: the son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.

Optionally, the cumulative compression gradient of son training through overcompression is synchronized to each instruction of the distributed training cluster Practice node.

Optionally, by the cumulative compression gradient of son training through overcompression in the distributed training cluster in sequence according to The secondary synchronization for carrying out final son training accumulated gradient, synchronization times are that the training number of nodes in distributed training cluster subtracts 1.

In embodiment provided by the present application, to realize the distributed load balancing for training cluster, by each training It is tired that final son training accumulated gradient in node successively carries out final son training in the distributed training cluster in sequence Add the synchronization of gradient, as shown in table 4, the final son training accumulated gradient a in training node 2₀+a₁+a₂It is synchronized to trained node 0 In, the final son training accumulated gradient b in training node 0₁+b₂+b₀It is synchronized in trained node 1, it is final in training node 1 Son training accumulated gradient c₂+c₀+c₁It is synchronized in trained node 2.

Table 4

As shown in table 5, the final son training accumulated gradient a in training node 0₀+a₁+a₂It is synchronized in trained node 1, instructs Practice the final son training accumulated gradient b in node 1₁+b₂+b₀It is synchronized in trained node 2, the final son training in training node 2 Accumulated gradient c₂+c₀+c₁It is synchronized in trained node 0.So far, in the present embodiment, final son training accumulated gradient is at described point Cloth trains progress 2 in cluster subsynchronous, completes son training accumulated gradient in the distribution and trains each of cluster training section Synchronization on point.

Table 5

Gradient synchronous method in distributed training provided by the embodiments of the present application, by the son training for calculating sub- training data Gradient, and the training gradient accumulation of multistep is obtained into son training accumulation gradient, then by son training accumulation gradient in distribution training Gradient information transmitting is carried out in cluster on each trained node, reduces the number of communication frequency and gradient information transmitting, effectively Solve the problems, such as that during model training, gradient information transmitting is time-consuming serious, in same step training accumulated gradient, lead to Overcompression trains accumulated gradient, and by the compressed cumulative compression gradient of son training in the training node for circularizing connection into Row synchronizes, and has compressed the parameter in gradient synchronizing process, and the time has been saved in acceleration model training.

It is corresponding with above method embodiment, present invention also provides distribution training in gradient synchronizing device embodiment, Fig. 5 shows the structural schematic diagram of gradient synchronizing device in the distribution training of the application one embodiment.As shown in figure 5, should Device includes:

Grouping module 502 is configured as dividing the training data in distributed training cluster on each trained node Group obtains multiple sub- training datas on each trained node, wherein the training node in distribution training cluster circularizes company It connects.

Computing module 504 is configured as every sub- training data in the training node for calculating the distributed training cluster Son training accumulation gradient.

Accumulator module 506 is configured as being obtained and the son training accumulation gradient pair according to the son training accumulation gradient The son training accumulated gradient answered.

Synchronization module 508 is configured as the son training accumulated gradient being synchronized to the every of the distributed training cluster A trained node.

Optionally, the grouping module 502 is configured to the number according to training node in distributed training cluster Amount is grouped the training data on each trained node, obtains multiple sub- training datas on each trained node, wherein The quantity of the sub- training data obtained is equal with the training quantity of node in the distributed training cluster.

Optionally, the computing module 504 is configured to obtain preset accumulation gradient step number；Described in calculating The son training gradient of the sub- training data in accumulation gradient step number；Son training gradient in the accumulation gradient step number is tired out Product obtains son training accumulation gradient.

Optionally, the accumulator module 506 is configured to the son training accumulation gradient in the distribution It adds up in the trained node of each of training cluster, obtains the cumulative ladder of son training corresponding with the son training accumulation gradient Degree.

Optionally, the accumulator module, comprising:

First judgment sub-unit, be configured as judging i-th of son training accumulation gradient in currently training node whether be Starting son training accumulation gradient, wherein i is positive integer.

First transmission sub-unit is configured as starting son training accumulation gradient being sent to latter trained node.

Receiving subelement is configured as in the feelings for receiving i-th of son training accumulation gradient that previous trained node is sent Under shape, by i-th of son training accumulation gradient of i-th of previous trained node training accumulation gradient and current training node into Row accumulation operations, the son training accumulated gradient after being added up.

Second judgment sub-unit is configured as whether the sub- training accumulated gradient after judging to add up is that final sub train is added up Gradient.

Subelement is obtained, is configured as stopping cumulative, acquisition son training accumulated gradient.

Second transmission sub-unit is configured as the son training accumulated gradient after adding up and is sent to latter trained node.

Optionally, described device further include: receive and send instruction module, be configured as default training node and receive described point The synchronizing information that the trained node of each of cloth training cluster issues, and to each training in the distributed training cluster Node issues the instruction for synchronizing the son training accumulated gradient.

Optionally, the synchronization module 508 is configured to the son training accumulated gradient according to preset order It is successively synchronized in the corresponding son training accumulation gradient of each trained node.

Optionally, the synchronization module 508 is configured to compress the son training accumulated gradient, obtain Obtain the cumulative compression gradient of sub- training；The cumulative compression gradient of son training is synchronized to each instruction of the distributed training cluster Practice node.

Gradient synchronizing device in distributed training provided by the embodiments of the present application, by the son training for calculating sub- training data Gradient, and the training gradient accumulation of multistep is obtained into son training accumulation gradient, then by son training accumulation gradient in distribution training Gradient information transmitting is carried out in cluster on each trained node, reduces the number of communication frequency and gradient information transmitting, effectively Solve the problems, such as that during model training, gradient information transmitting is time-consuming serious, in same step training accumulated gradient, lead to Overcompression trains accumulated gradient, and by the compressed cumulative compression gradient of son training in the training node for circularizing connection into Row synchronizes, and has compressed the parameter in gradient synchronizing process, and the time has been saved in acceleration model training.

A kind of calculating equipment is also provided in one embodiment of the application, including memory, processor and storage are on a memory And the computer instruction that can be run on a processor, the processor are realized when executing described instruction in the distribution training The step of gradient synchronous method.

One embodiment of the application also provides a kind of computer readable storage medium, is stored with computer instruction, the instruction The step of gradient synchronous method in distributed training as previously described is realized when being executed by processor.

A kind of exemplary scheme of above-mentioned computer readable storage medium for the present embodiment.It should be noted that this is deposited The technical solution of gradient synchronous method belongs to same design in the technical solution of storage media and above-mentioned distribution training, and storage is situated between The detail content that the technical solution of matter is not described in detail may refer to the technology of gradient synchronous method in above-mentioned distributed training The description of scheme.

The embodiment of the present application discloses a kind of chip, is stored with computer instruction, real when which is executed by processor Now as previously described in distributed training the step of gradient synchronous method.

It is above-mentioned that the application specific embodiment is described.Other embodiments are within the scope of the appended claims. In some cases, the movement recorded in detail in the claims or step can be executed according to the sequence being different from embodiment And desired result still may be implemented.In addition, process depicted in the drawing not necessarily require the particular order shown or Person's consecutive order is just able to achieve desired result.In some embodiments, multitasking and parallel processing are also possible Or it may be advantageous.

The computer instruction includes computer program code, the computer program code can for source code form, Object identification code form, executable file or certain intermediate forms etc..The computer-readable medium may include: that can carry institute State any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, the computer storage of computer program code Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Increase and decrease appropriate can be carried out according to the requirement made laws in jurisdiction with patent practice by holding, such as in certain jurisdictions of courts Area does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium.

It should be noted that for the various method embodiments described above, describing for simplicity, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules might not all be this Shen It please be necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

The application preferred embodiment disclosed above is only intended to help to illustrate the application.There is no detailed for alternative embodiment All details are described, are not limited the invention to the specific embodiments described.It obviously, can according to present context It makes many modifications and variations.The application chooses and specifically describes these embodiments, is the original in order to preferably explain the application Reason and practical application, so that skilled artisan be enable to better understand and utilize the application.The application is only authorized The limitation of sharp claim and its full scope and equivalent.

Claims

1. gradient synchronous method in a kind of distributed training characterized by comprising

Training data on each trained node in distributed training cluster is grouped, is obtained more on each trained node A sub- training data, wherein the training node in distribution training cluster circularizes connection；

2. gradient synchronous method in distributed training as described in claim 1, which is characterized in that in distribution training cluster Training data on each trained node is grouped, and obtains multiple sub- training datas on each trained node, comprising:

The training data on each trained node is grouped according to the quantity of training node in distributed training cluster, is obtained Multiple sub- training datas on each trained node, wherein the quantity of the sub- training data of acquisition and the distributed training set The quantity of training node is equal in group.

3. gradient synchronous method in distributed training as described in claim 1, which is characterized in that calculate the distributed training The son training accumulation gradient of every sub- training data in the training node of cluster, comprising:

Obtain preset accumulation gradient step number；

Calculate the son training gradient of the sub- training data in the accumulation gradient step number；

By the son training gradient accumulation in the accumulation gradient step number, son training accumulation gradient is obtained.

4. gradient synchronous method in distributed training as described in claim 1, which is characterized in that according to the son training accumulation Gradient obtains son training accumulated gradient corresponding with the son training accumulation gradient, comprising:

The son training accumulation gradient is added up in the trained node of each of the distributed training cluster, acquisition and institute State the corresponding son training accumulated gradient of son training accumulation gradient.

5. gradient synchronous method in distributed training as claimed in claim 4, which is characterized in that by the son training accumulation ladder Degree adds up in the trained node of each of the distributed training cluster, obtains corresponding with the son training accumulation gradient Son training accumulated gradient, comprising:

Whether i-th of son training accumulation gradient in the current training node of judgement is starting son training accumulation gradient, and wherein i is positive Integer；

If so, starting son training accumulation gradient is sent to latter trained node；

If it is not, in the case of receiving i-th of son training accumulation gradient that previous trained node is sent, by previous trained node I-th of son training accumulation gradient and i-th of son training accumulation gradient of current training node carry out accumulation operations, added up Son training accumulated gradient afterwards；

Whether the son training accumulated gradient after judgement is cumulative is final son training accumulated gradient；

If so, stopping cumulative, acquisition son training accumulated gradient；

If it is not, the son training accumulated gradient after will be cumulative is sent to latter trained node.

6. gradient synchronous method in distributed training as described in claim 1, which is characterized in that the son training is cumulative Gradient is synchronized to before the trained node of each of the distributed training cluster, further includes:

Default training node receives the synchronizing information that the trained node of each of the distributed training cluster issues, and to described The trained node of each of distribution training cluster issues the instruction for synchronizing the son training accumulated gradient.

7. gradient synchronous method in distributed training as described in claim 1, which is characterized in that by the cumulative ladder of son training Degree is synchronized to the trained node of each of the distributed training cluster, comprising:

The son training accumulated gradient is successively synchronized to the corresponding son training accumulation ladder of each trained node according to preset order In degree.

8. gradient synchronous method in distributed training as claimed in claim 7, which is characterized in that by the cumulative ladder of son training Degree is successively synchronized in the corresponding son training accumulation gradient of each trained node according to preset order, comprising:

The son training accumulated gradient is compressed, the cumulative compression gradient of son training is obtained；

The cumulative compression gradient of son training is synchronized to the trained node of each of the distributed training cluster.

9. gradient synchronizing device in a kind of distributed training characterized by comprising

Computing module is configured as the son training of every sub- training data in the training node for calculating the distributed training cluster Accumulation gradient；

Accumulator module is configured as obtaining sub- instruction corresponding with the son training accumulation gradient according to the son training accumulation gradient Practice accumulated gradient；

Synchronization module is configured as the son training accumulated gradient being synchronized to each of distributed training cluster training section Point.

10. a kind of calculating equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine instruction, which is characterized in that the processor realizes claim 1-8 or any one the method when executing described instruction The step of.

11. a kind of computer readable storage medium, is stored with computer instruction, which is characterized in that the instruction is held by processor The step of claim 1-8 any one the method is realized when row.