CN110472731A - Gradient synchronous method and device during a kind of distribution is trained - Google Patents

Gradient synchronous method and device during a kind of distribution is trained Download PDF

Info

Publication number
CN110472731A
CN110472731A CN201910760056.XA CN201910760056A CN110472731A CN 110472731 A CN110472731 A CN 110472731A CN 201910760056 A CN201910760056 A CN 201910760056A CN 110472731 A CN110472731 A CN 110472731A
Authority
CN
China
Prior art keywords
training
gradient
son
node
accumulation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910760056.XA
Other languages
Chinese (zh)
Inventor
李小龙
王洪伟
李鑫
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Chengdu Kingsoft Digital Entertainment Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Original Assignee
Chengdu Kingsoft Digital Entertainment Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Kingsoft Digital Entertainment Co Ltd, Beijing Jinshan Digital Entertainment Technology Co Ltd filed Critical Chengdu Kingsoft Digital Entertainment Co Ltd
Priority to CN201910760056.XA priority Critical patent/CN110472731A/en
Publication of CN110472731A publication Critical patent/CN110472731A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The application provides gradient synchronous method and device in a kind of distributed training, wherein gradient synchronous method includes: to be grouped to the training data in distributed training cluster on each trained node in the distributed training, obtain multiple sub- training datas on each trained node, wherein, the training node in distributed training cluster circularizes connection;Calculate the son training accumulation gradient of every sub- training data in the training node of the distributed training cluster;Son training accumulated gradient corresponding with the son training accumulation gradient is obtained according to the son training accumulation gradient;The son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.By accumulating the gradient of the sub- training data, the synchronization times of accumulation gradient are reduced, communication frequency is reduced, accelerates the speed of model training.

Description

Gradient synchronous method and device during a kind of distribution is trained
Technical field
This application involves field of computer technology, in particular to gradient synchronous method and device in a kind of distributed training, Calculate equipment, computer readable storage medium and chip.
Background technique
Currently, depth learning technology is also rapid progress, with depth with the fast development of computer technology Habit technology is goed deep into, and the algorithm to become increasingly complex is developed, these algorithms need a large amount of data and take a substantial amount of time It can effectively complete to train, therefore have developed distributed training.
In the model optimization of deep learning, needs to calculate gradient using the method that gradient declines and find the smallest loss letter Number, carrys out training pattern with this, accelerates the convergence of model.In current distribution training, need every primary training of completion will The transmitting for carrying out gradient information is synchronous with gradient information, in order to share the gradient on distributed training node, finds most Small loss function, therefore can be because the transmitting of high-frequency gradient information and transmitting cause the problem of containing much information in model training The model training time is long, span is big, has seriously delayed the speed of model training.
Therefore, how to improve the above problem, just become current urgent problem to be solved.
Summary of the invention
It is set in view of this, the embodiment of the present application provides gradient synchronous method and device, calculating in a kind of distributed training Standby, computer readable storage medium and chip, to solve technological deficiency existing in the prior art.
According to the embodiment of the present application in a first aspect, providing gradient synchronous method in a kind of distributed training, comprising:
Training data on each trained node in distributed training cluster is grouped, is obtained on each trained node Multiple sub- training datas, wherein distribution training cluster in training node circularize connection;
Calculate the son training accumulation gradient of every sub- training data in the training node of the distributed training cluster;
Son training accumulated gradient corresponding with the son training accumulation gradient is obtained according to the son training accumulation gradient;
The son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.
According to the second aspect of the embodiment of the present application, gradient synchronizing device in a kind of distributed training is provided, comprising:
Grouping module is configured as being grouped the training data in distributed training cluster on each trained node, Obtain multiple sub- training datas on each trained node, wherein the training node in distribution training cluster circularizes connection;
Computing module is configured as the son of every sub- training data in the training node for calculating the distributed training cluster Training accumulation gradient;
Accumulator module is configured as being obtained according to the son training accumulation gradient corresponding with the son training accumulation gradient Son training accumulated gradient;
Synchronization module is configured as the son training accumulated gradient being synchronized to each instruction of the distributed training cluster Practice node.
According to the third aspect of the embodiment of the present application, a kind of calculating equipment, including memory, processor and storage are provided On a memory and the computer instruction that can run on a processor, the processor realize the distribution when executing described instruction In formula training the step of gradient synchronous method.
According to the fourth aspect of the embodiment of the present application, a kind of computer readable storage medium is provided, is stored with calculating The step of machine instruction, which realizes gradient synchronous method in distributed training when being executed by processor.
According to the 5th of the embodiment of the present application the aspect, a kind of chip is provided, computer instruction is stored with, the instruction quilt Chip realizes the step of gradient synchronous method in the distributed training when executing.
Gradient synchronous method in distributed training provided by the present application, by being saved to each training in distribution training cluster Training data on point is grouped, and obtains multiple sub- training datas on each trained node, wherein distribution training cluster In training node circularize connection;Calculate the son instruction of every sub- training data in the training node of the distributed training cluster Practice accumulation gradient;The cumulative ladder of son training corresponding with the son training accumulation gradient is obtained according to the son training accumulation gradient Degree;The son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.In model training process In, the synchronization of gradient will be carried out again after the gradient information repeatedly calculated accumulation, significantly reduce the communication frequency of gradient information, The passing time for reducing gradient information accelerates the training speed of model, improves the efficiency of model training.
Detailed description of the invention
Fig. 1 is the structural block diagram provided by the embodiments of the present application for calculating equipment;
Fig. 2 is the flow chart of gradient synchronous method in distributed training provided by the embodiments of the present application;
Fig. 3 is the flow chart of son training accumulation gradient calculation method provided by the embodiments of the present application;
Fig. 4 is the flow chart of gradient synchronous method during the distribution that another embodiment of the application provides is trained;
Fig. 5 is the structural schematic diagram for the distribution training cluster that one embodiment of the application provides;
Fig. 6 is the structural schematic diagram of gradient synchronizing device in distributed training provided by the embodiments of the present application.
Specific embodiment
Many details are explained in the following description in order to fully understand the application.But the application can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to the application intension the case where Under do similar popularization, therefore the application is not limited by following public specific implementation.
The term used in the application one or more embodiment be only merely for for the purpose of describing particular embodiments, and It is not intended to be limiting the application one or more embodiment.The institute in the application one or more embodiment and the appended claims The "an" of the singular used, " described " and "the" are also intended to including most forms, unless context clearly shows that it His meaning.It is also understood that term "and/or" used in the application one or more embodiment refers to and includes one or more A associated any or all of project listed may combine.
It will be appreciated that though may be described using term first, second etc. in the application one or more embodiment Various information, but these information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out. For example, first can also be referred to as second in the case where not departing from the application one or more scope of embodiments, similarly, Second can also be referred to as first.Depending on context, word as used in this " if " can be construed to " ... When " or " when ... " or " in response to determination ".
Firstly, the vocabulary of terms being related to one or more embodiments of the invention explains.
Gradient: gradient is a vector, and it is maximum to indicate that the directional derivative of a certain function at this point is obtained along the direction Value, i.e. function change most fast, change rate maximum, the gradient in model training along the direction (direction of this gradient) at this point For finding least disadvantage function, training pattern accelerates the convergence of model.Number, that is, gradient step number of model training.
Gradient descent method (gradient descent): being an optimization algorithm, also commonly referred to as steepest descent method.Ladder Degree descent method is to solve for one of common method of unconstrained optimization problem, now more for being used to recurrence in machine learning Approach minimum deflection model to property.Especially for the back-propagation algorithm in neural network, gradient descent method is provided for Theoretical basis.
Gradient accumulation: together by the gradient accumulation of multistep training.
Distribution training: the method being trained using multiple trained nodes.
Son training accumulation gradient: calculating the gradient of the training data on the trained node of each of distributed training cluster, will Repeatedly calculate to gradient accumulation, obtain son training accumulation gradient.
Son training accumulated gradient: the corresponding son training accumulation gradient of nodes all in distribution training cluster is added up Come, obtains son training accumulated gradient.
In this application, provide gradient synchronous method and device in a kind of distributed training, calculate equipment, computer can Storage medium and chip are read, is described in detail one by one in the following embodiments.
Fig. 1 shows the structural block diagram of the calculating equipment 100 according to one embodiment of the application.The portion of the calculating equipment 100 Part includes but is not limited to memory 110 and processor 120.Processor 120 is connected with memory 110 by bus 130, data Library 150 is for saving data.
Calculating equipment 100 further includes access device 140, access device 140 enable calculate equipment 100 via one or Multiple networks 160 communicate.The example of these networks includes public switched telephone network (PSTN), local area network (LAN), wide area network (WAN), the combination of the communication network of personal area network (PAN) or such as internet.Access device 140 may include wired or wireless One or more of any kind of network interface (for example, network interface card (NIC)), such as IEEE802.11 wireless local area Net (WLAN) wireless interface, worldwide interoperability for microwave accesses (Wi-MAX) interface, Ethernet interface, universal serial bus (USB) connect Mouth, cellular network interface, blue tooth interface, near-field communication (NFC) interface, etc..
In one embodiment of the application, unshowned other component in the above-mentioned component and Fig. 1 of equipment 100 is calculated It can also be connected to each other, such as pass through bus.It should be appreciated that calculating device structure block diagram shown in FIG. 1 is merely for the sake of showing The purpose of example, rather than the limitation to the application range.Those skilled in the art can according to need, and increase or replace other portions Part.
Calculating equipment 100 can be any kind of static or mobile computing device, including mobile computer or mobile meter Calculate equipment (for example, tablet computer, personal digital assistant, laptop computer, notebook computer, net book etc.), movement Phone (for example, smart phone), wearable calculating equipment (for example, smartwatch, intelligent glasses etc.) or other kinds of shifting Dynamic equipment, or the static calculating equipment of such as desktop computer or PC.Calculating equipment 100 can also be mobile or state type Server.
Wherein, processor 120 can execute the step in the training of distribution shown in Fig. 2 in gradient synchronous method.Fig. 2 shows According to the flow chart of gradient synchronous method in the distribution training of the application one embodiment, including step 202 is to step 208.
Step 202: the training data on each trained node in distributed training cluster being grouped, each instruction is obtained Practice multiple sub- training datas on node, wherein the training node in distribution training cluster circularizes connection.
Distribution training cluster in distribution training has multiple trained nodes, is placed in each trained node identical To training pattern, all training samples are evenly distributed in each trained node, using difference in each trained node Training sample treat training pattern and individually trained, the training node in distribution training cluster circularizes connection, is Each trained node specifies corresponding previous trained node and latter trained node, and it is current that each trained node completes oneself Gradient is trained and calculated to node, and gradient is transferred to the latter training node, while receiving previous trained node Gradient.
Sub- training data is the multiple training datas that will be obtained after the training data grouping on each trained node.
Optionally, according to the quantity of training node in distributed training cluster to the training data on each trained node into Row grouping, obtains multiple sub- training datas on each trained node, wherein the quantity of the sub- training data of acquisition and described point The quantity of training node is equal in cloth training cluster.
The quantity for obtaining training node in distributed training cluster is n, and the training data on each trained node is equal Deng be randomly divided into n group, obtain n sub- training datas, quantity and the distribution training of the training data on each trained node The quantity of training node is equal in cluster, by the random grouping of the training data equalization on each trained node, ensure that training The randomness of data ensure that the load balancing of each trained node in distributed training cluster.
Step 204: calculating the son training accumulation of every sub- training data in the training node of the distributed training cluster Gradient.
In each trained node, model is repeatedly trained using sub- training data respectively, calculates sub- training data The son training gradient of training every time, and the son training gradient accumulation of repeatedly training is obtained into son training accumulation gradient.
Gradient is closely related with training data, and then calculated corresponding gradient is different for training data difference, in each training Corresponding son training gradient is calculated with the multiple sub- training datas being grouped at random in node, ensure that the random of son training gradient Property, be conducive to be quickly found out least disadvantage function, accelerate the convergence of training pattern.
Optionally, referring to Fig. 3, step 204 can be realized by following step 302 to step 306.
Step 302: obtaining preset accumulation gradient step number.
After the completion of model training, needing the step number of accumulation gradient is accumulation gradient step number, and the accumulation gradient step number is It pre-sets, preset accumulation gradient step number need to be obtained at this.
In embodiment provided by the present application, on each trained node, after completing model training, need to accumulate 5 steps Gradient, the preset accumulation gradient step number of acquisition is 5.
Step 304: calculating the son training gradient of the sub- training data in the accumulation gradient step number.
On some training node, certain terraced gradient of son training in accumulation gradient step number for organizing sub- training data is calculated.
In embodiment provided by the present application, use the example above, on first trained node, first group of sub- training data note It is d0, the son training gradient obtained, which is calculated, in the first step is denoted as g1, second step calculate obtain son training gradient be denoted as g2, third Step calculates the son training gradient obtained and is denoted as g3, four-step calculation obtain son training gradient be denoted as g4, the calculating acquisition of the 5th step Son training gradient is denoted as g5
Step 306: by the son training gradient accumulation in the accumulation gradient step number, obtaining son training accumulation gradient.
Son training gradient in the accumulation gradient step number is accumulated, son training accumulation gradient is obtained.
It in embodiment provided by the present application, uses the example above, first group of sub- training data d0Corresponding son training accumulation ladder Degree is a0, wherein a0=g1+g2+g3+g4+g5
Step 206: it is tired that son training corresponding with the son training accumulation gradient being obtained according to the son training accumulation gradient Add gradient.
Optionally, the son training accumulation gradient is carried out in the trained node of each of the distributed training cluster tired Add, obtains son training accumulated gradient corresponding with the son training accumulation gradient.
In each trained node, the corresponding son training accumulation gradient of the sub- training data on each trained node is carried out It is cumulative, such as have 5 trained nodes in distributed training cluster, there are 5 sub- training datas in each trained node, by each instruction Practice the corresponding son training accumulation gradient of the 1st sub- training data on node to add up, the corresponding son of the 2nd sub- training data Training accumulation gradient adds up, and so on, until the corresponding sub- training accumulation gradient of the 5th sub- training data is carried out tired Add.
Step 208: the son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.
Optionally, it is sub accordingly that the son training accumulated gradient is successively synchronized to each trained node according to preset order In training accumulation gradient.
Training node in distribution training cluster circularizes connection, therefore preset order can be clockwise, can also be with It is counterclockwise.Son training accumulated gradient on training node is successively synchronized to the distributed training set according to preset sequence On the trained node of each of group.
Gradient synchronous method in distributed training provided by the embodiments of the present application, by the son training for calculating sub- training data Gradient, and the training gradient accumulation of multistep is obtained into son training accumulation gradient, then by son training accumulation gradient in distribution training Gradient information transmitting is carried out in cluster on each trained node, reduces the number of communication frequency and gradient information transmitting, effectively Solve the problems, such as that during model training, gradient information transmitting is time-consuming serious, the time is saved in acceleration model training.
Fig. 4 shows gradient synchronous method in the distribution training of one embodiment of the application, gradient in distribution training Synchronous method is described by training for gradient is synchronous in cluster to the distribution for including 3 trained nodes, including step 402 to step 420.
Step 402: the training data on each trained node in distributed training cluster being grouped, each instruction is obtained Practice multiple sub- training datas on node, wherein the training node in distribution training cluster circularizes connection.
Step 404: calculating the son training accumulation of every sub- training data in the training node of the distributed training cluster Gradient.
Step 402 is consistent with the method for above-mentioned steps 202 to step 204 to step 404, the tool about step 402~404 Body explains that, referring to the detailed content of the step 202 in previous embodiment~204, details are not described herein again.
The structural schematic diagram of the distribution training cluster of one embodiment of the application offer is shown, referring to Fig. 5, Fig. 5 to divide There are three explain for point three groups of sub- training datas in training node and each trained node in cloth training cluster It is bright.
Training node in Fig. 5 circularizes connection, specifies corresponding previous training for the trained node of each of Fig. 5 Node and latter trained node, that is, training the previous trained node of node 0 is training node 2, the latter training section of training node 0 Point is training node 1;The previous trained node of training node 1 is training node 0, and the latter trained node of training node 1 is instruction Practice node 2;The previous trained node of training node 2 is training node 1, and the latter trained node of training node 2 is training node 0。
Step 406: whether i-th of son training accumulation gradient in the current training node of judgement is starting son training accumulation ladder Degree.If so, 408 are thened follow the steps, if it is not, thening follow the steps 410.
Wherein i is positive integer, when starting son training accumulation gradient refers to beginning with step training accumulation gradient, in distribution In the trained node of each of training cluster, each node can correspond to a son training accumulation gradient, from son training accumulation ladder It is synchronous that degree starts progress gradient.Starting son training accumulation gradient is first and starts the synchronous son training accumulation gradient of gradient.I-th Height training accumulation gradient is only starting son training accumulation gradient in a trained node of distributed training cluster.For example, 2nd son training accumulation gradient is only starting son training accumulation gradient in the training node 1 of distributed training cluster, at other It all will not be by as starting son training accumulation gradient in other training nodes of distribution training cluster.
It is further preferred that a starting son training accumulation gradient, starting sub- instruction is arranged automatically in each trained node It is consistent with the current training number of node to practice the corresponding number of accumulation gradient, guarantees load balancing in distributed training cluster, fills Divide using the training node in distributed training cluster, economizes on resources.
In embodiment provided by the present application, referring to table 1, as shown in table 1, a in training node 00, train in node 1 B1, train the c in node 22, accumulation gradient is trained for starting son, therefore to a0, b1, c2, step 408 is executed, for other sons Training accumulation gradient, executes step 410.
Table 1
Step 408: starting son training accumulation gradient is sent to latter trained node.
In embodiment provided by the present application, referring to table 1, by starting son training accumulation gradient a in training node 00Hair It send to training node 1, by starting son training accumulation gradient b in training node 11It is sent to trained node 2, by training node 2 In it is starting son training accumulation gradient c2It is sent to trained node 0.
Step 410:, will be previous in the case of receiving i-th of son training accumulation gradient that previous trained node is sent I-th of son training accumulation gradient of training node and i-th of son training accumulation gradient of current training node carry out accumulation operations, Son training accumulated gradient after being added up.
The son training that a training node is sent to be received of going forward such as non-starting son training accumulation gradient in training node is tired Product gradient, and the son training accumulation gradient for the son training accumulation gradient and current training node that previous trained node is sent into Row accumulation operations, the son training accumulated gradient after being added up.
In embodiment provided by the present application, referring to table 2, as shown in table 2, training node 0 receives trained node 2 and sends The 3rd son training accumulation gradient c2, and the 3rd son training accumulation gradient c with training node 02Accumulation operations are carried out, are obtained Son training accumulated gradient c after cumulative2+c0, and so on, the son training accumulated gradient after training node 1 is added up is a0+ a1, the son training accumulated gradient after training node 1 is added up is b1+b2
Table 2
Step 412: whether the son training accumulated gradient after judgement is cumulative is final son training accumulated gradient, if it is not, then holding Row step 414, if so, thening follow the steps 416.
Final son training accumulated gradient is that the son training accumulation gradient on each trained node passes through after one-accumulate Son training accumulation gradient, cumulative number is that the training number of nodes in distributed training cluster subtracts 1, i.e., when distributed training set When training number of nodes in group is n, final son training accumulated gradient is that the son training accumulated gradient on each trained node passes through N-1 times it is cumulative after the gradient that obtains.When son training accumulated gradient after cumulative is not final son training accumulated gradient, step is executed Rapid 414, when the son training accumulated gradient after cumulative is final son training accumulated gradient, execute step 416.
Step 414: the son training accumulated gradient after will be cumulative is sent to latter trained node.
In the embodiment that this Shen provides, training number of nodes is 3 in distribution training cluster, final son training accumulated gradient It should be the gradient obtained after 2 times cumulative, referring to table 2, cumulative number is 1 time at this time, a0+a1, b1+b2, c2+c0By sentencing Disconnected is not final son training accumulated gradient, it is therefore desirable to the son training accumulated gradient after adding up be added to be sent to latter training section Point continues the cumulative of son training accumulation gradient.
Step 416: stopping cumulative, the final son training accumulated gradient of acquisition.
In the embodiment that this Shen provides, referring to table 3, the b in node 0 is trained1+b2+b0To be obtained after adding up twice Gradient, be final son training accumulated gradient, similarly train the c in node 12+c0+c1For final son training accumulated gradient, training A in node 20+a1+a2For final son training accumulated gradient.
Table 3
Step 418: default training node receives the synchronization that the trained node of each of the distributed training cluster issues Information, and the instruction for synchronizing the son training accumulated gradient is issued to the trained node of each of the distributed training cluster.
It circularizes and is not only used for receiving the training node of synchronizing information in the distribution training cluster of connection, therefore in advance A specified trained node does the instruction for receiving and sending synchronizing information, the node also assist in gradient information receive, transmission it is same It walks in behavior act, flows gradient information in the annular training node in the distributed training cluster and synchronize.
In embodiment provided by the present application, chooses training node 0 and be used as preset trained node, training node 0 is connecing The son training accumulated gradient accumulated completion that trained node 0, training node 1, training node 2 issue is received, can be synchronized After synchronizing information, Xiang Xunlian node 0, training node 1, training node 2 issue the instruction with step training accumulated gradient.
Optionally, the son training accumulated gradient is compressed, obtains the cumulative compression gradient of son training.
Since son training accumulated gradient is after cumulative, causes the capacity of son training accumulated gradient bigger, synchronizing Cheng Zhong needs to expend longer time and completes parameter synchronization, therefore, can be compressed, be pressed with antithetical phrase training accumulated gradient The cumulative compression gradient of son training after contracting.
In embodiment provided by the present application, the son training accumulated gradient of acquisition is 32, will be described by compression gradient Son training accumulated gradient is compressed to 8.By compression son training accumulated gradient, as concentration gradient compresses (Deep Gradient Compression, DGC) etc., it is possible to reduce the parameter size of son training accumulated gradient reduces the big of parameter in synchronizing process It is small, call duration time has been saved, model training efficiency is improved.
Step 420: the son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.
Optionally, the cumulative compression gradient of son training through overcompression is synchronized to each instruction of the distributed training cluster Practice node.
Optionally, by the cumulative compression gradient of son training through overcompression in the distributed training cluster in sequence according to The secondary synchronization for carrying out final son training accumulated gradient, synchronization times are that the training number of nodes in distributed training cluster subtracts 1.
In embodiment provided by the present application, to realize the distributed load balancing for training cluster, by each training It is tired that final son training accumulated gradient in node successively carries out final son training in the distributed training cluster in sequence Add the synchronization of gradient, as shown in table 4, the final son training accumulated gradient a in training node 20+a1+a2It is synchronized to trained node 0 In, the final son training accumulated gradient b in training node 01+b2+b0It is synchronized in trained node 1, it is final in training node 1 Son training accumulated gradient c2+c0+c1It is synchronized in trained node 2.
Table 4
As shown in table 5, the final son training accumulated gradient a in training node 00+a1+a2It is synchronized in trained node 1, instructs Practice the final son training accumulated gradient b in node 11+b2+b0It is synchronized in trained node 2, the final son training in training node 2 Accumulated gradient c2+c0+c1It is synchronized in trained node 0.So far, in the present embodiment, final son training accumulated gradient is at described point Cloth trains progress 2 in cluster subsynchronous, completes son training accumulated gradient in the distribution and trains each of cluster training section Synchronization on point.
Table 5
Gradient synchronous method in distributed training provided by the embodiments of the present application, by the son training for calculating sub- training data Gradient, and the training gradient accumulation of multistep is obtained into son training accumulation gradient, then by son training accumulation gradient in distribution training Gradient information transmitting is carried out in cluster on each trained node, reduces the number of communication frequency and gradient information transmitting, effectively Solve the problems, such as that during model training, gradient information transmitting is time-consuming serious, in same step training accumulated gradient, lead to Overcompression trains accumulated gradient, and by the compressed cumulative compression gradient of son training in the training node for circularizing connection into Row synchronizes, and has compressed the parameter in gradient synchronizing process, and the time has been saved in acceleration model training.
It is corresponding with above method embodiment, present invention also provides distribution training in gradient synchronizing device embodiment, Fig. 5 shows the structural schematic diagram of gradient synchronizing device in the distribution training of the application one embodiment.As shown in figure 5, should Device includes:
Grouping module 502 is configured as dividing the training data in distributed training cluster on each trained node Group obtains multiple sub- training datas on each trained node, wherein the training node in distribution training cluster circularizes company It connects.
Computing module 504 is configured as every sub- training data in the training node for calculating the distributed training cluster Son training accumulation gradient.
Accumulator module 506 is configured as being obtained and the son training accumulation gradient pair according to the son training accumulation gradient The son training accumulated gradient answered.
Synchronization module 508 is configured as the son training accumulated gradient being synchronized to the every of the distributed training cluster A trained node.
Optionally, the grouping module 502 is configured to the number according to training node in distributed training cluster Amount is grouped the training data on each trained node, obtains multiple sub- training datas on each trained node, wherein The quantity of the sub- training data obtained is equal with the training quantity of node in the distributed training cluster.
Optionally, the computing module 504 is configured to obtain preset accumulation gradient step number;Described in calculating The son training gradient of the sub- training data in accumulation gradient step number;Son training gradient in the accumulation gradient step number is tired out Product obtains son training accumulation gradient.
Optionally, the accumulator module 506 is configured to the son training accumulation gradient in the distribution It adds up in the trained node of each of training cluster, obtains the cumulative ladder of son training corresponding with the son training accumulation gradient Degree.
Optionally, the accumulator module, comprising:
First judgment sub-unit, be configured as judging i-th of son training accumulation gradient in currently training node whether be Starting son training accumulation gradient, wherein i is positive integer.
First transmission sub-unit is configured as starting son training accumulation gradient being sent to latter trained node.
Receiving subelement is configured as in the feelings for receiving i-th of son training accumulation gradient that previous trained node is sent Under shape, by i-th of son training accumulation gradient of i-th of previous trained node training accumulation gradient and current training node into Row accumulation operations, the son training accumulated gradient after being added up.
Second judgment sub-unit is configured as whether the sub- training accumulated gradient after judging to add up is that final sub train is added up Gradient.
Subelement is obtained, is configured as stopping cumulative, acquisition son training accumulated gradient.
Second transmission sub-unit is configured as the son training accumulated gradient after adding up and is sent to latter trained node.
Optionally, described device further include: receive and send instruction module, be configured as default training node and receive described point The synchronizing information that the trained node of each of cloth training cluster issues, and to each training in the distributed training cluster Node issues the instruction for synchronizing the son training accumulated gradient.
Optionally, the synchronization module 508 is configured to the son training accumulated gradient according to preset order It is successively synchronized in the corresponding son training accumulation gradient of each trained node.
Optionally, the synchronization module 508 is configured to compress the son training accumulated gradient, obtain Obtain the cumulative compression gradient of sub- training;The cumulative compression gradient of son training is synchronized to each instruction of the distributed training cluster Practice node.
Gradient synchronizing device in distributed training provided by the embodiments of the present application, by the son training for calculating sub- training data Gradient, and the training gradient accumulation of multistep is obtained into son training accumulation gradient, then by son training accumulation gradient in distribution training Gradient information transmitting is carried out in cluster on each trained node, reduces the number of communication frequency and gradient information transmitting, effectively Solve the problems, such as that during model training, gradient information transmitting is time-consuming serious, in same step training accumulated gradient, lead to Overcompression trains accumulated gradient, and by the compressed cumulative compression gradient of son training in the training node for circularizing connection into Row synchronizes, and has compressed the parameter in gradient synchronizing process, and the time has been saved in acceleration model training.
A kind of calculating equipment is also provided in one embodiment of the application, including memory, processor and storage are on a memory And the computer instruction that can be run on a processor, the processor are realized when executing described instruction in the distribution training The step of gradient synchronous method.
One embodiment of the application also provides a kind of computer readable storage medium, is stored with computer instruction, the instruction The step of gradient synchronous method in distributed training as previously described is realized when being executed by processor.
A kind of exemplary scheme of above-mentioned computer readable storage medium for the present embodiment.It should be noted that this is deposited The technical solution of gradient synchronous method belongs to same design in the technical solution of storage media and above-mentioned distribution training, and storage is situated between The detail content that the technical solution of matter is not described in detail may refer to the technology of gradient synchronous method in above-mentioned distributed training The description of scheme.
The embodiment of the present application discloses a kind of chip, is stored with computer instruction, real when which is executed by processor Now as previously described in distributed training the step of gradient synchronous method.
It is above-mentioned that the application specific embodiment is described.Other embodiments are within the scope of the appended claims. In some cases, the movement recorded in detail in the claims or step can be executed according to the sequence being different from embodiment And desired result still may be implemented.In addition, process depicted in the drawing not necessarily require the particular order shown or Person's consecutive order is just able to achieve desired result.In some embodiments, multitasking and parallel processing are also possible Or it may be advantageous.
The computer instruction includes computer program code, the computer program code can for source code form, Object identification code form, executable file or certain intermediate forms etc..The computer-readable medium may include: that can carry institute State any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, the computer storage of computer program code Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Increase and decrease appropriate can be carried out according to the requirement made laws in jurisdiction with patent practice by holding, such as in certain jurisdictions of courts Area does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium.
It should be noted that for the various method embodiments described above, describing for simplicity, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules might not all be this Shen It please be necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.
The application preferred embodiment disclosed above is only intended to help to illustrate the application.There is no detailed for alternative embodiment All details are described, are not limited the invention to the specific embodiments described.It obviously, can according to present context It makes many modifications and variations.The application chooses and specifically describes these embodiments, is the original in order to preferably explain the application Reason and practical application, so that skilled artisan be enable to better understand and utilize the application.The application is only authorized The limitation of sharp claim and its full scope and equivalent.

Claims (11)

1. gradient synchronous method in a kind of distributed training characterized by comprising
Training data on each trained node in distributed training cluster is grouped, is obtained more on each trained node A sub- training data, wherein the training node in distribution training cluster circularizes connection;
Calculate the son training accumulation gradient of every sub- training data in the training node of the distributed training cluster;
Son training accumulated gradient corresponding with the son training accumulation gradient is obtained according to the son training accumulation gradient;
The son training accumulated gradient is synchronized to the trained node of each of the distributed training cluster.
2. gradient synchronous method in distributed training as described in claim 1, which is characterized in that in distribution training cluster Training data on each trained node is grouped, and obtains multiple sub- training datas on each trained node, comprising:
The training data on each trained node is grouped according to the quantity of training node in distributed training cluster, is obtained Multiple sub- training datas on each trained node, wherein the quantity of the sub- training data of acquisition and the distributed training set The quantity of training node is equal in group.
3. gradient synchronous method in distributed training as described in claim 1, which is characterized in that calculate the distributed training The son training accumulation gradient of every sub- training data in the training node of cluster, comprising:
Obtain preset accumulation gradient step number;
Calculate the son training gradient of the sub- training data in the accumulation gradient step number;
By the son training gradient accumulation in the accumulation gradient step number, son training accumulation gradient is obtained.
4. gradient synchronous method in distributed training as described in claim 1, which is characterized in that according to the son training accumulation Gradient obtains son training accumulated gradient corresponding with the son training accumulation gradient, comprising:
The son training accumulation gradient is added up in the trained node of each of the distributed training cluster, acquisition and institute State the corresponding son training accumulated gradient of son training accumulation gradient.
5. gradient synchronous method in distributed training as claimed in claim 4, which is characterized in that by the son training accumulation ladder Degree adds up in the trained node of each of the distributed training cluster, obtains corresponding with the son training accumulation gradient Son training accumulated gradient, comprising:
Whether i-th of son training accumulation gradient in the current training node of judgement is starting son training accumulation gradient, and wherein i is positive Integer;
If so, starting son training accumulation gradient is sent to latter trained node;
If it is not, in the case of receiving i-th of son training accumulation gradient that previous trained node is sent, by previous trained node I-th of son training accumulation gradient and i-th of son training accumulation gradient of current training node carry out accumulation operations, added up Son training accumulated gradient afterwards;
Whether the son training accumulated gradient after judgement is cumulative is final son training accumulated gradient;
If so, stopping cumulative, acquisition son training accumulated gradient;
If it is not, the son training accumulated gradient after will be cumulative is sent to latter trained node.
6. gradient synchronous method in distributed training as described in claim 1, which is characterized in that the son training is cumulative Gradient is synchronized to before the trained node of each of the distributed training cluster, further includes:
Default training node receives the synchronizing information that the trained node of each of the distributed training cluster issues, and to described The trained node of each of distribution training cluster issues the instruction for synchronizing the son training accumulated gradient.
7. gradient synchronous method in distributed training as described in claim 1, which is characterized in that by the cumulative ladder of son training Degree is synchronized to the trained node of each of the distributed training cluster, comprising:
The son training accumulated gradient is successively synchronized to the corresponding son training accumulation ladder of each trained node according to preset order In degree.
8. gradient synchronous method in distributed training as claimed in claim 7, which is characterized in that by the cumulative ladder of son training Degree is successively synchronized in the corresponding son training accumulation gradient of each trained node according to preset order, comprising:
The son training accumulated gradient is compressed, the cumulative compression gradient of son training is obtained;
The cumulative compression gradient of son training is synchronized to the trained node of each of the distributed training cluster.
9. gradient synchronizing device in a kind of distributed training characterized by comprising
Grouping module is configured as being grouped the training data in distributed training cluster on each trained node, obtain Multiple sub- training datas on each trained node, wherein the training node in distribution training cluster circularizes connection;
Computing module is configured as the son training of every sub- training data in the training node for calculating the distributed training cluster Accumulation gradient;
Accumulator module is configured as obtaining sub- instruction corresponding with the son training accumulation gradient according to the son training accumulation gradient Practice accumulated gradient;
Synchronization module is configured as the son training accumulated gradient being synchronized to each of distributed training cluster training section Point.
10. a kind of calculating equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine instruction, which is characterized in that the processor realizes claim 1-8 or any one the method when executing described instruction The step of.
11. a kind of computer readable storage medium, is stored with computer instruction, which is characterized in that the instruction is held by processor The step of claim 1-8 any one the method is realized when row.
CN201910760056.XA 2019-08-16 2019-08-16 Gradient synchronous method and device during a kind of distribution is trained Pending CN110472731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910760056.XA CN110472731A (en) 2019-08-16 2019-08-16 Gradient synchronous method and device during a kind of distribution is trained

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910760056.XA CN110472731A (en) 2019-08-16 2019-08-16 Gradient synchronous method and device during a kind of distribution is trained

Publications (1)

Publication Number Publication Date
CN110472731A true CN110472731A (en) 2019-11-19

Family

ID=68511040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910760056.XA Pending CN110472731A (en) 2019-08-16 2019-08-16 Gradient synchronous method and device during a kind of distribution is trained

Country Status (1)

Country Link
CN (1) CN110472731A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723932A (en) * 2020-06-03 2020-09-29 上海商汤智能科技有限公司 Training method of neural network model and related product
CN114764601A (en) * 2022-05-05 2022-07-19 北京瑞莱智慧科技有限公司 Gradient data fusion method and device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463324A (en) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 Convolution neural network parallel processing method based on large-scale high-performance cluster

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463324A (en) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 Convolution neural network parallel processing method based on large-scale high-performance cluster

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANDREW GIBIANSKY: "Bringing HPC Techniques to Deep Learning", 《BLOG》 *
PASCAL: "PyTorch中在反向传播前为什么要手动将梯度清零", 《知乎》 *
YUJUN LIN, SONG HAN, HUIZI MAO, YU WANG, WILLIAM J. DALLY: "DEEP GRADIENT COMPRESSION:", 《ICLR 2018》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723932A (en) * 2020-06-03 2020-09-29 上海商汤智能科技有限公司 Training method of neural network model and related product
CN114764601A (en) * 2022-05-05 2022-07-19 北京瑞莱智慧科技有限公司 Gradient data fusion method and device and storage medium
CN114764601B (en) * 2022-05-05 2024-01-30 北京瑞莱智慧科技有限公司 Gradient data fusion method, device and storage medium

Similar Documents

Publication Publication Date Title
CN112181666B (en) Equipment assessment and federal learning importance aggregation method based on edge intelligence
CN110619388B (en) Gradient synchronization method and device in distributed training
CN110472731A (en) Gradient synchronous method and device during a kind of distribution is trained
CN109800730A (en) The method and apparatus for generating model for generating head portrait
CN107229966A (en) A kind of model data update method, apparatus and system
CN105765955B (en) A kind of user management method, terminal and terminal device
CN115473901B (en) Distributed computing power cluster intelligent scheduling method and device and computer equipment
CN110689136B (en) Deep learning model obtaining method, device, equipment and storage medium
CN112717388A (en) Game object display control method and device
CN107729570A (en) Data migration method and device for server
CN112288083A (en) Neural network distributed training method, device, equipment and storage medium
Li et al. Deep neural network based computational resource allocation for mobile edge computing
CN110008017A (en) A kind of distributed processing system(DPS) and method, a kind of calculating equipment and storage medium
CN108512817A (en) More video code conversion dispatching methods and device
CN106603689A (en) Data processing method and device based on distributed message releasing and subscribing system
CN110147414A (en) Entity characterization method and device of knowledge graph
CN109861864A (en) A kind of MAC protocol recognition methods based on LSTM network
CN110175171B (en) System for IT equipment intelligent recommendation of on-shelf position
CN107690799B (en) A kind of method, apparatus, server and computer readable storage medium that data are synchronous
CN110310357A (en) A kind of model interts processing method, device, calculates equipment and storage medium
El Gaily et al. Derivation of Parameters of Quantum optimization in Resource Distribution Management
CN110009749B (en) Virtual object positioning method, device, computing equipment and storage medium
CN113420874A (en) Gradient synchronization method in distributed training and distributed training system
CN111309460B (en) Task processing method of intelligent mobile equipment in mobile edge computing scene
CN113657136A (en) Identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191119

RJ01 Rejection of invention patent application after publication