CN109919313A

CN109919313A - A kind of method and distribution training system of gradient transmission

Info

Publication number: CN109919313A
Application number: CN201910101338.9A
Authority: CN
Inventors: 徐华; 徐宇啸; 吕跃强
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2019-06-21
Anticipated expiration: 2039-01-31
Also published as: CN109919313B

Abstract

This application discloses a kind of method of gradient transmission and distributed training systems to improve the training effectiveness of distributed training to the efficiency of transmission of the gradient generated during training for promotion.This method comprises: obtaining the gradient of the corresponding weight of i-th layer of neuron of neural network model according to the training data of input；The gradient of the corresponding weight of i-th layer of neuron is sent to gradient buffer area；Whether the quantity for judging the gradient of gradient cache bank memories storage is more than propagation threshold；According to judging result, the gradient that gradient cache bank memories store up is sent to gradient collection module；Obtain the gradient mean value of the corresponding weight of i-th layer of neuron of the neural network model obtained according to the gradient that the multiple neural network models stored in gradient collection module are sent；And corresponding weight is updated according to the gradient mean value of the corresponding weight of i-th layer of neuron, to execute the next iteration of neural network model.

Description

A kind of method and distribution training system of gradient transmission

Technical field

A kind of method transmitted the invention relates to information technology field more particularly to gradient and distributed training system System.

Background technique

Currently, artificial intelligence (artificial intelligence, AI) receives significant attention, core technology is each A field obtains important breakthrough, and one of core technology of AI is exactly deep learning, and deep learning is a kind of based on neural network mould The machine learning techniques of type.Neural network model includes multilayer neuron, and every layer of neuron corresponds at least one weight.Nerve net Network model by successive ignition could normal use, the iteration of neural network model can be with are as follows: true according to the training data of magnanimity Optimal weight is made, the difference of the prediction result and priori knowledge that make neural network model is minimum.

In the iteration to neural network model, to improve training effectiveness, it can be distributed using multiple trained equipment Formula training.In an iteration of multiple trained equipment to neural network model, for any weight, multiple trained equipment are calculated The gradient of the weight out may be different, and multiple trained equipment need the gradient by calculated each weight to pass It is defeated, to determine gradient mean value.Multiple trained equipment are updated weight using gradient mean value, multiple for any weight The training updated weight of equipment is identical.Each trained equipment after being updated to the corresponding weight of each layer neuron, The corresponding weight of updated each layer neuron is used to carry out next iteration to neural network model respectively.

How multiple trained equipment synchronize pair the gradient that multiple trained equipment obtain during an iteration Training effectiveness is affected.

Summary of the invention

The embodiment of the present application provides a kind of method of gradient transmission, and the gradient to generation during training for promotion is in nerve Synchronous efficiency between network model.

In a first aspect, the present invention provides a kind of method of gradient transmission, in the distribution training system of neural network model In system, including multiple neural network models, each neural network model include n-layer neuron, every layer of neuron corresponding at least one A weight, wherein n is positive integer, and multiple neural network models carry out an iteration simultaneously, every in multiple neural network models For a neural network model when carrying out an iteration, process is similar.Any neural network model in training system in a distributed manner For be illustrated.

Neural network model is in an iteration, it is possible, firstly, to input training data, according to the training data of input, obtains Take the gradient of the corresponding weight of i-th layer of neuron of the neural network model, wherein i is the positive integer no more than n；By the mind The gradient of the corresponding weight of i-th layer of neuron through network model is sent to the gradient buffer area of the neural network model.It is inciting somebody to action After gradient is sent to gradient buffer area, it can be determined that the quantity of the gradient of the gradient cache bank memories of neural network model storage is It is no more than the propagation threshold determined；Then according to judging result, the gradient cache bank memories of the neural network model are stored up Gradient is sent to gradient collection module, and after having sent, gradient buffer area is just without gradient；Generally, when in gradient buffer area After the quantity of the gradient of storage is more than the propagation threshold determined, the gradient that gradient cache bank memories store up is sent to gradient and is collected Module.Gradient is sent to gradient and collects module by multiple neural network models, then gradient is collected in module and is stored with multiple minds The gradient sent through network model is further obtained according to the multiple neural network mould stored in the gradient collection module The gradient mean value of the corresponding weight of i-th layer of neuron for the neural network model that the gradient that type is sent obtains；Finally, according to this The gradient mean value of the corresponding weight of i-th layer of neuron of neural network model updates i-th layer of neuron of the neural network model Corresponding weight, to execute the next iteration of the neural network model.

By the way that propagation threshold is arranged, the quantity of the gradient of storage is compared with propagation threshold, it is determined whether by gradient Transmitted, and also according to the gradient of multiple neural network models, determine gradient mean value, using gradient mean value to weighted value into Row updates, and not only realizes the transmission of gradient, also improves the training effectiveness of distributed training.In addition, according to neural network model Determine that propagation threshold, different neural network models use different propagation thresholds, avoid because of propagation threshold and neural network The loss of efficiency of transmission caused by unmatched models further increases the training effectiveness of distributed training.

With reference to first aspect, in the first possible implementation of the first aspect, judging neural network model Before whether the quantity of the gradient of gradient cache bank memories storage is more than the propagation threshold determined, propagation threshold set can be determined, There is m alternate transmission threshold value in propagation threshold set, m alternate transmission threshold value in propagation threshold set is obtained by m iteration In the corresponding transmission duration of each alternate transmission threshold value；Then according to this m transmission duration, the propagation threshold is determined.

It is possible to further predefine the threshold value set including at least two alternate transmission threshold values, in distribution training During the iteration of system, propagation threshold is determined, at least two alternate transmission threshold values to realize the maximum of efficiency of transmission Change.

The possible implementation of with reference to first aspect the first, in second of possible implementation of first aspect In, according to this m transmission duration, when determining the propagation threshold, it can be shortest transmission in this m transmission duration of selection The corresponding alternate transmission threshold value of duration is as the propagation threshold.

It is possible to further using the corresponding alternate transmission threshold value of most short transmission duration as final propagation threshold, with reality Now quickly transmission improves the efficiency of distributed training.

The possible implementation of first or second kind with reference to first aspect, in the third possible realization of first aspect In mode, each alternate transmission threshold value is not less than the number of the corresponding weight of any layer neuron of the neural network model.

Second aspect, the present invention provides a kind of distributed training system, which has realization above-mentioned The functional module of method in any possible realization of first aspect and first aspect.The functional module can pass through hardware It realizes, corresponding software realization can also be executed by hardware.The hardware or software include one or more and above-mentioned function phase Corresponding module.

The distribution training system includes multiple neural network models and multiple transmission modules, the neural network model and should Transmission module corresponds, and each neural network model includes n-layer neuron, and every layer of neuron corresponds at least one weight, In, n is positive integer.

The neural network model, for according to the training data of input, obtaining the neural network model in an iteration The corresponding weight of i-th layer of neuron gradient, wherein i is positive integer no more than n；By i-th layer of the neural network model The gradient of the corresponding weight of neuron is sent to the gradient buffer area of the neural network model；

Whether the quantity of the transmission module, the gradient of the gradient cache bank memories storage for judging the neural network model surpasses Cross the propagation threshold determined；According to judging result, the gradient that the gradient cache bank memories of the neural network model are stored up is sent To gradient collection module；

The neural network model is also used to obtain according to the multiple neural network model stored in the gradient collection module The gradient mean value of the corresponding weight of i-th layer of neuron for the neural network model that the gradient of transmission obtains；According to the nerve net I-th layer of neuron that the gradient mean value of the corresponding weight of i-th layer of neuron of network model updates the neural network model is corresponding Weight, to execute the next iteration of the neural network model.

In conjunction with second aspect, in the first possible implementation of the second aspect, which is also used to pass through The corresponding transmission duration of each alternate transmission threshold value in m alternate transmission threshold value in m iteration acquisition propagation threshold set；Root According to this m transmission duration, the propagation threshold is determined, and trigger the transmission module and execute and judge that the gradient of neural network model is slow The step of whether quantity for depositing the gradient stored in area is more than the propagation threshold determined.

In conjunction with the first possible implementation of second aspect, in second of possible implementation of second aspect In, the transmission module according to this m transmission duration when determining the propagation threshold for being specifically used for: selecting this m transmission The corresponding alternate transmission threshold value of shortest transmission duration is as the propagation threshold in duration.

It is fast to realize it is possible to further the propagation threshold for being used as the corresponding alternate transmission threshold value of most short transmission duration Speed transmission improves the efficiency of distributed training.

In conjunction with the possible implementation of first or second kind of second aspect, in the third possible realization of second aspect In mode, each alternate transmission threshold value is not less than the number of the corresponding weight of any layer neuron of the neural network model.

The third aspect, the present invention provides a kind of distributed training systems, which includes at least one Training equipment, each trained equipment includes: processor and memory.The memory, for storing computer instruction；The processing Device, for executing the computer instruction in the memory, to realize any possible of above-mentioned first aspect and first aspect The method being somebody's turn to do in realization.

Fourth aspect, the present invention provide a kind of non-volatile computer readable storage medium storing program for executing, are stored in the storage medium Computer instruction, the computer instruction are used to execute any possibility to realize above-mentioned first aspect and first aspect by processor Realization in various method.

5th aspect, the present invention provides a kind of computer program product, when computer is read and executes the computer program When product, so that computer executes the method being somebody's turn to do in any possible realization of above-mentioned first aspect and first aspect.

6th aspect, the present invention provide a kind of chip, which couples with memory, for reading and executing the memory The software program of middle storage, in the method being somebody's turn to do in realizing any possible realization of above-mentioned first aspect and first aspect.

Detailed description of the invention

Figure 1A is a kind of neural network model schematic diagram provided in the embodiment of the present application；

Figure 1B is a kind of distributed training system schematic diagram provided in the embodiment of the present application；

Fig. 1 C is a kind of distributed training system schematic diagram of the decentralization provided in the embodiment of the present application；

Fig. 2A is a kind of iterative process schematic diagram provided in the embodiment of the present application；

Fig. 2 B is a kind of process schematic of gradient transmission provided in the embodiment of the present application；

Fig. 3 is a kind of process schematic of the determining propagation threshold provided in the embodiment of the present application；

Fig. 4 is a kind of process schematic of the threshold value set provided in the embodiment of the present application；

Fig. 5 is a kind of distributed training system schematic diagram provided in the embodiment of the present application.

Specific embodiment

Below in conjunction with attached drawing, the embodiment of the present application is described in detail.

This programme in order to facilitate understanding is first introduced neural network model.It should be understood that neural network model is one Kind imitates the network model of animal nerve network behavior feature, and this network model relies on its complexity, by adjusting inside Relationship interconnected between great deal of nodes, to achieve the purpose that handle information.

The process that the process of training neural network namely learns the corresponding weight of neuron, final purpose are Obtain the corresponding weight of each layer of neuron of trained neural network model.

Below with reference to Figure 1A, to the training process of the possible neural network model for being applied to the embodiment of the present application a kind of into Row detailed description.

Figure 1A is a kind of schematic block diagram of neural network model 100 provided by the embodiments of the present application.Neural network model 100 include n-layer neuron, and each layer in n-layer neuron includes one or more neuron, each layer of all neurons It is connect with next layer of all neurons.It is illustrated by taking the neural network model 100 in Figure 1A as an example, with reference to the 1st layer of Figure 1A Including each layer in two neurons, the 2nd to n-1 layer include three neurons, n-th layer include a neuron, wherein N is the positive integer not less than 2, and the i in Figure 1A is the positive integer no more than n and not less than 1.Each neuron has corresponding power Weight.

An iteration in 100 training process of neural network model is described in detail below.

It is concentrated from training data and obtains training data, training data is defeated as the 1st layer of neural network model 100 Enter, the input of first layer is by exporting a prediction result from n-th layer after the 1st multiple neurons into n-th layer.Specifically Ground, each layer of neuron all have corresponding weight.By training data input first layer neuron, first layer neuron be based on pair The output valve for the weight output first layer neuron answered.Using the output valve of first layer neuron as the defeated of second layer neuron Enter, output valve of the second layer neuron based on corresponding weight output second layer neuron.Similarly, and so on, finally from n-th Layer one prediction result of output.

During training neural network model 100, it is desirable to the prediction knot of the n-th layer output of neural network model 100 For fruit as close as the priori knowledge (prior knowledge) of training data, priori knowledge is also referred to as true value (ground truth) generally comprises the corresponding legitimate reading of training data provided by people.So can be by comparing current Prediction result and priori knowledge, each layer of mind in neural network model 100 is updated further according to difference condition between the two (process of initialization certainly, is usually had before first time updates, as initialization neural network mould through the corresponding weight of member The corresponding weight of each layer neuron in type 100).Therefore, after the prediction result for obtaining n-th layer output, using ERROR ALGORITHM, according to Prediction result and the corresponding weight of priori knowledge amendment neuron, it is specific as follows.

Loss function is calculated according to prediction result and priori knowledge, according to loss function, along the side of n-th layer to the 1st layer To the corresponding weight of each layer of neuron in amendment neural network model 100.It can be by calculating the corresponding ladder of each weight Degree is to be modified weight, which is obtained by loss function, which can differentiate to weight by loss function and obtain ?.

It include that loss letter is calculated according to prediction result and priori knowledge according to prediction result and priori knowledge amendment weight Number, according to loss function, along the gradient of the corresponding weight of every layer of neuron of direction calculating of n-th layer to the 1st layer.In other words It says, the calculating of the gradient of the corresponding weight of each layer neuron is successively carried out according to from n-th layer to the 1st layer of sequence, is calculated After the gradient of the corresponding weight of complete i-th layer of neuron, start the gradient for calculating the corresponding weight of (i-1)-th layer of neuron.It obtains every After the gradient of the corresponding weight of layer neuron, the corresponding weight of each layer neuron is modified according to each gradient, is completed primary Iteration.

During successive ignition, the corresponding weight of each layer neuron is constantly corrected, to realize nerve net Priori knowledge of the prediction result that network model 100 exports as close as training data.

Gradient transmission method in the application is applied to the distributed training system of neural network model.Distribution training system System includes multiple trained equipment.There is identical neural network model in each trained upper side administration, and each trained equipment obtains respectively The different training datas that training data is concentrated are trained.After iteration several times, the multiple trained equipment obtains multiple Neural network model after identical training, the neural network model after training in any trained equipment be complete it is distributed Trained neural network model.

In addition to this, each trained equipment further includes transmission module.It is said by taking the distributed training system 200 in Figure 1B as an example The process of bright distributed training.Illustratively, the training only drawn out in the embodiment of the present invention in distributed training system 200 is set For 210 and training equipment 220, the quantity of actual trained equipment can be more.

In an iteration of distribution training, each trained equipment is concentrated from training data and obtains training data, each to instruct It is different to practice the corresponding training data of equipment.Training data is inputted the neural network in corresponding training equipment by each trained equipment respectively Model, for example, the training data that training equipment 210 will acquire inputs the neural network model 211 in training equipment 210, training The training data that equipment 220 will acquire inputs the neural network model 221 in training equipment 220.Each neural network model difference Obtain the prediction result of the training data for input, and each layer neuron being calculated based on prediction result and priori knowledge The gradient of corresponding weight.Since the training data for inputting each neural network model is different, in an iteration, each nerve net The gradient that network model is calculated is different.That is, in distributed training system 200 as shown in Figure 1B, in an iteration In, input neural network model 211 is different from the training data of neural network model 221, then, based on different training numbers According to the resulting gradient of neural network model 211 is different from the resulting gradient of neural network model 221.If according to neural network mould The resulting gradient of type 211 adjusts the corresponding weight of each layer neuron in neural network model 211, according to neural network model 221 resulting gradients adjust the corresponding weight of each layer neuron in neural network model 221, several times after iteration, after training Neural network model 211 in the corresponding weight of each layer neuron power corresponding with layer neuron each in neural network model 221 Weight is different, and the neural network model 211 and neural network model 221 after training are different.

Therefore, during each iteration, after each neural network model obtains the gradient of weight, first by the gradient of weight It is sent to corresponding gradient buffer area to be stored, be deposited when the transmission module in each trained equipment determines in gradient buffer area When the quantity of the gradient of storage is greater than propagation threshold, corresponding transmission module need to transmit the gradient of weight.Specifically, training The gradient of weight is sent to gradient collection module 230 by the transmission module 212 in equipment 210, the transmission mould in training equipment 220 The gradient of weight is sent to gradient collection module 230 by block 222.It is stored in gradient collection module in each gradient buffer area Gradient.The gradient of weight is calculated according to the gradient in the gradient buffer area of each neural network model stored in gradient collection module Mean value.Each neural network model is calculated according to gradient mean value updates corresponding weight.Optionally, to each neural network model institute The gradient of the weight obtained does the gradient mean value for averagely obtaining the weight, can also be to the gradient of the resulting weight of each neural network model It does weighted average and obtains the gradient mean value of the weight, its elsewhere can also be done to the gradient of the resulting weight of each neural network model It manages to generate the gradient mean value of the weight.

Each trained equipment in distributed training system 200 can be centralization deployment, be also possible to decentralization portion Administration.If changing deployment centered on each trained equipment, gradient collection module collects mould independently of multiple trained equipment or gradient Block is deployed on one in multiple trained equipment.Resulting gradient is transmitted to the gradient collection module by each trained equipment. If each trained equipment is decentralization deployment, gradient collection module is deployed in each trained equipment.It is drawn out in Figure 1B Distributed training system 200 centered on change deployment, and gradient collection module 230 is independently of training equipment 210 and training equipment 220.The distributed training system 200 of decentralization deployment as shown in Figure 1 C, is deployed with gradient collection module in training equipment 210 Gradient collection module 2302 is deployed on 2301, and training equipment 220.

Multiple training when being trained using distributed training system to neural network model, in distributed training system Equipment executes the step of training neural network model parallel.Below with reference to Fig. 2A, in a distributed manner in training system a training For equipment, distributed training system is introduced to the process of neural network model training.

Step 21: obtaining training data from training dataset kind, the training data input neural network model that will acquire；Root According to the corresponding weight of layer neuron each in neural network model, the prediction result of the training data for input is obtained.

During iteration for the first time, the corresponding weight of every layer of neuron is that initialization procedure determines, initialization procedure It is the process that the corresponding weight of every layer of neuron is set for each neural network model.

Step 22: the priori knowledge of the training data is obtained, according to the prediction result and priori knowledge of the training data, meter Calculate loss function.

Step 23: propagation threshold is obtained, calculates the gradient of the corresponding weight of each layer neuron according to loss function, and according to Propagation threshold transmits the gradient of the corresponding weight of each layer neuron.

When calculating the gradient of the corresponding weight of each layer neuron according to loss function, need according to the last layer neuron (n-th layer neuron) successively calculates the gradient of the corresponding weight of every layer of neuron to the sequence of first layer neuron.

Due to needing to calculate the gradient mean value of weight, after the gradient that the corresponding weight of each layer neuron has been calculated, The gradient of the corresponding weight of each layer neuron need to be transmitted.Gradient is sent to by the gradient for transmitting the corresponding weight of each layer neuron Gradient collection module, so that gradient collection module calculates the gradient mean value of weight.When transmitting gradient, it can be and all layers have been calculated After the gradient of the corresponding weight of neuron, the gradient of the corresponding weight of all layers of neuron is transmitted.The transmission of gradient needs The time is wanted, it, can be after the gradient for calculating the corresponding weight of several layers neuron, by calculated gradient in order to improve efficiency It is transmitted, which can start to calculate during transmission preceding layer (with the 1st layer for most preceding one layer For) gradient of the corresponding weight of neuron, caused by reducing the transmission for needing to wait gradient before next iteration starts Time delay.Specifically, the gradient of the corresponding weight of one layer of neuron often has been calculated, i.e., caches the gradient to the ladder in training equipment Spend buffer area；Propagation threshold is set, the quantity of the gradient cached in gradient buffer area is compared with propagation threshold, gradient is slow The quantity for rushing the gradient cached in area is more than the primary transmission of starting after propagation threshold.

The setting of propagation threshold is the key that improve training effectiveness.The corresponding propagation threshold of different neural network models is not Together.Since the starting once transmitted needs the time, and too small propagation threshold will cause transmission and continually start, therefore, too small Propagation threshold can not effectively reduce transmission bring time delay；When the corresponding weight of layer neuron each in neural network model Total number is smaller, and excessive propagation threshold can not effectively improve the efficiency of transmission.Therefore, it need to be determined according to neural network model The propagation threshold that the neural network model uses, the propagation threshold that different neural network models is determined are different.Optionally, root Propagation threshold is determined according to the total number of the corresponding weight of each layer neuron of neural network model.It is determined according to neural network model The detailed process of propagation threshold is subsequent to be described.

The gradient of the corresponding weight of each layer neuron is calculated in step 23 according to loss function, and is transmitted according to propagation threshold The detailed process of the gradient of the corresponding weight of each layer neuron can be found in the description of step 231 to step 236.

Step 24: the weight calculated for each weight in the corresponding weight of each layer neuron, gradient collection module Gradient mean value.

Step 25: obtaining the corresponding gradient mean value of each weight, respective weights are updated using gradient mean value.

By above-mentioned step 21- step 25, training equipment completes an iteration, concentrates in training data and obtains new instruction Practice data, next iteration is carried out based on updated weight.Each trained equipment is according to step 21-25, using different Training data completes an iteration.After carrying out weight update, each trained updated weight of equipment is identical, Duo Gexun Next iteration can be carried out to neural network model by practicing equipment, until calculated loss function meets setting in step 22 Condition, then it is assumed that trained the neural network model.

The corresponding weight of every layer of neuron is calculated according to loss function in step 23 in an iteration described further below Gradient, and the detailed process of the gradient of the corresponding weight of every layer of neuron is transmitted according to propagation threshold, referring specifically to Fig. 2 B:

Step 231: the gradient of the corresponding each weight of current layer neuron is calculated, in the gradient for calculating each weight Afterwards, calculated gradient gradient buffer area is sent to cache.

It include n-layer neuron in neural network model, using i-th layer of neuron as current layer neuron, i is no more than n Positive integer.The neuron for successively calculating each layer by the sequence of n-th layer neuron to the 1st layer of neuron in an iteration is corresponding The corresponding gradient of weight, i.e., the initial value of i be n.

After the gradient for often calculating a weight, which can be cached.

Step 232: determining whether to complete the calculating of the gradient of the corresponding weight of all layers of neuron of neural network model. Determine whether that the specific embodiment for completing the calculating of the gradient of the corresponding weight of all layers of neuron includes determining when front layer mind Through member whether be neural network model first layer neuron.If current layer neuron is i-th layer of neuron, it is determined that i is No is 1.If not yet completing the calculating of the gradient of the corresponding weight of all layers of neuron, step 233 is executed；If completing institute There are the calculating of the gradient of the corresponding weight of layer neuron, directly execution step 236, all neurons pair are completed in step 236 The transmission of the gradient for the weight answered.

Step 233: determine whether the quantity of the gradient of current gradient buffer area storage is not less than propagation threshold, if so, 234 are thened follow the steps, the gradient of caching is transmitted；If it is not, then carrying out step 235, continue to calculate preceding layer neuron The gradient of corresponding weight.In step 233, can also with current gradient buffer area store gradient size (memory capacity) with Propagation threshold is compared, if the size of the gradient of current gradient buffer area storage is greater than propagation threshold, executes 235, no Then follow the steps 234.

Step 234: the gradient of current cache is sent to gradient collection module.

Step 234 further includes after the gradient of current cache is sent to gradient collection module, and deletion has been transmitted to ladder The gradient for spending collection module, after step 234 has executed, there is no the gradients transmitted in gradient buffer area.

Step 235: using (i-1)-th layer as current layer neuron, executing step 231, it is corresponding to calculate preceding layer neuron The gradient of weight.The calculating of the gradient of the corresponding weight of all layers of neuron is not completed at this time, that is, currently calculates completion Layer neuron is not first layer neuron, therefore also needs the preceding layer neuron to the current layer neuron for calculating and completing corresponding The gradient of weight is calculated.

Step 236: the gradient of current cache is sent to gradient collection module.

The corresponding power of all layers of neuron is completed in the gradient for having calculated that the 1st layer of corresponding weight of neuron at this time The gradient of weight calculates, and need to complete the transmission of the gradient of the corresponding weight of all neurons.At this point, being still stored in gradient buffer area (gradient of current cache includes the ladder of the calculated 1st layer of corresponding weight of neuron in step 231 to the gradient of current cache Degree), and the gradient of current cache is not yet transmitted at this time, it is therefore desirable to the gradient of current cache is transmitted, i.e., by the ladder of current cache Degree is sent to gradient collection module.At this point, not needing to judge before the gradient of current cache is sent to gradient collection module Whether the quantity of the gradient cached in buffer zone is not less than propagation threshold.After step 236, it can also delete in buffer area Gradient through transmitting completes the calculating and transmission of the gradient of the corresponding weight of all layers of neuron in an iteration.

In this application, propagation threshold is determined according to neural network model, specifically, according to layer each in neural network model The number of the corresponding weight of neuron determines.For example, every layer of neuron in traversal neural network model, obtains every layer of neuron The total number of corresponding weight, the total number of the corresponding weight of i-th layer of neuron are q_i, wherein neural network model includes n-layer Neuron, then i is the positive integer no more than n.It can be by the maximum value in the total number of the corresponding weight of every layer of neuron, or most Small value or intermediate value or average value etc. are used as propagation threshold.

It is possible to further predefine the threshold value set including at least two alternate transmission threshold values, in distribution training During the iteration of system, propagation threshold is determined, at least two alternate transmission threshold values to realize the maximum of efficiency of transmission Change.Specifically, can be found in Fig. 4 according to the detailed process of each alternate transmission threshold value in neural network model threshold value set In step 41- step 44.

The process shown in Figure 3 that alternate transmission threshold value is chosen in threshold value set below.It include m in threshold value set Alternate transmission threshold value, m are the positive integer not less than 2.Fig. 3 provides a kind of embodiment, is trained by neural network model The determination propagation threshold of m iteration in journey.

Step 31: in a times iteration, training data being inputted into neural network model；According to every in neural network model The corresponding weight of layer neuron, obtains the prediction result of the training data for input, and the stop value that the initial value of a is 1, a is m。

Referring herein to step 31 be above-mentioned step 21, detailed process no longer repeated.

Step 32: according to prediction result and priori knowledge, calculating loss function.

Referring herein to step 32 be above-mentioned step 22, detailed process no longer repeated.

Step 33: the alternate transmission threshold value without obtaining corresponding transmission duration is chosen in threshold value set.

In this application, the gradient of the corresponding weight of every layer of neuron is transmitted according to propagation threshold, it is final in order to determine Propagation threshold, determine that by the corresponding transmission market of each alternate transmission threshold value in threshold value set, alternate transmission threshold value is corresponding Transmission when it is a length of when carrying out an iteration using the alternate transmission threshold value, from starting to calculate the corresponding weight of n-th layer neuron Gradient to the gradient for completing the corresponding weight of n-layer neuron transmission duration.It, can be in threshold set in a times iteration An alternate transmission threshold value without obtaining corresponding transmission duration is chosen in conjunction.

Step 34: the gradient of the corresponding weight of every layer of neuron is calculated according to loss function, and according to alternate transmission threshold value Transmit the gradient of the corresponding weight of every layer of neuron.

Step 34 similar with aforementioned step 23 herein, and detailed process is no longer repeated herein.

Step 35: obtaining the corresponding transmission duration of the alternate transmission threshold value.

The transmission duration characterizes the training equipment and transmits the corresponding weight of each layer neuron according to the alternate transmission equipment Time required for gradient.Specifically, a length of training equipment is refreshing from calculating n-th layer is started when the transmission of the alternate transmission threshold value Duration is transmitted used in transmission of the gradient through the corresponding weight of member to the gradient for completing all layers of neuron.

Step 36: determining the corresponding transmission duration of each alternate transmission threshold value obtained in threshold value set.

Step 37: according to the corresponding transmission duration of each alternate transmission threshold value, a propagation threshold is chosen, by the biography of selection Defeated threshold value carries out successive iterations as propagation threshold, can be found in the description of front according to the process that propagation threshold is iterated.

Specifically, can choose the transmission shortest alternate transmission threshold value of duration as propagation threshold for subsequent training In.

It should be noted that can be each trained equipment when multiple alternate transmission threshold values in threshold value set The alternate transmission threshold value determined respectively according to neural network model；It is also possible to some or certain several trained equipment according to nerve net Network model determines alternate transmission threshold value, and determining alternate transmission threshold value is distributed to other training equipment；It can also be pipe It manages equipment and alternate transmission threshold value is determined according to neural network model, and determining alternate transmission threshold value is distributed to each training and is set It is standby.

The sequencing of above-mentioned step 33 and step 31 is in this application without limiting, the elder generation of step 33 and step 32 Sequence is in this application also without limiting afterwards.

Below by taking Fig. 4 as an example, the detailed process of each alternate transmission threshold value in threshold value set is described in detail:

Each of step 41, traversal neural network model layer neuron, the total number of the corresponding weight of every layer of neuron, The total number of the corresponding weight of i-th layer of neuron is q_i, and the total quantity s of the weight in neural network model is calculated, nerve The total quantity of weight in network model is s, wherein neural network model includes n-layer neuron, and i is just whole no more than n Number.

Step 42, to the total number q of the corresponding weight of every layer of neuron_iTo q_nDuplicate removal processing is carried out, by what is obtained after duplicate removal Each total number is determined as each alternate transmission threshold value in threshold value set.

Step 43 adds k alternate transmission threshold value into threshold value set.

In order to find out the propagation threshold for the performance that can embody distributed training, can also be added in threshold value set The number k of alternate transmission threshold value, the alternate transmission threshold value of addition can be preset, such as set k as 5,10,15 etc..

The sequencing of above-mentioned step 42 and step 43 is unlimited.

In a kind of enforceable mode, the number k of the alternate transmission threshold value of addition can be determined according to the following formula:

Wherein, s is the total quantity of the weight in neural network model, q_maxFor nerve Maximum value in network model in the total number of the corresponding weight of each layer neuron, x are constant, and x can be 8,10,15, are used Above-mentioned formula determines the value of k, it is intended that indicates to add at least x alternate transmission threshold in the threshold value set determined in step 42 Value.

K alternate transmission threshold value adding in threshold value set can be to be generated at random in the value range of setting. Generally for the needs of adaptation scheme, each alternate transmission threshold value in threshold value set meets certain value range.In order to keep away As soon as exempting to pass the gradient of the corresponding weight of this layer of neuron in the gradient for often calculating the corresponding weight of layer neuron It is defeated, and bring establishes the problem more than number of communications, the value of alternate transmission threshold value can be not less than q_min, q_minFor neural network Minimum value in model in the total number of the corresponding weight of each layer neuron；In order to avoid calculating all layers of neurons correspondence Weight gradient after, just the gradient of the corresponding weight of all layers of neuron of calculating is transmitted, and bring training when Between long problem, the value range of threshold value can be no more than the total quantity s of the weight in neural network model.Therefore the k of addition A threshold value can be in [q_min, s] k alternate transmission threshold value is randomly selected in value range is added in threshold value set.

In order to keep the alternate transmission threshold value in threshold value set more uniform, and make up q_maxWith the search blank of total quantity s Section, in a kind of enforceable mode, k alternate transmission threshold value of addition can determine according to the following formula:

p_j=q_max+(i+1)*(S-q_max)/k, i ∈ [0, k), P_iFor any alternate transmission threshold value added in threshold value set.

Determine that the process of k alternate transmission threshold value can be with according to above-mentioned formula are as follows: by the total of the weight of neural network model The total number q of quantity s weight corresponding with each layer neuron of neural network model_iMiddle maximum value q_maxDifference and k Quotient is determined as target step value；By the maximum value q of total number_maxWith the sum of target step value, as radix；With radix work For the first alternate transmission threshold value of addition, and using in the upper alternate transmission threshold basis once newly determined, increase target Mode of the step value as this alternate transmission threshold value newly determined, the k alternate transmission threshold added in threshold value set Value.

Step 44, for each alternate transmission threshold value in threshold value set, it is pre- to determine whether the alternate transmission threshold value is less than If value, if it is, the alternate transmission threshold value is filtered out from the threshold value set.

In order to rationally utilize transfer resource, and diminution search space, lesser alternate transmission threshold value can also be rejected Fall, i.e., filters out the alternate transmission threshold value for being less than preset value from threshold value set, preset value can be S/1000.

Above-mentioned step 44 can carry out before or after step 43 and/or step 42, can also be with step 42 and/or step Rapid 43 carry out simultaneously.Generally speaking, the step of alternate transmission threshold value filters out judgement is carried out, can determine an alternate transmission threshold It just carries out once filtering out judgement after value, is also possible to be filtered out after determining a part, or whole alternate transmission threshold values Judgement, step 44 is with the sequencing for determining other steps of alternate transmission threshold value in this application without specifically limiting It is fixed.

41- step 44 through the above steps defines each alternate transmission threshold value in threshold value set.

Based on the same inventive concept with above-mentioned gradient transmission method, the embodiment of the present application also provides a kind of distributions Training system, the distribution training system includes multiple neural network models and multiple transmission modules；The neural network model Neural network model 211 or 221 in specially Figure 1B, the transmission module are specially the transmission module 212 or 222 in Figure 1B.

The neural network model, for according to the training data of input, obtaining the neural network in an iteration The gradient of the corresponding weight of i-th layer of neuron of model, wherein i is the positive integer no more than n；

The transmission module, for sending the gradient of the corresponding weight of i-th layer of neuron of the neural network model To the gradient buffer area of the neural network model；Judge the gradient of the gradient cache bank memories storage of the neural network model Whether quantity is more than the propagation threshold determined；According to judging result, by the gradient cache bank memories of the neural network model The gradient of storage is sent to gradient collection module；

The neural network model is also used to obtain according to the multiple nerve net stored in the gradient collection module The gradient mean value of the corresponding weight of i-th layer of neuron for the neural network model that the gradient that network model is sent obtains；According to The gradient mean value of the corresponding weight of i-th layer of neuron of the neural network model updates i-th layer of the neural network model The corresponding weight of neuron, to execute the next iteration of the neural network model.

Exemplary, the neural network model is also used to obtain m alternative biographies in propagation threshold set by m iteration The corresponding transmission duration of each alternate transmission threshold value in defeated threshold value；According to the m transmission duration, the transmission threshold is determined Value, and trigger the transmission module and execute and judge whether the quantity of the gradient of gradient cache bank memories storage of neural network model surpasses The step of crossing the propagation threshold determined.

Exemplary, the neural network model is used for according to the m transmission duration, when determining the propagation threshold, It is specifically used for: selects in the m transmission duration the corresponding alternate transmission threshold value of shortest transmission duration as the transmission threshold Value.

Exemplary, the m alternate transmission threshold value is not less than the minimum value in n total number, in the n total number A total number be the neural network model n-layer neuron in the corresponding weight of one layer of neuron total number.

Based on the same inventive concept of the method with above-mentioned gradient transmission, as shown in figure 5, the embodiment of the present application also provides A kind of distributed training system 500, the distribution training system include at least one multiple training equipment, and each training is set It include n-layer neuron in standby, every layer of neuron corresponds at least one weight, wherein n is positive integer, includes place in training equipment Manage device and memory.Illustratively, three trained equipment in distributed training system 500 are only drawn out in Fig. 5.Such as Fig. 5 institute Show, includes that training equipment 50, training equipment 51 and training equipment 52, each trained equipment include: in distributed training system Processor and memory.Training equipment 50 includes processor 501 and memory 502, and training equipment 51 includes processor 511 and deposits Reservoir 512.Training equipment 52 includes processor 521 and memory 522.In training equipment in distributed training system 500 Memory is for storing computer instruction, and the processor in training equipment executes the computer instruction in memory, on realizing State equipment in the distributed training system 200 of centralization and module or the distributed training system 200 of above-mentioned decentralization In equipment, that is, module.Specifically, processor 501 and processor 511 are for realizing the distributed training system 200 in such as Figure 1B In neural network model 211 and transmission module 212, and the gradient buffer area in neural network model 211 is by 502 He of memory Memory 512 is realized；If the gradient collection module 230 in the distributed training system 200 in Figure 1B is realized by processor 521, At this point, distributed training system 500 is centralization.In the distributed training system 500 of decentralization, processor 501, place Reason device 511 and processor 521 are used to the neural network model 211 realized as in the distributed training system 200 in Fig. 1 C, pass Defeated module 212 and gradient collection module 2301, and the gradient buffer area in neural network model 211 is by memory 502, memory 512 and memory 522 realize.

Calculating equipment in distributed training system 500 can also include communication interface.For example, it includes logical for calculating equipment 50 Believe interface 503, calculating equipment 51 includes communication interface 513.It calculates equipment and communication is realized by communication interface thereon.

Processor can be central processing unit (central processing unit, CPU), network processing unit (network processor, NP) or image processor (graphic processing unit, GPU) or three's Any combination.

Processor can further include hardware chip or other general processors.Above-mentioned hardware chip can be dedicated Integrated circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or combinations thereof.Above-mentioned PLD can be Complex Programmable Logic Devices (complex programmable logic device, CPLD), field programmable gate array (field- Programmable gate array, FPGA), Universal Array Logic (generic array logic, GAL) and other can compile Journey logical device, discrete gate or transistor logic, discrete hardware components etc. or any combination thereof.General processor can be with It is that microprocessor or the processor are also possible to any conventional processor etc..

It should also be understood that the memory referred in the embodiment of the present application can be volatile memory or non-volatile memories Device, or may include both volatile and non-volatile memories.Wherein, nonvolatile memory can be read-only memory (Read-Only Memory, ROM), programmable read only memory (Programmable ROM, PROM), erasable programmable are only Read memory (Erasable PROM, EPROM), electrically erasable programmable read-only memory (Electrically EPROM, ) or flash memory EEPROM.Volatile memory can be random access memory (Random Access Memory, RAM), use Make External Cache.By exemplary but be not restricted explanation, the RAM of many forms is available, such as static random-access Memory (Static RAM, SRAM), dynamic random access memory (Dynamic RAM, DRAM), synchronous dynamic random-access Memory (Synchronous DRAM, SDRAM), double data speed synchronous dynamic RAM (Double Data Rate SDRAM, DDR SDRAM), it is enhanced Synchronous Dynamic Random Access Memory (Enhanced SDRAM, ESDRAM), same Step connection dynamic random access memory (Synchlink DRAM, SLDRAM) and direct rambus random access memory (Direct Rambus RAM, DR RAM).It should be noted that memory described herein is intended to include but is not limited to these and appoints The memory for other suitable types of anticipating.

"and/or" in the application describes the incidence relation of affiliated partner, indicates may exist three kinds of relationships, for example, A And/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.Before character "/" typicallys represent Affiliated partner is a kind of relationship of "or" afterwards.

It is multiple involved in the application, refer to two or more.

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, the computer in one or more which includes computer usable program code can be used in the application The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although the preferred embodiment of the application has been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the application range.

Obviously, those skilled in the art can carry out various modification and variations without departing from this Shen to the embodiment of the present application Please embodiment spirit and scope.In this way, if these modifications and variations of the embodiment of the present application belong to the claim of this application And its within the scope of equivalent technologies, then the application is also intended to including these modification and variations.

Claims

1. a kind of method of gradient transmission, the method are applied to the distributed training system of neural network model, feature exists In, the distribution training system includes multiple neural network models, and each neural network model includes n-layer neuron, and every layer Neuron corresponds at least one weight, wherein n is positive integer, which comprises

Each neural network model, according to the training data of input, obtains the i-th of the neural network model in an iteration The gradient of the corresponding weight of layer neuron, wherein i is the positive integer no more than n；

The gradient of the corresponding weight of i-th layer of neuron of the neural network model is sent to the ladder of the neural network model Spend buffer area；

Whether the quantity for judging the gradient of the gradient cache bank memories storage of the neural network model is more than the transmission threshold determined Value；

According to judging result, the gradient that the gradient cache bank memories of the neural network model store up is sent to gradient and collects mould Block；

It obtains described in being obtained according to the gradient that the multiple neural network model that stores in the gradient collection module is sent The gradient mean value of the corresponding weight of i-th layer of neuron of neural network model；

The neural network model is updated according to the gradient mean value of the i-th of the neural network model layer of corresponding weight of neuron The corresponding weight of i-th layer of neuron, to execute the next iteration of the neural network model.

2. the method as described in claim 1, which is characterized in that the gradient cache bank memories storage of the judgement neural network model Gradient quantity whether be more than the propagation threshold determined before, the method also includes:

Pass through each corresponding biography of alternate transmission threshold value in m alternate transmission threshold value in m iteration acquisition propagation threshold set Defeated duration；

According to the m transmission duration, the propagation threshold is determined.

3. method according to claim 2, which is characterized in that it is described according to the m transmission duration, determine the transmission Threshold value includes:

Select in the m transmission duration the corresponding alternate transmission threshold value of shortest transmission duration as the propagation threshold.

4. according to the method any in claim 2-3, which is characterized in that each alternate transmission threshold value is not less than described The number of the corresponding weight of any layer neuron of neural network model.

5. it is a kind of distribution training system, which is characterized in that it is described distribution training system include multiple neural network models and Transmission module, each neural network model include n-layer neuron, and every layer of neuron corresponds at least one weight, wherein n is positive Integer；

The neural network model, for according to the training data of input, obtaining the neural network model in an iteration The corresponding weight of i-th layer of neuron gradient, wherein i is positive integer no more than n；By the i-th of the neural network model The gradient of the corresponding weight of layer neuron is sent to the gradient buffer area of the neural network model；

Whether the quantity of the transmission module, the gradient of the gradient cache bank memories storage for judging the neural network model surpasses Cross the propagation threshold determined；According to judging result, the gradient that the gradient cache bank memories of the neural network model are stored up is sent out It send to gradient collection module；

The neural network model is also used to obtain according to the multiple neural network mould stored in the gradient collection module The gradient mean value of the corresponding weight of i-th layer of neuron for the neural network model that the gradient that type is sent obtains；According to described The gradient mean value of the corresponding weight of i-th layer of neuron of neural network model updates i-th layer of nerve of the neural network model The corresponding weight of member, to execute the next iteration of the neural network model.

6. distribution training system as claimed in claim 5, which is characterized in that the transmission module is also used to by m times repeatedly The corresponding transmission duration of each alternate transmission threshold value in m alternate transmission threshold value in generation acquisition propagation threshold set；According to institute M transmission duration is stated, determines the propagation threshold.

7. distribution training system as claimed in claim 6, which is characterized in that the transmission module is used for according to the m Duration is transmitted, when determining the propagation threshold, is specifically used for:

8. according to the distributed training system any in claim 6-7, which is characterized in that each alternate transmission threshold value is equal Not less than the number of the corresponding weight of any layer neuron of the neural network model.

9. a kind of distribution training system, which is characterized in that the distribution training system includes at least one training equipment, institute Stating trained equipment includes: processor and memory；

The memory, for storing computer instruction；

The processor, for executing the computer instruction in the memory, to realize any one of the claims 1-4 institute The method stated.

10. a kind of non-volatile computer readable storage medium storing program for executing, which is characterized in that be stored with computer in the storage medium and refer to It enables, the computer instruction by processor for being executed to realize the described in any item methods of the claims 1-4.