CN109508785A

CN109508785A - A kind of asynchronous parallel optimization method for neural metwork training

Info

Publication number: CN109508785A
Application number: CN201811265027.8A
Authority: CN
Inventors: 游科友; 张家绮; 宋士吉
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2019-03-22

Abstract

The present invention proposes a kind of asynchronous parallel optimization method for neural metwork training, belongs to deep learning field.This method will be used to train data in the data set of neural network to being assigned to n platform computer first；The Communication topology for determining n platform computer obtains the corresponding set of computers for sending data and receiving data of every computer；The respective neural network of every computer initialization and relevant parameter, then training is iterated to respective neural network, by the weighting parameters of update after each iteration, weights consistency variable and total step-length variable sends jointly to all computers communicated with；After all computers terminate repetitive exercise, final neural network parameter is that finally trained parameter, Neural Network Optimization finish neural network on any one computer.The present invention is implemented simply, and network training speed is fast, to the favorable expandability of large-scale dataset and computer cluster.

Description

A kind of asynchronous parallel optimization method for neural metwork training

Technical field

The invention belongs to deep learning fields, and in particular to a kind of asynchronous parallel optimization side for neural metwork training Method.

Background technique

The fields such as artificial intelligence and its relevant computer learning, deep learning have received widespread attention in recent years. Its application in fields such as recognition of face, Face datection, natural language processing, speech recognitions also shows powerful ability.

A very common and important technology is artificial neural network (abbreviation nerve net in artificial intelligence and deep learning Network).One neural network is made of several neurons, and a neuron can be regarded as receiving several input signals, then According to the function of certain rule one signal of output, this function corresponds to some adjustable parameters.Between neuron according to Certain rule carries out series connection and parallel connection just constitutes a neural network, and common neural network structure has full connection nerve net Network, convolutional neural networks (CNN) etc..Fig. 1 is a simple three-layer neural network schematic diagram, wherein input layer and middle layer point Not there are three neuron, output layer has a neuron, all connects entirely between layers.The network receives three tunnels input letter Number, then export signal all the way.In general, a neural network externally can be regarded as receiving several input letters Number, the function of several signals is then exported according to certain rule, rule here is by all neurons inside neural network State modulator, the also referred to as parameter of this neural network.The neural network of same structure, if parameter is different, nerve net The function of network is not often also identical, i.e., different signals can be exported for same input signal.The training of neural network refers to The process of parameter in neural network is constantly adjusted according to some way, so that the output of network meets expectation.

Currently used neural network training method is the stochastic gradient method based on data set.Data set is one group given The set of the data of input/output relation, it is related with problem to be solved.It is every in data set such as in image classification problem One data generally comprises a width picture (input) and the corresponding classification of the picture (output).Data set is usually from practical life It is collected in work, and needs some processes manually demarcated.The purpose of training of the neural network based on data set is exactly to adjust Parameter in neural network makes given network specifically input (such as a width picture), and the output of network is as close possible to data Concentrate corresponding output (such as classification of the picture).Stochastic gradient method is a kind of training method being widely used, training Process is that each step first takes out a certain number of data in data set at random, is then defined on nerve using the calculating of these data The gradient of target function value and objective function about network parameter on network adjusts nerve further according to the gradient of objective function Parameter in network.The step for being repeated when training, until the effect of neural network output is satisfactory.Therefore, stochastic gradient " random " in method takes out a part of data when referring to trained from data set at random, and " gradient " refers to update neural network parameter Using gradient method.

Since the data set that training uses in real problems is generally very big, corresponding neural network is also larger, uses Stochastic gradient method training neural network in single computer or equipment needs a large amount of time that can just obtain satisfied effect. In order to accelerate the training of network, a kind of possible way is using multiple stage computers or equipment while to carry out the training of network, I Be referred to as distributed method.Needing the main problem solved using distributed method is information between different computers Exchange with it is synchronous, mainly take two methods at present.One is choosing one in multiple stage computers to be used as host, remaining is calculated The result of calculating such as gradient etc. is issued host as slave, every slave of each step of training process by machine, and then host is integrated The data obtained from different slaves update neural network again.This method requires every slave that can be communicated with host, Therefore restriction of the calculating speed by the processing speed and main-machine communication bandwidth of host.

Another method is the copy that every computer all retains a neural network, and then each step of training process is every Platform computer first updates the neural network copy of itself according to data set thereon, then the copy is issued to neighbouring several calculating Machine, and their neural network copy is received from these computers, finally take the copy and receipts of certain strategy fusion oneself The copy arrived.The difference of host and slave is not present in this method, the communication load basic one of all computers in training process It causes, therefore training speed is not readily susceptible to the restriction of single computer.

In above-mentioned second method, although overall training speed is not limited by the bandwidth of single computer, It is that all computers must be updated with identical clock and frequency, i.e. the training process synchronization that requires different computers. This is required prevent every computer according to the clock of oneself from being updated, it is necessary to wait the information for receiving other computers It together updates afterwards.When the computing capability of different computers has difference, this method can make computing capability strong, updating decision Computer will wait computing capability weak, update slow computer, cause the waste of resource.

Summary of the invention

The purpose of the invention is to overcome the shortcomings of existing neural network training method and improve the instruction of neural network Practice efficiency, proposes a kind of asynchronous parallel optimization method for neural metwork training.The present invention implements simple, network training speed Degree is fast, to the favorable expandability of large-scale dataset and computer cluster.

The present invention proposes a kind of asynchronous parallel optimization method for neural metwork training, which is characterized in that including following Step:

(1) by the data in the data set D for being used to train neural network to being assigned to n >=1 computer, the data Collecting D includes N group inputoutput data pair；The Sub Data Set being assigned on i-th computer is expressed as D_i, D_iCorresponding data It is N to group number_i；

(2) Communication topology for determining n platform computer obtains the corresponding reception computer of every computer and sends number According to set of computers；It enablesIndicate the set of computers for the data that i-th computer of reception is sent,It is counted including i-th Calculation machine itself,The quantity of middle computer is usedIt indicates, enablesExpression sends data to the computer set of i-th computer It closes；

(3) the respective neural network of every computer initialization and relevant parameter；Specific step is as follows:

(3-1) every computer initializes a neural network according to the neural network structure defined in advance；Every is counted The initial parameter of calculation machine epineural network is denoted as w_i(0), the corresponding the number of iterations k of every computer is set_iIt is 0, every meter is set The consistency variable initial value z of calculation machine_iAnd total step-length variable initial value l (0)=1_i(0)=1；

(3-2) every computer establishes three buffer area W respectively_rec,i,Z_rec,iAnd L_rec,iAnd it is initialized as sky, three Buffer area stores the weighting parameters of this computer received from other computers respectively, weights consistency variable, Yi Jizong Step-length variable；

(3-3) every computer defines the corresponding trigger event of the computer；

(3-4) every computer i calculates initial weighting parametersInitial weighting consistency variableThen willAnd l_i(0) it is sent toIn all computers；

(4) every computer is iterated training to respective neural network, until Neural Network Optimization finishes；Specific step It is rapid as follows:

Before the arrival of corresponding trigger event, every computer receives (4-1) every computer iIn other calculating Weighting parameters, weighting consistency variable and the total step-length variable of machine, and it is stored in corresponding buffer area W respectively_rec,i,Z_rec,iWith And L_rec,i；

(4-2) when the arrival of the corresponding trigger event of computer i, this computer is performed the following operation:

(4-2-1) computer i updates each variable according to the following formula:

Wherein, ρ (t) is sequence, and t is the call number of sequence,It is loss function L (D, w) about nerve Network parameter w_iStochastic gradient；

(4-2-2) computer i updates weighting parameters respectivelyWeight consistency variableThen willAnd l_i(k_i+ 1) it is sent toIn all computers；

(4-2-3) updates the number of iterations k_i, enable k_iIncrease by 1, then return to step (4-2-1), when computer i meets When stopping criterion for iteration, this computer terminates repetitive exercise；

(4-3) after all computers terminate repetitive exercise, final neural network parameter is equal on any one computer For neural network, finally trained parameter, Neural Network Optimization are finished.

The features of the present invention and beneficial effect are:

(1) present invention in neural network be by multiple stage computers come and meanwhile train, for same network structure, do so Improve training speed.In addition, the distributed feature of this method makes it possible using bigger network, i.e. neural network Design is not only restricted to the computing capability of single computer.Compared with the single computer training method being widely used at present, we Method can use multiple stage computers cluster to significantly improve training speed.

(2) in the present invention different computers it is equal, the traffic of the every computer in each step is roughly the same, And every computer only needs to communicate with sub-fraction computer, do not need a host or similar central computer come with Every other computer communication.Compared to one host of currently used setting, other computers are slave, then host and all The training method that slave is communicated, this method make the traffic of single computer will not be as the bottleneck of whole system, more It is applied in the case that suitable number of computers is especially more.

(3) every computer updates itself neural network pair according to oneself clock or trigger event in the present invention This, is a kind of asynchronous algorithm without synchronous with the holding of other computers.In asynchronous algorithm every computer according to oneself Clock and frequency update network, and the communication between computer is allowed to there is delay, can make full use of the meter of every computer Calculation ability will not cause computer to have standby time to accelerate training process because of synchronizing, it is easier to expand to a large amount of The case where computer.

(4) future of the invention can be applied to include the image classification using nerual network technique, target detection, intensified learning Deng there is higher application value.

Detailed description of the invention

Fig. 1 is a three-layer neural network schematic diagram.

Fig. 2 is to change over time schematic diagram using the loss function error of the method for the present invention and distributed random gradient method.

Specific embodiment

The present invention proposes a kind of asynchronous parallel optimization method for neural metwork training, with reference to the accompanying drawing and specific real It is as follows to apply example further description.

The present invention proposes a kind of asynchronous parallel optimization method for neural metwork training, with n platform computer to one For neural network is trained.Enable (x_i,y_i) one group of inputoutput data pair is represented, wherein x_iFor corresponding input feature vector to Amount, y_iFor x_iCorresponding true output.Enable D={ (x_i,y_i), i=1 .., N } indicate the data sets of N group data pair.Nerve It is x that network, which can be regarded as an input,_iFunction, export as to true output y_iEstimated valueI.e.Wherein w is parameter to be adjusted in the neural network.The corresponding function f of the network of different structure (x_i, w) it is also different.The purpose of training neural network is to makeWith y_iBetween gap it is as small as possible, for this purpose, training specific mesh Here to minimize loss functionFor, remaining alternative loss function further includes handing over Pitch entropy loss functionAnd the loss of regular terms is added on the basis of both loss functions Function:WithHere λ is one fixed Positive number, size will be selected according to practical problem.

The present invention proposes a kind of asynchronous parallel optimization method for neural metwork training, comprising the following steps:

(1) data set for being used to train neural network is assigned to n platform computer；

According to the computing capability of different computers, storage capacity or other practical factors will have the number of N group data pair Being assigned to n >=1 computer according to the data in collection D, (computer generally has CPU, memory, the general-purpose computations of hard disk Machine) on, N is typically much deeper than n in practical problem.The method of salary distribution can be any mode, for example data can be evenly distributed to n On platform computer, then every computer obtains N/n data.Here the Sub Data Set being assigned on i-th computer is indicated For D_i, corresponding data are N to group number_i.Every computer can only access the data being assigned to thereon, i.e., i-th in training process Platform computer can only access D_iData set.

(2) Communication topology for determining n platform computer obtains the corresponding computer that can receive of every computer and sends out Send the set of computers of data.

In the light of actual conditions determine which the communication mode of this n platform computer, i.e. every computer can send data to Platform computer or the data for receiving which platform computer.Ideally, every computer can give all other computer Send data, but bandwidth is limited in practice, whole efficiency may be made lower in this way, if therefore every computer may only to Dry platform computer sends data.In order to indicate convenient, it is first 1 ..., n to every computer number, then usesExpression can To receive the set of computers for the data that i-th computer is sent.Here it providesIncluding i-th computer itself.In The quantity of computer is usedIt indicates.It enables againExpression sends data to the set of computers of i-th computer.

(3-1) every computer initializes a neural network according to the neural network structure defined in advance, wherein every The parameter of neural network can randomly select when computer initialization.By the initial parameter note of i-th computer epineural network For w_i(0), subscript i here and later indicates i-th computer.The neural network structure of different computers is identical, but It is that parameter may be different.In addition, every computer is by corresponding the number of iterations k_iIt is set as 0, at the beginning of then consistency variable is set Initial value z_iAnd total step-length variable initial value l (0)=1_i(0)=1.(3-2) every computer needs in respective memory respectively Establish three buffer area W_rec,i,Z_rec,iAnd L_rec,i, it is respectively intended to store on other computers received from other computers Weighting parameters, weight consistency variable, and total step-length variable.

It is (different that (3-3) every computer also needs to define the corresponding trigger event of the computer according to the actual situation The trigger event of computer may be different), the operation in step (4) will be executed when triggering every time.Trigger event is according to practical feelings Condition has different definition ways, and such as every primary by triggering in 1 second, triggering one is inferior after receiving data every time.

(3-4) every computer i calculates initial weighting parametersInitial weighting consistency variableThen willAnd l_i(0) it is sent toIn all computers.Here 0 generation in bracket The initial value of the table parameter.

(4-1) every computer i waits the arrival of the trigger event of predefined.During this period, every computer is all It will receiveIn the weighting parameters of other computers weight consistency variable, and total step-length variable be stored in respectively it is corresponding slow Deposit area W_rec,i,Z_rec,iAnd L_rec,i(transmission for the first time of every computer is initial value, therefore what is received for the first time is also Initial value)；Before the corresponding first time trigger event of every computer arrives, three initial parameters of the computerAnd l_i(0) it will not change, computer can update these three according to the following steps after triggering for the first time Parameter, changes will occur for these three variables after triggering every time later；The frequency that each computer is sent is not fixed, can basis Actual conditions are sent with optional frequency, and the transmission frequency of all computers is recommended to be consistent in certain practical operation as far as possible, this Sample effect is more preferable；Each computer only needs to retain one group of parameterAnd l_i, but this group of parameter can be in optimization process In constantly change.

For example, before the trigger event next time of computer i arrives, if computer i receives computer j transmission Data twice:Andl_j(6), then computer iWithDeposit W_rec,i,WithIt is stored in Z_rec,i, l_j(5) and l_j(6) it is stored in L_rec,i.The trigger event of different computers is mutually only It is vertical, before certain computer is not triggered, all parameters on the computer (being assumed to be j) (And l_j) keep not Become, and do not send data, the possible data to arrive to be received such as only.

(4-2) is then performed the following operation once computer i is triggered:

(4-2-1) computer i is calculated according to the following formula and is updated each variable

Wherein ρ (t) is predefined sequence, such as constant sequence ρ (t)=ρ or sequenceT refers to The call number of sequence, such as sequenceThe sequence of representative is exactly For loss Function L (D, w) is about neural network parameter w_iStochastic gradient.C in above formula_i(k_i+1),z_i(k_i+1),m_i(k_i+1), And α_i(k_iIt+1) is to update neural network parameter w_i(k_i+ 1), consistency variable z_i(k_i, and total step-length variable l+1)_i (k_i+ 1) temporary variable defined during for simplified formula does not have tool no need to send other computers are given yet Body physical significance.

When loss function isWhen, i-th computer calculates the method for its stochastic gradient such as Under: first from its Sub Data Set D_iIn randomly select p data and be denoted as (x_i,1,y_i,1),(x_i,2,y_i,2),...,(x_i,p,y_i,p)。p For a specified in advance positive integer, such as 16,32 etc..Then the corresponding output of these data of neural computing is utilized, I.e.Finally calculate stochastic gradient:

Here ▽ f_w(x_i,j,m_i(k_iIt+1) is) nerve The output of network is being inputted about the gradient of w as x_i,j, parameter m_i(k_i+ 1) value under.When using other loss functions,It is correspondingly changed to stochastic gradient of other loss functions under this p data, is such as usedWhen as loss function,It adopts With

When as loss function,

(4-2-2) computer i updates weighting parameters and weighting consistency variable respectively, then together with updated total step-length Variable is sent jointly toIn all computers.

Computer i updates weighting parametersWeight consistency variableThen willAnd l_i(k_i+ 1) it is sent toIn all computers.

(4-2-3) updates the number of iterations k_iEven k_iIncrease by 1, then returns to step (4-2-1)；When computer i thing When the termination condition first defined obtains meeting, this computer terminates repetitive exercise.When training can be set into termination condition Between exceed schedule time or l_i(k_i+ 1) size is more than number etc. specified in advance.

(4-3) when the repetitive exercise on all computers all after, final neural network on any one computer Parameter w_i(k_i+ 1) all can serve as neural network, finally trained parameter, Neural Network Optimization finish.

In the present invention, if certain computer has first reached the stop condition defined in advance, so that it may end first training.It is real In the operation of border, different computers terminate the trained time will not difference it is too many, approximate can regard as while terminate.

Fig. 2 is in one embodiment of the invention respectively using this method and currently used distributed random gradient method Loss function error with time change comparison.Wherein abscissa indicates training time in seconds, ordinate table Show the value of loss function in training process under logarithmic coordinates system.What solid line indicated is the table of the mentioned method of the present invention in this example Existing, dotted line is using the performance of distributed random gradient method in this example.It can be seen from the figure that the mentioned algorithm of the present invention is received Its validity will be embodied far faster than distributed random gradient method by holding back speed.And the disadvantage is that in training process error concussion compared with Greatly.

Claims

1. a kind of asynchronous parallel optimization method for neural metwork training, which comprises the following steps:

(1) by the data in the data set D for being used to train neural network to being assigned to n >=1 computer, the data set D Include N group inputoutput data pair；

The Sub Data Set being assigned on i-th computer is expressed as D_i, D_iCorresponding data are N to group number_i；

(2) Communication topology for determining n platform computer obtains the corresponding computer that receives of every computer and sends data Set of computers；It enablesIndicate the set of computers for the data that i-th computer of reception is sent,Including i-th computer Itself,The quantity of middle computer is usedIt indicates, enablesExpression sends data to the set of computers of i-th computer；

(3-1) every computer initializes a neural network according to the neural network structure defined in advance；By every computer The initial parameter of epineural network is denoted as w_i(0), the corresponding the number of iterations k of every computer is set_iIt is 0, every computer is set Consistency variable initial value z_iAnd total step-length variable initial value l (0)=1_i(0)=1；

(3-2) every computer establishes three buffer area W respectively_rec,i,Z_rec,iAnd L_rec,iAnd it is initialized as sky, three cachings Area stores the weighting parameters of this computer received from other computers respectively, weights consistency variable and total step-length Variable；

(3-3) every computer defines the corresponding trigger event of the computer；

(4) every computer is iterated training to respective neural network, until Neural Network Optimization finishes；Specific steps are such as Under:

Before the arrival of corresponding trigger event, every computer receives (4-1) every computer iIn other computers Weighting parameters, weighting consistency variable and total step-length variable, and it is stored in corresponding buffer area W respectively_rec,i,Z_rec,iAnd L_rec,i；

(4-2-1) computer i updates each variable according to the following formula:

Wherein, ρ (t) is sequence, and t is the call number of sequence,It is loss function L (D, w) about neural network Parameter w_iStochastic gradient；

(4-2-3) updates the number of iterations k_i, enable k_iIncrease by 1, step (4-2-1) is then returned to, when computer i meets iteration When termination condition, this computer terminates repetitive exercise；

(4-3) after all computers terminate repetitive exercise, final neural network parameter is mind on any one computer Through network, finally trained parameter, Neural Network Optimization are finished.