CN107784364A

CN107784364A - The asynchronous training of machine learning model

Info

Publication number: CN107784364A
Application number: CN201610730381.8A
Authority: CN
Inventors: 王太峰; 陈薇; 刘铁岩; 高飞; 叶启威
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2016-08-25
Filing date: 2016-08-25
Publication date: 2018-03-09
Anticipated expiration: 2036-08-25
Also published as: WO2018039011A1; EP3504666B1; US20190197404A1; CN107784364B; EP3504666A1

Abstract

Embodiment of the disclosure is related to the asynchronous training of machine learning model.Server is received by being trained the feedback data to generate to machine learning model from working machine.These feedback data are that working machine is obtained using the training data of oneself, and the preceding value with the parameter set of the machine learning model at this particular job machine is associated.Server determines the difference of the preceding value and parameter set between the currency at server.This currency is probably that have passed through one or many renewals due to the operation of other working machines.Then, server based on the difference between feedback data and the value of parameter set, can carry out the currency of undated parameter collection.Thus, it is this to update the training result for not only allowing for each working machine, appropriate compensation also is made that to the delay between different operating machine.

Description

The asynchronous training of machine learning model

Background technology

Machine learning has in numerous areas such as speech recognition, computer vision and natural language processings widely should With.For example, deep neural network (Deep Neural Networks, DNN) is on the basis of big data and powerful computing resource On, can parallel training there is multiple levels, the machine learning model of multiple parameters.In the training stage, it is necessary to according to given Training dataset and optimization aim, one or more parameters of model are trained.For example, training for neutral net and Speech, can use stochastic gradient descent method.

It is known training dataset to be distributed between multiple working machines.These working machines utilize respective training data Model parameter is optimized, and its result is returned into a central server.However, distributed asynchronous model in other words The key problem of training is the mismatch between each working machine.For example, when a working machine returns to the renewal of its parameter, clothes Model parameter at business device may have updated one or many by other working machines.Therefore, in the different of machine learning model In step training, it is expected to reduce or eliminate this delay or mismatch.

The content of the invention

Traditional all schemes are based on such theoretic knowledge, i.e. delay or mismatch between each working machine are Caused by the reason such as not quite identical that communicated between the performance between different operating machine and/or server and different operating machine. Therefore, the methods of focus of traditional scheme all is to pass through Optimized Operation reduces delay.But inventor is had found by studying, This delay be actually asynchronous framework it is intrinsic, it is impossible to be eliminated by Optimized Operation.Therefore, the implementation of the disclosure Example is intended to carry out appropriate compensation to the delay between different operating machine, rather than goes to attempt to eliminate this delay, and this is from work Any known arrangement is all markedly different from principle and mechanism.

Generally, in accordance with an embodiment of the present disclosure, server is received by being instructed to machine learning model from working machine The feedback data practiced and generated.These feedback data are working machines to be obtained using the training data of oneself, and with the machine Preceding value of the parameter set of learning model at this particular job machine is associated.Server determines the preceding value and parameter set Difference between the currency at server.It will be understood that this currency be probably the operation due to other working machines and It has passed through one or many renewals.Then, server can be come more based on the difference between feedback data and the value of parameter set The currency of new parameters sets.Thus, this update not only allows for the training result of each working machine, also to different operating machine it Between delay be made that appropriate compensation.Practice it has been proved that with attempt by force eliminate delay traditional scheme compared with, the disclosure Embodiment can significantly reduce mismatch between different operating machine, realize to the effective and efficient different of machine learning model Step training.

It is their specific realities below in order to which simplified form introduces the selection to concept to provide Summary Applying in mode to be further described.Summary is not intended to identify the key feature or main special of claimed theme Sign, also it is not intended to limit the scope of claimed theme.

Brief description of the drawings

Fig. 1 shows the block diagram for the environment that can implement embodiment of the disclosure；

Fig. 2 shows the flow chart of the method for training pattern in accordance with an embodiment of the present disclosure；

Fig. 3 A- Fig. 3 D show the performance comparision figure between scheme in accordance with an embodiment of the present disclosure and traditional scheme；

Fig. 4 A- Fig. 4 D show the performance comparision figure between scheme in accordance with an embodiment of the present disclosure and traditional scheme；With And

Fig. 5 is shown in which that the block diagram of computing system/server of one or more other embodiments of the present disclosure can be implemented.

In these accompanying drawings, same or similar reference symbol is used to represent same or similar element.

Embodiment

The disclosure is discussed now with reference to some embodiments.It should be appreciated that these embodiments are discussed merely to making Obtain those of ordinary skill in the art to better understood when and therefore realize the disclosure, rather than imply to the scope of the present disclosure Any restrictions.

As used herein, what term " comprising " and its variant will be read as meaning " to include but is not limited to " opens Put formula term.Term "based" will be read as " being based at least partially on ".Term " embodiment " and " one embodiment " will quilts It is read as " at least one embodiment ".Term " another embodiment " will be read as " at least one other embodiment ".Term " first ", " second " etc. may refer to different or identical object.Hereafter it is also possible that other are clearly and implicit Definition.

Asynchronous training framework

Fig. 1 shows the block diagram for the parallel computation environment 100 that can be implemented within embodiment of the disclosure.It is appreciated that It is to describe the 26S Proteasome Structure and Function of environment 100 merely for exemplary purpose rather than imply for any of the scope of the present disclosure Limitation.The disclosure can be embodied in different structure and/or function.

Parallel computation environment 100 includes server 102, working machine (worker) 104 and working machine 106.It should be appreciated that figure The number of server and working machine shown in 1 has no intention to limit merely for the sake of illustration purpose, and it is any appropriate to there may be Number destination server and working machine.For example, in parameter server framework, server 102 can be by multiple server distribution formulas Realize.In certain embodiments, working machine 104 and 106 etc. can be by one or more graphics processing unit (GPU) or GPU Cluster is realized.

In operation, each working machine has respective training data.For example, the training data each to work can be instruction Practice a subset of data complete or collected works.For each working machine, the subset of training data can be by from training data complete or collected works Grab sample obtains.According to optimization aim set in advance, each working machine is independently trained to model, and is tied Fruit returns to server 102.Server 102 according to the feedback result of each working machine 104,106 come the parameter of more new model, most It is set to meet optimization aim eventually.As described above, in this process, delay and mismatch between different operating machine 104,106, It is the main bottleneck for restricting training effect.

General principle

In order to more clearly describe the disclosure, it is described below in conjunction with more classification learnings based on neural network model. It will be appreciated, however, that the concept of the disclosure can apply in various appropriate machine learning models, particularly neutral net mould Type.

In more classification problems, it can useThe input space is represented, is usedRepresent Space is exported, is usedRepresentation spaceOn joint bottom distribution, wherein, d represent the input space dimension,Table Show set of real numbers, and K represents the quantity of the classification in output space.

In the training process, it usually needs providing includes the training set { (x of multiple data₁, y₁) ..., (x_S, y_S), its Middle S represents the number of the element of training set.Each element in training set includes a pair of inputs and output, its can be considered as from point ClothIn by independent same distribution (i.i.d.) sample obtain, for example, element (x₁, y₁) represent by inputting x₁With output y₁Group Into data.The overall goal of training is come learning neural network model based on training setIts InRepresent from spaceTo the mapping space of real number.It will be appreciated, however, that term used herein " training " is also It may refer to a part for training process.The real vector representation that the parameter set of the model can be tieed up by n, i.e.,Wherein N is natural number.That is, one or more parameter can be included in parameter set.For simplicity, can also incite somebody to action sometimes Parameter set is referred to as parameter or parameter vector.

Neural network model generally has hierarchical structure, wherein each node carries out linear group to the connecting node of last layer Conjunction and nonlinear activation.Model parameter represent two layers between side weight.Neural network model is for each input Output vector is produced, i.e.,：Represent the possibility that the input belongs to a different category.

Because bottom is distributedIt is unknown, the method for being commonly used in study or training pattern is to minimize loss letter Number.Loss function is a kind of form of optimization aim.Alternatively, can also be by maximum utility function come training pattern.Effect Generally there is the form of its loss function of equal value with function, for example, can be represented by the opposite number of loss function.Therefore, it is For the sake of simplifying, below embodiment of the disclosure is described mainly in combination with loss function.

Loss function can represent the measurement of the overall loss for model optimization, wherein loss can represent misclassification The various factors such as error.The conventional loss function of deep neural network is cross entropy loss function, and it is defined as：

Wherein, I represents indicator function, and log represents logarithmic function,Represent Softmax computings.Softmax operation be it is well known in the art that its be widely used in it is common in multi-class problem concerning study Operation, thus repeat no more.

During empirical loss function is minimized, it usually needs initial value is set to w, then changed according to training data Generation ground change w, until converging to the parameter value w for make it that loss function minimizes.

As described above, in synchronous stochastic gradient descent method, each working machine (for example, working machine 104 and 106) is based on The data of respective small lot (mini-batch) calculate gradient, and these gradients are added in world model.Pass through setting Barrier, each partial duty machine wait until the gradient from all partial duty machines is added to world model each other.So And because the setting of barrier, the training speed of model are determined by most slow working machine.In order to improve training effectiveness, can use Asynchronous stochastic gradient descent method.In asynchronous stochastic gradient descent method, without using barrier, in a working machine to the overall situation After model adds its gradient, continue training process, without waiting.Therefore, asynchronous stochastic gradient descent method is not due to having Stand-by period and there is the efficiency higher than synchronous stochastic gradient descent method.

Computing environment 100 shown in Fig. 1 can be used for implementing asynchronous stochastic gradient descent method.In time t, working machine 104 receive model parameter w from server 102_t.For example, working machine 104 can be sent to server 102 is directed to current model The request of parameter.Server 102 upon receipt of the request, to the transmission pattern parameter w of working machine 104_t.Then, working machine 104 according to data x_tCalculate partial gradient g (w_t).Working machine can receive data x from other servers_t, data x_tCan be from instruction Practice and concentrate stochastical sampling and obtain.Data x_tNumber can be one or more, the disclosure is unrestricted herein.It should manage Solution, term " stochastic gradient " not only include and are directed to a data x_tThe situation being trained, but also including being directed to multiple data x_t The situation being trained, the latter are otherwise referred to as small lot (mini-batch) gradient.

At working machine 104, partial gradient g (w can be obtained by calculating_t).For example, can be by by data x_tInput Gradient g (w_t) expression formula can obtain g (w_t) value.Then the gradient is sent to server 102 by working machine 104, by servicing Device 102 is by partial gradient g (w_t) it is added to world model's parameter.However, as shown in figure 1, before this, in additionIndividual working machine Its partial gradient may be added in world model's parameter so that world model's parameter has been updatedIt is secondary, It is changed intoTraditional asynchronous stochastic gradient descent algorithm ignores this problem, simply by partial gradient g (w_t) addition To world model's parameter

Wherein η represents learning rate, and the equation commonly referred to as updates rule.As can be seen that asynchronous stochastic gradient descent method Renewal rule and order stochastic gradient descent method (also referred to as, standalone version stochastic gradient descent method) renewal rule Valency.In asynchronous stochastic gradient descent method, actually by the partial gradient g (w of " delay " or " expired "_t) be added to it is current World model's parameterAnd in order stochastic gradient descent method, should be based on relative toGradient come more New world model parameter

Traditional theory thinks, asynchronous stochastic gradient descent method need more iteration reach with order stochastic gradient Drop method identical precision.Even in some cases, asynchronous stochastic gradient descent method can not obtain and order stochastic gradient Method identical precision, particularly when working machine it is in a large number when.A variety of methods be present now to solve asynchronous stochastic gradient The problem of in descending method.In some versions, the delay of partial gradient is reduced by setting various scheduling strategies.At some In scheme, less weight is set to larger partial gradient is postponed, larger weight is set to less partial gradient is postponed. As another example, will postpone to abandon more than the partial gradient of certain threshold value, etc..However, these methods are all without fully profit With each partial gradient, computing resource is wasted to a certain extent.

Traditional understanding thinks that this delay is by the performance between different operating machine and/or server and different operating machine Between communicate it is not quite identical caused by, therefore reduce delay the methods of pass through Optimized Operation.However, inventor recognizes this Kind understanding is inapt.Delay be asynchronous framework it is intrinsic, it is impossible to be completely eliminated.As shown in figure 1, part is terraced Spend g (w_t) it is added to world model's parameterWhen, it there will necessarily beSecondary delay.Therefore, in accordance with an embodiment of the present disclosure, exist In asynchronous model training, the delay (for example, delay of gradient) of different operating machine is adequately compensated for, rather than goes to attempt to reduce Delay.

Instantiation procedure

Principle in accordance with an embodiment of the present disclosure is described above in association with Fig. 1.It should be appreciated that the principle can be easily Expand to any appropriate model and scene using asynchronous stochastic gradient descent method.This public affairs is described below with reference to Fig. 2 Open the instantiation procedure of embodiment.For the sake of for convenience of description, Fig. 2 computing environment 100 for describing to combine Fig. 1 is deployed.

202, server 102 receives from working machine 104 the currency w for model parameter in time t_tRequest.By Generally represented in model parameter by multiple parameters, thus be referred to as parameter set.In response to the request, server 102 is to work Make the currency w of the transmission pattern parameter of machine 104_t.In certain embodiments, time t can be represented by counting, for example, global mould Shape parameter w_tOften update once, time t is incremented by a counting.

Thus, working machine 104 is obtaining the currency w of world model's parameter from server 102_t.In addition, working machine 104 One or more training data x can also be received from equipment such as the servers of hosted data collection_t.The number of training data is also referred to as The size of small lot, depending on user is set.In certain embodiments, training data can obtain from data set stochastical sampling.

Working machine 104 to model by being trained to generate feedback data.The feedback data and model parameter it is current Value w_tIt is associated.Training process performed by working machine 104 can be a part for the whole training process of model.As showing Example, in certain embodiments, feedback data indicates the optimization aim of the model relative to the currency w of model parameter_tIt is notable Variation tendency.For example, in certain embodiments, significant changes trend can be maximum variation tendency, and it is possible thereby to by Optimization aim relative to model parameter currency w_tGradient g (w_t) represent.

Especially, it should be noted that the scope of the present disclosure is not limited to " significant changes trend " or other any physics The mathematical character mode of amount.Used mathematical character (for example, mathematical quantity, expression formula, formula etc.) is all merely possible to show Example is described, and its sole purpose is to aid in thought and implementation skilled artisan understands that the disclosure.

As described above, optimization aim can be represented by loss function, therefore, for discussion purposes, referring still to loss letter Count to be discussed.Working machine 104 can according to training dataset come counting loss function parameter currency w_tThe office at place Portion gradient g (w_t).For many common loss functions, partial gradient g (w_t) there is explicit expression, thus by training data Partial gradient g (w can be obtained by inputting the expression formula_t) value.In this case, as the training performed by working machine 104 Journey is only based on the explicit expression and determines local g (w_t), namely a part for the whole training process of model.

Then, working machine 104 by the feedback data (for example, partial gradient g (w_t)) send back server 102.At some In embodiment, when working machine 104 obtains from server 102 the currency w of world model parameter_tWhen, the storage model of server 102 The currency w of parameter_tAs backup model parameter w_bak(m), wherein m represents the mark of working machine 104.

As shown in figure 1, in time t+ τ, server 104 is received by being trained model to give birth to from working machine 104 Into feedback data, such as partial gradient g (w_t).Now, the currency of model parameter has been updated toThus this is anti- Present data substantially represents the optimization aim of model relative to the preceding value w of model parameter in certain embodiments_tNotable change Change trend.

Referring still to Fig. 2,204, server 102 can determine the preceding value w of model parameter_tWith currencyIt Between difference.206, server 102 updates the currency of model parameter based on feedback data and the differenceSo as to Obtain the updated value w of model parameter_t+τ+1。

In certain embodiments, previous model parameter can be as described above as backup model parameter w_bak(m) deposit The model parameter w of storage_t.The updated value w of the renewal amount of model parameter, i.e. parameter_t+τ+1With currencyBetween residual quantity, The currency of model parameter can be considered asWith preceding value w_tBetween difference conversion.In certain embodiments, can be with The coefficient of this conversion is determined based on significant changes trend, and by determining that model is joined using this conversion to the difference Residual quantity between several currencys and updated value, i.e. the renewal amount of model parameter.By in currencyIt is upper to apply this One renewal amount, it is possible to obtain the updated value w of parameter set_t+τ+1.For example, the conversion can be linear transformation, and the conversion Coefficient can be linear change rate.Certainly other any appropriate conversion are also all feasible.

For discussion purposes, discussed below referring still to asynchronous stochastic gradient descent method.But please note again Meaning, this is used for the purpose of the principle and thought of the explaination disclosure, it is not intended to limits the scope of the present disclosure in any way.

In the ideal case, asynchronous stochastic gradient descent method should be identical with order stochastic gradient descent method by gradientIt is added to world model's parameterHowever, asynchronous stochastic gradient descent method will postpone gradient g (w_t) add It is added to world model's parameter

This gap can be illustrated by Taylor expansion.For example,In w_tThe Taylor expansion at place can represent For：

Wherein I_nN dimension unit vectors are represented, and high-orders more than O symbols 2 ranks of expression is write using flower .

From equation (3) as can be seen that asynchronous stochastic gradient method uses the zeroth order item of Taylor expansion as gradientApproximation, and have ignored other all items.Therefore, if making up delay gradient g (w_t) and gradient Between gap, then every other item can be taken into account, and be respectively calculated.However, this is unpractical, because It is related to the sum for calculating infinite multiple.In accordance with an embodiment of the present disclosure, the zero and first order item of Taylor expansion is only retained, it is only right Postpone gradient and carry out simplest compensation：

The first derivative of gradient reflects the rate of change of gradient, and corresponding to loss function (for example, by equation (1) institute The cross entropy loss function of expression) second dervative.The first derivative of gradient can be represented by black plug (Hessian) matrix.It is black Plug matrix can be defined as below：Hf (w)=[h_ij] i, j=1 ..., n, wherein

Thus, by the way that formula (4) is combined with the renewal rule of formula (2), it is possible to it is determined that the renewal amount to parameter. The renewal amount is formed by two, and Section 1 is to postpone gradient and the product of learning rate, and Section 2 is compensation term.Therefore, parameter Renewal amount can be considered as the currency of parameter setWith preceding value w_tBetween difference linear transformation, and linear change Rate is the product of learning rate and Hessian matrix.Due to learning rate be can be with predefined empirical parameter, thus linear change rate can To think of equal value with Hessian matrix.

However, although can directly calculate Hessian matrix in certain embodiments, but this process is probably difficult.Example Such as, there will be 1,000,000,000,000 elements for the neural network model with 1,000,000 parameters, corresponding Hessian matrix.Therefore, It is very high to obtain the computation complexity of so huge matrix, and such matrix is also required to occupy larger memory space.Cause And in further embodiments, carry out approximate Hessian matrix using the relatively easy approximation for calculating and/or storing so that delay is mended Repay and easily implement.

For example, in some embodiments, it is desirable to feedback data can be based on (for example, gradient g (w_t)) obtain its single order The approximation of derivative (for example, Hessian matrix Hf (w)).By this way, computation complexity will not be dramatically increased.

For discussion purposes, it is assumed that the model is neural network model, and its optimization aim is by normal in neural network model The cross entropy loss function seen represents.Accordingly, for cross entropy loss function Assuming that Y is to meet distributionDiscrete random variable, wherein k ∈ 1, 2 ..., K }, then it can prove：

WhereinThe apposition (outer product) or tensor product of vector are represented, Represent to meet to be distributed in YIn the case of loss function join relative to model Several second dervativesDesired value,Table Show and meeting to be distributed in YIn the case of, loss function is joined relative to model The apposition of several first derivativesDesired value.For simplicity, have The proof procedure of body is not described in detail.

For cross entropy loss function, due to Therefore Gf (x, y, w) is Hf (x, Y, w) unbiased esti-mator, and wherein Gf (w) represents gradient vector g (w) outer product matrices, that is, Gf (w)=[g_ij] i, j=1 ..., n, wherein g_ijRepresenting matrix Gf (w) element.As noted previously, as Hessian matrix can regard The linear change rate of linear transformation is updated for parameter, thus can be using tensor product as the unbiased esti-mator to linear change rate.

Matrix Gf (w) can be by performing tensor product computing to obtain for gradient vector g (w).Due to tensor product computing With relatively low computation complexity, thus calculating cost can be significantly reduced.In addition, in such embodiments, by using Hessian matrix is equally replaced in linear transformation, can avoid storing additional variable using additional memory space, and Substantial amounts of demand will not be brought to memory space.

It will be appreciated that though illustrate that gradient vector g (w) outer product matrices are black plug squares herein for cross entropy loss function The unbiased esti-mator of battle array, but can be easy to the conclusion being applied to unknown losses function or optimization aim, as long as this estimate The error of meter is within acceptable range of tolerable variance.

However, because the unbiased esti-mator of Hessian matrix does not account for the influence of variance, may cause there is Hessian matrix Higher approximate error.Therefore, in certain embodiments, deviation and variance can be considered simultaneously.It is, for example, possible to use mean square error Difference characterizes approximate quality.

In certain embodiments, in order to reduce variance, the further approximation to Hessian matrix can be usedWherein symbolRepresent definition, λ₁Represent control parameter.In order to simplify symbol, for institute Some σ_k, willRange Representation be [l_i, u_i], and defineAnd

For cross entropy loss function, can prove：If control parameter λ₁Meet：sign(λ₁)=sign (g_ij),AndIt is then approximate DeviceMean square error be less than approximator g_ijMean square error, wherein sign represent sign function.Therefore, by suitable Locality sets λ₁,It can realize and compare g_ijSmaller mean square error.

It should be appreciated that what above-mentioned approximation was merely exemplary, it is not intended to limit the scope of the present disclosure.For example, in other reality Apply in example, in order to reduce variance, another approximation of Hessian matrix can be usedWherein λ₂Represent Control parameter.

For cross entropy loss function, can prove：If λ₂∈ [0,1], and σ_kMeetThen approximate λ₂g_ijMean square error be less than approximate g_ijIt is equal Square error.Therefore, by suitably setting λ₂, λ₂g_ijIt can realize and compare g_ijSmaller mean square error.

Conclusions are given for cross entropy loss function, control parameter λ₁And λ₂Scope, but should manage Solution, for unknown losses function, there may also be corresponding proper range.In addition, in specific implementation process, control ginseng Number λ₁And λ₂Scope can be rule of thumb configured in wider scope with specific implementation situation.

It is approximateAnd λ₂Gf (w) can be based on partial gradient g (w_t) easily calculated, it is real The now good balance between approximate deviation and variance.Thus, it is possible to calculate Hessian matrix in a manner of more economical.

Further, in certain embodiments, gradient g (w can be based on_t) determine loss function relative to each in parameter set The value of the derivative of individual parameter, the value of derivative represent the size or absolute value of derivative.Then, can the value based on derivative come Determine linear change rate.Approximation is combined separately belowAnd λ₂Gf (w) describes further specific implement Example.

In certain embodiments, the approximation of Hessian matrix can be used onlyDiagonal element.Therefore, The renewal rule of world model's parameter is changed into：

Its equivalent in：

Wherein diag expressions take the diagonal element of matrix, and ⊙ is represented by element multiplication, and equation (6) and (7) are complete Two kinds of representations of equal value.

According to equation (7), gradient g (w can be based on_t) determine loss function leading relative to each parameter in parameter set Several values (magnitude).Mathematically, the value of derivative can be represented by its absolute value.In certain embodiments, Linear change rate directly can be determined using the value of each derivative, and then determine compensation term.In other words, can be by gradient g (w_t) each element the vector that is formed of absolute value as linear change rate, most multiphase between linear change rate and compensation term A poor adjusting parameter, such as learning rate η and control parameter λ₂Product.Because signed magnitude arithmetic(al) has low-down calculate again Miscellaneous degree, thus calculating cost can be significantly reduced.In addition, additional variable is stored without additional memory space, can't Additional demand is brought to memory space.

Alternatively, in further embodiments, the approximate λ of Hessian matrix can be used only₂Gf (w) diagonal element.Cause This, the renewal rule of world model's parameter is changed into：

Its equivalent in：

According to equation (9), gradient g (w can be based on_t) determine loss function leading relative to each parameter in parameter set Several squares, and can the linear change rate that square directly determines based on each derivative, and then determine compensation term.Change speech It, in such embodiments, can be by gradient g (w_t) each element absolute value square (rather than absolute value is in itself) The vector formed at most differs an adjusting parameter, such as learn as linear change rate between linear change rate and compensation term Habit rate η and control parameter λ₂Product.Because square operation has low-down computation complexity, thus meter can be significantly reduced It is counted as this.In addition, storing additional variable without additional memory space, additional need can't be brought to memory space Ask.

Because learning rate η may constantly reduce with the progress of the training process of model, thus may need correspondingly to adjust Whole control parameter.Due to the controlled parameter of coefficient itself and learning rate of compensation term it can be seen from the renewal rule above Product influences.In certain embodiments, control parameter is adapted so that the product of control parameter and learning rate is kept substantially not Become.Thus, in this case, it is also assumed that the overall control parameter of compensation term keeps constant.

In certain embodiments, as shown in figure 1, model parameter byUpdate w_t+τ+1When, server 102 can be with Automatically the model parameter after renewal is sent to working machine 104.Alternatively or additionally, server 102 can also be in response to The request for model parameter from working machine 104 is by the model parameter w after renewal_t+τ+1It is sent to working machine 104.

In accordance with an embodiment of the present disclosure, relative to traditional asynchronous stochastic gradient descent method, working machine is (for example, work Machine 104 and/or the calculating that 106) need not can be added, i.e., only it need to carry out partial gradient g (w_t) calculating.In addition, server 102 also only need to be additionally carried out the less high calculating of computation complexity.Even in the approximation for needing calculating outer product matrices Gf (w) In, also only need the apposition computing into row vector.Only consideringAnd λ₂The approximation of Gf (w) diagonal element Under, computation complexity is further reduced.

For each working machine, server 102 only needs to store each backup model parameter w_bak(m), wherein m can be 1st, 2 ..., M, M represent the sum of working machine.However, this does not generally interfere with or reduced systematic function.In some embodiments In, server 102 is realized in a distributed way, therefore, capacity of its available memory space considerably beyond individual machine.Alternatively Ground or additionally, working machine (for example, working machine 104 and/or 106) can be sent while gradient is sent to server 102 Its corresponding global parameter.By this way, it is not necessary in the memory space of the sidepiece of server 102 administration Large Copacity, but work The communications cost of machine (for example, working machine 104 and/or 106) between server can be double.

Experiment and performance

Fig. 3 A- Fig. 3 D and Fig. 4 A- Fig. 4 D show the property between scheme in accordance with an embodiment of the present disclosure and traditional scheme Figure can be compared.Fig. 3 A- Fig. 3 D and Fig. 4 A- Fig. 4 D test is carried out on CIFAR-10 data sets.For all methods, 60 iteration are carried out, the size of its small batch is 128, and initial learning rate is 0.5, is divided after 80 and 120 iteration 1/10th of preceding value are not reduced to.In accordance with an embodiment of the present disclosure, set control parameter λ₁And λ₂It is initially 2 respectively With 0.04, and increase to the change of learning rate ten times of preceding value.

It illustrates effective transmission of fixed number (effective pass) convergence curve by Fig. 3 A- Fig. 3 D.Order with Machine gradient descent method (SGD) realizes optimal training precision, and final test error is 8.75%.Asynchronous stochastic gradient The performance of descending method (Async SGD) and synchronous stochastic gradient descent method (Sync SGD) is all not so good, its test error Increase with the increase of the quantity of working machine.In the case of 4 working machines (M=4), both test errors are respectively 9.39% and 9.35%.In the case of 8 working machines (M=8), both test errors are 10.4% and 10.1% respectively. Because asynchronous stochastic gradient descent method by delay gradient influenceed, delay gradient with the increase of the number of working machine and Become more serious.Synchronous stochastic gradient descent method needs to increase the size of small lot, so as to influence the training performance of model. As a comparison, in the stochastic gradient descent method (DC-ASGD) of delay compensation in accordance with an embodiment of the present disclosure, two approximations (namely based on the approximate λ of the diagonal element of gradient apposition₁| g | and λ₂G ⊙ g) performance be all significantly better than asynchronous stochastic gradient descent Method and synchronous stochastic gradient descent method, and almost catch up with order stochastic gradient descent method.For example, in 4 working machines In the case of, the stochastic gradient descent method testing error of delay compensation is 8.69%, with order stochastic gradient descent method The basic undistinguishable of test error.In the case of 8 working machines, the stochastic gradient descent method testing of delay compensation misses Difference can be lowered to 9.27%, and it is significantly better than traditional asynchronous stochastic gradient descent method and synchronous stochastic gradient descent side Method.

Fig. 4 A- Fig. 4 D show the comparison of the convergence rate between scheme in accordance with an embodiment of the present disclosure and traditional scheme Figure.The convergence rate of asynchronous stochastic gradient descent method quickly, line is almost realized compared with order stochastic gradient descent method Property accelerate, but convergence point is not so good.Synchronous stochastic gradient descent method also than order stochastic gradient descent method faster, still Due to stepped cost, asynchronous stochastic gradient descent method is significantly slower than.The asynchronous stochastic gradient descent method of delay compensation is in precision Extraordinary balance is realized between speed.Its convergence rate is very similar to traditional asynchronous stochastic gradient descent method, And convergence point is substantially identical good with order stochastic gradient descent method.

Example apparatus

Fig. 5 be shown in which to implement the exemplary computing systems of one or more embodiments of subject matter described herein/ The block diagram of server 500.Model estimating system 50, model execution system 120 or both can be by computing system/servers 500 realize.Computing system/server 500 shown in Fig. 5 is only example, and it should not be formed to realization described herein The limitation of the function and scope that use.

As shown in figure 5, computing system/server 500 is the form of universal computing device.Computing system/server 500 Component can include but is not limited to one or more processors or processing unit 500, memory 520, and one or more input is set Standby 530, one or more output equipments 540, storage device 550, and one or more communication units 560.Processing unit 500 can To be reality or virtual processor and can persistently perform various processing according to what is stored in memory 520.In multiprocessing In system, multiplied unit performs computer executable instructions, to increase disposal ability.

Computing system/server 500 generally includes multiple computer medias.Such medium can be computing system/clothes It is engaged in the addressable any medium that can be obtained of device 500, including but not limited to volatibility and non-volatile media, detachably and not Detachable media.Memory 520 can be volatile memory (such as register, cache, random access storage device (RAM)), non-volatile memory is (for example, read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), flash memory) or their certain combination.Storage device 550 can be detachable or non-dismountable, and can include Machine readable media, such as flash drive, disk or any other medium, it can be used in storage information and can be with It is accessed in computing system/server 500.

Computing system/server 500 may further include other detachable/non-dismountable, volatile, nonvolatile Computer system storage medium.Although not shown in FIG. 5, can provide for from detachable, non-volatile magnetic disk (such as " floppy disk ") disk drive that is read out or writes and the light for being read out or writing from detachable, anonvolatile optical disk Dish driving.In such cases, each driving can be connected to bus 18 by one or more data media interfaces.Memory 520 can include at least one program product, have (for example, at least one) program module collection, and these program modules are configured To perform the function of various embodiments described herein.

The program of set with one or more program modules 524/utility program instrument 522 can be stored in for example In memory 520.Such program module 524 can include but is not limited to operating system, one or more application programs, other Program module and operation data.Each example or particular combination in these examples can include the realization of networked environment.Program Module 524 generally performs the function and/or method of embodiment of the subject matter described herein, such as method 200.

Input block 530 can be one or more various input equipments.For example, input block 539 can include user Equipment, mouse, keyboard, trackball etc..Communication unit 560 is realized to be led to other computational entity on the communication media Letter.Additionally, the function of the component of computing system/server 500 can be come real with single computing cluster or multiple computing machines Existing, these computing machines can fetch communication by communication link.Therefore, computing system/server 500 can use with one or The logic connection of other multiple servers, NetPC Network PC (PC) or another general networking node comes in networked environment In operated.Such as, but not limited to, communication media includes wired or Wireless networking technologies.

Computing system/server 500 can also be communicated with one or more external equipment (not shown) as needed, External equipment such as storage device, display device etc., with one or more user is handed over computing system/server 500 Mutual equipment is communicated, or with causing computing system/server 500 and times of other one or more computing device communications What equipment (for example, network interface card, modem etc.) is communicated.Such communication can be via input/output (I/O) interface (not shown) performs.

Function described herein can be performed by one or more hardware logic components at least in part.Such as but It is not limited to, the exemplary types of the hardware logic component that can be used include field programmable gate array (FPGA), special integrated Circuit (ASIC), Application Specific Standard Product (ASSP), on-chip system (SOC), CPLD (CPLD) etc..

The program code of method for implementing subject matter described herein can use one or more programming languages Any combinations are write.These program codes can be supplied at all-purpose computer, special-purpose computer or other programmable datas Manage the processor or controller of device so that program code when processor or controller execution when by making flow chart and/or block diagram Function/operation of middle defined is carried out.Program code can be performed completely on machine, partly performed on machine, be made For independent software package partly on machine perform and partly on the remote machine perform or completely in remote machine or service Performed on device.

In the context of present disclosure, machine readable media can be tangible medium, and it can include or store The program for using or being used in combination with instruction execution system, device or equipment for instruction execution system, device or equipment.Machine Device computer-readable recording medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media can include but unlimited In electronics, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment, or the above appoints What appropriate combination.The more specific example of machinable medium can include being electrically connected, be portable based on one or more lines Formula computer disks, hard disk, random access memory (RAM), read-only storage (ROM), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory), optical fiber, portable compact disk read-only storage (CD-ROM), optical storage device, magnetic storage are set Standby or the above any appropriate combination.

Although in addition, depicting each operation using certain order, this should be understood to requirement so operation with shown The certain order that goes out performs in sequential order, or requires that the operation of all diagrams should be performed to obtain desired result. Under certain environment, multitask and parallel processing are probably favourable.Similarly, although containing some tools in being discussed above Body realizes details, but these are not construed as the limitation to the scope of subject matter described herein.Individually realizing Context described in some features can also realize in combination in single realization.On the contrary, above and below single realization The various features of described in the text can also be realized in multiple realizations individually or in any suitable subcombination.

Example implementation

It is listed below some example implementations of the disclosure.

In certain embodiments, there is provided a kind of computer implemented method.This method includes：Receive and pass through from working machine The parameter set of the feedback data for being trained and generating to machine learning model, feedback data and machine learning model is in working machine The preceding value at place is associated；Determine the difference between preceding value and the currency of parameter set；And based on feedback data and difference, Currency is updated to obtain the updated value of parameter set.

In certain embodiments, feedback data indicates the optimization aim of machine learning model relative to the preceding value of parameter set Significant changes trend.

In certain embodiments, renewal currency includes：The coefficient of conversion is determined based on significant changes trend；It is and logical Cross the residual quantity for bringing and determining between currency and updated value using becoming to difference.

In certain embodiments, conversion is linear transformation, and coefficient is linear change rate, and significant changes trend is by optimizing Target represents relative to the gradient of the preceding value of parameter set.

In certain embodiments, it is determined that the coefficient of conversion includes：The tensor product of gradient is calculated as to linear change rate Unbiased esti-mator.

In certain embodiments, it is determined that the coefficient of conversion includes：Determine optimization aim relative in parameter set based on gradient The value of the rate of change of parameters；And

Linear change rate is determined based on the value of rate of change.

In certain embodiments, determine that linear change rate includes based on the value of rate of change：Calculate the value of rate of change Square；And value based on rate of change square determines linear change rate.

In certain embodiments, this method also includes：The request for parameter set is received from working machine；And in response to asking Ask, the updated value of parameter set is sent to working machine.

In certain embodiments, machine learning model includes neural network model, and optimization aim is by intersection entropy loss Function representation.

In certain embodiments, there is provided a kind of equipment.The equipment includes：Processing unit；Memory, it is single coupled to processing Member and instruction is stored with, instructs and following action is performed when being performed by processing unit：Received from working machine by engineering The parameter set of the feedback data that habit model is trained and generated, feedback data and machine learning model is previous at working machine Value is associated；Determine the difference between preceding value and the currency of parameter set；It is and current based on feedback data and difference, renewal Value is to obtain the updated value of parameter set.

Linear change rate is determined based on the value of rate of change.

In certain embodiments, action also includes：The request for parameter set is received from working machine；And in response to asking Ask, the updated value of parameter set is sent to working machine.

In certain embodiments, there is provided a kind of computer program product.The computer program product is stored in non-transient In computer-readable storage medium and including machine-executable instruction, when machine-executable instruction is run in a device so that equipment： Received from working machine by being trained the feedback data to generate, feedback data and machine learning model to machine learning model Preceding value of the parameter set at working machine be associated；Determine the difference between preceding value and the currency of parameter set；And base In feedback data and difference, currency is updated to obtain the updated value of parameter set.

Linear change rate is determined based on the value of rate of change.

In certain embodiments, machine-executable instruction also causes equipment：The request for parameter set is received from working machine； And in response to request, to the updated value of working machine transmission parameter set.

, should although describing the disclosure using the language specific to architectural feature and/or method logical action When understanding that the theme defined in appended claims is not necessarily limited to special characteristic described above or action.On on the contrary, Special characteristic and action described by face are only the exemplary forms for realizing claims.

Claims

1. a kind of computer implemented method, including：

Received from working machine by being trained the feedback data to generate to machine learning model, the feedback data with it is described Preceding value of the parameter set of machine learning model at the working machine is associated；

Determine the difference between the currency of the preceding value and the parameter set；And

Based on the feedback data and the difference, the currency is updated to obtain the updated value of the parameter set.

2. according to the method for claim 1, wherein the feedback data indicates the optimization aim of the machine learning model Relative to the significant changes trend of the preceding value of the parameter set.

3. according to the method for claim 2, wherein updating the currency includes：

The coefficient of conversion is determined based on the significant changes trend；And

By bringing the residual quantity determined between the currency and the updated value using the change to the difference.

4. according to the method for claim 3, wherein the conversion is linear transformation, the coefficient is linear change rate, and And the significant changes trend is represented by the optimization aim relative to the gradient of the preceding value of the parameter set.

5. according to the method for claim 4, wherein determining the coefficient of the conversion includes：

The tensor product of the gradient is calculated as the unbiased esti-mator to the linear change rate.

6. according to the method for claim 4, wherein determining the coefficient of the conversion includes：

Value of the optimization aim relative to the rate of change of parameters in the parameter set is determined based on the gradient；And

The linear change rate is determined based on the value of the rate of change.

7. according to the method for claim 6, wherein determining the linear change rate bag based on the value of the rate of change Include：

Calculate square of the value of the rate of change；And

Value based on the rate of change square determines the linear change rate.

8. the method according to claim 11, in addition to：

The request for the parameter set is received from the working machine；And

In response to the request, to the updated value of the working machine transmission parameter set.

9. according to the method for claim 1, wherein the machine learning model includes neural network model, and it is described excellent Change target to be represented by cross entropy loss function.

10. a kind of electronic equipment, including：

Processing unit；

Memory, coupled to the processing unit and instruction is stored with, the instruction is held when being performed by the processing unit Row is following to be acted：

11. equipment according to claim 10, wherein the feedback data indicates the optimization mesh of the machine learning model Mark the significant changes trend of the preceding value relative to the parameter set.

12. equipment according to claim 11, wherein updating the currency includes：

13. equipment according to claim 12, wherein the conversion is linear transformation, the coefficient is linear change rate, And the significant changes trend is represented by the optimization aim relative to the gradient of the preceding value of the parameter set.

14. equipment according to claim 13, wherein determining the coefficient of the conversion includes：

15. equipment according to claim 13, wherein determining the coefficient of the conversion includes：

The linear change rate is determined based on the value of the rate of change.

16. equipment according to claim 15, wherein determining the linear change rate based on the value of the rate of change Including：

Calculate square of the value of the rate of change；And

Value based on the rate of change square determines the linear change rate.

17. equipment according to claim 10, the action also includes：

The request for the parameter set is received from the working machine；And

18. equipment according to claim 10, wherein the machine learning model includes neural network model, and it is described Optimization aim is represented by cross entropy loss function.

19. a kind of computer program product, it is stored in non-transitory, computer storage medium and including machine-executable instruction, When the machine-executable instruction is run in a device so that the equipment：

20. computer program product according to claim 19, wherein the feedback data indicates the machine learning mould The optimization aim of type relative to the preceding value of the parameter set significant changes trend.