CN107784364A - The asynchronous training of machine learning model - Google Patents

The asynchronous training of machine learning model Download PDF

Info

Publication number
CN107784364A
CN107784364A CN201610730381.8A CN201610730381A CN107784364A CN 107784364 A CN107784364 A CN 107784364A CN 201610730381 A CN201610730381 A CN 201610730381A CN 107784364 A CN107784364 A CN 107784364A
Authority
CN
China
Prior art keywords
value
parameter set
rate
currency
working machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610730381.8A
Other languages
Chinese (zh)
Other versions
CN107784364B (en
Inventor
王太峰
陈薇
刘铁岩
高飞
叶启威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority to CN201610730381.8A priority Critical patent/CN107784364B/en
Priority to EP17758721.9A priority patent/EP3504666B1/en
Priority to PCT/US2017/047247 priority patent/WO2018039011A1/en
Priority to US16/327,679 priority patent/US20190197404A1/en
Publication of CN107784364A publication Critical patent/CN107784364A/en
Application granted granted Critical
Publication of CN107784364B publication Critical patent/CN107784364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiment of the disclosure is related to the asynchronous training of machine learning model.Server is received by being trained the feedback data to generate to machine learning model from working machine.These feedback data are that working machine is obtained using the training data of oneself, and the preceding value with the parameter set of the machine learning model at this particular job machine is associated.Server determines the difference of the preceding value and parameter set between the currency at server.This currency is probably that have passed through one or many renewals due to the operation of other working machines.Then, server based on the difference between feedback data and the value of parameter set, can carry out the currency of undated parameter collection.Thus, it is this to update the training result for not only allowing for each working machine, appropriate compensation also is made that to the delay between different operating machine.

Description

The asynchronous training of machine learning model
Background technology
Machine learning has in numerous areas such as speech recognition, computer vision and natural language processings widely should With.For example, deep neural network (Deep Neural Networks, DNN) is on the basis of big data and powerful computing resource On, can parallel training there is multiple levels, the machine learning model of multiple parameters.In the training stage, it is necessary to according to given Training dataset and optimization aim, one or more parameters of model are trained.For example, training for neutral net and Speech, can use stochastic gradient descent method.
It is known training dataset to be distributed between multiple working machines.These working machines utilize respective training data Model parameter is optimized, and its result is returned into a central server.However, distributed asynchronous model in other words The key problem of training is the mismatch between each working machine.For example, when a working machine returns to the renewal of its parameter, clothes Model parameter at business device may have updated one or many by other working machines.Therefore, in the different of machine learning model In step training, it is expected to reduce or eliminate this delay or mismatch.
The content of the invention
Traditional all schemes are based on such theoretic knowledge, i.e. delay or mismatch between each working machine are Caused by the reason such as not quite identical that communicated between the performance between different operating machine and/or server and different operating machine. Therefore, the methods of focus of traditional scheme all is to pass through Optimized Operation reduces delay.But inventor is had found by studying, This delay be actually asynchronous framework it is intrinsic, it is impossible to be eliminated by Optimized Operation.Therefore, the implementation of the disclosure Example is intended to carry out appropriate compensation to the delay between different operating machine, rather than goes to attempt to eliminate this delay, and this is from work Any known arrangement is all markedly different from principle and mechanism.
Generally, in accordance with an embodiment of the present disclosure, server is received by being instructed to machine learning model from working machine The feedback data practiced and generated.These feedback data are working machines to be obtained using the training data of oneself, and with the machine Preceding value of the parameter set of learning model at this particular job machine is associated.Server determines the preceding value and parameter set Difference between the currency at server.It will be understood that this currency be probably the operation due to other working machines and It has passed through one or many renewals.Then, server can be come more based on the difference between feedback data and the value of parameter set The currency of new parameters sets.Thus, this update not only allows for the training result of each working machine, also to different operating machine it Between delay be made that appropriate compensation.Practice it has been proved that with attempt by force eliminate delay traditional scheme compared with, the disclosure Embodiment can significantly reduce mismatch between different operating machine, realize to the effective and efficient different of machine learning model Step training.
It is their specific realities below in order to which simplified form introduces the selection to concept to provide Summary Applying in mode to be further described.Summary is not intended to identify the key feature or main special of claimed theme Sign, also it is not intended to limit the scope of claimed theme.
Brief description of the drawings
Fig. 1 shows the block diagram for the environment that can implement embodiment of the disclosure;
Fig. 2 shows the flow chart of the method for training pattern in accordance with an embodiment of the present disclosure;
Fig. 3 A- Fig. 3 D show the performance comparision figure between scheme in accordance with an embodiment of the present disclosure and traditional scheme;
Fig. 4 A- Fig. 4 D show the performance comparision figure between scheme in accordance with an embodiment of the present disclosure and traditional scheme;With And
Fig. 5 is shown in which that the block diagram of computing system/server of one or more other embodiments of the present disclosure can be implemented.
In these accompanying drawings, same or similar reference symbol is used to represent same or similar element.
Embodiment
The disclosure is discussed now with reference to some embodiments.It should be appreciated that these embodiments are discussed merely to making Obtain those of ordinary skill in the art to better understood when and therefore realize the disclosure, rather than imply to the scope of the present disclosure Any restrictions.
As used herein, what term " comprising " and its variant will be read as meaning " to include but is not limited to " opens Put formula term.Term "based" will be read as " being based at least partially on ".Term " embodiment " and " one embodiment " will quilts It is read as " at least one embodiment ".Term " another embodiment " will be read as " at least one other embodiment ".Term " first ", " second " etc. may refer to different or identical object.Hereafter it is also possible that other are clearly and implicit Definition.
Asynchronous training framework
Fig. 1 shows the block diagram for the parallel computation environment 100 that can be implemented within embodiment of the disclosure.It is appreciated that It is to describe the 26S Proteasome Structure and Function of environment 100 merely for exemplary purpose rather than imply for any of the scope of the present disclosure Limitation.The disclosure can be embodied in different structure and/or function.
Parallel computation environment 100 includes server 102, working machine (worker) 104 and working machine 106.It should be appreciated that figure The number of server and working machine shown in 1 has no intention to limit merely for the sake of illustration purpose, and it is any appropriate to there may be Number destination server and working machine.For example, in parameter server framework, server 102 can be by multiple server distribution formulas Realize.In certain embodiments, working machine 104 and 106 etc. can be by one or more graphics processing unit (GPU) or GPU Cluster is realized.
In operation, each working machine has respective training data.For example, the training data each to work can be instruction Practice a subset of data complete or collected works.For each working machine, the subset of training data can be by from training data complete or collected works Grab sample obtains.According to optimization aim set in advance, each working machine is independently trained to model, and is tied Fruit returns to server 102.Server 102 according to the feedback result of each working machine 104,106 come the parameter of more new model, most It is set to meet optimization aim eventually.As described above, in this process, delay and mismatch between different operating machine 104,106, It is the main bottleneck for restricting training effect.
General principle
In order to more clearly describe the disclosure, it is described below in conjunction with more classification learnings based on neural network model. It will be appreciated, however, that the concept of the disclosure can apply in various appropriate machine learning models, particularly neutral net mould Type.
In more classification problems, it can useThe input space is represented, is usedRepresent Space is exported, is usedRepresentation spaceOn joint bottom distribution, wherein, d represent the input space dimension,Table Show set of real numbers, and K represents the quantity of the classification in output space.
In the training process, it usually needs providing includes the training set { (x of multiple data1, y1) ..., (xS, yS), its Middle S represents the number of the element of training set.Each element in training set includes a pair of inputs and output, its can be considered as from point ClothIn by independent same distribution (i.i.d.) sample obtain, for example, element (x1, y1) represent by inputting x1With output y1Group Into data.The overall goal of training is come learning neural network model based on training setIts InRepresent from spaceTo the mapping space of real number.It will be appreciated, however, that term used herein " training " is also It may refer to a part for training process.The real vector representation that the parameter set of the model can be tieed up by n, i.e.,Wherein N is natural number.That is, one or more parameter can be included in parameter set.For simplicity, can also incite somebody to action sometimes Parameter set is referred to as parameter or parameter vector.
Neural network model generally has hierarchical structure, wherein each node carries out linear group to the connecting node of last layer Conjunction and nonlinear activation.Model parameter represent two layers between side weight.Neural network model is for each input Output vector is produced, i.e.,:Represent the possibility that the input belongs to a different category.
Because bottom is distributedIt is unknown, the method for being commonly used in study or training pattern is to minimize loss letter Number.Loss function is a kind of form of optimization aim.Alternatively, can also be by maximum utility function come training pattern.Effect Generally there is the form of its loss function of equal value with function, for example, can be represented by the opposite number of loss function.Therefore, it is For the sake of simplifying, below embodiment of the disclosure is described mainly in combination with loss function.
Loss function can represent the measurement of the overall loss for model optimization, wherein loss can represent misclassification The various factors such as error.The conventional loss function of deep neural network is cross entropy loss function, and it is defined as:
Wherein, I represents indicator function, and log represents logarithmic function,Represent Softmax computings.Softmax operation be it is well known in the art that its be widely used in it is common in multi-class problem concerning study Operation, thus repeat no more.
During empirical loss function is minimized, it usually needs initial value is set to w, then changed according to training data Generation ground change w, until converging to the parameter value w for make it that loss function minimizes.
As described above, in synchronous stochastic gradient descent method, each working machine (for example, working machine 104 and 106) is based on The data of respective small lot (mini-batch) calculate gradient, and these gradients are added in world model.Pass through setting Barrier, each partial duty machine wait until the gradient from all partial duty machines is added to world model each other.So And because the setting of barrier, the training speed of model are determined by most slow working machine.In order to improve training effectiveness, can use Asynchronous stochastic gradient descent method.In asynchronous stochastic gradient descent method, without using barrier, in a working machine to the overall situation After model adds its gradient, continue training process, without waiting.Therefore, asynchronous stochastic gradient descent method is not due to having Stand-by period and there is the efficiency higher than synchronous stochastic gradient descent method.
Computing environment 100 shown in Fig. 1 can be used for implementing asynchronous stochastic gradient descent method.In time t, working machine 104 receive model parameter w from server 102t.For example, working machine 104 can be sent to server 102 is directed to current model The request of parameter.Server 102 upon receipt of the request, to the transmission pattern parameter w of working machine 104t.Then, working machine 104 according to data xtCalculate partial gradient g (wt).Working machine can receive data x from other serverst, data xtCan be from instruction Practice and concentrate stochastical sampling and obtain.Data xtNumber can be one or more, the disclosure is unrestricted herein.It should manage Solution, term " stochastic gradient " not only include and are directed to a data xtThe situation being trained, but also including being directed to multiple data xt The situation being trained, the latter are otherwise referred to as small lot (mini-batch) gradient.
At working machine 104, partial gradient g (w can be obtained by calculatingt).For example, can be by by data xtInput Gradient g (wt) expression formula can obtain g (wt) value.Then the gradient is sent to server 102 by working machine 104, by servicing Device 102 is by partial gradient g (wt) it is added to world model's parameter.However, as shown in figure 1, before this, in additionIndividual working machine Its partial gradient may be added in world model's parameter so that world model's parameter has been updatedIt is secondary, It is changed intoTraditional asynchronous stochastic gradient descent algorithm ignores this problem, simply by partial gradient g (wt) addition To world model's parameter
Wherein η represents learning rate, and the equation commonly referred to as updates rule.As can be seen that asynchronous stochastic gradient descent method Renewal rule and order stochastic gradient descent method (also referred to as, standalone version stochastic gradient descent method) renewal rule Valency.In asynchronous stochastic gradient descent method, actually by the partial gradient g (w of " delay " or " expired "t) be added to it is current World model's parameterAnd in order stochastic gradient descent method, should be based on relative toGradient come more New world model parameter
Traditional theory thinks, asynchronous stochastic gradient descent method need more iteration reach with order stochastic gradient Drop method identical precision.Even in some cases, asynchronous stochastic gradient descent method can not obtain and order stochastic gradient Method identical precision, particularly when working machine it is in a large number when.A variety of methods be present now to solve asynchronous stochastic gradient The problem of in descending method.In some versions, the delay of partial gradient is reduced by setting various scheduling strategies.At some In scheme, less weight is set to larger partial gradient is postponed, larger weight is set to less partial gradient is postponed. As another example, will postpone to abandon more than the partial gradient of certain threshold value, etc..However, these methods are all without fully profit With each partial gradient, computing resource is wasted to a certain extent.
Traditional understanding thinks that this delay is by the performance between different operating machine and/or server and different operating machine Between communicate it is not quite identical caused by, therefore reduce delay the methods of pass through Optimized Operation.However, inventor recognizes this Kind understanding is inapt.Delay be asynchronous framework it is intrinsic, it is impossible to be completely eliminated.As shown in figure 1, part is terraced Spend g (wt) it is added to world model's parameterWhen, it there will necessarily beSecondary delay.Therefore, in accordance with an embodiment of the present disclosure, exist In asynchronous model training, the delay (for example, delay of gradient) of different operating machine is adequately compensated for, rather than goes to attempt to reduce Delay.
Instantiation procedure
Principle in accordance with an embodiment of the present disclosure is described above in association with Fig. 1.It should be appreciated that the principle can be easily Expand to any appropriate model and scene using asynchronous stochastic gradient descent method.This public affairs is described below with reference to Fig. 2 Open the instantiation procedure of embodiment.For the sake of for convenience of description, Fig. 2 computing environment 100 for describing to combine Fig. 1 is deployed.
202, server 102 receives from working machine 104 the currency w for model parameter in time ttRequest.By Generally represented in model parameter by multiple parameters, thus be referred to as parameter set.In response to the request, server 102 is to work Make the currency w of the transmission pattern parameter of machine 104t.In certain embodiments, time t can be represented by counting, for example, global mould Shape parameter wtOften update once, time t is incremented by a counting.
Thus, working machine 104 is obtaining the currency w of world model's parameter from server 102t.In addition, working machine 104 One or more training data x can also be received from equipment such as the servers of hosted data collectiont.The number of training data is also referred to as The size of small lot, depending on user is set.In certain embodiments, training data can obtain from data set stochastical sampling.
Working machine 104 to model by being trained to generate feedback data.The feedback data and model parameter it is current Value wtIt is associated.Training process performed by working machine 104 can be a part for the whole training process of model.As showing Example, in certain embodiments, feedback data indicates the optimization aim of the model relative to the currency w of model parametertIt is notable Variation tendency.For example, in certain embodiments, significant changes trend can be maximum variation tendency, and it is possible thereby to by Optimization aim relative to model parameter currency wtGradient g (wt) represent.
Especially, it should be noted that the scope of the present disclosure is not limited to " significant changes trend " or other any physics The mathematical character mode of amount.Used mathematical character (for example, mathematical quantity, expression formula, formula etc.) is all merely possible to show Example is described, and its sole purpose is to aid in thought and implementation skilled artisan understands that the disclosure.
As described above, optimization aim can be represented by loss function, therefore, for discussion purposes, referring still to loss letter Count to be discussed.Working machine 104 can according to training dataset come counting loss function parameter currency wtThe office at place Portion gradient g (wt).For many common loss functions, partial gradient g (wt) there is explicit expression, thus by training data Partial gradient g (w can be obtained by inputting the expression formulat) value.In this case, as the training performed by working machine 104 Journey is only based on the explicit expression and determines local g (wt), namely a part for the whole training process of model.
Then, working machine 104 by the feedback data (for example, partial gradient g (wt)) send back server 102.At some In embodiment, when working machine 104 obtains from server 102 the currency w of world model parametertWhen, the storage model of server 102 The currency w of parametertAs backup model parameter wbak(m), wherein m represents the mark of working machine 104.
As shown in figure 1, in time t+ τ, server 104 is received by being trained model to give birth to from working machine 104 Into feedback data, such as partial gradient g (wt).Now, the currency of model parameter has been updated toThus this is anti- Present data substantially represents the optimization aim of model relative to the preceding value w of model parameter in certain embodimentstNotable change Change trend.
Referring still to Fig. 2,204, server 102 can determine the preceding value w of model parametertWith currencyIt Between difference.206, server 102 updates the currency of model parameter based on feedback data and the differenceSo as to Obtain the updated value w of model parametert+τ+1
In certain embodiments, previous model parameter can be as described above as backup model parameter wbak(m) deposit The model parameter w of storaget.The updated value w of the renewal amount of model parameter, i.e. parametert+τ+1With currencyBetween residual quantity, The currency of model parameter can be considered asWith preceding value wtBetween difference conversion.In certain embodiments, can be with The coefficient of this conversion is determined based on significant changes trend, and by determining that model is joined using this conversion to the difference Residual quantity between several currencys and updated value, i.e. the renewal amount of model parameter.By in currencyIt is upper to apply this One renewal amount, it is possible to obtain the updated value w of parameter sett+τ+1.For example, the conversion can be linear transformation, and the conversion Coefficient can be linear change rate.Certainly other any appropriate conversion are also all feasible.
For discussion purposes, discussed below referring still to asynchronous stochastic gradient descent method.But please note again Meaning, this is used for the purpose of the principle and thought of the explaination disclosure, it is not intended to limits the scope of the present disclosure in any way.
In the ideal case, asynchronous stochastic gradient descent method should be identical with order stochastic gradient descent method by gradientIt is added to world model's parameterHowever, asynchronous stochastic gradient descent method will postpone gradient g (wt) add It is added to world model's parameter
This gap can be illustrated by Taylor expansion.For example,In wtThe Taylor expansion at place can represent For:
Wherein InN dimension unit vectors are represented, and high-orders more than O symbols 2 ranks of expression is write using flower .
From equation (3) as can be seen that asynchronous stochastic gradient method uses the zeroth order item of Taylor expansion as gradientApproximation, and have ignored other all items.Therefore, if making up delay gradient g (wt) and gradient Between gap, then every other item can be taken into account, and be respectively calculated.However, this is unpractical, because It is related to the sum for calculating infinite multiple.In accordance with an embodiment of the present disclosure, the zero and first order item of Taylor expansion is only retained, it is only right Postpone gradient and carry out simplest compensation:
The first derivative of gradient reflects the rate of change of gradient, and corresponding to loss function (for example, by equation (1) institute The cross entropy loss function of expression) second dervative.The first derivative of gradient can be represented by black plug (Hessian) matrix.It is black Plug matrix can be defined as below:Hf (w)=[hij] i, j=1 ..., n, wherein
Thus, by the way that formula (4) is combined with the renewal rule of formula (2), it is possible to it is determined that the renewal amount to parameter. The renewal amount is formed by two, and Section 1 is to postpone gradient and the product of learning rate, and Section 2 is compensation term.Therefore, parameter Renewal amount can be considered as the currency of parameter setWith preceding value wtBetween difference linear transformation, and linear change Rate is the product of learning rate and Hessian matrix.Due to learning rate be can be with predefined empirical parameter, thus linear change rate can To think of equal value with Hessian matrix.
However, although can directly calculate Hessian matrix in certain embodiments, but this process is probably difficult.Example Such as, there will be 1,000,000,000,000 elements for the neural network model with 1,000,000 parameters, corresponding Hessian matrix.Therefore, It is very high to obtain the computation complexity of so huge matrix, and such matrix is also required to occupy larger memory space.Cause And in further embodiments, carry out approximate Hessian matrix using the relatively easy approximation for calculating and/or storing so that delay is mended Repay and easily implement.
For example, in some embodiments, it is desirable to feedback data can be based on (for example, gradient g (wt)) obtain its single order The approximation of derivative (for example, Hessian matrix Hf (w)).By this way, computation complexity will not be dramatically increased.
For discussion purposes, it is assumed that the model is neural network model, and its optimization aim is by normal in neural network model The cross entropy loss function seen represents.Accordingly, for cross entropy loss function Assuming that Y is to meet distributionDiscrete random variable, wherein k ∈ 1, 2 ..., K }, then it can prove:
WhereinThe apposition (outer product) or tensor product of vector are represented, Represent to meet to be distributed in YIn the case of loss function join relative to model Several second dervativesDesired value,Table Show and meeting to be distributed in YIn the case of, loss function is joined relative to model The apposition of several first derivativesDesired value.For simplicity, have The proof procedure of body is not described in detail.
For cross entropy loss function, due to Therefore Gf (x, y, w) is Hf (x, Y, w) unbiased esti-mator, and wherein Gf (w) represents gradient vector g (w) outer product matrices, that is, Gf (w)=[gij] i, j=1 ..., n, wherein gijRepresenting matrix Gf (w) element.As noted previously, as Hessian matrix can regard The linear change rate of linear transformation is updated for parameter, thus can be using tensor product as the unbiased esti-mator to linear change rate.
Matrix Gf (w) can be by performing tensor product computing to obtain for gradient vector g (w).Due to tensor product computing With relatively low computation complexity, thus calculating cost can be significantly reduced.In addition, in such embodiments, by using Hessian matrix is equally replaced in linear transformation, can avoid storing additional variable using additional memory space, and Substantial amounts of demand will not be brought to memory space.
It will be appreciated that though illustrate that gradient vector g (w) outer product matrices are black plug squares herein for cross entropy loss function The unbiased esti-mator of battle array, but can be easy to the conclusion being applied to unknown losses function or optimization aim, as long as this estimate The error of meter is within acceptable range of tolerable variance.
However, because the unbiased esti-mator of Hessian matrix does not account for the influence of variance, may cause there is Hessian matrix Higher approximate error.Therefore, in certain embodiments, deviation and variance can be considered simultaneously.It is, for example, possible to use mean square error Difference characterizes approximate quality.
In certain embodiments, in order to reduce variance, the further approximation to Hessian matrix can be usedWherein symbolRepresent definition, λ1Represent control parameter.In order to simplify symbol, for institute Some σk, willRange Representation be [li, ui], and defineAnd
For cross entropy loss function, can prove:If control parameter λ1Meet:sign(λ1)=sign (gij),AndIt is then approximate DeviceMean square error be less than approximator gijMean square error, wherein sign represent sign function.Therefore, by suitable Locality sets λ1,It can realize and compare gijSmaller mean square error.
It should be appreciated that what above-mentioned approximation was merely exemplary, it is not intended to limit the scope of the present disclosure.For example, in other reality Apply in example, in order to reduce variance, another approximation of Hessian matrix can be usedWherein λ2Represent Control parameter.
For cross entropy loss function, can prove:If λ2∈ [0,1], and σkMeetThen approximate λ2gijMean square error be less than approximate gijIt is equal Square error.Therefore, by suitably setting λ2, λ2gijIt can realize and compare gijSmaller mean square error.
Conclusions are given for cross entropy loss function, control parameter λ1And λ2Scope, but should manage Solution, for unknown losses function, there may also be corresponding proper range.In addition, in specific implementation process, control ginseng Number λ1And λ2Scope can be rule of thumb configured in wider scope with specific implementation situation.
It is approximateAnd λ2Gf (w) can be based on partial gradient g (wt) easily calculated, it is real The now good balance between approximate deviation and variance.Thus, it is possible to calculate Hessian matrix in a manner of more economical.
Further, in certain embodiments, gradient g (w can be based ont) determine loss function relative to each in parameter set The value of the derivative of individual parameter, the value of derivative represent the size or absolute value of derivative.Then, can the value based on derivative come Determine linear change rate.Approximation is combined separately belowAnd λ2Gf (w) describes further specific implement Example.
In certain embodiments, the approximation of Hessian matrix can be used onlyDiagonal element.Therefore, The renewal rule of world model's parameter is changed into:
Its equivalent in:
Wherein diag expressions take the diagonal element of matrix, and ⊙ is represented by element multiplication, and equation (6) and (7) are complete Two kinds of representations of equal value.
According to equation (7), gradient g (w can be based ont) determine loss function leading relative to each parameter in parameter set Several values (magnitude).Mathematically, the value of derivative can be represented by its absolute value.In certain embodiments, Linear change rate directly can be determined using the value of each derivative, and then determine compensation term.In other words, can be by gradient g (wt) each element the vector that is formed of absolute value as linear change rate, most multiphase between linear change rate and compensation term A poor adjusting parameter, such as learning rate η and control parameter λ2Product.Because signed magnitude arithmetic(al) has low-down calculate again Miscellaneous degree, thus calculating cost can be significantly reduced.In addition, additional variable is stored without additional memory space, can't Additional demand is brought to memory space.
Alternatively, in further embodiments, the approximate λ of Hessian matrix can be used only2Gf (w) diagonal element.Cause This, the renewal rule of world model's parameter is changed into:
Its equivalent in:
According to equation (9), gradient g (w can be based ont) determine loss function leading relative to each parameter in parameter set Several squares, and can the linear change rate that square directly determines based on each derivative, and then determine compensation term.Change speech It, in such embodiments, can be by gradient g (wt) each element absolute value square (rather than absolute value is in itself) The vector formed at most differs an adjusting parameter, such as learn as linear change rate between linear change rate and compensation term Habit rate η and control parameter λ2Product.Because square operation has low-down computation complexity, thus meter can be significantly reduced It is counted as this.In addition, storing additional variable without additional memory space, additional need can't be brought to memory space Ask.
Because learning rate η may constantly reduce with the progress of the training process of model, thus may need correspondingly to adjust Whole control parameter.Due to the controlled parameter of coefficient itself and learning rate of compensation term it can be seen from the renewal rule above Product influences.In certain embodiments, control parameter is adapted so that the product of control parameter and learning rate is kept substantially not Become.Thus, in this case, it is also assumed that the overall control parameter of compensation term keeps constant.
In certain embodiments, as shown in figure 1, model parameter byUpdate wt+τ+1When, server 102 can be with Automatically the model parameter after renewal is sent to working machine 104.Alternatively or additionally, server 102 can also be in response to The request for model parameter from working machine 104 is by the model parameter w after renewalt+τ+1It is sent to working machine 104.
In accordance with an embodiment of the present disclosure, relative to traditional asynchronous stochastic gradient descent method, working machine is (for example, work Machine 104 and/or the calculating that 106) need not can be added, i.e., only it need to carry out partial gradient g (wt) calculating.In addition, server 102 also only need to be additionally carried out the less high calculating of computation complexity.Even in the approximation for needing calculating outer product matrices Gf (w) In, also only need the apposition computing into row vector.Only consideringAnd λ2The approximation of Gf (w) diagonal element Under, computation complexity is further reduced.
For each working machine, server 102 only needs to store each backup model parameter wbak(m), wherein m can be 1st, 2 ..., M, M represent the sum of working machine.However, this does not generally interfere with or reduced systematic function.In some embodiments In, server 102 is realized in a distributed way, therefore, capacity of its available memory space considerably beyond individual machine.Alternatively Ground or additionally, working machine (for example, working machine 104 and/or 106) can be sent while gradient is sent to server 102 Its corresponding global parameter.By this way, it is not necessary in the memory space of the sidepiece of server 102 administration Large Copacity, but work The communications cost of machine (for example, working machine 104 and/or 106) between server can be double.
Experiment and performance
Fig. 3 A- Fig. 3 D and Fig. 4 A- Fig. 4 D show the property between scheme in accordance with an embodiment of the present disclosure and traditional scheme Figure can be compared.Fig. 3 A- Fig. 3 D and Fig. 4 A- Fig. 4 D test is carried out on CIFAR-10 data sets.For all methods, 60 iteration are carried out, the size of its small batch is 128, and initial learning rate is 0.5, is divided after 80 and 120 iteration 1/10th of preceding value are not reduced to.In accordance with an embodiment of the present disclosure, set control parameter λ1And λ2It is initially 2 respectively With 0.04, and increase to the change of learning rate ten times of preceding value.
It illustrates effective transmission of fixed number (effective pass) convergence curve by Fig. 3 A- Fig. 3 D.Order with Machine gradient descent method (SGD) realizes optimal training precision, and final test error is 8.75%.Asynchronous stochastic gradient The performance of descending method (Async SGD) and synchronous stochastic gradient descent method (Sync SGD) is all not so good, its test error Increase with the increase of the quantity of working machine.In the case of 4 working machines (M=4), both test errors are respectively 9.39% and 9.35%.In the case of 8 working machines (M=8), both test errors are 10.4% and 10.1% respectively. Because asynchronous stochastic gradient descent method by delay gradient influenceed, delay gradient with the increase of the number of working machine and Become more serious.Synchronous stochastic gradient descent method needs to increase the size of small lot, so as to influence the training performance of model. As a comparison, in the stochastic gradient descent method (DC-ASGD) of delay compensation in accordance with an embodiment of the present disclosure, two approximations (namely based on the approximate λ of the diagonal element of gradient apposition1| g | and λ2G ⊙ g) performance be all significantly better than asynchronous stochastic gradient descent Method and synchronous stochastic gradient descent method, and almost catch up with order stochastic gradient descent method.For example, in 4 working machines In the case of, the stochastic gradient descent method testing error of delay compensation is 8.69%, with order stochastic gradient descent method The basic undistinguishable of test error.In the case of 8 working machines, the stochastic gradient descent method testing of delay compensation misses Difference can be lowered to 9.27%, and it is significantly better than traditional asynchronous stochastic gradient descent method and synchronous stochastic gradient descent side Method.
Fig. 4 A- Fig. 4 D show the comparison of the convergence rate between scheme in accordance with an embodiment of the present disclosure and traditional scheme Figure.The convergence rate of asynchronous stochastic gradient descent method quickly, line is almost realized compared with order stochastic gradient descent method Property accelerate, but convergence point is not so good.Synchronous stochastic gradient descent method also than order stochastic gradient descent method faster, still Due to stepped cost, asynchronous stochastic gradient descent method is significantly slower than.The asynchronous stochastic gradient descent method of delay compensation is in precision Extraordinary balance is realized between speed.Its convergence rate is very similar to traditional asynchronous stochastic gradient descent method, And convergence point is substantially identical good with order stochastic gradient descent method.
Example apparatus
Fig. 5 be shown in which to implement the exemplary computing systems of one or more embodiments of subject matter described herein/ The block diagram of server 500.Model estimating system 50, model execution system 120 or both can be by computing system/servers 500 realize.Computing system/server 500 shown in Fig. 5 is only example, and it should not be formed to realization described herein The limitation of the function and scope that use.
As shown in figure 5, computing system/server 500 is the form of universal computing device.Computing system/server 500 Component can include but is not limited to one or more processors or processing unit 500, memory 520, and one or more input is set Standby 530, one or more output equipments 540, storage device 550, and one or more communication units 560.Processing unit 500 can To be reality or virtual processor and can persistently perform various processing according to what is stored in memory 520.In multiprocessing In system, multiplied unit performs computer executable instructions, to increase disposal ability.
Computing system/server 500 generally includes multiple computer medias.Such medium can be computing system/clothes It is engaged in the addressable any medium that can be obtained of device 500, including but not limited to volatibility and non-volatile media, detachably and not Detachable media.Memory 520 can be volatile memory (such as register, cache, random access storage device (RAM)), non-volatile memory is (for example, read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), flash memory) or their certain combination.Storage device 550 can be detachable or non-dismountable, and can include Machine readable media, such as flash drive, disk or any other medium, it can be used in storage information and can be with It is accessed in computing system/server 500.
Computing system/server 500 may further include other detachable/non-dismountable, volatile, nonvolatile Computer system storage medium.Although not shown in FIG. 5, can provide for from detachable, non-volatile magnetic disk (such as " floppy disk ") disk drive that is read out or writes and the light for being read out or writing from detachable, anonvolatile optical disk Dish driving.In such cases, each driving can be connected to bus 18 by one or more data media interfaces.Memory 520 can include at least one program product, have (for example, at least one) program module collection, and these program modules are configured To perform the function of various embodiments described herein.
The program of set with one or more program modules 524/utility program instrument 522 can be stored in for example In memory 520.Such program module 524 can include but is not limited to operating system, one or more application programs, other Program module and operation data.Each example or particular combination in these examples can include the realization of networked environment.Program Module 524 generally performs the function and/or method of embodiment of the subject matter described herein, such as method 200.
Input block 530 can be one or more various input equipments.For example, input block 539 can include user Equipment, mouse, keyboard, trackball etc..Communication unit 560 is realized to be led to other computational entity on the communication media Letter.Additionally, the function of the component of computing system/server 500 can be come real with single computing cluster or multiple computing machines Existing, these computing machines can fetch communication by communication link.Therefore, computing system/server 500 can use with one or The logic connection of other multiple servers, NetPC Network PC (PC) or another general networking node comes in networked environment In operated.Such as, but not limited to, communication media includes wired or Wireless networking technologies.
Computing system/server 500 can also be communicated with one or more external equipment (not shown) as needed, External equipment such as storage device, display device etc., with one or more user is handed over computing system/server 500 Mutual equipment is communicated, or with causing computing system/server 500 and times of other one or more computing device communications What equipment (for example, network interface card, modem etc.) is communicated.Such communication can be via input/output (I/O) interface (not shown) performs.
Function described herein can be performed by one or more hardware logic components at least in part.Such as but It is not limited to, the exemplary types of the hardware logic component that can be used include field programmable gate array (FPGA), special integrated Circuit (ASIC), Application Specific Standard Product (ASSP), on-chip system (SOC), CPLD (CPLD) etc..
The program code of method for implementing subject matter described herein can use one or more programming languages Any combinations are write.These program codes can be supplied at all-purpose computer, special-purpose computer or other programmable datas Manage the processor or controller of device so that program code when processor or controller execution when by making flow chart and/or block diagram Function/operation of middle defined is carried out.Program code can be performed completely on machine, partly performed on machine, be made For independent software package partly on machine perform and partly on the remote machine perform or completely in remote machine or service Performed on device.
In the context of present disclosure, machine readable media can be tangible medium, and it can include or store The program for using or being used in combination with instruction execution system, device or equipment for instruction execution system, device or equipment.Machine Device computer-readable recording medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media can include but unlimited In electronics, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment, or the above appoints What appropriate combination.The more specific example of machinable medium can include being electrically connected, be portable based on one or more lines Formula computer disks, hard disk, random access memory (RAM), read-only storage (ROM), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory), optical fiber, portable compact disk read-only storage (CD-ROM), optical storage device, magnetic storage are set Standby or the above any appropriate combination.
Although in addition, depicting each operation using certain order, this should be understood to requirement so operation with shown The certain order that goes out performs in sequential order, or requires that the operation of all diagrams should be performed to obtain desired result. Under certain environment, multitask and parallel processing are probably favourable.Similarly, although containing some tools in being discussed above Body realizes details, but these are not construed as the limitation to the scope of subject matter described herein.Individually realizing Context described in some features can also realize in combination in single realization.On the contrary, above and below single realization The various features of described in the text can also be realized in multiple realizations individually or in any suitable subcombination.
Example implementation
It is listed below some example implementations of the disclosure.
In certain embodiments, there is provided a kind of computer implemented method.This method includes:Receive and pass through from working machine The parameter set of the feedback data for being trained and generating to machine learning model, feedback data and machine learning model is in working machine The preceding value at place is associated;Determine the difference between preceding value and the currency of parameter set;And based on feedback data and difference, Currency is updated to obtain the updated value of parameter set.
In certain embodiments, feedback data indicates the optimization aim of machine learning model relative to the preceding value of parameter set Significant changes trend.
In certain embodiments, renewal currency includes:The coefficient of conversion is determined based on significant changes trend;It is and logical Cross the residual quantity for bringing and determining between currency and updated value using becoming to difference.
In certain embodiments, conversion is linear transformation, and coefficient is linear change rate, and significant changes trend is by optimizing Target represents relative to the gradient of the preceding value of parameter set.
In certain embodiments, it is determined that the coefficient of conversion includes:The tensor product of gradient is calculated as to linear change rate Unbiased esti-mator.
In certain embodiments, it is determined that the coefficient of conversion includes:Determine optimization aim relative in parameter set based on gradient The value of the rate of change of parameters;And
Linear change rate is determined based on the value of rate of change.
In certain embodiments, determine that linear change rate includes based on the value of rate of change:Calculate the value of rate of change Square;And value based on rate of change square determines linear change rate.
In certain embodiments, this method also includes:The request for parameter set is received from working machine;And in response to asking Ask, the updated value of parameter set is sent to working machine.
In certain embodiments, machine learning model includes neural network model, and optimization aim is by intersection entropy loss Function representation.
In certain embodiments, there is provided a kind of equipment.The equipment includes:Processing unit;Memory, it is single coupled to processing Member and instruction is stored with, instructs and following action is performed when being performed by processing unit:Received from working machine by engineering The parameter set of the feedback data that habit model is trained and generated, feedback data and machine learning model is previous at working machine Value is associated;Determine the difference between preceding value and the currency of parameter set;It is and current based on feedback data and difference, renewal Value is to obtain the updated value of parameter set.
In certain embodiments, feedback data indicates the optimization aim of machine learning model relative to the preceding value of parameter set Significant changes trend.
In certain embodiments, renewal currency includes:The coefficient of conversion is determined based on significant changes trend;It is and logical Cross the residual quantity for bringing and determining between currency and updated value using becoming to difference.
In certain embodiments, conversion is linear transformation, and coefficient is linear change rate, and significant changes trend is by optimizing Target represents relative to the gradient of the preceding value of parameter set.
In certain embodiments, it is determined that the coefficient of conversion includes:The tensor product of gradient is calculated as to linear change rate Unbiased esti-mator.
In certain embodiments, it is determined that the coefficient of conversion includes:Determine optimization aim relative in parameter set based on gradient The value of the rate of change of parameters;And
Linear change rate is determined based on the value of rate of change.
In certain embodiments, determine that linear change rate includes based on the value of rate of change:Calculate the value of rate of change Square;And value based on rate of change square determines linear change rate.
In certain embodiments, action also includes:The request for parameter set is received from working machine;And in response to asking Ask, the updated value of parameter set is sent to working machine.
In certain embodiments, machine learning model includes neural network model, and optimization aim is by intersection entropy loss Function representation.
In certain embodiments, there is provided a kind of computer program product.The computer program product is stored in non-transient In computer-readable storage medium and including machine-executable instruction, when machine-executable instruction is run in a device so that equipment: Received from working machine by being trained the feedback data to generate, feedback data and machine learning model to machine learning model Preceding value of the parameter set at working machine be associated;Determine the difference between preceding value and the currency of parameter set;And base In feedback data and difference, currency is updated to obtain the updated value of parameter set.
In certain embodiments, feedback data indicates the optimization aim of machine learning model relative to the preceding value of parameter set Significant changes trend.
In certain embodiments, renewal currency includes:The coefficient of conversion is determined based on significant changes trend;It is and logical Cross the residual quantity for bringing and determining between currency and updated value using becoming to difference.
In certain embodiments, conversion is linear transformation, and coefficient is linear change rate, and significant changes trend is by optimizing Target represents relative to the gradient of the preceding value of parameter set.
In certain embodiments, it is determined that the coefficient of conversion includes:The tensor product of gradient is calculated as to linear change rate Unbiased esti-mator.
In certain embodiments, it is determined that the coefficient of conversion includes:Determine optimization aim relative in parameter set based on gradient The value of the rate of change of parameters;And
Linear change rate is determined based on the value of rate of change.
In certain embodiments, determine that linear change rate includes based on the value of rate of change:Calculate the value of rate of change Square;And value based on rate of change square determines linear change rate.
In certain embodiments, machine-executable instruction also causes equipment:The request for parameter set is received from working machine; And in response to request, to the updated value of working machine transmission parameter set.
In certain embodiments, machine learning model includes neural network model, and optimization aim is by intersection entropy loss Function representation.
, should although describing the disclosure using the language specific to architectural feature and/or method logical action When understanding that the theme defined in appended claims is not necessarily limited to special characteristic described above or action.On on the contrary, Special characteristic and action described by face are only the exemplary forms for realizing claims.

Claims (20)

1. a kind of computer implemented method, including:
Received from working machine by being trained the feedback data to generate to machine learning model, the feedback data with it is described Preceding value of the parameter set of machine learning model at the working machine is associated;
Determine the difference between the currency of the preceding value and the parameter set;And
Based on the feedback data and the difference, the currency is updated to obtain the updated value of the parameter set.
2. according to the method for claim 1, wherein the feedback data indicates the optimization aim of the machine learning model Relative to the significant changes trend of the preceding value of the parameter set.
3. according to the method for claim 2, wherein updating the currency includes:
The coefficient of conversion is determined based on the significant changes trend;And
By bringing the residual quantity determined between the currency and the updated value using the change to the difference.
4. according to the method for claim 3, wherein the conversion is linear transformation, the coefficient is linear change rate, and And the significant changes trend is represented by the optimization aim relative to the gradient of the preceding value of the parameter set.
5. according to the method for claim 4, wherein determining the coefficient of the conversion includes:
The tensor product of the gradient is calculated as the unbiased esti-mator to the linear change rate.
6. according to the method for claim 4, wherein determining the coefficient of the conversion includes:
Value of the optimization aim relative to the rate of change of parameters in the parameter set is determined based on the gradient;And
The linear change rate is determined based on the value of the rate of change.
7. according to the method for claim 6, wherein determining the linear change rate bag based on the value of the rate of change Include:
Calculate square of the value of the rate of change;And
Value based on the rate of change square determines the linear change rate.
8. the method according to claim 11, in addition to:
The request for the parameter set is received from the working machine;And
In response to the request, to the updated value of the working machine transmission parameter set.
9. according to the method for claim 1, wherein the machine learning model includes neural network model, and it is described excellent Change target to be represented by cross entropy loss function.
10. a kind of electronic equipment, including:
Processing unit;
Memory, coupled to the processing unit and instruction is stored with, the instruction is held when being performed by the processing unit Row is following to be acted:
Received from working machine by being trained the feedback data to generate to machine learning model, the feedback data with it is described Preceding value of the parameter set of machine learning model at the working machine is associated;
Determine the difference between the currency of the preceding value and the parameter set;And
Based on the feedback data and the difference, the currency is updated to obtain the updated value of the parameter set.
11. equipment according to claim 10, wherein the feedback data indicates the optimization mesh of the machine learning model Mark the significant changes trend of the preceding value relative to the parameter set.
12. equipment according to claim 11, wherein updating the currency includes:
The coefficient of conversion is determined based on the significant changes trend;And
By bringing the residual quantity determined between the currency and the updated value using the change to the difference.
13. equipment according to claim 12, wherein the conversion is linear transformation, the coefficient is linear change rate, And the significant changes trend is represented by the optimization aim relative to the gradient of the preceding value of the parameter set.
14. equipment according to claim 13, wherein determining the coefficient of the conversion includes:
The tensor product of the gradient is calculated as the unbiased esti-mator to the linear change rate.
15. equipment according to claim 13, wherein determining the coefficient of the conversion includes:
Value of the optimization aim relative to the rate of change of parameters in the parameter set is determined based on the gradient;And
The linear change rate is determined based on the value of the rate of change.
16. equipment according to claim 15, wherein determining the linear change rate based on the value of the rate of change Including:
Calculate square of the value of the rate of change;And
Value based on the rate of change square determines the linear change rate.
17. equipment according to claim 10, the action also includes:
The request for the parameter set is received from the working machine;And
In response to the request, to the updated value of the working machine transmission parameter set.
18. equipment according to claim 10, wherein the machine learning model includes neural network model, and it is described Optimization aim is represented by cross entropy loss function.
19. a kind of computer program product, it is stored in non-transitory, computer storage medium and including machine-executable instruction, When the machine-executable instruction is run in a device so that the equipment:
Received from working machine by being trained the feedback data to generate to machine learning model, the feedback data with it is described Preceding value of the parameter set of machine learning model at the working machine is associated;
Determine the difference between the currency of the preceding value and the parameter set;And
Based on the feedback data and the difference, the currency is updated to obtain the updated value of the parameter set.
20. computer program product according to claim 19, wherein the feedback data indicates the machine learning mould The optimization aim of type relative to the preceding value of the parameter set significant changes trend.
CN201610730381.8A 2016-08-25 2016-08-25 Asynchronous training of machine learning models Active CN107784364B (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201610730381.8A CN107784364B (en) 2016-08-25 2016-08-25 Asynchronous training of machine learning models
EP17758721.9A EP3504666B1 (en) 2016-08-25 2017-08-17 Asychronous training of machine learning model
PCT/US2017/047247 WO2018039011A1 (en) 2016-08-25 2017-08-17 Asychronous training of machine learning model
US16/327,679 US20190197404A1 (en) 2016-08-25 2017-08-17 Asychronous training of machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610730381.8A CN107784364B (en) 2016-08-25 2016-08-25 Asynchronous training of machine learning models

Publications (2)

Publication Number Publication Date
CN107784364A true CN107784364A (en) 2018-03-09
CN107784364B CN107784364B (en) 2021-06-15

Family

ID=59738469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610730381.8A Active CN107784364B (en) 2016-08-25 2016-08-25 Asynchronous training of machine learning models

Country Status (4)

Country Link
US (1) US20190197404A1 (en)
EP (1) EP3504666B1 (en)
CN (1) CN107784364B (en)
WO (1) WO2018039011A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108803348A (en) * 2018-08-03 2018-11-13 北京深度奇点科技有限公司 A kind of optimization method of pid parameter and the optimization device of pid parameter
CN109032630A (en) * 2018-06-29 2018-12-18 电子科技大学 The update method of global parameter in a kind of parameter server
CN109102075A (en) * 2018-07-26 2018-12-28 联想(北京)有限公司 Gradient updating method and relevant device during a kind of distribution is trained
CN109460826A (en) * 2018-10-31 2019-03-12 北京字节跳动网络技术有限公司 For distributing the method, apparatus and model modification system of data
CN109508785A (en) * 2018-10-29 2019-03-22 清华大学 A kind of asynchronous parallel optimization method for neural metwork training
CN110070116A (en) * 2019-04-08 2019-07-30 云南大学 Segmented based on the tree-shaped Training strategy of depth selects integrated image classification method
CN110322020A (en) * 2018-03-28 2019-10-11 国际商业机器公司 The autoadapted learning rate scheduling of distributed random gradient decline
CN110689137A (en) * 2019-09-24 2020-01-14 网易传媒科技(北京)有限公司 Parameter determination method, system, medium, and electronic device
CN112166444A (en) * 2018-05-31 2021-01-01 微软技术许可有限责任公司 Distributed computing system with integrated data as a service feedback loop engine
CN112906792A (en) * 2021-02-22 2021-06-04 中国科学技术大学 Image recognition model rapid training method and system based on many-core processor

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046475A1 (en) * 2016-08-11 2018-02-15 Twitter, Inc. Detecting scripted or otherwise anomalous interactions with social media platform
US11182674B2 (en) * 2017-03-17 2021-11-23 International Business Machines Corporation Model training by discarding relatively less relevant parameters
JP6886112B2 (en) * 2017-10-04 2021-06-16 富士通株式会社 Learning program, learning device and learning method
US11669914B2 (en) 2018-05-06 2023-06-06 Strong Force TX Portfolio 2018, LLC Adaptive intelligence and shared infrastructure lending transaction enablement platform responsive to crowd sourced information
CN112534452A (en) 2018-05-06 2021-03-19 强力交易投资组合2018有限公司 Method and system for improving machines and systems for automatically performing distributed ledger and other transactions in spot and forward markets for energy, computing, storage, and other resources
US11544782B2 (en) 2018-05-06 2023-01-03 Strong Force TX Portfolio 2018, LLC System and method of a smart contract and distributed ledger platform with blockchain custody service
US11550299B2 (en) 2020-02-03 2023-01-10 Strong Force TX Portfolio 2018, LLC Automated robotic process selection and configuration
CN109165515A (en) * 2018-08-10 2019-01-08 深圳前海微众银行股份有限公司 Model parameter acquisition methods, system and readable storage medium storing program for executing based on federation's study
US11715010B2 (en) * 2019-08-16 2023-08-01 Google Llc Cross replica reduction on networks having degraded nodes
WO2021038592A2 (en) * 2019-08-30 2021-03-04 Tata Consultancy Services Limited System and method for handling popularity bias in item recommendations
US11982993B2 (en) 2020-02-03 2024-05-14 Strong Force TX Portfolio 2018, LLC AI solution selection for an automated robotic process
CN111814965A (en) * 2020-08-14 2020-10-23 Oppo广东移动通信有限公司 Hyper-parameter adjusting method, device, equipment and storage medium
US20220076130A1 (en) * 2020-08-31 2022-03-10 International Business Machines Corporation Deep surrogate langevin sampling for multi-objective constraint black box optimization with applications to optimal inverse design problems
US11829799B2 (en) 2020-10-13 2023-11-28 International Business Machines Corporation Distributed resource-aware training of machine learning pipelines
CN117151239B (en) * 2023-03-17 2024-08-27 荣耀终端有限公司 Gradient updating method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009742A1 (en) * 2000-12-06 2003-01-09 Bass Michael D. Automated job training and performance tool
US8027938B1 (en) * 2007-03-26 2011-09-27 Google Inc. Discriminative training in machine learning
US20140379386A1 (en) * 2013-06-25 2014-12-25 Arthur Paul Drennan, III System and method for evaluating text to support multiple insurance applications
CN104346214A (en) * 2013-07-30 2015-02-11 中国银联股份有限公司 Device and method for managing asynchronous tasks in distributed environments
CN105022699A (en) * 2015-07-14 2015-11-04 惠龙易通国际物流股份有限公司 Cache region data preprocessing method and system
US9269057B1 (en) * 2013-12-11 2016-02-23 Google, Inc. Using specialized workers to improve performance in machine learning
CN105683944A (en) * 2013-11-04 2016-06-15 谷歌公司 Systems and methods for layered training in machine-learning architectures
CN105825269A (en) * 2016-03-15 2016-08-03 中国科学院计算技术研究所 Parallel autoencoder based feature learning method and system
CN105894087A (en) * 2015-01-26 2016-08-24 华为技术有限公司 System and method for training parameter set in neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152676B1 (en) * 2013-11-22 2018-12-11 Amazon Technologies, Inc. Distributed training of models using stochastic gradient descent

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009742A1 (en) * 2000-12-06 2003-01-09 Bass Michael D. Automated job training and performance tool
US8027938B1 (en) * 2007-03-26 2011-09-27 Google Inc. Discriminative training in machine learning
US20140379386A1 (en) * 2013-06-25 2014-12-25 Arthur Paul Drennan, III System and method for evaluating text to support multiple insurance applications
CN104346214A (en) * 2013-07-30 2015-02-11 中国银联股份有限公司 Device and method for managing asynchronous tasks in distributed environments
CN105683944A (en) * 2013-11-04 2016-06-15 谷歌公司 Systems and methods for layered training in machine-learning architectures
US9269057B1 (en) * 2013-12-11 2016-02-23 Google, Inc. Using specialized workers to improve performance in machine learning
CN105894087A (en) * 2015-01-26 2016-08-24 华为技术有限公司 System and method for training parameter set in neural network
CN105022699A (en) * 2015-07-14 2015-11-04 惠龙易通国际物流股份有限公司 Cache region data preprocessing method and system
CN105825269A (en) * 2016-03-15 2016-08-03 中国科学院计算技术研究所 Parallel autoencoder based feature learning method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QI MENG等: ""Asynchronous Accelerated Stochastic Gradient Descent"", 《PROCEEDINGS OF THE TWENTY-FIFTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-16)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110322020B (en) * 2018-03-28 2023-05-12 国际商业机器公司 Adaptive learning rate scheduling for distributed random gradient descent
CN110322020A (en) * 2018-03-28 2019-10-11 国际商业机器公司 The autoadapted learning rate scheduling of distributed random gradient decline
CN112166444A (en) * 2018-05-31 2021-01-01 微软技术许可有限责任公司 Distributed computing system with integrated data as a service feedback loop engine
CN109032630A (en) * 2018-06-29 2018-12-18 电子科技大学 The update method of global parameter in a kind of parameter server
CN109102075A (en) * 2018-07-26 2018-12-28 联想(北京)有限公司 Gradient updating method and relevant device during a kind of distribution is trained
CN108803348A (en) * 2018-08-03 2018-11-13 北京深度奇点科技有限公司 A kind of optimization method of pid parameter and the optimization device of pid parameter
CN108803348B (en) * 2018-08-03 2021-07-13 北京深度奇点科技有限公司 PID parameter optimization method and PID parameter optimization device
CN109508785A (en) * 2018-10-29 2019-03-22 清华大学 A kind of asynchronous parallel optimization method for neural metwork training
CN109460826A (en) * 2018-10-31 2019-03-12 北京字节跳动网络技术有限公司 For distributing the method, apparatus and model modification system of data
CN110070116B (en) * 2019-04-08 2022-09-20 云南大学 Segmented selection integration image classification method based on deep tree training strategy
CN110070116A (en) * 2019-04-08 2019-07-30 云南大学 Segmented based on the tree-shaped Training strategy of depth selects integrated image classification method
CN110689137A (en) * 2019-09-24 2020-01-14 网易传媒科技(北京)有限公司 Parameter determination method, system, medium, and electronic device
CN112906792A (en) * 2021-02-22 2021-06-04 中国科学技术大学 Image recognition model rapid training method and system based on many-core processor

Also Published As

Publication number Publication date
WO2018039011A1 (en) 2018-03-01
EP3504666B1 (en) 2021-02-17
US20190197404A1 (en) 2019-06-27
CN107784364B (en) 2021-06-15
EP3504666A1 (en) 2019-07-03

Similar Documents

Publication Publication Date Title
CN107784364A (en) The asynchronous training of machine learning model
Jørgensen et al. Exploiting the causal tensor network structure of quantum processes to efficiently simulate non-Markovian path integrals
Binois et al. Practical heteroscedastic Gaussian process modeling for large simulation experiments
Xue et al. Amortized finite element analysis for fast pde-constrained optimization
Wang et al. Reduced-order deep learning for flow dynamics. The interplay between deep learning and model reduction
Ferdinand et al. Anytime stochastic gradient descent: A time to hear from all the workers
US11675951B2 (en) Methods and systems for congestion prediction in logic synthesis using graph neural networks
Kallrath Polylithic modeling and solution approaches using algebraic modeling systems
EP2680157B1 (en) Co-simulation procedures using full derivatives of output variables
CN113377964A (en) Knowledge graph link prediction method, device, equipment and storage medium
Korondi et al. Multi-fidelity design optimisation strategy under uncertainty with limited computational budget
CN114840322A (en) Task scheduling method and device, electronic equipment and storage
Tang et al. DAS: A deep adaptive sampling method for solving partial differential equations
Kaneda et al. A deep conjugate direction method for iteratively solving linear systems
KR20230132369A (en) Reducing resources in quantum circuits
CN110377769A (en) Modeling Platform system, method, server and medium based on graph data structure
CN113508404A (en) Apparatus and method for quantum circuit simulator
CN110009091A (en) Optimization of the learning network in Class Spaces
Javed et al. Random neural network learning heuristics
Borovska et al. Searchless Intelligent System of Modern Production Control
Chang et al. Designing a Framework for Solving Multiobjective Simulation Optimization Problems
Hoda et al. A gradient-based approach for computing Nash equilibria of large sequential games
WO2020149919A1 (en) Inertial damping for enhanced simulation of elastic bodies
Charlier et al. VecHGrad for solving accurately complex tensor decomposition
Wu et al. Finding quantum many-body ground states with artificial neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant