CN107784364A - The asynchronous training of machine learning model - Google Patents
The asynchronous training of machine learning model Download PDFInfo
- Publication number
- CN107784364A CN107784364A CN201610730381.8A CN201610730381A CN107784364A CN 107784364 A CN107784364 A CN 107784364A CN 201610730381 A CN201610730381 A CN 201610730381A CN 107784364 A CN107784364 A CN 107784364A
- Authority
- CN
- China
- Prior art keywords
- value
- parameter set
- rate
- currency
- working machine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Embodiment of the disclosure is related to the asynchronous training of machine learning model.Server is received by being trained the feedback data to generate to machine learning model from working machine.These feedback data are that working machine is obtained using the training data of oneself, and the preceding value with the parameter set of the machine learning model at this particular job machine is associated.Server determines the difference of the preceding value and parameter set between the currency at server.This currency is probably that have passed through one or many renewals due to the operation of other working machines.Then, server based on the difference between feedback data and the value of parameter set, can carry out the currency of undated parameter collection.Thus, it is this to update the training result for not only allowing for each working machine, appropriate compensation also is made that to the delay between different operating machine.
Description
Background technology
Machine learning has in numerous areas such as speech recognition, computer vision and natural language processings widely should
With.For example, deep neural network (Deep Neural Networks, DNN) is on the basis of big data and powerful computing resource
On, can parallel training there is multiple levels, the machine learning model of multiple parameters.In the training stage, it is necessary to according to given
Training dataset and optimization aim, one or more parameters of model are trained.For example, training for neutral net and
Speech, can use stochastic gradient descent method.
It is known training dataset to be distributed between multiple working machines.These working machines utilize respective training data
Model parameter is optimized, and its result is returned into a central server.However, distributed asynchronous model in other words
The key problem of training is the mismatch between each working machine.For example, when a working machine returns to the renewal of its parameter, clothes
Model parameter at business device may have updated one or many by other working machines.Therefore, in the different of machine learning model
In step training, it is expected to reduce or eliminate this delay or mismatch.
The content of the invention
Traditional all schemes are based on such theoretic knowledge, i.e. delay or mismatch between each working machine are
Caused by the reason such as not quite identical that communicated between the performance between different operating machine and/or server and different operating machine.
Therefore, the methods of focus of traditional scheme all is to pass through Optimized Operation reduces delay.But inventor is had found by studying,
This delay be actually asynchronous framework it is intrinsic, it is impossible to be eliminated by Optimized Operation.Therefore, the implementation of the disclosure
Example is intended to carry out appropriate compensation to the delay between different operating machine, rather than goes to attempt to eliminate this delay, and this is from work
Any known arrangement is all markedly different from principle and mechanism.
Generally, in accordance with an embodiment of the present disclosure, server is received by being instructed to machine learning model from working machine
The feedback data practiced and generated.These feedback data are working machines to be obtained using the training data of oneself, and with the machine
Preceding value of the parameter set of learning model at this particular job machine is associated.Server determines the preceding value and parameter set
Difference between the currency at server.It will be understood that this currency be probably the operation due to other working machines and
It has passed through one or many renewals.Then, server can be come more based on the difference between feedback data and the value of parameter set
The currency of new parameters sets.Thus, this update not only allows for the training result of each working machine, also to different operating machine it
Between delay be made that appropriate compensation.Practice it has been proved that with attempt by force eliminate delay traditional scheme compared with, the disclosure
Embodiment can significantly reduce mismatch between different operating machine, realize to the effective and efficient different of machine learning model
Step training.
It is their specific realities below in order to which simplified form introduces the selection to concept to provide Summary
Applying in mode to be further described.Summary is not intended to identify the key feature or main special of claimed theme
Sign, also it is not intended to limit the scope of claimed theme.
Brief description of the drawings
Fig. 1 shows the block diagram for the environment that can implement embodiment of the disclosure;
Fig. 2 shows the flow chart of the method for training pattern in accordance with an embodiment of the present disclosure;
Fig. 3 A- Fig. 3 D show the performance comparision figure between scheme in accordance with an embodiment of the present disclosure and traditional scheme;
Fig. 4 A- Fig. 4 D show the performance comparision figure between scheme in accordance with an embodiment of the present disclosure and traditional scheme;With
And
Fig. 5 is shown in which that the block diagram of computing system/server of one or more other embodiments of the present disclosure can be implemented.
In these accompanying drawings, same or similar reference symbol is used to represent same or similar element.
Embodiment
The disclosure is discussed now with reference to some embodiments.It should be appreciated that these embodiments are discussed merely to making
Obtain those of ordinary skill in the art to better understood when and therefore realize the disclosure, rather than imply to the scope of the present disclosure
Any restrictions.
As used herein, what term " comprising " and its variant will be read as meaning " to include but is not limited to " opens
Put formula term.Term "based" will be read as " being based at least partially on ".Term " embodiment " and " one embodiment " will quilts
It is read as " at least one embodiment ".Term " another embodiment " will be read as " at least one other embodiment ".Term
" first ", " second " etc. may refer to different or identical object.Hereafter it is also possible that other are clearly and implicit
Definition.
Asynchronous training framework
Fig. 1 shows the block diagram for the parallel computation environment 100 that can be implemented within embodiment of the disclosure.It is appreciated that
It is to describe the 26S Proteasome Structure and Function of environment 100 merely for exemplary purpose rather than imply for any of the scope of the present disclosure
Limitation.The disclosure can be embodied in different structure and/or function.
Parallel computation environment 100 includes server 102, working machine (worker) 104 and working machine 106.It should be appreciated that figure
The number of server and working machine shown in 1 has no intention to limit merely for the sake of illustration purpose, and it is any appropriate to there may be
Number destination server and working machine.For example, in parameter server framework, server 102 can be by multiple server distribution formulas
Realize.In certain embodiments, working machine 104 and 106 etc. can be by one or more graphics processing unit (GPU) or GPU
Cluster is realized.
In operation, each working machine has respective training data.For example, the training data each to work can be instruction
Practice a subset of data complete or collected works.For each working machine, the subset of training data can be by from training data complete or collected works
Grab sample obtains.According to optimization aim set in advance, each working machine is independently trained to model, and is tied
Fruit returns to server 102.Server 102 according to the feedback result of each working machine 104,106 come the parameter of more new model, most
It is set to meet optimization aim eventually.As described above, in this process, delay and mismatch between different operating machine 104,106,
It is the main bottleneck for restricting training effect.
General principle
In order to more clearly describe the disclosure, it is described below in conjunction with more classification learnings based on neural network model.
It will be appreciated, however, that the concept of the disclosure can apply in various appropriate machine learning models, particularly neutral net mould
Type.
In more classification problems, it can useThe input space is represented, is usedRepresent
Space is exported, is usedRepresentation spaceOn joint bottom distribution, wherein, d represent the input space dimension,Table
Show set of real numbers, and K represents the quantity of the classification in output space.
In the training process, it usually needs providing includes the training set { (x of multiple data1, y1) ..., (xS, yS), its
Middle S represents the number of the element of training set.Each element in training set includes a pair of inputs and output, its can be considered as from point
ClothIn by independent same distribution (i.i.d.) sample obtain, for example, element (x1, y1) represent by inputting x1With output y1Group
Into data.The overall goal of training is come learning neural network model based on training setIts
InRepresent from spaceTo the mapping space of real number.It will be appreciated, however, that term used herein " training " is also
It may refer to a part for training process.The real vector representation that the parameter set of the model can be tieed up by n, i.e.,Wherein
N is natural number.That is, one or more parameter can be included in parameter set.For simplicity, can also incite somebody to action sometimes
Parameter set is referred to as parameter or parameter vector.
Neural network model generally has hierarchical structure, wherein each node carries out linear group to the connecting node of last layer
Conjunction and nonlinear activation.Model parameter represent two layers between side weight.Neural network model is for each input
Output vector is produced, i.e.,:Represent the possibility that the input belongs to a different category.
Because bottom is distributedIt is unknown, the method for being commonly used in study or training pattern is to minimize loss letter
Number.Loss function is a kind of form of optimization aim.Alternatively, can also be by maximum utility function come training pattern.Effect
Generally there is the form of its loss function of equal value with function, for example, can be represented by the opposite number of loss function.Therefore, it is
For the sake of simplifying, below embodiment of the disclosure is described mainly in combination with loss function.
Loss function can represent the measurement of the overall loss for model optimization, wherein loss can represent misclassification
The various factors such as error.The conventional loss function of deep neural network is cross entropy loss function, and it is defined as:
Wherein, I represents indicator function, and log represents logarithmic function,Represent
Softmax computings.Softmax operation be it is well known in the art that its be widely used in it is common in multi-class problem concerning study
Operation, thus repeat no more.
During empirical loss function is minimized, it usually needs initial value is set to w, then changed according to training data
Generation ground change w, until converging to the parameter value w for make it that loss function minimizes.
As described above, in synchronous stochastic gradient descent method, each working machine (for example, working machine 104 and 106) is based on
The data of respective small lot (mini-batch) calculate gradient, and these gradients are added in world model.Pass through setting
Barrier, each partial duty machine wait until the gradient from all partial duty machines is added to world model each other.So
And because the setting of barrier, the training speed of model are determined by most slow working machine.In order to improve training effectiveness, can use
Asynchronous stochastic gradient descent method.In asynchronous stochastic gradient descent method, without using barrier, in a working machine to the overall situation
After model adds its gradient, continue training process, without waiting.Therefore, asynchronous stochastic gradient descent method is not due to having
Stand-by period and there is the efficiency higher than synchronous stochastic gradient descent method.
Computing environment 100 shown in Fig. 1 can be used for implementing asynchronous stochastic gradient descent method.In time t, working machine
104 receive model parameter w from server 102t.For example, working machine 104 can be sent to server 102 is directed to current model
The request of parameter.Server 102 upon receipt of the request, to the transmission pattern parameter w of working machine 104t.Then, working machine
104 according to data xtCalculate partial gradient g (wt).Working machine can receive data x from other serverst, data xtCan be from instruction
Practice and concentrate stochastical sampling and obtain.Data xtNumber can be one or more, the disclosure is unrestricted herein.It should manage
Solution, term " stochastic gradient " not only include and are directed to a data xtThe situation being trained, but also including being directed to multiple data xt
The situation being trained, the latter are otherwise referred to as small lot (mini-batch) gradient.
At working machine 104, partial gradient g (w can be obtained by calculatingt).For example, can be by by data xtInput
Gradient g (wt) expression formula can obtain g (wt) value.Then the gradient is sent to server 102 by working machine 104, by servicing
Device 102 is by partial gradient g (wt) it is added to world model's parameter.However, as shown in figure 1, before this, in additionIndividual working machine
Its partial gradient may be added in world model's parameter so that world model's parameter has been updatedIt is secondary,
It is changed intoTraditional asynchronous stochastic gradient descent algorithm ignores this problem, simply by partial gradient g (wt) addition
To world model's parameter
Wherein η represents learning rate, and the equation commonly referred to as updates rule.As can be seen that asynchronous stochastic gradient descent method
Renewal rule and order stochastic gradient descent method (also referred to as, standalone version stochastic gradient descent method) renewal rule
Valency.In asynchronous stochastic gradient descent method, actually by the partial gradient g (w of " delay " or " expired "t) be added to it is current
World model's parameterAnd in order stochastic gradient descent method, should be based on relative toGradient come more
New world model parameter
Traditional theory thinks, asynchronous stochastic gradient descent method need more iteration reach with order stochastic gradient
Drop method identical precision.Even in some cases, asynchronous stochastic gradient descent method can not obtain and order stochastic gradient
Method identical precision, particularly when working machine it is in a large number when.A variety of methods be present now to solve asynchronous stochastic gradient
The problem of in descending method.In some versions, the delay of partial gradient is reduced by setting various scheduling strategies.At some
In scheme, less weight is set to larger partial gradient is postponed, larger weight is set to less partial gradient is postponed.
As another example, will postpone to abandon more than the partial gradient of certain threshold value, etc..However, these methods are all without fully profit
With each partial gradient, computing resource is wasted to a certain extent.
Traditional understanding thinks that this delay is by the performance between different operating machine and/or server and different operating machine
Between communicate it is not quite identical caused by, therefore reduce delay the methods of pass through Optimized Operation.However, inventor recognizes this
Kind understanding is inapt.Delay be asynchronous framework it is intrinsic, it is impossible to be completely eliminated.As shown in figure 1, part is terraced
Spend g (wt) it is added to world model's parameterWhen, it there will necessarily beSecondary delay.Therefore, in accordance with an embodiment of the present disclosure, exist
In asynchronous model training, the delay (for example, delay of gradient) of different operating machine is adequately compensated for, rather than goes to attempt to reduce
Delay.
Instantiation procedure
Principle in accordance with an embodiment of the present disclosure is described above in association with Fig. 1.It should be appreciated that the principle can be easily
Expand to any appropriate model and scene using asynchronous stochastic gradient descent method.This public affairs is described below with reference to Fig. 2
Open the instantiation procedure of embodiment.For the sake of for convenience of description, Fig. 2 computing environment 100 for describing to combine Fig. 1 is deployed.
202, server 102 receives from working machine 104 the currency w for model parameter in time ttRequest.By
Generally represented in model parameter by multiple parameters, thus be referred to as parameter set.In response to the request, server 102 is to work
Make the currency w of the transmission pattern parameter of machine 104t.In certain embodiments, time t can be represented by counting, for example, global mould
Shape parameter wtOften update once, time t is incremented by a counting.
Thus, working machine 104 is obtaining the currency w of world model's parameter from server 102t.In addition, working machine 104
One or more training data x can also be received from equipment such as the servers of hosted data collectiont.The number of training data is also referred to as
The size of small lot, depending on user is set.In certain embodiments, training data can obtain from data set stochastical sampling.
Working machine 104 to model by being trained to generate feedback data.The feedback data and model parameter it is current
Value wtIt is associated.Training process performed by working machine 104 can be a part for the whole training process of model.As showing
Example, in certain embodiments, feedback data indicates the optimization aim of the model relative to the currency w of model parametertIt is notable
Variation tendency.For example, in certain embodiments, significant changes trend can be maximum variation tendency, and it is possible thereby to by
Optimization aim relative to model parameter currency wtGradient g (wt) represent.
Especially, it should be noted that the scope of the present disclosure is not limited to " significant changes trend " or other any physics
The mathematical character mode of amount.Used mathematical character (for example, mathematical quantity, expression formula, formula etc.) is all merely possible to show
Example is described, and its sole purpose is to aid in thought and implementation skilled artisan understands that the disclosure.
As described above, optimization aim can be represented by loss function, therefore, for discussion purposes, referring still to loss letter
Count to be discussed.Working machine 104 can according to training dataset come counting loss function parameter currency wtThe office at place
Portion gradient g (wt).For many common loss functions, partial gradient g (wt) there is explicit expression, thus by training data
Partial gradient g (w can be obtained by inputting the expression formulat) value.In this case, as the training performed by working machine 104
Journey is only based on the explicit expression and determines local g (wt), namely a part for the whole training process of model.
Then, working machine 104 by the feedback data (for example, partial gradient g (wt)) send back server 102.At some
In embodiment, when working machine 104 obtains from server 102 the currency w of world model parametertWhen, the storage model of server 102
The currency w of parametertAs backup model parameter wbak(m), wherein m represents the mark of working machine 104.
As shown in figure 1, in time t+ τ, server 104 is received by being trained model to give birth to from working machine 104
Into feedback data, such as partial gradient g (wt).Now, the currency of model parameter has been updated toThus this is anti-
Present data substantially represents the optimization aim of model relative to the preceding value w of model parameter in certain embodimentstNotable change
Change trend.
Referring still to Fig. 2,204, server 102 can determine the preceding value w of model parametertWith currencyIt
Between difference.206, server 102 updates the currency of model parameter based on feedback data and the differenceSo as to
Obtain the updated value w of model parametert+τ+1。
In certain embodiments, previous model parameter can be as described above as backup model parameter wbak(m) deposit
The model parameter w of storaget.The updated value w of the renewal amount of model parameter, i.e. parametert+τ+1With currencyBetween residual quantity,
The currency of model parameter can be considered asWith preceding value wtBetween difference conversion.In certain embodiments, can be with
The coefficient of this conversion is determined based on significant changes trend, and by determining that model is joined using this conversion to the difference
Residual quantity between several currencys and updated value, i.e. the renewal amount of model parameter.By in currencyIt is upper to apply this
One renewal amount, it is possible to obtain the updated value w of parameter sett+τ+1.For example, the conversion can be linear transformation, and the conversion
Coefficient can be linear change rate.Certainly other any appropriate conversion are also all feasible.
For discussion purposes, discussed below referring still to asynchronous stochastic gradient descent method.But please note again
Meaning, this is used for the purpose of the principle and thought of the explaination disclosure, it is not intended to limits the scope of the present disclosure in any way.
In the ideal case, asynchronous stochastic gradient descent method should be identical with order stochastic gradient descent method by gradientIt is added to world model's parameterHowever, asynchronous stochastic gradient descent method will postpone gradient g (wt) add
It is added to world model's parameter
This gap can be illustrated by Taylor expansion.For example,In wtThe Taylor expansion at place can represent
For:
Wherein InN dimension unit vectors are represented, and high-orders more than O symbols 2 ranks of expression is write using flower
.
From equation (3) as can be seen that asynchronous stochastic gradient method uses the zeroth order item of Taylor expansion as gradientApproximation, and have ignored other all items.Therefore, if making up delay gradient g (wt) and gradient
Between gap, then every other item can be taken into account, and be respectively calculated.However, this is unpractical, because
It is related to the sum for calculating infinite multiple.In accordance with an embodiment of the present disclosure, the zero and first order item of Taylor expansion is only retained, it is only right
Postpone gradient and carry out simplest compensation:
The first derivative of gradient reflects the rate of change of gradient, and corresponding to loss function (for example, by equation (1) institute
The cross entropy loss function of expression) second dervative.The first derivative of gradient can be represented by black plug (Hessian) matrix.It is black
Plug matrix can be defined as below:Hf (w)=[hij] i, j=1 ..., n, wherein
Thus, by the way that formula (4) is combined with the renewal rule of formula (2), it is possible to it is determined that the renewal amount to parameter.
The renewal amount is formed by two, and Section 1 is to postpone gradient and the product of learning rate, and Section 2 is compensation term.Therefore, parameter
Renewal amount can be considered as the currency of parameter setWith preceding value wtBetween difference linear transformation, and linear change
Rate is the product of learning rate and Hessian matrix.Due to learning rate be can be with predefined empirical parameter, thus linear change rate can
To think of equal value with Hessian matrix.
However, although can directly calculate Hessian matrix in certain embodiments, but this process is probably difficult.Example
Such as, there will be 1,000,000,000,000 elements for the neural network model with 1,000,000 parameters, corresponding Hessian matrix.Therefore,
It is very high to obtain the computation complexity of so huge matrix, and such matrix is also required to occupy larger memory space.Cause
And in further embodiments, carry out approximate Hessian matrix using the relatively easy approximation for calculating and/or storing so that delay is mended
Repay and easily implement.
For example, in some embodiments, it is desirable to feedback data can be based on (for example, gradient g (wt)) obtain its single order
The approximation of derivative (for example, Hessian matrix Hf (w)).By this way, computation complexity will not be dramatically increased.
For discussion purposes, it is assumed that the model is neural network model, and its optimization aim is by normal in neural network model
The cross entropy loss function seen represents.Accordingly, for cross entropy loss function
Assuming that Y is to meet distributionDiscrete random variable, wherein k ∈ 1,
2 ..., K }, then it can prove:
WhereinThe apposition (outer product) or tensor product of vector are represented,
Represent to meet to be distributed in YIn the case of loss function join relative to model
Several second dervativesDesired value,Table
Show and meeting to be distributed in YIn the case of, loss function is joined relative to model
The apposition of several first derivativesDesired value.For simplicity, have
The proof procedure of body is not described in detail.
For cross entropy loss function, due to
Therefore Gf (x, y, w) is Hf (x, Y, w) unbiased esti-mator, and wherein Gf (w) represents gradient vector g (w) outer product matrices, that is,
Gf (w)=[gij] i, j=1 ..., n, wherein gijRepresenting matrix Gf (w) element.As noted previously, as Hessian matrix can regard
The linear change rate of linear transformation is updated for parameter, thus can be using tensor product as the unbiased esti-mator to linear change rate.
Matrix Gf (w) can be by performing tensor product computing to obtain for gradient vector g (w).Due to tensor product computing
With relatively low computation complexity, thus calculating cost can be significantly reduced.In addition, in such embodiments, by using
Hessian matrix is equally replaced in linear transformation, can avoid storing additional variable using additional memory space, and
Substantial amounts of demand will not be brought to memory space.
It will be appreciated that though illustrate that gradient vector g (w) outer product matrices are black plug squares herein for cross entropy loss function
The unbiased esti-mator of battle array, but can be easy to the conclusion being applied to unknown losses function or optimization aim, as long as this estimate
The error of meter is within acceptable range of tolerable variance.
However, because the unbiased esti-mator of Hessian matrix does not account for the influence of variance, may cause there is Hessian matrix
Higher approximate error.Therefore, in certain embodiments, deviation and variance can be considered simultaneously.It is, for example, possible to use mean square error
Difference characterizes approximate quality.
In certain embodiments, in order to reduce variance, the further approximation to Hessian matrix can be usedWherein symbolRepresent definition, λ1Represent control parameter.In order to simplify symbol, for institute
Some σk, willRange Representation be [li, ui], and defineAnd
For cross entropy loss function, can prove:If control parameter λ1Meet:sign(λ1)=sign (gij),AndIt is then approximate
DeviceMean square error be less than approximator gijMean square error, wherein sign represent sign function.Therefore, by suitable
Locality sets λ1,It can realize and compare gijSmaller mean square error.
It should be appreciated that what above-mentioned approximation was merely exemplary, it is not intended to limit the scope of the present disclosure.For example, in other reality
Apply in example, in order to reduce variance, another approximation of Hessian matrix can be usedWherein λ2Represent
Control parameter.
For cross entropy loss function, can prove:If λ2∈ [0,1], and σkMeetThen approximate λ2gijMean square error be less than approximate gijIt is equal
Square error.Therefore, by suitably setting λ2, λ2gijIt can realize and compare gijSmaller mean square error.
Conclusions are given for cross entropy loss function, control parameter λ1And λ2Scope, but should manage
Solution, for unknown losses function, there may also be corresponding proper range.In addition, in specific implementation process, control ginseng
Number λ1And λ2Scope can be rule of thumb configured in wider scope with specific implementation situation.
It is approximateAnd λ2Gf (w) can be based on partial gradient g (wt) easily calculated, it is real
The now good balance between approximate deviation and variance.Thus, it is possible to calculate Hessian matrix in a manner of more economical.
Further, in certain embodiments, gradient g (w can be based ont) determine loss function relative to each in parameter set
The value of the derivative of individual parameter, the value of derivative represent the size or absolute value of derivative.Then, can the value based on derivative come
Determine linear change rate.Approximation is combined separately belowAnd λ2Gf (w) describes further specific implement
Example.
In certain embodiments, the approximation of Hessian matrix can be used onlyDiagonal element.Therefore,
The renewal rule of world model's parameter is changed into:
Its equivalent in:
Wherein diag expressions take the diagonal element of matrix, and ⊙ is represented by element multiplication, and equation (6) and (7) are complete
Two kinds of representations of equal value.
According to equation (7), gradient g (w can be based ont) determine loss function leading relative to each parameter in parameter set
Several values (magnitude).Mathematically, the value of derivative can be represented by its absolute value.In certain embodiments,
Linear change rate directly can be determined using the value of each derivative, and then determine compensation term.In other words, can be by gradient g
(wt) each element the vector that is formed of absolute value as linear change rate, most multiphase between linear change rate and compensation term
A poor adjusting parameter, such as learning rate η and control parameter λ2Product.Because signed magnitude arithmetic(al) has low-down calculate again
Miscellaneous degree, thus calculating cost can be significantly reduced.In addition, additional variable is stored without additional memory space, can't
Additional demand is brought to memory space.
Alternatively, in further embodiments, the approximate λ of Hessian matrix can be used only2Gf (w) diagonal element.Cause
This, the renewal rule of world model's parameter is changed into:
Its equivalent in:
According to equation (9), gradient g (w can be based ont) determine loss function leading relative to each parameter in parameter set
Several squares, and can the linear change rate that square directly determines based on each derivative, and then determine compensation term.Change speech
It, in such embodiments, can be by gradient g (wt) each element absolute value square (rather than absolute value is in itself)
The vector formed at most differs an adjusting parameter, such as learn as linear change rate between linear change rate and compensation term
Habit rate η and control parameter λ2Product.Because square operation has low-down computation complexity, thus meter can be significantly reduced
It is counted as this.In addition, storing additional variable without additional memory space, additional need can't be brought to memory space
Ask.
Because learning rate η may constantly reduce with the progress of the training process of model, thus may need correspondingly to adjust
Whole control parameter.Due to the controlled parameter of coefficient itself and learning rate of compensation term it can be seen from the renewal rule above
Product influences.In certain embodiments, control parameter is adapted so that the product of control parameter and learning rate is kept substantially not
Become.Thus, in this case, it is also assumed that the overall control parameter of compensation term keeps constant.
In certain embodiments, as shown in figure 1, model parameter byUpdate wt+τ+1When, server 102 can be with
Automatically the model parameter after renewal is sent to working machine 104.Alternatively or additionally, server 102 can also be in response to
The request for model parameter from working machine 104 is by the model parameter w after renewalt+τ+1It is sent to working machine 104.
In accordance with an embodiment of the present disclosure, relative to traditional asynchronous stochastic gradient descent method, working machine is (for example, work
Machine 104 and/or the calculating that 106) need not can be added, i.e., only it need to carry out partial gradient g (wt) calculating.In addition, server
102 also only need to be additionally carried out the less high calculating of computation complexity.Even in the approximation for needing calculating outer product matrices Gf (w)
In, also only need the apposition computing into row vector.Only consideringAnd λ2The approximation of Gf (w) diagonal element
Under, computation complexity is further reduced.
For each working machine, server 102 only needs to store each backup model parameter wbak(m), wherein m can be
1st, 2 ..., M, M represent the sum of working machine.However, this does not generally interfere with or reduced systematic function.In some embodiments
In, server 102 is realized in a distributed way, therefore, capacity of its available memory space considerably beyond individual machine.Alternatively
Ground or additionally, working machine (for example, working machine 104 and/or 106) can be sent while gradient is sent to server 102
Its corresponding global parameter.By this way, it is not necessary in the memory space of the sidepiece of server 102 administration Large Copacity, but work
The communications cost of machine (for example, working machine 104 and/or 106) between server can be double.
Experiment and performance
Fig. 3 A- Fig. 3 D and Fig. 4 A- Fig. 4 D show the property between scheme in accordance with an embodiment of the present disclosure and traditional scheme
Figure can be compared.Fig. 3 A- Fig. 3 D and Fig. 4 A- Fig. 4 D test is carried out on CIFAR-10 data sets.For all methods,
60 iteration are carried out, the size of its small batch is 128, and initial learning rate is 0.5, is divided after 80 and 120 iteration
1/10th of preceding value are not reduced to.In accordance with an embodiment of the present disclosure, set control parameter λ1And λ2It is initially 2 respectively
With 0.04, and increase to the change of learning rate ten times of preceding value.
It illustrates effective transmission of fixed number (effective pass) convergence curve by Fig. 3 A- Fig. 3 D.Order with
Machine gradient descent method (SGD) realizes optimal training precision, and final test error is 8.75%.Asynchronous stochastic gradient
The performance of descending method (Async SGD) and synchronous stochastic gradient descent method (Sync SGD) is all not so good, its test error
Increase with the increase of the quantity of working machine.In the case of 4 working machines (M=4), both test errors are respectively
9.39% and 9.35%.In the case of 8 working machines (M=8), both test errors are 10.4% and 10.1% respectively.
Because asynchronous stochastic gradient descent method by delay gradient influenceed, delay gradient with the increase of the number of working machine and
Become more serious.Synchronous stochastic gradient descent method needs to increase the size of small lot, so as to influence the training performance of model.
As a comparison, in the stochastic gradient descent method (DC-ASGD) of delay compensation in accordance with an embodiment of the present disclosure, two approximations
(namely based on the approximate λ of the diagonal element of gradient apposition1| g | and λ2G ⊙ g) performance be all significantly better than asynchronous stochastic gradient descent
Method and synchronous stochastic gradient descent method, and almost catch up with order stochastic gradient descent method.For example, in 4 working machines
In the case of, the stochastic gradient descent method testing error of delay compensation is 8.69%, with order stochastic gradient descent method
The basic undistinguishable of test error.In the case of 8 working machines, the stochastic gradient descent method testing of delay compensation misses
Difference can be lowered to 9.27%, and it is significantly better than traditional asynchronous stochastic gradient descent method and synchronous stochastic gradient descent side
Method.
Fig. 4 A- Fig. 4 D show the comparison of the convergence rate between scheme in accordance with an embodiment of the present disclosure and traditional scheme
Figure.The convergence rate of asynchronous stochastic gradient descent method quickly, line is almost realized compared with order stochastic gradient descent method
Property accelerate, but convergence point is not so good.Synchronous stochastic gradient descent method also than order stochastic gradient descent method faster, still
Due to stepped cost, asynchronous stochastic gradient descent method is significantly slower than.The asynchronous stochastic gradient descent method of delay compensation is in precision
Extraordinary balance is realized between speed.Its convergence rate is very similar to traditional asynchronous stochastic gradient descent method,
And convergence point is substantially identical good with order stochastic gradient descent method.
Example apparatus
Fig. 5 be shown in which to implement the exemplary computing systems of one or more embodiments of subject matter described herein/
The block diagram of server 500.Model estimating system 50, model execution system 120 or both can be by computing system/servers
500 realize.Computing system/server 500 shown in Fig. 5 is only example, and it should not be formed to realization described herein
The limitation of the function and scope that use.
As shown in figure 5, computing system/server 500 is the form of universal computing device.Computing system/server 500
Component can include but is not limited to one or more processors or processing unit 500, memory 520, and one or more input is set
Standby 530, one or more output equipments 540, storage device 550, and one or more communication units 560.Processing unit 500 can
To be reality or virtual processor and can persistently perform various processing according to what is stored in memory 520.In multiprocessing
In system, multiplied unit performs computer executable instructions, to increase disposal ability.
Computing system/server 500 generally includes multiple computer medias.Such medium can be computing system/clothes
It is engaged in the addressable any medium that can be obtained of device 500, including but not limited to volatibility and non-volatile media, detachably and not
Detachable media.Memory 520 can be volatile memory (such as register, cache, random access storage device
(RAM)), non-volatile memory is (for example, read-only storage (ROM), Electrically Erasable Read Only Memory
(EEPROM), flash memory) or their certain combination.Storage device 550 can be detachable or non-dismountable, and can include
Machine readable media, such as flash drive, disk or any other medium, it can be used in storage information and can be with
It is accessed in computing system/server 500.
Computing system/server 500 may further include other detachable/non-dismountable, volatile, nonvolatile
Computer system storage medium.Although not shown in FIG. 5, can provide for from detachable, non-volatile magnetic disk (such as
" floppy disk ") disk drive that is read out or writes and the light for being read out or writing from detachable, anonvolatile optical disk
Dish driving.In such cases, each driving can be connected to bus 18 by one or more data media interfaces.Memory
520 can include at least one program product, have (for example, at least one) program module collection, and these program modules are configured
To perform the function of various embodiments described herein.
The program of set with one or more program modules 524/utility program instrument 522 can be stored in for example
In memory 520.Such program module 524 can include but is not limited to operating system, one or more application programs, other
Program module and operation data.Each example or particular combination in these examples can include the realization of networked environment.Program
Module 524 generally performs the function and/or method of embodiment of the subject matter described herein, such as method 200.
Input block 530 can be one or more various input equipments.For example, input block 539 can include user
Equipment, mouse, keyboard, trackball etc..Communication unit 560 is realized to be led to other computational entity on the communication media
Letter.Additionally, the function of the component of computing system/server 500 can be come real with single computing cluster or multiple computing machines
Existing, these computing machines can fetch communication by communication link.Therefore, computing system/server 500 can use with one or
The logic connection of other multiple servers, NetPC Network PC (PC) or another general networking node comes in networked environment
In operated.Such as, but not limited to, communication media includes wired or Wireless networking technologies.
Computing system/server 500 can also be communicated with one or more external equipment (not shown) as needed,
External equipment such as storage device, display device etc., with one or more user is handed over computing system/server 500
Mutual equipment is communicated, or with causing computing system/server 500 and times of other one or more computing device communications
What equipment (for example, network interface card, modem etc.) is communicated.Such communication can be via input/output (I/O) interface
(not shown) performs.
Function described herein can be performed by one or more hardware logic components at least in part.Such as but
It is not limited to, the exemplary types of the hardware logic component that can be used include field programmable gate array (FPGA), special integrated
Circuit (ASIC), Application Specific Standard Product (ASSP), on-chip system (SOC), CPLD (CPLD) etc..
The program code of method for implementing subject matter described herein can use one or more programming languages
Any combinations are write.These program codes can be supplied at all-purpose computer, special-purpose computer or other programmable datas
Manage the processor or controller of device so that program code when processor or controller execution when by making flow chart and/or block diagram
Function/operation of middle defined is carried out.Program code can be performed completely on machine, partly performed on machine, be made
For independent software package partly on machine perform and partly on the remote machine perform or completely in remote machine or service
Performed on device.
In the context of present disclosure, machine readable media can be tangible medium, and it can include or store
The program for using or being used in combination with instruction execution system, device or equipment for instruction execution system, device or equipment.Machine
Device computer-readable recording medium can be machine-readable signal medium or machine-readable storage medium.Machine readable media can include but unlimited
In electronics, magnetic, optical, electromagnetism, infrared or semiconductor system, device or equipment, or the above appoints
What appropriate combination.The more specific example of machinable medium can include being electrically connected, be portable based on one or more lines
Formula computer disks, hard disk, random access memory (RAM), read-only storage (ROM), Erasable Programmable Read Only Memory EPROM
(EPROM or flash memory), optical fiber, portable compact disk read-only storage (CD-ROM), optical storage device, magnetic storage are set
Standby or the above any appropriate combination.
Although in addition, depicting each operation using certain order, this should be understood to requirement so operation with shown
The certain order that goes out performs in sequential order, or requires that the operation of all diagrams should be performed to obtain desired result.
Under certain environment, multitask and parallel processing are probably favourable.Similarly, although containing some tools in being discussed above
Body realizes details, but these are not construed as the limitation to the scope of subject matter described herein.Individually realizing
Context described in some features can also realize in combination in single realization.On the contrary, above and below single realization
The various features of described in the text can also be realized in multiple realizations individually or in any suitable subcombination.
Example implementation
It is listed below some example implementations of the disclosure.
In certain embodiments, there is provided a kind of computer implemented method.This method includes:Receive and pass through from working machine
The parameter set of the feedback data for being trained and generating to machine learning model, feedback data and machine learning model is in working machine
The preceding value at place is associated;Determine the difference between preceding value and the currency of parameter set;And based on feedback data and difference,
Currency is updated to obtain the updated value of parameter set.
In certain embodiments, feedback data indicates the optimization aim of machine learning model relative to the preceding value of parameter set
Significant changes trend.
In certain embodiments, renewal currency includes:The coefficient of conversion is determined based on significant changes trend;It is and logical
Cross the residual quantity for bringing and determining between currency and updated value using becoming to difference.
In certain embodiments, conversion is linear transformation, and coefficient is linear change rate, and significant changes trend is by optimizing
Target represents relative to the gradient of the preceding value of parameter set.
In certain embodiments, it is determined that the coefficient of conversion includes:The tensor product of gradient is calculated as to linear change rate
Unbiased esti-mator.
In certain embodiments, it is determined that the coefficient of conversion includes:Determine optimization aim relative in parameter set based on gradient
The value of the rate of change of parameters;And
Linear change rate is determined based on the value of rate of change.
In certain embodiments, determine that linear change rate includes based on the value of rate of change:Calculate the value of rate of change
Square;And value based on rate of change square determines linear change rate.
In certain embodiments, this method also includes:The request for parameter set is received from working machine;And in response to asking
Ask, the updated value of parameter set is sent to working machine.
In certain embodiments, machine learning model includes neural network model, and optimization aim is by intersection entropy loss
Function representation.
In certain embodiments, there is provided a kind of equipment.The equipment includes:Processing unit;Memory, it is single coupled to processing
Member and instruction is stored with, instructs and following action is performed when being performed by processing unit:Received from working machine by engineering
The parameter set of the feedback data that habit model is trained and generated, feedback data and machine learning model is previous at working machine
Value is associated;Determine the difference between preceding value and the currency of parameter set;It is and current based on feedback data and difference, renewal
Value is to obtain the updated value of parameter set.
In certain embodiments, feedback data indicates the optimization aim of machine learning model relative to the preceding value of parameter set
Significant changes trend.
In certain embodiments, renewal currency includes:The coefficient of conversion is determined based on significant changes trend;It is and logical
Cross the residual quantity for bringing and determining between currency and updated value using becoming to difference.
In certain embodiments, conversion is linear transformation, and coefficient is linear change rate, and significant changes trend is by optimizing
Target represents relative to the gradient of the preceding value of parameter set.
In certain embodiments, it is determined that the coefficient of conversion includes:The tensor product of gradient is calculated as to linear change rate
Unbiased esti-mator.
In certain embodiments, it is determined that the coefficient of conversion includes:Determine optimization aim relative in parameter set based on gradient
The value of the rate of change of parameters;And
Linear change rate is determined based on the value of rate of change.
In certain embodiments, determine that linear change rate includes based on the value of rate of change:Calculate the value of rate of change
Square;And value based on rate of change square determines linear change rate.
In certain embodiments, action also includes:The request for parameter set is received from working machine;And in response to asking
Ask, the updated value of parameter set is sent to working machine.
In certain embodiments, machine learning model includes neural network model, and optimization aim is by intersection entropy loss
Function representation.
In certain embodiments, there is provided a kind of computer program product.The computer program product is stored in non-transient
In computer-readable storage medium and including machine-executable instruction, when machine-executable instruction is run in a device so that equipment:
Received from working machine by being trained the feedback data to generate, feedback data and machine learning model to machine learning model
Preceding value of the parameter set at working machine be associated;Determine the difference between preceding value and the currency of parameter set;And base
In feedback data and difference, currency is updated to obtain the updated value of parameter set.
In certain embodiments, feedback data indicates the optimization aim of machine learning model relative to the preceding value of parameter set
Significant changes trend.
In certain embodiments, renewal currency includes:The coefficient of conversion is determined based on significant changes trend;It is and logical
Cross the residual quantity for bringing and determining between currency and updated value using becoming to difference.
In certain embodiments, conversion is linear transformation, and coefficient is linear change rate, and significant changes trend is by optimizing
Target represents relative to the gradient of the preceding value of parameter set.
In certain embodiments, it is determined that the coefficient of conversion includes:The tensor product of gradient is calculated as to linear change rate
Unbiased esti-mator.
In certain embodiments, it is determined that the coefficient of conversion includes:Determine optimization aim relative in parameter set based on gradient
The value of the rate of change of parameters;And
Linear change rate is determined based on the value of rate of change.
In certain embodiments, determine that linear change rate includes based on the value of rate of change:Calculate the value of rate of change
Square;And value based on rate of change square determines linear change rate.
In certain embodiments, machine-executable instruction also causes equipment:The request for parameter set is received from working machine;
And in response to request, to the updated value of working machine transmission parameter set.
In certain embodiments, machine learning model includes neural network model, and optimization aim is by intersection entropy loss
Function representation.
, should although describing the disclosure using the language specific to architectural feature and/or method logical action
When understanding that the theme defined in appended claims is not necessarily limited to special characteristic described above or action.On on the contrary,
Special characteristic and action described by face are only the exemplary forms for realizing claims.
Claims (20)
1. a kind of computer implemented method, including:
Received from working machine by being trained the feedback data to generate to machine learning model, the feedback data with it is described
Preceding value of the parameter set of machine learning model at the working machine is associated;
Determine the difference between the currency of the preceding value and the parameter set;And
Based on the feedback data and the difference, the currency is updated to obtain the updated value of the parameter set.
2. according to the method for claim 1, wherein the feedback data indicates the optimization aim of the machine learning model
Relative to the significant changes trend of the preceding value of the parameter set.
3. according to the method for claim 2, wherein updating the currency includes:
The coefficient of conversion is determined based on the significant changes trend;And
By bringing the residual quantity determined between the currency and the updated value using the change to the difference.
4. according to the method for claim 3, wherein the conversion is linear transformation, the coefficient is linear change rate, and
And the significant changes trend is represented by the optimization aim relative to the gradient of the preceding value of the parameter set.
5. according to the method for claim 4, wherein determining the coefficient of the conversion includes:
The tensor product of the gradient is calculated as the unbiased esti-mator to the linear change rate.
6. according to the method for claim 4, wherein determining the coefficient of the conversion includes:
Value of the optimization aim relative to the rate of change of parameters in the parameter set is determined based on the gradient;And
The linear change rate is determined based on the value of the rate of change.
7. according to the method for claim 6, wherein determining the linear change rate bag based on the value of the rate of change
Include:
Calculate square of the value of the rate of change;And
Value based on the rate of change square determines the linear change rate.
8. the method according to claim 11, in addition to:
The request for the parameter set is received from the working machine;And
In response to the request, to the updated value of the working machine transmission parameter set.
9. according to the method for claim 1, wherein the machine learning model includes neural network model, and it is described excellent
Change target to be represented by cross entropy loss function.
10. a kind of electronic equipment, including:
Processing unit;
Memory, coupled to the processing unit and instruction is stored with, the instruction is held when being performed by the processing unit
Row is following to be acted:
Received from working machine by being trained the feedback data to generate to machine learning model, the feedback data with it is described
Preceding value of the parameter set of machine learning model at the working machine is associated;
Determine the difference between the currency of the preceding value and the parameter set;And
Based on the feedback data and the difference, the currency is updated to obtain the updated value of the parameter set.
11. equipment according to claim 10, wherein the feedback data indicates the optimization mesh of the machine learning model
Mark the significant changes trend of the preceding value relative to the parameter set.
12. equipment according to claim 11, wherein updating the currency includes:
The coefficient of conversion is determined based on the significant changes trend;And
By bringing the residual quantity determined between the currency and the updated value using the change to the difference.
13. equipment according to claim 12, wherein the conversion is linear transformation, the coefficient is linear change rate,
And the significant changes trend is represented by the optimization aim relative to the gradient of the preceding value of the parameter set.
14. equipment according to claim 13, wherein determining the coefficient of the conversion includes:
The tensor product of the gradient is calculated as the unbiased esti-mator to the linear change rate.
15. equipment according to claim 13, wherein determining the coefficient of the conversion includes:
Value of the optimization aim relative to the rate of change of parameters in the parameter set is determined based on the gradient;And
The linear change rate is determined based on the value of the rate of change.
16. equipment according to claim 15, wherein determining the linear change rate based on the value of the rate of change
Including:
Calculate square of the value of the rate of change;And
Value based on the rate of change square determines the linear change rate.
17. equipment according to claim 10, the action also includes:
The request for the parameter set is received from the working machine;And
In response to the request, to the updated value of the working machine transmission parameter set.
18. equipment according to claim 10, wherein the machine learning model includes neural network model, and it is described
Optimization aim is represented by cross entropy loss function.
19. a kind of computer program product, it is stored in non-transitory, computer storage medium and including machine-executable instruction,
When the machine-executable instruction is run in a device so that the equipment:
Received from working machine by being trained the feedback data to generate to machine learning model, the feedback data with it is described
Preceding value of the parameter set of machine learning model at the working machine is associated;
Determine the difference between the currency of the preceding value and the parameter set;And
Based on the feedback data and the difference, the currency is updated to obtain the updated value of the parameter set.
20. computer program product according to claim 19, wherein the feedback data indicates the machine learning mould
The optimization aim of type relative to the preceding value of the parameter set significant changes trend.
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610730381.8A CN107784364B (en) | 2016-08-25 | 2016-08-25 | Asynchronous training of machine learning models |
EP17758721.9A EP3504666B1 (en) | 2016-08-25 | 2017-08-17 | Asychronous training of machine learning model |
PCT/US2017/047247 WO2018039011A1 (en) | 2016-08-25 | 2017-08-17 | Asychronous training of machine learning model |
US16/327,679 US20190197404A1 (en) | 2016-08-25 | 2017-08-17 | Asychronous training of machine learning model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610730381.8A CN107784364B (en) | 2016-08-25 | 2016-08-25 | Asynchronous training of machine learning models |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107784364A true CN107784364A (en) | 2018-03-09 |
CN107784364B CN107784364B (en) | 2021-06-15 |
Family
ID=59738469
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610730381.8A Active CN107784364B (en) | 2016-08-25 | 2016-08-25 | Asynchronous training of machine learning models |
Country Status (4)
Country | Link |
---|---|
US (1) | US20190197404A1 (en) |
EP (1) | EP3504666B1 (en) |
CN (1) | CN107784364B (en) |
WO (1) | WO2018039011A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108803348A (en) * | 2018-08-03 | 2018-11-13 | 北京深度奇点科技有限公司 | A kind of optimization method of pid parameter and the optimization device of pid parameter |
CN109032630A (en) * | 2018-06-29 | 2018-12-18 | 电子科技大学 | The update method of global parameter in a kind of parameter server |
CN109102075A (en) * | 2018-07-26 | 2018-12-28 | 联想(北京)有限公司 | Gradient updating method and relevant device during a kind of distribution is trained |
CN109460826A (en) * | 2018-10-31 | 2019-03-12 | 北京字节跳动网络技术有限公司 | For distributing the method, apparatus and model modification system of data |
CN109508785A (en) * | 2018-10-29 | 2019-03-22 | 清华大学 | A kind of asynchronous parallel optimization method for neural metwork training |
CN110070116A (en) * | 2019-04-08 | 2019-07-30 | 云南大学 | Segmented based on the tree-shaped Training strategy of depth selects integrated image classification method |
CN110322020A (en) * | 2018-03-28 | 2019-10-11 | 国际商业机器公司 | The autoadapted learning rate scheduling of distributed random gradient decline |
CN110689137A (en) * | 2019-09-24 | 2020-01-14 | 网易传媒科技(北京)有限公司 | Parameter determination method, system, medium, and electronic device |
CN112166444A (en) * | 2018-05-31 | 2021-01-01 | 微软技术许可有限责任公司 | Distributed computing system with integrated data as a service feedback loop engine |
CN112906792A (en) * | 2021-02-22 | 2021-06-04 | 中国科学技术大学 | Image recognition model rapid training method and system based on many-core processor |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180046475A1 (en) * | 2016-08-11 | 2018-02-15 | Twitter, Inc. | Detecting scripted or otherwise anomalous interactions with social media platform |
US11182674B2 (en) * | 2017-03-17 | 2021-11-23 | International Business Machines Corporation | Model training by discarding relatively less relevant parameters |
JP6886112B2 (en) * | 2017-10-04 | 2021-06-16 | 富士通株式会社 | Learning program, learning device and learning method |
US11669914B2 (en) | 2018-05-06 | 2023-06-06 | Strong Force TX Portfolio 2018, LLC | Adaptive intelligence and shared infrastructure lending transaction enablement platform responsive to crowd sourced information |
CN112534452A (en) | 2018-05-06 | 2021-03-19 | 强力交易投资组合2018有限公司 | Method and system for improving machines and systems for automatically performing distributed ledger and other transactions in spot and forward markets for energy, computing, storage, and other resources |
US11544782B2 (en) | 2018-05-06 | 2023-01-03 | Strong Force TX Portfolio 2018, LLC | System and method of a smart contract and distributed ledger platform with blockchain custody service |
US11550299B2 (en) | 2020-02-03 | 2023-01-10 | Strong Force TX Portfolio 2018, LLC | Automated robotic process selection and configuration |
CN109165515A (en) * | 2018-08-10 | 2019-01-08 | 深圳前海微众银行股份有限公司 | Model parameter acquisition methods, system and readable storage medium storing program for executing based on federation's study |
US11715010B2 (en) * | 2019-08-16 | 2023-08-01 | Google Llc | Cross replica reduction on networks having degraded nodes |
WO2021038592A2 (en) * | 2019-08-30 | 2021-03-04 | Tata Consultancy Services Limited | System and method for handling popularity bias in item recommendations |
US11982993B2 (en) | 2020-02-03 | 2024-05-14 | Strong Force TX Portfolio 2018, LLC | AI solution selection for an automated robotic process |
CN111814965A (en) * | 2020-08-14 | 2020-10-23 | Oppo广东移动通信有限公司 | Hyper-parameter adjusting method, device, equipment and storage medium |
US20220076130A1 (en) * | 2020-08-31 | 2022-03-10 | International Business Machines Corporation | Deep surrogate langevin sampling for multi-objective constraint black box optimization with applications to optimal inverse design problems |
US11829799B2 (en) | 2020-10-13 | 2023-11-28 | International Business Machines Corporation | Distributed resource-aware training of machine learning pipelines |
CN117151239B (en) * | 2023-03-17 | 2024-08-27 | 荣耀终端有限公司 | Gradient updating method and related device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030009742A1 (en) * | 2000-12-06 | 2003-01-09 | Bass Michael D. | Automated job training and performance tool |
US8027938B1 (en) * | 2007-03-26 | 2011-09-27 | Google Inc. | Discriminative training in machine learning |
US20140379386A1 (en) * | 2013-06-25 | 2014-12-25 | Arthur Paul Drennan, III | System and method for evaluating text to support multiple insurance applications |
CN104346214A (en) * | 2013-07-30 | 2015-02-11 | 中国银联股份有限公司 | Device and method for managing asynchronous tasks in distributed environments |
CN105022699A (en) * | 2015-07-14 | 2015-11-04 | 惠龙易通国际物流股份有限公司 | Cache region data preprocessing method and system |
US9269057B1 (en) * | 2013-12-11 | 2016-02-23 | Google, Inc. | Using specialized workers to improve performance in machine learning |
CN105683944A (en) * | 2013-11-04 | 2016-06-15 | 谷歌公司 | Systems and methods for layered training in machine-learning architectures |
CN105825269A (en) * | 2016-03-15 | 2016-08-03 | 中国科学院计算技术研究所 | Parallel autoencoder based feature learning method and system |
CN105894087A (en) * | 2015-01-26 | 2016-08-24 | 华为技术有限公司 | System and method for training parameter set in neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10152676B1 (en) * | 2013-11-22 | 2018-12-11 | Amazon Technologies, Inc. | Distributed training of models using stochastic gradient descent |
-
2016
- 2016-08-25 CN CN201610730381.8A patent/CN107784364B/en active Active
-
2017
- 2017-08-17 WO PCT/US2017/047247 patent/WO2018039011A1/en unknown
- 2017-08-17 US US16/327,679 patent/US20190197404A1/en active Pending
- 2017-08-17 EP EP17758721.9A patent/EP3504666B1/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030009742A1 (en) * | 2000-12-06 | 2003-01-09 | Bass Michael D. | Automated job training and performance tool |
US8027938B1 (en) * | 2007-03-26 | 2011-09-27 | Google Inc. | Discriminative training in machine learning |
US20140379386A1 (en) * | 2013-06-25 | 2014-12-25 | Arthur Paul Drennan, III | System and method for evaluating text to support multiple insurance applications |
CN104346214A (en) * | 2013-07-30 | 2015-02-11 | 中国银联股份有限公司 | Device and method for managing asynchronous tasks in distributed environments |
CN105683944A (en) * | 2013-11-04 | 2016-06-15 | 谷歌公司 | Systems and methods for layered training in machine-learning architectures |
US9269057B1 (en) * | 2013-12-11 | 2016-02-23 | Google, Inc. | Using specialized workers to improve performance in machine learning |
CN105894087A (en) * | 2015-01-26 | 2016-08-24 | 华为技术有限公司 | System and method for training parameter set in neural network |
CN105022699A (en) * | 2015-07-14 | 2015-11-04 | 惠龙易通国际物流股份有限公司 | Cache region data preprocessing method and system |
CN105825269A (en) * | 2016-03-15 | 2016-08-03 | 中国科学院计算技术研究所 | Parallel autoencoder based feature learning method and system |
Non-Patent Citations (1)
Title |
---|
QI MENG等: ""Asynchronous Accelerated Stochastic Gradient Descent"", 《PROCEEDINGS OF THE TWENTY-FIFTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-16)》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110322020B (en) * | 2018-03-28 | 2023-05-12 | 国际商业机器公司 | Adaptive learning rate scheduling for distributed random gradient descent |
CN110322020A (en) * | 2018-03-28 | 2019-10-11 | 国际商业机器公司 | The autoadapted learning rate scheduling of distributed random gradient decline |
CN112166444A (en) * | 2018-05-31 | 2021-01-01 | 微软技术许可有限责任公司 | Distributed computing system with integrated data as a service feedback loop engine |
CN109032630A (en) * | 2018-06-29 | 2018-12-18 | 电子科技大学 | The update method of global parameter in a kind of parameter server |
CN109102075A (en) * | 2018-07-26 | 2018-12-28 | 联想(北京)有限公司 | Gradient updating method and relevant device during a kind of distribution is trained |
CN108803348A (en) * | 2018-08-03 | 2018-11-13 | 北京深度奇点科技有限公司 | A kind of optimization method of pid parameter and the optimization device of pid parameter |
CN108803348B (en) * | 2018-08-03 | 2021-07-13 | 北京深度奇点科技有限公司 | PID parameter optimization method and PID parameter optimization device |
CN109508785A (en) * | 2018-10-29 | 2019-03-22 | 清华大学 | A kind of asynchronous parallel optimization method for neural metwork training |
CN109460826A (en) * | 2018-10-31 | 2019-03-12 | 北京字节跳动网络技术有限公司 | For distributing the method, apparatus and model modification system of data |
CN110070116B (en) * | 2019-04-08 | 2022-09-20 | 云南大学 | Segmented selection integration image classification method based on deep tree training strategy |
CN110070116A (en) * | 2019-04-08 | 2019-07-30 | 云南大学 | Segmented based on the tree-shaped Training strategy of depth selects integrated image classification method |
CN110689137A (en) * | 2019-09-24 | 2020-01-14 | 网易传媒科技(北京)有限公司 | Parameter determination method, system, medium, and electronic device |
CN112906792A (en) * | 2021-02-22 | 2021-06-04 | 中国科学技术大学 | Image recognition model rapid training method and system based on many-core processor |
Also Published As
Publication number | Publication date |
---|---|
WO2018039011A1 (en) | 2018-03-01 |
EP3504666B1 (en) | 2021-02-17 |
US20190197404A1 (en) | 2019-06-27 |
CN107784364B (en) | 2021-06-15 |
EP3504666A1 (en) | 2019-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107784364A (en) | The asynchronous training of machine learning model | |
Jørgensen et al. | Exploiting the causal tensor network structure of quantum processes to efficiently simulate non-Markovian path integrals | |
Binois et al. | Practical heteroscedastic Gaussian process modeling for large simulation experiments | |
Xue et al. | Amortized finite element analysis for fast pde-constrained optimization | |
Wang et al. | Reduced-order deep learning for flow dynamics. The interplay between deep learning and model reduction | |
Ferdinand et al. | Anytime stochastic gradient descent: A time to hear from all the workers | |
US11675951B2 (en) | Methods and systems for congestion prediction in logic synthesis using graph neural networks | |
Kallrath | Polylithic modeling and solution approaches using algebraic modeling systems | |
EP2680157B1 (en) | Co-simulation procedures using full derivatives of output variables | |
CN113377964A (en) | Knowledge graph link prediction method, device, equipment and storage medium | |
Korondi et al. | Multi-fidelity design optimisation strategy under uncertainty with limited computational budget | |
CN114840322A (en) | Task scheduling method and device, electronic equipment and storage | |
Tang et al. | DAS: A deep adaptive sampling method for solving partial differential equations | |
Kaneda et al. | A deep conjugate direction method for iteratively solving linear systems | |
KR20230132369A (en) | Reducing resources in quantum circuits | |
CN110377769A (en) | Modeling Platform system, method, server and medium based on graph data structure | |
CN113508404A (en) | Apparatus and method for quantum circuit simulator | |
CN110009091A (en) | Optimization of the learning network in Class Spaces | |
Javed et al. | Random neural network learning heuristics | |
Borovska et al. | Searchless Intelligent System of Modern Production Control | |
Chang et al. | Designing a Framework for Solving Multiobjective Simulation Optimization Problems | |
Hoda et al. | A gradient-based approach for computing Nash equilibria of large sequential games | |
WO2020149919A1 (en) | Inertial damping for enhanced simulation of elastic bodies | |
Charlier et al. | VecHGrad for solving accurately complex tensor decomposition | |
Wu et al. | Finding quantum many-body ground states with artificial neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |