CN111738408A

CN111738408A - Method, device and equipment for optimizing loss function and storage medium

Info

Publication number: CN111738408A
Application number: CN202010405723.5A
Authority: CN
Inventors: 郭跃超; 谯轶轩; 唐义君; 王俊; 高鹏; 谢国彤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-10-02
Also published as: WO2021139237A1

Abstract

The invention relates to the field of operation and maintenance of a pedestal, and discloses a method, a device, equipment and a storage medium for optimizing a loss function, which are used for solving the problem of low convergence accuracy of the loss function. The optimization method of the loss function comprises the following steps: obtaining a machine learning task to be optimized, wherein the machine learning task is used for indicating a loss function in a convergence machine learning model; training the machine learning task by using a first optimizer to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer; training the machine learning task by using a second optimizer to obtain a second slow weight, wherein the second slow weight is used for indicating a result obtained after the machine learning task is iterated by using the second optimizer; merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight; and calculating the target updating weight of each iteration stage until the convergence of the loss function is completed.

Description

Method, device and equipment for optimizing loss function and storage medium

Technical Field

The present invention relates to the field of pedestal operation and maintenance, and in particular, to a method, an apparatus, a device, and a storage medium for optimizing a loss function.

Background

With the popularization of a neural network in a computer, deep learning is a feature which enables the neural network to learn how to capture data, and after the data feature is captured, the captured data feature is different from a real data feature, so that a loss function needs to be optimized in time. The optimizer is therefore an important tool for optimizing the loss function in deep learning networks. At present, an optimizer in deep learning is usually a Stochastic Gradient Descent (SGD) method, and when a loss function is optimized by using the SGD, a small batch of data is used to randomly descend a gradient, and an optimal loss function is obtained through continuous iteration and convergence.

However, in the later stage of optimizing the loss function, the SGD is prone to have the situation that the loss function falls into the local minimum, so that the loss function abnormally jitters during convergence, and the optimal situation of convergence cannot be achieved, and further the convergence accuracy and efficiency of the loss function are low.

Disclosure of Invention

The invention mainly aims to solve the problem of low accuracy of loss function convergence.

The first aspect of the present invention provides a method for optimizing a loss function, including: obtaining a machine learning task to be optimized, wherein the machine learning task is used for indicating a loss function in a convergence machine learning model; training the machine learning task by using a first optimizer to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer; training the machine learning task by using a second optimizer to obtain a second slow weight, wherein the second slow weight is used for indicating a result obtained after the machine learning task is iterated by using the second optimizer; merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight; and calculating the target updating weight of each iteration stage until the convergence of the loss function is completed.

Optionally, in a first implementation manner of the first aspect of the present invention, the training the machine learning task by using a first optimizer to obtain a first slow weight, where the first slow weight is used to instruct the machine learning task to perform iteration by using the first optimizer, and a result obtained by performing iteration by using the first optimizer includes: randomly selecting a sample i from n training samples of the machine learning task by using a first optimizer_s，i_s∈ {1,2, …, n }, n being an integer greater than 1, using a first preset formula W_t+1＝W_t-η_tg_tCalculate i_sUpdated first ephemeral fast weight W_t+1In the first preset formula, t is the current time, W_tis a weight parameter, η, at time t of the first optimizer_tTo learn rate, g_tIs a gradient, wherein g_t＝ΔJ_is(W_t,X(i_s),Y(i_s) Wherein J (W) is a cost function, Δ J (W) is a gradient, and X (i)_s) To input a sample, Y (i)_s) and performing integrated calculation on k first short-time fast weight values to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task adopts the first optimizer to iterate, and k ∈ {2,3, …, n }.

Optionally, in a second implementation manner of the first aspect of the present invention, the integrating and calculating k values of the first short-time fast weights to obtain first slow weights, where the first slow weights are used to instruct the machine learning task to use a result obtained after the iteration of the first optimizer, and k ∈ {2,3, …, n } includes: obtaining k continuous values of the first short-time fast weight, wherein k belongs to {2,3, …, n }; calculating a first slow weight according to a second preset formula and k continuous values of the first short-time fast weight, wherein the second preset formula is as follows:

wherein, t is the current time,

is the first slow weight at time t,

to start the starting point of the first short-term weight, W_tIs a weight parameter at time t, W_t+kAnd the first slow weight is a weight parameter at the moment of t + k, and is used for indicating a result obtained after the machine learning task adopts the first optimizer for iteration.

optionally, in a third implementation manner of the first aspect of the present invention, the training of the machine learning task by using the second optimizer to obtain a second slow weight, where the second slow weight is used to indicate that the machine learning task is iterated by using the second optimizer, and the obtained result includes randomly selecting a sample i, i ∈ {1,2, …, n }, where n is an integer greater than 1, from n training samples of the machine learning task by using the second optimizer, and calculating a second short-time fast weight W 'updated by using a third preset formula'_t+1The third preset formula is as follows:

where t is the current time, W_t' is a weight parameter at the time t of the second optimizer, η is an initial learning rate, is a numerical stability quantity,

is a correction value of the first order momentum term,

the expression of (a) is as follows:

where t is the current time, m_tIs a first order momentum term, v_tis a second order momentum term, β₁is a first order momentum decay coefficient, beta₂Is a second-order momentum attenuation coefficient,

is m_tThe correction value of (a) is determined,

is v is_tCorrection value of, gradient g_t'＝ΔJ(W_t,i') where J (W') is a cost function, Δ J (W)_t,i') the gradient of the cost function of the sample i about the weight W is appointed at the time t, and the values of k second short-time fast weights are integrated and calculated to obtain second slow weights, wherein the second slow weights are used for indicating the result obtained after the machine learning task adopts the second optimizer to iterate, and k ∈ {2,3, …, n }.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the integrating and calculating k values of the second short-time fast weights to obtain second slow weights, where the second slow weights are used to instruct the machine learning task to use a result obtained after the second optimizer performs iteration, and k ∈ {2,3, …, n } includes: obtaining values of k continuous second short-time fast weight parameters, wherein k belongs to {2,3, …, n }; calculating a second slow weight according to a fourth preset formula and k continuous values of the second fast weight parameter, wherein the fourth preset formula is as follows:

wherein, t is the current time,

is the second slow weight at time t,

to start the starting point of the second short-term fast weight, W_t' is a weight parameter at time t, W_t+kAnd the second slow weight is used for indicating a result obtained after the machine learning task adopts the first optimizer for iteration.

optionally, in a fifth implementation manner of the first aspect of the present invention, the merging the first slow weight and the second slow weight according to a preset merging formula to obtain the target update weight includes extracting the first slow weight and the second slow weight at time t, where t belongs to {0,1,2, …, n }, and substituting the first slow weight and the second slow weight at time t into a preset merging formula to obtain the target update weight, where the preset merging formula is:

wherein the content of the first and second substances,

the weights are updated for the targets at time t,

is the first slow weight at time t,

for the second slow weight at time t, α is a coefficient parameter, and the calculation formula of α is as follows:

wherein T is the current update time, and T is the iteration number of the whole training.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the calculating the target update weight at each iteration stage until the convergence of the loss function is completed includes: acquiring a target updating weight of a first iteration stage, taking the target updating weight of the first iteration stage as a starting point of a short-time fast weight of a second iteration stage, and calculating the target updating weight of the second iteration stage; and taking the target updating weight of the second iteration stage as a starting point of the short-time fast weight of the third iteration stage, calculating the target updating weight of the third iteration stage, and calculating the target updating weights of the rest iteration stages until the convergence of the loss function is completed.

A second aspect of the present invention provides an apparatus for optimizing a loss function, including: an obtaining module, configured to obtain a machine learning task to be optimized, where the machine learning task is used to indicate a loss function in a converged machine learning model; the first optimization module is used for training the machine learning task by using a first optimizer to obtain a first slow weight, and the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer; the second optimization module is used for training the machine learning task by using a second optimizer to obtain a second slow weight, and the second slow weight is used for indicating the machine learning task to adopt the second optimizer to perform iteration to obtain a result; the merging module is used for merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight; and the iteration module is used for calculating the target updating weight of each iteration stage until the convergence of the loss function is completed.

Optionally, in a first implementation manner of the second aspect of the present invention, the first optimization module includes: a first selection unit for randomly selecting a sample i from n training samples of the machine learning task by using a first optimizer_s，i_s∈ {1,2, …, n }, n being an integer greater than 1, a first calculation unit for using a first preset formula W_t+1＝W_t-η_tg_tCalculate i_sUpdated first ephemeral fast weight W_t+1In the first preset formula, t is the current time, W_tis a weight parameter, η, at time t of the first optimizer_tTo learn rate, g_tIs a gradient, wherein g_t＝ΔJ_is(W_t,X(i_s),Y(i_s) Wherein J (W) is a cost function, Δ J (W) is a gradient, and X (i)_s) To input a sample, Y (i)_s) Is an output sample; the first integration sheetand the element is used for performing integrated calculation on k values of the first short-time fast weights to obtain first slow weights, the first slow weights are used for indicating results obtained after the machine learning task adopts the first optimizer to iterate, and k belongs to {2,3, …, n }.

Optionally, in a second implementation manner of the second aspect of the present invention, the first integration unit is specifically configured to: obtaining k continuous values of the first short-time fast weight, wherein k belongs to {2,3, …, n }; calculating a first slow weight according to a second preset formula and k continuous values of the first short-time fast weight, wherein the second preset formula is as follows:

wherein, t is the current time,

is the first slow weight at time t,

optionally, in a third implementation manner of the second aspect of the present invention, the second optimization module includes a second selection unit configured to randomly select, by using a second optimizer, one sample i, i ∈ {1,2, …, n }, where n is an integer greater than 1, from n training samples of the machine learning task, and a second calculation unit configured to calculate, by using a third preset formula, a second updated short-time fast weight W 'of i'_t+1The third preset formula is as follows:

where t is the current time, W_t' weight parameter at time t of second optimizerη is the initial learning rate, is a numerical stability quantity,

is a correction value of the first order momentum term,

the expression of (a) is as follows:

m_t＝β₁m_t-1+(1-β₁)g_t'，

v_t＝β₂v_t-1+(1-β₂)g_t'²,

is m_tThe correction value of (a) is determined,

is v is_tCorrection value of, gradient g_t'＝ΔJ(W_t,i') where J (W') is a cost function, Δ J (W)_t,i') the gradient of the cost function of the sample i with respect to the weight W is specified at the moment t, and a second integration unit is used for integrating and calculating the values of k second short-time fast weights to obtain second slow weights, wherein the second slow weights are used for indicating the result obtained after the machine learning task is iterated by adopting the second optimizer, and k belongs to {2,3, …, n }.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the second integration unit is specifically configured to: obtaining values of k continuous second short-time fast weight parameters, wherein k belongs to {2,3, …, n }; calculating a second slow weight according to a fourth preset formula and k continuous values of the second fast weight parameter, wherein the fourth preset formula is as follows:

wherein, t is the current time,

is the second slow weight at time t,

optionally, in a fifth implementation manner of the second aspect of the present invention, the merging module is specifically configured to extract the first slow weight and the second slow weight at time t, and for which t belongs to {0,1,2, …, n }, and bring the first slow weight and the second slow weight at time t into a preset merging formula to obtain a target update weight, where the preset merging formula is:

wherein the content of the first and second substances,

the weights are updated for the targets at time t,

is the first slow weight at time t,

Optionally, in a sixth implementation manner of the second aspect of the present invention, the iteration module is specifically configured to: acquiring a target updating weight of a first iteration stage, taking the target updating weight of the first iteration stage as a starting point of a short-time fast weight of a second iteration stage, and calculating the target updating weight of the second iteration stage; and taking the target updating weight of the second iteration stage as a starting point of the short-time fast weight of the third iteration stage, calculating the target updating weight of the third iteration stage, and calculating the target updating weights of the rest iteration stages until the convergence of the loss function is completed.

A third aspect of the present invention provides a device for optimizing a loss function, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor calls the instructions in the memory to cause the optimization device of the loss function to perform the optimization method of the loss function described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the above-described method of optimization of a loss function.

According to the technical scheme, a machine learning task to be optimized is obtained, and the machine learning task is used for indicating a loss function in a convergence machine learning model; training the machine learning task by using a first optimizer to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer; training the machine learning task by using a second optimizer to obtain a second slow weight, wherein the second slow weight is used for indicating a result obtained after the machine learning task is iterated by using the second optimizer; merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight; calculating the target update weight for each iteration stage until the loss function convergence is completed. In the embodiment of the invention, the first slow weight calculated by the first optimizer and the second slow weight calculated by the second optimizer are integrated and calculated to obtain the target update weight, and finally iterative calculation is carried out until the loss function is converged, so that the calculation time of calculating the weight and the abnormal jitter of the loss function during convergence are reduced, and the convergence accuracy and the convergence efficiency of the loss function are improved.

Drawings

FIG. 1 is a diagram illustrating an embodiment of a method for optimizing a loss function according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating another embodiment of the method for optimizing the loss function according to the embodiment of the present invention;

FIG. 3 is a diagram of an embodiment of an apparatus for optimizing a loss function according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of the loss function optimization apparatus according to the embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of an optimization apparatus for a loss function in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method, a device, equipment and a storage medium for optimizing a loss function, wherein a first slow weight calculated by a first optimizer and a second slow weight calculated by a second optimizer are integrated and calculated to obtain a target update weight, so that the calculation time of the calculated weights and the abnormal jitter of the loss function during convergence are reduced, and the convergence accuracy and the convergence efficiency of the loss function are improved.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of the method for optimizing a loss function in the embodiment of the present invention includes:

101. obtaining a machine learning task to be optimized, wherein the machine learning task is used for indicating a loss function in a convergence machine learning model;

the server obtains a machine learning task to be optimized for indicating a loss function in the converged machine learning model.

It is to be understood that the execution subject of the present invention may be an optimization apparatus of the loss function, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

It should be noted that, in the process of deep learning, each machine learning model has a loss function, and the purpose of deep learning is to minimize the loss function, however, not all machine learning models can find the minimum value of the loss function quickly and accurately, even some loss functions have no minimum value, so a general machine learning model uses a convex function as the loss function, the convex function ensures that the convex function has the minimum value, and the most common method for finding the minimum value of the convex function in deep learning is a gradient descent method, which is the main role of the optimizer in deep learning. The machine learning task refers to a loss function which needs to be optimized in deep learning, the loss function plays a very important role in the deep learning, and many loss functions of the deep learning are constructed on a sample pair or a sample triplet, so that the magnitude of a sample space is very large.

102. Training the machine learning task by using a first optimizer to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer;

the server trains the machine learning task by using the first optimizer to obtain a first slow weight used for indicating a result obtained after the machine learning task is iterated by using the first optimizer.

The first optimizer here uses a Stochastic Gradient Descent (SGD), which has low requirements for the gradient and thus can calculate the gradient quickly, and for the introduced noise, the SGD can converge on a certain range of noise, for example: compared with the standard gradient descent method that all samples are traversed, and the time required for updating the parameters once when one sample is input is shortened greatly.

It should be noted that, when the SGD performs loss update on each sample, it is very critical to set the learning rate in the algorithm, and if the learning rate set by the server is small, the convergence speed is slow, and if the learning rate set by the server is large, the loss function is abnormally jittered during convergence, so that the optimal condition of convergence cannot be achieved, and therefore, a large amount of training is required to obtain an appropriate learning rate; in addition, SGD is when optimizing the later stage of loss function, appears falling into the condition of local minimum easily, can not effectually carry out the later stage optimization, can adopt SGD to train large-scale data set rapidly as first optimizer in this application from this, has improved the efficiency of optimizing the loss function to combine together with the second optimizer, reach the effect that improves efficiency promptly and have improvement rate of accuracy.

103. Training the machine learning task by using a second optimizer to obtain a second slow weight, wherein the second slow weight is used for indicating a result obtained after the machine learning task is iterated by using the second optimizer;

and the server trains the machine learning task by using the second optimizer to obtain a second slow weight used for indicating a result obtained after the machine learning task is iterated by using the second optimizer.

The second optimizer adopts an adaptive moment estimation algorithm (Adam), Adam belongs to an adaptive learning rate optimization algorithm, the adaptive learning rate optimization algorithm is optimized for the learning rate in the machine learning model, the traditional optimization algorithm generally sets the learning rate as a constant or adjusts the learning rate according to the training times, the setting ignores the possibility of other changes of the learning rate, so that the loss optimization deviation is caused, the Adam optimizer can dynamically adjust the learning rate of each parameter by using the first moment estimation and the second moment estimation of the gradient, and the purpose of accurately optimizing the loss function is achieved. Adam has the advantages that after offset correction, the learning rate of each iteration is within a determined range, so that the adjusted parameters are relatively stable, and the loss function is more accurately optimized.

Adam iterates as follows: in the machine learning task, the server randomly selects a sample i from n training samples in the machine learning task by using a second optimizer, wherein n is an integer larger than 1, and loss updating is carried out on the sample i by using g_t'＝ΔJ(W_t,i') calculate g_t'，g_t' gradient magnitude of cost function with respect to W ' after t iterations, where J (W ') is the cost function, Δ J (W)_t,i') for a given time t, the classification i is given, the gradient of the cost function with respect to W is given, and then the server calculates the first order momentum term m respectively_tAnd a second order momentum term v_tIn which the first order momentum term m_tThe correction value of (1) is:

m_t＝β₁m_t-1+(1-β₁)g_t'，

wherein, beta₁Is a first-order momentum attenuation coefficient generally taking the value of 0.9, and a second-order momentum term v_tThe correction value of (1) is:

v_t＝β₂v_t-1+(1-β₂)g_t'²，

wherein, beta₂The second-order momentum attenuation coefficient is generally 0.999, and the server calculates the first-order momentum term m_tAnd a second order momentum term v_tAfter the correction value is obtained, the server calculates a second short-time fast weight according to a third preset formula, wherein the third preset formula is as follows:

wherein eta is an initial learning rate value, generally 0.01, and a value stabilizing quantity, generally 10^-8So as to ensure that the denominator of the fraction is not zero, thus completing the optimization of one-time loss.

104. Merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight;

and the server combines the first slow weight and the second slow weight according to a preset combination formula to obtain a target update weight.

It should be noted that the first slow weight calculated by the server may be understood as a result of the first optimizer optimizing the loss function, the second slow weight calculated by the server may be understood as a result of the second optimizer optimizing the same loss function, and the results of the first optimizer and the second optimizer optimizing the loss function have errors, so that the server combines the first slow weight and the second slow weight after optimization, and the target update weight obtained by a preset combination formula more conforms to a true value. Where the server utilizes the pre-predictionThe collocation and combination formula is:

wherein the content of the first and second substances,

wherein alpha is consistent with a certain probability distribution without loss of generality, T is the current updating time, T is the iteration number of the whole training, and the server can obtain the probability distribution by calculation according to steps 102-103

And

such that the server may combine the first slow weight with the second slow weight to obtain the target update weight.

105. And calculating the target updating weight of each iteration stage until the convergence of the loss function is completed.

The server calculates the target update weights for each iteration stage until the loss function convergence is complete.

It can be understood that the server performs calculation of the target update weight by using the results of the k iteration steps, and connects the calculated target update weights at different stages to obtain a result after the loss optimization until the convergence of the loss function is completed. The iterative updating method effectively overcomes the jitter of the error function during gradient updating when the training data is ended, and ensures the convergence speed and the accuracy of convergence.

In the embodiment of the invention, the first slow weight calculated by the first optimizer and the second slow weight calculated by the second optimizer are integrated and calculated to obtain the target update weight, and finally iterative calculation is carried out until the loss function is converged, so that the calculation time of calculating the weight and the abnormal jitter of the loss function during convergence are reduced, and the convergence accuracy and the convergence efficiency of the loss function are improved.

Referring to fig. 2, another embodiment of the method for optimizing a loss function according to the embodiment of the present invention includes:

201. obtaining a machine learning task to be optimized, wherein the machine learning task is used for indicating a loss function in a convergence machine learning model;

It should be noted that the machine learning task herein refers to a loss function that needs to be optimized in the deep learning, the loss function plays a very important role in the deep metric learning, and many of the loss functions of the deep metric learning are constructed on a sample pair or a sample triplet, so the magnitude of a sample space is very large, in the training of a later stage learning model, the gradient values of the sample pair sample triplet are almost 0, and if any targeted optimization is not performed, the convergence rate of the learning algorithm is very slow and is easy to fall into local optimization, which results in the decrease of the accuracy rate of the loss function.

In the method, a machine learning task is solved by using a gradient descent method, the gradient descent method is the most common algorithm for optimizing a loss function in an optimizer, generally, a random gradient descent method (SGD) is commonly used, the gradient is randomly descended by using a small batch of data, the optimal loss function is obtained through continuous iteration and convergence, although the SGD has a certain deviation in the updating direction and the calculation batch per time is small, the more times of updating, the better the convergence effect of the loss function is. Derived variant optimizers by SGD fall into two broad categories: the optimization method comprises the steps of combining an SGD optimizer and an adaptive learning rate mechanism optimizer to obtain an optimal loss function.

202. Training the machine learning task by using a first optimizer to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer;

the server trains the machine learning task by using the first optimizer to obtain a first slow weight used for indicating a result obtained after the machine learning task is iterated by using the first optimizer. Specifically, the method comprises the following steps:

the server randomly selects a sample i from n training samples of the machine learning task by utilizing a first optimizer_s，i_s∈ {1,2, …, n }, n being an integer greater than 1, and the server using a first preset formula W_t+1＝W_t-η_tg_tCalculate i_sUpdated first ephemeral fast weight W_t+1In the first preset formula, t is the current time, W_tis a weight parameter, η, at time t of the first optimizer_tTo learn rate, g_tIs a gradient, wherein g_t＝ΔJ_is(W_t,X(i_s),Y(i_s) Wherein J (W) is a cost function, Δ J (W) is a gradient, and X (i)_s) To input a sample, Y (i)_s) and the server performs integrated calculation on the values of the k first short-time fast weights to obtain first slow weights, wherein the first slow weights are used for indicating a result obtained after a machine learning task is iterated by adopting a first optimizer, and k ∈ {2,3, …, n }.

The first optimizer here is the SGD optimizer, which has low requirements for gradient, so that the gradient can be calculated quickly, and for the introduced noise, the SGD can converge on the noise within a certain range, for example: compared with the standard gradient descent method that all samples are traversed, and the time required for updating the parameters once when one sample is input is shortened greatly.

For example: in the machine learning task, the server randomly selects a sample i from n training samples_sWhere n is an integer greater than 1 for sample i_sUpdate it for loss, server utilizes g_t＝ΔJ_is(W_t,X(i_s),Y(i_s) Calculate g)_tHere g_tIs the current gradient of SGD, J (W) is the cost function, Δ J (W) is the gradient, X (i)_s) To input a sample, Y (i)_s) To output the sample, the server then utilizes a first preset formula andcalculated g_tTo calculate a first short-term fast weight, where the first predetermined formula is W_t+1＝W_t-η_tg_tT is the current time, W_tis a weight parameter, η, at time t of the first optimizer_tFor learning rate, this completes the optimization of one loss.

The server performs integrated calculation on the values of the k first short-time fast weights to obtain first slow weights, the first slow weights are used for indicating results obtained after the machine learning tasks are iterated by adopting a first optimizer, and k belongs to {2,3, …, n }, specifically: obtaining values of k continuous first short-time fast weights, wherein k belongs to {2,3, …, n }; calculating a first slow weight according to a second preset formula and the values of k continuous first short-time fast weights, wherein the second preset formula is as follows:

wherein, t is the current time,

is the first slow weight at time t,

It is further described that after the server performs training iteration, values of first short-time fast weights updated at different points are obtained, after the server calculates the first short-time fast weight at each moment, values of k consecutive first short-time fast weights need to be integrated and calculated, the k first short-time fast weights are used to calculate the first slow weight, and a plurality of values are used for calculation, so that accuracy of the first slow weight can be ensured, and the value of the first slow weight is more fit with the value of the whole first short-time fast weight.

In general, the value of k is 4, so that the calculated value is more suitable for optimization of data, but the value of k may be modified according to actual conditions, and the value of k is not limited in this application.

For example, when k is 4, the server calculates the first slow weight as follows: the server knows that the starting point of the first ephemeral fast weight is

The first short-term fast weight of successive ones is W₁、W₂、W₃、W₄Then, according to a second preset formula:

calculating a first slow weight:

this results in a first slow weight that more closely fits the first ephemeral fast weight.

203. Training the machine learning task by using a second optimizer to obtain a second slow weight, wherein the second slow weight is used for indicating a result obtained after the machine learning task is iterated by using the second optimizer;

and the server trains the machine learning task by using the second optimizer to obtain a second slow weight used for indicating a result obtained after the machine learning task is iterated by using the second optimizer. Specifically, the method comprises the following steps:

the server firstly utilizes a second optimizer to randomly select a sample i, i ∈ {1,2, …, n } from n training samples of the machine learning task, n is an integer larger than 1, and then utilizes a third preset formula to calculate a second short-time fast weight W 'after i is updated'_t+1The third preset formula is:

where t is the current time, W_t' is the weight parameter at time t of the second optimizer,eta is the initial learning rate, is a numerical stability quantity,

is a correction value of the first order momentum term,

the expression of (a) is as follows:

m_t＝β₁m_t-1+(1-β₁)g_t'，

v_t＝β₂v_t-1+(1-β₂)g_t'²,

is m_tThe correction value of (a) is determined,

is v is_tCorrection value of, gradient g_t'＝ΔJ(W_t,i') where J (W') is a cost function, Δ J (W)_t,i') appointing the gradient of the cost function of the sample i about the weight W at the time t, and finally, the server carries out integrated calculation on the values of k second short-time and fast weights to obtain second slow weights, wherein the second slow weights are used for indicating the machine learning task to adopt a result obtained after a second optimizer is adopted for iteration, and k ∈ {2,3, …, n }.

The second optimizer adopts an adaptive moment estimation algorithm (Adam), Adam belongs to an adaptive learning rate optimization algorithm, the adaptive learning rate optimization algorithm is optimized according to the learning rate in the machine learning model, and the Adam optimizer can utilize a ladder to optimize the learning rateThe first moment estimation and the second moment estimation of the degree dynamically adjust the learning rate of each parameter, and the purpose of accurately optimizing the loss function is achieved. Adam has the advantages that after offset correction, the learning rate of each iteration is within a determined range, so that the adjusted parameters are relatively stable, and the loss function is more accurately optimized. Adam iterates as follows: in the machine learning task, the server randomly selects a sample i from n training samples in the machine learning task by using a second optimizer, wherein n is an integer larger than 1, and loss updating is carried out on the sample i by using g_t'＝ΔJ(W_t,i') calculate g_t'，g_t' gradient magnitude of cost function with respect to W ' after t iterations, where J (W ') is the cost function, Δ J (W)_t,i') for a given time t, the classification i is given, the gradient of the cost function with respect to W is given, and then the server calculates the first order momentum term m respectively_tAnd a second order momentum term v_tIn which the first order momentum term m_tThe correction value of (1) is:

m_t＝β₁m_t-1+(1-β₁)g_t'，

v_t＝β₂v_t-1+(1-β₂)g_t'²，

The server performs integrated calculation on values of the k second short-time fast weights to obtain second slow weights, the second slow weights are used for indicating results obtained after the machine learning tasks are iterated by adopting a second optimizer, and k belongs to {2,3, …, n }, specifically: the server obtains the values of k continuous second short-time fast weight parameters, wherein k belongs to {2,3, …, n }; then the server calculates a second slow weight according to a fourth preset formula and values of k continuous second fast weight parameters, wherein the fourth preset formula is as follows:

wherein, t is the current time,

is the second slow weight at time t,

It should be noted that after the server performs training iteration, values of second short-time fast weights updated at different points are obtained, after the server calculates the second short-time fast weights at each moment, values of k consecutive second short-time fast weights need to be calculated, the values of the k second short-time fast weights are used for calculating second slow weights, and a plurality of values are used for calculation, so that the accuracy of the second slow weights can be ensured, and the values of the second slow weights are more fit with the values of the whole second short-time fast weights.

For example, when k is 4, the server calculates the second slow weight as follows: the server knows that the starting point of the second ephemeral fast weight is

The second short-term fast weight of successive ones is W₁、W₂、W₃、W₄Then, according to a second preset formula:

calculating a second slow weight as:

this results in a second slow weight that more closely fits the second ephemeral fast weight.

204. Merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight;

and the server combines the first slow weight and the second slow weight according to a preset combination formula to obtain a target update weight. Specifically, the method comprises the following steps:

the server extracts a first slow weight and a second slow weight at the time t, wherein the t belongs to {0,1,2, …, n }, and then the server brings the first slow weight and the second slow weight at the time t into a preset combination formula to obtain a target update weight, wherein the preset combination formula is as follows:

wherein the content of the first and second substances,

the weights are updated for the targets at time t,

is the first slow weight at time t,

It should be noted that the first slow weight calculated by the server may be understood as a result of the first optimizer optimizing the loss function, the second slow weight calculated by the server may be understood as a result of the second optimizer optimizing the same loss function, and the results of the first optimizer and the second optimizer optimizing the loss function have errors, so that the server combines the first slow weight and the second slow weight after optimization, and the target update weight obtained by a preset combination formula more conforms to a true value. The preset merge formula utilized by the server here is:

wherein

wherein, alpha accords with a certain probability distribution without loss of generality, T is the current updating time, T is the iteration number of the whole training, and the server can obtain the probability distribution through calculation according to the step 202 and the step 203

And

so that the server can combine the first slow weight and the second slow weight to obtain the target update weight.

205. Acquiring a target updating weight of a first iteration stage, taking the target updating weight of the first iteration stage as a starting point of a short-time fast weight of a second iteration stage, and calculating the target updating weight of the second iteration stage;

and the server acquires the target updating weight of the first iteration stage, takes the target updating weight of the first iteration stage as the starting point of the short-time fast weight of the second iteration stage, and calculates the target updating weight of the second iteration stage.

It is understood that the server calculates the target update weight of the second iteration stage by using the target update weight calculated in the first iteration stage as a starting point for calculating the short-term fast weight of the second iteration stage by using the method of steps 202-204. The step of calculating the target update weight in the second iteration stage by the server is the same as the step of calculating the target update weight in the first iteration stage, and therefore, the present application is not described herein in detail.

206. And taking the target updating weight of the second iteration stage as the starting point of the short-time fast weight of the third iteration stage, calculating the target updating weight of the third iteration stage, and calculating the target updating weights of the rest iteration stages until the convergence of the loss function is completed.

And the server takes the target updating weight of the second iteration stage as the starting point of the short-time fast weight of the third iteration stage, calculates the target updating weight of the third iteration stage and calculates the target updating weights of the rest iteration stages until the convergence of the loss function is completed.

With reference to fig. 3, the method for optimizing a loss function in an embodiment of the present invention is described above, and an embodiment of an apparatus for optimizing a loss function in an embodiment of the present invention is described below, where:

an obtaining module 301, configured to obtain a machine learning task to be optimized, where the machine learning task is used to indicate a loss function in a converged machine learning model;

the first optimization module 302 is configured to train the machine learning task by using a first optimizer to obtain a first slow weight, where the first slow weight is used to instruct the machine learning task to use the first optimizer to perform iteration to obtain a result;

the second optimization module 303 is configured to train the machine learning task by using a second optimizer to obtain a second slow weight, where the second slow weight is used to instruct the machine learning task to use the second optimizer for iteration to obtain a result;

a combining module 304, configured to combine the first slow weight and the second slow weight according to a preset combining formula, to obtain a target update weight;

an iteration module 305 for calculating the target update weight for each iteration stage until the loss function convergence is completed.

Referring to fig. 4, another embodiment of the apparatus for optimizing a loss function according to the embodiment of the present invention includes:

Optionally, the first optimization module 302 includes:

a first selecting unit 3021, configured to randomly select one sample i from n training samples of the machine learning task by using a first optimizer_s，i_se is e {1,2, …, n }, wherein n is an integer greater than 1;

a first calculating unit 3022 for using a first preset formula W_t+1＝W_t-η_tg_tCalculate i_sUpdated first ephemeral fast weight W_t+1In the first preset formula, t is the current time, W_tis a weight parameter, η, at time t of the first optimizer_tTo learn rate, g_tIs a gradient in which, among other things,

wherein J (W) is the cost function, Δ J (W) is the gradient, X (i)_s) To input a sample, Y (i)_s) Is an output sample;

the first integrating unit 3023 is configured to perform integration calculation on the values of the k first short-time fast weights to obtain first slow weights, where the first slow weights are used to indicate results obtained after the machine learning task is iterated by using a first optimizer, and k belongs to {2,3, …, n }.

Optionally, the first integration unit 3023 may be further specifically configured to:

obtaining values of k continuous first short-time fast weights, wherein k belongs to {2,3, …, n };

calculating a first slow weight according to a second preset formula and the values of k continuous first short-time fast weights, wherein the second preset formula is as follows:

wherein, t is the current time,

is the first slow weight at time t,

Optionally, the second optimization module 303 includes:

a second selecting unit 3031, configured to randomly select, by using a second optimizer, one sample i from n training samples of a machine learning task, where i belongs to {1,2, …, n }, and n is an integer greater than 1;

a second calculating unit 3032, configured to calculate i updated second short-time fast weight W 'by using a third preset formula'_t+1The third preset formula is:

is a correction value of the first order momentum term,

the expression of (a) is as follows:

m_t＝β₁m_t-1+(1-β₁)g_t'，

v_t＝β₂v_t-1+(1-β₂)g_t'²,

is m_tThe correction value of (a) is determined,

is v is_tCorrection value of, gradient g_t'＝ΔJ(W_t,i') where J (W') is a cost function, Δ J (W)_t,i') the gradient of the cost function for sample i with respect to weight W is specified for time t;

a second integrating unit 3033, configured to perform integration calculation on the values of the k second short-time fast weights to obtain a second slow weight, where the second slow weight is used to indicate a result obtained after the machine learning task is iterated by using a second optimizer, and k belongs to {2,3, …, n }.

Optionally, the second integration unit 3033 may be further specifically configured to:

obtaining values of k continuous second short-time fast weight parameters, wherein k belongs to {2,3, …, n };

calculating a second slow weight according to a fourth preset formula and values of k continuous second fast weight parameters, wherein the fourth preset formula is as follows:

wherein, t is the current time,

is the second slow weight at time t,

Optionally, the merging module 304 may be further specifically configured to:

extracting a first slow weight and a second slow weight at the time t, wherein the t belongs to {0,1,2, …, n };

and substituting the first slow weight and the second slow weight at the time t into a preset combination formula to obtain a target update weight, wherein the preset combination formula is as follows:

wherein the content of the first and second substances,

the weights are updated for the targets at time t,

is the first slow weight at time t,

Optionally, the iteration module 305 may be further specifically configured to:

acquiring a target updating weight of a first iteration stage, taking the target updating weight of the first iteration stage as a starting point of a short-time fast weight of a second iteration stage, and calculating the target updating weight of the second iteration stage;

and taking the target updating weight of the second iteration stage as the starting point of the short-time fast weight of the third iteration stage, calculating the target updating weight of the third iteration stage, and calculating the target updating weights of the rest iteration stages until the convergence of the loss function is completed.

Fig. 3 and 4 describe the optimization apparatus of the loss function in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the optimization device of the loss function in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of an optimization apparatus for loss function according to an embodiment of the present invention, where the optimization apparatus 500 for loss function may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the optimization apparatus 500 for a loss function. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the loss function optimization apparatus 500.

The loss function optimization apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows server, Mac OS X, Unix, Linux, FreeBSD, and the like. It will be appreciated by those skilled in the art that the loss function optimization apparatus configuration shown in fig. 5 does not constitute a limitation of the loss function optimization apparatus, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, which may also be a volatile computer readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the method for optimizing a loss function.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for optimizing a loss function, the method comprising:

obtaining a machine learning task to be optimized, wherein the machine learning task is used for indicating a loss function in a convergence machine learning model;

training the machine learning task by using a first optimizer to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer;

training the machine learning task by using a second optimizer to obtain a second slow weight, wherein the second slow weight is used for indicating a result obtained after the machine learning task is iterated by using the second optimizer;

merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight;

and calculating the target updating weight of each iteration stage until the convergence of the loss function is completed.

2. The method for optimizing the loss function according to claim 1, wherein the training of the machine learning task by using the first optimizer to obtain a first slow weight, and the instructing of the first slow weight to the result obtained after the machine learning task is iterated by using the first optimizer comprises:

randomly selecting a sample i from n training samples of the machine learning task by using a first optimizer_s，i_se is e {1,2, …, n }, wherein n is an integer greater than 1;

using a first preset formula W_t+1＝W_t-η_tg_tCalculate i_sUpdated first ephemeral fast weight W_t+1In the first preset formula, t is the current time, W_tis a weight parameter, η, at time t of the first optimizer_tTo learn rate, g_tIs a gradient in which, among other things,

and performing integrated calculation on the k first short-time fast weights to obtain a first slow weight, wherein the first slow weight is used for indicating a result obtained after the machine learning task adopts the first optimizer to iterate, and k belongs to {2,3, …, n }.

3. The method for optimizing a loss function according to claim 2, wherein the integrating calculation of the k values of the first short-time fast weights results in first slow weights, the first slow weights are used for indicating results obtained after the machine learning task is iterated by using the first optimizer, and k e {2,3, …, n } includes:

obtaining k continuous values of the first short-time fast weight, wherein k belongs to {2,3, …, n };

calculating a first slow weight according to a second preset formula and k continuous values of the first short-time fast weight, wherein the second preset formula is as follows:

wherein, t is the current time,

is the first slow weight at time t,

4. The method for optimizing the loss function according to claim 1, wherein the training of the machine learning task by using the second optimizer obtains a second slow weight, and the second slow weight is used to instruct the machine learning task to use the second optimizer for iteration to obtain a result including:

randomly selecting a sample i from n training samples of the machine learning task by using a second optimizer, wherein i belongs to {1,2, …, n }, and n is an integer greater than 1;

calculating a second short-time fast weight W 'after i is updated by using a third preset formula'_t+1The third preset formula is as follows:

is a correction value of the first order momentum term,

the expression of (a) is as follows:

m_t＝β₁m_t-1+(1-β₁)g_t'，

v_t＝β₂v_t-1+(1-β₂)g_t'²,

is m_tThe correction value of (a) is determined,

and performing integrated calculation on the k second short-time fast weights to obtain a second slow weight, wherein the second slow weight is used for indicating a result obtained after the machine learning task adopts the second optimizer to perform iteration, and k belongs to {2,3, …, n }.

5. The method for optimizing a loss function according to claim 4, wherein the integrating calculation of the k values of the second short-time fast weights is performed to obtain second slow weights, the second slow weights are used to indicate results obtained after the machine learning task is iterated by using the second optimizer, and k e {2,3, …, n } includes:

calculating a second slow weight according to a fourth preset formula and k continuous values of the second fast weight parameter, wherein the fourth preset formula is as follows:

wherein, t is the current time,

is the second slow weight at time t,

6. The method of claim 1, wherein the combining the first slow weight and the second slow weight according to a preset combining formula to obtain a target update weight comprises:

extracting the first slow weight and the second slow weight at the time t, wherein t belongs to {0,1,2, …, n };

substituting the first slow weight and the second slow weight at the time t into a preset combination formula to obtain a target update weight, wherein the preset combination formula is as follows:

wherein the content of the first and second substances,

the weights are updated for the targets at time t,

is the first slow weight at time t,

where T is the current update time and T is the iteration of the ensemble trainingThe number of times.

7. The method of claim 1, wherein the calculating the target update weight for each iteration stage until the loss function convergence is complete comprises:

and taking the target updating weight of the second iteration stage as a starting point of the short-time fast weight of the third iteration stage, calculating the target updating weight of the third iteration stage, and calculating the target updating weights of the rest iteration stages until the convergence of the loss function is completed.

8. An optimization apparatus for a loss function, comprising:

an obtaining module, configured to obtain a machine learning task to be optimized, where the machine learning task is used to indicate a loss function in a converged machine learning model;

the first optimization module is used for training the machine learning task by using a first optimizer to obtain a first slow weight, and the first slow weight is used for indicating a result obtained after the machine learning task is iterated by using the first optimizer;

the second optimization module is used for training the machine learning task by using a second optimizer to obtain a second slow weight, and the second slow weight is used for indicating the machine learning task to adopt the second optimizer to perform iteration to obtain a result;

the merging module is used for merging the first slow weight and the second slow weight according to a preset merging formula to obtain a target updating weight;

and the iteration module is used for calculating the target updating weight of each iteration stage until the convergence of the loss function is completed.

9. An optimization apparatus for a loss function, comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor calls the instructions in the memory to cause the optimization device of the loss function to perform the optimization method of the loss function according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of optimizing a loss function according to any one of claims 1 to 7.