CN112085524B

CN112085524B - Q learning model-based result pushing method and system

Info

Publication number: CN112085524B
Application number: CN202010896316.9A
Authority: CN
Inventors: 徐君; 贾浩男; 张骁; 蒋昊; 文继荣
Original assignee: Huawei Technologies Co Ltd; Renmin University of China
Current assignee: Huawei Technologies Co Ltd; Renmin University of China
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2022-11-15
Anticipated expiration: 2040-08-31
Also published as: CN112085524A

Abstract

The invention relates to a result pushing method and a result pushing system based on a Q learning model, which comprise the following steps: will state s _t Pushing results a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D; extracting several data groups from experience pool D, calculating network parameters

The network parameter at the moment is an anchor point network parameter; randomly extracting the data group in the last step, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and repeating the steps until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result. The variance reduction technology is introduced into the Q learning model with the descending random gradient, so that the stability of the training process of reinforcement learning is improved.

Description

Q learning model-based result pushing method and system

Technical Field

The invention relates to a result pushing method and a result pushing system based on a Q learning model, and belongs to the technical field of internet.

Background

In the information retrieval, the workload of a searcher can be greatly reduced by adopting a result pushing method or sequencing according to the relevance of the result and the retrieval information, and the information acquisition efficiency is improved. At present, many reinforcement learning models, such as a deep Q learning model, are applied to search result pushing, and the pushed result can better meet the requirement of a searcher by training the reinforcement learning models through using a historical search record of the searcher, so as to further improve the search efficiency. However, the conventional method for pushing results generated by using the deep Q learning model has the following problems:

on one hand, the deep Q learning model (DQN) plays an absolute leading role in the aspect of deep reinforcement learning based on value functions, so that the improvement of the DQN algorithm focuses on improving the network structure of the DQN algorithm to improve the efficiency thereof; on the other hand, the reinforcement learning algorithm has the training characteristic of trial and error, so that the reinforcement learning algorithm is usually unstable in the training process, and the instability is mainly caused by the excessively high variance of the reward value, the Q value and the like.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, an object of the present invention is to provide a result pushing method and system based on a Q learning model, which reduces the variance of a reward value or a Q value and improves the stability of a training process of reinforcement learning by introducing a variance reduction technique into the Q learning model with a decreasing random gradient.

In order to achieve the above object, the present invention provides a result pushing method based on a Q learning model, comprising the following steps: s1 determining a current state S _t The current state s _t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value _t (ii) a S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user _t+1 (ii) a S3 comparing the state S _t Push result a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups

The network parameter at the moment is an anchor point network parameter; s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.

Further, the variance reduction formula in step S5:

wherein the content of the first and second substances,

is the next network parameter;

is the current network parameter; α is the learning rate;

is a gradient value; g is the full gradient mean.

Further, the gradient value is calculated by the formula:

gradient values under current network parameters:

gradient values under anchor network parameters:

wherein S, a are respectively the state in the data group randomly extracted in step S5 and the push result corresponding to the state, q _m Is the target Q value, Q, under the current network parameters ₀ Is the target Q value under the anchor network parameters,

is an anchor network parameter and Q () is the Q network.

Further, the target Q value is calculated by the formula:

target Q value under current network parameters:

target Q value under anchor network parameters:

wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.

Further, the calculation formula of the full gradient mean value is as follows:

where N is the number of data sets and l () is the loss function.

The invention also discloses another result pushing method based on the Q learning model, which comprises the following steps: s1 determining a current state S ₁ The current state s _t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value _t (ii) a S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user _t+1 (ii) a S3 comparing the state S _t Pushing results a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups

Performing gradient optimization on the full-gradient mean value:

wherein, the first and the second end of the pipe are connected with each other,

is the next network parameter;

is the current network parameter;

is the full gradient mean value under the current network parameters; s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the target Q value and the gradient value under the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.

Further, the variance reduction formula in step S5:

where l () is a loss function,

is the last network parameter;

is the current network parameter;

is the full gradient mean value under the last network parameter;

is the full gradient mean under the current network parameters.

Further, the gradient value is calculated by the formula:

gradient values under current network parameters:

gradient values at last network parameter:

is an anchor network parameter and Q () is the Q network.

Further, the target Q value is calculated by the formula:

target Q value under current network parameters:

target Q value under last network parameter:

The invention also discloses a result pushing system based on the Q learning model, which comprises the following steps: an original push result generation module for determining the current state s _t The current state s _t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value _t (ii) a The reward value generation module is used for pushing the original pushing result to the user and obtaining the reward value r by recording the browsing of the user _t+1 (ii) a Memory moduleFor transforming the state s _t Push result a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D; a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data sets

The network parameter at the moment is an anchor point network parameter; the gradient updating module is used for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor point network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and the output module is used for repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. by introducing the variance reduction technology into the Q learning model with descending random gradient, the variance of the reward value or the Q value is reduced, and the precision and the stability of the training process of reinforcement learning are improved.

2. By adopting a random recursive Gradient algorithm (SARAH), the problem that the information difference is larger and larger due to the fact that the network parameters are not fixed and can gradually deviate from the parameters during sampling during training of a random Variance reduction Gradient Descent technology (SVRG) is solved, and the model calculation is more accurate.

Drawings

FIG. 1 is a schematic diagram of a method for detecting discontinuity in seismic data based on a deep learning model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a gradient optimization algorithm in one embodiment of the present invention, FIG. 2 (a) is a schematic diagram of a conventional gradient optimization algorithm, and FIG. 2 (b) is a schematic diagram of a gradient optimization algorithm with a random gradient descent;

FIG. 3 is a logic diagram of a deep Q learning model training framework based on variance reduction in an embodiment of the present invention.

Detailed Description

The present invention is described in detail by way of specific embodiments in order to better understand the technical direction of the present invention for those skilled in the art. It should be understood, however, that the detailed description is provided for a better understanding of the invention only and that they should not be taken as limiting the invention. In describing the present invention, it is to be understood that the terminology used is for the purpose of description only and is not intended to be indicative or implied of relative importance.

Example one

The embodiment discloses a result pushing method based on a Q learning model, as shown in FIG. 1, comprising the following steps:

s1, firstly, setting an initial Q learning model and determining the current state S _t Wherein the state s is initialized ₀ Recording activity through a user's current browsing; the subsequent conditions are obtained through the browsing history after the last interaction of the user; the current state s _t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value _t (ii) a Wherein the push result comprises the push content and the position of the push content.

S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user _t+1 ；

S3 converting the state S _t Pushing results a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D;

s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups

The network parameter at the moment is an anchor point network parameter;

the calculation formula of the full gradient mean value is as follows:

where N is the number of data sets and l () is the loss function.

S5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor point network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;

wherein, the calculation formula of the target Q value is as follows:

target Q value under current network parameters:

target Q value under anchor network parameters:

If the target network Q' (s, a; theta) is introduced, the calculation formula of the target Q value is as follows:

target Q value under current network parameters:

q _m ←r+γmax _a′ Q`(s′，a′；θ ^- )

target Q value under anchor network parameters:

q ₀ ←r+γmax _a′ Q`(s′，a′；θ ^- )

wherein the parameter theta ^- Representing the values of the parameters of the last training net Q (s, a; theta) to the target net Q' (s, a; theta), which is a net having the same structure as the training net Q but different network parameters.

The gradient value is calculated by the formula:

gradient values under current network parameters:

gradient values under anchor network parameters:

wherein S, a are states in a data group randomly extracted in step S5 and push results corresponding to the states, q _m Is the target Q value, Q, under the current network parameters ₀ Is the target Q value under the anchor network parameters,

is an anchor network parameter and Q () is the Q network.

The variance reduction formula is:

is the next network parameter;

is the current network parameter; α is the learning rate;

is a gradient value; g is the full gradient mean.

And S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.

The embodiment is mainly realized by adopting a Q learning model based on a Stochastic Variance Reduced Gradient Descent (SVRG) technology. As shown in fig. 2, in the conventional gradient optimization algorithm, the algorithm mainly based on Gradient Descent (GD) can ensure that the parameter to be optimized reaches a global optimum point, but since each step involves the calculation of full gradient, it usually causes large calculation consumption in the context of the problem of excessive data volume, so that the training process becomes sluggish. In order to avoid large computation consumption of each training step, the Stochastic Gradient Descent (SGD) algorithm abandons the computation of the full gradient, and trains the model by sampling one (or a small batch) data in each step, and although the convergence of the optimization target can be ensured, due to the characteristics of random sampling, the optimization level still has the limitation of slow convergence speed caused by too high gradient variance.

In order to solve the problem, the variance reduction technology is introduced in the process of random gradient descent for optimization. The mathematical definition of variance reduction is:

Z _α ＝α(X-Y)+E[Y]

wherein X represents a random variable to be reduced in variance, Y represents another random variable having a positive correlation with X, E [ Y ]]Representing mathematical expectations of a random variable Y, Z _α Representing the random variables after optimization by variance reduction.

The stochastic variance reduction gradient descent technique changes the original parameter update step to be shaped as Z above _α By periodically sampling the batch of training data as Y in the definition of variance reduction, the gradient update formula is:

wherein theta is ^t For the parameter to be optimized in training up to the t-th step, θ ^old Representing the parameter values at which the full gradient is calculated,

the expectation of a full gradient value representing a bulk data loss function,

represents the gradient value of the single data sample loss function, and η represents the learning rate.

The invention makes the gradient of the loss function l (s, a; theta) to each layer parameter of the network

As a random variable X for the variance to be reduced. As shown in FIG. 3, a reduced variance based deep Q-learning training framework, where the current network Q represents a learning model, the environment represents objects interacting with the network Q, the network Q accepts as input the current state s of the environment and depends on the current network parameters θ _m And evaluating the Q value of each action executed in the state s, selecting the optimal action a according to the Q value, outputting the optimal action a to the environment, receiving the action by the environment and transferring to the next state s'. The framework takes the current network Q as input and takes the network after being optimized by the variance as output, specifically, the parameter theta of the network is input ₀ Outputting the optimized network parameters trained by variance reduction

During training, the environment continuously interacts with the current network to generate a transfer data set (s, a, r, s'), and a finite experience pool D is responsible for storing the generated data and periodically sending the data to the network for training. According to the characteristics of the SVRG algorithm, firstly a batch of data needs to be sampled in an experience pool, and meanwhile, the network needs to sample the batch of data

The full gradient mean g of this batch of data is calculated to serve as the expected E Y in the SVRG optimization process]. Network of individual samples in a batch of data when sampling the batch

The lower gradient value then serves as an auxiliary variable Y in the optimization process.

Example two

Based on the same inventive concept, the embodiment discloses another result pushing method based on a Q learning model, which comprises the following steps:

s1, firstly, setting an initial Q learning model and determining the current state S _t Wherein the state s is initialized ₀ Recording activity through a user's current browsing; followed byThe condition of the user is obtained through the browsing history after the user interacts last time; the current state s _t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value _t (ii) a Wherein the push result comprises the push content and the position of the push content.

S3 comparing the state S _t Push result a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups

And (3) performing gradient optimization on the full gradient mean value:

wherein the content of the first and second substances,

is the next network parameter;

is the current network parameter;

is the full gradient mean value under the current network parameters;

s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the target Q value and the gradient value under the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;

wherein, the calculation formula of the target Q value is as follows:

target Q value under current network parameters:

target Q value under last network parameter:

Variance reduction formula in step S5:

where l () is a loss function,

is the last network parameter;

is the current network parameter;

is the full gradient mean value under the last network parameter;

is the full gradient mean under the current network parameters.

The gradient value is calculated by the formula:

gradient values under current network parameters:

gradient values at last network parameter:

wherein S, a are respectively randomly extracted in step S5States in a data set and push results, q, corresponding to the states _m Is the target Q value, Q, under the current network parameters ₀ Is the target Q value under the anchor network parameters,

is an anchor network parameter and Q () is the Q network.

Variance reduction formula in step S5:

where l () is a loss function,

is the last network parameter;

is the current network parameter;

is the full gradient mean under the last network parameter.

The embodiment is mainly implemented by using a Q learning model based on a Stochastic recursive gradient algorithm (SARAH). The SVRG algorithm uses a fixed batch data full gradient mean g as the correction E [ Y ]]And using a fixed network (batch data sampling time network)

To calculate the gradient value of a single sample as Y, while the parameters of the network are not fixed during training and may gradually shift the parameter θ during sampling ₀ Thereby causing a problem that the information difference is larger and larger.

To address this problem, SARAH proposes using a loopNew or adaptively updated methods to process gradient and full gradient estimates forego using fixed batch data full gradient mean g and fixed sampling parameter θ ^old While the full gradient mean g is gradually updated during the training process and the parameter theta of the previous step is used ^t-1 Instead of theta ^old In summary, it can be concluded that, in the SARAH algorithm, the update step with variance reduction utility gradient is as follows:

θ ^t+1 ＝θ ^t -ηg ^t

compared with the SVRG algorithm in fig. 3, in the present embodiment, the SVRG operation unit is replaced by the SARAH updating unit, and the parameters are updated while the update of the full gradient mean g is maintained, and in addition, the network in the fixed sampling is replaced by the network in the previous training, that is, the network in the previous training is adopted in the present embodiment

EXAMPLE III

Based on the same inventive concept, the embodiment discloses a result pushing system based on a Q learning model, which comprises:

an original push result generation module for determining the current state s _t From the current state s _t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value _t ；

The reward value generation module is used for pushing the original pushing result to the user and obtaining the reward value r by recording the browsing of the user _t+1 ；

A storage module for storing the state s _t Push result a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D;

a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data sets

The network parameter at the moment is an anchor point network parameter;

a gradient updating module for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;

and the output module is used for repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application should be defined by the claims.

Claims

1. A result pushing method based on a Q learning model is characterized by comprising the following steps:

s1 determining a current state S _t The current state s _t Bringing the initial Q learning model into to obtain a Q value, and obtaining an original pushing result a according to the Q value _t ；

S2, pushing the original pushing result to a user, and obtaining the reward value r by recording the browsing of the user _t+1 ；

S3 comparing the state S _t Push knotFruit a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D;

The network parameter at the moment is an anchor point network parameter;

s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;

s6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;

the variance reduction formula in step S5:

is the next network parameter;

is the current network parameter; α is the learning rate;

is a gradient value; g is the full gradient mean.

2. The result pushing method based on the Q learning model according to claim 1, wherein the gradient value is calculated by the formula:

gradient values under current network parameters:

gradient values under anchor network parameters:

wherein S, a are respectively the state in a data group randomly extracted in step S5 and the push result corresponding to the state, q _m Is the target Q value, Q, under the current network parameters ₀ Is the target Q value under the anchor network parameters,

is an anchor network parameter and Q () is the Q network.

3. The result pushing method based on the Q learning model according to claim 2, wherein the calculation formula of the target Q value is:

target Q value under current network parameters:

target Q value under anchor network parameters:

wherein S ', a' are the next state in the data set extracted at random in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.

4. The Q learning model-based result pushing method according to claim 3, wherein the calculation formula of the full gradient mean is as follows:

where N is the number of data sets and l () is the loss function.

5. A result pushing method based on a Q learning model is characterized by comprising the following steps:

S2, pushing the original pushing result to a user, and obtaining the reward value r by recording user browsing _t+1 ；

S3 comparing the state S _t Push result a _t Next state s _t+1 And a prize value r _t+1 Forming a data group and storing the data group into an experience pool D;

Gradient optimizing the full gradient mean by the following full gradient mean:

is the next network parameter;

is the current network parameter;

is the full gradient mean value under the current network parameters;

s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;

the variance reduction formula in step S5:

is the next network parameter;

is the current network parameter; α is the learning rate;

is a gradient value; g is the full gradient mean.

6. The Q-learning model-based result pushing method according to claim 5, wherein the variance reduction formula in step S5 is:

wherein, l () is a loss function,

is the last network parameter;

is the current network parameter;

is the full gradient mean value under the last network parameter;

is the full gradient mean under the current network parameters.

7. The Q-learning model-based result pushing method according to claim 6, wherein the gradient value is calculated by the formula:

gradient values under current network parameters:

gradient value at last network parameter:

is an anchor network parameter and Q () is the Q network.

8. The result pushing method based on the Q learning model of claim 7, wherein the calculation formula of the target Q value is:

target Q value under current network parameters:

target Q value under last network parameter:

9. A result pushing system based on a Q learning model is characterized by comprising:

an original push result generation module for determining the current state s _t The current state s _t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value _t ；

The reward value generation module is used for pushing the original pushing result to a user and obtaining a reward value r by recording the browsing of the user _t+1 ；

The network parameter at the moment is an anchor point network parameter;

the output module is used for repeating the steps S4-S5 until the training is finished to obtain a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;

the variance reduction formula in step S5:

wherein the content of the first and second substances,

is the next network parameter;

is the current network parameter; α is the learning rate;

is a gradient value; g is the full gradient mean.