CN112085524B - Q learning model-based result pushing method and system - Google Patents
Q learning model-based result pushing method and system Download PDFInfo
- Publication number
- CN112085524B CN112085524B CN202010896316.9A CN202010896316A CN112085524B CN 112085524 B CN112085524 B CN 112085524B CN 202010896316 A CN202010896316 A CN 202010896316A CN 112085524 B CN112085524 B CN 112085524B
- Authority
- CN
- China
- Prior art keywords
- value
- gradient
- learning model
- network parameter
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0255—Targeted advertisements based on user history
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0255—Targeted advertisements based on user history
- G06Q30/0256—User search
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0269—Targeted advertisements based on user profile or attribute
- G06Q30/0271—Personalized advertisement
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/55—Push-based network services
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Software Systems (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Marketing (AREA)
- Economics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Signal Processing (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
Abstract
The invention relates to a result pushing method and a result pushing system based on a Q learning model, which comprise the following steps: will state s t Pushing results a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; extracting several data groups from experience pool D, calculating network parametersThe network parameter at the moment is an anchor point network parameter; randomly extracting the data group in the last step, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and repeating the steps until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result. The variance reduction technology is introduced into the Q learning model with the descending random gradient, so that the stability of the training process of reinforcement learning is improved.
Description
Technical Field
The invention relates to a result pushing method and a result pushing system based on a Q learning model, and belongs to the technical field of internet.
Background
In the information retrieval, the workload of a searcher can be greatly reduced by adopting a result pushing method or sequencing according to the relevance of the result and the retrieval information, and the information acquisition efficiency is improved. At present, many reinforcement learning models, such as a deep Q learning model, are applied to search result pushing, and the pushed result can better meet the requirement of a searcher by training the reinforcement learning models through using a historical search record of the searcher, so as to further improve the search efficiency. However, the conventional method for pushing results generated by using the deep Q learning model has the following problems:
on one hand, the deep Q learning model (DQN) plays an absolute leading role in the aspect of deep reinforcement learning based on value functions, so that the improvement of the DQN algorithm focuses on improving the network structure of the DQN algorithm to improve the efficiency thereof; on the other hand, the reinforcement learning algorithm has the training characteristic of trial and error, so that the reinforcement learning algorithm is usually unstable in the training process, and the instability is mainly caused by the excessively high variance of the reward value, the Q value and the like.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, an object of the present invention is to provide a result pushing method and system based on a Q learning model, which reduces the variance of a reward value or a Q value and improves the stability of a training process of reinforcement learning by introducing a variance reduction technique into the Q learning model with a decreasing random gradient.
In order to achieve the above object, the present invention provides a result pushing method based on a Q learning model, comprising the following steps: s1 determining a current state S t The current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t (ii) a S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1 (ii) a S3 comparing the state S t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groupsThe network parameter at the moment is an anchor point network parameter; s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Further, the variance reduction formula in step S5:
wherein the content of the first and second substances,is the next network parameter;is the current network parameter; α is the learning rate;is a gradient value; g is the full gradient mean.
Further, the gradient value is calculated by the formula:
wherein S, a are respectively the state in the data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,is an anchor network parameter and Q () is the Q network.
Further, the target Q value is calculated by the formula:
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
Further, the calculation formula of the full gradient mean value is as follows:
where N is the number of data sets and l () is the loss function.
The invention also discloses another result pushing method based on the Q learning model, which comprises the following steps: s1 determining a current state S 1 The current state s t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value t (ii) a S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1 (ii) a S3 comparing the state S t Pushing results a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groupsPerforming gradient optimization on the full-gradient mean value:
wherein, the first and the second end of the pipe are connected with each other,is the next network parameter;is the current network parameter;is the full gradient mean value under the current network parameters; s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the target Q value and the gradient value under the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Further, the variance reduction formula in step S5:
where l () is a loss function,is the last network parameter;is the current network parameter;is the full gradient mean value under the last network parameter;is the full gradient mean under the current network parameters.
Further, the gradient value is calculated by the formula:
wherein S, a are respectively the state in the data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,is an anchor network parameter and Q () is the Q network.
Further, the target Q value is calculated by the formula:
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
The invention also discloses a result pushing system based on the Q learning model, which comprises the following steps: an original push result generation module for determining the current state s t The current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t (ii) a The reward value generation module is used for pushing the original pushing result to the user and obtaining the reward value r by recording the browsing of the user t+1 (ii) a Memory moduleFor transforming the state s t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data setsThe network parameter at the moment is an anchor point network parameter; the gradient updating module is used for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor point network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and the output module is used for repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. by introducing the variance reduction technology into the Q learning model with descending random gradient, the variance of the reward value or the Q value is reduced, and the precision and the stability of the training process of reinforcement learning are improved.
2. By adopting a random recursive Gradient algorithm (SARAH), the problem that the information difference is larger and larger due to the fact that the network parameters are not fixed and can gradually deviate from the parameters during sampling during training of a random Variance reduction Gradient Descent technology (SVRG) is solved, and the model calculation is more accurate.
Drawings
FIG. 1 is a schematic diagram of a method for detecting discontinuity in seismic data based on a deep learning model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a gradient optimization algorithm in one embodiment of the present invention, FIG. 2 (a) is a schematic diagram of a conventional gradient optimization algorithm, and FIG. 2 (b) is a schematic diagram of a gradient optimization algorithm with a random gradient descent;
FIG. 3 is a logic diagram of a deep Q learning model training framework based on variance reduction in an embodiment of the present invention.
Detailed Description
The present invention is described in detail by way of specific embodiments in order to better understand the technical direction of the present invention for those skilled in the art. It should be understood, however, that the detailed description is provided for a better understanding of the invention only and that they should not be taken as limiting the invention. In describing the present invention, it is to be understood that the terminology used is for the purpose of description only and is not intended to be indicative or implied of relative importance.
Example one
The embodiment discloses a result pushing method based on a Q learning model, as shown in FIG. 1, comprising the following steps:
s1, firstly, setting an initial Q learning model and determining the current state S t Wherein the state s is initialized 0 Recording activity through a user's current browsing; the subsequent conditions are obtained through the browsing history after the last interaction of the user; the current state s t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value t (ii) a Wherein the push result comprises the push content and the position of the push content.
S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1 ;
S3 converting the state S t Pushing results a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groupsThe network parameter at the moment is an anchor point network parameter;
the calculation formula of the full gradient mean value is as follows:
where N is the number of data sets and l () is the loss function.
S5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor point network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
wherein, the calculation formula of the target Q value is as follows:
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
If the target network Q' (s, a; theta) is introduced, the calculation formula of the target Q value is as follows:
target Q value under current network parameters:
q m ←r+γmax a′ Q`(s′,a′;θ - )
target Q value under anchor network parameters:
q 0 ←r+γmax a′ Q`(s′,a′;θ - )
wherein the parameter theta - Representing the values of the parameters of the last training net Q (s, a; theta) to the target net Q' (s, a; theta), which is a net having the same structure as the training net Q but different network parameters.
The gradient value is calculated by the formula:
wherein S, a are states in a data group randomly extracted in step S5 and push results corresponding to the states, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,is an anchor network parameter and Q () is the Q network.
The variance reduction formula is:
wherein, the first and the second end of the pipe are connected with each other,is the next network parameter;is the current network parameter; α is the learning rate;is a gradient value; g is the full gradient mean.
And S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
The embodiment is mainly realized by adopting a Q learning model based on a Stochastic Variance Reduced Gradient Descent (SVRG) technology. As shown in fig. 2, in the conventional gradient optimization algorithm, the algorithm mainly based on Gradient Descent (GD) can ensure that the parameter to be optimized reaches a global optimum point, but since each step involves the calculation of full gradient, it usually causes large calculation consumption in the context of the problem of excessive data volume, so that the training process becomes sluggish. In order to avoid large computation consumption of each training step, the Stochastic Gradient Descent (SGD) algorithm abandons the computation of the full gradient, and trains the model by sampling one (or a small batch) data in each step, and although the convergence of the optimization target can be ensured, due to the characteristics of random sampling, the optimization level still has the limitation of slow convergence speed caused by too high gradient variance.
In order to solve the problem, the variance reduction technology is introduced in the process of random gradient descent for optimization. The mathematical definition of variance reduction is:
Z α =α(X-Y)+E[Y]
wherein X represents a random variable to be reduced in variance, Y represents another random variable having a positive correlation with X, E [ Y ]]Representing mathematical expectations of a random variable Y, Z α Representing the random variables after optimization by variance reduction.
The stochastic variance reduction gradient descent technique changes the original parameter update step to be shaped as Z above α By periodically sampling the batch of training data as Y in the definition of variance reduction, the gradient update formula is:
wherein theta is t For the parameter to be optimized in training up to the t-th step, θ old Representing the parameter values at which the full gradient is calculated,the expectation of a full gradient value representing a bulk data loss function,represents the gradient value of the single data sample loss function, and η represents the learning rate.
The invention makes the gradient of the loss function l (s, a; theta) to each layer parameter of the networkAs a random variable X for the variance to be reduced. As shown in FIG. 3, a reduced variance based deep Q-learning training framework, where the current network Q represents a learning model, the environment represents objects interacting with the network Q, the network Q accepts as input the current state s of the environment and depends on the current network parameters θ m And evaluating the Q value of each action executed in the state s, selecting the optimal action a according to the Q value, outputting the optimal action a to the environment, receiving the action by the environment and transferring to the next state s'. The framework takes the current network Q as input and takes the network after being optimized by the variance as output, specifically, the parameter theta of the network is input 0 Outputting the optimized network parameters trained by variance reduction
During training, the environment continuously interacts with the current network to generate a transfer data set (s, a, r, s'), and a finite experience pool D is responsible for storing the generated data and periodically sending the data to the network for training. According to the characteristics of the SVRG algorithm, firstly a batch of data needs to be sampled in an experience pool, and meanwhile, the network needs to sample the batch of dataThe full gradient mean g of this batch of data is calculated to serve as the expected E Y in the SVRG optimization process]. Network of individual samples in a batch of data when sampling the batchThe lower gradient value then serves as an auxiliary variable Y in the optimization process.
Example two
Based on the same inventive concept, the embodiment discloses another result pushing method based on a Q learning model, which comprises the following steps:
s1, firstly, setting an initial Q learning model and determining the current state S t Wherein the state s is initialized 0 Recording activity through a user's current browsing; followed byThe condition of the user is obtained through the browsing history after the user interacts last time; the current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t (ii) a Wherein the push result comprises the push content and the position of the push content.
S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1 ;
S3 comparing the state S t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groupsAnd (3) performing gradient optimization on the full gradient mean value:
wherein the content of the first and second substances,is the next network parameter;is the current network parameter;is the full gradient mean value under the current network parameters;
s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the target Q value and the gradient value under the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
wherein, the calculation formula of the target Q value is as follows:
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
Variance reduction formula in step S5:
where l () is a loss function,is the last network parameter;is the current network parameter;is the full gradient mean value under the last network parameter;is the full gradient mean under the current network parameters.
The gradient value is calculated by the formula:
wherein S, a are respectively randomly extracted in step S5States in a data set and push results, q, corresponding to the states m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,is an anchor network parameter and Q () is the Q network.
Variance reduction formula in step S5:
where l () is a loss function,is the last network parameter;is the current network parameter;is the full gradient mean under the last network parameter.
And S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
The embodiment is mainly implemented by using a Q learning model based on a Stochastic recursive gradient algorithm (SARAH). The SVRG algorithm uses a fixed batch data full gradient mean g as the correction E [ Y ]]And using a fixed network (batch data sampling time network)To calculate the gradient value of a single sample as Y, while the parameters of the network are not fixed during training and may gradually shift the parameter θ during sampling 0 Thereby causing a problem that the information difference is larger and larger.
To address this problem, SARAH proposes using a loopNew or adaptively updated methods to process gradient and full gradient estimates forego using fixed batch data full gradient mean g and fixed sampling parameter θ old While the full gradient mean g is gradually updated during the training process and the parameter theta of the previous step is used t-1 Instead of theta old In summary, it can be concluded that, in the SARAH algorithm, the update step with variance reduction utility gradient is as follows:
θ t+1 =θ t -ηg t
compared with the SVRG algorithm in fig. 3, in the present embodiment, the SVRG operation unit is replaced by the SARAH updating unit, and the parameters are updated while the update of the full gradient mean g is maintained, and in addition, the network in the fixed sampling is replaced by the network in the previous training, that is, the network in the previous training is adopted in the present embodiment
EXAMPLE III
Based on the same inventive concept, the embodiment discloses a result pushing system based on a Q learning model, which comprises:
an original push result generation module for determining the current state s t From the current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t ;
The reward value generation module is used for pushing the original pushing result to the user and obtaining the reward value r by recording the browsing of the user t+1 ;
A storage module for storing the state s t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data setsThe network parameter at the moment is an anchor point network parameter;
a gradient updating module for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
and the output module is used for repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application should be defined by the claims.
Claims (9)
1. A result pushing method based on a Q learning model is characterized by comprising the following steps:
s1 determining a current state S t The current state s t Bringing the initial Q learning model into to obtain a Q value, and obtaining an original pushing result a according to the Q value t ;
S2, pushing the original pushing result to a user, and obtaining the reward value r by recording the browsing of the user t+1 ;
S3 comparing the state S t Push knotFruit a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groupsThe network parameter at the moment is an anchor point network parameter;
s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
s6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;
the variance reduction formula in step S5:
2. The result pushing method based on the Q learning model according to claim 1, wherein the gradient value is calculated by the formula:
wherein S, a are respectively the state in a data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,is an anchor network parameter and Q () is the Q network.
3. The result pushing method based on the Q learning model according to claim 2, wherein the calculation formula of the target Q value is:
wherein S ', a' are the next state in the data set extracted at random in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
5. A result pushing method based on a Q learning model is characterized by comprising the following steps:
s1 determining a current state S t The current state s t Bringing the initial Q learning model into to obtain a Q value, and obtaining an original pushing result a according to the Q value t ;
S2, pushing the original pushing result to a user, and obtaining the reward value r by recording user browsing t+1 ;
S3 comparing the state S t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groupsGradient optimizing the full gradient mean by the following full gradient mean:
wherein, the first and the second end of the pipe are connected with each other,is the next network parameter;is the current network parameter;is the full gradient mean value under the current network parameters;
s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
s6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;
the variance reduction formula in step S5:
6. The Q-learning model-based result pushing method according to claim 5, wherein the variance reduction formula in step S5 is:
7. The Q-learning model-based result pushing method according to claim 6, wherein the gradient value is calculated by the formula:
wherein S, a are respectively the state in a data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,is an anchor network parameter and Q () is the Q network.
8. The result pushing method based on the Q learning model of claim 7, wherein the calculation formula of the target Q value is:
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
9. A result pushing system based on a Q learning model is characterized by comprising:
an original push result generation module for determining the current state s t The current state s t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value t ;
The reward value generation module is used for pushing the original pushing result to a user and obtaining a reward value r by recording the browsing of the user t+1 ;
A storage module for storing the state s t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data setsThe network parameter at the moment is an anchor point network parameter;
a gradient updating module for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
the output module is used for repeating the steps S4-S5 until the training is finished to obtain a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;
the variance reduction formula in step S5:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010896316.9A CN112085524B (en) | 2020-08-31 | 2020-08-31 | Q learning model-based result pushing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010896316.9A CN112085524B (en) | 2020-08-31 | 2020-08-31 | Q learning model-based result pushing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112085524A CN112085524A (en) | 2020-12-15 |
CN112085524B true CN112085524B (en) | 2022-11-15 |
Family
ID=73731256
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010896316.9A Active CN112085524B (en) | 2020-08-31 | 2020-08-31 | Q learning model-based result pushing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112085524B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109471963A (en) * | 2018-09-13 | 2019-03-15 | 广州丰石科技有限公司 | A kind of proposed algorithm based on deeply study |
CN110084378A (en) * | 2019-05-07 | 2019-08-02 | 南京大学 | A kind of distributed machines learning method based on local learning strategy |
KR20190132193A (en) * | 2018-05-18 | 2019-11-27 | 한양대학교 에리카산학협력단 | A Dynamic Pricing Demand Response Method and System for Smart Grid Systems |
-
2020
- 2020-08-31 CN CN202010896316.9A patent/CN112085524B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20190132193A (en) * | 2018-05-18 | 2019-11-27 | 한양대학교 에리카산학협력단 | A Dynamic Pricing Demand Response Method and System for Smart Grid Systems |
CN109471963A (en) * | 2018-09-13 | 2019-03-15 | 广州丰石科技有限公司 | A kind of proposed algorithm based on deeply study |
CN110084378A (en) * | 2019-05-07 | 2019-08-02 | 南京大学 | A kind of distributed machines learning method based on local learning strategy |
Also Published As
Publication number | Publication date |
---|---|
CN112085524A (en) | 2020-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110674604B (en) | Transformer DGA data prediction method based on multi-dimensional time sequence frame convolution LSTM | |
CN108875916B (en) | Advertisement click rate prediction method based on GRU neural network | |
CN111260030B (en) | A-TCN-based power load prediction method and device, computer equipment and storage medium | |
WO2021109644A1 (en) | Hybrid vehicle working condition prediction method based on meta-learning | |
CN110942194A (en) | Wind power prediction error interval evaluation method based on TCN | |
CN112381673B (en) | Park electricity utilization information analysis method and device based on digital twin | |
CN113449919B (en) | Power consumption prediction method and system based on feature and trend perception | |
CN112015719A (en) | Regularization and adaptive genetic algorithm-based hydrological prediction model construction method | |
CN115271219A (en) | Short-term load prediction method and prediction system based on causal relationship analysis | |
CN114548591A (en) | Time sequence data prediction method and system based on hybrid deep learning model and Stacking | |
CN114742209A (en) | Short-term traffic flow prediction method and system | |
CN113807596B (en) | Management method and system for informatization project cost | |
CN114971090A (en) | Electric heating load prediction method, system, equipment and medium | |
CN112085524B (en) | Q learning model-based result pushing method and system | |
CN103607219B (en) | A kind of noise prediction method of electric line communication system | |
CN112951209A (en) | Voice recognition method, device, equipment and computer readable storage medium | |
CN109740221B (en) | Intelligent industrial design algorithm based on search tree | |
CN115829123A (en) | Natural gas demand prediction method and device based on grey model and neural network | |
CN116151581A (en) | Flexible workshop scheduling method and system and electronic equipment | |
CN113705878B (en) | Method and device for determining water yield of horizontal well, computer equipment and storage medium | |
CN115035304A (en) | Image description generation method and system based on course learning | |
CN112348275A (en) | Regional ecological environment change prediction method based on online incremental learning | |
CN111859807A (en) | Initial pressure optimizing method, device, equipment and storage medium for steam turbine | |
CN111369046A (en) | Wind-solar complementary power prediction method based on grey neural network | |
CN110580548A (en) | Multi-step traffic speed prediction method based on class integration learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |