CN112085524B - Q learning model-based result pushing method and system - Google Patents

Q learning model-based result pushing method and system Download PDF

Info

Publication number
CN112085524B
CN112085524B CN202010896316.9A CN202010896316A CN112085524B CN 112085524 B CN112085524 B CN 112085524B CN 202010896316 A CN202010896316 A CN 202010896316A CN 112085524 B CN112085524 B CN 112085524B
Authority
CN
China
Prior art keywords
value
gradient
learning model
network parameter
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010896316.9A
Other languages
Chinese (zh)
Other versions
CN112085524A (en
Inventor
徐君
贾浩男
张骁
蒋昊
文继荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Renmin University of China
Original Assignee
Huawei Technologies Co Ltd
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd, Renmin University of China filed Critical Huawei Technologies Co Ltd
Priority to CN202010896316.9A priority Critical patent/CN112085524B/en
Publication of CN112085524A publication Critical patent/CN112085524A/en
Application granted granted Critical
Publication of CN112085524B publication Critical patent/CN112085524B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0255Targeted advertisements based on user history
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0255Targeted advertisements based on user history
    • G06Q30/0256User search
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0269Targeted advertisements based on user profile or attribute
    • G06Q30/0271Personalized advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Software Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)

Abstract

The invention relates to a result pushing method and a result pushing system based on a Q learning model, which comprise the following steps: will state s t Pushing results a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; extracting several data groups from experience pool D, calculating network parameters
Figure DDA0002658524720000011
The network parameter at the moment is an anchor point network parameter; randomly extracting the data group in the last step, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and repeating the steps until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result. The variance reduction technology is introduced into the Q learning model with the descending random gradient, so that the stability of the training process of reinforcement learning is improved.

Description

Q learning model-based result pushing method and system
Technical Field
The invention relates to a result pushing method and a result pushing system based on a Q learning model, and belongs to the technical field of internet.
Background
In the information retrieval, the workload of a searcher can be greatly reduced by adopting a result pushing method or sequencing according to the relevance of the result and the retrieval information, and the information acquisition efficiency is improved. At present, many reinforcement learning models, such as a deep Q learning model, are applied to search result pushing, and the pushed result can better meet the requirement of a searcher by training the reinforcement learning models through using a historical search record of the searcher, so as to further improve the search efficiency. However, the conventional method for pushing results generated by using the deep Q learning model has the following problems:
on one hand, the deep Q learning model (DQN) plays an absolute leading role in the aspect of deep reinforcement learning based on value functions, so that the improvement of the DQN algorithm focuses on improving the network structure of the DQN algorithm to improve the efficiency thereof; on the other hand, the reinforcement learning algorithm has the training characteristic of trial and error, so that the reinforcement learning algorithm is usually unstable in the training process, and the instability is mainly caused by the excessively high variance of the reward value, the Q value and the like.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, an object of the present invention is to provide a result pushing method and system based on a Q learning model, which reduces the variance of a reward value or a Q value and improves the stability of a training process of reinforcement learning by introducing a variance reduction technique into the Q learning model with a decreasing random gradient.
In order to achieve the above object, the present invention provides a result pushing method based on a Q learning model, comprising the following steps: s1 determining a current state S t The current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t (ii) a S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1 (ii) a S3 comparing the state S t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups
Figure GDA0003831718570000015
The network parameter at the moment is an anchor point network parameter; s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Further, the variance reduction formula in step S5:
Figure GDA0003831718570000011
wherein the content of the first and second substances,
Figure GDA0003831718570000012
is the next network parameter;
Figure GDA0003831718570000013
is the current network parameter; α is the learning rate;
Figure GDA0003831718570000014
is a gradient value; g is the full gradient mean.
Further, the gradient value is calculated by the formula:
gradient values under current network parameters:
Figure GDA0003831718570000021
gradient values under anchor network parameters:
Figure GDA0003831718570000022
wherein S, a are respectively the state in the data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,
Figure GDA0003831718570000023
is an anchor network parameter and Q () is the Q network.
Further, the target Q value is calculated by the formula:
target Q value under current network parameters:
Figure GDA0003831718570000024
target Q value under anchor network parameters:
Figure GDA0003831718570000025
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
Further, the calculation formula of the full gradient mean value is as follows:
Figure GDA0003831718570000026
where N is the number of data sets and l () is the loss function.
The invention also discloses another result pushing method based on the Q learning model, which comprises the following steps: s1 determining a current state S 1 The current state s t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value t (ii) a S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1 (ii) a S3 comparing the state S t Pushing results a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups
Figure GDA0003831718570000027
Performing gradient optimization on the full-gradient mean value:
Figure GDA0003831718570000028
wherein, the first and the second end of the pipe are connected with each other,
Figure GDA0003831718570000029
is the next network parameter;
Figure GDA00038317185700000210
is the current network parameter;
Figure GDA00038317185700000211
is the full gradient mean value under the current network parameters; s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the target Q value and the gradient value under the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Further, the variance reduction formula in step S5:
Figure GDA00038317185700000212
where l () is a loss function,
Figure GDA00038317185700000213
is the last network parameter;
Figure GDA00038317185700000214
is the current network parameter;
Figure GDA00038317185700000215
is the full gradient mean value under the last network parameter;
Figure GDA00038317185700000216
is the full gradient mean under the current network parameters.
Further, the gradient value is calculated by the formula:
gradient values under current network parameters:
Figure GDA0003831718570000031
gradient values at last network parameter:
Figure GDA0003831718570000032
wherein S, a are respectively the state in the data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,
Figure GDA0003831718570000033
is an anchor network parameter and Q () is the Q network.
Further, the target Q value is calculated by the formula:
target Q value under current network parameters:
Figure GDA0003831718570000034
target Q value under last network parameter:
Figure GDA0003831718570000035
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
The invention also discloses a result pushing system based on the Q learning model, which comprises the following steps: an original push result generation module for determining the current state s t The current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t (ii) a The reward value generation module is used for pushing the original pushing result to the user and obtaining the reward value r by recording the browsing of the user t+1 (ii) a Memory moduleFor transforming the state s t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data sets
Figure GDA0003831718570000036
The network parameter at the moment is an anchor point network parameter; the gradient updating module is used for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor point network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating; and the output module is used for repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. by introducing the variance reduction technology into the Q learning model with descending random gradient, the variance of the reward value or the Q value is reduced, and the precision and the stability of the training process of reinforcement learning are improved.
2. By adopting a random recursive Gradient algorithm (SARAH), the problem that the information difference is larger and larger due to the fact that the network parameters are not fixed and can gradually deviate from the parameters during sampling during training of a random Variance reduction Gradient Descent technology (SVRG) is solved, and the model calculation is more accurate.
Drawings
FIG. 1 is a schematic diagram of a method for detecting discontinuity in seismic data based on a deep learning model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a gradient optimization algorithm in one embodiment of the present invention, FIG. 2 (a) is a schematic diagram of a conventional gradient optimization algorithm, and FIG. 2 (b) is a schematic diagram of a gradient optimization algorithm with a random gradient descent;
FIG. 3 is a logic diagram of a deep Q learning model training framework based on variance reduction in an embodiment of the present invention.
Detailed Description
The present invention is described in detail by way of specific embodiments in order to better understand the technical direction of the present invention for those skilled in the art. It should be understood, however, that the detailed description is provided for a better understanding of the invention only and that they should not be taken as limiting the invention. In describing the present invention, it is to be understood that the terminology used is for the purpose of description only and is not intended to be indicative or implied of relative importance.
Example one
The embodiment discloses a result pushing method based on a Q learning model, as shown in FIG. 1, comprising the following steps:
s1, firstly, setting an initial Q learning model and determining the current state S t Wherein the state s is initialized 0 Recording activity through a user's current browsing; the subsequent conditions are obtained through the browsing history after the last interaction of the user; the current state s t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value t (ii) a Wherein the push result comprises the push content and the position of the push content.
S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1
S3 converting the state S t Pushing results a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups
Figure GDA0003831718570000041
The network parameter at the moment is an anchor point network parameter;
the calculation formula of the full gradient mean value is as follows:
Figure GDA0003831718570000042
where N is the number of data sets and l () is the loss function.
S5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor point network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
wherein, the calculation formula of the target Q value is as follows:
target Q value under current network parameters:
Figure GDA0003831718570000043
target Q value under anchor network parameters:
Figure GDA0003831718570000044
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
If the target network Q' (s, a; theta) is introduced, the calculation formula of the target Q value is as follows:
target Q value under current network parameters:
q m ←r+γmax a′ Q`(s′,a′;θ - )
target Q value under anchor network parameters:
q 0 ←r+γmax a′ Q`(s′,a′;θ - )
wherein the parameter theta - Representing the values of the parameters of the last training net Q (s, a; theta) to the target net Q' (s, a; theta), which is a net having the same structure as the training net Q but different network parameters.
The gradient value is calculated by the formula:
gradient values under current network parameters:
Figure GDA0003831718570000051
gradient values under anchor network parameters:
Figure GDA0003831718570000052
wherein S, a are states in a data group randomly extracted in step S5 and push results corresponding to the states, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,
Figure GDA0003831718570000053
is an anchor network parameter and Q () is the Q network.
The variance reduction formula is:
Figure GDA0003831718570000054
wherein, the first and the second end of the pipe are connected with each other,
Figure GDA0003831718570000055
is the next network parameter;
Figure GDA0003831718570000056
is the current network parameter; α is the learning rate;
Figure GDA0003831718570000057
is a gradient value; g is the full gradient mean.
And S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
The embodiment is mainly realized by adopting a Q learning model based on a Stochastic Variance Reduced Gradient Descent (SVRG) technology. As shown in fig. 2, in the conventional gradient optimization algorithm, the algorithm mainly based on Gradient Descent (GD) can ensure that the parameter to be optimized reaches a global optimum point, but since each step involves the calculation of full gradient, it usually causes large calculation consumption in the context of the problem of excessive data volume, so that the training process becomes sluggish. In order to avoid large computation consumption of each training step, the Stochastic Gradient Descent (SGD) algorithm abandons the computation of the full gradient, and trains the model by sampling one (or a small batch) data in each step, and although the convergence of the optimization target can be ensured, due to the characteristics of random sampling, the optimization level still has the limitation of slow convergence speed caused by too high gradient variance.
In order to solve the problem, the variance reduction technology is introduced in the process of random gradient descent for optimization. The mathematical definition of variance reduction is:
Z α =α(X-Y)+E[Y]
wherein X represents a random variable to be reduced in variance, Y represents another random variable having a positive correlation with X, E [ Y ]]Representing mathematical expectations of a random variable Y, Z α Representing the random variables after optimization by variance reduction.
The stochastic variance reduction gradient descent technique changes the original parameter update step to be shaped as Z above α By periodically sampling the batch of training data as Y in the definition of variance reduction, the gradient update formula is:
Figure GDA0003831718570000061
wherein theta is t For the parameter to be optimized in training up to the t-th step, θ old Representing the parameter values at which the full gradient is calculated,
Figure GDA0003831718570000062
the expectation of a full gradient value representing a bulk data loss function,
Figure GDA0003831718570000063
represents the gradient value of the single data sample loss function, and η represents the learning rate.
The invention makes the gradient of the loss function l (s, a; theta) to each layer parameter of the network
Figure GDA0003831718570000064
As a random variable X for the variance to be reduced. As shown in FIG. 3, a reduced variance based deep Q-learning training framework, where the current network Q represents a learning model, the environment represents objects interacting with the network Q, the network Q accepts as input the current state s of the environment and depends on the current network parameters θ m And evaluating the Q value of each action executed in the state s, selecting the optimal action a according to the Q value, outputting the optimal action a to the environment, receiving the action by the environment and transferring to the next state s'. The framework takes the current network Q as input and takes the network after being optimized by the variance as output, specifically, the parameter theta of the network is input 0 Outputting the optimized network parameters trained by variance reduction
Figure GDA0003831718570000065
During training, the environment continuously interacts with the current network to generate a transfer data set (s, a, r, s'), and a finite experience pool D is responsible for storing the generated data and periodically sending the data to the network for training. According to the characteristics of the SVRG algorithm, firstly a batch of data needs to be sampled in an experience pool, and meanwhile, the network needs to sample the batch of data
Figure GDA0003831718570000066
The full gradient mean g of this batch of data is calculated to serve as the expected E Y in the SVRG optimization process]. Network of individual samples in a batch of data when sampling the batch
Figure GDA0003831718570000067
The lower gradient value then serves as an auxiliary variable Y in the optimization process.
Example two
Based on the same inventive concept, the embodiment discloses another result pushing method based on a Q learning model, which comprises the following steps:
s1, firstly, setting an initial Q learning model and determining the current state S t Wherein the state s is initialized 0 Recording activity through a user's current browsing; followed byThe condition of the user is obtained through the browsing history after the user interacts last time; the current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t (ii) a Wherein the push result comprises the push content and the position of the push content.
S2, pushing the original pushing result to the user, and obtaining the reward value r by recording the browsing of the user t+1
S3 comparing the state S t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D; s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups
Figure GDA0003831718570000071
And (3) performing gradient optimization on the full gradient mean value:
Figure GDA0003831718570000072
wherein the content of the first and second substances,
Figure GDA0003831718570000073
is the next network parameter;
Figure GDA0003831718570000074
is the current network parameter;
Figure GDA0003831718570000075
is the full gradient mean value under the current network parameters;
s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the target Q value and the gradient value under the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
wherein, the calculation formula of the target Q value is as follows:
target Q value under current network parameters:
Figure GDA0003831718570000076
target Q value under last network parameter:
Figure GDA0003831718570000077
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
Variance reduction formula in step S5:
Figure GDA0003831718570000078
where l () is a loss function,
Figure GDA0003831718570000079
is the last network parameter;
Figure GDA00038317185700000710
is the current network parameter;
Figure GDA00038317185700000711
is the full gradient mean value under the last network parameter;
Figure GDA00038317185700000712
is the full gradient mean under the current network parameters.
The gradient value is calculated by the formula:
gradient values under current network parameters:
Figure GDA00038317185700000713
gradient values at last network parameter:
Figure GDA00038317185700000714
wherein S, a are respectively randomly extracted in step S5States in a data set and push results, q, corresponding to the states m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,
Figure GDA00038317185700000715
is an anchor network parameter and Q () is the Q network.
Variance reduction formula in step S5:
Figure GDA00038317185700000716
where l () is a loss function,
Figure GDA00038317185700000717
is the last network parameter;
Figure GDA00038317185700000718
is the current network parameter;
Figure GDA00038317185700000719
is the full gradient mean under the last network parameter.
And S6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
The embodiment is mainly implemented by using a Q learning model based on a Stochastic recursive gradient algorithm (SARAH). The SVRG algorithm uses a fixed batch data full gradient mean g as the correction E [ Y ]]And using a fixed network (batch data sampling time network)
Figure GDA00038317185700000720
To calculate the gradient value of a single sample as Y, while the parameters of the network are not fixed during training and may gradually shift the parameter θ during sampling 0 Thereby causing a problem that the information difference is larger and larger.
To address this problem, SARAH proposes using a loopNew or adaptively updated methods to process gradient and full gradient estimates forego using fixed batch data full gradient mean g and fixed sampling parameter θ old While the full gradient mean g is gradually updated during the training process and the parameter theta of the previous step is used t-1 Instead of theta old In summary, it can be concluded that, in the SARAH algorithm, the update step with variance reduction utility gradient is as follows:
Figure GDA0003831718570000081
θ t+1 =θ t -ηg t
compared with the SVRG algorithm in fig. 3, in the present embodiment, the SVRG operation unit is replaced by the SARAH updating unit, and the parameters are updated while the update of the full gradient mean g is maintained, and in addition, the network in the fixed sampling is replaced by the network in the previous training, that is, the network in the previous training is adopted in the present embodiment
Figure GDA0003831718570000082
EXAMPLE III
Based on the same inventive concept, the embodiment discloses a result pushing system based on a Q learning model, which comprises:
an original push result generation module for determining the current state s t From the current state s t The initial Q learning model is brought in to obtain a Q value, and an original pushing result a is obtained according to the Q value t
The reward value generation module is used for pushing the original pushing result to the user and obtaining the reward value r by recording the browsing of the user t+1
A storage module for storing the state s t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data sets
Figure GDA0003831718570000083
The network parameter at the moment is an anchor point network parameter;
a gradient updating module for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
and the output module is used for repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims. The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application should be defined by the claims.

Claims (9)

1. A result pushing method based on a Q learning model is characterized by comprising the following steps:
s1 determining a current state S t The current state s t Bringing the initial Q learning model into to obtain a Q value, and obtaining an original pushing result a according to the Q value t
S2, pushing the original pushing result to a user, and obtaining the reward value r by recording the browsing of the user t+1
S3 comparing the state S t Push knotFruit a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups
Figure FDA00038317185600000110
The network parameter at the moment is an anchor point network parameter;
s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
s6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;
the variance reduction formula in step S5:
Figure FDA0003831718560000011
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003831718560000012
is the next network parameter;
Figure FDA0003831718560000013
is the current network parameter; α is the learning rate;
Figure FDA0003831718560000014
is a gradient value; g is the full gradient mean.
2. The result pushing method based on the Q learning model according to claim 1, wherein the gradient value is calculated by the formula:
gradient values under current network parameters:
Figure FDA0003831718560000015
gradient values under anchor network parameters:
Figure FDA0003831718560000016
wherein S, a are respectively the state in a data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,
Figure FDA0003831718560000017
is an anchor network parameter and Q () is the Q network.
3. The result pushing method based on the Q learning model according to claim 2, wherein the calculation formula of the target Q value is:
target Q value under current network parameters:
Figure FDA0003831718560000018
target Q value under anchor network parameters:
Figure FDA0003831718560000019
wherein S ', a' are the next state in the data set extracted at random in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
4. The Q learning model-based result pushing method according to claim 3, wherein the calculation formula of the full gradient mean is as follows:
Figure FDA0003831718560000021
where N is the number of data sets and l () is the loss function.
5. A result pushing method based on a Q learning model is characterized by comprising the following steps:
s1 determining a current state S t The current state s t Bringing the initial Q learning model into to obtain a Q value, and obtaining an original pushing result a according to the Q value t
S2, pushing the original pushing result to a user, and obtaining the reward value r by recording user browsing t+1
S3 comparing the state S t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
s4, extracting a plurality of data groups from the experience pool D, and calculating network parameters according to the extracted data groups
Figure FDA0003831718560000022
Gradient optimizing the full gradient mean by the following full gradient mean:
Figure FDA0003831718560000023
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003831718560000024
is the next network parameter;
Figure FDA0003831718560000025
is the current network parameter;
Figure FDA0003831718560000026
is the full gradient mean value under the current network parameters;
s5, randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the last network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
s6, repeating the steps S4-S5 until the training is finished, obtaining a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;
the variance reduction formula in step S5:
Figure FDA0003831718560000027
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003831718560000028
is the next network parameter;
Figure FDA0003831718560000029
is the current network parameter; α is the learning rate;
Figure FDA00038317185600000210
is a gradient value; g is the full gradient mean.
6. The Q-learning model-based result pushing method according to claim 5, wherein the variance reduction formula in step S5 is:
Figure FDA00038317185600000211
wherein, l () is a loss function,
Figure FDA00038317185600000212
is the last network parameter;
Figure FDA00038317185600000213
is the current network parameter;
Figure FDA00038317185600000214
is the full gradient mean value under the last network parameter;
Figure FDA00038317185600000215
is the full gradient mean under the current network parameters.
7. The Q-learning model-based result pushing method according to claim 6, wherein the gradient value is calculated by the formula:
gradient values under current network parameters:
Figure FDA00038317185600000216
gradient value at last network parameter:
Figure FDA00038317185600000217
wherein S, a are respectively the state in a data group randomly extracted in step S5 and the push result corresponding to the state, q m Is the target Q value, Q, under the current network parameters 0 Is the target Q value under the anchor network parameters,
Figure FDA0003831718560000031
is an anchor network parameter and Q () is the Q network.
8. The result pushing method based on the Q learning model of claim 7, wherein the calculation formula of the target Q value is:
target Q value under current network parameters:
Figure FDA0003831718560000032
target Q value under last network parameter:
Figure FDA0003831718560000033
wherein S ', a' are the next state in the data set extracted randomly in step S5 and the push result corresponding to the next state, r is the reward value, and γ is the discount coefficient.
9. A result pushing system based on a Q learning model is characterized by comprising:
an original push result generation module for determining the current state s t The current state s t Substituting the initial Q learning model to obtain a Q value, and obtaining an original pushing result a according to the Q value t
The reward value generation module is used for pushing the original pushing result to a user and obtaining a reward value r by recording the browsing of the user t+1
A storage module for storing the state s t Push result a t Next state s t+1 And a prize value r t+1 Forming a data group and storing the data group into an experience pool D;
a full gradient mean calculation module for extracting several data sets from the experience pool D and calculating network parameters according to the extracted data sets
Figure FDA0003831718560000038
The network parameter at the moment is an anchor point network parameter;
a gradient updating module for randomly extracting the data group in the step S4, calculating a target Q value and a gradient value of the data group under the current network parameter and the anchor network parameter, and substituting the gradient value and the full gradient mean value into a variance reduction formula to realize gradient updating;
the output module is used for repeating the steps S4-S5 until the training is finished to obtain a final Q learning model, and inputting the state to be tested into the final Q learning model to obtain an optimal pushing result;
the variance reduction formula in step S5:
Figure FDA0003831718560000034
wherein the content of the first and second substances,
Figure FDA0003831718560000035
is the next network parameter;
Figure FDA0003831718560000036
is the current network parameter; α is the learning rate;
Figure FDA0003831718560000037
is a gradient value; g is the full gradient mean.
CN202010896316.9A 2020-08-31 2020-08-31 Q learning model-based result pushing method and system Active CN112085524B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010896316.9A CN112085524B (en) 2020-08-31 2020-08-31 Q learning model-based result pushing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010896316.9A CN112085524B (en) 2020-08-31 2020-08-31 Q learning model-based result pushing method and system

Publications (2)

Publication Number Publication Date
CN112085524A CN112085524A (en) 2020-12-15
CN112085524B true CN112085524B (en) 2022-11-15

Family

ID=73731256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010896316.9A Active CN112085524B (en) 2020-08-31 2020-08-31 Q learning model-based result pushing method and system

Country Status (1)

Country Link
CN (1) CN112085524B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109471963A (en) * 2018-09-13 2019-03-15 广州丰石科技有限公司 A kind of proposed algorithm based on deeply study
CN110084378A (en) * 2019-05-07 2019-08-02 南京大学 A kind of distributed machines learning method based on local learning strategy
KR20190132193A (en) * 2018-05-18 2019-11-27 한양대학교 에리카산학협력단 A Dynamic Pricing Demand Response Method and System for Smart Grid Systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190132193A (en) * 2018-05-18 2019-11-27 한양대학교 에리카산학협력단 A Dynamic Pricing Demand Response Method and System for Smart Grid Systems
CN109471963A (en) * 2018-09-13 2019-03-15 广州丰石科技有限公司 A kind of proposed algorithm based on deeply study
CN110084378A (en) * 2019-05-07 2019-08-02 南京大学 A kind of distributed machines learning method based on local learning strategy

Also Published As

Publication number Publication date
CN112085524A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN110674604B (en) Transformer DGA data prediction method based on multi-dimensional time sequence frame convolution LSTM
CN108875916B (en) Advertisement click rate prediction method based on GRU neural network
CN111260030B (en) A-TCN-based power load prediction method and device, computer equipment and storage medium
WO2021109644A1 (en) Hybrid vehicle working condition prediction method based on meta-learning
CN110942194A (en) Wind power prediction error interval evaluation method based on TCN
CN112381673B (en) Park electricity utilization information analysis method and device based on digital twin
CN113449919B (en) Power consumption prediction method and system based on feature and trend perception
CN112015719A (en) Regularization and adaptive genetic algorithm-based hydrological prediction model construction method
CN115271219A (en) Short-term load prediction method and prediction system based on causal relationship analysis
CN114548591A (en) Time sequence data prediction method and system based on hybrid deep learning model and Stacking
CN114742209A (en) Short-term traffic flow prediction method and system
CN113807596B (en) Management method and system for informatization project cost
CN114971090A (en) Electric heating load prediction method, system, equipment and medium
CN112085524B (en) Q learning model-based result pushing method and system
CN103607219B (en) A kind of noise prediction method of electric line communication system
CN112951209A (en) Voice recognition method, device, equipment and computer readable storage medium
CN109740221B (en) Intelligent industrial design algorithm based on search tree
CN115829123A (en) Natural gas demand prediction method and device based on grey model and neural network
CN116151581A (en) Flexible workshop scheduling method and system and electronic equipment
CN113705878B (en) Method and device for determining water yield of horizontal well, computer equipment and storage medium
CN115035304A (en) Image description generation method and system based on course learning
CN112348275A (en) Regional ecological environment change prediction method based on online incremental learning
CN111859807A (en) Initial pressure optimizing method, device, equipment and storage medium for steam turbine
CN111369046A (en) Wind-solar complementary power prediction method based on grey neural network
CN110580548A (en) Multi-step traffic speed prediction method based on class integration learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant