CN113688306A

CN113688306A - Recommendation strategy generation method and device based on reinforcement learning

Info

Publication number: CN113688306A
Application number: CN202110726927.3A
Authority: CN
Inventors: 李成钢; 黄莹; 李忠; 李金岭; 杜忠田; 王彦君; 夏海轮; 张碧昭; 余清华; 卜理超; 张天正; 李凤文; 袁福碧
Original assignee: China Telecom Group System Integration Co Ltd
Current assignee: China Telecom Group System Integration Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-11-23

Abstract

The invention discloses a recommendation strategy generation method and device based on reinforcement learning, and belongs to the field of intelligent recommendation. Wherein, the method comprises the following steps: acquiring scene information; generating a user simulator according to the scene information; generating a simulation environment according to the user simulator; and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment. The invention solves the technical problems that the recommendation effect is damaged in a short period, the items which are not interested by the user are recommended to the user in the early stage of recommendation, and a large amount of attempts are needed to obtain relatively accurate item rewards in the prior recommendation method, and realizes the technical effect of efficiently, reasonably and accurately mining the hidden preference of the user while meeting the current interest and hobbies of the user.

Description

Recommendation strategy generation method and device based on reinforcement learning

Technical Field

The invention belongs to the field of intelligent recommendation, and particularly relates to a recommendation strategy generation method and device based on reinforcement learning.

Background

Along with the continuous development of intelligent science and technology, people use intelligent equipment more and more among life, work, the study, use intelligent science and technology means, improved the quality of people's life, increased the efficiency of people's study and work.

At present, most recommendation algorithms are designed and trained on the basis of obtaining user historical data, and after determining the interests and hobbies of users on different articles, personalized recommendation is carried out on the users. In such recommendation algorithms, the user's interests are determined from the collected user history data, and it is assumed that the user's interests remain unchanged for a certain time. However, for most recommendation systems, such as music and movie recommendation systems, the interests of the users are constantly changing, even with the change of the contents recommended by the recommendation systems. Thus, current recommendation systems face two challenges: (1) the interest and the hobbies of the user are not constant and change along with time, so that the recommendation algorithm needs to consider the short-term interest of the user and also needs to mine the potential interest of the user, and higher long-term benefit is obtained. (2) Current recommendation algorithms tend to continually recommend similar items to users, which may reduce the user's interest in similar topics, thereby reducing satisfaction with the overall recommendation service. These two challenges are the Exploration and Exploitation (EE) problem in the recommendation system, i.e. how to mine the hidden preferences of the user while satisfying the user's current interests.

Since the recommendation problem can be converted into a sequence problem and three elements of reinforcement learning (state, action and reward) can also be defined, a reinforcement learning framework can be applied to a recommendation algorithm, thereby solving the above problem in the recommendation scenario. Some existing reinforcement learning methods add some randomness to the decision of recommending new items to solve the EE problem. For example, a simple e-Greedy policy and a multi-armed tiger algorithm-based Upper Confidence Bound (UCB) policy are adopted, but both of the two policies have the problem of damaging the recommendation effect in a short period, wherein the e-Greedy policy may recommend items which are not interesting to the user at all in the early stage of recommendation, and the UCB algorithm needs a large amount of attempts to obtain relatively accurate item rewards.

Disclosure of Invention

The invention provides a recommendation strategy generation method and device based on reinforcement learning, solves the technical problems that the recommendation effect is damaged in a short period, a user is recommended with a project which is not interested at all in the early stage of recommendation, and a large amount of attempts are needed to obtain a relatively accurate project reward in the prior recommendation method, and achieves the technical effect of efficiently, reasonably and accurately mining the hidden preference of the user while meeting the current interest and hobbies of the user.

In one aspect of the present invention, a recommendation policy generation method based on reinforcement learning is provided, including: acquiring scene information; generating a user simulator according to the scene information; generating a simulation environment according to the user simulator; and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.

Further, before the acquiring the scene information, the method further includes: and defining a recommendation scene according to the user requirement.

Further, the generating a user simulator according to the scene information includes: describing the user state of the scene information according to an attention mechanism to obtain the user state; determining a user decision function and a user reward function according to the user state; and constructing the user simulator according to the user decision function and the user reward function.

Further, after the generating a recommended policy model by a policy gradient algorithm through the simulation environment, the method further comprises: and outputting the recommended strategy model.

In another aspect of the present invention, a recommendation policy generation apparatus based on reinforcement learning is further provided, including: the acquisition module is used for acquiring scene information; the generating module is used for generating a user simulator according to the scene information; the simulation module is used for generating a simulation environment according to the user simulator; and the recommendation module is used for generating a recommendation strategy model by adopting a strategy gradient algorithm through the simulation environment.

Further, the apparatus further comprises: and the definition module is used for defining the recommendation scene according to the user requirement.

Further, the generating module includes: the description unit is used for describing the user state of the scene information according to an attention mechanism to obtain the user state; the determining unit is used for determining a user decision function and a user reward function according to the user state; and the construction unit is used for constructing the user simulator according to the user decision function and the user reward function.

Further, the apparatus further comprises: and the output module is used for outputting the recommended strategy model.

In another aspect of the present invention, a non-volatile storage medium is further provided, where the non-volatile storage medium includes a stored program, and the program controls, when running, a device in which the non-volatile storage medium is located to execute a recommendation policy generation method based on reinforcement learning.

In another aspect of the present invention, an electronic device is further provided, which includes a processor and a memory; the memory stores computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a recommendation strategy generation method based on reinforcement learning.

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the state characteristics of the user are fully extracted by adopting an attention mechanism, so that the interest change of the user can be captured more deeply and accurately; then modeling a decision making process of the recommended scene into a user simulator, and training the simulator by applying the minimum maximization principle of the generated countermeasure network to fit the distribution of the real user decision making behaviors in order to reduce the deviation between the user simulator and the real user decision making process; finally, the obtained user simulator is used as a simulation environment, a recommendation strategy is obtained based on a reinforcement learning strategy gradient method, and the technical problems that the recommendation effect is damaged in a short period, the items which are not interested in the user are recommended to the user in the early stage of recommendation, and a large amount of attempts are needed to obtain relatively accurate item rewards in the conventional recommendation method are solved; the method has the advantages that high user behavior prediction accuracy can be obtained, the recommendation performance is effectively improved, and the technical effect of efficiently, reasonably and accurately mining the hidden preference of the user while the current interest and hobbies of the user are met is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a recommendation algorithm context rendering based on reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a user state characterization scheme based on an attention mechanism according to an embodiment of the present invention;

FIG. 3 is a user state characterization scheme based on an attention mechanism according to an embodiment of the present invention;

FIG. 4 is a recommendation algorithm framework based on reinforcement learning according to an embodiment of the present invention;

FIG. 5 is a flowchart of a recommendation strategy generation method based on reinforcement learning according to an embodiment of the present invention;

fig. 6 is a block diagram of a recommendation policy generation apparatus based on reinforcement learning according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of a reinforcement learning based recommendation policy generation method, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Example one

Fig. 5 is a flowchart of a recommendation strategy generation method based on reinforcement learning according to an embodiment of the present invention, as shown in fig. 5, the method includes the following steps:

step S502, scene information is acquired.

In order to solve the technical problems that the recommendation effect is damaged in a short period, items which are not interested in the user are recommended to the user in the early stage of recommendation, and a large number of attempts are needed to obtain relatively accurate item rewards in the prior recommendation method, the scene information is required to be set before the scene analysis and recommendation strategy is carried out, wherein the scene information can comprise scene construction data carried out according to the implementation scene where the user needs to be located, and can also be a scene data set generated according to parameters preset by the user and used for the generation and training operation of a subsequent model.

Specifically, the set recommendation scenario is as follows: the recommending agent presents the user with Y items in the page, the user provides feedback by clicking on one of the items or choosing not to click on any of the items, and then the agent displays a new page containing Y items.

Optionally, before the acquiring the scene information, the method further includes: and defining a recommendation scene according to the user requirement.

It should be noted that, the recommendation process is mapped into a reinforcement learning framework, as shown in fig. 1, fig. 1 is a recommendation algorithm context rendering based on reinforcement learning according to an embodiment of the present invention, and an environment in reinforcement learning corresponds to an online user in a recommendation algorithm. State s^tCorresponding to an ordered sequence of user click histories. Recommending agent corresponding recommending algorithm execution center, which selects item list I from candidate set according to recommending strategy^tWherein Y are selected to be shown to the user, namely, the recommended action, and the recommended list is represented by A^tAnd (4) showing. When the user interacts with the page, clicking a certain item a in the recommendation list^tAs a feedback, the feedback is given by,when the reward of this state is obtained, the next state s is entered at the same time^t+1. Transition probability model P according to current state s^tAnd the selected action a^tPredicting entry into the next state s^t+1The probability of (c).

In addition, the state transition probability expression (1) indicates that the state is s^tAt the bottom, transition to state s^t+1The probability of (d); reward function formula (2) corresponds to the user giving feedback, i.e. clicking on item a^tThe reward, i.e. the short-term benefit, is then obtained because the user can only make the selection action a from the recommendations given by the recommendation system^t∈A^tSo r(s) can be used^t,a^t) Instead of r(s)^t,A^t,a^t) Using P (g | s)^t,a^t) Instead of P (g | s)^t,A^t,a^t)。

Policy π corresponds to the recommendation policy for recommending an agent, at state s^tThen, the agent follows the candidate set I according to a certain strategy pi^tGet the recommendation list A^t。

It is also noted that an object of embodiments of the present invention is to maximize long-term rewards while ensuring recommendation accuracy. Therefore, the improved recommendation algorithm based on reinforcement learning of the embodiment aims to find an optimal strategy pi(s)^t,I^t) In a state s^tNext slave candidate set I^tThe Y items are selected for recommendation to the user such that the desired reward is maximized. The objective function is defined as shown in equation (3).

Value-based methods in reinforcement learningThe method has great advantages in the aspect of continuous off-strategy learning, but the convergence of the strategy function is poor. In contrast, policy-based approaches perform well in terms of policy function convergence. Therefore, the embodiment trains the recommendation strategy of the recommendation agent by using a strategy gradient-based reinforcement learning method REINFORCE with better convergence and using a user simulator based on generation of confrontation network training as a simulation environment. Wherein E is a desired function; r(s)^t,a^t) Is a reward function.

And step S504, generating a user simulator according to the scene information.

Optionally, the generating a user simulator according to the scene information includes: describing the user state of the scene information according to an attention mechanism to obtain the user state; determining a user decision function and a user reward function according to the user state; and constructing the user simulator according to the user decision function and the user reward function.

In particular, the user state s^tSequence of historical items c clicked on by the user before time t₀,c₁,K,c_t-1Composition of (c)_*Representing a user click item. Will sequence c₀,c₁,K,c_t-1Convert to an embedding layer vector { f }₁,f₂K,f_t-1And then, the state definition of the user is as shown in formula (4):

s^t＝h(f₁,f₂K,f_t-1) (4)

wherein, the vector f_τ(τ ═ 1,2, L, t-1) represents the embedded layer vector of the click item at time τ, and h (g) is a feature embedding function, with the purpose of generating a vector of determined length to represent the user state. Therefore, if the user status uses a history sequence { f) with length m_t-m,K,f_t-1Represents, the user status can be represented as:

s^t＝h(f_t-m,f_t-(m-1)K,f_t-1) (5)

if using F^t-m:t-1Representing a user history sequence of length m f_t-m,K,f_t-1Then, the user status can be represented as:

s^t＝h(F^t-m:t-1):＝σ(F^t-m:t-1W+B) (6)

where W is a matrix of weighting coefficients in m rows and n columns, B is a bias matrix in d rows and n columns, and σ (g) is the activation function.

The present embodiment is described by taking a news recommendation system as an example, in consideration that the user's interest and behavior state change with time. Two contextual scenarios are common in news recommendation systems: (1) if the user's two-click browsing operation interval is long, the state s of a certain time point tau after the two clicks is described_τIt cannot be simply assumed that the influence weight of the content of each position of the user history interaction sequence on the user decision strategy is the same. That is, in the history sequence with the length of m, if the interval between the initial position of t-m and the end position of t-1 is long, the user behavior at the time of t-m has no or little influence on the decision of the user at the current time t. (2) If a user has an interest in news such as "a virus" after browsing the news titled "a virus variation in uk", the user's interest may be changed by the influence of the news "a virus variation in uk", and the user then may likely want to browse the news related to "a virus". However, since the influence of each position of the history sequence before the time τ is the same on the user decision strategy, the finally generated recommendation item may not contain the related news that the user most wants to browse. Both the above two context scenarios illustrate how the feature representation (6) cannot distinguish the influence of the behaviors of different sequence positions on the user decision strategy. In order to solve the above problems, a scheme for representing the user state based on an Attention Mechanism (Attention Mechanism) is provided, and the influence of different positions of a historical click sequence on the user state is adjusted. The degree of influence of each position on the time t is determined by the attention weight coefficient a^τDetermining:

where d represents the position of the currently clicked item in the state sequence.

This scheme is illustrated in FIG. 2, where FIG. 2 is a user state characterization scheme based on an attention mechanism in which { w } according to an embodiment of the present invention₁,w₂K represents a PWM (Position Weight matrix) parameter, which considers the user state s^tInfluenced by where the sequence of user interactions is located (i.e., the time of occurrence). If with H^t-m:t-1User status s shown by expression (6)^tThen the user state based on attention mechanism can be expressed as:

in addition, in order to determine a user decision function and a reward function, and thus to simplify the model, the embodiment of the present invention sets a recommendation scenario as follows: the user is presented with Y items and the user will make a decision to select one of the most interesting items to click or none of the items to click. The user simulator refers to an interaction model in a recommended scenario. In this simulator, the user's satisfaction or interest in a project is measured by a reward r, and the optimization goal of the user decision strategy is to maximize the long-term reward. In the real user decision making process, the items pushed by the recommendation algorithm to the user have certain influence on the interest change of the user. Taking the news recommendation service as an example, a certain user may not be interested in NBA news at first, but if the recommendation algorithm recommends such news to this user, the user may like it and then be interested in other NBA news. Similarly, a user may feel bored even after repeatedly seeing similar news, and thus, the user's satisfaction with the same item may be affected by the user's behavior history sequence. In summary, the reward function and the user status s can be obtained^tAnd decision-making behavior a of the user^tIn relation to, and therefore representing the prize as a prize function r(s)^t,a^t). Optimal user decision model phi^*To be in a user state s^tNext, a set of items A is recommended from the recommending agent^tMiddle click item a^tMake the reward function r(s)^t,a^t) The largest set of parameters. The user decision function can thus be representedComprises the following steps:

wherein y represents an item number in the recommended agent push list; y is the total number of items; delta^yIs a Y-dimensional probability simplex, as shown in formula (10):

Δ^yrepresenting a probability of the user clicking on each recommended item and being 1. L is₂(φ) is an L2 regularization function to encourage exploration. Eta is the exploration rate, and as an exploration utilization balance parameter, the larger the eta is, the more exploratory the user is. It is assumed that the reward of the recommender system is the same as the user utility. Therefore, the accumulated reward of the recommendation system is optimized, the requirements of the user can be met for a long time, and the satisfaction degree of the user is improved. The defined reward function is determined by the utility of the user after making a click decision, as shown in equation (11):

r(s^t,a^t):＝reg(W[s^t,a^t]+b) (11)

where W is the reward weight matrix, b is the corresponding deviation vector, and reg (-) is the final regression function.

Step S506, generating a simulation environment according to the user simulator.

Specifically, when the user simulator is generated, the embodiment of the present invention needs to output the simulation environment according to the characteristic value of the user simulator in the early stage of the recommended strategy after the mature user simulator is obtained, and the simulation environment is used for generating the final recommended strategy model.

And step S508, generating a recommendation strategy model by adopting a strategy gradient algorithm through the simulation environment.

Optionally, after generating the recommended policy model by using a policy gradient algorithm through the simulation environment, the method further includes: and outputting the recommended strategy model.

Specifically, in the embodiments of the present invention, and thus in the defined user simulator, the reward and user decision strategy are unknown and trained from the data. In the training process, the user decides on the function φ(s)^t,A^t) The sequence of click behaviors of real users is modeled, and the click behaviors of the real users and the behavior motivation of the user simulator are both to maximize the long-term reward value. This is fitted to the generator and discriminator which generate the countermeasure network

Click items generated by the machine, phi representing the user decision function phi(s)^t,A^t) And r represents the reward function r(s)^t,A^t) Wherein the generator is a user decision function for generating the next click behavior of the user according to the historical behavior of the user, and the discriminator is a reward function r(s)^t,A^t) Distinguishing between real clicks of the user and clicks generated by the user simulator. In the operation of the embodiment, a state action sequence track (track) with the length of T is adopted for training, and a user historical click sequence with the length of T is given

And corresponding user click item feature f¹,f²,L,f^TAnd (4) training by solving the minimum maximum optimization problem of the formula (12) to obtain a user decision function and a reward function.

Wherein the content of the first and second substances,

representing the status of the real user,

representing real user pointsHit an item, a^tRepresents the user simulation, phi represents the user decision function phi(s)^t,A^t) And r represents the reward function r(s)^t,A^t),E_φIs a desired function.

Building a user simulator, a reward function r(s), based on generating a confrontation network^t,a^t) Feature training networks are extracted from real user behaviors and user simulator generated behaviors, differences between the two are amplified, and negative differences between the two are increased. And the user decision function phi(s)^t,A^t) In contrast to the reward function, the goal of the user decision function is to narrow the difference between the real user behavior and the user simulator generated behavior, generating a sample that approximates the real user behavior as closely as possible. This user simulator is named MRLG-Attention, the flow of the generated data is shown in FIG. 3, and the generation of a confrontation model can be interpreted as a game between a competitor and a learner, wherein the competitor adjusts the reward function r(s)^t,a^t) To minimize the learner's reward by adjusting the user decision function(s)^t,A^t) To maximize rewards, this provides a large amount of training data for user simulator training, with less deviation of the trained model.

It should be noted that, the recommendation algorithm based on the user simulator and reinforcement learning may be a recommendation policy model obtained by training in a simulation environment by using a policy gradient method REINFORCE, using a user simulator MRLG-orientation obtained by learning a real environment as the simulation environment, as shown in fig. 4. Set the recommendation policy to π_θ(s_t,a_t) And theta is a recommended strategy function parameter. For a set of action state sequences τ of length l generated by the user simulator, the return (return) of the sequences τ is shown in equation (13):

if P (τ; 0) represents the probability of the occurrence of the sequence τ, then the target desired reward function is as shown in equation (14):

to find the optimal parameters of the objective function such that J (θ) is maximized, the present embodiment uses a gradient-ascending method as shown in equation (15) to solve:

the derivation of the objective function (14) results in the following equation (16):

wherein the mean of the m sequences is used to approximate the expectation of the strategy gradient.

Example two

Fig. 6 is a block diagram of a recommendation policy generation apparatus based on reinforcement learning according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes:

and an obtaining module 60, configured to obtain scene information.

Specifically, in order to solve the technical problems that the recommendation effect is damaged in a short period, items which are not interested in the user are recommended to the user in the early stage of recommendation, and a large number of attempts are needed to obtain a relatively accurate item reward in the prior recommendation method, the scene information needs to be set before the scene analysis and recommendation strategy are carried out, wherein the scene information may include scene construction data carried out according to the implementation scene where the user needs to be located, or a scene data set generated according to parameters preset by the user, and is used for the generation and training operation of a subsequent model.

Specifically, the set recommendation scenario is as follows: the recommending agent presents k items to the user in a page, the user provides feedback by clicking on one of the items or choosing not to click on any of the items, and then the agent displays a new page containing Y items.

Optionally, the apparatus further comprises: and the definition module is used for defining the recommendation scene according to the user requirement.

It should be noted that, the recommendation process is mapped into a reinforcement learning framework, as shown in fig. 1, fig. 1 is a recommendation algorithm context rendering based on reinforcement learning according to an embodiment of the present invention, and an environment in reinforcement learning corresponds to an online user in a recommendation algorithm. State s^tCorresponding to an ordered sequence of user click histories. Recommending agent corresponding recommending algorithm execution center, which selects item list I from candidate set according to recommending strategy^tWherein Y are selected to be shown to the user, namely, the recommended action, and the recommended list is represented by A^tAnd (4) showing. When the user interacts with the page, clicking a certain item a in the recommendation list^tAs feedback, when the reward of this state is obtained, the next state s is also entered^t+1. Transition probability model P according to current state s^tAnd the selected action a^tPredicting entry into the next state s^t+1The probability of (c).

Policy π corresponds to the recommendation policy for recommending an agent, at state s^tThen, the agent follows the candidate set I according to a certain strategy pi^tGet the push inReferrer list A^t。

It is also noted that an object of embodiments of the present invention is to maximize long-term rewards while ensuring recommendation accuracy. Therefore, the improved recommendation algorithm based on reinforcement learning aims to find an optimal strategy pi(s)^t,I^t) In a state s^tNext slave candidate set I^tThe Y items are selected for recommendation to the user such that the desired reward is maximized. The objective function is defined as shown in equation (3).

In the reinforcement learning method, the value-based method has great advantages in the aspect of continuous off-strategy learning, but the convergence of the strategy function is poor. In contrast, policy-based approaches perform well in terms of policy function convergence. Therefore, the invention uses a strategy gradient-based reinforcement learning method REINFORCE with better convergence, and trains the recommendation strategy of the recommendation agent by taking the user simulator based on the generation of confrontation network training as a simulation environment. Wherein E is a desired function; r(s)^t,a^t) Is a reward function.

And a generating module 62, configured to generate a user simulator according to the scene information.

Optionally, the generating module includes: the description unit is used for describing the user state of the scene information according to an attention mechanism to obtain the user state; the determining unit is used for determining a user decision function and a user reward function according to the user state; and the construction unit is used for constructing the user simulator according to the user decision function and the user reward function.

In particular, the user state s^tSequence of historical items c clicked on by the user before time t₀,c₁,K,c_t-1Composition of (c)_*Representing a user click item. Will sequence c₀,c₁,K,c_t-1Convert to an embedding layer vector { f }₁,f₂K,f_t-1The user status is defined as formula (4)) Shown in the figure:

s^t＝h(f₁,f₂K,f_t-1) (4)

s^t＝h(f_t-m,f_t-(m-1)K,f_t-1) (5)

s^t＝h(F^t-m:t-1):＝σ(F^t-m:t-1W+B) (6)

where W is a matrix of weighting coefficients in m rows and n columns, B is a bias matrix in d rows and n columns, and σ (g) is the activation function. The present embodiment is described by taking a news recommendation system as an example, in consideration that the user's interest and behavior state change with time. Two contextual scenarios are common in news recommendation systems: (1) if the user's two-click browsing operation interval is long, the state s of a certain time point tau after the two clicks is described_τIt cannot be simply assumed that the influence weight of the content of each position of the user history interaction sequence on the user decision strategy is the same. That is, in the history sequence with the length of m, if the interval between the initial position of t-m and the end position of t-1 is long, the user behavior at the time of t-m has no or little influence on the decision of the user at the current time t. (2) If a user has an interest in news such as "a virus" after browsing the news titled "a virus variation in uk", the user's interest may be changed by the influence of the news "a virus variation in uk", and the user then may likely want to browse the news related to "a virus". However, since the influence of each position of the history sequence before the time τ on the user decision strategy is the same, the finally generated recommendation item may not contain any useThe relevant news the user most wants to browse. Both the above two context scenarios illustrate how the feature representation (6) cannot distinguish the influence of the behaviors of different sequence positions on the user decision strategy. In order to solve the above problems, a scheme for representing the user state based on an Attention Mechanism (Attention Mechanism) is provided, and the influence of different positions of a historical click sequence on the user state is adjusted. The degree of influence of each position on the time t is determined by the attention weight coefficient a^τDetermining:

This scheme is illustrated in FIG. 2, where FIG. 2 is a user state characterization scheme based on an attention mechanism in which { w } according to an embodiment of the present invention₁,w₂K represents a PWM parameter, which is considered the user state s^tInfluenced by where the sequence of user interactions is located (i.e., the time of occurrence). If with H^t-m:t-1User status s shown by expression (6)^tThen the user state based on attention mechanism can be expressed as:

in addition, in order to determine a user decision function and a reward function, and thus to simplify the model, the embodiment of the present invention sets a recommendation scenario as follows: the user is presented with Y items and the user will make a decision to select one of the most interesting items to click or none of the items to click. The user simulator refers to an interaction model in a recommended scenario. In this simulator, the user's satisfaction or interest in a project is measured by a reward r, and the optimization goal of the user decision strategy is to maximize the long-term reward. In the real user decision making process, the items pushed by the recommendation algorithm to the user have certain influence on the interest change of the user. Taking the news recommendation service as an example, a certain user may not be interested in NBA news at the beginning, but if the recommendation algorithm gives thisIndividual users recommend such news that the user may like and then have an interest in other NBA news. Similarly, a user may feel bored even after repeatedly seeing similar news, and thus, the user's satisfaction with the same item may be affected by the user's behavior history sequence. In summary, the reward function and the user status s can be obtained^tAnd decision-making behavior a of the user^tIn relation to, and therefore representing the prize as a prize function r(s)^t,a^t). Optimal user decision model phi^*To be in a user state s^tNext, a set of items A is recommended from the recommending agent^tMiddle click item a^tMake the reward function r(s)^t,a^t) The largest set of parameters. The user decision function can thus be expressed as:

r(s^t,a^t):＝reg(W[s^t,a^t]+b) (11)

A simulation module 64 for generating a simulated environment from the user simulator.

And the recommending module 66 is used for generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.

Optionally, the apparatus further comprises: and the output module is used for outputting the recommended strategy model.

Specifically, in the embodiments of the present invention, and thus in the defined user simulator, the reward and user decision strategy are unknown and trained from the data. In the training process, the user decides on the function φ(s)^t,A^t) The sequence of click behaviors of real users is modeled, and the click behaviors of the real users and the behavior motivation of the user simulator are both to maximize the long-term reward value. This is fitted with a generator that generates the countermeasure network, where the generator is a user decision function that generates the next click behavior of the user from the user's historical behavior, and a discriminator that is a reward function r(s)^t,A^t) Distinguishing between real clicks of the user and clicks generated by the user simulator. In the operation of the embodiment, a state action sequence track (track) with the length of T is adopted for training, and a user historical click sequence with the length of T is given

Wherein the content of the first and second substances,

representing the status of the real user,

representing actual user clicks on items, a^tRepresenting the click items generated by the user simulator, phi representing the user decision function phi(s)^t,A^t) And r represents the reward function r(s)^t,A^t)，E_φIs a desired function.

It should be noted that, the recommendation algorithm based on the user simulator and reinforcement learning may be a recommendation policy model obtained by training in a simulation environment by using a policy gradient method REINFORCE, using a user simulator MRLG-orientation obtained by learning a real environment as the simulation environment, as shown in fig. 4. Set the recommendation policy to π_θ(s_t,a_t) And theta is a recommended strategy function parameter. User simulator generated set of lengthsI, the return (return) of the sequence τ is shown in equation (13):

EXAMPLE III

According to another aspect of the embodiment of the present invention, a non-volatile storage medium is further provided, where the non-volatile storage medium includes a stored program, and the program controls, when running, a device in which the non-volatile storage medium is located to execute a recommendation policy generation method based on reinforcement learning.

Specifically, the method comprises the following steps: acquiring scene information; generating a user simulator according to the scene information; generating a simulation environment according to the user simulator; and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.

Example four

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor and a memory; the memory stores computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a recommendation strategy generation method based on reinforcement learning.

Effects of the embodiment

Compared with the prior art, when the related technical problems are solved, the user simulator can adapt to the behavior characteristic change of the user, the higher user behavior prediction accuracy rate is obtained, the higher click rate and the long-term reward are obtained based on the recommendation algorithm of the user simulator, and the recommendation performance is effectively improved.

(1) In order to evaluate the accuracy of the user simulator MRLG-Attention in predicting the user click behavior, the following widely used prediction models are selected for comparison. W & D LR: and the logistic regression method is used for estimating the click rate of the user. W & D CCF: a collaborative filtering approach of contextual information is considered, which models a user decision process to learn user preferences. XGBOST: an integrated learning method based on a Decision Tree is an end-to-end Decision Tree enhancement algorithm improved on the basis of algorithms such as a Gradient Boosting Decision Tree (GDBT) and the like. MRLGAN: the algorithm is an algorithm for removing the attention mechanism on the basis of the user simulator provided by the embodiment, and the algorithm only changes the state characteristic representation of the user.

In the comparative experiment of the embodiment of the present invention, the user data is randomly divided into two parts: 80% of the user data are used as training set, and 20% of the user data are used as test set. The evaluation index is Top-K accuracy (Precision @ K is abbreviated as Prec @ K), which represents the proportion of K bits in front of the prediction list of the item actually clicked by the user, and the proportion is the average value of the tested users. The results of the experiment are shown in table 1.

TABLE 1 comparison of prediction accuracy performance of different models on different datasets

It can be seen that in table 1, the bolded data represents the best algorithm model under the current evaluation criteria in this dataset. The Improved row indicates the improvement rate of the user simulator MRLG-Attention proposed in this embodiment compared to the best-performing model of the present data set (except MRLG-Attention) under the current standard. As can be seen from the data in the table, in both data sets, the user behavior prediction accuracy of the user simulator MRLG-Attention provided by the embodiment is higher than that of other algorithm models. In the two data sets, the prediction accuracy of all models is improved along with the increase of the k value, particularly, the accuracy of the MRLG-Attention on the Yelp data set is 19.3% higher when the k value is 2 than when the k value is 1, and the prediction accuracy is improved along with the increase of the k value. In addition, as can be seen from table 1, compared to MRLGAN, the MRLG-Attention algorithm increases the prediction accuracy by 0.6% when the k value is 1, and increases the prediction accuracy by 3.8% when the k value is 2, which indicates that the Attention mechanism is helpful to improve the prediction accuracy, and the k value represents the number of recommended items in the recommendation list displayed on the page.

(2) In order to verify the recommendation accuracy of a recommendation algorithm MRLG Rec based on reinforcement learning, historical behavior data of 2000 users are collected from a supported information platform project, the data are divided into two data sets which are not overlapped with each other randomly, a data set is used for training a user simulator, and the other data set is used for training a reinforcement learning strategy. And selecting a deep learning model W & D LR, a W & D CCF and an off-line strategy model-free reinforcement learning model DQN by using a comparison algorithm.

Two performance evaluation criteria selected by the embodiment of the invention are as follows:

(1) cumulative Reward (CR): the exploratory nature of reinforcement learning enables the recommendation algorithm MRLG Rec designed by the invention to consider long-term benefits, so the cumulative reward of the algorithm should be higher than that of the non-reinforcement learning algorithms W & D LR and W & D CCF. Each recommendation action in the recommendation sequence can be calculated by the reward function of the user simulator to obtain the reward of the user. Since the rewards of the users are not calculated in training the reinforcement learning based recommendation strategy, this section uses the average of the cumulative rewards of all users as the CR value.

(2) CTR: the number of times an item is clicked on by the user is divided by the number of times the item is presented in the recommendation list. Since the behavior of the user is uncertain, 10 repeated experiments were performed for each recommended strategy, and the results of the performance comparison are shown in table 2.

TABLE 2 recommended Performance comparison

In table 2, the k value represents the number of recommended items in the recommendation list presented by the page. It can be seen from the table that as the k value increases, the performance index of all algorithms in the comparative experiment is improved. When the k value of the algorithm provided by the invention is 3, the accumulated reward is respectively increased by 66.78% and 20.67% relative to the W & D LR and W & D CCF of the deep learning algorithm, and is increased by 13.34% relative to the DQN of the model-free reinforcement learning algorithm; at a k value of 5, there was an increase of 67.06%, 21.11% and 9.45%, respectively, relative to the three algorithms. The results show that the long-term benefits of the recommendation algorithm and the MRLG Rec adopting the reinforcement learning method are higher than those of the W & D LR and W & D CCF adopting the deep learning algorithms, which indicates that better long-term benefits can be obtained by adopting the reinforcement learning method, and the long-term benefits of the reinforcement learning method MRLG Rec based on the model are higher than those of the reinforcement learning method DQN without the model due to the adoption of the dynamic sequence model. Although the training goal of the recommendation algorithm MRLG Rec provided by the invention is to maximize the cumulative reward, the recommendation algorithm MRLG Rec also obtains a relatively high click rate compared with the other three algorithms, and the click rate is respectively improved by 19.35%, 21.88% and 11.43%, which benefits from the learning of the reward function by the model-based reinforcement learning method, and the consistency of the click distribution of the user simulator and the click distribution of real users is considered while the maximization of the cumulative reward is pursued.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A recommendation strategy generation method based on reinforcement learning is characterized by comprising the following steps:

acquiring scene information;

generating a user simulator according to the scene information;

generating a simulation environment according to the user simulator;

and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.

2. The method of claim 1, wherein prior to said obtaining context information, the method further comprises:

and defining a recommendation scene according to the user requirement.

3. The method of claim 1, wherein generating a user simulator based on the context information comprises:

describing the user state of the scene information according to an attention mechanism to obtain the user state;

determining a user decision function and a user reward function according to the user state;

and constructing the user simulator according to the user decision function and the user reward function.

4. The method of claim 1, wherein after said generating a recommended policy model using a policy gradient algorithm through said simulation environment, said method further comprises:

and outputting the recommended strategy model.

5. A recommendation policy generation device based on reinforcement learning, comprising:

the acquisition module is used for acquiring scene information;

the generating module is used for generating a user simulator according to the scene information;

the simulation module is used for generating a simulation environment according to the user simulator;

and the recommendation module is used for generating a recommendation strategy model by adopting a strategy gradient algorithm through the simulation environment.

6. The apparatus of claim 5, further comprising:

and the definition module is used for defining the recommendation scene according to the user requirement.

7. The apparatus of claim 5, wherein the generating module comprises:

the description unit is used for describing the user state of the scene information according to an attention mechanism to obtain the user state;

the determining unit is used for determining a user decision function and a user reward function according to the user state;

and the construction unit is used for constructing the user simulator according to the user decision function and the user reward function.

8. The apparatus of claim 5, further comprising:

and the output module is used for outputting the recommended strategy model.

9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform the method of any one of claims 1 to 4.

10. An electronic device comprising a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform the method of any one of claims 1 to 4.