CN113688306A - Recommendation strategy generation method and device based on reinforcement learning - Google Patents

Recommendation strategy generation method and device based on reinforcement learning Download PDF

Info

Publication number
CN113688306A
CN113688306A CN202110726927.3A CN202110726927A CN113688306A CN 113688306 A CN113688306 A CN 113688306A CN 202110726927 A CN202110726927 A CN 202110726927A CN 113688306 A CN113688306 A CN 113688306A
Authority
CN
China
Prior art keywords
user
recommendation
generating
simulator
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110726927.3A
Other languages
Chinese (zh)
Inventor
李成钢
黄莹
李忠
李金岭
杜忠田
王彦君
夏海轮
张碧昭
余清华
卜理超
张天正
李凤文
袁福碧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Group System Integration Co Ltd
Original Assignee
China Telecom Group System Integration Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Group System Integration Co Ltd filed Critical China Telecom Group System Integration Co Ltd
Priority to CN202110726927.3A priority Critical patent/CN113688306A/en
Publication of CN113688306A publication Critical patent/CN113688306A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a recommendation strategy generation method and device based on reinforcement learning, and belongs to the field of intelligent recommendation. Wherein, the method comprises the following steps: acquiring scene information; generating a user simulator according to the scene information; generating a simulation environment according to the user simulator; and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment. The invention solves the technical problems that the recommendation effect is damaged in a short period, the items which are not interested by the user are recommended to the user in the early stage of recommendation, and a large amount of attempts are needed to obtain relatively accurate item rewards in the prior recommendation method, and realizes the technical effect of efficiently, reasonably and accurately mining the hidden preference of the user while meeting the current interest and hobbies of the user.

Description

Recommendation strategy generation method and device based on reinforcement learning
Technical Field
The invention belongs to the field of intelligent recommendation, and particularly relates to a recommendation strategy generation method and device based on reinforcement learning.
Background
Along with the continuous development of intelligent science and technology, people use intelligent equipment more and more among life, work, the study, use intelligent science and technology means, improved the quality of people's life, increased the efficiency of people's study and work.
At present, most recommendation algorithms are designed and trained on the basis of obtaining user historical data, and after determining the interests and hobbies of users on different articles, personalized recommendation is carried out on the users. In such recommendation algorithms, the user's interests are determined from the collected user history data, and it is assumed that the user's interests remain unchanged for a certain time. However, for most recommendation systems, such as music and movie recommendation systems, the interests of the users are constantly changing, even with the change of the contents recommended by the recommendation systems. Thus, current recommendation systems face two challenges: (1) the interest and the hobbies of the user are not constant and change along with time, so that the recommendation algorithm needs to consider the short-term interest of the user and also needs to mine the potential interest of the user, and higher long-term benefit is obtained. (2) Current recommendation algorithms tend to continually recommend similar items to users, which may reduce the user's interest in similar topics, thereby reducing satisfaction with the overall recommendation service. These two challenges are the Exploration and Exploitation (EE) problem in the recommendation system, i.e. how to mine the hidden preferences of the user while satisfying the user's current interests.
Since the recommendation problem can be converted into a sequence problem and three elements of reinforcement learning (state, action and reward) can also be defined, a reinforcement learning framework can be applied to a recommendation algorithm, thereby solving the above problem in the recommendation scenario. Some existing reinforcement learning methods add some randomness to the decision of recommending new items to solve the EE problem. For example, a simple e-Greedy policy and a multi-armed tiger algorithm-based Upper Confidence Bound (UCB) policy are adopted, but both of the two policies have the problem of damaging the recommendation effect in a short period, wherein the e-Greedy policy may recommend items which are not interesting to the user at all in the early stage of recommendation, and the UCB algorithm needs a large amount of attempts to obtain relatively accurate item rewards.
Disclosure of Invention
The invention provides a recommendation strategy generation method and device based on reinforcement learning, solves the technical problems that the recommendation effect is damaged in a short period, a user is recommended with a project which is not interested at all in the early stage of recommendation, and a large amount of attempts are needed to obtain a relatively accurate project reward in the prior recommendation method, and achieves the technical effect of efficiently, reasonably and accurately mining the hidden preference of the user while meeting the current interest and hobbies of the user.
In one aspect of the present invention, a recommendation policy generation method based on reinforcement learning is provided, including: acquiring scene information; generating a user simulator according to the scene information; generating a simulation environment according to the user simulator; and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.
Further, before the acquiring the scene information, the method further includes: and defining a recommendation scene according to the user requirement.
Further, the generating a user simulator according to the scene information includes: describing the user state of the scene information according to an attention mechanism to obtain the user state; determining a user decision function and a user reward function according to the user state; and constructing the user simulator according to the user decision function and the user reward function.
Further, after the generating a recommended policy model by a policy gradient algorithm through the simulation environment, the method further comprises: and outputting the recommended strategy model.
In another aspect of the present invention, a recommendation policy generation apparatus based on reinforcement learning is further provided, including: the acquisition module is used for acquiring scene information; the generating module is used for generating a user simulator according to the scene information; the simulation module is used for generating a simulation environment according to the user simulator; and the recommendation module is used for generating a recommendation strategy model by adopting a strategy gradient algorithm through the simulation environment.
Further, the apparatus further comprises: and the definition module is used for defining the recommendation scene according to the user requirement.
Further, the generating module includes: the description unit is used for describing the user state of the scene information according to an attention mechanism to obtain the user state; the determining unit is used for determining a user decision function and a user reward function according to the user state; and the construction unit is used for constructing the user simulator according to the user decision function and the user reward function.
Further, the apparatus further comprises: and the output module is used for outputting the recommended strategy model.
In another aspect of the present invention, a non-volatile storage medium is further provided, where the non-volatile storage medium includes a stored program, and the program controls, when running, a device in which the non-volatile storage medium is located to execute a recommendation policy generation method based on reinforcement learning.
In another aspect of the present invention, an electronic device is further provided, which includes a processor and a memory; the memory stores computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a recommendation strategy generation method based on reinforcement learning.
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, the state characteristics of the user are fully extracted by adopting an attention mechanism, so that the interest change of the user can be captured more deeply and accurately; then modeling a decision making process of the recommended scene into a user simulator, and training the simulator by applying the minimum maximization principle of the generated countermeasure network to fit the distribution of the real user decision making behaviors in order to reduce the deviation between the user simulator and the real user decision making process; finally, the obtained user simulator is used as a simulation environment, a recommendation strategy is obtained based on a reinforcement learning strategy gradient method, and the technical problems that the recommendation effect is damaged in a short period, the items which are not interested in the user are recommended to the user in the early stage of recommendation, and a large amount of attempts are needed to obtain relatively accurate item rewards in the conventional recommendation method are solved; the method has the advantages that high user behavior prediction accuracy can be obtained, the recommendation performance is effectively improved, and the technical effect of efficiently, reasonably and accurately mining the hidden preference of the user while the current interest and hobbies of the user are met is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a recommendation algorithm context rendering based on reinforcement learning according to an embodiment of the present invention;
FIG. 2 is a user state characterization scheme based on an attention mechanism according to an embodiment of the present invention;
FIG. 3 is a user state characterization scheme based on an attention mechanism according to an embodiment of the present invention;
FIG. 4 is a recommendation algorithm framework based on reinforcement learning according to an embodiment of the present invention;
FIG. 5 is a flowchart of a recommendation strategy generation method based on reinforcement learning according to an embodiment of the present invention;
fig. 6 is a block diagram of a recommendation policy generation apparatus based on reinforcement learning according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present invention, there is provided a method embodiment of a reinforcement learning based recommendation policy generation method, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Example one
Fig. 5 is a flowchart of a recommendation strategy generation method based on reinforcement learning according to an embodiment of the present invention, as shown in fig. 5, the method includes the following steps:
step S502, scene information is acquired.
In order to solve the technical problems that the recommendation effect is damaged in a short period, items which are not interested in the user are recommended to the user in the early stage of recommendation, and a large number of attempts are needed to obtain relatively accurate item rewards in the prior recommendation method, the scene information is required to be set before the scene analysis and recommendation strategy is carried out, wherein the scene information can comprise scene construction data carried out according to the implementation scene where the user needs to be located, and can also be a scene data set generated according to parameters preset by the user and used for the generation and training operation of a subsequent model.
Specifically, the set recommendation scenario is as follows: the recommending agent presents the user with Y items in the page, the user provides feedback by clicking on one of the items or choosing not to click on any of the items, and then the agent displays a new page containing Y items.
Optionally, before the acquiring the scene information, the method further includes: and defining a recommendation scene according to the user requirement.
It should be noted that, the recommendation process is mapped into a reinforcement learning framework, as shown in fig. 1, fig. 1 is a recommendation algorithm context rendering based on reinforcement learning according to an embodiment of the present invention, and an environment in reinforcement learning corresponds to an online user in a recommendation algorithm. State stCorresponding to an ordered sequence of user click histories. Recommending agent corresponding recommending algorithm execution center, which selects item list I from candidate set according to recommending strategytWherein Y are selected to be shown to the user, namely, the recommended action, and the recommended list is represented by AtAnd (4) showing. When the user interacts with the page, clicking a certain item a in the recommendation listtAs a feedback, the feedback is given by,when the reward of this state is obtained, the next state s is entered at the same timet+1. Transition probability model P according to current state stAnd the selected action atPredicting entry into the next state st+1The probability of (c).
In addition, the state transition probability expression (1) indicates that the state is stAt the bottom, transition to state st+1The probability of (d); reward function formula (2) corresponds to the user giving feedback, i.e. clicking on item atThe reward, i.e. the short-term benefit, is then obtained because the user can only make the selection action a from the recommendations given by the recommendation systemt∈AtSo r(s) can be usedt,at) Instead of r(s)t,At,at) Using P (g | s)t,at) Instead of P (g | s)t,At,at)。
Figure 1
Figure RE-GDA0003301939230000052
Policy π corresponds to the recommendation policy for recommending an agent, at state stThen, the agent follows the candidate set I according to a certain strategy pitGet the recommendation list At
It is also noted that an object of embodiments of the present invention is to maximize long-term rewards while ensuring recommendation accuracy. Therefore, the improved recommendation algorithm based on reinforcement learning of the embodiment aims to find an optimal strategy pi(s)t,It) In a state stNext slave candidate set ItThe Y items are selected for recommendation to the user such that the desired reward is maximized. The objective function is defined as shown in equation (3).
Figure RE-GDA0003301939230000053
Value-based methods in reinforcement learningThe method has great advantages in the aspect of continuous off-strategy learning, but the convergence of the strategy function is poor. In contrast, policy-based approaches perform well in terms of policy function convergence. Therefore, the embodiment trains the recommendation strategy of the recommendation agent by using a strategy gradient-based reinforcement learning method REINFORCE with better convergence and using a user simulator based on generation of confrontation network training as a simulation environment. Wherein E is a desired function; r(s)t,at) Is a reward function.
And step S504, generating a user simulator according to the scene information.
Optionally, the generating a user simulator according to the scene information includes: describing the user state of the scene information according to an attention mechanism to obtain the user state; determining a user decision function and a user reward function according to the user state; and constructing the user simulator according to the user decision function and the user reward function.
In particular, the user state stSequence of historical items c clicked on by the user before time t0,c1,K,ct-1Composition of (c)*Representing a user click item. Will sequence c0,c1,K,ct-1Convert to an embedding layer vector { f }1,f2K,ft-1And then, the state definition of the user is as shown in formula (4):
st=h(f1,f2K,ft-1) (4)
wherein, the vector fτ(τ ═ 1,2, L, t-1) represents the embedded layer vector of the click item at time τ, and h (g) is a feature embedding function, with the purpose of generating a vector of determined length to represent the user state. Therefore, if the user status uses a history sequence { f) with length mt-m,K,ft-1Represents, the user status can be represented as:
st=h(ft-m,ft-(m-1)K,ft-1) (5)
if using Ft-m:t-1Representing a user history sequence of length m ft-m,K,ft-1Then, the user status can be represented as:
st=h(Ft-m:t-1):=σ(Ft-m:t-1W+B) (6)
where W is a matrix of weighting coefficients in m rows and n columns, B is a bias matrix in d rows and n columns, and σ (g) is the activation function.
The present embodiment is described by taking a news recommendation system as an example, in consideration that the user's interest and behavior state change with time. Two contextual scenarios are common in news recommendation systems: (1) if the user's two-click browsing operation interval is long, the state s of a certain time point tau after the two clicks is describedτIt cannot be simply assumed that the influence weight of the content of each position of the user history interaction sequence on the user decision strategy is the same. That is, in the history sequence with the length of m, if the interval between the initial position of t-m and the end position of t-1 is long, the user behavior at the time of t-m has no or little influence on the decision of the user at the current time t. (2) If a user has an interest in news such as "a virus" after browsing the news titled "a virus variation in uk", the user's interest may be changed by the influence of the news "a virus variation in uk", and the user then may likely want to browse the news related to "a virus". However, since the influence of each position of the history sequence before the time τ is the same on the user decision strategy, the finally generated recommendation item may not contain the related news that the user most wants to browse. Both the above two context scenarios illustrate how the feature representation (6) cannot distinguish the influence of the behaviors of different sequence positions on the user decision strategy. In order to solve the above problems, a scheme for representing the user state based on an Attention Mechanism (Attention Mechanism) is provided, and the influence of different positions of a historical click sequence on the user state is adjusted. The degree of influence of each position on the time t is determined by the attention weight coefficient aτDetermining:
Figure RE-GDA0003301939230000071
where d represents the position of the currently clicked item in the state sequence.
This scheme is illustrated in FIG. 2, where FIG. 2 is a user state characterization scheme based on an attention mechanism in which { w } according to an embodiment of the present invention1,w2K represents a PWM (Position Weight matrix) parameter, which considers the user state stInfluenced by where the sequence of user interactions is located (i.e., the time of occurrence). If with Ht-m:t-1User status s shown by expression (6)tThen the user state based on attention mechanism can be expressed as:
Figure RE-GDA0003301939230000072
in addition, in order to determine a user decision function and a reward function, and thus to simplify the model, the embodiment of the present invention sets a recommendation scenario as follows: the user is presented with Y items and the user will make a decision to select one of the most interesting items to click or none of the items to click. The user simulator refers to an interaction model in a recommended scenario. In this simulator, the user's satisfaction or interest in a project is measured by a reward r, and the optimization goal of the user decision strategy is to maximize the long-term reward. In the real user decision making process, the items pushed by the recommendation algorithm to the user have certain influence on the interest change of the user. Taking the news recommendation service as an example, a certain user may not be interested in NBA news at first, but if the recommendation algorithm recommends such news to this user, the user may like it and then be interested in other NBA news. Similarly, a user may feel bored even after repeatedly seeing similar news, and thus, the user's satisfaction with the same item may be affected by the user's behavior history sequence. In summary, the reward function and the user status s can be obtainedtAnd decision-making behavior a of the usertIn relation to, and therefore representing the prize as a prize function r(s)t,at). Optimal user decision model phi*To be in a user state stNext, a set of items A is recommended from the recommending agenttMiddle click item atMake the reward function r(s)t,at) The largest set of parameters. The user decision function can thus be representedComprises the following steps:
Figure RE-GDA0003301939230000073
wherein y represents an item number in the recommended agent push list; y is the total number of items; deltayIs a Y-dimensional probability simplex, as shown in formula (10):
Figure RE-GDA0003301939230000081
Δyrepresenting a probability of the user clicking on each recommended item and being 1. L is2(φ) is an L2 regularization function to encourage exploration. Eta is the exploration rate, and as an exploration utilization balance parameter, the larger the eta is, the more exploratory the user is. It is assumed that the reward of the recommender system is the same as the user utility. Therefore, the accumulated reward of the recommendation system is optimized, the requirements of the user can be met for a long time, and the satisfaction degree of the user is improved. The defined reward function is determined by the utility of the user after making a click decision, as shown in equation (11):
r(st,at):=reg(W[st,at]+b) (11)
where W is the reward weight matrix, b is the corresponding deviation vector, and reg (-) is the final regression function.
Step S506, generating a simulation environment according to the user simulator.
Specifically, when the user simulator is generated, the embodiment of the present invention needs to output the simulation environment according to the characteristic value of the user simulator in the early stage of the recommended strategy after the mature user simulator is obtained, and the simulation environment is used for generating the final recommended strategy model.
And step S508, generating a recommendation strategy model by adopting a strategy gradient algorithm through the simulation environment.
Optionally, after generating the recommended policy model by using a policy gradient algorithm through the simulation environment, the method further includes: and outputting the recommended strategy model.
Specifically, in the embodiments of the present invention, and thus in the defined user simulator, the reward and user decision strategy are unknown and trained from the data. In the training process, the user decides on the function φ(s)t,At) The sequence of click behaviors of real users is modeled, and the click behaviors of the real users and the behavior motivation of the user simulator are both to maximize the long-term reward value. This is fitted to the generator and discriminator which generate the countermeasure network
Click items generated by the machine, phi representing the user decision function phi(s)t,At) And r represents the reward function r(s)t,At) Wherein the generator is a user decision function for generating the next click behavior of the user according to the historical behavior of the user, and the discriminator is a reward function r(s)t,At) Distinguishing between real clicks of the user and clicks generated by the user simulator. In the operation of the embodiment, a state action sequence track (track) with the length of T is adopted for training, and a user historical click sequence with the length of T is given
Figure RE-GDA0003301939230000091
And corresponding user click item feature f1,f2,L,fTAnd (4) training by solving the minimum maximum optimization problem of the formula (12) to obtain a user decision function and a reward function.
Figure RE-GDA0003301939230000092
Wherein the content of the first and second substances,
Figure RE-GDA0003301939230000093
representing the status of the real user,
Figure RE-GDA0003301939230000094
representing real user pointsHit an item, atRepresents the user simulation, phi represents the user decision function phi(s)t,At) And r represents the reward function r(s)t,At),EφIs a desired function.
Building a user simulator, a reward function r(s), based on generating a confrontation networkt,at) Feature training networks are extracted from real user behaviors and user simulator generated behaviors, differences between the two are amplified, and negative differences between the two are increased. And the user decision function phi(s)t,At) In contrast to the reward function, the goal of the user decision function is to narrow the difference between the real user behavior and the user simulator generated behavior, generating a sample that approximates the real user behavior as closely as possible. This user simulator is named MRLG-Attention, the flow of the generated data is shown in FIG. 3, and the generation of a confrontation model can be interpreted as a game between a competitor and a learner, wherein the competitor adjusts the reward function r(s)t,at) To minimize the learner's reward by adjusting the user decision function(s)t,At) To maximize rewards, this provides a large amount of training data for user simulator training, with less deviation of the trained model.
It should be noted that, the recommendation algorithm based on the user simulator and reinforcement learning may be a recommendation policy model obtained by training in a simulation environment by using a policy gradient method REINFORCE, using a user simulator MRLG-orientation obtained by learning a real environment as the simulation environment, as shown in fig. 4. Set the recommendation policy to πθ(st,at) And theta is a recommended strategy function parameter. For a set of action state sequences τ of length l generated by the user simulator, the return (return) of the sequences τ is shown in equation (13):
Figure RE-GDA0003301939230000095
if P (τ; 0) represents the probability of the occurrence of the sequence τ, then the target desired reward function is as shown in equation (14):
Figure RE-GDA0003301939230000096
to find the optimal parameters of the objective function such that J (θ) is maximized, the present embodiment uses a gradient-ascending method as shown in equation (15) to solve:
Figure RE-GDA0003301939230000097
the derivation of the objective function (14) results in the following equation (16):
Figure RE-GDA0003301939230000101
wherein the mean of the m sequences is used to approximate the expectation of the strategy gradient.
Example two
Fig. 6 is a block diagram of a recommendation policy generation apparatus based on reinforcement learning according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes:
and an obtaining module 60, configured to obtain scene information.
Specifically, in order to solve the technical problems that the recommendation effect is damaged in a short period, items which are not interested in the user are recommended to the user in the early stage of recommendation, and a large number of attempts are needed to obtain a relatively accurate item reward in the prior recommendation method, the scene information needs to be set before the scene analysis and recommendation strategy are carried out, wherein the scene information may include scene construction data carried out according to the implementation scene where the user needs to be located, or a scene data set generated according to parameters preset by the user, and is used for the generation and training operation of a subsequent model.
Specifically, the set recommendation scenario is as follows: the recommending agent presents k items to the user in a page, the user provides feedback by clicking on one of the items or choosing not to click on any of the items, and then the agent displays a new page containing Y items.
Optionally, the apparatus further comprises: and the definition module is used for defining the recommendation scene according to the user requirement.
It should be noted that, the recommendation process is mapped into a reinforcement learning framework, as shown in fig. 1, fig. 1 is a recommendation algorithm context rendering based on reinforcement learning according to an embodiment of the present invention, and an environment in reinforcement learning corresponds to an online user in a recommendation algorithm. State stCorresponding to an ordered sequence of user click histories. Recommending agent corresponding recommending algorithm execution center, which selects item list I from candidate set according to recommending strategytWherein Y are selected to be shown to the user, namely, the recommended action, and the recommended list is represented by AtAnd (4) showing. When the user interacts with the page, clicking a certain item a in the recommendation listtAs feedback, when the reward of this state is obtained, the next state s is also enteredt+1. Transition probability model P according to current state stAnd the selected action atPredicting entry into the next state st+1The probability of (c).
In addition, the state transition probability expression (1) indicates that the state is stAt the bottom, transition to state st+1The probability of (d); reward function formula (2) corresponds to the user giving feedback, i.e. clicking on item atThe reward, i.e. the short-term benefit, is then obtained because the user can only make the selection action a from the recommendations given by the recommendation systemt∈AtSo r(s) can be usedt,at) Instead of r(s)t,At,at) Using P (g | s)t,at) Instead of P (g | s)t,At,at)。
Figure 1
Figure RE-GDA0003301939230000112
Policy π corresponds to the recommendation policy for recommending an agent, at state stThen, the agent follows the candidate set I according to a certain strategy pitGet the push inReferrer list At
It is also noted that an object of embodiments of the present invention is to maximize long-term rewards while ensuring recommendation accuracy. Therefore, the improved recommendation algorithm based on reinforcement learning aims to find an optimal strategy pi(s)t,It) In a state stNext slave candidate set ItThe Y items are selected for recommendation to the user such that the desired reward is maximized. The objective function is defined as shown in equation (3).
Figure RE-GDA0003301939230000113
In the reinforcement learning method, the value-based method has great advantages in the aspect of continuous off-strategy learning, but the convergence of the strategy function is poor. In contrast, policy-based approaches perform well in terms of policy function convergence. Therefore, the invention uses a strategy gradient-based reinforcement learning method REINFORCE with better convergence, and trains the recommendation strategy of the recommendation agent by taking the user simulator based on the generation of confrontation network training as a simulation environment. Wherein E is a desired function; r(s)t,at) Is a reward function.
And a generating module 62, configured to generate a user simulator according to the scene information.
Optionally, the generating module includes: the description unit is used for describing the user state of the scene information according to an attention mechanism to obtain the user state; the determining unit is used for determining a user decision function and a user reward function according to the user state; and the construction unit is used for constructing the user simulator according to the user decision function and the user reward function.
In particular, the user state stSequence of historical items c clicked on by the user before time t0,c1,K,ct-1Composition of (c)*Representing a user click item. Will sequence c0,c1,K,ct-1Convert to an embedding layer vector { f }1,f2K,ft-1The user status is defined as formula (4)) Shown in the figure:
st=h(f1,f2K,ft-1) (4)
wherein, the vector fτ(τ ═ 1,2, L, t-1) represents the embedded layer vector of the click item at time τ, and h (g) is a feature embedding function, with the purpose of generating a vector of determined length to represent the user state. Therefore, if the user status uses a history sequence { f) with length mt-m,K,ft-1Represents, the user status can be represented as:
st=h(ft-m,ft-(m-1)K,ft-1) (5)
if using Ft-m:t-1Representing a user history sequence of length m ft-m,K,ft-1Then, the user status can be represented as:
st=h(Ft-m:t-1):=σ(Ft-m:t-1W+B) (6)
where W is a matrix of weighting coefficients in m rows and n columns, B is a bias matrix in d rows and n columns, and σ (g) is the activation function. The present embodiment is described by taking a news recommendation system as an example, in consideration that the user's interest and behavior state change with time. Two contextual scenarios are common in news recommendation systems: (1) if the user's two-click browsing operation interval is long, the state s of a certain time point tau after the two clicks is describedτIt cannot be simply assumed that the influence weight of the content of each position of the user history interaction sequence on the user decision strategy is the same. That is, in the history sequence with the length of m, if the interval between the initial position of t-m and the end position of t-1 is long, the user behavior at the time of t-m has no or little influence on the decision of the user at the current time t. (2) If a user has an interest in news such as "a virus" after browsing the news titled "a virus variation in uk", the user's interest may be changed by the influence of the news "a virus variation in uk", and the user then may likely want to browse the news related to "a virus". However, since the influence of each position of the history sequence before the time τ on the user decision strategy is the same, the finally generated recommendation item may not contain any useThe relevant news the user most wants to browse. Both the above two context scenarios illustrate how the feature representation (6) cannot distinguish the influence of the behaviors of different sequence positions on the user decision strategy. In order to solve the above problems, a scheme for representing the user state based on an Attention Mechanism (Attention Mechanism) is provided, and the influence of different positions of a historical click sequence on the user state is adjusted. The degree of influence of each position on the time t is determined by the attention weight coefficient aτDetermining:
Figure RE-GDA0003301939230000121
where d represents the position of the currently clicked item in the state sequence.
This scheme is illustrated in FIG. 2, where FIG. 2 is a user state characterization scheme based on an attention mechanism in which { w } according to an embodiment of the present invention1,w2K represents a PWM parameter, which is considered the user state stInfluenced by where the sequence of user interactions is located (i.e., the time of occurrence). If with Ht-m:t-1User status s shown by expression (6)tThen the user state based on attention mechanism can be expressed as:
Figure RE-GDA0003301939230000131
in addition, in order to determine a user decision function and a reward function, and thus to simplify the model, the embodiment of the present invention sets a recommendation scenario as follows: the user is presented with Y items and the user will make a decision to select one of the most interesting items to click or none of the items to click. The user simulator refers to an interaction model in a recommended scenario. In this simulator, the user's satisfaction or interest in a project is measured by a reward r, and the optimization goal of the user decision strategy is to maximize the long-term reward. In the real user decision making process, the items pushed by the recommendation algorithm to the user have certain influence on the interest change of the user. Taking the news recommendation service as an example, a certain user may not be interested in NBA news at the beginning, but if the recommendation algorithm gives thisIndividual users recommend such news that the user may like and then have an interest in other NBA news. Similarly, a user may feel bored even after repeatedly seeing similar news, and thus, the user's satisfaction with the same item may be affected by the user's behavior history sequence. In summary, the reward function and the user status s can be obtainedtAnd decision-making behavior a of the usertIn relation to, and therefore representing the prize as a prize function r(s)t,at). Optimal user decision model phi*To be in a user state stNext, a set of items A is recommended from the recommending agenttMiddle click item atMake the reward function r(s)t,at) The largest set of parameters. The user decision function can thus be expressed as:
Figure RE-GDA0003301939230000132
wherein y represents an item number in the recommended agent push list; y is the total number of items; deltayIs a Y-dimensional probability simplex, as shown in formula (10):
Figure RE-GDA0003301939230000133
Δyrepresenting a probability of the user clicking on each recommended item and being 1. L is2(φ) is an L2 regularization function to encourage exploration. Eta is the exploration rate, and as an exploration utilization balance parameter, the larger the eta is, the more exploratory the user is. It is assumed that the reward of the recommender system is the same as the user utility. Therefore, the accumulated reward of the recommendation system is optimized, the requirements of the user can be met for a long time, and the satisfaction degree of the user is improved. The defined reward function is determined by the utility of the user after making a click decision, as shown in equation (11):
r(st,at):=reg(W[st,at]+b) (11)
where W is the reward weight matrix, b is the corresponding deviation vector, and reg (-) is the final regression function.
A simulation module 64 for generating a simulated environment from the user simulator.
Specifically, when the user simulator is generated, the embodiment of the present invention needs to output the simulation environment according to the characteristic value of the user simulator in the early stage of the recommended strategy after the mature user simulator is obtained, and the simulation environment is used for generating the final recommended strategy model.
And the recommending module 66 is used for generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.
Optionally, the apparatus further comprises: and the output module is used for outputting the recommended strategy model.
Specifically, in the embodiments of the present invention, and thus in the defined user simulator, the reward and user decision strategy are unknown and trained from the data. In the training process, the user decides on the function φ(s)t,At) The sequence of click behaviors of real users is modeled, and the click behaviors of the real users and the behavior motivation of the user simulator are both to maximize the long-term reward value. This is fitted with a generator that generates the countermeasure network, where the generator is a user decision function that generates the next click behavior of the user from the user's historical behavior, and a discriminator that is a reward function r(s)t,At) Distinguishing between real clicks of the user and clicks generated by the user simulator. In the operation of the embodiment, a state action sequence track (track) with the length of T is adopted for training, and a user historical click sequence with the length of T is given
Figure RE-GDA0003301939230000141
And corresponding user click item feature f1,f2,L,fTAnd (4) training by solving the minimum maximum optimization problem of the formula (12) to obtain a user decision function and a reward function.
Figure RE-GDA0003301939230000142
Wherein the content of the first and second substances,
Figure RE-GDA0003301939230000143
representing the status of the real user,
Figure RE-GDA0003301939230000144
representing actual user clicks on items, atRepresenting the click items generated by the user simulator, phi representing the user decision function phi(s)t,At) And r represents the reward function r(s)t,At),EφIs a desired function.
Building a user simulator, a reward function r(s), based on generating a confrontation networkt,at) Feature training networks are extracted from real user behaviors and user simulator generated behaviors, differences between the two are amplified, and negative differences between the two are increased. And the user decision function phi(s)t,At) In contrast to the reward function, the goal of the user decision function is to narrow the difference between the real user behavior and the user simulator generated behavior, generating a sample that approximates the real user behavior as closely as possible. This user simulator is named MRLG-Attention, the flow of the generated data is shown in FIG. 3, and the generation of a confrontation model can be interpreted as a game between a competitor and a learner, wherein the competitor adjusts the reward function r(s)t,at) To minimize the learner's reward by adjusting the user decision function(s)t,At) To maximize rewards, this provides a large amount of training data for user simulator training, with less deviation of the trained model.
It should be noted that, the recommendation algorithm based on the user simulator and reinforcement learning may be a recommendation policy model obtained by training in a simulation environment by using a policy gradient method REINFORCE, using a user simulator MRLG-orientation obtained by learning a real environment as the simulation environment, as shown in fig. 4. Set the recommendation policy to πθ(st,at) And theta is a recommended strategy function parameter. User simulator generated set of lengthsI, the return (return) of the sequence τ is shown in equation (13):
Figure RE-GDA0003301939230000151
if P (τ; 0) represents the probability of the occurrence of the sequence τ, then the target desired reward function is as shown in equation (14):
Figure RE-GDA0003301939230000152
to find the optimal parameters of the objective function such that J (θ) is maximized, the present embodiment uses a gradient-ascending method as shown in equation (15) to solve:
Figure RE-GDA0003301939230000153
the derivation of the objective function (14) results in the following equation (16):
Figure RE-GDA0003301939230000154
wherein the mean of the m sequences is used to approximate the expectation of the strategy gradient.
EXAMPLE III
According to another aspect of the embodiment of the present invention, a non-volatile storage medium is further provided, where the non-volatile storage medium includes a stored program, and the program controls, when running, a device in which the non-volatile storage medium is located to execute a recommendation policy generation method based on reinforcement learning.
Specifically, the method comprises the following steps: acquiring scene information; generating a user simulator according to the scene information; generating a simulation environment according to the user simulator; and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.
Example four
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor and a memory; the memory stores computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a recommendation strategy generation method based on reinforcement learning.
Specifically, the method comprises the following steps: acquiring scene information; generating a user simulator according to the scene information; generating a simulation environment according to the user simulator; and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.
Effects of the embodiment
Compared with the prior art, when the related technical problems are solved, the user simulator can adapt to the behavior characteristic change of the user, the higher user behavior prediction accuracy rate is obtained, the higher click rate and the long-term reward are obtained based on the recommendation algorithm of the user simulator, and the recommendation performance is effectively improved.
(1) In order to evaluate the accuracy of the user simulator MRLG-Attention in predicting the user click behavior, the following widely used prediction models are selected for comparison. W & D LR: and the logistic regression method is used for estimating the click rate of the user. W & D CCF: a collaborative filtering approach of contextual information is considered, which models a user decision process to learn user preferences. XGBOST: an integrated learning method based on a Decision Tree is an end-to-end Decision Tree enhancement algorithm improved on the basis of algorithms such as a Gradient Boosting Decision Tree (GDBT) and the like. MRLGAN: the algorithm is an algorithm for removing the attention mechanism on the basis of the user simulator provided by the embodiment, and the algorithm only changes the state characteristic representation of the user.
In the comparative experiment of the embodiment of the present invention, the user data is randomly divided into two parts: 80% of the user data are used as training set, and 20% of the user data are used as test set. The evaluation index is Top-K accuracy (Precision @ K is abbreviated as Prec @ K), which represents the proportion of K bits in front of the prediction list of the item actually clicked by the user, and the proportion is the average value of the tested users. The results of the experiment are shown in table 1.
TABLE 1 comparison of prediction accuracy performance of different models on different datasets
Figure RE-GDA0003301939230000161
Figure RE-GDA0003301939230000171
It can be seen that in table 1, the bolded data represents the best algorithm model under the current evaluation criteria in this dataset. The Improved row indicates the improvement rate of the user simulator MRLG-Attention proposed in this embodiment compared to the best-performing model of the present data set (except MRLG-Attention) under the current standard. As can be seen from the data in the table, in both data sets, the user behavior prediction accuracy of the user simulator MRLG-Attention provided by the embodiment is higher than that of other algorithm models. In the two data sets, the prediction accuracy of all models is improved along with the increase of the k value, particularly, the accuracy of the MRLG-Attention on the Yelp data set is 19.3% higher when the k value is 2 than when the k value is 1, and the prediction accuracy is improved along with the increase of the k value. In addition, as can be seen from table 1, compared to MRLGAN, the MRLG-Attention algorithm increases the prediction accuracy by 0.6% when the k value is 1, and increases the prediction accuracy by 3.8% when the k value is 2, which indicates that the Attention mechanism is helpful to improve the prediction accuracy, and the k value represents the number of recommended items in the recommendation list displayed on the page.
(2) In order to verify the recommendation accuracy of a recommendation algorithm MRLG Rec based on reinforcement learning, historical behavior data of 2000 users are collected from a supported information platform project, the data are divided into two data sets which are not overlapped with each other randomly, a data set is used for training a user simulator, and the other data set is used for training a reinforcement learning strategy. And selecting a deep learning model W & D LR, a W & D CCF and an off-line strategy model-free reinforcement learning model DQN by using a comparison algorithm.
Two performance evaluation criteria selected by the embodiment of the invention are as follows:
(1) cumulative Reward (CR): the exploratory nature of reinforcement learning enables the recommendation algorithm MRLG Rec designed by the invention to consider long-term benefits, so the cumulative reward of the algorithm should be higher than that of the non-reinforcement learning algorithms W & D LR and W & D CCF. Each recommendation action in the recommendation sequence can be calculated by the reward function of the user simulator to obtain the reward of the user. Since the rewards of the users are not calculated in training the reinforcement learning based recommendation strategy, this section uses the average of the cumulative rewards of all users as the CR value.
(2) CTR: the number of times an item is clicked on by the user is divided by the number of times the item is presented in the recommendation list. Since the behavior of the user is uncertain, 10 repeated experiments were performed for each recommended strategy, and the results of the performance comparison are shown in table 2.
TABLE 2 recommended Performance comparison
Figure RE-GDA0003301939230000172
Figure RE-GDA0003301939230000181
In table 2, the k value represents the number of recommended items in the recommendation list presented by the page. It can be seen from the table that as the k value increases, the performance index of all algorithms in the comparative experiment is improved. When the k value of the algorithm provided by the invention is 3, the accumulated reward is respectively increased by 66.78% and 20.67% relative to the W & D LR and W & D CCF of the deep learning algorithm, and is increased by 13.34% relative to the DQN of the model-free reinforcement learning algorithm; at a k value of 5, there was an increase of 67.06%, 21.11% and 9.45%, respectively, relative to the three algorithms. The results show that the long-term benefits of the recommendation algorithm and the MRLG Rec adopting the reinforcement learning method are higher than those of the W & D LR and W & D CCF adopting the deep learning algorithms, which indicates that better long-term benefits can be obtained by adopting the reinforcement learning method, and the long-term benefits of the reinforcement learning method MRLG Rec based on the model are higher than those of the reinforcement learning method DQN without the model due to the adoption of the dynamic sequence model. Although the training goal of the recommendation algorithm MRLG Rec provided by the invention is to maximize the cumulative reward, the recommendation algorithm MRLG Rec also obtains a relatively high click rate compared with the other three algorithms, and the click rate is respectively improved by 19.35%, 21.88% and 11.43%, which benefits from the learning of the reward function by the model-based reinforcement learning method, and the consistency of the click distribution of the user simulator and the click distribution of real users is considered while the maximization of the cumulative reward is pursued.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A recommendation strategy generation method based on reinforcement learning is characterized by comprising the following steps:
acquiring scene information;
generating a user simulator according to the scene information;
generating a simulation environment according to the user simulator;
and generating a recommended strategy model by adopting a strategy gradient algorithm through the simulation environment.
2. The method of claim 1, wherein prior to said obtaining context information, the method further comprises:
and defining a recommendation scene according to the user requirement.
3. The method of claim 1, wherein generating a user simulator based on the context information comprises:
describing the user state of the scene information according to an attention mechanism to obtain the user state;
determining a user decision function and a user reward function according to the user state;
and constructing the user simulator according to the user decision function and the user reward function.
4. The method of claim 1, wherein after said generating a recommended policy model using a policy gradient algorithm through said simulation environment, said method further comprises:
and outputting the recommended strategy model.
5. A recommendation policy generation device based on reinforcement learning, comprising:
the acquisition module is used for acquiring scene information;
the generating module is used for generating a user simulator according to the scene information;
the simulation module is used for generating a simulation environment according to the user simulator;
and the recommendation module is used for generating a recommendation strategy model by adopting a strategy gradient algorithm through the simulation environment.
6. The apparatus of claim 5, further comprising:
and the definition module is used for defining the recommendation scene according to the user requirement.
7. The apparatus of claim 5, wherein the generating module comprises:
the description unit is used for describing the user state of the scene information according to an attention mechanism to obtain the user state;
the determining unit is used for determining a user decision function and a user reward function according to the user state;
and the construction unit is used for constructing the user simulator according to the user decision function and the user reward function.
8. The apparatus of claim 5, further comprising:
and the output module is used for outputting the recommended strategy model.
9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform the method of any one of claims 1 to 4.
10. An electronic device comprising a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform the method of any one of claims 1 to 4.
CN202110726927.3A 2021-06-29 2021-06-29 Recommendation strategy generation method and device based on reinforcement learning Pending CN113688306A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110726927.3A CN113688306A (en) 2021-06-29 2021-06-29 Recommendation strategy generation method and device based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110726927.3A CN113688306A (en) 2021-06-29 2021-06-29 Recommendation strategy generation method and device based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN113688306A true CN113688306A (en) 2021-11-23

Family

ID=78576492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110726927.3A Pending CN113688306A (en) 2021-06-29 2021-06-29 Recommendation strategy generation method and device based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN113688306A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091174A (en) * 2023-04-07 2023-05-09 湖南工商大学 Recommendation policy optimization system, method and device and related equipment
CN116823408A (en) * 2023-08-29 2023-09-29 小舟科技有限公司 Commodity recommendation method, device, terminal and storage medium based on virtual reality

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104595A (en) * 2019-12-16 2020-05-05 华中科技大学 Deep reinforcement learning interactive recommendation method and system based on text information
CN112149824A (en) * 2020-09-15 2020-12-29 支付宝(杭州)信息技术有限公司 Method and device for updating recommendation model by game theory
CN112597392A (en) * 2020-12-25 2021-04-02 厦门大学 Recommendation system based on dynamic attention and hierarchical reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104595A (en) * 2019-12-16 2020-05-05 华中科技大学 Deep reinforcement learning interactive recommendation method and system based on text information
CN112149824A (en) * 2020-09-15 2020-12-29 支付宝(杭州)信息技术有限公司 Method and device for updating recommendation model by game theory
CN112597392A (en) * 2020-12-25 2021-04-02 厦门大学 Recommendation system based on dynamic attention and hierarchical reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜子文: "融合用户长短期偏好的基于强化学习的推荐算法", 《现代计算机》, no. 06, pages 31 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091174A (en) * 2023-04-07 2023-05-09 湖南工商大学 Recommendation policy optimization system, method and device and related equipment
CN116823408A (en) * 2023-08-29 2023-09-29 小舟科技有限公司 Commodity recommendation method, device, terminal and storage medium based on virtual reality
CN116823408B (en) * 2023-08-29 2023-12-01 小舟科技有限公司 Commodity recommendation method, device, terminal and storage medium based on virtual reality

Similar Documents

Publication Publication Date Title
CN108648049B (en) Sequence recommendation method based on user behavior difference modeling
Sun et al. Learning multiple-question decision trees for cold-start recommendation
Pandey et al. Bandits for taxonomies: A model-based approach
CN110046304A (en) A kind of user's recommended method and device
Salehi Application of implicit and explicit attribute based collaborative filtering and BIDE for learning resource recommendation
EP4181026A1 (en) Recommendation model training method and apparatus, recommendation method and apparatus, and computer-readable medium
CN109961142B (en) Neural network optimization method and device based on meta learning
CN110781409B (en) Article recommendation method based on collaborative filtering
Liu et al. Balancing between accuracy and fairness for interactive recommendation with reinforcement learning
CN103678518A (en) Method and device for adjusting recommendation lists
CN111242310A (en) Feature validity evaluation method and device, electronic equipment and storage medium
WO2012119245A1 (en) System and method for identifying and ranking user preferences
CN113688306A (en) Recommendation strategy generation method and device based on reinforcement learning
CN111949886A (en) Sample data generation method and related device for information recommendation
CN112528164A (en) User collaborative filtering recall method and device
CN112053188A (en) Internet advertisement recommendation method based on hybrid deep neural network model
Han et al. Optimizing ranking algorithm in recommender system via deep reinforcement learning
CN115600009A (en) Deep reinforcement learning-based recommendation method considering future preference of user
CN110347916A (en) Cross-scenario item recommendation method, device, electronic equipment and storage medium
CN115345635A (en) Processing method and device for recommended content, computer equipment and storage medium
Huang et al. C-3PO: C lick-sequence-aware dee P neural network (DNN)-based P op-u P s rec O mmendation: I know you’ll click
Zhang Design and Implementation of Sports Industry Information Service Management System Based on Data Mining
CN114298118B (en) Data processing method based on deep learning, related equipment and storage medium
Shinde et al. Scenario analysis of technology products with an agent-based simulation and data mining framework
Lei et al. Advertising click-through rate prediction model based on an attention mechanism and a neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 1308, 13th floor, East Tower, 33 Fuxing Road, Haidian District, Beijing 100036

Applicant after: China Telecom Digital Intelligence Technology Co.,Ltd.

Address before: Room 1308, 13th floor, East Tower, 33 Fuxing Road, Haidian District, Beijing 100036

Applicant before: CHINA TELECOM GROUP SYSTEM INTEGRATION Co.,Ltd.

CB02 Change of applicant information