CN111199458B

CN111199458B - Recommendation system based on meta learning and reinforcement learning

Info

Publication number: CN111199458B
Application number: CN201911393658.2A
Authority: CN
Inventors: 李建欣; 张帅; 朱琪山; 杨继远; 周号益
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2023-06-02
Anticipated expiration: 2039-12-30
Also published as: CN111199458A

Abstract

The invention realizes a recommendation system based on meta learning and reinforcement learning by means of methods in the fields of meta learning, reinforcement learning and data mining, defines and constructs an internal updating module and a meta updating module model, and forms a system model by the two modules; the training process of the system model is as follows: and finally, generating errors by inputting feedback of the user to the recommended content, deriving an initial model, and updating to obtain a new model. After model training is completed, the system accepts the user's feature data, recommends push content for the user, and collects feedback from the user for these content thereafter.

Description

Recommendation system based on meta learning and reinforcement learning

Technical Field

The invention relates to the fields of meta learning, reinforcement learning and data mining, in particular to a recommendation system based on meta learning and reinforcement learning.

Background

At present, a recommendation system is almost ubiquitous, a plurality of apps use the recommendation system, travel, shopping, video, news, social interaction and the like, and people can see the shadow of the recommendation system, which is closely related to people's daily lives. For the reason, for users, facing massive data, it is always hoped to quickly find information which is interesting or valuable for the users; it is always desirable for information producers to attract more customers to their own content, but different customers have different preferences, so there should be different recommended content for different users. Although the recommender system has many benefits for both the business and the user, if the recommender system performs poorly, significant losses are incurred for the business. Many recommendation systems at present recommend based on the similarity between users or commodities, and the recommendation mode based on supervised learning has certain limitation:

1. these recommendation systems tend to recommend based on short-term behavior without taking into account the long-term behavior of the user. Such as: when a person purchases a set of headphones on the panning, the recommendation system still recommends headphones to the user for the last time, which can seriously affect the user's experience.

2. Personalized recommendations according to the user's behavior or preferences cannot be quickly made. Because of the recommendation based on the similarity, the system needs to collect certain user information and behaviors before personalized recommendation can be performed. This requires long user feedback and may lead to user churn.

3. Deviations of the recommendation system. When the recommendation system recommends the user of the two commodities A and B, the recommendation system only pays attention to the feedback of the user on the commodities A and B, and the user preference degree of the user on other videos is not known.

Reinforcement learning has gained much attention in recent years to maximize the splendid in the go arena, to be able to play without any play, and to capture the eyes of people even in the performance of autopilot. Reinforcement learning is one area of machine learning, emphasizing how to learn in interactions with the environment to obtain maximum benefit. The core of meta learning is how to learn the machine learning model, and the effect of rapid adaptation along with the change of a user can be achieved after the combination of the dynamic adjustment strategy mode of reinforcement learning, which is very consistent with the operation mechanism like a recommendation system which needs to respond agilely according to the response of a new user to recommendation, so that the meta learning and reinforcement learning technology can be combined to form the recommendation system.

Disclosure of Invention

To solve some drawbacks of the current recommendation system, we propose to use meta learning and reinforcement learning methods to make the recommendation system. On the one hand, the dynamic interaction strategy improvement method can well avoid some defects judged by similarity. On the other hand, in recent years, meta learning is rapidly developed, and the problem of how to learn under the condition of few samples and how to achieve good effects is solved, and the combination of the meta learning and reinforcement learning achieves good effects on some problems, so that the problem of recommending learning to rapidly learn user preferences is solved by using a method of reinforcement learning and meta learning.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a recommendation system based on meta learning and reinforcement learning specifically comprises the following steps:

step one: inputting user basic information and a tour record or a purchase record, defining an internal updating module and a meta updating module model, and forming a system model by the internal updating module and the meta updating module;

the internal updating module model firstly receives feedback of various characteristic data of a user for a period of time, and then optimizes the model by using a gradient descent method to obtain a model after rapid adaptation, and then receives the current various characteristic data of the user;

the meta-updating module calculates various index changes of a user through a Markov process, and the recommended excitation energy provided in the whole stage is maximized through a defined meta-loss function;

step two: the system model is subjected to model correction by the internal updating module through a gradient descent method by combining the previous recommended content of the user and the feedback generated by the user, and personalized adaptation is performed.

Step three: receiving current characteristic data of a user through the internal updating module, and recommending new content for the user by using the strategy corrected in the step two;

step four: and D, calculating the recommended incentive by inputting feedback of the user on the recommended content in the step three, wherein the feedback is a label record of whether the recommended content is liked or not.

Step five: after obtaining the excitation obtained by the recommendation, conducting derivative updating on the initial model, and adjusting the strategy.

The internal update module may be expressed as:

and define

Where θ is the vector of the input data,

for the model internally updated at time i, step M, τ is the ordered sequence vector representing the policy, and M is the total number of internally updated steps.

The meta-update module error calculation is expressed as:

wherein the method comprises the steps of

Representing the initial state and transition probability of the Markov process,/for the Markov process>

For the meta-loss function->

Is a mathematical expectation of the corresponding element.

The meta-loss function is defined as:

compared with the traditional recommendation system method, the meta learning and reinforcement learning based method has the following advantages:

1. the first accurate recommendation is realized by comprehensively learning the behavior or preference of the user through the basic data of the user and the browsing records or purchasing records of the user and automatically extracting the hidden relationship between the personal behavior of the user and the possibly favorite articles.

2. By combining the user's reactions (like or dislike) to the recommended content, our recommendation strategy is dynamically adjusted.

3. Making a recommendation decision by using a reinforcement learning method, and learning an optimal recommendation scheme for a certain user by dynamically adjusting recommendation contents to the user through interaction of an agent (a decision maker of a recommendation system) and an environment (the user);

4. the algorithm is multi-parameter adjustable, the expansibility is strong, the system can continuously acquire rewards (like or dislike of the user), environment states (tour record or purchase record of the current user) and take actions (recommended content) through interaction with the user, the algorithm can be set according to the problem requirement, and the portability of the algorithm is good.

Drawings

FIG. 1 is an interaction diagram with user data in model training;

FIG. 2 model global update strategy;

Detailed Description

The following is a preferred embodiment of the present invention and a technical solution of the present invention is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.

The invention is based on meta learning and reinforcement learning methods, through learning basic information of users and tour records or purchase records, extracting hidden rules therein, recommending the users, and taking the reactions (like or dislike) of the users to recommended contents as model feedback, thereby adjusting the strategy learned by the model and giving new recommended contents for the users.

The overall flow of the model is shown in fig. 2, firstly, a recommended strategy is generated by the model for the characteristic data input of a user, then the error is calculated according to the strategy, then the parameters of the model are optimized through an optimization method in the internal updating process to obtain the next strategy, the final strategy is obtained according to the set internal updating step number, finally, the error is generated by the feedback of the input user on the recommended content, then the initial model is derived, the new model is obtained by updating, and the training is finished until the last time.

Because the method of our system adopts different strategies during training and working, the training is that for a certain user, the browsing behavior (like or dislike) after each recommendation of content is changed (T _i ，T _i+1 ) Corresponding recommended content Y _i The parameter theta and the learning rate alpha of the model are updated together by meta learning and gradient descent methods. The learned strategy is changed because the feedback of the user on the recommended content cannot be obtained immediately when the method is executed. When in execution: obtaining basic data and browsing data T of a certain user _i Then make a recommendation content Y for the user _i And updated according to the weight of each different recommendation.

First, the data involved in the scheme are defined:

the basic data of the user is recorded as X= { X ₁ ，...，X _n }, wherein each data point X _i Represents the ith piece of data, each piece of data represents the behavior record (like or dislike articles or purchase history, etc.) of the user at a certain moment, X _i ＝{x _i1 ，...，x _im -each x _ij The j-th feature of the i-th data is represented, and each data has m features. The set to be recommended is recorded as y= { Y ₁ ，...，y _k }, wherein each data point y _i The following data definition, representing the i-th item or content, defines the recommendation system model proposed by the present invention:

1. state space: a vector X with length of m, each component value of the vector represents the index value corresponding to the user basic information and each index in the history record in the current state;

2. action space: a vector A, A of length k _i E {0,1}, 1.ltoreq.i.ltoreq.k, corresponding to whether the content recommended in the current state contains the item Y _i If 0, it means that the article is not contained, and if 1, it means that the article is contained;

3. rewarding: a scalar R represents the feedback of the user after receiving action a decided by the recommender system decision maker.

4. Strategy sequence: user features X _i And corresponding recommended content Y _i Excitation R _i Ordered sequence of (2)

Loss function: l (L) _T Loss function of recommended content effect for a particular user:

an internal updating module:

one of the main parts of the whole system, the model theta first accepts each X of the user _i Item characteristic data X, corresponding prescription data Y are predicted, and then a gradient descent method is used for optimizing the model to obtain a model phi after fast adaptation:

the above-mentioned M is the number of steps of an internal update,

the learning rate of each step in the internal updating process is optimized along with the parameters of the model in the whole training process. The derivation of LT we use a strategy gradient method to optimize:

and a meta updating module:

the meta update (meta update) module is used for checking a generalization capability of the model, and the behavior of the user is dynamically changed continuously, so that the model is adapted to the dynamic environment better, and the best recommendation suggestion can be provided for the user when the user index is changed in a variety.

In the meta-update phase we have to realize that the next prediction for the user after the recommendation can reach the maximum incentive R. We consider the user's index changes as a markov process, and we can refer to the model as a whole and the interaction with the data as shown in fig. 1.

The goal of the meta-update phase is to minimize the error in the recommendations provided throughout the phase, expressed as:

wherein the method comprises the steps of

Represented are the initial state and transition probabilities of the markov process. Note that our model is divided into two layers, the upper one is model parameters that are optimized continuously according to input, and the lower one is dynamic variation of each index of the user (we regard it as a markov decision process). In addition, we have defined the meta-loss function (metaloss) is between two successive states before and after the user recommendation:

it can be understood that we use the current input T _i To change the model to enable the user to have various feature data T _i+1 The most efficient recommendation suggestions are generated.

The whole flow is as follows: first model

Characteristic data input T for primary user _i Generating a recommended strategy τ _i Further, the error is determined according to this strategy>

Then optimizing the parameters of the model by means of the optimization method mentioned in the internal update procedure to get +.>

Obtaining final step number M according to the set internal update, and generating error by inputting user feedback to recommended content>

Then ∈the initial model>

Conduct derivative and update to get new model +.>

One training is finished

Output module of recommendation system based on meta learning and reinforcement learning:

after model training is completed, the system firstly receives the characteristic data of the user, then operates the model to recommend push content for the user, and the output of the model is combinedThe result is a motion vector A, A _i E {0,1}, 1.ltoreq.i.ltoreq.k, k being the length of the set to be recommended, A _i If 1 indicates that content Y is currently recommended for the user _i Otherwise, not recommending content Y for the user _i And then pushing corresponding contents to the user according to the result of the model, collecting feedback (like or dislike) of the user on the contents, calculating a recommendation reward R according to the feedback, increasing R if the recommended contents are like by the user, decreasing R if the user does not like, and adjusting model parameters (strategy adjustment) according to the recommendation result if no feedback R is a default value. And then the next recommendation is made.

Claims

1. A recommendation system based on meta learning and reinforcement learning is characterized in that:

step one: inputting user basic information and a tour record or a purchase record, defining an internal updating module and a meta updating module, and forming a system model by the internal updating module and the meta updating module;

the internal updating module firstly receives various characteristic data of a user in a period of time before, wherein the various characteristic data comprise feedback of the user on the data, a gradient descent method is used for optimizing the model to obtain a model which is quickly adapted, and then the various characteristic data of the user at present are received;

the meta-updating module receives the model finally updated by the internal updating module, obtains feedback of the user on recommended content, calculates various index changes of the user through a Markov process, maximizes the recommended excitation energy provided in the whole stage through a defined meta-loss function, and finally updates the model to obtain a new model;

step two: the system model is subjected to model correction by the internal updating module through a gradient descent method by combining the previous recommended content of the user and the feedback generated by the user, and personalized adaptation is performed;

step three: receiving current characteristic data of a user through the internal updating module, and outputting the current characteristic data by using the strategy corrected in the step two, wherein the output is the recommendation of new content to the user;

step four: calculating and obtaining recommended motivation by inputting feedback of a user on recommended content in the step three, wherein the feedback is a label record of whether the recommended content is liked or not;

step five: after obtaining the excitation obtained after the recommendation in the step four, conducting derivative updating on the initial model;

in the internal updating module, the model theta firstly accepts each X of the user _i The item characteristic data X are predicted to be corresponding recommended data Y, and then the model is optimized by a gradient descent method to obtain a rapidly-adapted model phi, wherein the basic data of a user are X= { X ₁ ，...，X _n }, wherein each data point X _i Representing the ith data, each data representing the behavior record of the user at a certain moment, X _i ＝{x _i1 ，...，x _im -each x _ij The j features of the ith data are represented, each data has m features, and the set to be recommended is recorded as Y= { Y ₁ ，...，y _k }, wherein each data point y _i Data representing the ith item or content, then the recommendation system model is defined as follows: the state space is a vector X with the length of m, and each component value of the vector represents the index value corresponding to each index in the user basic information and the history record in the current state; the motion space is a vector A, A with a length k _i E {0,1}, 1.ltoreq.i.ltoreq.k, corresponding to whether the content recommended in the current state contains the item Y _i If 0, it means that the article is not contained, and if 1, it means that the article is contained; the reward is a scalar R, which represents feedback of the user after receiving the action A decided by the decision maker of the recommendation system; the policy sequence is the user's various characteristics X _i And corresponding recommended content Y _i Excitation R _i Ordered sequence of (2)

The internal update module may be expressed as:

and define

Wherein, theta is a model,

for the model internally updated at the moment i and the mth step, an ordered sequence vector representing a strategy, and M is the total number of internally updated steps;

the meta-update module error calculation is expressed as:

wherein the method comprises the steps of

For the meta-loss function->

Representing mathematical expectations of corresponding elements;

the meta-loss function is defined as:

the training process of the model is as follows: first, for an internal update module, a model

Characteristic data input T for primary user _i Generating a recommendation strategy τ _i Further, the error is determined according to this strategy>

Obtaining a final +.>

Finally, for the meta-update module, by inputting the user feedback T of the recommended content _i+1 To generate error->

Then ∈the initial model>

Conduct derivative and update to get new model +.>

/>