CN111199458B - Recommendation system based on meta learning and reinforcement learning - Google Patents
Recommendation system based on meta learning and reinforcement learning Download PDFInfo
- Publication number
- CN111199458B CN111199458B CN201911393658.2A CN201911393658A CN111199458B CN 111199458 B CN111199458 B CN 111199458B CN 201911393658 A CN201911393658 A CN 201911393658A CN 111199458 B CN111199458 B CN 111199458B
- Authority
- CN
- China
- Prior art keywords
- user
- model
- meta
- data
- recommended
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention realizes a recommendation system based on meta learning and reinforcement learning by means of methods in the fields of meta learning, reinforcement learning and data mining, defines and constructs an internal updating module and a meta updating module model, and forms a system model by the two modules; the training process of the system model is as follows: and finally, generating errors by inputting feedback of the user to the recommended content, deriving an initial model, and updating to obtain a new model. After model training is completed, the system accepts the user's feature data, recommends push content for the user, and collects feedback from the user for these content thereafter.
Description
Technical Field
The invention relates to the fields of meta learning, reinforcement learning and data mining, in particular to a recommendation system based on meta learning and reinforcement learning.
Background
At present, a recommendation system is almost ubiquitous, a plurality of apps use the recommendation system, travel, shopping, video, news, social interaction and the like, and people can see the shadow of the recommendation system, which is closely related to people's daily lives. For the reason, for users, facing massive data, it is always hoped to quickly find information which is interesting or valuable for the users; it is always desirable for information producers to attract more customers to their own content, but different customers have different preferences, so there should be different recommended content for different users. Although the recommender system has many benefits for both the business and the user, if the recommender system performs poorly, significant losses are incurred for the business. Many recommendation systems at present recommend based on the similarity between users or commodities, and the recommendation mode based on supervised learning has certain limitation:
1. these recommendation systems tend to recommend based on short-term behavior without taking into account the long-term behavior of the user. Such as: when a person purchases a set of headphones on the panning, the recommendation system still recommends headphones to the user for the last time, which can seriously affect the user's experience.
2. Personalized recommendations according to the user's behavior or preferences cannot be quickly made. Because of the recommendation based on the similarity, the system needs to collect certain user information and behaviors before personalized recommendation can be performed. This requires long user feedback and may lead to user churn.
3. Deviations of the recommendation system. When the recommendation system recommends the user of the two commodities A and B, the recommendation system only pays attention to the feedback of the user on the commodities A and B, and the user preference degree of the user on other videos is not known.
Reinforcement learning has gained much attention in recent years to maximize the splendid in the go arena, to be able to play without any play, and to capture the eyes of people even in the performance of autopilot. Reinforcement learning is one area of machine learning, emphasizing how to learn in interactions with the environment to obtain maximum benefit. The core of meta learning is how to learn the machine learning model, and the effect of rapid adaptation along with the change of a user can be achieved after the combination of the dynamic adjustment strategy mode of reinforcement learning, which is very consistent with the operation mechanism like a recommendation system which needs to respond agilely according to the response of a new user to recommendation, so that the meta learning and reinforcement learning technology can be combined to form the recommendation system.
Disclosure of Invention
To solve some drawbacks of the current recommendation system, we propose to use meta learning and reinforcement learning methods to make the recommendation system. On the one hand, the dynamic interaction strategy improvement method can well avoid some defects judged by similarity. On the other hand, in recent years, meta learning is rapidly developed, and the problem of how to learn under the condition of few samples and how to achieve good effects is solved, and the combination of the meta learning and reinforcement learning achieves good effects on some problems, so that the problem of recommending learning to rapidly learn user preferences is solved by using a method of reinforcement learning and meta learning.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a recommendation system based on meta learning and reinforcement learning specifically comprises the following steps:
step one: inputting user basic information and a tour record or a purchase record, defining an internal updating module and a meta updating module model, and forming a system model by the internal updating module and the meta updating module;
the internal updating module model firstly receives feedback of various characteristic data of a user for a period of time, and then optimizes the model by using a gradient descent method to obtain a model after rapid adaptation, and then receives the current various characteristic data of the user;
the meta-updating module calculates various index changes of a user through a Markov process, and the recommended excitation energy provided in the whole stage is maximized through a defined meta-loss function;
step two: the system model is subjected to model correction by the internal updating module through a gradient descent method by combining the previous recommended content of the user and the feedback generated by the user, and personalized adaptation is performed.
Step three: receiving current characteristic data of a user through the internal updating module, and recommending new content for the user by using the strategy corrected in the step two;
step four: and D, calculating the recommended incentive by inputting feedback of the user on the recommended content in the step three, wherein the feedback is a label record of whether the recommended content is liked or not.
Step five: after obtaining the excitation obtained by the recommendation, conducting derivative updating on the initial model, and adjusting the strategy.
The internal update module may be expressed as:
Where θ is the vector of the input data,for the model internally updated at time i, step M, τ is the ordered sequence vector representing the policy, and M is the total number of internally updated steps.
The meta-update module error calculation is expressed as:
wherein the method comprises the steps ofRepresenting the initial state and transition probability of the Markov process,/for the Markov process>For the meta-loss function->Is a mathematical expectation of the corresponding element.
The meta-loss function is defined as:
compared with the traditional recommendation system method, the meta learning and reinforcement learning based method has the following advantages:
1. the first accurate recommendation is realized by comprehensively learning the behavior or preference of the user through the basic data of the user and the browsing records or purchasing records of the user and automatically extracting the hidden relationship between the personal behavior of the user and the possibly favorite articles.
2. By combining the user's reactions (like or dislike) to the recommended content, our recommendation strategy is dynamically adjusted.
3. Making a recommendation decision by using a reinforcement learning method, and learning an optimal recommendation scheme for a certain user by dynamically adjusting recommendation contents to the user through interaction of an agent (a decision maker of a recommendation system) and an environment (the user);
4. the algorithm is multi-parameter adjustable, the expansibility is strong, the system can continuously acquire rewards (like or dislike of the user), environment states (tour record or purchase record of the current user) and take actions (recommended content) through interaction with the user, the algorithm can be set according to the problem requirement, and the portability of the algorithm is good.
Drawings
FIG. 1 is an interaction diagram with user data in model training;
FIG. 2 model global update strategy;
Detailed Description
The following is a preferred embodiment of the present invention and a technical solution of the present invention is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.
The invention is based on meta learning and reinforcement learning methods, through learning basic information of users and tour records or purchase records, extracting hidden rules therein, recommending the users, and taking the reactions (like or dislike) of the users to recommended contents as model feedback, thereby adjusting the strategy learned by the model and giving new recommended contents for the users.
The overall flow of the model is shown in fig. 2, firstly, a recommended strategy is generated by the model for the characteristic data input of a user, then the error is calculated according to the strategy, then the parameters of the model are optimized through an optimization method in the internal updating process to obtain the next strategy, the final strategy is obtained according to the set internal updating step number, finally, the error is generated by the feedback of the input user on the recommended content, then the initial model is derived, the new model is obtained by updating, and the training is finished until the last time.
Because the method of our system adopts different strategies during training and working, the training is that for a certain user, the browsing behavior (like or dislike) after each recommendation of content is changed (T i ,T i+1 ) Corresponding recommended content Y i The parameter theta and the learning rate alpha of the model are updated together by meta learning and gradient descent methods. The learned strategy is changed because the feedback of the user on the recommended content cannot be obtained immediately when the method is executed. When in execution: obtaining basic data and browsing data T of a certain user i Then make a recommendation content Y for the user i And updated according to the weight of each different recommendation.
First, the data involved in the scheme are defined:
the basic data of the user is recorded as X= { X 1 ,...,X n }, wherein each data point X i Represents the ith piece of data, each piece of data represents the behavior record (like or dislike articles or purchase history, etc.) of the user at a certain moment, X i ={x i1 ,...,x im -each x ij The j-th feature of the i-th data is represented, and each data has m features. The set to be recommended is recorded as y= { Y 1 ,...,y k }, wherein each data point y i The following data definition, representing the i-th item or content, defines the recommendation system model proposed by the present invention:
1. state space: a vector X with length of m, each component value of the vector represents the index value corresponding to the user basic information and each index in the history record in the current state;
2. action space: a vector A, A of length k i E {0,1}, 1.ltoreq.i.ltoreq.k, corresponding to whether the content recommended in the current state contains the item Y i If 0, it means that the article is not contained, and if 1, it means that the article is contained;
3. rewarding: a scalar R represents the feedback of the user after receiving action a decided by the recommender system decision maker.
4. Strategy sequence: user features X i And corresponding recommended content Y i Excitation R i Ordered sequence of (2)
an internal updating module:
one of the main parts of the whole system, the model theta first accepts each X of the user i Item characteristic data X, corresponding prescription data Y are predicted, and then a gradient descent method is used for optimizing the model to obtain a model phi after fast adaptation:
the above-mentioned M is the number of steps of an internal update,the learning rate of each step in the internal updating process is optimized along with the parameters of the model in the whole training process. The derivation of LT we use a strategy gradient method to optimize:
and a meta updating module:
the meta update (meta update) module is used for checking a generalization capability of the model, and the behavior of the user is dynamically changed continuously, so that the model is adapted to the dynamic environment better, and the best recommendation suggestion can be provided for the user when the user index is changed in a variety.
In the meta-update phase we have to realize that the next prediction for the user after the recommendation can reach the maximum incentive R. We consider the user's index changes as a markov process, and we can refer to the model as a whole and the interaction with the data as shown in fig. 1.
The goal of the meta-update phase is to minimize the error in the recommendations provided throughout the phase, expressed as:
wherein the method comprises the steps ofRepresented are the initial state and transition probabilities of the markov process. Note that our model is divided into two layers, the upper one is model parameters that are optimized continuously according to input, and the lower one is dynamic variation of each index of the user (we regard it as a markov decision process). In addition, we have defined the meta-loss function (metaloss) is between two successive states before and after the user recommendation:
it can be understood that we use the current input T i To change the model to enable the user to have various feature data T i+1 The most efficient recommendation suggestions are generated.
The whole flow is as follows: first modelCharacteristic data input T for primary user i Generating a recommended strategy τ i Further, the error is determined according to this strategy>Then optimizing the parameters of the model by means of the optimization method mentioned in the internal update procedure to get +.>Obtaining final step number M according to the set internal update, and generating error by inputting user feedback to recommended content>Then ∈the initial model>Conduct derivative and update to get new model +.>One training is finished
Output module of recommendation system based on meta learning and reinforcement learning:
after model training is completed, the system firstly receives the characteristic data of the user, then operates the model to recommend push content for the user, and the output of the model is combinedThe result is a motion vector A, A i E {0,1}, 1.ltoreq.i.ltoreq.k, k being the length of the set to be recommended, A i If 1 indicates that content Y is currently recommended for the user i Otherwise, not recommending content Y for the user i And then pushing corresponding contents to the user according to the result of the model, collecting feedback (like or dislike) of the user on the contents, calculating a recommendation reward R according to the feedback, increasing R if the recommended contents are like by the user, decreasing R if the user does not like, and adjusting model parameters (strategy adjustment) according to the recommendation result if no feedback R is a default value. And then the next recommendation is made.
Claims (1)
1. A recommendation system based on meta learning and reinforcement learning is characterized in that:
step one: inputting user basic information and a tour record or a purchase record, defining an internal updating module and a meta updating module, and forming a system model by the internal updating module and the meta updating module;
the internal updating module firstly receives various characteristic data of a user in a period of time before, wherein the various characteristic data comprise feedback of the user on the data, a gradient descent method is used for optimizing the model to obtain a model which is quickly adapted, and then the various characteristic data of the user at present are received;
the meta-updating module receives the model finally updated by the internal updating module, obtains feedback of the user on recommended content, calculates various index changes of the user through a Markov process, maximizes the recommended excitation energy provided in the whole stage through a defined meta-loss function, and finally updates the model to obtain a new model;
step two: the system model is subjected to model correction by the internal updating module through a gradient descent method by combining the previous recommended content of the user and the feedback generated by the user, and personalized adaptation is performed;
step three: receiving current characteristic data of a user through the internal updating module, and outputting the current characteristic data by using the strategy corrected in the step two, wherein the output is the recommendation of new content to the user;
step four: calculating and obtaining recommended motivation by inputting feedback of a user on recommended content in the step three, wherein the feedback is a label record of whether the recommended content is liked or not;
step five: after obtaining the excitation obtained after the recommendation in the step four, conducting derivative updating on the initial model;
in the internal updating module, the model theta firstly accepts each X of the user i The item characteristic data X are predicted to be corresponding recommended data Y, and then the model is optimized by a gradient descent method to obtain a rapidly-adapted model phi, wherein the basic data of a user are X= { X 1 ,...,X n }, wherein each data point X i Representing the ith data, each data representing the behavior record of the user at a certain moment, X i ={x i1 ,...,x im -each x ij The j features of the ith data are represented, each data has m features, and the set to be recommended is recorded as Y= { Y 1 ,...,y k }, wherein each data point y i Data representing the ith item or content, then the recommendation system model is defined as follows: the state space is a vector X with the length of m, and each component value of the vector represents the index value corresponding to each index in the user basic information and the history record in the current state; the motion space is a vector A, A with a length k i E {0,1}, 1.ltoreq.i.ltoreq.k, corresponding to whether the content recommended in the current state contains the item Y i If 0, it means that the article is not contained, and if 1, it means that the article is contained; the reward is a scalar R, which represents feedback of the user after receiving the action A decided by the decision maker of the recommendation system; the policy sequence is the user's various characteristics X i And corresponding recommended content Y i Excitation R i Ordered sequence of (2)
The internal update module may be expressed as:
Wherein, theta is a model,for the model internally updated at the moment i and the mth step, an ordered sequence vector representing a strategy, and M is the total number of internally updated steps;
the meta-update module error calculation is expressed as:
wherein the method comprises the steps ofRepresenting the initial state and transition probability of the Markov process,/for the Markov process>For the meta-loss function->Representing mathematical expectations of corresponding elements;
the meta-loss function is defined as:
the training process of the model is as follows: first, for an internal update module, a modelCharacteristic data input T for primary user i Generating a recommendation strategy τ i Further, the error is determined according to this strategy>Then optimizing the parameters of the model by means of the optimization method mentioned in the internal update procedure to get +.>Obtaining a final +.>Finally, for the meta-update module, by inputting the user feedback T of the recommended content i+1 To generate error->Then ∈the initial model>Conduct derivative and update to get new model +.>/>
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911393658.2A CN111199458B (en) | 2019-12-30 | 2019-12-30 | Recommendation system based on meta learning and reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911393658.2A CN111199458B (en) | 2019-12-30 | 2019-12-30 | Recommendation system based on meta learning and reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111199458A CN111199458A (en) | 2020-05-26 |
CN111199458B true CN111199458B (en) | 2023-06-02 |
Family
ID=70746290
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911393658.2A Active CN111199458B (en) | 2019-12-30 | 2019-12-30 | Recommendation system based on meta learning and reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111199458B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112180726A (en) * | 2020-09-29 | 2021-01-05 | 北京航空航天大学 | Spacecraft relative motion trajectory planning method based on meta-learning |
CN112417319B (en) * | 2020-11-24 | 2022-11-18 | 清华大学 | Site recommendation method and device based on difficulty sampling meta-learning |
CN112509392B (en) * | 2020-12-16 | 2022-11-29 | 复旦大学 | Robot behavior teaching method based on meta-learning |
CN112597391B (en) * | 2020-12-25 | 2022-08-12 | 厦门大学 | Dynamic recursion mechanism-based hierarchical reinforcement learning recommendation system |
CN113031520B (en) * | 2021-03-02 | 2022-03-22 | 南京航空航天大学 | Meta-invariant feature space learning method for cross-domain prediction |
CN113158086B (en) * | 2021-04-06 | 2023-05-05 | 浙江贝迩熊科技有限公司 | Personalized customer recommendation system and method based on deep reinforcement learning |
CN115017418B (en) * | 2022-08-10 | 2022-11-01 | 北京数慧时空信息技术有限公司 | Remote sensing image recommendation system and method based on reinforcement learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537114A (en) * | 2015-01-21 | 2015-04-22 | 清华大学 | Individual recommendation method |
CN108230057A (en) * | 2016-12-09 | 2018-06-29 | 阿里巴巴集团控股有限公司 | A kind of intelligent recommendation method and system |
CN109919299A (en) * | 2019-02-19 | 2019-06-21 | 西安交通大学 | A kind of meta learning algorithm based on meta learning device gradually gradient calibration |
CN109978660A (en) * | 2019-03-13 | 2019-07-05 | 南京航空航天大学 | A kind of recommender system off-line training method based on intensified learning frame |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10311467B2 (en) * | 2015-03-24 | 2019-06-04 | Adobe Inc. | Selecting digital advertising recommendation policies in light of risk and expected return |
-
2019
- 2019-12-30 CN CN201911393658.2A patent/CN111199458B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537114A (en) * | 2015-01-21 | 2015-04-22 | 清华大学 | Individual recommendation method |
CN108230057A (en) * | 2016-12-09 | 2018-06-29 | 阿里巴巴集团控股有限公司 | A kind of intelligent recommendation method and system |
CN109919299A (en) * | 2019-02-19 | 2019-06-21 | 西安交通大学 | A kind of meta learning algorithm based on meta learning device gradually gradient calibration |
CN109978660A (en) * | 2019-03-13 | 2019-07-05 | 南京航空航天大学 | A kind of recommender system off-line training method based on intensified learning frame |
Also Published As
Publication number | Publication date |
---|---|
CN111199458A (en) | 2020-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111199458B (en) | Recommendation system based on meta learning and reinforcement learning | |
CN110046304B (en) | User recommendation method and device | |
US20190392330A1 (en) | System and method for generating aspect-enhanced explainable description-based recommendations | |
CN111310063B (en) | Neural network-based article recommendation method for memory perception gated factorization machine | |
CN112074857A (en) | Combining machine learning and social data to generate personalized recommendations | |
WO2022016522A1 (en) | Recommendation model training method and apparatus, recommendation method and apparatus, and computer-readable medium | |
US11663661B2 (en) | Apparatus and method for training a similarity model used to predict similarity between items | |
Jiao et al. | A novel learning rate function and its application on the SVD++ recommendation algorithm | |
US20230316378A1 (en) | System and methods for determining an object property | |
Li et al. | Sparse online collaborative filtering with dynamic regularization | |
US20230206076A1 (en) | Graph structure aware incremental learning for recommender system | |
CN112699310A (en) | Cold start cross-domain hybrid recommendation method and system based on deep neural network | |
CN116830100A (en) | Neighborhood selection recommendation system with adaptive threshold | |
US20210201146A1 (en) | Computing device and operation method thereof | |
Pramod et al. | Conversational recommender systems techniques, tools, acceptance, and adoption: A state of the art review | |
KR20210012730A (en) | Learning method of artificial intelligence model and electronic apparatus | |
CN114936901A (en) | Visual perception recommendation method and system based on cross-modal semantic reasoning and fusion | |
CN114565436A (en) | Vehicle model recommendation system, method, device and storage medium based on time sequence modeling | |
CN112084415A (en) | Recommendation method based on analysis of long-term and short-term coupling relationship between user and project | |
Kalidindi et al. | Discrete Deep Learning Based Collaborative Filtering Approach for Cold Start Problem. | |
Jin et al. | Hybrid recommender system with core users selection | |
Le | MetaRec: Meta-Learning Meets Recommendation Systems | |
CN117290598A (en) | Method for constructing sequence recommendation model, sequence recommendation method and device | |
Tahir et al. | Movies Recommendation System Using Machine Learning Algorithms | |
CN117540080A (en) | Method and device for recommending articles in cold start mode for strengthening new user characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |