CN112612948B

CN112612948B - Deep reinforcement learning-based recommendation system construction method

Info

Publication number: CN112612948B
Application number: CN202011473950.8A
Authority: CN
Inventors: 石龙翔; 金苍宏; 李卓蓉; 吴明晖
Original assignee: Hangzhou City University
Current assignee: Hangzhou Woerfu Information Technology Co ltd
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2022-07-08
Anticipated expiration: 2040-12-14
Also published as: CN112612948A

Abstract

The invention provides a recommendation system construction method based on deep reinforcement learning, which comprises the following steps: s1) establishing a feature characterization set of the interaction between the user and the recommendation system; s2) establishing a state representation of the interaction between the user and the recommendation system; s3) modeling of a recommendation system; s4) training of a recommendation system; s5) recommending the deployment of the system. The invention has the advantages that: the action selection space of the one-dimensional discrete type project is mapped to the multi-dimensional continuous real-valued space, and the action space of the recommended item is simplified by adopting a binary conversion method, so that the training difficulty of a recommendation system is reduced; the behavior characteristics of the user are modeled by adopting the convolution recurrent neural network, so that the performance of the recommendation system is improved.

Description

Deep reinforcement learning-based recommendation system construction method

Technical Field

The invention relates to the technical field of recommendation systems, in particular to a recommendation system construction method based on deep reinforcement learning.

Background

The recommendation system is an indispensable part of an intelligent electronic commerce system, and the recommendation of possibly required items to a user through historical browsing data of the user is a main task of the recommendation system. Common recommendation system methods are collaborative filtering, matrix factorization, content-based ranking, and the like. However, these conventional recommendation methods often model the user preference as a static process, and recommend to the user by some greedy ranking methods, which cannot consider the dynamic change of the user preference. Recent studies show that modeling of the recommendation system is a reinforcement learning problem, and by maximizing total scores which may be given by a user in the future as an optimization target, dynamic modeling can be effectively performed on the preference of the user, and the performance and performance of the recommendation system are improved.

However, due to some characteristics of the recommendation system itself, the following two problems are often encountered in the process of constructing the recommendation system by using the reinforcement learning method: 1) the number of recommended items in a recommendation system is often huge, and optimization difficulty caused by the huge action space problem needs to be considered when the recommendation system is used for solving the problem by reinforcement learning; 2) the modeling of the observed state of the user can generate time sequence information fed back by the user in the interaction process of the user and the recommendation system, and how to extract the time sequence information, model the state of the user and describe the behavior characteristics of the user is a key problem about the performance of the recommendation system.

Disclosure of Invention

The invention aims to provide a recommendation system construction method based on deep reinforcement learning, which effectively simplifies an action space by mapping the action selection space of a one-dimensional discrete item to a multi-dimensional continuous real-valued space and improves the performance of a recommendation system by adopting a convolution recurrent neural network to model the action characteristics of a user.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a recommendation system construction method based on deep reinforcement learning comprises the following steps:

s1) establishing a feature characterization set of user interaction with a recommendation system

The interaction process of the user and the recommendation system adopts a Markov decision process<S,A,P,R>And representing, wherein S is a state set, namely a characteristic characterization set of user interaction with the recommendation system, A is a set of items selectable by the user, and the items selected at the time t are marked as a_t，P(s_t+1|s_t,a_t) As a function of the state transition, i.e. the state s at the present moment_tNext user selects item a_tLast next moment state s_t+1Probability of (a), R(s)_t+1|s_t,a_t) As a function of reward, i.e. user vs. current status s_tSelecting item a_tRating of the rear user, the optimization goal of the recommendation system being to maximize the expected total rating of the user, i.e.

Wherein T is the termination time;

s2) establishing state representation of interaction between user and recommendation system

Establishing a user scoring matrix with U rows and M columns according to the historical records of all users, wherein U is the total number of the users, M is the total number of recommended items, the ith row and the jth column of the scoring matrix represent the scoring of the ith user on the jth item, if the user does not evaluate the item, the scoring is set to be 0, and each column of the matrix is used as the characteristic of each recommended item;

extracting all the selected items and scores before the current moment according to the historical records (including items selected by the user and the scores of the user) in the interaction process of each user and the recommendation system, inputting the items and the scores into a convolutional neural network, processing the output of all LSTM (time-circulating neural network) units by using a convolutional layer, and processing the output of the convolutional layer by using maximum pooling and mean pooling to serve as the state representation of the interaction of the user and the recommendation system;

wherein, the neural network parameters representing the user state can be learned through the training process of the step S4);

s3) modeling of recommendation system

Modeling a recommendation system by adopting an Actor-critic (Actor-critic) framework, wherein an Actor module inputs a currently observed user state and outputs a recommended item given based on the user state, and a critic module receives the current state and a representation of a selected item as input, is responsible for evaluating selectable items under the current user observed state and evaluating expected returns which may be received in the future;

the actor module and the critics module are both expressed by adopting a neural network;

in order to process the large number of recommended items, project codes for mapping index codes of the recommended items to continuous real values in a B system are adopted, so that the action output dimension of an actor module is log_BM of each output valueIn the range of [0, B-1]；

S4) training of recommendation system

Training an actor network and a critic network by using historical data of a user by using a deterministic strategy gradient algorithm (DDPG);

specifically, the scoring data of each user in the training data is first processed into<s_t,a_t,r_t,s_t+1>In quadruple form of (a), wherein s_tIs the state of the user at time t, a_tB-ary coding, r, for selecting an entry for a user_tAs a score of the user, s_t+1The state at the time t + 1; then, a deterministic strategy gradient algorithm DDPG is adopted, user data in a four-tuple form is stored in experience cache of the DDPG, and an actor network and a critic network are trained;

s5) deployment of recommendation system

On an online deployment platform, recommending corresponding projects by utilizing the actor network trained in the step S4) according to the current state of the user by combining the output values of the actor module and the critic module;

specifically, on the online deployment platform, using the actor module trained in step S4), the actor module outputs an item code according to the current state of the user, matching the output item code with all item codes using a similarity evaluation method, selecting K items with the largest similarity, evaluating the output value of the critics module in the current state, and selecting N items with the largest output value of the critics module as recommended items to be output to the user.

Compared with the prior art, the invention has the following advantages:

according to the recommendation system construction method based on deep reinforcement learning, action selection spaces of one-dimensional discrete items are mapped to a multi-dimensional continuous real-valued space, and an action space of recommended items is simplified by adopting a method of binary conversion, so that the difficulty of training of a recommendation system is reduced; the behavior characteristics of the user are modeled by adopting the convolution recurrent neural network, so that the performance of the recommendation system is improved.

Drawings

FIG. 1 is a schematic diagram illustrating a state representation of user interaction with a recommendation system in an embodiment of the present invention.

Fig. 2 is a schematic diagram of an actor module modeled by the recommendation system in an embodiment of the invention.

FIG. 3 is a schematic diagram of a critics module modeling the recommendation system in an embodiment of the invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

The interaction process of the user and the recommendation system adopts a Markov decision process<S,A,P,R>And representing, wherein S is a state set, namely a characteristic characterization set of user interaction with the recommendation system, A is a set of items selectable by the user, and the items selected at the time t are marked as a_t，P(s_t+1|s_t,a_t) As a function of the state transition, i.e. the state s at the present moment_tNext user selects item a_tLast next moment state s_t+1Probability of (a), R(s)_t+1|s_t,a_t) As a function of reward, i.e. user vs. current status s_tSelecting item a_tRating of the rear user, the optimization goal of the recommendation system is to maximize the total desired rating of the user, i.e. the total rating of the user

Wherein T is the termination time.

S2) establishing the state representation of the interaction between the user and the recommendation system

the neural network parameters representing the user state can be learned through the training process of step S4), and the neural network representing the user state can be used as a part of the actor-critic framework, and the representation of the interaction state of a specific user and the recommendation system is shown in fig. 1.

S3) modeling of recommendation system

An Actor-critic (Actor-critic) framework is used to model the recommendation system, wherein the Actor module inputs the currently observed user state and outputs recommended items given based on the user state, and the critic module receives as input the current state and representations of selected items, and is responsible for evaluating the selectable items in the current user observed state and evaluating the expected reward that it may receive in the future, namely:

in order to process the large number of recommended items, item codes for mapping index codes of the recommended items to continuous real values in a B system are adopted, so that the action output dimension of an actor module is log_BM, each output value of which ranges from [0, B-1 ]]The actor module outputs an index code of the recommended item, and the actor module and the critics module are respectively shown in fig. 2 and 3.

S4) training of recommendation system

the detailed training algorithm is described with reference to the following algorithm 1, wherein the hyper-parameters include: a discount factor gamma, a learning time v, a learning sample number w and a soft updating factor tau.

S5) deployment of recommendation system

specifically, on an online deployment platform, using the actor module trained in step S4), the actor module outputs an entry code according to the current state of the user, and using a similarity evaluation method to perform similarity matching between the output entry code and all entry codes, where the similarity matching may select a judgment method of euclidean distance or cosine similarity, select K items with the largest similarity, evaluate the output value of the critic module in the current state, and select N items with the largest output value of the critic module as recommended entries to be output to the user, and the specific recommendation process is shown in the following algorithm 2.

Experimental results of the invention experiments were performed on the MovieLens-1M dataset (https:// group. org/dates/MovieLens/1M /) commonly used by recommendation systems, which contained score data for about 6000 users about 4000 movies, for a total of 1,000,000 scores. 80% of the data were taken as training set and 20% as test set.

The experimental effect of the invention was compared with the effects obtained by several other recommendation methods, including popularity-based recommendation and Singular Value composition (SVD). As shown in table 1, by comparing with the two methods, including statistics of precision (precision) and recall (recall) of 30 items, a higher value indicates better recommendation effect, so that the method of the present invention is very effective.

TABLE 1 comparison of the effectiveness of the method with several other types of recommendation methods

Recommendation method	Precision ratio	Recall rate
			Popularity-based recommendations	32.4％	0.713％
SVD	30.3％	0.558％
			Method for producing a composite material	45.7％	2.12％

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the spirit of the present invention, and these modifications and improvements should also be considered as within the scope of the present invention.

Claims

1. A recommendation system construction method based on deep reinforcement learning is characterized by comprising the following steps:

s1) establishing a feature characterization set of the interaction between the user and the recommendation system

The interaction process of the user and the recommendation system adopts a Markov decision process<S,A,P,R>Showing, wherein S is a state set, namely a characteristic characterization set interacted with a recommendation system by a user, A is a set of items selectable by the user, and the items selected at the time t are marked as a_t，P(s_t+1|s_t,a_t) As a function of the state transition, i.e. the state s at the present moment_tNext user selects item a_tState s at the next moment_t+1Probability of (a), R(s)_t+1|s_t,a_t) As a function of the return, i.e. the user is given the current status s_tSelecting item a_tScoring of the later user, wherein the optimization goal of the recommendation system is to maximize the total score expected by the user;

selecting historical selection items and scores of the users according to historical records in the interaction process of each user and the recommendation system, modeling each historical item by adopting a recurrent neural network, processing the output of the recurrent neural network by adopting the recurrent neural network, and processing the characteristics of the recurrent neural network by adopting a pooling layer to serve as the characteristic representation of the interaction of the user and the recommendation system;

s3) modeling of recommendation system

Modeling a recommendation system by adopting an Actor-critic (Actor-critic) framework, wherein an Actor module inputs a currently observed user state and outputs a recommended item given based on the user state, and a critic module receives the current state and a representation of a selected item as input, is responsible for evaluating selectable items under the state observed by the current user and evaluating expected returns which the critic module may receive in the future;

the actor module and the critics module are both expressed by adopting a neural network; wherein, the actor module adopts a method of system conversion to map items of the recommendation system, and represents the code of each recommendation item by a B system, and the actor module neural network output is log_BM real values each ranging from [0, B-1 ]]；

S4) training of recommendation system

s5) deployment of recommendation system

And on the online deployment platform, recommending corresponding projects by using the actor network trained in the step S4) according to the current state of the user and combining the output values of the actor module and the critic module.

2. The deep reinforcement learning-based recommendation system construction method according to claim 1, wherein:

in the step S4), the method for training the actor module and the critic module using the historical data includes: extracting feature S of the current time from the score data of each user according to the feature representation method of step S2)_tAnd the feature s of the next time_t+1Current user rating r_tAs the instant report, the index number of the item selected by the user is converted into B-system code as the action a, and the action a is stored<s_t,a_t,r_t,s_t+1>To DDPG algorithmAnd training the actor module and the critics module by using a DDPG algorithm.

3. The deep reinforcement learning-based recommendation system construction method according to claim 1, wherein:

in step S5), the deployment method of the recommendation system includes: and obtaining codes of the selected items according to the current state input to the actor module, calculating the similarity between the B-system codes of all recommended items and the output codes of the actor module by using a similarity evaluation method, selecting K items with the maximum similarity, evaluating the output value of the critic module in the current state, and selecting N items with the maximum output of the critic module as recommended items to be output to a user.