CN113626720A

CN113626720A - Recommendation method and device based on action pruning, electronic equipment and storage medium

Info

Publication number: CN113626720A
Application number: CN202111185124.8A
Authority: CN
Inventors: 张俊格; 白栋栋; 黄凯奇
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2021-11-09
Anticipated expiration: 2041-10-12
Also published as: CN113626720B

Abstract

The invention provides a recommendation method and device based on action pruning, an electronic device and a storage medium, wherein the recommendation method comprises the following steps: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the score prediction model is obtained through reinforcement learning, in the reinforcement learning process, the score prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, score prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, the historical states are sample states before the current sample state, the convergence speed of reinforcement learning is increased, and personalized accurate recommendation is carried out on users.

Description

Recommendation method and device based on action pruning, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a recommendation method and device based on action pruning, an electronic device and a storage medium.

Background

Reinforcement learning is particularly suitable for business scenarios involving interactions, such as scenarios in which content is recommended to a user, due to the advantages of being able to perceive a dynamic environment and obtain rewards from the environment to continually adapt to the environment.

However, in practical applications, the conventional reinforcement learning algorithm only performs policy improvement based on immediate rewards obtained from the environment, and has a problem of slow convergence rate, and is only suitable for learning tasks that process small-scale motion spaces. Therefore, how to increase the convergence rate of reinforcement learning is an important issue to be solved urgently in the industry at present.

Disclosure of Invention

The invention provides a recommendation method and device based on action pruning, electronic equipment and a storage medium, which are used for solving the defect of low convergence speed of reinforcement learning in the prior art and realizing the improvement of the convergence speed of the reinforcement learning.

The invention provides a recommendation method based on action pruning, which comprises the following steps:

determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;

predicting the scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to the target user based on the scores of the contents to be recommended;

the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of candidate scores in the current sample state from a regret value set, and performs scoring prediction based on the candidate scores with the regret values larger than a preset threshold value, wherein the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of the candidate scores in the historical states, and the historical states are sample states before the current sample state.

According to the action pruning-based recommendation method provided by the invention, the scoring prediction model obtains the regret value of each candidate score in the current sample state from the regret value set, and carries out scoring prediction based on the candidate score with the regret value larger than the preset threshold, and the method comprises the following steps:

querying the current sample state in the regret set of values;

if the current sample state exists in the regret value set, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, and scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold;

otherwise, the scoring prediction model adds the regret value of each candidate score in the current sample state in the regret value set, sets the regret value of each added candidate score as an initial value, and performs scoring prediction based on each candidate score.

According to the action pruning-based recommendation method provided by the invention, the scoring prediction model obtains the regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than the preset threshold, and then the action pruning-based recommendation method further comprises the following steps:

and the score prediction model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.

According to the action pruning-based recommendation method provided by the present invention, the updating each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score includes:

if the regret value of any candidate score in the regret value set under the current sample state is larger than a preset threshold value, overlapping the regret value of any candidate score with the current regret value of any candidate score to obtain an updated regret value of any candidate score;

and if the regret value of any candidate score in the regret value set under the current sample state is less than or equal to a preset threshold value, not updating the regret value of any candidate score.

According to the recommendation method based on action pruning provided by the invention, the score prediction based on the candidate score with the regret value larger than the preset threshold value comprises the following steps:

determining the value of each current candidate score in the current sample state based on the value of the current sample state and the advantages of each current candidate score in the current sample state, wherein the current candidate score is a candidate score with the regret value larger than a preset threshold value;

and taking the current candidate score corresponding to the maximum value in the values of the current candidate scores as the current score.

According to the recommendation method based on action pruning provided by the invention, the regret value is determined based on the following formula:

wherein,

is the first in the history state

An unfortunate value of the score of an individual candidate,

for the value of the said historical state,

is the first in the history state

The dominance of the individual candidate scores is,

in order to be able to take account of the history status,

is the first

And (4) scoring the candidate.

The invention also provides a recommendation device based on action pruning, which comprises the following components:

the determining module is used for determining the state corresponding to each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;

the recommendation module is used for predicting the scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to be recommended to the target user based on the scores of the contents to be recommended;

The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the recommendation method based on action pruning.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for recommending based on action pruning as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the method for recommending based on action pruning as defined in any of the above.

According to the action pruning-based recommendation method, the action pruning-based recommendation device, the electronic equipment and the storage medium, regret values of candidate scores in a corresponding state generated by an intelligent agent in each decision are accumulated through the regret value set, and in the reinforcement learning process, the score prediction model prunes the candidate scores with lower regret values based on the regret value set, so that the convergence rate of reinforcement learning is increased through action pruning, the learning efficiency of reinforcement learning is improved, and the scores of contents to be recommended are obtained through the score prediction model, so that personalized accurate recommendation for different users is realized, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a recommendation method based on action pruning according to the present invention;

FIG. 2 is a schematic flow chart of a method for determining a score prediction model according to the present invention;

FIG. 3 is a diagram of a reinforcement learning framework based on action pruning according to the present invention;

FIG. 4 is a schematic structural diagram of a recommending device based on action pruning provided by the invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The traditional reinforcement learning algorithm has slow convergence speed in practical application and can only be suitable for the learning task of processing small-scale action space. In contrast, the embodiment of the present invention provides a new technology in the field of reinforcement learning, that is, an action pruning technology, which continuously prunes candidate actions with a low regret value before the agent makes a decision each time, so as to reduce the action space of the agent, improve the convergence rate of the reinforcement learning algorithm, and shorten the learning time. Based on the above, the embodiment of the invention provides a recommendation method for action pruning.

Fig. 1 is a schematic flow chart of a recommendation method based on action pruning provided by the present invention, as shown in fig. 1, the method includes:

step 110, determining a state corresponding to each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;

step 120, predicting scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to be recommended to a target user based on the scores of the contents to be recommended;

the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

Specifically, the target user is a user to be subjected to content recommendation, and the user characteristics are used for characterizing attribute information of the target user, such as gender, age, education level, occupation and the like of the target user. The content to be recommended may be content recommended to the user, and the specific type of the content to be recommended may be movies, music, news, and the like, which is not specifically limited in the embodiment of the present invention. The content features are used for characterizing attribute information of the content to be recommended, such as movie types, theme content and the like. Further, the content to be recommended may be specifically determined based on the browsing record of the target user, and it is understood that since the browsing record covers the preference information of the user, determining the content to be recommended based on the browsing record may help the subsequently recommended content to be accepted by the user.

In order to perform personalized content recommendation for different users, the embodiment of the invention firstly determines the state corresponding to each content to be recommended according to the user characteristics of a target user and the content characteristics of each content to be recommended, then inputs the state corresponding to each content to be recommended into a scoring prediction model, predicts the score of each content to be recommended by the target user through the scoring prediction model, finally sorts each content to be recommended according to the score of each content to be recommended output by the scoring prediction model, and determines the content recommended to the target user according to the sorting result. Here, the state corresponding to each content to be recommended, that is, the state of each content to be recommended and the rating recommendation environment where the target user is located, may be obtained by environment feedback.

It can be understood that the scoring prediction model is obtained by performing reinforcement learning based on the sample state corresponding to the sample content, in the reinforcement learning process, the scoring prediction model can learn the scoring modes of different users by obtaining rewards from the environment and continuously performing policy optimization, on the basis, the step 120 is executed by applying the scoring prediction model, the obtained score of each content to be recommended can accurately represent the preference degree of a target user for each content to be recommended, the target user is recommended based on the score of each content to be recommended, personalized accurate recommendation can be performed on different users, and therefore user experience is greatly improved.

Before step 120 is executed, the scoring prediction model may be trained specifically as follows: first, user characteristics of a large number of sample users and content characteristics of sample contents are collected, and a sample state corresponding to each sample content is determined based on the user characteristics of each sample user and the content characteristics of each sample content. And then, performing reinforcement learning on the initial model based on the sample state corresponding to each sample content, thereby obtaining a grading prediction model.

The existing reinforcement learning method does not consider to reduce the action space of the intelligent agent, so that the convergence rate of reinforcement learning is low, and the reinforcement learning method is only suitable for processing small-scale discrete states or discrete action learning tasks. In order to solve the problem, in the reinforcement learning process, the score prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, cuts out the candidate scores with the regret value less than or equal to a preset threshold, uses the remaining candidate scores with the regret value greater than the preset threshold as a current score space of the agent, and performs score prediction based on the score space. Here, the preset threshold, that is, the preset threshold, may be arbitrarily set according to actual requirements, for example, 0, 0.1, and the like.

In the reinforcement learning process, the scoring prediction model essentially realizes interaction with the environment through the intelligent agent controlled by the scoring prediction model. The agent accesses a sample state of the environment at each decision, and the sample state accessed by the agent before the current sample state can be used as a historical state. The regret value set stores the historical states accessed by the agent, an action table is maintained corresponding to each accessed historical state, and regret values of all candidate actions, namely candidate scores, executed by the agent in the historical states are recorded. Here, the regret value may be specifically determined according to the advantage of each candidate score in the history state calculated by the action advantage function, and is used to represent the loss degree of the yield obtained by selecting each candidate score in the history state, where the candidate score is a score selectable in the score space.

For the current decision, the scoring prediction model can firstly acquire regret values of each candidate score in the current sample state from the regret value set, prune the candidate scores with the regret values smaller than or equal to a preset threshold value, on the basis, the scoring prediction model selects the current decision action based on the pruned scoring space, namely selects the current score of the sample content from the scoring space, so as to complete the scoring prediction, and then the scoring prediction model can control the intelligent body to score the sample content according to the selected current score, so that the environment can react to the score, and send the next sample state of the environment to the intelligent body.

It should be noted that regret values of candidate scores of the agent in a corresponding state generated by the agent in each past decision making are collected through the regret value set to form cumulative regrets, and the candidate scores with lower regret values are pruned, so that the local optimal scoring strategy can be accurately excluded, and then the current decision making action is selected based on the pruned scoring space, so that the reinforcement learning can be accurately performed towards the target direction, the global optimal strategy can be found more quickly, and the convergence rate of the reinforcement learning is accelerated. In addition, the action pruning technology can continuously reduce the action space of the intelligent agent in the reinforcement learning process, and the reinforcement learning method provided by the invention can still ensure higher convergence rate for large-scale learning tasks of the action space.

According to the method provided by the embodiment of the invention, regret values of each candidate score in a corresponding state generated by the intelligent agent in each decision are accumulated through the regret value set, and in the reinforcement learning process, the score prediction model prunes candidate scores with lower regret values based on the regret value set, so that the convergence speed of reinforcement learning is increased through action pruning, the learning efficiency of reinforcement learning is improved, and the scores of contents to be recommended are obtained through applying the score prediction model, so that personalized accurate recommendation is realized for different users, and the user experience is improved.

Based on any of the above embodiments, the scoring prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, including:

querying the current sample state in the regret set;

if the regret value set has the current sample state, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, and scoring prediction is carried out based on the candidate scores of which the regret values are larger than a preset threshold value;

Specifically, after the current sample state is obtained, the scoring prediction model firstly queries whether the current sample state exists in the regret value set:

if the current sample state exists in the regret value set, the scoring prediction model can acquire a regret value list corresponding to the current sample state from the regret value set

，

The number of candidate scores in the scoring space is represented, and then the fact that the action with small regret value is not beneficial to the learning of the global optimal strategy is considered, therefore, the scoring prediction model cuts out the candidate scores with the regret value smaller than or equal to a preset threshold value, the remaining candidate scores with the regret value larger than the preset threshold value are used as the current scoring space of the intelligent body, and scoring prediction is carried out on the basis of the scoring space;

otherwise, that is, the regret value set does not have the current sample state, the scoring prediction model may create a regret value list for the current sample state in the regret value set, each list element in the regret value list corresponds to the regret value of each candidate score in the current sample state stored in the regret value set, and each list element in the regret value list is set as an initial value, for example, set as 1, and then, the scoring prediction model performs scoring prediction based on each candidate score in the original scoring space.

Based on any of the above embodiments, the scoring prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, and then further includes:

and the scoring prediction model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.

Specifically, after the scoring prediction model completes scoring prediction, the scoring prediction model may calculate a current regret value of each candidate score according to the advantage of each candidate score in the current sample state, and then update each regret value corresponding to the current sample state in the regret value set according to the calculated current regret value of each candidate score. Here, the specific updating manner may be to superimpose an originally stored regret value of each candidate score in the current sample state in the regret value set with the calculated regret value of each candidate score, and then take the superimposed result as each regret value corresponding to the updated current sample state, where the superimposing manner may be direct superimposing or weighted superimposing, which is not specifically limited in this embodiment of the present invention.

For example, in the stage of score prediction, it is queried that there is no current sample state in the regret value set, then an regret value list is created for the current sample state in the regret value set, and each list element in the regret value list is set to 1, then, in this step, each element in the created regret value list and the current regret value of each candidate score can be directly superimposed, at this time, each regret value corresponding to the current sample state in the updated regret value set is the current regret value of 1 plus each candidate score.

It should be noted that this step is to update each state and corresponding regret value stored in the regret value set after each scoring prediction is completed, and accumulate and store the past decision experience by continuously overlapping regret values corresponding to the same state, so that the subsequent action pruning can accurately exclude the locally optimal scoring strategy, thereby improving the learning efficiency of reinforcement learning. In addition, whether the scoring prediction is performed based on the pruned scoring space or the original scoring space, this step should be performed after the scoring prediction is completed, i.e. the states stored in the regret value set and the corresponding regret values are updated.

Based on any of the above embodiments, updating each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score, including:

if the regret value of any candidate score in the current sample state in the regret value set is larger than a preset threshold value, overlapping the regret value of the candidate score and the current regret value of the candidate score to obtain an updated regret value of the candidate score;

if the regret value of any candidate score in the current sample state in the regret value set is less than or equal to the preset threshold, the regret value of the candidate score is not updated.

Specifically, the various states stored in the regret set and the corresponding regret values may be updated as follows: for candidate scores in which the regret value in the current sample state in the regret value set is greater than the preset threshold, the regret value of the candidate score stored originally in the regret value set and the calculated current regret value of the candidate score can be superposed, so that the regret value of the candidate score updated in the current sample state in the regret value set is obtained; for candidate scores in which the regret value in the current sample state in the regret value set is less than or equal to the preset threshold, the regret value of the candidate score is not updated, that is, the regret value of the candidate score in the current sample state stored in the regret value set is continuously less than or equal to the preset threshold.

It can be understood that, if there is a situation that an unfortunate value of any candidate score in the current sample state is less than or equal to a preset threshold in the unfortunate value set, for the current decision, the candidate score will be pruned in the action pruning stage, and then, in this step, the unfortunate value of the candidate score in the current sample state stored in the unfortunate value set will continue to be less than or equal to the preset threshold, and therefore, for each subsequent decision, the candidate score will also be pruned in the action pruning stage, so that the action space of the agent can be continuously reduced, one or several actions can be locked for a specific state in the late stage of reinforcement learning, and the learning efficiency of reinforcement learning is greatly improved.

Based on any of the above embodiments, performing score prediction based on a candidate score with an regressive value greater than a preset threshold includes:

determining the value of each current candidate score in the current sample state based on the value of the current sample state and the advantages of each current candidate score in the current sample state, wherein the current candidate score is a candidate score with an regret value larger than a preset threshold value;

Specifically, after the regret value of each candidate score in the current sample state is obtained from the regret value set, the candidate scores whose regret value is less than or equal to the preset threshold may be pruned, and the remaining candidate scores whose regret value is greater than the preset threshold are the current candidate scores. Then, the score prediction model may calculate the value of each current candidate score in the current sample state by the following formula:

wherein,

is the first in the current sample state

The value of the current candidate score(s),

is the first in the current sample state

The dominance of the current candidate score,

as a function of the value of the current sample state,

in order to be in the current state of the sample,

is as follows

A current candidate score. Here, the number of the first and second electrodes,

can be calculated by a function of the state value,

can be calculated by the action dominance function.

The values of all current candidate scores in the current sample state can be obtained through the formula, and then the values are compared, and the current candidate score corresponding to the maximum value is determined as the current score. Here, the current score is a result obtained by the score prediction model executing the current decision, and after the current score is determined, the score prediction model may control the agent to execute a score action, that is, to score the sample content according to the current score.

Based on any of the above embodiments, the unfortunate value is determined based on the following equation:

wherein,

is in a history state

An unfortunate value of the score of an individual candidate,

in order to be the value of the historical state,

is in a history state

The dominance of the individual candidate scores is,

in the case of a history state, the history state,

is as follows

And (4) scoring the candidate.

Specifically, in order to implement action pruning based on the regret value guidance, the embodiment of the invention defines the regret value of each candidate score by using the value of the state and the advantage of each candidate score in the state by using the game theory for reference, and is used for estimating the loss degree of the income obtained by selecting each candidate score in the current state. For each historical state derived from the environment

Can be first calculated by a state value function

I.e. the value of the historical state, is calculated by the action dominance function

I.e. in the history state

A candidate score

Is then judged

Whether the current time is more than 0 or not, and calculating to obtain the current time in the history state according to the judgment result and the formula

Unfortunately

Thereby obtaining the history status

Regrettable values of all the candidate scores, and finally the historical state

And storing all the corresponding regretted values into the regretted value set for subsequent action pruning.

Here, the state value function may be specifically implemented by a state value function network, and the action dominance function may be specifically implemented by an action dominance function network. It can be understood that the values of the states respectively output by the two networks and the advantages of the candidate scores are more accurate by continuously optimizing the parameters of the state value function network and the action advantage function network in the reinforcement learning process, so that the error in the regret value calculation process can be effectively reduced, and the effectiveness of action pruning is further improved.

Based on any of the above embodiments, fig. 2 is a schematic flow chart of the determination method of the score prediction model provided by the present invention, as shown in fig. 2, taking movie recommendation as an example, a specific flow of the method is as follows:

step 1: constructing a recommendation system interaction environment;

step 2: creating an unfortunate value set, namely an unfortunate value Pool Re _ Pool and an experience Pool Ex _ Pool, wherein the two pools are empty initially;

and step 3: initializing a state value function network and an action advantage function network in the scoring prediction model; setting reward functions, discount factors and exploration probabilities

；

And 4, step 4: and (3) the intelligent agent makes random decision and accumulates experience: the agent interacts with the environment, and user characteristics of the user by the sample are obtained from the environment

And movie features of sample movies

Sample state of composition

The agent is in the scoring space according to the state of the sample

In the random selection of scores

Calculating the state of the sample

The regret value of each candidate score is stored in Re _ Pool, and the score is stored

Feedback to the environment to obtain the reward

The state of the next moment

And whether the flag of the termination state is reachedsign(ii) a Here, the determination of whether the termination state is reached is based on the current sample useruserWhether the browsing records are completely traversed or not, if not, whether the browsing records are completely traversed or not

By user features

Movie feature of next sample movie

Composition, if the traversal is complete

And switching users in the next interaction, and then experience

Storing in Ex _ Pool;

and 5: repeating the step 4 until the decision times are more than

Secondly;

step 6: the agent interacts with the environment, and the agent obtains the current sample state from the environment

Query in regret Pool Re _ Pool

If yes, extracting the regret value list corresponding to the state from the regret value pool

Wherein

To be the number of candidate scores in the scoring space,

and create an all-zeros list of the same size as the regret list

Filling and replacing elements at corresponding positions in the all-zero list according to the following formula to obtain a zero list zo _ list:

according to the formula

Calculating the value of each candidate score in the scoring space, namely Q value, and according to the value

Policy selection scoring

That is, the Q values of the candidate scores corresponding to the regret values which are not 0 are screened out, and then the candidate score corresponding to the maximum value in the Q values is used as the current decision of the agent; if not, creating an unfortunate value list for the state, storing the unforeseen value list in Re _ Pool, initializing each element in the list to 1, and according to the initialization, storing the unforeseen value list in Re _ Pool

Policy selection scoring

Taking the candidate score corresponding to the maximum value in the Q values of all candidate scores as the current decision of the agent;

and 7: the current sample state

Inputting the value of the state into a state value function network to calculate

Will be

Inputting the calculated advantages of each candidate score in the action advantage function network

And calculate by this

Current regret value of next candidate score

：

Updating Re _ Pool according to the calculated current regret value of each candidate score

The specific updating method of the regret values of the candidate scores is as follows: obtaining from regret pools

List of existing regret values composed of regret values of next candidate scores

And the location of the list in the regret poolindexCalculated by the above formula

The current regret values corresponding to the next candidate scores form a latest regret value list

And overlapping elements at the corresponding positions of the existing regret list and the latest regret list according to the following formula:

finally, the superposed regret list is replaced in the regret poolindexAn unfortunate list of locations;

and 8: will score

Return to the environmentAnd receive a reward

The state of the next moment

Andsignwill experience

Storing in Ex _ Pool;

and step 9: and returning to the step 6, continuously iterating and training until the convergence condition of the score prediction model is met, for example, the precision of the prediction score does not change any more each time, wherein the precision can be the difference between the prediction score and the real score, at this time, it is indicated that the reinforcement learning has learned a stable strategy, and the score prediction model can be obtained after the training is finished.

Based on any of the above embodiments, fig. 3 is a schematic diagram of a reinforcement learning framework based on action pruning according to the present invention, and as shown in fig. 3, the framework includes components such as a simulation environment, an action regret value pool, an experience playback pool, a state value function network, and an action dominance function network. The specific flow of the reinforcement learning method based on the framework is as follows:

step 1: create an unfortunate Pool Re _ Pool (i.e., the action regret Pool in FIG. 3), and an experience Pool Ex _ Pool (i.e., the experience playback Pool in FIG. 3), both pools initially being empty;

step 2: initializing a state value function network and an action advantage function network; setting reward functions, discount factors and exploration probabilities

；

And step 3: the intelligent agent interacts with the environment, makes a random decision and stores the experience into Ex _ Pool;

and 4, step 4: repeating the step 3 until the decision times are more than

Secondly;

and 5: intelligence developmentThe energy body interacts with the environment, and the intelligent body obtains the current sample state from the environment

Query in regret Pool Re _ Pool

Whether or not it exists, if

Already in the pool, extracting the action regret value list corresponding to the state from the regret value pool

Wherein

The number of candidate actions in the action space,

and create an all-zeros list of the same size as the regret list

according to the formula

Computing value of candidate actions in an action space

I.e., Q value, based on

Policy selection action

That is, the Q values of the candidate actions corresponding to the regret value not equal to 0 are screened out first, and then the candidate action corresponding to the maximum value in the Q values is used as the current decision of the agent; if it is

If not in the pool, then

Creating an action regret list and storing the action regret list in Re _ Pool, initializing each action regret to 1, and according to the action regret list

Policy selection action

Taking the candidate action corresponding to the maximum value in the Q values of all candidate actions as the current decision of the agent;

step 6: the current sample state

Input to a function network of state values

To calculate the value of the state

Will be

Input to action dominance function network

The advantage of each candidate action in the state is obtained by calculation

And calculate by this

Current regret value of next candidate actions

：

Updating Re _ Pool according to the calculated current regret value of each candidate action

The specific updating method of the regret values of the following candidate actions is as follows: obtaining from regret pools

List of existing regret values composed of regret values of next candidate actions

The current regret values corresponding to the next candidate actions form a latest regret value list

And overlapping the corresponding positions of the existing regret list and the latest regret list according to the following formula:

and 7: will act

Return to the simulation environment and obtain the reward

The state of the next moment

Will experience

Storing in Ex _ Pool;

and 8: and returning to the step 5, continuously iterating the training until the network is converged, and finishing the training to obtain the strategy model.

It should be noted that, at the beginning of the reinforcement learning process, the experience Pool Ex _ Pool does not store full experience, and at this time, for each interaction between the agent and the environment, only the experience is stored in the experience Pool, and the state value function network and the action dominance function network are not updated; after storing full experience in the experience Pool Ex _ Pool, for each interaction between the agent and the environment, experience is first stored

Storing in Ex _ Pool, and selecting a batch of experience (mini-batch) versus state value function network from the experience Pool

And action dominance function

And (6) updating.

The invention aims to provide a new thought and a method for improving the convergence rate of reinforcement learning, in the reinforcement learning process, experience is accumulated through random decision of an intelligent agent at the early stage, the regret value of each candidate action in each state is calculated according to the existing experience at the later stage, when the regret value of the candidate action is less than or equal to the regret value and exceeds a parameter (namely a preset threshold value in each embodiment), the candidate action is pruned, the candidate action is never selected, and finally one or more actions are locked aiming at a specific state, so that the convergence rate of the reinforcement learning is greatly improved. In addition, the action pruning technology provided by the invention can be combined with any reinforcement learning algorithm, so that the learning efficiency of the reinforcement learning algorithm is improved, and the application range is wide.

The following describes a recommendation device based on action pruning according to the present invention, and the recommendation device based on action pruning described below and the recommendation method based on action pruning described above may be referred to in correspondence with each other.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of a recommendation device based on action pruning, as shown in fig. 4, the device includes:

a determining module 410, configured to determine, based on the user characteristics of the target user and the content characteristics of each content to be recommended, a state corresponding to each content to be recommended;

the recommending module 420 is configured to predict scores of the contents to be recommended based on the states and the score prediction models corresponding to the contents to be recommended, and recommend to a target user based on the scores of the contents to be recommended;

According to the device provided by the embodiment of the invention, regret values of each candidate score in a corresponding state generated by the intelligent agent in each decision are accumulated through the regret value set, and in the reinforcement learning process, the score prediction model prunes candidate scores with lower regret values based on the regret value set, so that the convergence speed of reinforcement learning is increased through action pruning, the learning efficiency of reinforcement learning is improved, and the scores of contents to be recommended are obtained through applying the score prediction model, so that personalized accurate recommendation is realized for different users, and the user experience is improved.

querying the current sample state in the regret set;

if the regret value of any candidate score in the current sample state in the regret value set is larger than a preset threshold value, overlapping the regret value of any candidate score with the current regret value of any candidate score to obtain an updated regret value of any candidate score;

and if the regret value of any candidate score in the current sample state in the regret value set is less than or equal to the preset threshold, the regret value of any candidate score is not updated.

wherein,

is in a history state

An unfortunate value of the score of an individual candidate,

in order to be the value of the historical state,

is in a history state

The dominance of the individual candidate scores is,

in the case of a history state, the history state,

is as follows

And (4) scoring the candidate.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method of recommending based on action pruning, the method comprising: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the action pruning-based recommendation method provided by the above methods, and the method includes: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for action-based pruning recommendation provided by the above methods, the method comprising: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A recommendation method based on action pruning is characterized by comprising the following steps:

2. The action pruning-based recommendation method according to claim 1, wherein the scoring prediction model obtains an regret value of each candidate score in a current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, and comprises:

querying the current sample state in the regret set of values;

3. The action-pruning-based recommendation method according to claim 1 or 2, wherein the score prediction model obtains an regret value of each candidate score in a current sample state from the regret value set, and performs score prediction based on a candidate score with the regret value larger than a preset threshold, and then further comprises:

4. The action-pruning-based recommendation method according to claim 3, wherein the updating each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score comprises:

5. The action-pruning-based recommendation method according to claim 1 or 2, wherein the performing score prediction based on the candidate scores with the regressive values larger than a preset threshold comprises:

6. The action-based pruning recommendation method according to claim 1 or 2, wherein the regressive value is determined based on the following formula:

wherein,

is the first in the history state

An unfortunate value of the score of an individual candidate,

for the value of the said historical state,

is the first in the history state

The dominance of the individual candidate scores is,

in order to be able to take account of the history status,

is the first

And (4) scoring the candidate.

7. A recommendation device based on action pruning, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for recommending based on action pruning according to any of claims 1 to 6 are implemented when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the action-based pruning recommendation method according to any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the action-based pruning recommendation method according to any of claims 1 to 6 when executed by a processor.