CN113626720A - Recommendation method and device based on action pruning, electronic equipment and storage medium - Google Patents

Recommendation method and device based on action pruning, electronic equipment and storage medium Download PDF

Info

Publication number
CN113626720A
CN113626720A CN202111185124.8A CN202111185124A CN113626720A CN 113626720 A CN113626720 A CN 113626720A CN 202111185124 A CN202111185124 A CN 202111185124A CN 113626720 A CN113626720 A CN 113626720A
Authority
CN
China
Prior art keywords
regret
candidate
score
value
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111185124.8A
Other languages
Chinese (zh)
Other versions
CN113626720B (en
Inventor
张俊格
白栋栋
黄凯奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202111185124.8A priority Critical patent/CN113626720B/en
Publication of CN113626720A publication Critical patent/CN113626720A/en
Application granted granted Critical
Publication of CN113626720B publication Critical patent/CN113626720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a recommendation method and device based on action pruning, an electronic device and a storage medium, wherein the recommendation method comprises the following steps: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the score prediction model is obtained through reinforcement learning, in the reinforcement learning process, the score prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, score prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, the historical states are sample states before the current sample state, the convergence speed of reinforcement learning is increased, and personalized accurate recommendation is carried out on users.

Description

Recommendation method and device based on action pruning, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a recommendation method and device based on action pruning, an electronic device and a storage medium.
Background
Reinforcement learning is particularly suitable for business scenarios involving interactions, such as scenarios in which content is recommended to a user, due to the advantages of being able to perceive a dynamic environment and obtain rewards from the environment to continually adapt to the environment.
However, in practical applications, the conventional reinforcement learning algorithm only performs policy improvement based on immediate rewards obtained from the environment, and has a problem of slow convergence rate, and is only suitable for learning tasks that process small-scale motion spaces. Therefore, how to increase the convergence rate of reinforcement learning is an important issue to be solved urgently in the industry at present.
Disclosure of Invention
The invention provides a recommendation method and device based on action pruning, electronic equipment and a storage medium, which are used for solving the defect of low convergence speed of reinforcement learning in the prior art and realizing the improvement of the convergence speed of the reinforcement learning.
The invention provides a recommendation method based on action pruning, which comprises the following steps:
determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;
predicting the scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to the target user based on the scores of the contents to be recommended;
the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of candidate scores in the current sample state from a regret value set, and performs scoring prediction based on the candidate scores with the regret values larger than a preset threshold value, wherein the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of the candidate scores in the historical states, and the historical states are sample states before the current sample state.
According to the action pruning-based recommendation method provided by the invention, the scoring prediction model obtains the regret value of each candidate score in the current sample state from the regret value set, and carries out scoring prediction based on the candidate score with the regret value larger than the preset threshold, and the method comprises the following steps:
querying the current sample state in the regret set of values;
if the current sample state exists in the regret value set, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, and scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold;
otherwise, the scoring prediction model adds the regret value of each candidate score in the current sample state in the regret value set, sets the regret value of each added candidate score as an initial value, and performs scoring prediction based on each candidate score.
According to the action pruning-based recommendation method provided by the invention, the scoring prediction model obtains the regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than the preset threshold, and then the action pruning-based recommendation method further comprises the following steps:
and the score prediction model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
According to the action pruning-based recommendation method provided by the present invention, the updating each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score includes:
if the regret value of any candidate score in the regret value set under the current sample state is larger than a preset threshold value, overlapping the regret value of any candidate score with the current regret value of any candidate score to obtain an updated regret value of any candidate score;
and if the regret value of any candidate score in the regret value set under the current sample state is less than or equal to a preset threshold value, not updating the regret value of any candidate score.
According to the recommendation method based on action pruning provided by the invention, the score prediction based on the candidate score with the regret value larger than the preset threshold value comprises the following steps:
determining the value of each current candidate score in the current sample state based on the value of the current sample state and the advantages of each current candidate score in the current sample state, wherein the current candidate score is a candidate score with the regret value larger than a preset threshold value;
and taking the current candidate score corresponding to the maximum value in the values of the current candidate scores as the current score.
According to the recommendation method based on action pruning provided by the invention, the regret value is determined based on the following formula:
Figure 982309DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 645371DEST_PATH_IMAGE002
is the first in the history state
Figure 436610DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 249845DEST_PATH_IMAGE004
for the value of the said historical state,
Figure 75718DEST_PATH_IMAGE005
is the first in the history state
Figure 411147DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 830627DEST_PATH_IMAGE006
in order to be able to take account of the history status,
Figure 306608DEST_PATH_IMAGE007
is the first
Figure 986988DEST_PATH_IMAGE003
And (4) scoring the candidate.
The invention also provides a recommendation device based on action pruning, which comprises the following components:
the determining module is used for determining the state corresponding to each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;
the recommendation module is used for predicting the scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to be recommended to the target user based on the scores of the contents to be recommended;
the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of candidate scores in the current sample state from a regret value set, and performs scoring prediction based on the candidate scores with the regret values larger than a preset threshold value, wherein the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of the candidate scores in the historical states, and the historical states are sample states before the current sample state.
The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the recommendation method based on action pruning.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for recommending based on action pruning as described in any of the above.
The present invention also provides a computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the method for recommending based on action pruning as defined in any of the above.
According to the action pruning-based recommendation method, the action pruning-based recommendation device, the electronic equipment and the storage medium, regret values of candidate scores in a corresponding state generated by an intelligent agent in each decision are accumulated through the regret value set, and in the reinforcement learning process, the score prediction model prunes the candidate scores with lower regret values based on the regret value set, so that the convergence rate of reinforcement learning is increased through action pruning, the learning efficiency of reinforcement learning is improved, and the scores of contents to be recommended are obtained through the score prediction model, so that personalized accurate recommendation for different users is realized, and the user experience is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a recommendation method based on action pruning according to the present invention;
FIG. 2 is a schematic flow chart of a method for determining a score prediction model according to the present invention;
FIG. 3 is a diagram of a reinforcement learning framework based on action pruning according to the present invention;
FIG. 4 is a schematic structural diagram of a recommending device based on action pruning provided by the invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The traditional reinforcement learning algorithm has slow convergence speed in practical application and can only be suitable for the learning task of processing small-scale action space. In contrast, the embodiment of the present invention provides a new technology in the field of reinforcement learning, that is, an action pruning technology, which continuously prunes candidate actions with a low regret value before the agent makes a decision each time, so as to reduce the action space of the agent, improve the convergence rate of the reinforcement learning algorithm, and shorten the learning time. Based on the above, the embodiment of the invention provides a recommendation method for action pruning.
Fig. 1 is a schematic flow chart of a recommendation method based on action pruning provided by the present invention, as shown in fig. 1, the method includes:
step 110, determining a state corresponding to each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;
step 120, predicting scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to be recommended to a target user based on the scores of the contents to be recommended;
the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
Specifically, the target user is a user to be subjected to content recommendation, and the user characteristics are used for characterizing attribute information of the target user, such as gender, age, education level, occupation and the like of the target user. The content to be recommended may be content recommended to the user, and the specific type of the content to be recommended may be movies, music, news, and the like, which is not specifically limited in the embodiment of the present invention. The content features are used for characterizing attribute information of the content to be recommended, such as movie types, theme content and the like. Further, the content to be recommended may be specifically determined based on the browsing record of the target user, and it is understood that since the browsing record covers the preference information of the user, determining the content to be recommended based on the browsing record may help the subsequently recommended content to be accepted by the user.
In order to perform personalized content recommendation for different users, the embodiment of the invention firstly determines the state corresponding to each content to be recommended according to the user characteristics of a target user and the content characteristics of each content to be recommended, then inputs the state corresponding to each content to be recommended into a scoring prediction model, predicts the score of each content to be recommended by the target user through the scoring prediction model, finally sorts each content to be recommended according to the score of each content to be recommended output by the scoring prediction model, and determines the content recommended to the target user according to the sorting result. Here, the state corresponding to each content to be recommended, that is, the state of each content to be recommended and the rating recommendation environment where the target user is located, may be obtained by environment feedback.
It can be understood that the scoring prediction model is obtained by performing reinforcement learning based on the sample state corresponding to the sample content, in the reinforcement learning process, the scoring prediction model can learn the scoring modes of different users by obtaining rewards from the environment and continuously performing policy optimization, on the basis, the step 120 is executed by applying the scoring prediction model, the obtained score of each content to be recommended can accurately represent the preference degree of a target user for each content to be recommended, the target user is recommended based on the score of each content to be recommended, personalized accurate recommendation can be performed on different users, and therefore user experience is greatly improved.
Before step 120 is executed, the scoring prediction model may be trained specifically as follows: first, user characteristics of a large number of sample users and content characteristics of sample contents are collected, and a sample state corresponding to each sample content is determined based on the user characteristics of each sample user and the content characteristics of each sample content. And then, performing reinforcement learning on the initial model based on the sample state corresponding to each sample content, thereby obtaining a grading prediction model.
The existing reinforcement learning method does not consider to reduce the action space of the intelligent agent, so that the convergence rate of reinforcement learning is low, and the reinforcement learning method is only suitable for processing small-scale discrete states or discrete action learning tasks. In order to solve the problem, in the reinforcement learning process, the score prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, cuts out the candidate scores with the regret value less than or equal to a preset threshold, uses the remaining candidate scores with the regret value greater than the preset threshold as a current score space of the agent, and performs score prediction based on the score space. Here, the preset threshold, that is, the preset threshold, may be arbitrarily set according to actual requirements, for example, 0, 0.1, and the like.
In the reinforcement learning process, the scoring prediction model essentially realizes interaction with the environment through the intelligent agent controlled by the scoring prediction model. The agent accesses a sample state of the environment at each decision, and the sample state accessed by the agent before the current sample state can be used as a historical state. The regret value set stores the historical states accessed by the agent, an action table is maintained corresponding to each accessed historical state, and regret values of all candidate actions, namely candidate scores, executed by the agent in the historical states are recorded. Here, the regret value may be specifically determined according to the advantage of each candidate score in the history state calculated by the action advantage function, and is used to represent the loss degree of the yield obtained by selecting each candidate score in the history state, where the candidate score is a score selectable in the score space.
For the current decision, the scoring prediction model can firstly acquire regret values of each candidate score in the current sample state from the regret value set, prune the candidate scores with the regret values smaller than or equal to a preset threshold value, on the basis, the scoring prediction model selects the current decision action based on the pruned scoring space, namely selects the current score of the sample content from the scoring space, so as to complete the scoring prediction, and then the scoring prediction model can control the intelligent body to score the sample content according to the selected current score, so that the environment can react to the score, and send the next sample state of the environment to the intelligent body.
It should be noted that regret values of candidate scores of the agent in a corresponding state generated by the agent in each past decision making are collected through the regret value set to form cumulative regrets, and the candidate scores with lower regret values are pruned, so that the local optimal scoring strategy can be accurately excluded, and then the current decision making action is selected based on the pruned scoring space, so that the reinforcement learning can be accurately performed towards the target direction, the global optimal strategy can be found more quickly, and the convergence rate of the reinforcement learning is accelerated. In addition, the action pruning technology can continuously reduce the action space of the intelligent agent in the reinforcement learning process, and the reinforcement learning method provided by the invention can still ensure higher convergence rate for large-scale learning tasks of the action space.
According to the method provided by the embodiment of the invention, regret values of each candidate score in a corresponding state generated by the intelligent agent in each decision are accumulated through the regret value set, and in the reinforcement learning process, the score prediction model prunes candidate scores with lower regret values based on the regret value set, so that the convergence speed of reinforcement learning is increased through action pruning, the learning efficiency of reinforcement learning is improved, and the scores of contents to be recommended are obtained through applying the score prediction model, so that personalized accurate recommendation is realized for different users, and the user experience is improved.
Based on any of the above embodiments, the scoring prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, including:
querying the current sample state in the regret set;
if the regret value set has the current sample state, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, and scoring prediction is carried out based on the candidate scores of which the regret values are larger than a preset threshold value;
otherwise, the scoring prediction model adds the regret value of each candidate score in the current sample state in the regret value set, sets the regret value of each added candidate score as an initial value, and performs scoring prediction based on each candidate score.
Specifically, after the current sample state is obtained, the scoring prediction model firstly queries whether the current sample state exists in the regret value set:
if the current sample state exists in the regret value set, the scoring prediction model can acquire a regret value list corresponding to the current sample state from the regret value set
Figure 867219DEST_PATH_IMAGE008
Figure 633050DEST_PATH_IMAGE009
The number of candidate scores in the scoring space is represented, and then the fact that the action with small regret value is not beneficial to the learning of the global optimal strategy is considered, therefore, the scoring prediction model cuts out the candidate scores with the regret value smaller than or equal to a preset threshold value, the remaining candidate scores with the regret value larger than the preset threshold value are used as the current scoring space of the intelligent body, and scoring prediction is carried out on the basis of the scoring space;
otherwise, that is, the regret value set does not have the current sample state, the scoring prediction model may create a regret value list for the current sample state in the regret value set, each list element in the regret value list corresponds to the regret value of each candidate score in the current sample state stored in the regret value set, and each list element in the regret value list is set as an initial value, for example, set as 1, and then, the scoring prediction model performs scoring prediction based on each candidate score in the original scoring space.
Based on any of the above embodiments, the scoring prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, and then further includes:
and the scoring prediction model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
Specifically, after the scoring prediction model completes scoring prediction, the scoring prediction model may calculate a current regret value of each candidate score according to the advantage of each candidate score in the current sample state, and then update each regret value corresponding to the current sample state in the regret value set according to the calculated current regret value of each candidate score. Here, the specific updating manner may be to superimpose an originally stored regret value of each candidate score in the current sample state in the regret value set with the calculated regret value of each candidate score, and then take the superimposed result as each regret value corresponding to the updated current sample state, where the superimposing manner may be direct superimposing or weighted superimposing, which is not specifically limited in this embodiment of the present invention.
For example, in the stage of score prediction, it is queried that there is no current sample state in the regret value set, then an regret value list is created for the current sample state in the regret value set, and each list element in the regret value list is set to 1, then, in this step, each element in the created regret value list and the current regret value of each candidate score can be directly superimposed, at this time, each regret value corresponding to the current sample state in the updated regret value set is the current regret value of 1 plus each candidate score.
It should be noted that this step is to update each state and corresponding regret value stored in the regret value set after each scoring prediction is completed, and accumulate and store the past decision experience by continuously overlapping regret values corresponding to the same state, so that the subsequent action pruning can accurately exclude the locally optimal scoring strategy, thereby improving the learning efficiency of reinforcement learning. In addition, whether the scoring prediction is performed based on the pruned scoring space or the original scoring space, this step should be performed after the scoring prediction is completed, i.e. the states stored in the regret value set and the corresponding regret values are updated.
Based on any of the above embodiments, updating each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score, including:
if the regret value of any candidate score in the current sample state in the regret value set is larger than a preset threshold value, overlapping the regret value of the candidate score and the current regret value of the candidate score to obtain an updated regret value of the candidate score;
if the regret value of any candidate score in the current sample state in the regret value set is less than or equal to the preset threshold, the regret value of the candidate score is not updated.
Specifically, the various states stored in the regret set and the corresponding regret values may be updated as follows: for candidate scores in which the regret value in the current sample state in the regret value set is greater than the preset threshold, the regret value of the candidate score stored originally in the regret value set and the calculated current regret value of the candidate score can be superposed, so that the regret value of the candidate score updated in the current sample state in the regret value set is obtained; for candidate scores in which the regret value in the current sample state in the regret value set is less than or equal to the preset threshold, the regret value of the candidate score is not updated, that is, the regret value of the candidate score in the current sample state stored in the regret value set is continuously less than or equal to the preset threshold.
It can be understood that, if there is a situation that an unfortunate value of any candidate score in the current sample state is less than or equal to a preset threshold in the unfortunate value set, for the current decision, the candidate score will be pruned in the action pruning stage, and then, in this step, the unfortunate value of the candidate score in the current sample state stored in the unfortunate value set will continue to be less than or equal to the preset threshold, and therefore, for each subsequent decision, the candidate score will also be pruned in the action pruning stage, so that the action space of the agent can be continuously reduced, one or several actions can be locked for a specific state in the late stage of reinforcement learning, and the learning efficiency of reinforcement learning is greatly improved.
Based on any of the above embodiments, performing score prediction based on a candidate score with an regressive value greater than a preset threshold includes:
determining the value of each current candidate score in the current sample state based on the value of the current sample state and the advantages of each current candidate score in the current sample state, wherein the current candidate score is a candidate score with an regret value larger than a preset threshold value;
and taking the current candidate score corresponding to the maximum value in the values of the current candidate scores as the current score.
Specifically, after the regret value of each candidate score in the current sample state is obtained from the regret value set, the candidate scores whose regret value is less than or equal to the preset threshold may be pruned, and the remaining candidate scores whose regret value is greater than the preset threshold are the current candidate scores. Then, the score prediction model may calculate the value of each current candidate score in the current sample state by the following formula:
Figure 148607DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 683494DEST_PATH_IMAGE011
is the first in the current sample state
Figure 265785DEST_PATH_IMAGE003
The value of the current candidate score(s),
Figure 518911DEST_PATH_IMAGE012
is the first in the current sample state
Figure 71115DEST_PATH_IMAGE003
The dominance of the current candidate score,
Figure 867033DEST_PATH_IMAGE013
as a function of the value of the current sample state,
Figure 969026DEST_PATH_IMAGE014
in order to be in the current state of the sample,
Figure 443870DEST_PATH_IMAGE007
is as follows
Figure 675131DEST_PATH_IMAGE003
A current candidate score. Here, the number of the first and second electrodes,
Figure 184610DEST_PATH_IMAGE013
can be calculated by a function of the state value,
Figure 967758DEST_PATH_IMAGE012
can be calculated by the action dominance function.
The values of all current candidate scores in the current sample state can be obtained through the formula, and then the values are compared, and the current candidate score corresponding to the maximum value is determined as the current score. Here, the current score is a result obtained by the score prediction model executing the current decision, and after the current score is determined, the score prediction model may control the agent to execute a score action, that is, to score the sample content according to the current score.
Based on any of the above embodiments, the unfortunate value is determined based on the following equation:
Figure 336423DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure 731894DEST_PATH_IMAGE002
is in a history state
Figure 95879DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 49929DEST_PATH_IMAGE016
in order to be the value of the historical state,
Figure 374731DEST_PATH_IMAGE005
is in a history state
Figure 338008DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 323543DEST_PATH_IMAGE006
in the case of a history state, the history state,
Figure 323861DEST_PATH_IMAGE007
is as follows
Figure 260593DEST_PATH_IMAGE003
And (4) scoring the candidate.
Specifically, in order to implement action pruning based on the regret value guidance, the embodiment of the invention defines the regret value of each candidate score by using the value of the state and the advantage of each candidate score in the state by using the game theory for reference, and is used for estimating the loss degree of the income obtained by selecting each candidate score in the current state. For each historical state derived from the environment
Figure 761981DEST_PATH_IMAGE006
Can be first calculated by a state value function
Figure 975925DEST_PATH_IMAGE004
I.e. the value of the historical state, is calculated by the action dominance function
Figure 271777DEST_PATH_IMAGE005
I.e. in the history state
Figure 462849DEST_PATH_IMAGE003
A candidate score
Figure 377715DEST_PATH_IMAGE007
Is then judged
Figure 836379DEST_PATH_IMAGE005
Whether the current time is more than 0 or not, and calculating to obtain the current time in the history state according to the judgment result and the formula
Figure 303132DEST_PATH_IMAGE007
Unfortunately
Figure 948877DEST_PATH_IMAGE002
Thereby obtaining the history status
Figure 933013DEST_PATH_IMAGE006
Regrettable values of all the candidate scores, and finally the historical state
Figure 747648DEST_PATH_IMAGE006
And storing all the corresponding regretted values into the regretted value set for subsequent action pruning.
Here, the state value function may be specifically implemented by a state value function network, and the action dominance function may be specifically implemented by an action dominance function network. It can be understood that the values of the states respectively output by the two networks and the advantages of the candidate scores are more accurate by continuously optimizing the parameters of the state value function network and the action advantage function network in the reinforcement learning process, so that the error in the regret value calculation process can be effectively reduced, and the effectiveness of action pruning is further improved.
Based on any of the above embodiments, fig. 2 is a schematic flow chart of the determination method of the score prediction model provided by the present invention, as shown in fig. 2, taking movie recommendation as an example, a specific flow of the method is as follows:
step 1: constructing a recommendation system interaction environment;
step 2: creating an unfortunate value set, namely an unfortunate value Pool Re _ Pool and an experience Pool Ex _ Pool, wherein the two pools are empty initially;
and step 3: initializing a state value function network and an action advantage function network in the scoring prediction model; setting reward functions, discount factors and exploration probabilities
Figure 385303DEST_PATH_IMAGE017
And 4, step 4: and (3) the intelligent agent makes random decision and accumulates experience: the agent interacts with the environment, and user characteristics of the user by the sample are obtained from the environment
Figure 393710DEST_PATH_IMAGE018
And movie features of sample movies
Figure 775013DEST_PATH_IMAGE019
Sample state of composition
Figure 208268DEST_PATH_IMAGE020
The agent is in the scoring space according to the state of the sample
Figure 892190DEST_PATH_IMAGE021
In the random selection of scores
Figure 279572DEST_PATH_IMAGE022
Calculating the state of the sample
Figure 464565DEST_PATH_IMAGE020
The regret value of each candidate score is stored in Re _ Pool, and the score is stored
Figure 627693DEST_PATH_IMAGE022
Feedback to the environment to obtain the reward
Figure 607151DEST_PATH_IMAGE023
The state of the next moment
Figure 714784DEST_PATH_IMAGE024
And whether the flag of the termination state is reachedsign(ii) a Here, the determination of whether the termination state is reached is based on the current sample useruserWhether the browsing records are completely traversed or not, if not, whether the browsing records are completely traversed or not
Figure 578835DEST_PATH_IMAGE024
By user features
Figure 956989DEST_PATH_IMAGE018
Movie feature of next sample movie
Figure 372927DEST_PATH_IMAGE025
Composition, if the traversal is complete
Figure 967856DEST_PATH_IMAGE026
And switching users in the next interaction, and then experience
Figure 370018DEST_PATH_IMAGE027
Storing in Ex _ Pool;
and 5: repeating the step 4 until the decision times are more than
Figure 632373DEST_PATH_IMAGE028
Secondly;
step 6: the agent interacts with the environment, and the agent obtains the current sample state from the environment
Figure 455097DEST_PATH_IMAGE014
Query in regret Pool Re _ Pool
Figure 412689DEST_PATH_IMAGE014
If yes, extracting the regret value list corresponding to the state from the regret value pool
Figure 743176DEST_PATH_IMAGE029
Wherein
Figure 860037DEST_PATH_IMAGE009
To be the number of candidate scores in the scoring space,
Figure 227564DEST_PATH_IMAGE030
and create an all-zeros list of the same size as the regret list
Figure 797086DEST_PATH_IMAGE031
Filling and replacing elements at corresponding positions in the all-zero list according to the following formula to obtain a zero list zo _ list:
Figure 178868DEST_PATH_IMAGE032
according to the formula
Figure 25602DEST_PATH_IMAGE033
Calculating the value of each candidate score in the scoring space, namely Q value, and according to the value
Figure 954243DEST_PATH_IMAGE034
Policy selection scoring
Figure 745482DEST_PATH_IMAGE035
That is, the Q values of the candidate scores corresponding to the regret values which are not 0 are screened out, and then the candidate score corresponding to the maximum value in the Q values is used as the current decision of the agent; if not, creating an unfortunate value list for the state, storing the unforeseen value list in Re _ Pool, initializing each element in the list to 1, and according to the initialization, storing the unforeseen value list in Re _ Pool
Figure 558717DEST_PATH_IMAGE036
Policy selection scoring
Figure 119011DEST_PATH_IMAGE035
Taking the candidate score corresponding to the maximum value in the Q values of all candidate scores as the current decision of the agent;
and 7: the current sample state
Figure 720019DEST_PATH_IMAGE014
Inputting the value of the state into a state value function network to calculate
Figure 139499DEST_PATH_IMAGE037
Will be
Figure 349901DEST_PATH_IMAGE014
Inputting the calculated advantages of each candidate score in the action advantage function network
Figure 30281DEST_PATH_IMAGE038
And calculate by this
Figure 300725DEST_PATH_IMAGE014
Current regret value of next candidate score
Figure 207501DEST_PATH_IMAGE039
Figure 723058DEST_PATH_IMAGE040
Updating Re _ Pool according to the calculated current regret value of each candidate score
Figure 257945DEST_PATH_IMAGE014
The specific updating method of the regret values of the candidate scores is as follows: obtaining from regret pools
Figure 574657DEST_PATH_IMAGE014
List of existing regret values composed of regret values of next candidate scores
Figure 93363DEST_PATH_IMAGE029
And the location of the list in the regret poolindexCalculated by the above formula
Figure 645567DEST_PATH_IMAGE014
The current regret values corresponding to the next candidate scores form a latest regret value list
Figure 175905DEST_PATH_IMAGE041
And overlapping elements at the corresponding positions of the existing regret list and the latest regret list according to the following formula:
Figure 289617DEST_PATH_IMAGE042
finally, the superposed regret list is replaced in the regret poolindexAn unfortunate list of locations;
and 8: will score
Figure 30040DEST_PATH_IMAGE035
Return to the environmentAnd receive a reward
Figure 261301DEST_PATH_IMAGE043
The state of the next moment
Figure 770780DEST_PATH_IMAGE044
Andsignwill experience
Figure 819507DEST_PATH_IMAGE045
Storing in Ex _ Pool;
and step 9: and returning to the step 6, continuously iterating and training until the convergence condition of the score prediction model is met, for example, the precision of the prediction score does not change any more each time, wherein the precision can be the difference between the prediction score and the real score, at this time, it is indicated that the reinforcement learning has learned a stable strategy, and the score prediction model can be obtained after the training is finished.
Based on any of the above embodiments, fig. 3 is a schematic diagram of a reinforcement learning framework based on action pruning according to the present invention, and as shown in fig. 3, the framework includes components such as a simulation environment, an action regret value pool, an experience playback pool, a state value function network, and an action dominance function network. The specific flow of the reinforcement learning method based on the framework is as follows:
step 1: create an unfortunate Pool Re _ Pool (i.e., the action regret Pool in FIG. 3), and an experience Pool Ex _ Pool (i.e., the experience playback Pool in FIG. 3), both pools initially being empty;
step 2: initializing a state value function network and an action advantage function network; setting reward functions, discount factors and exploration probabilities
Figure 657013DEST_PATH_IMAGE017
And step 3: the intelligent agent interacts with the environment, makes a random decision and stores the experience into Ex _ Pool;
and 4, step 4: repeating the step 3 until the decision times are more than
Figure 52485DEST_PATH_IMAGE028
Secondly;
and 5: intelligence developmentThe energy body interacts with the environment, and the intelligent body obtains the current sample state from the environment
Figure 682049DEST_PATH_IMAGE014
Query in regret Pool Re _ Pool
Figure 777044DEST_PATH_IMAGE014
Whether or not it exists, if
Figure 226480DEST_PATH_IMAGE014
Already in the pool, extracting the action regret value list corresponding to the state from the regret value pool
Figure 924178DEST_PATH_IMAGE046
Wherein
Figure 909714DEST_PATH_IMAGE009
The number of candidate actions in the action space,
Figure 175610DEST_PATH_IMAGE047
and create an all-zeros list of the same size as the regret list
Figure 112342DEST_PATH_IMAGE048
Filling and replacing elements at corresponding positions in the all-zero list according to the following formula to obtain a zero list zo _ list:
Figure 348151DEST_PATH_IMAGE049
according to the formula
Figure 827674DEST_PATH_IMAGE050
Computing value of candidate actions in an action space
Figure 123526DEST_PATH_IMAGE051
I.e., Q value, based on
Figure 49019DEST_PATH_IMAGE034
Policy selection action
Figure 963886DEST_PATH_IMAGE035
That is, the Q values of the candidate actions corresponding to the regret value not equal to 0 are screened out first, and then the candidate action corresponding to the maximum value in the Q values is used as the current decision of the agent; if it is
Figure 422549DEST_PATH_IMAGE014
If not in the pool, then
Figure 889302DEST_PATH_IMAGE014
Creating an action regret list and storing the action regret list in Re _ Pool, initializing each action regret to 1, and according to the action regret list
Figure 675993DEST_PATH_IMAGE036
Policy selection action
Figure 784763DEST_PATH_IMAGE035
Taking the candidate action corresponding to the maximum value in the Q values of all candidate actions as the current decision of the agent;
step 6: the current sample state
Figure 333818DEST_PATH_IMAGE014
Input to a function network of state values
Figure 846839DEST_PATH_IMAGE052
To calculate the value of the state
Figure 245459DEST_PATH_IMAGE037
Will be
Figure 892341DEST_PATH_IMAGE014
Input to action dominance function network
Figure 935384DEST_PATH_IMAGE053
The advantage of each candidate action in the state is obtained by calculation
Figure 743940DEST_PATH_IMAGE038
And calculate by this
Figure 119602DEST_PATH_IMAGE014
Current regret value of next candidate actions
Figure 179962DEST_PATH_IMAGE039
Figure 202145DEST_PATH_IMAGE040
Updating Re _ Pool according to the calculated current regret value of each candidate action
Figure 447181DEST_PATH_IMAGE014
The specific updating method of the regret values of the following candidate actions is as follows: obtaining from regret pools
Figure 695760DEST_PATH_IMAGE014
List of existing regret values composed of regret values of next candidate actions
Figure 418865DEST_PATH_IMAGE029
And the location of the list in the regret poolindexCalculated by the above formula
Figure 62599DEST_PATH_IMAGE014
The current regret values corresponding to the next candidate actions form a latest regret value list
Figure 353903DEST_PATH_IMAGE041
And overlapping the corresponding positions of the existing regret list and the latest regret list according to the following formula:
Figure 948832DEST_PATH_IMAGE042
finally, the superposed regret list is replaced in the regret poolindexAn unfortunate list of locations;
and 7: will act
Figure 741208DEST_PATH_IMAGE035
Return to the simulation environment and obtain the reward
Figure 613349DEST_PATH_IMAGE043
The state of the next moment
Figure 934609DEST_PATH_IMAGE044
Will experience
Figure 783878DEST_PATH_IMAGE054
Storing in Ex _ Pool;
and 8: and returning to the step 5, continuously iterating the training until the network is converged, and finishing the training to obtain the strategy model.
It should be noted that, at the beginning of the reinforcement learning process, the experience Pool Ex _ Pool does not store full experience, and at this time, for each interaction between the agent and the environment, only the experience is stored in the experience Pool, and the state value function network and the action dominance function network are not updated; after storing full experience in the experience Pool Ex _ Pool, for each interaction between the agent and the environment, experience is first stored
Figure 724152DEST_PATH_IMAGE055
Storing in Ex _ Pool, and selecting a batch of experience (mini-batch) versus state value function network from the experience Pool
Figure 575434DEST_PATH_IMAGE056
And action dominance function
Figure 598753DEST_PATH_IMAGE057
And (6) updating.
The invention aims to provide a new thought and a method for improving the convergence rate of reinforcement learning, in the reinforcement learning process, experience is accumulated through random decision of an intelligent agent at the early stage, the regret value of each candidate action in each state is calculated according to the existing experience at the later stage, when the regret value of the candidate action is less than or equal to the regret value and exceeds a parameter (namely a preset threshold value in each embodiment), the candidate action is pruned, the candidate action is never selected, and finally one or more actions are locked aiming at a specific state, so that the convergence rate of the reinforcement learning is greatly improved. In addition, the action pruning technology provided by the invention can be combined with any reinforcement learning algorithm, so that the learning efficiency of the reinforcement learning algorithm is improved, and the application range is wide.
The following describes a recommendation device based on action pruning according to the present invention, and the recommendation device based on action pruning described below and the recommendation method based on action pruning described above may be referred to in correspondence with each other.
Based on any of the above embodiments, fig. 4 is a schematic structural diagram of a recommendation device based on action pruning, as shown in fig. 4, the device includes:
a determining module 410, configured to determine, based on the user characteristics of the target user and the content characteristics of each content to be recommended, a state corresponding to each content to be recommended;
the recommending module 420 is configured to predict scores of the contents to be recommended based on the states and the score prediction models corresponding to the contents to be recommended, and recommend to a target user based on the scores of the contents to be recommended;
the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
According to the device provided by the embodiment of the invention, regret values of each candidate score in a corresponding state generated by the intelligent agent in each decision are accumulated through the regret value set, and in the reinforcement learning process, the score prediction model prunes candidate scores with lower regret values based on the regret value set, so that the convergence speed of reinforcement learning is increased through action pruning, the learning efficiency of reinforcement learning is improved, and the scores of contents to be recommended are obtained through applying the score prediction model, so that personalized accurate recommendation is realized for different users, and the user experience is improved.
Based on any of the above embodiments, the scoring prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, including:
querying the current sample state in the regret set;
if the regret value set has the current sample state, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, and scoring prediction is carried out based on the candidate scores of which the regret values are larger than a preset threshold value;
otherwise, the scoring prediction model adds the regret value of each candidate score in the current sample state in the regret value set, sets the regret value of each added candidate score as an initial value, and performs scoring prediction based on each candidate score.
Based on any of the above embodiments, the scoring prediction model obtains an regret value of each candidate score in the current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, and then further includes:
and the scoring prediction model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
Based on any of the above embodiments, updating each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score, including:
if the regret value of any candidate score in the current sample state in the regret value set is larger than a preset threshold value, overlapping the regret value of any candidate score with the current regret value of any candidate score to obtain an updated regret value of any candidate score;
and if the regret value of any candidate score in the current sample state in the regret value set is less than or equal to the preset threshold, the regret value of any candidate score is not updated.
Based on any of the above embodiments, performing score prediction based on a candidate score with an regressive value greater than a preset threshold includes:
determining the value of each current candidate score in the current sample state based on the value of the current sample state and the advantages of each current candidate score in the current sample state, wherein the current candidate score is a candidate score with an regret value larger than a preset threshold value;
and taking the current candidate score corresponding to the maximum value in the values of the current candidate scores as the current score.
Based on any of the above embodiments, the unfortunate value is determined based on the following equation:
Figure 43641DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure 646661DEST_PATH_IMAGE002
is in a history state
Figure 385072DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 923501DEST_PATH_IMAGE004
in order to be the value of the historical state,
Figure 980318DEST_PATH_IMAGE005
is in a history state
Figure 652608DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 353848DEST_PATH_IMAGE006
in the case of a history state, the history state,
Figure 453391DEST_PATH_IMAGE007
is as follows
Figure 233390DEST_PATH_IMAGE003
And (4) scoring the candidate.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method of recommending based on action pruning, the method comprising: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention further provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the action pruning-based recommendation method provided by the above methods, and the method includes: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for action-based pruning recommendation provided by the above methods, the method comprising: determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended; predicting the score of each content to be recommended based on the state corresponding to each content to be recommended and a score prediction model, and recommending to a target user based on the score of each content to be recommended; the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold value, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A recommendation method based on action pruning is characterized by comprising the following steps:
determining the corresponding state of each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;
predicting the scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to the target user based on the scores of the contents to be recommended;
the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of candidate scores in the current sample state from a regret value set, and performs scoring prediction based on the candidate scores with the regret values larger than a preset threshold value, wherein the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of the candidate scores in the historical states, and the historical states are sample states before the current sample state.
2. The action pruning-based recommendation method according to claim 1, wherein the scoring prediction model obtains an regret value of each candidate score in a current sample state from the regret value set, and performs scoring prediction based on the candidate score with the regret value larger than a preset threshold, and comprises:
querying the current sample state in the regret set of values;
if the current sample state exists in the regret value set, the scoring prediction model obtains regret values of all candidate scores in the current sample state from the regret value set, and scoring prediction is carried out based on the candidate scores with the regret values larger than a preset threshold;
otherwise, the scoring prediction model adds the regret value of each candidate score in the current sample state in the regret value set, sets the regret value of each added candidate score as an initial value, and performs scoring prediction based on each candidate score.
3. The action-pruning-based recommendation method according to claim 1 or 2, wherein the score prediction model obtains an regret value of each candidate score in a current sample state from the regret value set, and performs score prediction based on a candidate score with the regret value larger than a preset threshold, and then further comprises:
and the score prediction model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
4. The action-pruning-based recommendation method according to claim 3, wherein the updating each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score comprises:
if the regret value of any candidate score in the regret value set under the current sample state is larger than a preset threshold value, overlapping the regret value of any candidate score with the current regret value of any candidate score to obtain an updated regret value of any candidate score;
and if the regret value of any candidate score in the regret value set under the current sample state is less than or equal to a preset threshold value, not updating the regret value of any candidate score.
5. The action-pruning-based recommendation method according to claim 1 or 2, wherein the performing score prediction based on the candidate scores with the regressive values larger than a preset threshold comprises:
determining the value of each current candidate score in the current sample state based on the value of the current sample state and the advantages of each current candidate score in the current sample state, wherein the current candidate score is a candidate score with the regret value larger than a preset threshold value;
and taking the current candidate score corresponding to the maximum value in the values of the current candidate scores as the current score.
6. The action-based pruning recommendation method according to claim 1 or 2, wherein the regressive value is determined based on the following formula:
Figure 242116DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 922496DEST_PATH_IMAGE002
is the first in the history state
Figure 68307DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 568559DEST_PATH_IMAGE004
for the value of the said historical state,
Figure 582651DEST_PATH_IMAGE005
is the first in the history state
Figure 884582DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 201293DEST_PATH_IMAGE006
in order to be able to take account of the history status,
Figure 719999DEST_PATH_IMAGE007
is the first
Figure 272203DEST_PATH_IMAGE003
And (4) scoring the candidate.
7. A recommendation device based on action pruning, comprising:
the determining module is used for determining the state corresponding to each content to be recommended based on the user characteristics of the target user and the content characteristics of each content to be recommended;
the recommendation module is used for predicting the scores of the contents to be recommended based on the states corresponding to the contents to be recommended and the score prediction models, and recommending the contents to be recommended to the target user based on the scores of the contents to be recommended;
the scoring prediction model is obtained by performing reinforcement learning based on a sample state corresponding to sample content; in the reinforcement learning process, the scoring prediction model obtains regret values of candidate scores in the current sample state from a regret value set, and performs scoring prediction based on the candidate scores with the regret values larger than a preset threshold value, wherein the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on advantages of the candidate scores in the historical states, and the historical states are sample states before the current sample state.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for recommending based on action pruning according to any of claims 1 to 6 are implemented when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the action-based pruning recommendation method according to any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the action-based pruning recommendation method according to any of claims 1 to 6 when executed by a processor.
CN202111185124.8A 2021-10-12 2021-10-12 Recommendation method and device based on action pruning, electronic equipment and storage medium Active CN113626720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111185124.8A CN113626720B (en) 2021-10-12 2021-10-12 Recommendation method and device based on action pruning, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111185124.8A CN113626720B (en) 2021-10-12 2021-10-12 Recommendation method and device based on action pruning, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113626720A true CN113626720A (en) 2021-11-09
CN113626720B CN113626720B (en) 2022-02-25

Family

ID=78391005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111185124.8A Active CN113626720B (en) 2021-10-12 2021-10-12 Recommendation method and device based on action pruning, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113626720B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160239738A1 (en) * 2013-10-23 2016-08-18 Tencent Technology (Shenzhen) Company Limited Question recommending method, apparatus and system
CN110404265A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium
CN111476639A (en) * 2020-04-10 2020-07-31 深圳市物语智联科技有限公司 Commodity recommendation strategy determining method and device, computer equipment and storage medium
CN111986005A (en) * 2020-08-31 2020-11-24 上海博泰悦臻电子设备制造有限公司 Activity recommendation method and related equipment
CN112149824A (en) * 2020-09-15 2020-12-29 支付宝(杭州)信息技术有限公司 Method and device for updating recommendation model by game theory

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160239738A1 (en) * 2013-10-23 2016-08-18 Tencent Technology (Shenzhen) Company Limited Question recommending method, apparatus and system
CN110404265A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium
CN111476639A (en) * 2020-04-10 2020-07-31 深圳市物语智联科技有限公司 Commodity recommendation strategy determining method and device, computer equipment and storage medium
CN111986005A (en) * 2020-08-31 2020-11-24 上海博泰悦臻电子设备制造有限公司 Activity recommendation method and related equipment
CN112149824A (en) * 2020-09-15 2020-12-29 支付宝(杭州)信息技术有限公司 Method and device for updating recommendation model by game theory

Also Published As

Publication number Publication date
CN113626720B (en) 2022-02-25

Similar Documents

Publication Publication Date Title
CN107515909B (en) Video recommendation method and system
CN110138612B (en) Cloud software service resource allocation method based on QoS model self-correction
JP2024026276A (en) Computer-based system, computer component and computer object configured to implement dynamic outlier bias reduction in machine learning model
CN112329948B (en) Multi-agent strategy prediction method and device
CN109408731A (en) A kind of multiple target recommended method, multiple target recommended models generation method and device
JP7224395B2 (en) Optimization method, device, device and computer storage medium for recommender system
Xu et al. Learning to explore with meta-policy gradient
CN112149824B (en) Method and device for updating recommendation model by game theory
KR102203253B1 (en) Rating augmentation and item recommendation method and system based on generative adversarial networks
CN111159542A (en) Cross-domain sequence recommendation method based on self-adaptive fine-tuning strategy
US20230311003A1 (en) Decision model training method and apparatus, device, storage medium, and program product
WO2021055442A1 (en) Small and fast video processing networks via neural architecture search
CN113626720B (en) Recommendation method and device based on action pruning, electronic equipment and storage medium
CN113761388A (en) Recommendation method and device, electronic equipment and storage medium
CN117056595A (en) Interactive project recommendation method and device and computer readable storage medium
Pinto et al. Learning partial policies to speedup MDP tree search via reduction to IID learning
CN116992151A (en) Online course recommendation method based on double-tower graph convolution neural network
CN113449176A (en) Recommendation method and device based on knowledge graph
CN113626721B (en) Regrettful exploration-based recommendation method and device, electronic equipment and storage medium
CN115600818A (en) Multi-dimensional scoring method and device, electronic equipment and storage medium
WO2022166125A1 (en) Recommendation system with adaptive weighted baysian personalized ranking loss
CN110533192B (en) Reinforced learning method and device, computer readable medium and electronic equipment
CN113221017B (en) Rough arrangement method and device and storage medium
CN110765345A (en) Searching method, device and equipment
CN113468436A (en) Reinforced learning recommendation method, system, terminal and medium based on user evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant