CN113626721B - Regrettful exploration-based recommendation method and device, electronic equipment and storage medium - Google Patents

Regrettful exploration-based recommendation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113626721B
CN113626721B CN202111185156.8A CN202111185156A CN113626721B CN 113626721 B CN113626721 B CN 113626721B CN 202111185156 A CN202111185156 A CN 202111185156A CN 113626721 B CN113626721 B CN 113626721B
Authority
CN
China
Prior art keywords
candidate
state
regret
score
exploration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111185156.8A
Other languages
Chinese (zh)
Other versions
CN113626721A (en
Inventor
白栋栋
洪志理
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202111185156.8A priority Critical patent/CN113626721B/en
Publication of CN113626721A publication Critical patent/CN113626721A/en
Application granted granted Critical
Publication of CN113626721B publication Critical patent/CN113626721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a recommendation method, a recommendation device, electronic equipment and a storage medium based on regret exploration, wherein the method comprises the following steps: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model performs scoring exploration based on the regret value set and the current sample state, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on the advantages of candidate scores in the historical states, and the historical states are sample states before the current sample state, so that exploration efficiency is improved, personalized accurate recommendation for different users is achieved, and user experience is improved.

Description

Regrettful exploration-based recommendation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a recommendation method and apparatus, an electronic device, and a storage medium based on regret exploration.
Background
Reinforcement learning is particularly suitable for business scenarios involving interactions, such as scenarios in which content is recommended to a user, due to the advantages of being able to perceive a dynamic environment and obtain rewards from the environment to continually adapt to the environment. Exploration and utilization are always difficult points in reinforcement learning, and the exploration efficiency influences whether the algorithm can be converged to the maximum accumulated return position or not.
The current exploration methods are numerous and are more classically epsilon-greedy, Topson sampling and the like, however, the methods do not explore actions from the global perspective aiming at specific states, and therefore the exploration efficiency is greatly influenced.
Disclosure of Invention
The invention provides a recommendation method, a recommendation device, electronic equipment and a storage medium based on regret exploration, which are used for solving the defect of low exploration efficiency in the prior art and realizing the improvement of exploration efficiency.
The invention provides a recommendation method based on regret exploration, which comprises the following steps:
determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;
inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;
determining an object recommended to the target user based on the scores of the candidate objects;
the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate values, the regret values are determined based on the advantages of candidate scores in the historical states, and the historical states are sample states before the current sample state.
According to the regret exploration-based recommendation method provided by the invention, the scoring model carries out scoring exploration on the basis of the regret value set and the current sample state, and the method comprises the following steps:
determining a currently generated random number;
if the random number is larger than or equal to a preset exploration probability, the scoring model carries out scoring utilization based on the current sample state;
otherwise, the scoring model carries out scoring exploration based on the regrettable value set and the current sample state.
According to the recommendation method based on the regret exploration provided by the invention, the regret value is determined based on the following formula:
Figure 395461DEST_PATH_IMAGE001
wherein,
Figure 880188DEST_PATH_IMAGE002
is the first in the history state
Figure 293851DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 974231DEST_PATH_IMAGE004
is the value of the history state, is the first value in the history state
Figure 916780DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 682610DEST_PATH_IMAGE005
in order to be able to take account of the history status,
Figure 899965DEST_PATH_IMAGE006
is the first
Figure 372535DEST_PATH_IMAGE003
And (4) scoring the candidate.
According to the recommendation method based on the regret exploration provided by the invention, the scoring exploration is carried out based on the regret value set and the current sample state, and the method comprises the following steps:
if the regret value set comprises the current sample state, acquiring each regret value corresponding to the current sample state from the regret value set, and taking a candidate score corresponding to the maximum value in each regret value as a current score;
otherwise, setting the regret value of each candidate score in the current sample state as an initial value in the regret value set, and taking the candidate score selected from each candidate score with equal probability as the current score.
According to the regret exploration-based recommendation method provided by the invention, the scoring model carries out scoring exploration on the basis of the regret value set and the current sample state, and then the method further comprises the following steps:
if the regret value set comprises the current sample state, the scoring model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
According to the regret exploration-based recommendation method provided by the invention, the scoring utilization based on the current sample state comprises the following steps:
determining the value of each candidate score in the current sample state based on the value of the current sample state and the advantage of each candidate score in the current sample state;
and taking the candidate score corresponding to the maximum value in the values of the candidate scores as the current score.
The invention also provides a recommendation device based on the regret exploration, which comprises:
the determining module is used for determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;
the input module is used for inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;
the recommending module is used for determining the objects recommended to the target user based on the scores of the candidate objects;
the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate values, the regret values are determined based on the advantages of candidate scores in the historical states, and the historical states are sample states before the current sample state.
The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the above-mentioned methods for recommending an object based on an unfortunate exploration.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described unfortunately exploration based recommendation method.
According to the recommendation method and device based on the regret exploration, the electronic equipment and the storage medium, the scoring model carries out scoring exploration based on the regret value set and the current sample state in the reinforcement learning process, so that the exploration is carried out on the accumulated historical decision experience, the exploration of actions from the global angle for specific states is realized, the exploration efficiency is greatly improved, the learning efficiency of the reinforcement learning is further improved, the scoring of each candidate object is obtained by applying the scoring model, the individual accurate recommendation for different users is realized, and the user experience is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of an exemplary method for providing an exemplary recommendation based on an unfortunate exploration;
FIG. 2 is a schematic flow chart of a scoring model determination method provided by the present invention;
FIG. 3 is a schematic diagram of an enhanced learning framework based on regret exploration according to the present invention;
FIG. 4 is a schematic diagram of a recommendation apparatus based on regret exploration according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Exploring and utilizing are always difficult points in reinforcement learning, wherein the utilization is to select an optimal strategy according to the current known knowledge and experience, and the exploration is to try different strategies to explore whether a better strategy exists. The maximum immediate reward is available, but when learning is inadequate, the algorithm tends to fall into local optimality. The exploration can fully learn the reward of each strategy, find the optimal strategy, is not easy to fall into local optimization, is beneficial to maximizing accumulated return, but takes more learning time and can slow down the convergence speed of the algorithm.
The existing exploration methods are numerous and have epsilon-greedy, Topson sampling and the like in a more classical way, however, the methods do not explore actions from the global perspective aiming at specific states, so that the exploration efficiency is greatly influenced, and the learning efficiency of reinforcement learning is further greatly influenced.
In contrast, the embodiment of the invention provides a recommendation method based on regret exploration. Fig. 1 is a schematic flow chart of an unfortunately-exploration-based recommendation method provided in the present invention, as shown in fig. 1, the method includes:
step 110, determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;
step 120, inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;
step 130, determining an object recommended to a target user based on the scores of the candidate objects;
the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
Specifically, the target user refers to a user to be subjected to object recommendation, and the user characteristics are used for representing attribute information of the target user, such as gender, age, education level, occupation and the like of the target user. The candidate object refers to an object to be recommended, and may be specifically determined based on the browsing record of the target user, and it may be understood that the browsing record covers preference information of the user, so as to facilitate acceptance of a subsequently recommended object by the user. The specific type of the candidate object may be a movie, music, news, etc., and this is not particularly limited by the embodiment of the present invention. The object features are used to characterize attribute information of the candidate object, e.g., movie type, subject content, etc.
In order to perform personalized content recommendation for different users, the embodiment of the invention firstly determines the corresponding state of each candidate object according to the user characteristics of the target user and the object characteristics of each candidate object, then inputs the corresponding state of each candidate object into the scoring model, scores each candidate object by the scoring model so as to output the score of each candidate object, finally sorts each candidate object according to the score of each candidate object, and determines the object recommended to the target user according to the sorting result. Here, the state corresponding to each candidate object, that is, the state of the scoring recommendation environment where each candidate object and the target user are located, may be obtained by environment feedback.
It can be understood that, since the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object, in the reinforcement learning process, the scoring model can learn the scoring modes of different users by obtaining rewards from the environment and continuously performing policy optimization, and on the basis, the scoring model is applied to execute the step 120, and the obtained scores of the candidate objects can accurately represent the preference degree of the target user for the candidate objects, so that personalized accurate recommendation is performed on different users, and the user experience is greatly improved.
Before step 120 is executed, the scoring model may be trained specifically as follows: first, user characteristics of a large number of sample users and object characteristics of sample objects are collected, and a sample state corresponding to each sample object is determined based on the user characteristics of each sample user and the object characteristics of each sample object. And then, performing reinforcement learning on the initial model based on the sample state corresponding to each sample object, thereby obtaining a grading model.
In view of the fact that the existing exploration methods do not explore actions from the global perspective for specific states, and therefore exploration efficiency is poor, in the reinforcement learning process, the scoring model conducts scoring exploration in the environment based on the regret value set and the current sample state. Here, the regret value set stores the historical state and regret values of the candidate scores corresponding to the historical state, the regret values can be determined according to advantages of the candidate scores in the historical state, the candidate scores are scores available in a score space, and the advantages of the candidate scores can be obtained through calculation of an action advantage function. The scoring exploration corresponds to the exploration in a scoring scene, and whether a better scoring strategy exists is explored by trying different scoring strategies.
In the reinforcement learning process, the scoring model realizes interaction with the environment substantially through an agent controlled by the scoring model, after the scoring model determines a current decision-making action through scoring exploration, the agent can be controlled to execute the decision-making action, namely scoring the sample object, the environment can react to the score, and a next sample state of the environment is obtained. The historical state is a sample state before the current sample state, the intelligent agent can access one sample state of the environment at each decision, and the corresponding sample state can be used as the historical state no matter whether the decision at each time before the current sample state is in an exploration or utilization mode.
It should be noted that, the unfortunate value set stores the sample states accessed by the agent, maintains an action table corresponding to each accessed sample state, and records the regret value of each candidate action, i.e., candidate score, executed by the agent in the sample state. In the exploration process, scoring exploration is carried out based on the regret value set, and the existing random exploration mode is replaced, so that the exploration is carried out on the past decision experience, the exploration efficiency is greatly improved, and the learning efficiency of reinforcement learning is further improved. And in addition, the traditional decision-making experience of the intelligent agent every time is collected through the regret value set to form accumulated regret, and the accumulated regret and the specific state are used for scoring and exploring during exploration, so that purposeful exploration is realized by exploring actions from the global perspective aiming at the specific state, and the exploration efficiency is further improved.
According to the method provided by the embodiment of the invention, in the reinforcement learning process, the scoring model is used for scoring and exploring based on the regret value set and the current sample state, so that exploration is carried out on accumulated historical decision-making experience, the exploration of actions from the global perspective aiming at specific states is realized, the exploration efficiency is greatly improved, the learning efficiency of the reinforcement learning is further improved, the scoring of each candidate object is obtained by applying the scoring model, the individualized accurate recommendation of different users is realized, and the user experience is improved.
Based on any of the above embodiments, the scoring model performs scoring exploration based on the regrettable value set and the current sample state, including:
determining a currently generated random number;
if the random number is larger than or equal to the preset exploration probability, the scoring model carries out scoring utilization based on the current sample state;
otherwise, the scoring model performs scoring exploration based on the regrettable value set and the current sample state.
Specifically, scoring utilization is corresponding to utilization in a scoring scene, and an optimal scoring strategy in the current sample state is directly selected. In order to balance exploration and utilization, after a current sample state is obtained, a scoring model firstly determines a current generated random number, compares the random number with a preset exploration probability, and then determines whether a decision action is currently determined in a utilization or exploration mode based on a comparison result of the random number and the exploration probability, namely determines a current score from a scoring space:
if the random number is greater than or equal to the exploration probability, the scoring model can determine the current score in a utilization mode, and the current score can be determined according to the current sample state; otherwise, that is, the random number is smaller than the exploration probability, the scoring model may determine the current score in an exploration manner, and the current score may be specifically determined according to the regrettable value set and the current sample state.
Here, the random number may be specifically generated randomly by a random module, and the exploration probability, that is, the probability of exploratory property for representing the score of the agent, may be preset according to an actual requirement, which is not specifically limited in the embodiment of the present invention.
Based on any of the above embodiments, the unfortunate value is determined based on the following equation:
Figure 17143DEST_PATH_IMAGE007
wherein,
Figure 270269DEST_PATH_IMAGE002
is in a history state
Figure 760157DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 618391DEST_PATH_IMAGE004
in order to be the value of the historical state,
Figure 230638DEST_PATH_IMAGE008
is in a history state
Figure 923393DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 216971DEST_PATH_IMAGE005
in the case of a history state, the history state,
Figure 726449DEST_PATH_IMAGE006
is as follows
Figure 712860DEST_PATH_IMAGE003
And (4) scoring the candidate.
Specifically, in order to realize purposeful score exploration, the embodiment of the invention uses the game theory for reference, and utilizes the value of the state and the advantages of each candidate score in the state to define the regret value of each candidate score for estimating the loss degree of the income obtained by selecting each candidate score in the current state. For each history shape derived from the environmentState of the art
Figure 878262DEST_PATH_IMAGE005
Can be first calculated by a state value function
Figure 975531DEST_PATH_IMAGE004
I.e. the value of the historical state, is calculated by the action dominance function
Figure 339516DEST_PATH_IMAGE008
I.e. in the history state
Figure 231249DEST_PATH_IMAGE003
A candidate score
Figure 680685DEST_PATH_IMAGE006
Is then judged
Figure 581645DEST_PATH_IMAGE008
Whether the current time is more than 0 or not, and calculating to obtain the current time in the history state according to the judgment result and the formula
Figure 3399DEST_PATH_IMAGE006
Unfortunately
Figure 128350DEST_PATH_IMAGE002
Thereby obtaining the history status
Figure 271274DEST_PATH_IMAGE005
Regrettable values of all the candidate scores, and finally the historical state
Figure 444766DEST_PATH_IMAGE005
And storing all the corresponding regretted values into the regretted value set for subsequent grading exploration.
Here, the state value function may be specifically implemented by a state value function network, and the action advantage function may be specifically implemented by an action advantage function network, and it can be understood that by continuously optimizing parameters of the state value function network and the action advantage function network in the reinforcement learning process, values of states respectively output by the two networks and advantages of candidate scores are more accurate, so that errors in the calculation process of the regret value may be effectively reduced, and efficiency of score exploration is further improved.
Based on any of the above embodiments, performing score exploration based on the regret value set and the current sample state includes:
if the regret value set comprises the current sample state, obtaining each regret value corresponding to the current sample state from the regret value set, and taking the candidate score corresponding to the maximum value in each regret value as the current score;
otherwise, setting the regret value of each candidate score in the current sample state as an initial value in the regret value set, and taking the candidate score selected from each candidate score with equal probability as the current score.
Specifically, after it is determined that the scoring exploration is to be performed, the scoring model may first query the current sample state in the regret value set, and if the current sample state is queried, that is, the regret value set includes the current sample state, the regret value list corresponding to the current sample state may be obtained from the regret value set
Figure 783343DEST_PATH_IMAGE009
Figure 282458DEST_PATH_IMAGE010
The number of candidate scores in a scoring space is represented, and then the learning of an exploration process is facilitated by considering actions with large regrettage values, so that a scoring strategy which is better than a current local optimal strategy is explored, and for the scoring strategy, a candidate score corresponding to the maximum regrettage value in an acquired regrettage value list is determined as a current score by a scoring model;
otherwise, that is, the regret value set does not include the current sample state, an regret value list may be created for the current sample state in the regret value set, each list element in the regret value list corresponds to the regret value of each candidate score in the current sample state stored in the regret value set, and each list element in the regret value list is set to an initial value, for example, set to 1, and then, a candidate score is selected in the scoring space with equal probability, and the candidate score is determined as the current score.
Based on any of the above embodiments, the scoring model performs scoring exploration based on the regrettable value set and the current sample state, and then further includes:
if the regret value set comprises the current sample state, the scoring model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
Specifically, after determining the current decision-making action, the scoring model may first query whether the current sample state exists in the regret value set, and if the regret value set includes the current sample state, the scoring model may calculate the current regret value of each candidate score according to the advantages of each candidate score in the current sample state, and then update each regret value corresponding to the current sample state in the regret value set according to the calculated current regret value of each candidate score.
Here, the specific updating manner may be to superimpose an originally stored regret value of each candidate score in the current sample state in the regret value set with the calculated regret value of each candidate score, and then take the superimposed result as each regret value corresponding to the updated current sample state, where the superimposing manner may be direct superimposing or weighted superimposing, which is not specifically limited in this embodiment of the present invention.
For example, in the step of searching for a score, if the regret value set does not include the current sample state, an regret value list may be created for the current sample state in the regret value set, and each list element in the regret value list is set to 1, and then, in this step, the regret value set includes the current sample state, and at this time, each regret value corresponding to the current sample state in the updated regret value set may be 1 plus the current regret value of each candidate score.
It should be noted that this step is to update each state and corresponding regret value stored in the regret value set after determining the current decision-making action each time, and accumulate and store the past decision-making experience by continuously overlapping regret values corresponding to the same state, so that the subsequent regret exploration can be accurately performed toward the target direction from the global perspective, thereby further improving the efficiency of exploration. In addition, no matter whether the decision action is determined by score exploration or score utilization, the step should be executed after the decision action is determined, namely, the states stored in the regret value set and the corresponding regret values are updated.
Based on any embodiment, the scoring based on the current sample state comprises:
determining the value of each candidate score in the current sample state based on the value of the current sample state and the advantages of each candidate score in the current sample state;
and taking the candidate score corresponding to the maximum value in the values of the candidate scores as the current score.
Specifically, after determining that score utilization is to be performed, the scoring model may first calculate the value of each candidate score in the current sample state by:
Figure 644169DEST_PATH_IMAGE011
wherein,
Figure 886932DEST_PATH_IMAGE012
is the first in the current sample state
Figure 345595DEST_PATH_IMAGE003
The value of the score of each of the candidate scores,
Figure 750031DEST_PATH_IMAGE013
is the first in the current sample state
Figure 395776DEST_PATH_IMAGE003
A candidate scoreIn the sense of the advantages of (a) and (b),
Figure 442230DEST_PATH_IMAGE014
as a function of the value of the current sample state,
Figure 693083DEST_PATH_IMAGE015
in order to be in the current state of the sample,
Figure 534000DEST_PATH_IMAGE006
is as follows
Figure 669970DEST_PATH_IMAGE003
And (4) scoring the candidate. Here, the number of the first and second electrodes,
Figure 988956DEST_PATH_IMAGE014
can be calculated by a function of the state value,
Figure 359895DEST_PATH_IMAGE013
can be calculated by the action dominance function.
The value of all candidate scores in the current sample state can be obtained through the formula. Then, the values are compared, and the candidate score corresponding to the maximum value is determined as the current score.
Based on any of the above embodiments, fig. 2 is a schematic flow chart of the determination method of the scoring model provided by the present invention, as shown in fig. 2, taking movie recommendation as an example, a specific flow of the method is as follows:
step 1: building a simulation environment, such as a scoring recommendation environment;
step 2: creating an unfortunate value set, namely an unfortunate value Pool Re _ Pool and an experience Pool Ex _ Pool, wherein the two pools are empty initially;
and step 3: initializing a state value function network and an action advantage function network in the scoring model; setting reward functions, discount factors and exploration probabilities
Figure 168451DEST_PATH_IMAGE016
And 4, step 4: and (3) the intelligent agent makes random decision and accumulates experience: intelligent agent andinteracting with the environment, obtaining user characteristics of the users from the environment
Figure 992050DEST_PATH_IMAGE017
And movie features of sample movies
Figure 911465DEST_PATH_IMAGE018
Sample state of composition
Figure 402489DEST_PATH_IMAGE019
The agent is in the scoring space according to the state of the sample
Figure 319629DEST_PATH_IMAGE020
In the random selection of scores
Figure 427263DEST_PATH_IMAGE021
Calculating the state of the sample
Figure 88051DEST_PATH_IMAGE019
The regret value of each candidate score is stored in Re _ Pool, and the score is stored
Figure 433582DEST_PATH_IMAGE021
Feedback to the environment to obtain the reward
Figure 583940DEST_PATH_IMAGE022
The state of the next moment
Figure 119483DEST_PATH_IMAGE023
And whether the flag of the termination state is reachedsign(ii) a Here, the determination of whether the termination state is reached is based on the current sample useruserWhether the browsing records are completely traversed or not, if not, whether the browsing records are completely traversed or not
Figure 849541DEST_PATH_IMAGE023
By user features
Figure 846316DEST_PATH_IMAGE017
Movie feature of next sample movie
Figure 105259DEST_PATH_IMAGE024
Composition, if the traversal is complete
Figure 390747DEST_PATH_IMAGE025
And switching users in the next interaction, and then experience
Figure 721234DEST_PATH_IMAGE026
Storing in Ex _ Pool;
and 5: repeating the step 4 until the decision times are more than
Figure 775778DEST_PATH_IMAGE027
Secondly;
step 6: the agent interacts with the environment, and the agent obtains the current sample state from the environment
Figure 205622DEST_PATH_IMAGE028
Will be
Figure 775144DEST_PATH_IMAGE028
The value of the state is calculated by inputting the value into a state value function
Figure 847005DEST_PATH_IMAGE029
Will be
Figure 490476DEST_PATH_IMAGE028
Inputting the calculated advantages into the action advantage function to obtain the advantage of each candidate score in the state
Figure 419118DEST_PATH_IMAGE030
And calculate by this
Figure 404829DEST_PATH_IMAGE028
Current regret value of next candidate score
Figure 14802DEST_PATH_IMAGE031
Figure 575097DEST_PATH_IMAGE032
And 7: according to the current sample state
Figure 877902DEST_PATH_IMAGE028
The specific way of selecting the score is as follows: the random module random () generates a random number (i.e., from-num in fig. 2), if the random number is greater than or equal to
Figure 94120DEST_PATH_IMAGE016
Then according to the formula
Figure 304521DEST_PATH_IMAGE033
Calculating the value of each candidate score, i.e. Q value, according to
Figure 188163DEST_PATH_IMAGE034
Policy selection scoring
Figure 396291DEST_PATH_IMAGE035
As a decision by the agent; if the random number is less than
Figure 162122DEST_PATH_IMAGE016
Then query in the regret Pool Re _ Pool
Figure 379476DEST_PATH_IMAGE028
If yes, extracting the regret value list corresponding to the state from the regret value pool
Figure 852046DEST_PATH_IMAGE036
According to the formula
Figure 358638DEST_PATH_IMAGE037
Selection score
Figure 815027DEST_PATH_IMAGE035
That candidate score with the greatest regret value among all candidate scores is taken as the decision of the agent, wherein
Figure 304914DEST_PATH_IMAGE038
To be the number of candidate scores in the scoring space,
Figure 163149DEST_PATH_IMAGE039
if not, creating an unfortunate value list for the state, storing the unfortunate value list in Re _ Pool, initializing each element in the list to be 1, and selecting a score with equal probability in a score space
Figure 509817DEST_PATH_IMAGE035
As a decision by the agent;
and 8: according to the current sample state
Figure 187923DEST_PATH_IMAGE028
The current regret value of each candidate score is calculated according to the value and the advantages of each candidate score, and the regret value of each candidate score is updated into Re _ Pool
Figure 747080DEST_PATH_IMAGE028
The specific updating method of the regret values of the candidate scores is as follows: agent obtaining from the environment
Figure 256559DEST_PATH_IMAGE028
Searching the regret value pool for the state, and if the regret value pool exists, returning to the existing regret value list
Figure 977390DEST_PATH_IMAGE036
And the location of the list in the regret poolindexCalculating by using the regret value calculation mode in step 6
Figure 142792DEST_PATH_IMAGE028
Obtaining the current regret value corresponding to each candidate score to obtain the latest regret value list
Figure 36799DEST_PATH_IMAGE040
And overlapping the corresponding positions of the existing regret list and the latest regret list according to the following formula:
Figure 869626DEST_PATH_IMAGE041
finally, the superposed regret list is replaced in the regret poolindexAn unfortunate list of locations;
and step 9: will score
Figure 764288DEST_PATH_IMAGE035
Return to the environment and obtain the reward
Figure 213724DEST_PATH_IMAGE042
The state of the next moment
Figure 849105DEST_PATH_IMAGE043
Andsignwill experience
Figure 536438DEST_PATH_IMAGE044
Storing in Ex _ Pool;
step 10: returning to the step 6, continuously iterating the training until the convergence condition of the scoring model is met, for example, the precision of the predicted scoring does not change any more each time, where the precision may be the difference between the predicted scoring and the actual scoring, which indicates that the reinforcement learning has learned a stable strategy, and ending the training to obtain the scoring model.
Based on any of the above embodiments, fig. 3 is a schematic diagram of a reinforcement learning framework based on regret exploration provided by the present invention, as shown in fig. 3, a specific flow of the reinforcement learning method based on the framework is as follows:
step 1: create an unfortunate Pool Re _ Pool (i.e., the action regret Pool in FIG. 3), and an experience Pool Ex _ Pool (i.e., the experience playback Pool in FIG. 3), both pools initially being empty;
step 2: initializing a state value function network and an action advantage function network; setting reward functions, discount factors and exploration probabilities
Figure 661389DEST_PATH_IMAGE016
And step 3: the intelligent agent interacts with the environment, makes a random decision and stores the experience into Ex _ Pool;
and 4, step 4: repeating the step 3 until the decision times are more than
Figure 801383DEST_PATH_IMAGE027
Secondly;
and 5: the agent interacts with the environment, and the agent obtains the current sample state from the environment
Figure 974875DEST_PATH_IMAGE028
Will be
Figure 313453DEST_PATH_IMAGE028
The value of the state is calculated by inputting the value into a state value function
Figure 812567DEST_PATH_IMAGE029
Will be
Figure 174279DEST_PATH_IMAGE028
The advantage of each candidate action in the state is calculated and obtained by inputting the function of the action advantage
Figure 213779DEST_PATH_IMAGE030
And calculate by this
Figure 610125DEST_PATH_IMAGE028
Current regret value of next candidate actions
Figure 17491DEST_PATH_IMAGE031
Figure 928815DEST_PATH_IMAGE032
Step 6: according to the current sample state
Figure 975269DEST_PATH_IMAGE028
The specific way of selecting the action is as follows: the random module random () generates a random number (i.e., from-num in fig. 3), if the random number is greater than or equal to
Figure 960542DEST_PATH_IMAGE016
Then according to the formula
Figure 598197DEST_PATH_IMAGE045
Calculate out
Figure 200080DEST_PATH_IMAGE028
Value of next candidate actions
Figure 519066DEST_PATH_IMAGE046
I.e. Q value, according to
Figure 890004DEST_PATH_IMAGE034
Policy selection action
Figure 698560DEST_PATH_IMAGE035
As a decision by the agent; if the random number is less than or equal to
Figure 522160DEST_PATH_IMAGE016
Then query in Re _ Pool
Figure 644836DEST_PATH_IMAGE028
If yes, extracting the action regret value list corresponding to the state from the regret value pool
Figure 667019DEST_PATH_IMAGE047
According to the formula
Figure 864387DEST_PATH_IMAGE048
Selection actions
Figure 972020DEST_PATH_IMAGE035
As a decision of the agent, wherein
Figure 898388DEST_PATH_IMAGE049
The number of candidate actions in the action space,
Figure 978340DEST_PATH_IMAGE050
if not, it is
Figure 128698DEST_PATH_IMAGE028
Creating an action regret list, storing the action regret list in Re _ Pool, and selecting an action with equal probability in an action space
Figure 926890DEST_PATH_IMAGE035
As a decision by the agent;
and 7: according to the current sample state
Figure 391369DEST_PATH_IMAGE028
Value of and
Figure 591407DEST_PATH_IMAGE028
calculating the current regret value of each candidate action according to the advantages of each candidate action, and updating the regret value of each candidate action in the current sample state in the regret value Pool Re _ Pool, wherein the specific updating mode is as follows: agent obtaining state from environment
Figure 912666DEST_PATH_IMAGE028
Search for the state in the regret pool if
Figure 932575DEST_PATH_IMAGE028
Already in the pool, then return to the list of existing regrettes
Figure 200745DEST_PATH_IMAGE047
And the location of the list in the regret poolindexCalculating the state by using the regret value calculation method in step 5
Figure 52027DEST_PATH_IMAGE028
Obtaining the current regret value corresponding to each candidate action
Figure 750380DEST_PATH_IMAGE051
And overlapping the corresponding positions of the existing regret list and the latest regret list according to the following formula:
Figure 523164DEST_PATH_IMAGE041
finally, the superposed regret list is replaced in the regret poolindexAn unfortunate list of locations; if it is
Figure 126184DEST_PATH_IMAGE028
If not in the pool, then
Figure 300813DEST_PATH_IMAGE028
Creating an action regret value list and storing the action regret value list into Re _ Pool, wherein each action regret value is initialized to be 1;
and 8: will act
Figure 901559DEST_PATH_IMAGE035
Return to the simulation environment and obtain the reward
Figure 958376DEST_PATH_IMAGE042
The state of the next moment
Figure 568349DEST_PATH_IMAGE043
Will experience
Figure 331906DEST_PATH_IMAGE052
Storing in Ex _ Pool;
and step 9: and returning to the step 5, continuously iterating the training until the network is converged, and finishing the training to obtain the strategy model.
It should be noted that, at the beginning of the reinforcement learning process, the experience Pool Ex _ Pool does not store full experience, and at this time, for each interaction between the agent and the environment, only the experience is stored in the experience Pool, and the state value function network and the action dominance function network are not updated; after storing full experience in the experience Pool Ex _ Pool, for each interaction between the agent and the environment, experience is first stored
Figure 431449DEST_PATH_IMAGE052
Store in Ex _ Pool, and select a batch of experience (mini-batch, minibatch) versus state value function network
Figure 913246DEST_PATH_IMAGE053
And action dominance function
Figure 61330DEST_PATH_IMAGE054
And (6) updating.
In the following, the recommendation apparatus based on the regret search provided by the present invention is described, and the recommendation apparatus based on the regret search described below and the recommendation method based on the regret search described above may be referred to correspondingly.
Based on any of the above embodiments, fig. 4 is a schematic structural diagram of a recommendation apparatus based on regret exploration provided by the present invention, as shown in fig. 4, the apparatus includes:
a determining module 410, configured to determine a state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;
the input module 420 is configured to input the state of each candidate object to the scoring model, so as to obtain a score of each candidate object output by the scoring model;
a recommending module 430, configured to determine an object recommended to the target user based on the score of each candidate object;
the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
According to the device provided by the embodiment of the invention, in the reinforcement learning process, the scoring model is used for scoring and exploring based on the regret value set and the current sample state, so that exploration is carried out on accumulated historical decision-making experience, the exploration of actions from the global perspective for specific states is realized, the exploration efficiency is greatly improved, the learning efficiency of the reinforcement learning is further improved, the scoring of each candidate object is obtained by applying the scoring model, the individualized accurate recommendation of different users is realized, and the user experience is improved.
Based on any of the above embodiments, the scoring model performs scoring exploration based on the regrettable value set and the current sample state, including:
determining a currently generated random number;
if the random number is larger than or equal to the preset exploration probability, the scoring model carries out scoring utilization based on the current sample state;
otherwise, the scoring model performs scoring exploration based on the regrettable value set and the current sample state.
Based on any of the above embodiments, the unfortunate value is determined based on the following equation:
Figure 741710DEST_PATH_IMAGE055
wherein,
Figure 218347DEST_PATH_IMAGE002
is in a history state
Figure 921861DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 670374DEST_PATH_IMAGE004
in order to be the value of the historical state,
Figure 674102DEST_PATH_IMAGE008
is in a history state
Figure 53131DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 306257DEST_PATH_IMAGE005
in the case of a history state, the history state,
Figure 61724DEST_PATH_IMAGE006
is as follows
Figure 388800DEST_PATH_IMAGE003
And (4) scoring the candidate.
Based on any of the above embodiments, performing score exploration based on the regret value set and the current sample state includes:
if the regret value set comprises the current sample state, obtaining each regret value corresponding to the current sample state from the regret value set, and taking the candidate score corresponding to the maximum value in each regret value as the current score;
otherwise, setting the regret value of each candidate score in the current sample state as an initial value in the regret value set, and taking the candidate score selected from each candidate score with equal probability as the current score.
Based on any of the above embodiments, the scoring model performs scoring exploration based on the regrettable value set and the current sample state, and then further includes:
if the regret value set comprises the current sample state, the scoring model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
Based on any embodiment, the scoring based on the current sample state comprises:
determining the value of each candidate score in the current sample state based on the value of the current sample state and the advantages of each candidate score in the current sample state;
and taking the candidate score corresponding to the maximum value in the values of the candidate scores as the current score.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform an unfortunately exploration-based recommendation method comprising: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, the computer can execute the method for recommending based on the regret search provided by the above methods, the method including: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements an unfortunately exploration-based recommendation method provided by the above methods, the method comprising: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. An unfortunate exploration-based recommendation method, comprising:
determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;
inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;
determining an object recommended to the target user based on the scores of the candidate objects;
the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and unfortunate values corresponding to the unfortunate values, the unfortunate values are determined based on the advantages of candidate scores in the historical states, the historical states are sample states before the current sample state, and the candidate scores are scores which can be selected in a scoring space.
2. The method of claim 1, wherein the scoring model performs scoring based on a set of regrettable values and a current sample state, and comprises:
determining a currently generated random number;
if the random number is larger than or equal to a preset exploration probability, the scoring model carries out scoring utilization based on the current sample state;
otherwise, the scoring model carries out scoring exploration based on the regrettable value set and the current sample state.
3. The method of claim 1, wherein the regret value is determined based on the following formula:
Figure 634466DEST_PATH_IMAGE001
wherein,
Figure 29675DEST_PATH_IMAGE002
is the first in the history state
Figure 844047DEST_PATH_IMAGE003
An unfortunate value of the score of an individual candidate,
Figure 728826DEST_PATH_IMAGE004
for the value of the said historical state,
Figure 995860DEST_PATH_IMAGE005
is the first in the history state
Figure 257077DEST_PATH_IMAGE003
The dominance of the individual candidate scores is,
Figure 925955DEST_PATH_IMAGE006
in order to be able to take account of the history status,
Figure 981636DEST_PATH_IMAGE007
is the first
Figure 735966DEST_PATH_IMAGE003
And (4) scoring the candidate.
4. The method for recommendation based on regret exploration according to any of claims 1 to 3, characterized in that said scoring exploration based on regret value set and current sample state comprises:
if the regret value set comprises the current sample state, acquiring each regret value corresponding to the current sample state from the regret value set, and taking a candidate score corresponding to the maximum value in each regret value as a current score;
otherwise, setting the regret value of each candidate score in the current sample state as an initial value in the regret value set, and taking the candidate score selected from each candidate score with equal probability as the current score.
5. The method for recommending based on regret exploration according to any of claims 1 to 3, characterized in that said scoring model carries out scoring exploration based on regret value set and current sample state, and then further comprising:
if the regret value set comprises the current sample state, the scoring model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.
6. The unfortunately-explored based recommendation method according to claim 2, wherein said score utilization based on the current sample state comprises:
determining the value of each candidate score in the current sample state based on the value of the current sample state and the advantage of each candidate score in the current sample state;
and taking the candidate score corresponding to the maximum value in the values of the candidate scores as the current score.
7. An apparatus for providing an unfortunate exploration-based recommendation, comprising:
the determining module is used for determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;
the input module is used for inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;
the recommending module is used for determining the objects recommended to the target user based on the scores of the candidate objects;
the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and unfortunate values corresponding to the unfortunate values, the unfortunate values are determined based on the advantages of candidate scores in the historical states, the historical states are sample states before the current sample state, and the candidate scores are scores which can be selected in a scoring space.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the unfortunately-based recommendation method according to any of claims 1 to 6.
9. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the unfortunately exploration based recommendation method according to any of claims 1 to 6.
CN202111185156.8A 2021-10-12 2021-10-12 Regrettful exploration-based recommendation method and device, electronic equipment and storage medium Active CN113626721B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111185156.8A CN113626721B (en) 2021-10-12 2021-10-12 Regrettful exploration-based recommendation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111185156.8A CN113626721B (en) 2021-10-12 2021-10-12 Regrettful exploration-based recommendation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113626721A CN113626721A (en) 2021-11-09
CN113626721B true CN113626721B (en) 2022-01-25

Family

ID=78391013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111185156.8A Active CN113626721B (en) 2021-10-12 2021-10-12 Regrettful exploration-based recommendation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113626721B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK2531959T3 (en) * 2010-02-05 2017-10-30 Ecole polytechnique fédérale de Lausanne (EPFL) ORGANIZATION OF NEURAL NETWORKS
WO2019238483A1 (en) * 2018-06-11 2019-12-19 Inait Sa Characterizing activity in a recurrent artificial neural network and encoding and decoding information
CN112149824B (en) * 2020-09-15 2022-07-22 支付宝(杭州)信息技术有限公司 Method and device for updating recommendation model by game theory

Also Published As

Publication number Publication date
CN113626721A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN110458663B (en) Vehicle recommendation method, device, equipment and storage medium
CN112149824B (en) Method and device for updating recommendation model by game theory
WO2017197330A1 (en) Two-stage training of a spoken dialogue system
CN110222838B (en) Document sorting method and device, electronic equipment and storage medium
CN111046188A (en) User preference degree determining method and device, electronic equipment and readable storage medium
CN111159382B (en) Method and device for constructing and using session system knowledge model
CN109977029A (en) A kind of training method and device of page jump model
CN112785005A (en) Multi-target task assistant decision-making method and device, computer equipment and medium
CN112765484A (en) Short video pushing method and device, electronic equipment and storage medium
CN110263136B (en) Method and device for pushing object to user based on reinforcement learning model
CN110971683A (en) Service combination method based on reinforcement learning
US20210374604A1 (en) Apparatus and method for training reinforcement learning model for use in combinatorial optimization
CN113626721B (en) Regrettful exploration-based recommendation method and device, electronic equipment and storage medium
CN117056595A (en) Interactive project recommendation method and device and computer readable storage medium
CN114764603B (en) Method and device for determining characteristics aiming at user classification model and service prediction model
CN109508424B (en) Feature evolution-based streaming data recommendation method
CN113626720B (en) Recommendation method and device based on action pruning, electronic equipment and storage medium
CN112121439B (en) Intelligent cloud game engine optimization method and device based on reinforcement learning
CN115394295A (en) Segmentation processing method, device, equipment and storage medium
CN111125541A (en) Method for acquiring sustainable multi-cloud service combination for multiple users
CN118036756B (en) Method, device, computer equipment and storage medium for large model multi-round dialogue
CN117725190B (en) Multi-round question-answering method, system, terminal and storage medium based on large language model
CN113468436A (en) Reinforced learning recommendation method, system, terminal and medium based on user evaluation
CN114282101A (en) Training method and device of product recommendation model, electronic equipment and storage medium
CN112837116A (en) Product recommendation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant