CN113626721B

CN113626721B - Regrettful exploration-based recommendation method and device, electronic equipment and storage medium

Info

Publication number: CN113626721B
Application number: CN202111185156.8A
Authority: CN
Inventors: 白栋栋; 洪志理
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2022-01-25
Anticipated expiration: 2041-10-12
Also published as: CN113626721A

Abstract

The invention provides a recommendation method, a recommendation device, electronic equipment and a storage medium based on regret exploration, wherein the method comprises the following steps: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model performs scoring exploration based on the regret value set and the current sample state, the regret value set stores historical states and regret values corresponding to the historical states, the regret values are determined based on the advantages of candidate scores in the historical states, and the historical states are sample states before the current sample state, so that exploration efficiency is improved, personalized accurate recommendation for different users is achieved, and user experience is improved.

Description

Regrettful exploration-based recommendation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a recommendation method and apparatus, an electronic device, and a storage medium based on regret exploration.

Background

Reinforcement learning is particularly suitable for business scenarios involving interactions, such as scenarios in which content is recommended to a user, due to the advantages of being able to perceive a dynamic environment and obtain rewards from the environment to continually adapt to the environment. Exploration and utilization are always difficult points in reinforcement learning, and the exploration efficiency influences whether the algorithm can be converged to the maximum accumulated return position or not.

The current exploration methods are numerous and are more classically epsilon-greedy, Topson sampling and the like, however, the methods do not explore actions from the global perspective aiming at specific states, and therefore the exploration efficiency is greatly influenced.

Disclosure of Invention

The invention provides a recommendation method, a recommendation device, electronic equipment and a storage medium based on regret exploration, which are used for solving the defect of low exploration efficiency in the prior art and realizing the improvement of exploration efficiency.

The invention provides a recommendation method based on regret exploration, which comprises the following steps:

determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;

inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;

determining an object recommended to the target user based on the scores of the candidate objects;

the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate values, the regret values are determined based on the advantages of candidate scores in the historical states, and the historical states are sample states before the current sample state.

According to the regret exploration-based recommendation method provided by the invention, the scoring model carries out scoring exploration on the basis of the regret value set and the current sample state, and the method comprises the following steps:

determining a currently generated random number;

if the random number is larger than or equal to a preset exploration probability, the scoring model carries out scoring utilization based on the current sample state;

otherwise, the scoring model carries out scoring exploration based on the regrettable value set and the current sample state.

According to the recommendation method based on the regret exploration provided by the invention, the regret value is determined based on the following formula:

wherein,

is the first in the history state

An unfortunate value of the score of an individual candidate,

is the value of the history state, is the first value in the history state

The dominance of the individual candidate scores is,

in order to be able to take account of the history status,

is the first

And (4) scoring the candidate.

According to the recommendation method based on the regret exploration provided by the invention, the scoring exploration is carried out based on the regret value set and the current sample state, and the method comprises the following steps:

if the regret value set comprises the current sample state, acquiring each regret value corresponding to the current sample state from the regret value set, and taking a candidate score corresponding to the maximum value in each regret value as a current score;

otherwise, setting the regret value of each candidate score in the current sample state as an initial value in the regret value set, and taking the candidate score selected from each candidate score with equal probability as the current score.

According to the regret exploration-based recommendation method provided by the invention, the scoring model carries out scoring exploration on the basis of the regret value set and the current sample state, and then the method further comprises the following steps:

if the regret value set comprises the current sample state, the scoring model determines the current regret value of each candidate score based on the advantages of each candidate score in the current sample state, and updates each regret value corresponding to the current sample state in the regret value set based on the current regret value of each candidate score.

According to the regret exploration-based recommendation method provided by the invention, the scoring utilization based on the current sample state comprises the following steps:

determining the value of each candidate score in the current sample state based on the value of the current sample state and the advantage of each candidate score in the current sample state;

and taking the candidate score corresponding to the maximum value in the values of the candidate scores as the current score.

The invention also provides a recommendation device based on the regret exploration, which comprises:

the determining module is used for determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;

the input module is used for inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;

the recommending module is used for determining the objects recommended to the target user based on the scores of the candidate objects;

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the above-mentioned methods for recommending an object based on an unfortunate exploration.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described unfortunately exploration based recommendation method.

According to the recommendation method and device based on the regret exploration, the electronic equipment and the storage medium, the scoring model carries out scoring exploration based on the regret value set and the current sample state in the reinforcement learning process, so that the exploration is carried out on the accumulated historical decision experience, the exploration of actions from the global angle for specific states is realized, the exploration efficiency is greatly improved, the learning efficiency of the reinforcement learning is further improved, the scoring of each candidate object is obtained by applying the scoring model, the individual accurate recommendation for different users is realized, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of an exemplary method for providing an exemplary recommendation based on an unfortunate exploration;

FIG. 2 is a schematic flow chart of a scoring model determination method provided by the present invention;

FIG. 3 is a schematic diagram of an enhanced learning framework based on regret exploration according to the present invention;

FIG. 4 is a schematic diagram of a recommendation apparatus based on regret exploration according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Exploring and utilizing are always difficult points in reinforcement learning, wherein the utilization is to select an optimal strategy according to the current known knowledge and experience, and the exploration is to try different strategies to explore whether a better strategy exists. The maximum immediate reward is available, but when learning is inadequate, the algorithm tends to fall into local optimality. The exploration can fully learn the reward of each strategy, find the optimal strategy, is not easy to fall into local optimization, is beneficial to maximizing accumulated return, but takes more learning time and can slow down the convergence speed of the algorithm.

The existing exploration methods are numerous and have epsilon-greedy, Topson sampling and the like in a more classical way, however, the methods do not explore actions from the global perspective aiming at specific states, so that the exploration efficiency is greatly influenced, and the learning efficiency of reinforcement learning is further greatly influenced.

In contrast, the embodiment of the invention provides a recommendation method based on regret exploration. Fig. 1 is a schematic flow chart of an unfortunately-exploration-based recommendation method provided in the present invention, as shown in fig. 1, the method includes:

step 110, determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;

step 120, inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model;

step 130, determining an object recommended to a target user based on the scores of the candidate objects;

the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

Specifically, the target user refers to a user to be subjected to object recommendation, and the user characteristics are used for representing attribute information of the target user, such as gender, age, education level, occupation and the like of the target user. The candidate object refers to an object to be recommended, and may be specifically determined based on the browsing record of the target user, and it may be understood that the browsing record covers preference information of the user, so as to facilitate acceptance of a subsequently recommended object by the user. The specific type of the candidate object may be a movie, music, news, etc., and this is not particularly limited by the embodiment of the present invention. The object features are used to characterize attribute information of the candidate object, e.g., movie type, subject content, etc.

In order to perform personalized content recommendation for different users, the embodiment of the invention firstly determines the corresponding state of each candidate object according to the user characteristics of the target user and the object characteristics of each candidate object, then inputs the corresponding state of each candidate object into the scoring model, scores each candidate object by the scoring model so as to output the score of each candidate object, finally sorts each candidate object according to the score of each candidate object, and determines the object recommended to the target user according to the sorting result. Here, the state corresponding to each candidate object, that is, the state of the scoring recommendation environment where each candidate object and the target user are located, may be obtained by environment feedback.

It can be understood that, since the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object, in the reinforcement learning process, the scoring model can learn the scoring modes of different users by obtaining rewards from the environment and continuously performing policy optimization, and on the basis, the scoring model is applied to execute the step 120, and the obtained scores of the candidate objects can accurately represent the preference degree of the target user for the candidate objects, so that personalized accurate recommendation is performed on different users, and the user experience is greatly improved.

Before step 120 is executed, the scoring model may be trained specifically as follows: first, user characteristics of a large number of sample users and object characteristics of sample objects are collected, and a sample state corresponding to each sample object is determined based on the user characteristics of each sample user and the object characteristics of each sample object. And then, performing reinforcement learning on the initial model based on the sample state corresponding to each sample object, thereby obtaining a grading model.

In view of the fact that the existing exploration methods do not explore actions from the global perspective for specific states, and therefore exploration efficiency is poor, in the reinforcement learning process, the scoring model conducts scoring exploration in the environment based on the regret value set and the current sample state. Here, the regret value set stores the historical state and regret values of the candidate scores corresponding to the historical state, the regret values can be determined according to advantages of the candidate scores in the historical state, the candidate scores are scores available in a score space, and the advantages of the candidate scores can be obtained through calculation of an action advantage function. The scoring exploration corresponds to the exploration in a scoring scene, and whether a better scoring strategy exists is explored by trying different scoring strategies.

In the reinforcement learning process, the scoring model realizes interaction with the environment substantially through an agent controlled by the scoring model, after the scoring model determines a current decision-making action through scoring exploration, the agent can be controlled to execute the decision-making action, namely scoring the sample object, the environment can react to the score, and a next sample state of the environment is obtained. The historical state is a sample state before the current sample state, the intelligent agent can access one sample state of the environment at each decision, and the corresponding sample state can be used as the historical state no matter whether the decision at each time before the current sample state is in an exploration or utilization mode.

It should be noted that, the unfortunate value set stores the sample states accessed by the agent, maintains an action table corresponding to each accessed sample state, and records the regret value of each candidate action, i.e., candidate score, executed by the agent in the sample state. In the exploration process, scoring exploration is carried out based on the regret value set, and the existing random exploration mode is replaced, so that the exploration is carried out on the past decision experience, the exploration efficiency is greatly improved, and the learning efficiency of reinforcement learning is further improved. And in addition, the traditional decision-making experience of the intelligent agent every time is collected through the regret value set to form accumulated regret, and the accumulated regret and the specific state are used for scoring and exploring during exploration, so that purposeful exploration is realized by exploring actions from the global perspective aiming at the specific state, and the exploration efficiency is further improved.

According to the method provided by the embodiment of the invention, in the reinforcement learning process, the scoring model is used for scoring and exploring based on the regret value set and the current sample state, so that exploration is carried out on accumulated historical decision-making experience, the exploration of actions from the global perspective aiming at specific states is realized, the exploration efficiency is greatly improved, the learning efficiency of the reinforcement learning is further improved, the scoring of each candidate object is obtained by applying the scoring model, the individualized accurate recommendation of different users is realized, and the user experience is improved.

Based on any of the above embodiments, the scoring model performs scoring exploration based on the regrettable value set and the current sample state, including:

determining a currently generated random number;

if the random number is larger than or equal to the preset exploration probability, the scoring model carries out scoring utilization based on the current sample state;

otherwise, the scoring model performs scoring exploration based on the regrettable value set and the current sample state.

Specifically, scoring utilization is corresponding to utilization in a scoring scene, and an optimal scoring strategy in the current sample state is directly selected. In order to balance exploration and utilization, after a current sample state is obtained, a scoring model firstly determines a current generated random number, compares the random number with a preset exploration probability, and then determines whether a decision action is currently determined in a utilization or exploration mode based on a comparison result of the random number and the exploration probability, namely determines a current score from a scoring space:

if the random number is greater than or equal to the exploration probability, the scoring model can determine the current score in a utilization mode, and the current score can be determined according to the current sample state; otherwise, that is, the random number is smaller than the exploration probability, the scoring model may determine the current score in an exploration manner, and the current score may be specifically determined according to the regrettable value set and the current sample state.

Here, the random number may be specifically generated randomly by a random module, and the exploration probability, that is, the probability of exploratory property for representing the score of the agent, may be preset according to an actual requirement, which is not specifically limited in the embodiment of the present invention.

Based on any of the above embodiments, the unfortunate value is determined based on the following equation:

wherein,

is in a history state

An unfortunate value of the score of an individual candidate,

in order to be the value of the historical state,

is in a history state

The dominance of the individual candidate scores is,

in the case of a history state, the history state,

is as follows

And (4) scoring the candidate.

Specifically, in order to realize purposeful score exploration, the embodiment of the invention uses the game theory for reference, and utilizes the value of the state and the advantages of each candidate score in the state to define the regret value of each candidate score for estimating the loss degree of the income obtained by selecting each candidate score in the current state. For each history shape derived from the environmentState of the art

Can be first calculated by a state value function

I.e. the value of the historical state, is calculated by the action dominance function

I.e. in the history state

A candidate score

Is then judged

Whether the current time is more than 0 or not, and calculating to obtain the current time in the history state according to the judgment result and the formula

Unfortunately

Thereby obtaining the history status

Regrettable values of all the candidate scores, and finally the historical state

And storing all the corresponding regretted values into the regretted value set for subsequent grading exploration.

Here, the state value function may be specifically implemented by a state value function network, and the action advantage function may be specifically implemented by an action advantage function network, and it can be understood that by continuously optimizing parameters of the state value function network and the action advantage function network in the reinforcement learning process, values of states respectively output by the two networks and advantages of candidate scores are more accurate, so that errors in the calculation process of the regret value may be effectively reduced, and efficiency of score exploration is further improved.

Based on any of the above embodiments, performing score exploration based on the regret value set and the current sample state includes:

if the regret value set comprises the current sample state, obtaining each regret value corresponding to the current sample state from the regret value set, and taking the candidate score corresponding to the maximum value in each regret value as the current score;

Specifically, after it is determined that the scoring exploration is to be performed, the scoring model may first query the current sample state in the regret value set, and if the current sample state is queried, that is, the regret value set includes the current sample state, the regret value list corresponding to the current sample state may be obtained from the regret value set

，

The number of candidate scores in a scoring space is represented, and then the learning of an exploration process is facilitated by considering actions with large regrettage values, so that a scoring strategy which is better than a current local optimal strategy is explored, and for the scoring strategy, a candidate score corresponding to the maximum regrettage value in an acquired regrettage value list is determined as a current score by a scoring model;

otherwise, that is, the regret value set does not include the current sample state, an regret value list may be created for the current sample state in the regret value set, each list element in the regret value list corresponds to the regret value of each candidate score in the current sample state stored in the regret value set, and each list element in the regret value list is set to an initial value, for example, set to 1, and then, a candidate score is selected in the scoring space with equal probability, and the candidate score is determined as the current score.

Based on any of the above embodiments, the scoring model performs scoring exploration based on the regrettable value set and the current sample state, and then further includes:

Specifically, after determining the current decision-making action, the scoring model may first query whether the current sample state exists in the regret value set, and if the regret value set includes the current sample state, the scoring model may calculate the current regret value of each candidate score according to the advantages of each candidate score in the current sample state, and then update each regret value corresponding to the current sample state in the regret value set according to the calculated current regret value of each candidate score.

Here, the specific updating manner may be to superimpose an originally stored regret value of each candidate score in the current sample state in the regret value set with the calculated regret value of each candidate score, and then take the superimposed result as each regret value corresponding to the updated current sample state, where the superimposing manner may be direct superimposing or weighted superimposing, which is not specifically limited in this embodiment of the present invention.

For example, in the step of searching for a score, if the regret value set does not include the current sample state, an regret value list may be created for the current sample state in the regret value set, and each list element in the regret value list is set to 1, and then, in this step, the regret value set includes the current sample state, and at this time, each regret value corresponding to the current sample state in the updated regret value set may be 1 plus the current regret value of each candidate score.

It should be noted that this step is to update each state and corresponding regret value stored in the regret value set after determining the current decision-making action each time, and accumulate and store the past decision-making experience by continuously overlapping regret values corresponding to the same state, so that the subsequent regret exploration can be accurately performed toward the target direction from the global perspective, thereby further improving the efficiency of exploration. In addition, no matter whether the decision action is determined by score exploration or score utilization, the step should be executed after the decision action is determined, namely, the states stored in the regret value set and the corresponding regret values are updated.

Based on any embodiment, the scoring based on the current sample state comprises:

determining the value of each candidate score in the current sample state based on the value of the current sample state and the advantages of each candidate score in the current sample state;

Specifically, after determining that score utilization is to be performed, the scoring model may first calculate the value of each candidate score in the current sample state by:

wherein,

is the first in the current sample state

The value of the score of each of the candidate scores,

is the first in the current sample state

A candidate scoreIn the sense of the advantages of (a) and (b),

as a function of the value of the current sample state,

in order to be in the current state of the sample,

is as follows

And (4) scoring the candidate. Here, the number of the first and second electrodes,

can be calculated by a function of the state value,

can be calculated by the action dominance function.

The value of all candidate scores in the current sample state can be obtained through the formula. Then, the values are compared, and the candidate score corresponding to the maximum value is determined as the current score.

Based on any of the above embodiments, fig. 2 is a schematic flow chart of the determination method of the scoring model provided by the present invention, as shown in fig. 2, taking movie recommendation as an example, a specific flow of the method is as follows:

step 1: building a simulation environment, such as a scoring recommendation environment;

step 2: creating an unfortunate value set, namely an unfortunate value Pool Re _ Pool and an experience Pool Ex _ Pool, wherein the two pools are empty initially;

and step 3: initializing a state value function network and an action advantage function network in the scoring model; setting reward functions, discount factors and exploration probabilities

；

And 4, step 4: and (3) the intelligent agent makes random decision and accumulates experience: intelligent agent andinteracting with the environment, obtaining user characteristics of the users from the environment

And movie features of sample movies

Sample state of composition

The agent is in the scoring space according to the state of the sample

In the random selection of scores

Calculating the state of the sample

The regret value of each candidate score is stored in Re _ Pool, and the score is stored

Feedback to the environment to obtain the reward

The state of the next moment

And whether the flag of the termination state is reachedsign(ii) a Here, the determination of whether the termination state is reached is based on the current sample useruserWhether the browsing records are completely traversed or not, if not, whether the browsing records are completely traversed or not

By user features

Movie feature of next sample movie

Composition, if the traversal is complete

And switching users in the next interaction, and then experience

Storing in Ex _ Pool;

and 5: repeating the step 4 until the decision times are more than

Secondly;

step 6: the agent interacts with the environment, and the agent obtains the current sample state from the environment

Will be

The value of the state is calculated by inputting the value into a state value function

Will be

Inputting the calculated advantages into the action advantage function to obtain the advantage of each candidate score in the state

And calculate by this

Current regret value of next candidate score

：

And 7: according to the current sample state

The specific way of selecting the score is as follows: the random module random () generates a random number (i.e., from-num in fig. 2), if the random number is greater than or equal to

Then according to the formula

Calculating the value of each candidate score, i.e. Q value, according to

Policy selection scoring

As a decision by the agent; if the random number is less than

Then query in the regret Pool Re _ Pool

If yes, extracting the regret value list corresponding to the state from the regret value pool

According to the formula

Selection score

That candidate score with the greatest regret value among all candidate scores is taken as the decision of the agent, wherein

To be the number of candidate scores in the scoring space,

if not, creating an unfortunate value list for the state, storing the unfortunate value list in Re _ Pool, initializing each element in the list to be 1, and selecting a score with equal probability in a score space

As a decision by the agent;

and 8: according to the current sample state

The current regret value of each candidate score is calculated according to the value and the advantages of each candidate score, and the regret value of each candidate score is updated into Re _ Pool

The specific updating method of the regret values of the candidate scores is as follows: agent obtaining from the environment

Searching the regret value pool for the state, and if the regret value pool exists, returning to the existing regret value list

And the location of the list in the regret poolindexCalculating by using the regret value calculation mode in step 6

Obtaining the current regret value corresponding to each candidate score to obtain the latest regret value list

And overlapping the corresponding positions of the existing regret list and the latest regret list according to the following formula:

finally, the superposed regret list is replaced in the regret poolindexAn unfortunate list of locations;

and step 9: will score

Return to the environment and obtain the reward

The state of the next moment

Andsignwill experience

Storing in Ex _ Pool;

step 10: returning to the step 6, continuously iterating the training until the convergence condition of the scoring model is met, for example, the precision of the predicted scoring does not change any more each time, where the precision may be the difference between the predicted scoring and the actual scoring, which indicates that the reinforcement learning has learned a stable strategy, and ending the training to obtain the scoring model.

Based on any of the above embodiments, fig. 3 is a schematic diagram of a reinforcement learning framework based on regret exploration provided by the present invention, as shown in fig. 3, a specific flow of the reinforcement learning method based on the framework is as follows:

step 1: create an unfortunate Pool Re _ Pool (i.e., the action regret Pool in FIG. 3), and an experience Pool Ex _ Pool (i.e., the experience playback Pool in FIG. 3), both pools initially being empty;

step 2: initializing a state value function network and an action advantage function network; setting reward functions, discount factors and exploration probabilities

；

And step 3: the intelligent agent interacts with the environment, makes a random decision and stores the experience into Ex _ Pool;

and 4, step 4: repeating the step 3 until the decision times are more than

Secondly;

and 5: the agent interacts with the environment, and the agent obtains the current sample state from the environment

Will be

Will be

The advantage of each candidate action in the state is calculated and obtained by inputting the function of the action advantage

And calculate by this

Current regret value of next candidate actions

：

Step 6: according to the current sample state

The specific way of selecting the action is as follows: the random module random () generates a random number (i.e., from-num in fig. 3), if the random number is greater than or equal to

Then according to the formula

Calculate out

Value of next candidate actions

I.e. Q value, according to

Policy selection action

As a decision by the agent; if the random number is less than or equal to

Then query in Re _ Pool

If yes, extracting the action regret value list corresponding to the state from the regret value pool

According to the formula

Selection actions

As a decision of the agent, wherein

The number of candidate actions in the action space,

if not, it is

Creating an action regret list, storing the action regret list in Re _ Pool, and selecting an action with equal probability in an action space

As a decision by the agent;

and 7: according to the current sample state

Value of and

calculating the current regret value of each candidate action according to the advantages of each candidate action, and updating the regret value of each candidate action in the current sample state in the regret value Pool Re _ Pool, wherein the specific updating mode is as follows: agent obtaining state from environment

Search for the state in the regret pool if

Already in the pool, then return to the list of existing regrettes

And the location of the list in the regret poolindexCalculating the state by using the regret value calculation method in step 5

Obtaining the current regret value corresponding to each candidate action

finally, the superposed regret list is replaced in the regret poolindexAn unfortunate list of locations; if it is

If not in the pool, then

Creating an action regret value list and storing the action regret value list into Re _ Pool, wherein each action regret value is initialized to be 1;

and 8: will act

Return to the simulation environment and obtain the reward

The state of the next moment

Will experience

Storing in Ex _ Pool;

and step 9: and returning to the step 5, continuously iterating the training until the network is converged, and finishing the training to obtain the strategy model.

It should be noted that, at the beginning of the reinforcement learning process, the experience Pool Ex _ Pool does not store full experience, and at this time, for each interaction between the agent and the environment, only the experience is stored in the experience Pool, and the state value function network and the action dominance function network are not updated; after storing full experience in the experience Pool Ex _ Pool, for each interaction between the agent and the environment, experience is first stored

Store in Ex _ Pool, and select a batch of experience (mini-batch, minibatch) versus state value function network

And action dominance function

And (6) updating.

In the following, the recommendation apparatus based on the regret search provided by the present invention is described, and the recommendation apparatus based on the regret search described below and the recommendation method based on the regret search described above may be referred to correspondingly.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of a recommendation apparatus based on regret exploration provided by the present invention, as shown in fig. 4, the apparatus includes:

a determining module 410, configured to determine a state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object;

the input module 420 is configured to input the state of each candidate object to the scoring model, so as to obtain a score of each candidate object output by the scoring model;

a recommending module 430, configured to determine an object recommended to the target user based on the score of each candidate object;

According to the device provided by the embodiment of the invention, in the reinforcement learning process, the scoring model is used for scoring and exploring based on the regret value set and the current sample state, so that exploration is carried out on accumulated historical decision-making experience, the exploration of actions from the global perspective for specific states is realized, the exploration efficiency is greatly improved, the learning efficiency of the reinforcement learning is further improved, the scoring of each candidate object is obtained by applying the scoring model, the individualized accurate recommendation of different users is realized, and the user experience is improved.

determining a currently generated random number;

wherein,

is in a history state

An unfortunate value of the score of an individual candidate,

in order to be the value of the historical state,

is in a history state

The dominance of the individual candidate scores is,

in the case of a history state, the history state,

is as follows

And (4) scoring the candidate.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform an unfortunately exploration-based recommendation method comprising: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium, and when the computer program is executed by a processor, the computer can execute the method for recommending based on the regret search provided by the above methods, the method including: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements an unfortunately exploration-based recommendation method provided by the above methods, the method comprising: determining the state of each candidate object based on the user characteristics of the target user and the object characteristics of each candidate object; inputting the state of each candidate object into a scoring model to obtain the score of each candidate object output by the scoring model; determining an object recommended to a target user based on the scores of the candidate objects; the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and regret values corresponding to the unfortunate value set, the regret values are determined based on the advantages of all candidate scores in the historical states, and the historical states are sample states before the current sample state.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An unfortunate exploration-based recommendation method, comprising:

the scoring model is obtained by performing reinforcement learning based on the sample state of the sample object; in the reinforcement learning process, the scoring model carries out scoring exploration based on an unfortunate value set and a current sample state, the unfortunate value set stores historical states and unfortunate values corresponding to the unfortunate values, the unfortunate values are determined based on the advantages of candidate scores in the historical states, the historical states are sample states before the current sample state, and the candidate scores are scores which can be selected in a scoring space.

2. The method of claim 1, wherein the scoring model performs scoring based on a set of regrettable values and a current sample state, and comprises:

determining a currently generated random number;

3. The method of claim 1, wherein the regret value is determined based on the following formula:

wherein,

is the first in the history state

An unfortunate value of the score of an individual candidate,

for the value of the said historical state,

is the first in the history state

The dominance of the individual candidate scores is,

in order to be able to take account of the history status,

is the first

And (4) scoring the candidate.

4. The method for recommendation based on regret exploration according to any of claims 1 to 3, characterized in that said scoring exploration based on regret value set and current sample state comprises:

5. The method for recommending based on regret exploration according to any of claims 1 to 3, characterized in that said scoring model carries out scoring exploration based on regret value set and current sample state, and then further comprising:

6. The unfortunately-explored based recommendation method according to claim 2, wherein said score utilization based on the current sample state comprises:

7. An apparatus for providing an unfortunate exploration-based recommendation, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the unfortunately-based recommendation method according to any of claims 1 to 6.

9. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the unfortunately exploration based recommendation method according to any of claims 1 to 6.