CN111859099A

CN111859099A - Recommendation method, device, terminal and storage medium based on reinforcement learning

Info

Publication number: CN111859099A
Application number: CN201911236964.5A
Authority: CN
Inventors: 乔宏利; 高砚; 权圣
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2020-10-30
Anticipated expiration: 2039-12-05
Also published as: CN111859099B

Abstract

The invention discloses a recommendation method and a recommendation device based on reinforcement learning, which comprise the following steps: updating the accumulated income value corresponding to the last recommended action of the online Q value table by using a dual-core Q-learning reinforcement learning model, wherein the dual-core Q-learning reinforcement learning model comprises an online Q value table and an exploration Q value table; judging whether the recommendation type of the last recommendation intention is an exploration action recommendation type; if so, updating the accumulated income value of the exploration Q value table by using a dual-core Q-learning reinforcement learning model; through the steps, the recommendation method provided by the invention can be used as a bottom-pocketed scheme of intelligent recommendation application or a starting scheme of platform type recommendation service without depending on the personalized characteristics of the user.

Description

Recommendation method, device, terminal and storage medium based on reinforcement learning

Technical Field

The invention relates to the field of artificial intelligence, in particular to a recommendation method, a recommendation device, a recommendation terminal and a storage medium based on reinforcement learning.

Background

The recommendation system is one of the main components of the intelligent question-answering system, the process of user input can be reduced through question recommendation, user experience is improved, the user can be guided to enter a normal question processing flow during recommendation, and topic dispersion is reduced.

The reinforcement learning is a hot field of the artificial intelligence field, and has stronger adaptability, robustness and interpretability compared with the traditional supervision machine learning; in industrial application, the dependence on large-scale labeled corpora and frequent model updating can be reduced, and the implementation cost is reduced; however, the implementation difficulty of reinforcement learning is higher, and the fitting requirement of modeling on the scene is higher.

The existing recommendation system based on reinforcement learning can only be applied to part of service scenes, and is applicable to sensitive application fields such as: in the fields of finance, insurance, securities and the like with unobvious user personalized features and platform-type intelligent question and answer solving tools with more access industries, enough personalized features cannot be obtained at first to achieve good recommendation effect.

Disclosure of Invention

The invention mainly solves the technical problem of providing a recommendation method, a recommendation device, a recommendation terminal and a storage medium based on reinforcement learning so as to be suitable for various different service scenes and achieve a better recommendation effect.

In order to solve the technical problem, the invention provides a technical scheme that: updating the accumulated income value corresponding to the last recommended action of the online Q value table by using a dual-core Q-learning reinforcement learning model, wherein the dual-core Q-learning reinforcement learning model comprises an online Q value table and an exploration Q value table; judging whether the recommendation type of the last recommendation intention is an exploration action recommendation type; if so, updating the accumulated income value of the exploration Q value table by using a dual-core Q-learning reinforcement learning model; and obtaining the recommendation intention according to the updated online Q value table and a preset rule, and recommending the recommendation intention.

Wherein, the step of updating the accumulated profit value of the exploration Q value table by using the dual-core Q-learning reinforcement learning model comprises the following steps: and inputting the state before the last recommendation, the last recommended action and the corresponding user action into a dual-core Q-learning reinforcement learning model for calculation based on the recommendation type, and updating the exploration Q value table according to the calculation result.

The step of updating the accumulated profit value of the online Q value table corresponding to the last recommended action by using the dual-core Q-learning reinforcement learning model comprises the following steps: acquiring a state before last recommendation, a last recommended action and a user action corresponding to the last recommended action; the last recommended action comprises a last recommended intention, the user action comprises a click action, or the user action comprises an input action and an intention of the input action; and inputting the state before the last recommendation, the last recommendation action and the user action corresponding to the last recommendation action into a dual-core Q-learning reinforcement learning model, calculating to obtain the accumulated profit value of the last recommendation state under the recommendation action, and updating the accumulated profit value into an online Q value table.

The method for obtaining the recommendation intention comprises the following steps of obtaining the recommendation intention according to the updated online Q value table and a preset rule: judging whether the updated exploration Q value table meets a preset condition or not; if so, replacing the online Q value table by the exploration Q value table to obtain a new online Q value table, and initializing the exploration Q value table; obtaining the recommendation intention according to the updated online Q value table and the preset rule comprises the following steps: and obtaining the recommendation intention according to the new online Q value table and a preset rule.

Wherein, the step of judging whether the updated exploration Q value table meets the preset condition comprises the following steps: judging whether the click rate corresponding to the recommendation result of the updated exploration Q value table is larger than the click rate corresponding to the recommendation result generated by the online Q value table; if yes, judging whether the distance between the last time of exploring the Q value table and the online Q value table exceeds the preset recommendation times; if yes, the preset condition is met.

The step of obtaining the recommendation intention according to the updated Q value table and the preset rule specifically comprises the following steps: randomly generating a fraction between [0, 1 ]; when the decimal falls into the interval range of [0 ], ] determining the recommendation intention corresponding to the recommendation action with the maximum Q value in the online Q value table as the recommendation intention; and when the decimal falls into the range of the [ 1], randomly selecting the recommendation intention corresponding to the recommendation action according to the convergence strategy of the exploration action and determining the recommendation intention as the recommendation intention.

The recommendation method based on reinforcement learning further comprises the following steps: obtaining the reward value of the last recommended action according to the last recommended action and the action intention of the corresponding user action; and calculating to obtain the accumulated profit value of the last recommended action in the last recommended state through a Q-learning reinforcement learning model based on the reward value, and updating a Q value table corresponding to the last recommended action.

Wherein, the step of recommending the recommendation intention to the user further comprises: when the user action is taken as an input action, determining that the recommendation of the recommended action fails, and counting the recommendation into historical failure times; acquiring an action intention corresponding to the input action, and judging whether the action intention corresponding to the input action is the same as the recommendation intention; if the recommendation intentions of the recommended action are the same, determining the recommendation intentions of the recommended action as intent fuzziness, and counting the times of intent fuzziness; acquiring the intention fuzzy probability of the recommended action according to the ratio of the intention fuzzy times to the historical failure times; and judging whether the intent fuzzy probability is greater than a preset threshold value, and if so, determining the recommendation intent of the recommended action at this time as the fuzzy intent.

In order to solve the above technical problem, the present invention further provides a recommendation apparatus based on reinforcement learning: the recommendation device comprises a processing module, a judgment module and a recommendation module, wherein the processing module is used for updating the accumulated profit value corresponding to the last recommended action of the online Q value table by using a dual-core Q-learning reinforcement learning model, and the dual-core Q-learning reinforcement learning model comprises the online Q value table and an exploration Q value table; the judging module is used for judging whether the recommendation type of the last recommendation intention is an exploration action recommendation type; if so, the processing module is further used for updating the accumulated profit value of the exploration Q value table by using a dual-core Q-learning reinforcement learning model; and the recommendation module is used for obtaining the current recommendation intention according to the updated online Q value table and the preset rule and recommending the current recommendation intention.

In order to solve the above technical problem, the present invention further provides a recommendation terminal based on reinforcement learning, wherein the recommendation terminal based on reinforcement learning includes: a processor and a memory, wherein the memory stores program data, and the processor is used for executing the program data to realize the recommended method of the technical scheme.

In order to solve the above technical problem, the present invention also proposes a storage medium storing program data that can be executed to implement the recommendation method as described above.

The invention has the beneficial effects that: different from the prior art, the recommendation method based on reinforcement learning provided by the invention comprises the steps of obtaining the recommendation intention of the last recommended action and the corresponding user action; wherein the user action comprises a click action or an input action; inputting the last recommendation intention and the corresponding user action into a dual-core Q-learning reinforcement learning model, and calculating to obtain the recommendation intention of the current recommendation action; recommending the recommendation intention to the user. The intention recommendation is carried out by utilizing the reinforcement learning model, and the intention recommendation can be used as a bottom-pocket scheme of intelligent recommendation application or a starting scheme of platform type recommendation service without depending on personalized features of a user.

Drawings

FIG. 1 is a flowchart illustrating an embodiment of a reinforcement learning-based recommendation method according to the present invention;

FIG. 2 is a flow chart illustrating data updating of a dual Q table according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a recommendation method based on reinforcement learning according to another embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an embodiment of a reinforcement learning-based recommendation apparatus according to the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of a reinforcement learning-based recommendation terminal provided in the present invention;

fig. 6 is a schematic structural diagram of an embodiment of a storage medium provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reinforcement learning is an algorithm, which is a method for enabling a computer to operate completely randomly from the beginning, learn from errors through continuous trial and error and finally find a rule, so as to learn the purpose.

It mainly comprises four elements: agent, environmental status, action and reward, the goal of reinforcement learning is to get the most accumulated reward, while in the recommendation system of the present invention, the recommendation strategy is the goal of learning.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a recommendation method based on reinforcement learning according to the present invention, wherein the recommendation method based on reinforcement learning of the present embodiment includes the following steps:

s11: and updating the accumulated income value corresponding to the last recommended action of the online Q value table by using a dual-core Q-learning reinforcement learning model, wherein the dual-core Q-learning reinforcement learning model comprises the online Q value table and an exploration Q value table.

And the accumulated profit value corresponding to the last recommended action is obtained through the last recommended state. The recommendation state is a combination of recommendation and user response, and the user response is information of user actions and comprises user click action information, user input action information and intention information of input actions; in this embodiment, a user selects a certain action under a certain recommendation intention, expresses a certain intention, and defines a state; the recommendation intention, the user action and the user action intention together constitute a recommendation state, the recommendation state is a combination of recommendation and user reaction, and the user has two reactions to the problem recommended by the system: clicking a certain recommendation intention and manually inputting the problem to be solved, so the user action is divided into clicking action and input action and intention of the input action.

According to different application requirements, classifying and marking user problems to be solved by an intelligent system by manpower to form an intention set, and marking as I; the set of possible reactions of the user is denoted as U, { C, E }, where C denotes a click action and E denotes an input action.

Assuming that the system has N intents, a set of N intents is defined as an intent set, a ═ I1, I2, I3, … IN }.

For example: suppose the system has three intents { I1, I2, I3 }; the set of recommended states is then:

S＝{<I1，C，I1>，<I2，C，I2>，<I3，C，I3>，<I1，E，I1>，<I1，E，I2>，<I1，E，I3>，<I2，E，I1>，<I2，E，I2>，<I2，E，I3>，<I3，E，I1>，<I3，E，I2>，<I3，E，I3>}；

wherein < I1, C, I1> indicates that the recommended state is: the system recommends I1 to the user, the user clicks I1, the action intention of the user is also I1, < I1, E, I1> means: the system recommends I1 to the user, who enters an action, whose action intent is also I1; < I1, E, I3> means: the system recommends I1 to the user, the user inputs an action, the action intention of which is I3; using click rate promotion as a target; a state targeting < IX, C, IX >, i.e. a state in which the user clicks on a recommendation, is directly awarded. If the system enters such a state, a corresponding recommendation intent action is rewarded.

That is, the last recommended state is related to the last recommended action and also related to the user's intention to recommend the last time, and the recommended state is procedural.

Defining the recommendation interaction process of < recommendation intention, user action and user action intention > as states, then performing Q-learning reinforcement learning, modeling and evaluating the whole reinforcement learning environment without depending on user personalized data, and calculating the accumulated value of the recommendation action in each state, wherein the recommendation is a prediction of the next intention of the user, and what the next intention of the user is not only related to the recommendation intention given to the user last time, but also related to the feedback of the user to the recommendation intention, the intention recommendation is an action, and the action is made based on the states. Therefore, the definition of the reinforcement learning state in the invention is an abstraction of the interactive process, namely, the dynamic process is stateful. The method is suitable for all recommended application scenarios.

In this step, the cumulative profit value corresponding to the last recommended action of the online Q-value table is updated by using a dual-core Q-learning reinforcement model, wherein the dual-core Q-learning reinforcement model includes the online Q-value table and the exploration Q-value table.

In an embodiment of the present invention, two Q-value tables are introduced into the recommendation system based on reinforcement learning to form two cores of reinforcement learning, and a reinforcement learning model (algorithm) with dual cores Q-learning is formed. The two Q value tables are an online Q value table and an exploration Q value table respectively, the online Q value table is used for serving a recommendation system in use, and the online Q value table continuously calculates convergence according to the income conditions of exploration actions and utilization actions and generates a recommendation intention; and the exploration Q value table is in an off-line state, does not directly generate recommendation intents for on-line service, and only continuously performs exploration actions to calculate and generate the recommendation intents and calculate the value of each recommendation intention.

Wherein, the calculation result of the reinforcement learning model of the dual-core Q-learning is updated in the online Q value table and the exploration Q value table.

The search action and the utilization action are two action types of reinforcement learning based on Q-learning, wherein the search action refers to randomly generating a recommendation intention, and acquiring a rule from the random action and converging a policy, and the specific manner of the converging policy is as follows: the system carries out recommendation intention randomly, when the user clicks the intention, the system is awarded, the system obtains a certain profit value, when the user does not click the intention, the system is not awarded, the system obtains the profit value of 0, namely, the system searches the intention with high profit value from the initial random action for obtaining the reward and carries out recommendation, and accordingly, a relevant rule is obtained and convergence is carried out, and a gradually mature strategy is formed.

The utilizing action is to generate a recommendation intention according to a maximum profit principle; the maximum profit principle is that after a Q value table is updated according to the recommendation state of the last recommended action, the recommendation intention of the current recommendation is calculated according to the updated Q value table; in one embodiment, an intention of the maximum Q value corresponding to the recommended state among the recommendation intentions calculated this time may be used as the recommendation intention.

In one embodiment, a recommendation type of a last intent to recommend is determined; the recommendation types comprise a utilization action recommendation type and an exploration action recommendation type; inputting the state before last recommendation, the last recommended action and the corresponding user action into a dual-core Q-learning reinforcement learning model for calculation based on the recommendation type, and updating a corresponding Q value table according to the calculation result;

in an implementation mode, the state before the last recommendation, the last recommendation action and the corresponding user action can be obtained by judging a request parameter transmitted back by calling a model server interface by a display terminal, the state before the last recommendation, the last recommendation action and the corresponding user action are input into a dual-core Q-learning reinforcement learning model, the accumulated profit value of the recommendation state under the recommendation action is obtained through calculation, and the accumulated profit value is updated into an online Q value table, wherein the recommendation type comprises a utilization action recommendation type and an exploration action recommendation type, and the last recommendation action comprises a last recommendation intention.

Specifically, if the recommendation type of the intention recommended by the last recommended action is an exploration action type and the user action is a click, the latest Q value, that is, the accumulated profit value, is obtained by calculating through a dual-core Q-learning reinforcement learning algorithm according to the exploration action type and the user action, and the online Q value table is updated. In addition, if the recommendation type of the intention of recommending the action last time is the utilization action type, the latest Q value is obtained by calculation through a dual-core Q-learning reinforcement learning algorithm according to the utilization action type and the user action, and the online Q value table is updated.

In a specific practical application, the parameters brought by the recommendation request sent by the client include: last recommended classType, state S before last recommendation_(Last)The last recommendation intent, the last user action, and the intent of the user action. Wherein<Last recommendation intent, last user action, intent of this action>Constitutes the current state S_(Now)。

S12: and judging whether the recommendation type of the last recommendation intention is the exploration action recommendation type.

In the search Q value table, it is also determined whether or not the recommendation type of the previous recommendation intent is the search action recommendation type. The recommendation types are classified into a recommendation type comprising a utilization action recommendation type and an exploration action recommendation type, and the recommendation intention for recommending through an exploration action is the exploration action recommendation type.

S13: if yes, updating the accumulated profit value of the exploration Q value table by using a dual-core Q-learning reinforcement learning model.

And if the recommendation type of the last recommendation intention is the exploration action recommendation type, inputting the state before the last recommendation, the last recommendation action and the corresponding user action, namely the accumulated profit value corresponding to the last recommendation action, into the dual-core Q-learning reinforcement learning model for calculation, and updating the exploration Q value table according to the calculation result.

In specific practical application, the background algorithm server enters a learning stage after taking the request parameters, wherein the learning stage comprises the following steps: if the last action type is a utilization action, according to the Q-left algorithm formula, for the last state Q_(Last)The action cumulative benefit value of (2) is calculated: has already had the last state Q_(Last),The action in the last state, namely the last recommended intention of the user, and the action result, namely the feedback of the user, can already calculate Q_(Last)New values under the last action (last recommendation intent); it is updated to the online Q-value table. Similarly, if the type of the previous action is an exploration action, the same formula can be used to update the calculation on the exploration Q value table.

Judging whether the click rate corresponding to the recommendation result generated by the updated online Q value table is smaller than the click rate corresponding to the recommendation result generated by the exploration Q value table; if yes, judging whether the distance between the last time of exploring the Q value table and the online Q value table exceeds the preset recommendation times; the preset recommendation times can be set according to practical applications, for example, 100 times, 50 times, and the like of the total number of intentions, and are not limited thereto; if so, replacing the online Q value table by the exploration Q value table, namely updating the data of the double Q value table to obtain a new online Q value table, and initializing the data of the exploration Q value table to obtain a new exploration Q value table.

Referring to fig. 2, fig. 2 is a schematic flow chart illustrating data updating of a dual Q-value table according to an embodiment of the invention.

An online Q value table Q11, an online Q value table Q12 and an online Q value table Q13 perform calculation convergence according to the profit conditions of utilization actions and exploration actions, an exploration Q value table Q21, an exploration Q value table Q22 and an exploration Q value table Q23 perform calculation only according to the profit conditions of exploration actions, an evaluation monitor F is arranged between the two Q value tables, when the evaluation monitor F detects that the click rate of the recommendation generated by the exploration Q value table Q22 is higher than that of the recommendation generated by the online Q value table Q12 according to user feedback, and the previous double Q value table is updated to reach the preset recommendation times, the data of the online Q value table is updated, the data of the exploration Q value table Q22 is entirely copied into the online Q value table Q13, the original data of the online Q value table is replaced, and after the replacement is completed, the online Q value table Q13 continues to generate online recommendation according to the original preset calculation rules (calculation convergence is performed according to utilization actions and exploration actions at the same time), after the exploration Q-value table Q23 is initialized, the recommendation intent continues to be generated according to the exploration action, and at the moment, the data between the double Q-value tables are updated.

S13: and obtaining the recommendation intention according to the updated online Q value table and a preset rule, and recommending the recommendation intention.

And obtaining the recommendation intention according to the updated online Q value table and the preset rule. Further, randomly generating a fraction between [0, 1 ]; when the decimal falls into the interval range of [0 ], ] determining the intention with the maximum Q value in the updated online Q value table as the recommendation intention; and when the decimal falls into the range of the [ 1], randomly selecting the recommendation intention according to the convergence strategy of the exploration action of the updated online Q value table. In addition, since the online Q-value table is calculated and converged according to the two actions, i.e., the search action and the utilization action, and a certain planning is required between the search action and the utilization action, a greedy coefficient is set to define an action ratio between the search action and the utilization action in the online Q-value table, and the defining rule is as follows: randomly generating a decimal between [0, 1], and when the decimal falls into the interval range of [0, ] taking the corresponding intention of the maximum Q value according to the utilization action as the recommendation intention in the online Q value table; when the decimal point falls within the range of [, 1], the online Q-value table selects the recommendation intent according to a convergence strategy of the search action, wherein the size of the greedy coefficient is between [0, 1], and the specific value thereof can be adjusted according to the practical application, such as 0.95, 0.97, and the like, but not limited thereto.

In practical application, the background server sends the recommended content to the client; the number of recommended contents may be multiple, for example, 3 or 5, each recommended content is recommended independently based on the recommendation method, and when recommendation is performed by using an action, if 5 recommended contents are recommended, 5 recommended actions are selected from the recommended actions corresponding to the maximum Q value for recommendation. Further, the recommended content includes a state S before the current recommendation_(Now)The client side sends the recommended content to the server as the recommended information of the request next time when recommending the request, and in addition, the recommended information of the request also comprises the action of the user to the recommended intention and the real intention of the user action.

After recommending the recommendation intention to the user, collecting the recommendation intention and the user action of the recommendation action, determining whether to give the recommendation state reward according to the user action, and judging whether to update Q value data of the online Q value table according to a reward result so as to improve the accuracy of the recommendation intention next time.

In addition, when a problem corresponding to the system recommendation intention occurs in the recommendation process, but the user does not click the problem of the recommendation intention but inputs the problem by himself, after the system identification, the intention corresponding to the problem input by the user is found to be exactly in the system recommendation intention, and at this time, two situations may occur: the first is that the problem words corresponding to the recommendation intention are problematic and are not understood by the user; secondly, the recommendation intent itself is problematic, may be too general, or ambiguous; that is to say, after recommending an intention to a user in the recommending action of this time, the user action is taken as an input action, the recommending of the recommending action of this time fails, the recommending failure of this time is counted into historical failure times of the intention recommended of this time, further judgment is made, the action intention of the user input action is determined as an intention fuzzy if the action intentions are the same, the recommending of this time is counted into the intention fuzzy times of the recommending intention, an intention fuzzy probability of the recommending intention is obtained according to the ratio of the intention fuzzy times to the historical failure times, whether the intention fuzzy probability is greater than a preset threshold value is judged, and if the intention is greater than the preset threshold value, the recommending intention of the recommending action of this time is determined as a fuzzy intention. And the data is recorded in the system for service analysis and correction by a human annotator to improve the efficiency of the whole system, wherein the preset threshold may be set according to practical applications, such as 50%, 30%, and the like, and is not limited herein.

In a specific practical application, the system enters a recommendation phase, and the recommendation phase comprises: the current system state has reached Q_(Now)Firstly, rolling dice by probability (randomly generating one), and if the dice is rolled to an exploring action, randomly generating intentions to be recommended at present from all intention spaces in a certain random mode; if the user throws the user to the utilization action, directly inquiring an online Q value table, and in the current state Q_(Now)And in the row, selecting the action with the largest value in all actions (namely the intention to be recommended) as the current intention to be recommended.

Therefore, in the present embodiment, with the above-mentioned solution, the operation of the recommendation system can be performed only according to the user feedback of the last recommended action without depending on the personalized features of the user, and therefore, the recommendation method based on reinforcement learning of the present embodiment can be applied to various platforms, such as: the method can be used as a bottom-of-pocket scheme of intelligent recommendation application or a starting scheme of platform type recommendation service.

Referring to fig. 3, fig. 3 is a flowchart illustrating a recommendation method based on reinforcement learning according to another embodiment of the present invention.

S210: and acquiring the state before last recommendation, the last recommended action and the corresponding user action.

Firstly, obtaining a state before last recommendation, a last recommended action and a user action, namely a cumulative profit value corresponding to the last recommended action, wherein the last recommended action comprises a last recommendation intention, and the user action comprises a click action or the user action comprises an input action and an intention of the input action.

S211: and updating the online Q value table.

And inputting the state before the last recommendation, the last recommendation action and the corresponding user action into a dual-core Q-learning reinforcement learning model, calculating to obtain an accumulated profit value of the recommended state under the recommendation action, and updating the accumulated profit value into an online Q value table.

S212: whether the last recommendation is an exploration action recommendation type.

Judging whether the recommendation intention is an exploration action recommendation type in the last recommendation according to the information acquired in the last step, wherein the recommendation types comprise: the action recommendation type is an intention recommended by the online Q-value table according to the utilization action, and the search action recommendation type is an intention recommended by the online Q-value table according to the search action. If the type is the search action recommendation type, step S213 is executed, and if not, step S217 is executed.

S213: and updating the exploration Q value table.

If the recommendation intention in the last recommendation is the exploration action recommendation type, the exploration Q value table is successfully recommended to obtain the benefit, and the exploration strategy for exploring the Q value table is converged to improve the recommendation accuracy.

S214: and exploring whether the click rate of the Q value table is larger than that of the online Q value table or not.

And judging whether the click rate of the exploration Q value table is greater than that of the online Q value table, if so, continuously judging whether the number of times of replacing the online Q value table with the last exploration Q value table exceeds the preset recommendation number, and if so, executing the step S216. If the above two conditions are not satisfied, step S215 is performed.

S215: data between the double Q-value tables is not updated.

And the data of the exploration Q value table and the online Q value table are not updated, and the operation is continued on the basis of the original data.

S216: the exploration Q-value table replaces the online Q-value table.

Copying and covering the data of the exploration Q value table into the online Q value table, covering the original data of the online Q value table, continuing online service on the basis of the data of the exploration Q value table, and calculating by using the utilization action and the exploration action; after the data of the exploration Q value table is copied, the exploration Q value table is initialized, and calculation convergence is continued according to the exploration action.

S217: randomly determining a decimal between [0, 1] to determine whether the decimal falls within the interval of [0, ].

Because the online Q-value table is calculated and converged according to the exploration action and the utilization action, a certain planning is needed between the exploration action and the utilization action, and therefore, a greedy coefficient is set to define an action proportion between the exploration action and the utilization action in the online Q-value table, and the defining rule is as follows: randomly generating a decimal between [0, 1], judging whether it falls within the interval of [0, ] and if not, executing step S218, if yes, executing step S219.

S218: and selecting the recommendation intention according to the exploration action.

When the decimal falls into the range of [, 1], the recommendation intention is selected according to the convergence strategy of the exploration action.

S219: and selecting the recommendation intention according to the online Q value table.

And acquiring an intention corresponding to the recommended action corresponding to the maximum Q value in the corresponding state as a recommended intention.

S220: and displaying the recommendation intention to the user.

And pushing the selected recommendation intention to the user for selection.

S221: and obtaining the recommended state.

After recommending the recommendation intention to the user, the recommendation intention, the user action and the user action intention of the recommendation action are collected.

S222: and identifying that the recommendation intention is fuzzy.

And after collecting the recommendation state of the current recommendation action, identifying the fuzzy intention, wherein the fuzzy intention refers to the situation that the recommendation intention is the same as the action intention of the user, but the action of the user is not clicking, and identifying the recommendation action of the current recommendation intention as the fuzzy intention when the situation occurs.

S223: the probability of intent ambiguity > a preset threshold.

And judging whether the intent fuzzy probability of the recommendation intent is greater than a preset threshold, if so, executing step S225, and if not, executing step S224.

S224: the recommendation is not intended to be recorded.

This recommendation is not intended to be recorded.

S225: and recording the fuzzy intention of the recommendation intention.

And identifying the recommendation intention as a fuzzy intention, and recording the fuzzy intention for the analysis, perfection or correction of business and annotating personnel.

Therefore, in the present embodiment, with the above-mentioned solution, the operation of the recommendation system can be performed only according to the feedback of the last recommended action without depending on the personalized features of the user, and therefore, the recommendation method based on reinforcement learning of the present embodiment can be applied to various platforms, such as: the method can be used as a bottom-of-pocket scheme of intelligent recommendation application or a starting scheme of platform type recommendation service.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of a recommendation device based on reinforcement learning according to the present invention.

The processing module 31 is configured to input the state before the last recommendation, the last recommended action, that is, the accumulated profit value corresponding to the last recommended action, and the corresponding user action into the dual-core Q-learning reinforcement learning model, calculate an accumulated profit value of the recommended state under the recommended action, and update the accumulated profit value into the online Q-value table; the Q value table is composed of accumulated income values of corresponding recommended actions in each recommended state, and the dual-core Q-learning reinforcement learning model comprises an online Q value table and an exploration Q value table.

The processing module 31 may obtain a state before the last recommendation, a last recommendation action, and a corresponding user action by determining a request parameter transferred back by calling the display terminal to the model server interface, input the state before the last recommendation, the last recommendation action, and the corresponding user action into the dual-core Q-learning reinforcement learning model, calculate an accumulated profit value of the recommended state under the recommendation action, and update the accumulated profit value into the online Q-value table, where the recommendation type includes a recommendation type by using an action and a recommendation type by exploring an action, and the last recommendation action includes a last recommendation intention. Specifically, if the recommendation type of the intention recommended by the last recommended action is the utilization action type and the user action is a click, the latest Q value is calculated by using a dual-core Q-learning reinforcement learning algorithm according to the utilization action type and the user action, and the online Q value table is updated. The processing module 32 further needs to determine whether the recommendation type of the last recommendation intention is the exploration action recommendation type, and if the recommendation type of the last recommendation intention is the exploration action recommendation type, input the state before the last recommendation, the last recommendation action, and the corresponding user action into the dual-core Q-learning reinforcement learning model for calculation, and update the exploration Q-value table according to the calculation result.

The processing module 31 judges whether the click rate corresponding to the recommendation result generated by the updated online Q-value table is smaller than the click rate corresponding to the recommendation result generated by the exploration Q-value table; if yes, judging whether the distance between the last time of exploring the Q value table and the online Q value table exceeds the preset recommendation times; the preset recommendation times can be set according to practical applications, for example, 100 times, 50 times, and the like of the total number of intentions, and are not limited thereto; if so, replacing the online Q value table by the exploration Q value table, namely updating the data of the double Q value table to obtain a new online Q value table, and initializing the data of the exploration Q value table to obtain a new exploration Q value table.

The determining module 32 is configured to determine whether the recommendation type of the previous recommendation intent is the exploration action recommendation type, and if so, the processing module is further configured to update the accumulated profit value of the exploration Q-value table by using a dual-core Q-learning reinforcement learning model.

Specifically, if the recommendation type of the last recommendation intention is the exploration action recommendation type, the state before the last recommendation, the last recommendation action, and the corresponding user action, that is, the accumulated profit value corresponding to the last recommendation action, are input into the dual-core Q-learning reinforcement learning model for calculation, and the exploration Q value table is also updated according to the calculation result.

The recommending module 33 is configured to obtain the recommending intention according to the updated online Q-value table and the preset rule, and recommend the recommending intention.

The recommending module 33 obtains the recommending intention according to the updated online Q value table and the preset rule. Further, randomly generating a fraction between [0, 1 ]; when the decimal falls into the interval range of [0 ], ] determining the intention with the maximum Q value in the updated online Q value table as the recommendation intention; when the decimal falls into the range of [, 1], selecting the recommendation intention randomly according to the convergence strategy of the updated exploration action of the online Q value table. In addition, since the online Q-value table is calculated and converged according to the two actions, i.e., the search action and the utilization action, and a certain planning is required between the search action and the utilization action, a greedy coefficient is set to define an action ratio between the search action and the utilization action in the online Q-value table, and the defining rule is as follows: randomly generating a decimal between [0, 1], and when the decimal falls into the interval range of [0, ] taking the corresponding intention of the maximum Q value according to the utilization action as the recommendation intention in the online Q value table; when the decimal point falls within the range of [, 1], the online Q-value table selects the recommendation intent according to the convergence strategy of the search action, wherein the size of the greedy coefficient is between [0, 1], and the specific value thereof can be adjusted according to the practical application, such as 0.03, 0.05, and the like, but not limited thereto. And after the calculation of the reinforcement learning algorithm of the dual-core Q-learning is completed, recommending the recommendation intention calculated by the online Q value table to the user.

When the system recommends the intention-related problem in the recommendation process, but the user does not click the recommendation intention-related problem but inputs the problem by himself, after the system identifies, the problem input by the user is found to be exactly in the intention recommended by the system, there may be two cases: the first is that the problem words corresponding to the recommendation intention are problematic and are not understood by the user; secondly, the recommendation intent itself is problematic, may be too general, or ambiguous; after the situations occur, the intent fuzzy probability of each intent is calculated, and when the intent fuzzy probability exceeds a preset threshold, the recommending module 33 marks the corresponding intent as a fuzzy intent and records the fuzzy intent into the system for business analysis and manual marking personnel to analyze and correct so as to improve the efficiency of the whole system.

Based on the same inventive concept, the present invention further provides a recommendation terminal based on reinforcement learning, which can be executed to implement the recommendation method based on reinforcement learning of any of the above embodiments, please refer to fig. 5, where fig. 5 is a schematic structural diagram of an embodiment of the recommendation terminal based on reinforcement learning provided by the present invention, and the recommendation terminal based on reinforcement learning includes a processor 41 and a memory 42.

The memory 42 is used for storing the recommended state, the recommended action, the user action, the fuzzy intention and the data of the dual Q value table each time.

The processor 41 is configured to calculate, based on an accumulated profit value corresponding to the last recommended action, a recommendation intention of this time by using a double-core Q-learning algorithm; specifically, the recommendation intention is calculated by a dual-core Q-learning algorithm, wherein the dual-core Q-learning algorithm specifically includes: introducing two Q value tables on the basis of a Q-learning reinforcement algorithm, and calculating a recommendation intention according to a preset calculation rule by the two Q value tables; meanwhile, data updating is carried out between the two Q value tables according to an updating rule; wherein, two Q value tables include: an online Q value table and an exploration Q value table.

In the step of calculating the recommendation intention according to the preset calculation rule by the two Q value tables, the preset calculation rule is specifically as follows: the online Q value table is used for calculating convergence according to the exploration action and the utilization action, generating a recommendation intention and recommending the recommendation intention to a user, and meanwhile, a greedy coefficient is arranged in the online Q value table to define an action proportion between the exploration action and the utilization action; the search Q value table does not perform online service, and the recommendation intention is calculated only according to the search action.

The specific process of setting the greedy coefficient to define the action proportion between the exploration action and the utilization action by the processor 41 is as follows: setting a greedy coefficient, and randomly generating a decimal number between [0, 1 ]; when the decimal falls into the interval range of [0 ], ] taking the corresponding intention of the maximum Q value as the recommendation intention according to the utilization action in the online Q value table; when the decimal falls into the range of [, 1], the online Q value table picks out the recommendation intention according to the convergence strategy of the exploration action.

Meanwhile, data updating is carried out between the two Q value tables, and the data updating method specifically comprises the following steps: when the click rate of the recommendation intention calculated by the exploration Q value table exceeds the click rate of the recommendation intention calculated by the online Q value table, and the distance from the last exploration Q value table to replace the online Q value table exceeds the preset recommendation times, updating data between the double Q value tables, covering the data updating of the exploration Q value table into the online Q value table, initializing the data of the exploration Q value table, and continuously calculating the two Q value tables according to the preset calculation rule.

When the system recommends the intention-related problem in the recommendation process, but the user does not click the recommendation intention-related problem but inputs the problem by himself, after the system identifies, the problem input by the user is found to be exactly in the intention recommended by the system, there may be two cases: the first is that the problem words corresponding to the recommendation intention are problematic and are not understood by the user; secondly, the recommendation intent itself is problematic, may be too general, or ambiguous; after the above situation occurs, the intent fuzzy probability of each intent is calculated, and when the intent fuzzy probability exceeds a preset threshold, the processor 41 marks the corresponding intent as a fuzzy intent and records the fuzzy intent into the system for business analysis and manual marking personnel to analyze and correct so as to improve the efficiency of the whole system.

By adopting the above scheme, the recommendation system can be operated according to the feedback of the last recommended action without depending on the personalized features of the user, and therefore, the recommendation method based on reinforcement learning of the embodiment can be applied to various platforms, such as: the method can be used as a bottom-of-pocket scheme of intelligent recommendation application or a starting scheme of platform type recommendation service.

Based on the same inventive concept, the present invention further provides a storage medium, please refer to fig. 6, and fig. 6 is a schematic structural diagram of an embodiment of the storage medium provided by the present invention. The storage medium 50 stores therein program data 51, and the program data 51 may be a program or an instruction, which is capable of executing updating of an accumulated profit value corresponding to a last recommended action of the online Q-value table by using a dual-core Q-learning reinforcement learning model, and determining whether a recommendation type of a last recommendation intention is an exploration action recommendation type; if so, updating the accumulated income value of the exploration Q value table by using a dual-core Q-learning reinforcement learning model; the Q value table is composed of accumulated income values of corresponding recommended actions in each recommended state, and the dual-core Q-learning reinforcement learning model comprises an online Q value table and an exploration Q value table; and obtaining the recommendation intention according to the updated online Q value table and a preset rule, and recommending the recommendation intention to the user.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A recommendation method based on reinforcement learning is characterized by comprising the following steps:

updating an accumulated income value corresponding to the last recommended action of an online Q value table by using a dual-core Q-learning reinforcement learning model, wherein the dual-core Q-learning reinforcement learning model comprises the online Q value table and an exploration Q value table;

judging whether the recommendation type of the last recommendation intention is an exploration action recommendation type;

if so, updating the accumulated income value of the exploration Q value table by using the dual-core Q-learning reinforcement learning model;

and obtaining the current recommendation intention according to the updated online Q value table and a preset rule, and recommending the current recommendation intention.

2. The reinforcement learning-based recommendation method according to claim 1, wherein the step of updating the accumulated profit value of the exploration Q-value table by using the dual-core Q-learning reinforcement learning model comprises:

And inputting the state before the last recommendation, the last recommendation action and the corresponding user action into the dual-core Q-learning reinforcement learning model for calculation based on the recommendation type, and updating the exploration Q value table according to the calculation result.

3. The reinforcement learning-based recommendation method according to claim 2, wherein the step of updating the accumulated profit value of the online Q-value table corresponding to the last recommended action by using a dual-core Q-learning reinforcement learning model comprises:

acquiring a state before last recommendation, the last recommended action and a user action corresponding to the last recommended action; the last recommended action comprises a last recommended intention, the user action comprises a click action, or the user action comprises an input action and the intention of the input action;

and inputting the state before the last recommendation, the last recommendation action and the user action corresponding to the last recommendation action into a dual-core Q-learning reinforcement learning model, calculating to obtain the accumulated profit value of the last recommendation state under the recommendation action, and updating the accumulated profit value into the online Q value table.

4. The reinforcement learning-based recommendation method according to claim 2, wherein the step of obtaining the recommendation intent of this time according to the updated online Q-value table and a preset rule further comprises:

judging whether the updated exploration Q value table meets a preset condition or not;

if so, replacing the online Q value table by the exploration Q value table to obtain a new online Q value table, and initializing the exploration Q value table;

obtaining the recommendation intention according to the updated online Q value table and a preset rule comprises the following steps:

and obtaining the recommendation intention according to the new online Q value table and a preset rule.

5. The reinforcement learning-based recommendation method according to claim 4, wherein the step of determining whether the updated search Q-value table satisfies a preset condition comprises:

judging whether the click rate corresponding to the updated recommendation result of the exploration Q value table is larger than the click rate corresponding to the recommendation result generated by the online Q value table or not;

if yes, judging whether the distance between the last time of the exploration Q value table and the online Q value table exceeds the preset recommendation times;

if yes, the preset condition is met.

6. The reinforcement learning-based recommendation method according to claim 4, wherein the step of obtaining the recommendation intent of this time according to the updated Q-value table and a preset rule specifically comprises:

Randomly generating a fraction between [0, 1 ];

when the decimal falls into the interval range of [0, ], determining the recommendation intention corresponding to the recommendation action with the maximum Q value in the online Q value table as the recommendation intention; and when the decimal falls into the range of the [ 1], randomly selecting the recommendation intention corresponding to the recommendation action according to the convergence strategy of the exploration action and determining the recommendation intention as the recommendation intention.

7. The reinforcement learning-based recommendation method according to claim 2, further comprising:

obtaining the reward value of the last recommended action according to the last recommended action and the action intention of the corresponding user action;

and calculating the accumulated profit value of the last recommended action in the last recommended state through the Q-learning reinforcement learning model based on the reward value, and updating the Q value table corresponding to the last recommended action.

8. The reinforcement learning-based recommendation method according to any one of claims 1-7, wherein the step of recommending the recommendation intent to the user further comprises:

when the user action is taken as an input action, determining that the recommended action recommendation fails, and counting the recommended action recommendation failure times into historical failure times;

Acquiring an action intention corresponding to the input action, and judging whether the action intention corresponding to the input action is the same as the recommendation intention; if the recommendation intentions of the recommended action are the same, determining the recommendation intentions of the recommended action as intent fuzziness, and counting the times of intent fuzziness;

acquiring the intent fuzzy probability of the recommended action according to the ratio of the intent fuzzy times to the historical failure times;

and judging whether the fuzzy probability of the intention is greater than a preset threshold value, and if so, determining the recommendation intention of the recommended action as the fuzzy intention.

9. The recommendation device based on reinforcement learning is characterized by comprising a processing module, a judging module and a recommendation module,

the processing module is used for updating an accumulated profit value corresponding to the last recommended action of an online Q value table by using a dual-core Q-learning reinforcement model, wherein the dual-core Q-learning reinforcement model comprises the online Q value table and an exploration Q value table;

the judging module is used for judging whether the recommendation type of the last recommendation intention is an exploration action recommendation type;

if so, the processing module is further configured to update the accumulated profit value of the exploration Q-value table by using the dual-core Q-learning reinforcement learning model;

And the recommending module is used for obtaining the recommending intention according to the updated online Q value table and a preset rule and recommending the recommending intention.

10. A reinforcement learning-based recommendation terminal, comprising: a processor and a memory, the memory having stored therein program data, the processor being configured to execute the program data to implement the recommended method of any one of claims 1-8.

11. A storage medium characterized in that the storage medium stores program data executable to implement the recommendation method according to any one of claims 1-8.