CN110263136B

CN110263136B - Method and device for pushing object to user based on reinforcement learning model

Info

Publication number: CN110263136B
Application number: CN201910463434.8A
Authority: CN
Inventors: 陈岑; 胡旭; 傅驰林; 安蓉
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2023-10-20
Anticipated expiration: 2039-05-30
Also published as: CN110263136A

Abstract

The embodiments of the present specification provide a method and apparatus for pushing objects to a user based on a reinforcement learning model, the method comprising a continuous up to N rounds of pushing for a first user, wherein each round of pushing has a corresponding set of predetermined candidate objects, each round of pushing starting from a second round of pushing begins after the first user clicks on an object pushed in a previous round of pushing, and each round of pushing candidate object set starting from the second round of pushing comprises a respective plurality of sub-classes of a plurality of candidate objects of the previous round of pushing, wherein the ith round of pushing comprises the steps of: acquiring the ith state information; and inputting the ith state information into the reinforcement learning model to determine the respective identifications of a predetermined number of push objects of an ith round of push.

Description

Method and device for pushing object to user based on reinforcement learning model

Technical Field

The embodiment of the specification relates to the technical field of machine learning, in particular to a method and a device for pushing objects to users based on a reinforcement learning model.

Background

Conventional customer service is labor/resource intensive and time consuming, and therefore it is important to build an intelligent assistant that can automatically answer a user's facing questions. Recently, there has been increasing focus on how to better build such intelligent assistants with machine learning. As a core function of the customer service robot, user question prediction aims to automatically predict questions that the user may want to ask, and presents candidate questions to the user for their selection to alleviate the cognitive burden of the user. The problem prediction essence is to predict the problem possibly posed by the user based on the historical behavior of the user, help the user solve the problem, improve the satisfaction of the user and save the labor cost of customer service. The existing problem prediction method is usually based on single-round problem recommendation of supervised learning, and the problem is directly pushed. However, in some complex scenarios where the user's intent is not clear, the accuracy of the recommendation is generally low.

Thus, there is a need for a more efficient solution to push questions to users.

Disclosure of Invention

Embodiments of the present disclosure aim to provide a more efficient solution for pushing objects to a user, so as to solve the deficiencies in the prior art.

To achieve the above object, one aspect of the present specification provides a method of pushing objects to a user based on a reinforcement learning model, the method comprising consecutive up to N rounds of pushing for a first user, wherein each round of pushing has a corresponding set of predetermined candidate objects, each round of pushing starting from a second round of pushing starts after the first user clicks on an object pushed in a previous round of pushing, and each round of pushing candidate object set starting from the second round of pushing comprises a respective plurality of sub-classes of a plurality of candidate objects of the previous round of pushing, wherein an ith round of pushing of the up to N rounds of pushing comprises the steps of:

acquiring ith state information, wherein the ith state information comprises static characteristics and dynamic characteristics, the static characteristics comprise the existing characteristics of a first user before the method is executed, and the dynamic characteristics comprise the identification of each clicked object for the previous i-1 round push of the first user; and

Inputting the ith state information into the reinforcement learning model, so that the reinforcement learning model determines the respective identifications of the preset number of pushing objects of the ith pushing from the candidate object set of the ith pushing.

In one embodiment, causing the reinforcement learning model to determine, from the candidate set of ith round of pushes, respective identifications of a predetermined number of the ith round of push objects comprises causing the reinforcement learning model to: based on the ith state information and object identifiers of all candidate objects in the candidate object set of the ith round of pushing, calculating the pushing probability of all candidate objects of the ith round of pushing, and determining the preset number of pushing objects of the ith round of pushing based on all the pushing probabilities.

In one embodiment, the first user clicks on a first push object in the round of push for an i-1 th round of push, wherein determining the predetermined number of push objects for the i-th round of push based on the respective push probabilities includes determining a first candidate object belonging to a subclass of the first push object among respective candidate objects for the i-th round of push, and determining the predetermined number of push objects for the i-th round of push based on the push probabilities of the respective first candidate objects.

In one embodiment, the first user clicks on a first push object in the round of push for the i-1 th round of push, wherein causing the reinforcement learning model to determine the respective identifications of the predetermined number of push objects of the i-th round of push from the candidate set of the i-th round of push comprises causing the reinforcement learning model to determine the respective identifications of the predetermined number of push objects of the i-th round of push from a subset of the candidate set of the i-th round of push, wherein the subset comprises a plurality of sub-categories of the first push object.

In one embodiment, the ith round of pushing further includes, after determining a pushing object of the ith round of pushing, pushing the pushing object to the first user to obtain feedback of the first user.

In one embodiment, i+.n, in case the feedback of the first user is not clicking on one of the push objects, the method comprises successive i-round pushes for the first user, the method further comprising optimizing the model by a policy gradient algorithm based on sets of data corresponding respectively to a plurality of push objects in the i-round push, wherein a set of data corresponding to a second push object in a j-th round push comprises: the method comprises the steps of carrying out j-th round pushing on state information corresponding to the j-th round pushing, identifying a second pushing object and a return value corresponding to the second pushing object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of the first user on the second pushing object.

In one embodiment, i=n, the method comprises a continuous N round of pushing for a first user, the method further comprising, after obtaining feedback for the first user, optimizing the model by a policy gradient algorithm based on a plurality of sets of data corresponding to a plurality of push objects in the N round of pushing, wherein a set of data corresponding to a second push object in a j-th round of pushing in the N round of pushing comprises: status information corresponding to the j-th round of pushing, an identification of a second pushing object, and a return value corresponding to the second pushing object, wherein the return value is obtained based on feedback of the first user on the second pushing object.

In one embodiment, the push object corresponding to the nth round of push is a query question, and the return value is a positive value in a case where the first user clicks on the second push object, and is zero in a case where the first user does not click on the second push object.

In one embodiment, the reward value takes a first value in the case where j=n and the first user clicks on the second push object, and takes a second value in the case where j+.n and the first user clicks on the second push object, where the first value is greater than the second value.

Another aspect of the present specification provides an apparatus for pushing objects to a user based on a reinforcement learning model, the apparatus comprising at most N pushing modules deployed in succession for a first user, wherein each pushing module has a corresponding set of predetermined candidate objects, each pushing module from a second pushing module is deployed after the first user clicks on an object pushed by a previous pushing module, and each candidate object set of pushing modules from the second pushing module comprises a respective plurality of sub-categories of a plurality of candidate objects of the previous pushing module, wherein an ith pushing module of the at most N pushing modules comprises the following units:

the device comprises an acquisition unit, a storage unit and a storage unit, wherein the acquisition unit is configured to acquire ith state information, the ith state information comprises static characteristics and dynamic characteristics, the static characteristics comprise the existing characteristics of a first user before the device is deployed, and the dynamic characteristics comprise the identification of each object clicked by the first user for a previous i-1 push module; and

a determining unit configured to input the i-th state information into the reinforcement learning model, so that the reinforcement learning model determines the identities of each of a predetermined number of push objects of the i-th push module from a candidate object set of the i-th push module.

In one embodiment, the determining unit includes a unit configured to deploy in the reinforcement learning model: a calculating subunit configured to calculate, based on the i-th state information and the object identification of each candidate object in the candidate object set of the i-th push module, a push probability of each candidate object of the i-th push module, and a determining subunit configured to determine, based on each push probability, a predetermined number of push objects of the i-th push module.

In an embodiment, the first user clicks on the first pushed object pushed by the i-1 th push module, wherein the determining subunit is further configured to determine a first candidate object belonging to the subclass of the first pushed object among the candidate objects of the i-th push module, and determine the predetermined number of pushed objects of the i-th push module based on the push probability of each first candidate object.

In an embodiment, the first user clicks on a first push object pushed by the i-1 th push module, wherein the determining unit is further configured to cause the reinforcement learning model to determine, from a subset of a candidate set of i-th push modules, an identity of each of a predetermined number of push objects of the i-th push module, wherein the subset includes a plurality of sub-categories of the first push object.

In an embodiment, the ith pushing module further includes a pushing unit configured to, after determining the pushing object of the ith pushing module, push the pushing object to the first user to obtain feedback of the first user.

In one embodiment, i+.n, in case the feedback of the first user is not clicking on one of the push objects, the apparatus comprises a succession of i push modules for the first user, the apparatus further comprising an optimization module configured to optimize the model by a policy gradient algorithm based on sets of data corresponding to respective ones of the i push modules, wherein a set of data corresponding to a second push object of a j-th push module comprises: the method comprises the steps of enabling state information corresponding to a j-th pushing module, identification of a second pushing object and a return value corresponding to the second pushing object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of a first user to the second pushing object.

In one embodiment, i=n, the apparatus includes a continuous N push modules for a first user, the apparatus further includes an optimization module configured to optimize the model by a policy gradient algorithm based on multiple sets of data corresponding to multiple push objects in the N push modules after obtaining feedback of the first user, wherein a set of data corresponding to a second push object in a j-th push module in the N push modules includes: the method comprises the steps of enabling state information corresponding to a j-th pushing module, identification of a second pushing object and a return value corresponding to the second pushing object to be obtained based on feedback of a first user on the second pushing object.

Another aspect of the present description provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform any of the methods described above.

Another aspect of the present specification provides a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, performs any of the methods described above.

In the object pushing scheme according to the embodiment of the present disclosure, a novel structured object pushing flow is provided, a user is guided step by step, a whole multi-round pushing state skip process is modeled through reinforcement learning, and dynamic click information of the user is considered in a model, so that prediction accuracy is improved.

Drawings

The embodiments of the present specification may be further clarified by describing the embodiments of the present specification with reference to the accompanying drawings:

FIG. 1 shows a schematic diagram of a process of pushing objects to a user according to an embodiment of the present description;

FIG. 2 illustrates a method of pushing objects to a user based on a reinforcement learning model in accordance with an embodiment of the present disclosure;

FIG. 3 schematically illustrates push objects respectively presented to a user in three rounds of pushing;

Fig. 4 illustrates an apparatus 400 for pushing objects to a user based on a reinforcement learning model in accordance with an embodiment of the present disclosure.

Detailed Description

Embodiments of the present specification will be described below with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of a process of pushing an object to a user according to an embodiment of the present description. The figure shows the process of making three consecutive decisions for the user 12 (i.e., the environment) to push three times respectively through the reinforcement learning model 11 (i.e., agent). The reinforcement learning model is used, for example, in intelligent customer service for predicting problems that a user wants to ask. In the present description embodiment, reinforcement learning models are constructed by modeling the concept that a human would determine the problem that he wants to ask last when encountering the problem through structured hierarchical thinking. That is, in the three decisions, the problem is predicted hierarchically, first, the major class in which the problem is located is predicted, then the minor class in which the problem is located is predicted, and finally the problem in the minor class is predicted.

Specifically, in the first decision, an initial state s is input to the model 11 based on the current state of the user 12 ₁ The state s ₁ Including static features (e.g. labels s ₁ Shown as white boxes in the oval of (c) and dynamic features (not shown), the static features being the current features of the user, including the attribute features the user had before the round, historical behavioral features, etc., the dynamic features being the question the user has clicked on in the round, here due to the fact that Is the first decision, and therefore the dynamic feature is null. At input s to the model 11 ₁ The model 11 then calculates the probability of each candidate major class of the first round of push corresponding to the decision by means of a policy gradient algorithm and determines the push major class of the round of push based on the probability, e.g. (a) ₁₁ 、a ₁₂ 、a ₁₃ ) The push generic, i.e. the action output by the model 11. Thus, several broad categories may be presented (i.e., pushed) to the user based on the output of the model. For example, in the intelligent customer service of the payment treasure, based on the output of the model 11, three major categories, "flower bar", "bar", and "balance treasure" are first presented to the user. After the presentation is performed to the user, the user may feedback the presentation, for example, the user may click on one of the major categories, or may not click on any of the major categories, and may obtain the return values (r ₁₁ 、r ₁₂ 、r ₁₃ ). After the user clicks on a major category, e.g. "in bar", the model 11 starts the second decision process. Specifically, a second state s is input to the model 11 based on the current state of the user ₂ The second state s ₂ Also includes static features (s is marked in the figure ₂ Shown by the white boxes in the ellipses of (c), and dynamic features, as noted by s in the figure ₂ Shown by grey boxes in the ellipses) of the static features and states s ₁ The static characteristics of (a) are the same, and the dynamic characteristics comprise the identification of the large class of "flowers" clicked by the user, such as a ₁₁ . In the state s ₂ After inputting the model 11, the model 11 is likewise based on the state s ₂ Three subclasses (a ₂₁ 、a ₂₂ 、a ₂₃ ) Which for example correspond to the subclasses "bill", "repayment", "commission" of the next layer comprised by the "bar" subclass, respectively. Likewise, after the second round of pushing, feedback of the user may be obtained, e.g., the user clicks "repayment", and return values (r ₂₁ 、r ₂₂ 、r ₂₃ ) And, the state s of the third decision can be acquired accordingly ₃ . By combining the states s ₃ Three problems (a) of the third round of pushing can be output by inputting the model 11 ₃₁ 、a ₃₂ 、a ₃₃ ) Which correspond, for example, to "how the flowers can be paid in advance", "automatic payment and deduction sequence of the flowers", "how the flowers are paid", respectively, and can obtain return values (r ₃₁ 、r ₃₂ 、r ₃₃ ). After three rounds of pushing to the user are performed as described above, model optimization can be performed based on the data in the three rounds of pushing, so that the prediction accuracy of the model is improved.

It will be appreciated that the above description of fig. 1 is only illustrative and not restrictive, for example, the push object is not limited to a problem of querying for a user, but may be other push objects, such as merchandise and commentary, so that the corresponding major class and minor class also change accordingly, and the actions of the user on the push object also change accordingly, and the calculation method of the return value also changes accordingly; the model is not limited to pushing to the user through three rounds of pushing, but can be set according to specific scenes; the model is not limited to reinforcement learning by a strategic gradient algorithm, and so on.

The above-described pushing process is specifically described below.

Fig. 2 shows a method of pushing objects to a user based on a reinforcement learning model according to an embodiment of the present description, the method comprising a succession of up to N rounds of pushing for a first user, wherein each round of pushing has a corresponding set of predetermined candidate objects, each round of pushing starting from a second round of pushing starts after the first user clicks on an object pushed in a previous round of pushing, and each round of pushing candidate object set starting from the second round of pushing comprises a respective plurality of sub-classes of a plurality of candidate objects of the previous round of pushing, wherein an ith round of pushing of the up to N rounds of pushing comprises the steps of:

Step S202, obtaining ith state information, wherein the ith state information comprises static features and dynamic features, the static features comprise existing features of a first user before the method is executed, and the dynamic features comprise identifiers of all clicked objects of the first user for previous i-1 round pushing; and

step S204, inputting the ith state information into the reinforcement learning model, so that the reinforcement learning model determines the respective identifications of the preset number of push objects of the ith push from the candidate object set of the ith push.

The up to N rounds of pushing is one round of reinforcement learning (epi), as described above, where the pushing object of the nth round of pushing is, for example, a query problem of a user, and the pushing objects of the 1 st to N-1 th rounds of pushing are correspondingly a subclass, and so on where the query problem is located. For each round of pushing, a corresponding candidate object set is preset. For example, in round 1 pushing, the preset candidate object set includes various major classes, for example, in payment treasures intelligent customer service, the candidate object set of round 1 pushing may include "in turn," "in balance treasures," "sesame credit," "ant insurance," "ant forest," and so on. In the 2 nd round of pushing, the preset candidate object set includes each subclass, where each subclass is a subclass of each major class, for example, the method includes: the subclass of "bei" is (the "bei bill", "bei repayment", "bei commission", "Kayibei"); subclass of "Xuebei" ("Xuexixizhi", "Xuezhi repayment") the "borrowing amount", "opening borrowing") and the like. In the 3 rd round of pushing, the preset candidate object set includes each question under each subclass, and each question is a subclass of each subclass, such as a subclass including "turn of repayment" (i.e., each inquiry question related to turn of repayment), a subclass of "turn of bill" (i.e., each inquiry question related to turn of bill "), and so on.

It will be appreciated that the set of candidate objects for each round of pushing starting from the second round of pushing is not limited to the above-described definition. For example, in the case where the model is a model of a policy gradient algorithm, the output push object is determined based on the ordering of the push probabilities of the respective candidate objects by calculating the push probability of each candidate object based on the input state in the model. To save model calculation time after the first user has clicked on a certain push object in the first round, the candidate object set may be defined to include the respective sub-class of the clicked push object in the second round. In this case, it is possible, for example, to indicate by means of a specific identification, which major class in the first round of pushing each minor class in the second round, respectively, belongs to.

Wherein each round of pushing starting from the second round of pushing starts after the first user clicks on an object pushed in the previous round of pushing. For example, after a first push is performed, for example, as shown in fig. 1, after pushing the various major categories (e.g., "in", "borrow", and "balance treasures") in the 1 st round of push, if the user clicks one of the major categories (e.g., "in") then the second round of push procedure for the round is entered, if the user does not click any of them, then the round ends, i.e., the round includes only one round of push.

The process included in each push of the up to N number of pushes is the same, wherein the ith push may include the following steps.

First, in step S202, the ith status information is obtained, where the ith status information includes a static feature and a dynamic feature, where the static feature includes an existing feature of the first user before the method is executed, and the dynamic feature includes an identifier of each object clicked by the first user for the previous i-1 round of pushing.

The ith state information, i.e. the ith state s entered in the ith model prediction of the round _i The state s _i For example in the form of a feature vector comprising a plurality of elements. Wherein the state s _i The elements of the predetermined plurality of dimensions of (a) correspond to static features of the user, i.e. existing features before the round is performed, such as user's attribute features, portrait features, historical behavioral features, etc., so that the static features in the respective states corresponding to the model predictions in one round are the same. Wherein the state s _i The elements of the predetermined plurality of dimensions corresponding to dynamic characteristics of the user, the dynamic characteristics being such that the first user precedes the round of pushing in the round The identity of each object clicked in each push. For example, referring to the description of FIG. 1 above, in the first round of pushing, the state s is entered because the first user has not previously clicked any more ₁ The dynamic characteristics of (a) may be represented as [0,0 ], for example]The method comprises the steps of carrying out a first treatment on the surface of the In the second round of pushing, for example, the first user clicks "in" after the first round of pushing, whereby a second state s is entered ₂ The dynamic characteristics of (a) include the identification of (a) the flower, e.g. state s ₂ The dynamic characteristics in (a) can be expressed as [ a ] ₁₁ ,0]The method comprises the steps of carrying out a first treatment on the surface of the In a third round of pushing, for example, the first user clicks "repayment" after the second round of pushing, and thus, a third state s is entered ₃ The dynamic characteristics of (1) include the identification of the clicked "flower bar" and the identification of "repayment" of the first user for the first push and the second push, for example, the state s ₃ The dynamic characteristics in (a) can be expressed as [ a ] ₁₁ ,a ₂₂ ]。

The reinforcement learning model is, for example, a model based on a strategy gradient algorithm, in which case the model comprises a strategy function pi (a|s, θ) for the state s and the action a, where θ is the model parameter of the reinforcement learning model and pi (a|s, θ) is the probability of taking action a in state s. For example, in the model input state s _i Thereafter, each candidate (i.e., each action) a for the ith decision is obtained in the model based on the policy function pi (a|s, θ) _ij And determining a predetermined number of push objects for the round of push based on the push probabilities of the respective candidate objects.

For example, in the scenario of intelligent customer service as described in fig. 1, in the first round of push, the model input state s ₁ Thereafter, as described above, candidates for the first decision of the model include, for example, "in flowers", "in borrows", "balance treasures", "sesame credits", "ant insurance", "leesAnt forest ", e.g. by a respectively ₁₁ 、a ₁₂ 、…、a ₁₆ Identification, the model sequentially calculates the adoption probability pi (a _1j |s ₁ θ), where j is 1, 2, … 6, and ordering the 6 probabilities so that a predetermined number (e.g., 3) of candidates that are top ordered are determined to be push objects. It will be appreciated that the predetermined number may be set to be the same for each decision process, or the predetermined number may be set separately for each decision. For example, the predetermined number may be set to be proportional to the number of candidates for the decision, such that in the first round of pushing the number of candidates is smaller, such that the number of pushed objects is smaller, and in the third round of pushing the number of candidates is larger, such that the number of pushed objects is correspondingly larger.

Typically, after performing e.g. the first round of pushing as described above, e.g. the first user clicks "flower" of the first round of pushing, the model is based on the new state s ₂ The predicted push objects are each subclass under "flower". However, to exclude model errors, the output of the model may be filtered and then ranked. For example, after calculating the pushing probability of each candidate object of the second round of pushing, the model filters candidate objects which are not in the "in-the-flower" subclass in each candidate object, and sorts the pushing probabilities of the rest candidate objects, so as to finally determine the pushing object of the second round of pushing. In one embodiment, the candidate objects of the second round of pushing may be filtered before the model calculates the pushing probability, that is, the candidate objects in the candidate object set that are not the sub-class of "flowers" are filtered, so that only the sub-class of "flowers" remains in the candidate object set, and the pushing probability of each candidate object is calculated based on the filtered candidate object set (the candidate object set is a subset of the original candidate object set), and the pushed objects of the round of pushing are determined based on the ordering of the pushing probabilities of each candidate object.

It will be appreciated that the reinforcement learning model is not limited to the use of a strategic gradient algorithm, but may use other algorithms, such as Q learning algorithms, behavior-critic algorithms (acter-critic), etc., and will not be described in detail herein.

After determining the push object of the round of push as described above, pushing the push object to the first user to obtain feedback of the first user. For example, in a smart customer service scenario, a predetermined number of push objects of the round of push may be presented sequentially to the first user on a window page. Fig. 3 schematically shows push objects respectively presented to the user in three rounds of push, as shown in fig. 3, in a first round of push "flower", "bar" and "balance treasures" are presented to the user in sequence. After the presentation, feedback of the first user, that is, the click condition of the first user on the first user, may be obtained, and a corresponding return value may be obtained based on the feedback of the first user, where the mouse in fig. 3 is used to indicate the click of the user on the push object. For example, it may be preset that when the first user clicks a certain push object a _1j At the time, the return value r corresponding to the pushing object in the round of pushing is to be obtained _1j Recorded as 0.1, if the first user does not click on the push object a _1j When the return value r corresponding to the object is to be calculated _1j And is noted as 0. That is, after the user clicks "bar" in the first round of pushing, it can be derived that r ₁₁ ＝0.1,r ₁₂ ＝0,r ₁₃ =0. It will be appreciated that this preset is only illustrative, e.g. when the first user clicks on a certain push object a _1j The return value r may be set based on the order of the push objects among a predetermined number of push objects _1j The earlier the ranking, the greater the return value. For example, assume that in this first round of pushing, the model outputs three push objects a arranged in the following order ₁₁ 、a ₁₂ 、a ₁₃ The first user can click on the push object a ₁₁ Return value r at time ₁₁ Set to be larger than the first user clicking push object a ₁₂ Return value r at time ₁₂ 。

After the first user clicks on a certain push object of the round of push, the model proceeds to the next round of push so that the return value for the next round of push can be obtained as described above. For example, if the first user clicks "in the first round of pushing, in the next round of pushing, according to the second prediction of the model, it may be determined that the first few" subclasses "under the" in "major class are displayed in the second round of pushing, and as shown in fig. 3, the three subclasses predicted by the model, including in the second round of pushing, the in bill, the in repayment, and the in fee, are displayed to the user. And after the user clicks on one of the above-mentioned "subclasses" (e.g. "pay-out of flowers"), the model makes a third prediction to determine three specific questions that show the next layer of subclasses "pay-out of flowers" in the third push. As shown in fig. 3, three questions of model predictions are presented to the user in a third round of push: the flower-stalk can pay in advance, pay the deduction order automatically, how the flower-stalk pays.

For example, in the N-th round of pushing, it may be set that, for the nth round of pushing, when the user clicks on a certain pushing object of the nth round, the return value may be set to be larger than the click return value corresponding to the previous N-1 round. For example, for the above-mentioned intelligent customer service scenario, when the first user clicks on a push object a of the third round _3j (i.e., a problem), the return value corresponding to the push object in the round can be set to r _3j =1, when the first user clicks on a push object of the first or second round, the return value (r _1j Or r _2j ) Set to 0.1. For any one of the three rounds of pushing, if the user does not click on any push object of that round of pushing, then that round of the model ends, i.e., the model will not go to the next round of pushing, i.e., one round of the model includes at most N rounds of pushing.

After one round of the model is completed, the model may be trained by the input-output data and feedback data in that round. For example, in one case, the round includes three rounds of pushing to the first user, that is, the first user has a click action in both the first and second rounds of pushing, thereby eventually entering the third round of pushing. In this case, at least three training of the model may be performed. Specifically, assume that a first user clicks after a first round of pushing on the mark a ₁₁ Is pushed in the second round and clicks after the second round are marked as a ₂₂ Is pushed by (2)Delivering the object, clicking the mark as a after the third round of pushing ₃₂ Can acquire three sets of training data (s ₁ 、a ₁₁ 、r ₁₁ )、(s ₂ 、a ₂₂ 、r ₂₂ ) Sum(s) ₃ 、a ₃₂ 、r ₃₂ ) Wherein, as can be derived from the above, r ₁₁ ＝0.1，r ₂₂ ＝0.1，r ₃₂ =1. Based on each of the three sets of training data, model parameter updates may be performed according to a strategy gradient algorithm by the following equation (1):

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the expected value. For example, in the use (s ₁ 、a ₁₁ 、r ₁₁ ) When training the model, the ++in equation (1) can be calculated by equation (2) as follows>

In use(s) ₂ 、a ₂₂ 、r ₂₂ ) Sum(s) ₃ 、a ₃₂ 、r ₃₂ ) The model may be trained based on the respective return values r similarly as above ₂₂ And r ₃₂ Respectively calculateAnd->

In this case, in addition to the three sets of training data, training may be acquired based on the push object that is not clicked by the user in each round of pushingData. For example, for push object a in the first round of push ₁₂ A set of training data (s ₁ 、a ₁₂ 、r ₁₂ ) In this case, since the user does not click on the push object, r ₁₂ =0, and accordingly,and also 0.

In another case, for example, in the second round of pushing, the first user does not click on any push object, in which case the round ends after the second round of pushing is performed. Specifically, assume that a first user clicks after a first round of pushing on the mark a ₁₁ Does not click on any push object after the second round of push, a set of training data (s ₁ 、a ₁₁ 、r ₁₁ ) Wherein r is ₁₁ =0.1. Thus, model training can be performed similarly by the formula (1), and wherein the formula (1) can be obtained by the formula (2)I.e. < ->Similarly, from the round, multiple sets of training data may also be acquired for model training corresponding to each non-clicked push object in the first round of pushing and the second round of pushing.

Fig. 4 shows an apparatus 400 for pushing objects to a user based on a reinforcement learning model according to an embodiment of the present description, the apparatus comprising at most N pushing modules 41 deployed consecutively for a first user, wherein each pushing module has a corresponding set of predetermined candidate objects, each pushing module starting from a second pushing module is deployed starting after the first user clicks on an object pushed by a previous pushing module, and each candidate object set of pushing modules starting from the second pushing module comprises a plurality of sub-categories of each of the plurality of candidate objects of the previous pushing module, wherein an ith pushing module of the at most N pushing modules comprises the following units:

An obtaining unit 411, configured to obtain ith status information, where the ith status information includes a static feature and a dynamic feature, where the static feature includes an existing feature of the first user before the device is deployed, and the dynamic feature includes an identifier of each object clicked by the first user for the previous i-1 push modules; and

a determining unit 412 is configured to input the ith status information into the reinforcement learning model, so that the reinforcement learning model determines the identities of each of the predetermined number of push objects of the ith push module from the candidate object set of the ith push module.

In one embodiment, the determining unit 412 includes a software component deployed in the reinforcement learning model: a calculating subunit 4121 configured to calculate, based on the i-th state information and the object identification of each candidate object in the candidate object set of the i-th push module, a push probability of each candidate object of the i-th push module, and a determining subunit 4122 configured to determine, based on each push probability, a predetermined number of push objects of the i-th push module.

In one embodiment, the ith pushing module 41 further includes a pushing unit 413 configured to, after determining the pushing object of the ith pushing module, push the pushing object to the first user, so as to obtain feedback of the first user.

In an embodiment, i+.n, in case the feedback of the first user is not clicking on one of the push objects, the apparatus comprises a succession of i push modules for the first user, the apparatus further comprises an optimization module 42 configured to optimize the model by means of a policy gradient algorithm based on sets of data corresponding to respective ones of the i push modules, wherein a set of data corresponding to a second push object of a j-th one of the i push modules comprises: the method comprises the steps of enabling state information corresponding to a j-th pushing module, identification of a second pushing object and a return value corresponding to the second pushing object to be obtained based on feedback of a first user on the second pushing object.

In one embodiment, i=n, the apparatus includes a continuous N push modules for a first user, the apparatus further includes an optimization module 42 configured to optimize the model by a policy gradient algorithm based on multiple sets of data corresponding to multiple push objects in the N push modules after obtaining feedback of the first user, wherein a set of data corresponding to a second push object in a j-th push module in the N push modules includes: the method comprises the steps of enabling state information corresponding to a j-th pushing module, identification of a second pushing object and a return value corresponding to the second pushing object to be obtained based on feedback of a first user on the second pushing object.

It should be understood that the description of "first," "second," etc. herein is merely for simplicity of description and does not have other limiting effect on the similar concepts.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not considered to be beyond the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of pushing objects to a user based on a reinforcement learning model, the method comprising a succession of up to N rounds of pushing for a first user, wherein each round of pushing has a corresponding set of predetermined candidate objects, each round of pushing starting from a second round of pushing begins after the first user clicks on an object pushed in a previous round of pushing, and each round of pushing candidate object set starting from the second round of pushing comprises a respective plurality of sub-classes of a plurality of candidate objects of the previous round of pushing, wherein an ith round of pushing in the up to N rounds of pushing comprises the steps of:

2. The method of claim 1, wherein causing the reinforcement learning model to determine, from the candidate set of ith round of pushes, an identity of each of a predetermined number of push objects of the ith round of pushes comprises causing the reinforcement learning model to: based on the ith state information and object identifiers of all candidate objects in the candidate object set of the ith round of pushing, calculating the pushing probability of all candidate objects of the ith round of pushing, and determining the preset number of pushing objects of the ith round of pushing based on all the pushing probabilities.

3. The method of claim 2, wherein the first user clicks on a first push object in the round of push for an i-1 th round of push, wherein determining a predetermined number of push objects for the i-th round of push based on the respective push probabilities comprises determining a first candidate object belonging to a subclass of the first push object among the respective candidate objects for the i-th round of push, and determining a predetermined number of push objects for the i-th round of push based on the push probabilities of the respective first candidate objects.

4. The method of claim 1, wherein the first user clicks on a first pushed object in the round of pushing for an i-1 th round of pushing, wherein causing the reinforcement learning model to determine respective identifications of a predetermined number of pushed objects of the i-th round of pushing from a set of candidate objects of the i-th round of pushing comprises causing the reinforcement learning model to determine respective identifications of a predetermined number of pushed objects of the i-th round of pushing from a subset of the set of candidate objects of the i-th round of pushing, wherein the subset comprises a plurality of sub-categories of the first pushed object.

5. The method of claim 2, the ith round of pushing further comprising, after determining a push object of the ith round of pushing, pushing the push object to the first user to obtain feedback of the first user.

6. The method of claim 5, wherein, in the case where i+.n, the feedback of the first user is not clicking on one of the push objects, the method comprises successive i-round pushes for the first user, the method further comprising optimizing the model by a policy gradient algorithm based on sets of data corresponding respectively to a plurality of push objects in the i-round pushes, wherein a set of data corresponding to a second push object in a j-th round of pushes comprises: the method comprises the steps of carrying out j-th round pushing on state information corresponding to the j-th round pushing, identifying a second pushing object and a return value corresponding to the second pushing object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of the first user on the second pushing object.

7. The method of claim 5, wherein i = N, the method comprising successive N-round pushes for a first user, the method further comprising, after obtaining feedback for the first user, optimizing the model by a policy gradient algorithm based on multiple sets of data corresponding to multiple push objects in the N-round pushes, wherein a set of data corresponding to a second push object in a j-th round of the N-round pushes comprises: status information corresponding to the j-th round of pushing, an identification of a second pushing object, and a return value corresponding to the second pushing object, wherein the return value is obtained based on feedback of the first user on the second pushing object.

8. The method of claim 7, wherein a push object corresponding to an nth round of push is an inquiry question, the return value taking a positive value if the first user clicks on the second push object, and zero if the first user does not click on the second push object.

9. The method of claim 8, the reward value taking a first value if j = N and the first user clicks on the second push object, the reward value taking a second value if j +.n and the first user clicks on the second push object, wherein the first value is greater than the second value.

10. An apparatus for pushing objects to a user based on a reinforcement learning model, the apparatus comprising at most N pushing modules deployed consecutively for a first user, wherein each pushing module has a corresponding set of predetermined candidate objects, each pushing module starting from a second pushing module begins deployment after the first user clicks on an object pushed by a previous pushing module, and each candidate object set of pushing modules starting from the second pushing module comprises a respective plurality of sub-categories of a plurality of candidate objects of the previous pushing module, wherein an ith pushing module of the at most N pushing modules comprises the following elements:

11. The apparatus of claim 10, wherein the determining unit comprises a component disposed in the reinforcement learning model: a calculating subunit configured to calculate, based on the i-th state information and the object identification of each candidate object in the candidate object set of the i-th push module, a push probability of each candidate object of the i-th push module, and a determining subunit configured to determine, based on each push probability, a predetermined number of push objects of the i-th push module.

12. The apparatus of claim 11, wherein the first user clicks on a first pushed object pushed by an i-1 th push module for that module, wherein the determination subunit is further configured to determine a first candidate object belonging to a subclass of the first pushed object among each candidate object of the i-th push module, and determine a predetermined number of pushed objects of the i-th push module based on a probability of pushing each first candidate object.

13. The apparatus of claim 10, wherein the first user clicks on a first pushed object pushed by an i-1 th push module for that module, wherein the determination unit is further configured to cause the reinforcement learning model to determine, from a subset of a candidate set of objects for the i-th push module, an identity of each of a predetermined number of pushed objects for the i-th push module, wherein the subset includes a plurality of sub-categories of the first pushed object.

14. The apparatus of claim 11, the ith push module further comprising a push unit configured to push the push object to the first user to obtain feedback of the first user after determining the push object of the ith push module.

15. The apparatus of claim 14, wherein i+.n, in case the feedback of the first user is not clicking on one of the push objects, the apparatus comprises a succession of i push modules for the first user, the apparatus further comprising an optimization module configured to optimize the model by a policy gradient algorithm based on sets of data corresponding respectively to a plurality of the i push objects, wherein a set of data corresponding to a second push object of a j-th push module comprises: the method comprises the steps of enabling state information corresponding to a j-th pushing module, identification of a second pushing object and a return value corresponding to the second pushing object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of a first user to the second pushing object.

16. The apparatus of claim 14, wherein i = N, the apparatus comprising a succession of N push modules for a first user, the apparatus further comprising an optimization module configured to optimize the model by a policy gradient algorithm based on sets of data corresponding to a plurality of push objects of the N push modules after obtaining feedback for the first user, wherein a set of data corresponding to a second push object of a j-th push module of the N push modules comprises: the method comprises the steps of enabling state information corresponding to a j-th pushing module, identification of a second pushing object and a return value corresponding to the second pushing object to be obtained based on feedback of a first user on the second pushing object.

17. The apparatus of claim 16, wherein a push object corresponding to an nth push module is an inquiry question, the return value taking a positive value if the first user clicks on the second push object, and zero if the first user does not click on the second push object.

18. The device of claim 17, the reward value taking a first value if j = N and the first user clicks on the second push object, the reward value taking a second value if j +.n and the first user clicks on the second push object, wherein the first value is greater than the second value.

19. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-9.

20. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-9.