CN110543596A

CN110543596A - Method and device for pushing object to user based on reinforcement learning model

Info

Publication number: CN110543596A
Application number: CN201910740883.2A
Authority: CN
Inventors: 陈岑; 傅驰林; 张晓露
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2019-12-06

Abstract

The embodiment of the specification provides a method for pushing an object to a user based on a reinforcement learning model, wherein the method comprises at most N continuous pushing rounds for a first user, and the ith pushing round in the at most N pushing rounds comprises the following steps: acquiring ith state information, wherein the ith state information comprises static features and dynamic features, the static features comprise existing features of a first user before the first user executes the method, and the dynamic features comprise identifiers of all clicked objects pushed by the first user for the previous i-1 round; inputting the ith state information into the reinforcement learning model and determining, by the reinforcement learning model, a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects.

Description

method and device for pushing object to user based on reinforcement learning model

Technical Field

The embodiment of the specification relates to the technical field of machine learning, in particular to a method and a device for pushing an object to a user based on a reinforcement learning model.

Background

Traditional customer service is labor/resource intensive and time consuming, and therefore it is important to build intelligent assistants that can automatically answer user-facing questions. Recently, there has been an increasing interest in how to better construct such intelligent assistants with machine learning. As a core function of the customer service robot, user question prediction aims to automatically predict questions that a user may want to ask, and present candidate questions to the user for selection thereof to reduce the cognitive burden of the user. The existing method for identifying intentions (problem recommendation) is to directly display a recommendation problem list to a user according to the characteristics of the user and the historical behavior track of the user. However, when a user encounters a problem, the user cannot directly think of an accurate problem description, but can think of a series of concepts. For example, when the user fails to use the flower over payment, the concepts of 'payment', 'flower', 'failure', etc. may be conceived. For such a scenario, it is desirable to better guide and guess the user's intent based on a series of concepts that the user thinks.

Therefore, a more efficient solution for pushing questions to the user is needed.

disclosure of Invention

The embodiment of the specification aims to provide a more effective scheme for pushing an object to a user so as to solve the defects in the prior art.

To achieve the above object, one aspect of the present specification provides a method for pushing objects to a user based on a reinforcement learning model, the method including at most N consecutive rounds of pushing for a first user, where each round of pushing has the same set of candidate objects, the set of candidate objects includes a plurality of first candidate objects and a plurality of second candidate objects, the plurality of second candidate objects are keypoints obtained from the plurality of first candidate objects, and an ith round of pushing in the at most N rounds of pushing includes the following steps:

Acquiring ith state information, wherein the ith state information comprises static features and dynamic features, the static features comprise existing features of a first user before the first user executes the method, and the dynamic features comprise identifiers of all clicked objects pushed by the first user for the previous i-1 round;

Inputting the ith state information into the reinforcement learning model; and

A first predetermined number of first push objects is determined from the plurality of first candidate objects by a reinforcement learning model and a second predetermined number of second push objects is determined from the plurality of second candidate objects.

In one embodiment, determining a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects by the reinforcement learning model includes calculating, by the reinforcement learning model, a push probability of each first candidate object based on the ith state information and an object identification of each first candidate object and determining a first predetermined number of first push objects based on each push probability, and calculating a push probability of each second candidate object based on the ith state information and an object identification of each second candidate object and determining a second predetermined number of second push objects based on each push probability.

In one embodiment, the method further comprises, after determining, by the reinforcement learning model, a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects, pushing the first predetermined number of first push objects and the second predetermined number of second push objects, respectively, to the first user in a push page.

In one embodiment, the first candidate object is a question and the second candidate objects are keywords obtained from the question.

in one embodiment, the plurality of second push objects includes a primary push object that is highlighted.

In one embodiment, in each push round starting from the second push round, the main push object is the second push object clicked by the first user in the previous push round.

in one embodiment, the method further comprises, after pushing the first predetermined number of first push objects and the second predetermined number of second push objects to the first user in a push page, respectively, determining, by the reinforcement learning model, a third predetermined number of third push objects from a plurality of second candidate objects excluding the second predetermined number of second push objects, based on a predetermined indication of the first user, to replace other second push objects in the push page except the main push object.

In one embodiment, in a case that the feedback of the first user to the push is exiting from the push page, the at most N rounds of push are ended, the method further includes optimizing the model by a policy gradient algorithm based on a plurality of sets of data respectively corresponding to a plurality of push objects in the i round of push, where the set of data corresponding to one push object in the j-th round of push includes: the information processing method comprises the steps of state information corresponding to the jth round of pushing, identification of a pushed object and a return value corresponding to the pushed object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of the first user to the pushed object.

In one embodiment, in a case that the push object is a first push object and the first user clicks the first push object, the reward value takes a first value, and in a case that the push object is a second push object and the first user clicks the second push object, the reward value takes a second value, where the first value is greater than the second value.

In one embodiment, in a case that the feedback is clicking any one of the first push objects or any one of the second push objects, the method enters an i +1 th round of pushing.

In one embodiment, in the case that the feedback is clicking any one of the second push objects, the method enters the (i + 1) th round of pushing.

Another aspect of the present specification provides an apparatus for pushing objects to a user based on a reinforcement learning model, the apparatus including at most N consecutive push modules for a first user, where each push module has a same candidate object set, the candidate object set includes a plurality of first candidate objects and a plurality of second candidate objects, the plurality of second candidate objects are keypoints obtained from the plurality of first candidate objects, and an ith push module of the at most N push modules includes the following units:

The obtaining unit is configured to obtain ith state information, wherein the ith state information comprises static features and dynamic features, the static features comprise existing features of a first user before the first user executes the method, and the dynamic features comprise identifiers of objects clicked by the first user in the previous i-1 rounds of pushing;

An input unit configured to input the ith state information into the reinforcement learning model; and

A first determining unit configured to determine, by the reinforcement learning model, a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects.

In one embodiment, the first determination unit is further configured to, by the reinforcement learning model: calculating a push probability of each first candidate object based on the ith state information and the object identification of each first candidate object, and determining a first predetermined number of first push objects based on each push probability, and calculating a push probability of each second candidate object based on the ith state information and the object identification of each second candidate object, and determining a second predetermined number of second push objects based on each push probability.

In one embodiment, the apparatus further includes a pushing unit configured to, after determining, by the reinforcement learning model, a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects, push the first predetermined number of first push objects and the second predetermined number of second push objects, respectively, to the first user in a push page.

in one embodiment, the apparatus further includes a second determining unit configured to, after the first predetermined number of first push objects and the second predetermined number of second push objects are pushed to the first user in a push page, respectively, determine, based on a predetermined indication of the first user, a third predetermined number of third push objects from a plurality of second candidate objects excluding the second predetermined number of second push objects, through the reinforcement learning model, to replace other second push objects in the push page except the main push object.

in one embodiment, in a case that the feedback of the first user to the push is exiting from the push page, the at most N rounds of push are ended, the apparatus further includes an optimization module configured to optimize the model by a policy gradient algorithm based on a plurality of sets of data respectively corresponding to a plurality of push objects in the i round of push, where the set of data corresponding to one push object in the jth round of push includes: the information processing method comprises the steps of state information corresponding to the jth round of pushing, identification of a pushed object and a return value corresponding to the pushed object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of the first user to the pushed object.

in one embodiment, in a case that the feedback is a click on any one of the first push objects or any one of the second push objects, the apparatus starts to deploy an i +1 th push module.

In one embodiment, in a case that the feedback is clicking any one of the second push objects, the apparatus starts to deploy the (i + 1) th push module.

Another aspect of the present specification provides a computer readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform any one of the above methods.

Another aspect of the present specification provides a computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor implements any one of the above methods when executing the executable code.

In the scheme of pushing the object according to the embodiment of the specification, a novel interactive problem recommendation process is provided, the user is guided step by step, the whole multi-round recommended state skip process is modeled through reinforcement learning, and dynamic click information of the user is considered in the model, so that the prediction accuracy rate is improved.

Drawings

the embodiments of the present specification may be made more clear by describing the embodiments with reference to the attached drawings:

FIG. 1 is a diagram illustrating a process for pushing an object to a user according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method for pushing objects to a user based on a reinforcement learning model according to an embodiment of the present disclosure;

FIG. 3 illustrates a push page schematic;

Fig. 4 illustrates an apparatus 400 for pushing objects to a user based on a reinforcement learning model according to the present description.

Detailed Description

the embodiments of the present specification will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram illustrating a process of pushing an object to a user according to an embodiment of the present specification. The figure shows the process of three consecutive decisions for a user 12, i.e. an environment, by a reinforcement learning model 11, i.e. an agent, to push three times each. The reinforcement learning model is used, for example, in intelligent customer service for predicting a question that a user wants to ask. In the embodiment of the specification, a mode that a simulation human can think of a series of labels and keywords corresponding to a problem when encountering the problem is designed, and an interactive problem recommendation mode is designed, so that a traditional problem recommendation mode is converted into interactive problem recommendation with user participation. And dynamically recommending a label and the like of a desired question to the user layer by layer according to the feedback of the user in the interactive process, and guiding to obtain the final question recommendation.

Specifically, when the user opens the push page, the first decision is started. In the first decision, an initial state s1 is input to the model 11 based on the current state of the user 12, the state s1 including static features (as indicated by the white box in the ellipse labeled s 1) and dynamic features (at this time, the dynamic features are empty and therefore not shown), the static features being the current features of the user, including attribute features, historical behavior features, etc. that the user had before the round, the dynamic features being the problem that the user had clicked before the round of pushing in the round (epamode), here, the dynamic features are empty because of the first decision. After input s1 to the model 11, the model 11 calculates the probability of each candidate object by a policy gradient algorithm. In the embodiment of the present specification, the candidate object set includes a first class of candidate objects, for example, a question, and a second class of candidate objects, for example, a keyword related to the question. The model may compute probabilities for each question and for each keyword, such that m (a first predetermined number) of push questions (e.g., identified as a11, a12, a13, respectively) and n (a second predetermined number) of push keywords (e.g., identified as b11, b12, respectively) may be determined based on the probabilities. Thus, the determined questions and keywords may be presented (i.e., pushed) to the user based on the output of the model. After the above-mentioned exhibition is performed to the user, the user may perform feedback on the exhibition, for example, the user may click one of the keywords, or may click one question, or may click "back" to exit, and the return values corresponding to the questions and keywords in the round of pushing may be obtained based on the feedback (ra11, ra12, ra13, rb11, rb 12). After the user clicks on a keyword or a question (e.g., b11), the model 11 begins the second decision making process. Specifically, the second state s2 is input to the model 11 based on the current state of the user, and the second state s2 also includes a static feature (as indicated by a white box in the ellipse labeled s2 in the figure) which is the same as the static feature of the state s1, and a dynamic feature (as indicated by a gray box in the ellipse labeled s2 in the figure) which includes an identification of an object that the user has clicked on, such as "b 11" or a numerical identification corresponding to "b 11", and the like. After state s2 is input into model 11, model 11 likewise outputs, based on state s2, the identifications of the m push questions (a21, a22, a23) and the identifications of the n push keywords (b21, b22) for the second round of pushes corresponding to the second decision. Likewise, after a second push round, the user's reward value may be similarly obtained, and the state of the third decision s3 may be obtained accordingly. By inputting the state s3 into the model 11, m questions (not shown in the figure) and n keywords (not shown in the figure) of the third round of pushing may be output, in this embodiment, the user may click on the "back" button in the push page in the third round of pushing, so that the round may be ended, i.e., the round includes three rounds of pushing, and the value of the user's return for the third round of pushing may be similarly obtained. It is understood that three rounds of pushing are merely illustrative, and that the user may perform multiple rounds of pushing according to his or her own needs, and in one embodiment, the number of times of pushing may be preset to be at most N times. After pushing of one round of the user is performed as described above, model optimization can be performed based on the above data in each round of pushing in the round, so that prediction accuracy of the model is improved.

It is to be understood that the above description of fig. 1 is only illustrative and not restrictive, for example, the first kind of candidate object is not limited to query questions for users, but may be other push objects, such as commodities, movie reviews, etc., and correspondingly, the second kind of candidate object is not limited to keywords of questions, but may be tags, various attributes, etc. of the first kind of candidate object; in addition, the model is not limited to reinforcement learning by a policy gradient algorithm or the like.

The above-described pushing process is described in detail below.

Fig. 2 shows a flowchart of a method for pushing objects to a user based on a reinforcement learning model according to an embodiment of the present specification, where the method includes at most N consecutive rounds of pushing for a first user, where each round of pushing has a same set of candidate objects, and the set of candidate objects includes a plurality of first candidate objects and a plurality of second candidate objects, where the plurality of second candidate objects are keypoints obtained from the plurality of first candidate objects, and an ith round of pushing in the at most N rounds of pushing includes the following steps:

Step S202, obtaining ith state information, wherein the ith state information comprises static features and dynamic features, the static features comprise existing features of a first user before the first user executes the method, and the dynamic features comprise identifiers of all clicked objects pushed by the first user for the previous i-1 round;

Step S204, inputting the ith state information into the reinforcement learning model; and

Step S206, determining a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects by the reinforcement learning model.

The at most N push cycles are one round of reinforcement learning (episode). After a user (e.g., a first user) opens a push page, the reinforcement learning model according to the embodiments of the present specification starts outputting a prediction result based on the existing state of the first user, so that the first user can be pushed for a first time based on the prediction result. When the first user clicks any one of the objects for the first time of pushing, the state of the first user changes, and the model performs second output for the changed state of the first user, so that the first user can be pushed for the second time. In the case that the first user continuously clicks for each push, multiple pushes may be continuously performed, and an upper limit of the multiple pushes may be a preset larger number N. When the user exits the push page, the push round ends.

The reinforcement learning model is, for example, a model based on a policy gradient algorithm, a candidate object set is preset in the model, and in each pushing, the model calculates the probability of each candidate object based on the input user state(s), so that the object to be pushed is based on the probability of each candidate object. Unlike the prior art, in the embodiment of the present specification, the first class candidate object and the second class candidate object are included in the candidate object set, the model may determine m first push objects to be pushed based on the probability of each first class candidate object, and determine n second push objects to be pushed based on the probability of each second class candidate object, where m and n are preset numbers. Wherein each candidate object of the second class is a feature which the at least one candidate object has in common. For example, the first category candidate may include a plurality of questions collected by the platform and frequently asked by the user, the second category candidate may be a higher-frequency keyword extracted from the plurality of questions, or the second category candidate may include respective types to which the plurality of questions belong, or the second category candidate may include respective tags to which the plurality of questions belong, and so on. Therefore, when pushing, m questions and n keywords can be pushed simultaneously on the push page, and usually, the user first clicks on the matched keywords based on the keywords thought by the user. After the user clicks the keyword, the dynamic characteristics of the user include the identifier of the clicked keyword, and the model is learned in advance based on the feedback of other users, so that each question related to the keyword has a high probability when the probability of each question is determined, and thus at least one question corresponding to the keyword can be output, the probability of the user clicking the question is improved, and the total profit of the model is improved.

The process involved in each of the at most N rounds of pushing is substantially the same, and the specific process of the ith round of pushing in the at most N rounds of pushing will be described in detail below.

First, in step S202, an ith status information is obtained, where the ith status information includes a static feature and a dynamic feature, where the static feature includes an existing feature of a first user before executing the method, and the dynamic feature includes an identifier of each object clicked by the first user for the previous i-1 rounds of pushing.

The ith state information is also the ith state si input in the ith model prediction of the round, for example, in the form of a feature vector including a plurality of elements. Wherein elements of predetermined multiple dimensions in the state si correspond to static features of the user, i.e. existing features before the round is performed, such as attribute features, portrait features, historical behavior features, etc. of the user, so that the static features in the respective states respectively corresponding to the model predictions in one round are the same. Wherein the elements of the predetermined plurality of dimensions in the state si correspond to dynamic characteristics of the user, the dynamic characteristics being identifications of objects that have been clicked by the first user in respective pushes prior to the round of pushing of the round. For example, referring to the description of fig. 1 above, in the first push round, since the first user has not clicked any earlier, the dynamic feature in the input state s1 can be represented as [0,0, …,0], for example, where the dimension of the dynamic feature is N-1 dimension; in the second round of pushing, for example, the first user clicks the keyword corresponding to "b 11" after the first round of pushing, so that "b 11" is included in the dynamic feature in the second state s2, for example, the dynamic feature in the state s2 may be represented as [ b11,0, …,0 ]; in the third round of pushing, for example, the first user clicked on a corresponding question, e.g., "a 21", after the second round of pushing, and thus the entered dynamic features in the third state s3 include the identifications "b 11" and "a 21" of the clicked keywords and questions for the first round of pushing and the second round of pushing, e.g., the dynamic features in state s3 may be represented as [ b11, a21, …,0 ].

In step S204, inputting the ith state information into the reinforcement learning model; and determining a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects by the reinforcement learning model at step S206.

The reinforcement learning model is for example a model based on a policy gradient algorithm, in which case the model comprises policy functions pi (a | s, theta) with respect to a state s and an action a, where theta is a model parameter of the reinforcement learning model and pi (a | s, theta) is a probability of adopting the action a in the state s. In this embodiment, the candidate set includes a plurality of first candidates and a plurality of second candidates, after the state si is input to the model, a push probability of each first candidate aij of the ith decision is obtained in the model based on a policy function pi (a | s, θ), m first push objects of the round of pushing are determined based on the push probability of each first candidate, a push probability of each second candidate bik of the ith decision is obtained based on the policy function pi (a | s, θ), and n second push objects of the round of pushing are determined based on the push probability of each second candidate.

For example, 100 question questions are preset in the model as first candidate objects, 50 keywords are extracted from the 100 question questions as second candidate objects, and for example, 7 push keywords and 15 question questions are preset to be output. Thus, after the state of the user is input to the model, the model calculates the probability of each question and the probability of each keyword based on the state and the policy function in the model, sorts the 100 questions based on their respective probabilities, and takes the top 15 questions as the questions to be pushed, sorts the top 50 keywords based on their respective probabilities, and takes the top 7 keywords as the keywords to be pushed.

It is understood that the reinforcement learning model is not limited to the use of a policy gradient algorithm, but other algorithms, such as a Q learning algorithm, an action-critique algorithm (action-critic), etc., may be used, and are not described in detail herein.

After determining the push object of the round of pushing as described above, pushing the push object to the first user. For example, in a push page of a smart customer service scenario, the first predetermined number of first push objects and the second predetermined number of second push objects may be pushed to the first user in two portions, respectively. FIG. 3 shows a push page schematic. As shown in fig. 3, in a push page displayed by, for example, a mobile phone screen, a keyword and an inquiry question are pushed to a user in upper and lower two sections, respectively. Here, as shown in the figure, the upper space is large, and 7 keywords are surrounded by seven bubbles (bubbles are schematically shown as circles in the figure), respectively. Thus, by displaying the keywords in such a conspicuous manner, the user can see each of the pushed keywords at a glance after opening the push page, and can immediately determine whether there is a keyword that is desired to be clicked among the pushed keywords. It is understood that the bubbles in the figures are merely illustrative and that in embodiments of the present description, various graphics may be used for highlighting keywords, such as love hearts, boxes, oval boxes, round boxes, arrows, and the like. In the embodiment of the present specification, the display is not limited to the manner in which the keyword is surrounded by the graphic, and for example, if the graphic is an arrow, the keyword may be placed at a position pointed by the arrow to highlight the keyword.

In one embodiment, as shown in FIG. 3, the main push object may be highlighted by a different color (e.g., a shaded bubble in the figure). It is to be understood that in this embodiment, the main push object is not limited to be highlighted by different colors, and may be highlighted by different shapes, different sizes, or the like. For example, when a user opens a push page, the model may determine a keyword with the highest probability as a main push object, and in a case where the user clicks a certain keyword in the previous push, the model may determine the clicked keyword as the main push object of the push.

In one embodiment, a predetermined button (e.g., the "spit bubble" button in FIG. 3) may also be provided in the push page and the user may be prompted to click on the button in the page. The button may be set such that, in a case where neither a keyword that the user wishes to click nor a question that the user wishes to click is displayed in the page other than the main push object, the keyword in the page other than the main push object may be replaced by clicking the button. For example, as shown in FIG. 3, after clicking the "spit bubble" button, the user will replace the keywords other than "Account". Specifically, for example, as described above, 50 candidate keywords are preset in the model, so that after the user clicks the "spit bubble" button, the model determines that six keywords ranked at 8-13 are used to replace the six keywords "modify", "turn wrong", "deduct money", "how", "query", "pay for money" in the graph. By doing so, the user can be made to click on the keyword sequence he/she wishes to click on in successive steps, so that the question presented in the lower part of the page can be simultaneously related to the keyword sequence that the user clicks on, and the user's intention can be captured more accurately.

As also shown in fig. 3, a predetermined number of first push objects, for example, 15 question questions, are displayed in the lower part of the display page. As shown in the drawing, the lower portion of the display page may be set to have a small space, and thus, the 15 question questions may be scrollably displayed in the lower portion. Wherein, the problem with higher probability can be arranged in the front, and the problem with lower probability can be arranged in the back. Additionally, in the display page, the user may be prompted to scroll through the list of display questions.

After being displayed through, for example, the display page, feedback of the first user, that is, a click condition of the first user on the first user may be obtained, and a corresponding reward value may be obtained based on the feedback of the first user. For example, it may be preset that when a first user clicks a first push object aij (for example, ask a question) for an ith round of pushing, a return value rij corresponding to the first push object aij in the round of pushing may be recorded as 1, and if the first user does not click the first push object aij, the return value rij corresponding to the first push object aij may be recorded as 0. When the first user clicks a second push object bik (e.g., a keyword) for the ith round of push, the reward value rik corresponding to the push object in the round of push may be recorded as 0.5, and if the first user does not click the first push object aik, the reward value rik corresponding to the object may be recorded as 0. It will be appreciated that the above-described reward values rij and rik in the click scenario are not limited to 1 and 0.5, but that appropriate reward value values may be determined based on the model gains of the respective values by performing multiple tests at different values in the training. It will be appreciated that assuming the reward values rij and rik in the case of a click take a first value and a second value respectively, in the case where the final purpose of the push is to expect the user to click on a question, then the first value should be greater than the second value.

for example, after the user clicks "pay treasure" for the first push round as shown in fig. 3, the "pay treasure" is identified as "b 11", for example. It follows that rb11 is 0.5 and the reward values corresponding to other keywords and question questions in the pushed page are all 0. For example, after the user has clicked the first question in the lower part of the page for the second push round shown in fig. 3, the identification of this question is "a 21", for example, so that it can be concluded that ra21 is 1, and the reward values corresponding to other keywords and question questions in the pushed page are all 0.

In one embodiment, it may be preset in the model that, in the case that the feedback is clicking any one of the first push objects or any one of the second push objects, the method enters the next push round. In one embodiment, it may be preset in the model that, in case the feedback is a click on any of the second push objects, the method enters the next push round. That is, in this embodiment, when a push page as shown in fig. 3 is pushed to a user, the model is triggered to determine the push object of the next round only when the user clicks a keyword in the upper part of the figure, and the model is not triggered to run when the user clicks an inquiry question in the lower part of the figure.

In one embodiment, as shown in fig. 3, a "back" button is further provided in the upper left corner of the push page, and when the user clicks the "back" button, the push page is exited, so that the push round is ended. After one round of the model is over, the model may be trained with the input-output data and the feedback data in that round. For example, in one case, the round includes three pushes to the first user, that is, the first user clicks on actions in both the first and second pushes, and clicks "back" on the exit push page in the third push. In this case, at least two training sessions of the model may be performed. Specifically, assuming that the first user clicks the push object identified as b11 after the first round of push, clicks the push object identified as a21 after the second round of push, and does not click the push object after the third round of push, at least two sets of training data (s1, b11, rb11) and (s2, a21, ra21) may be obtained, where rb11 is 0.5 and ra21 is 1 as described above. Based on each of the two sets of training data, according to the policy gradient algorithm, model parameter update can be performed by the following formula (1):

where the expected value is indicated. For example, when a model is trained using (s1, b11, rb11), the calculation in formula (1) can be performed by the following formula (2)

when the model is trained using (s2, a21, ra21), it can be calculated based on the reported value ra21 similarly to the above

In this case, in addition to the two sets of training data, training data may be acquired based on a push target that is not clicked by the user in each push. For example, for the second push object b12 that was not clicked in the first round of pushing, a set of training data (s1, b12, rb12) may be acquired, in which case, since the user did not click on the push object, rb12 is 0, and accordingly, with reference to equation (2), γ ra21 may be obtained.

Fig. 4 shows an apparatus 400 for pushing objects to a user based on a reinforcement learning model according to the present specification, the apparatus including at most N consecutive push modules for a first user, wherein each push module has the same candidate object set, the candidate object set includes a plurality of first candidate objects and a plurality of second candidate objects, the plurality of second candidate objects are keypoints obtained from the plurality of first candidate objects, and an ith push module 41 of the at most N push modules includes the following units:

An obtaining unit 411, configured to obtain ith status information, where the ith status information includes a static feature and a dynamic feature, where the static feature includes an existing feature of a first user before executing the method, and the dynamic feature includes an identifier of each object clicked by the first user in a previous i-1 round of pushing;

An input unit 412 configured to input the ith state information into the reinforcement learning model; and

a first determining unit 413 configured to determine a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects by the reinforcement learning model.

In one embodiment, the first determining unit 413 is further configured to, by the reinforcement learning model: calculating a push probability of each first candidate object based on the ith state information and the object identification of each first candidate object, and determining a first predetermined number of first push objects based on each push probability, and calculating a push probability of each second candidate object based on the ith state information and the object identification of each second candidate object, and determining a second predetermined number of second push objects based on each push probability.

In one embodiment, the apparatus 400 further includes a pushing unit 414 configured to, after determining a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects through the reinforcement learning model, push the first predetermined number of first push objects and the second predetermined number of second push objects to the first user in a push page, respectively.

In one embodiment, the apparatus further includes a second determining unit 415 configured to, after the first predetermined number of first push objects and the second predetermined number of second push objects are pushed to the first user in a push page, respectively, determine, by the reinforcement learning model, a third predetermined number of third push objects from a plurality of second candidate objects excluding the second predetermined number of second push objects, based on a predetermined indication of the first user, to replace other second push objects in the push page except the main push object.

In an embodiment, in a case that the feedback of the first user to the push is exiting from the push page, the at most N rounds of push are ended, the apparatus further includes an optimization module 42 configured to optimize the model through a policy gradient algorithm based on multiple sets of data respectively corresponding to multiple push objects in the i round of push, where the set of data corresponding to one push object in the jth round of push includes: the information processing method comprises the steps of state information corresponding to the jth round of pushing, identification of a pushed object and a return value corresponding to the pushed object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of the first user to the pushed object.

it is to be understood that the terms "first," "second," and the like, herein are used for descriptive purposes only and not for purposes of limitation, to distinguish between similar concepts.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for pushing objects to a user based on a reinforcement learning model, the method comprising at most N consecutive rounds of pushing for a first user, wherein each round of pushing has the same set of candidate objects, the set of candidate objects comprises a plurality of first candidate objects and a plurality of second candidate objects, wherein the plurality of second candidate objects are keypoints obtained from the plurality of first candidate objects, and an ith round of pushing in the at most N rounds of pushing comprises the following steps:

inputting the ith state information into the reinforcement learning model; and

Determining, by the reinforcement learning model, a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects.

2. The method of claim 1, wherein determining, by the reinforcement learning model, a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects comprises calculating, by the reinforcement learning model, a push probability for each first candidate object based on the ith state information and an object identification of the each first candidate object and determining the first predetermined number of first push objects based on the each push probability, and calculating a push probability for each second candidate object based on the ith state information and an object identification of the each second candidate object and determining the second predetermined number of second push objects based on the each push probability.

3. The method of claim 1, wherein the method further comprises, after determining, by the reinforcement learning model, a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects, pushing the first predetermined number of first push objects and the second predetermined number of second push objects, respectively, to the first user in a push page.

4. The method of claim 3, wherein the first candidate object is a question and the second candidate objects are keywords obtained from the questions.

5. The method of claim 3, wherein the plurality of second push objects includes a primary push object that is highlighted.

6. The method of claim 5, wherein in each push round starting with a second push round, the primary push object is the second push object that the first user clicked on in the last push round.

7. The method of claim 5, further comprising, after pushing the first predetermined number of first push objects and the second predetermined number of second push objects to the first user, respectively, in a push page, determining, by the reinforcement learning model, a third predetermined number of third push objects from a plurality of second candidate objects excluding the second predetermined number of second push objects to replace other second push objects in the push page other than the primary push object based on a predetermined indication of the first user.

8. The method of claim 3, wherein in a case that the feedback of the first user to the push is to exit a push page, the at most N rounds of push are ended, the method further comprising optimizing the model by a policy gradient algorithm based on a plurality of sets of data respectively corresponding to a plurality of push objects in the i round of push, wherein the set of data corresponding to one push object in the j-th round of push comprises: the information processing method comprises the steps of state information corresponding to the jth round of pushing, identification of a pushed object and a return value corresponding to the pushed object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of the first user to the pushed object.

9. the method of claim 8, wherein the reward value takes a first value in a case where the push object is a first push object and the first user clicks on the first push object, and a second value in a case where the push object is a second push object and the first user clicks on the second push object, wherein the first value is greater than the second value.

10. The method of claim 8, wherein in a case that the feedback is a click on any of the first push objects or any of the second push objects, the method enters an i +1 th push.

11. The method of claim 8, wherein in the case that the feedback is a click on any of the second push objects, the method enters an i +1 th push.

12. An apparatus for pushing objects to a user based on a reinforcement learning model, the apparatus comprising at most N consecutive push modules for a first user, wherein each push module has a same set of candidate objects, the set of candidate objects comprises a plurality of first candidate objects and a plurality of second candidate objects, the plurality of second candidate objects are keypoints obtained from the plurality of first candidate objects, an ith push module of the at most N push modules comprises the following units:

13. The apparatus of claim 12, wherein the first determining unit is further configured to, by the reinforcement learning model: calculating a push probability of each first candidate object based on the ith state information and the object identification of each first candidate object, and determining a first predetermined number of first push objects based on each push probability, and calculating a push probability of each second candidate object based on the ith state information and the object identification of each second candidate object, and determining a second predetermined number of second push objects based on each push probability.

14. The apparatus of claim 12, further comprising a pushing unit configured to, after determining a first predetermined number of first push objects from the plurality of first candidate objects and a second predetermined number of second push objects from the plurality of second candidate objects by the reinforcement learning model, push the first predetermined number of first push objects and the second predetermined number of second push objects, respectively, to the first user in a push page.

15. the apparatus of claim 14, wherein the first candidate object is a question and the plurality of second candidate objects are a plurality of keywords obtained from the plurality of questions.

16. the apparatus of claim 14, wherein the plurality of second push objects includes a primary push object that is highlighted.

17. The apparatus of claim 16, wherein in each push round starting with a second push round, the primary push object is the second push object that the first user clicked on in the last push round.

18. the apparatus of claim 16, further comprising a second determining unit configured to, after pushing the first predetermined number of first push objects and the second predetermined number of second push objects to the first user in a push page, respectively, determine, based on a predetermined indication of the first user, a third predetermined number of third push objects from a plurality of second candidate objects excluding the second predetermined number of second push objects, by the reinforcement learning model, to replace other second push objects in the push page except the main push object.

19. The apparatus of claim 14, wherein in a case that the feedback of the first user to the push is to exit a push page, the at most N rounds of push are ended, the apparatus further comprising an optimization module configured to optimize the model by a policy gradient algorithm based on a plurality of sets of data respectively corresponding to a plurality of push objects in the i round of push, wherein the set of data corresponding to one push object in the jth round of push comprises: the information processing method comprises the steps of state information corresponding to the jth round of pushing, identification of a pushed object and a return value corresponding to the pushed object, wherein j is any natural number from 1 to i, and the return value is obtained based on feedback of the first user to the pushed object.

20. the apparatus of claim 19, wherein the reward value takes a first value in a case where the push object is a first push object and the first user clicks on the first push object, and a second value in a case where the push object is a second push object and the first user clicks on the second push object, wherein the first value is greater than the second value.

21. The apparatus of claim 19, wherein in a case that the feedback is a click on any of the first push objects or any of the second push objects, the apparatus starts to deploy an i +1 th push module.

22. The apparatus of claim 19, wherein in case the feedback is a click on any of the second push objects, the apparatus starts deploying the (i + 1) th push module.

23. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.

24. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-11.