CN110263136A

CN110263136A - The method and apparatus for pushing object to user based on intensified learning model

Info

Publication number: CN110263136A
Application number: CN201910463434.8A
Authority: CN
Inventors: 陈岑; 胡旭; 傅驰林; 安蓉
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-09-20
Anticipated expiration: 2039-05-30
Also published as: CN110263136B

Abstract

This specification embodiment provides a kind of method and apparatus for pushing object to user based on intensified learning model, the method includes the continuous at most N for the first user to take turns push, wherein, every wheel push has corresponding predetermined candidate target set, every wheel push since the second wheel push is clicking on after the object pushed in previous round push the first user, and, the candidate target set that every wheel since the second wheel push pushes includes the respective multiple subclasses of multiple candidate targets that previous round pushes, wherein, i-th wheel push is the following steps are included: obtain i-th of status information；And i-th of status information is inputted into the intensified learning model, to determine the respective mark of push object of the predetermined number of the i-th wheel push.

Description

The method and apparatus for pushing object to user based on intensified learning model

Technical field

This specification embodiment is related to machine learning techniques field, is based on intensified learning model more particularly, to one kind The method and apparatus for pushing object to user.

Background technique

Traditional customer service is manpower/resource-intensive and time-consuming, and therefore, building can answer automatically user and face The intelligent assistant of problem is extremely important.Recently, people are increasingly concerned with how preferably to construct such intelligence with machine learning It can assistant.As the core function of customer service robot, customer problem prediction, which is intended to automatic Prediction user, may wish to inquire The problem of, and candidate problem is presented so that it is selected to mitigate the cognitive load of user to user.Problem the essence of prediction is to be based on The historical behavior of user predicts the problem of user may propose, helps user to solve the problems, such as, improves the satisfaction of user, and Save the human cost of customer service.Problem prediction technique existing at present is normally based on the single-wheel question recommending of supervised learning, directly It connects and investigates topic.However, the user in some complexity is intended to noncommittal scene, the accuracy rate of recommendation is generally lower.

Therefore, it is necessary to a kind of schemes that problem is more effectively pushed to user.

Summary of the invention

This specification embodiment is intended to provide a kind of scheme that object is more effectively pushed to user, to solve the prior art In deficiency.

To achieve the above object, this specification provides one kind on one side and pushes object to user based on intensified learning model Method, the method includes the continuous at most N for the first user to take turns push, wherein every wheel is pushed with corresponding pre- Determine candidate target set, every wheel push since the second wheel push clicks in previous round push in first user to be pushed Object after start, also, since second wheel push start every wheel push candidate target set include what previous round pushed Multiple respective multiple subclasses of candidate target, wherein the i-th wheel push at most N wheel push the following steps are included:

I-th of status information is obtained, i-th of status information includes static nature and behavioral characteristics, wherein described quiet State feature includes the existing feature of the first user before carrying out the method, and the behavioral characteristics include that first user is directed to The mark for each object that preceding i-1 wheel push has been clicked；And

I-th of status information is inputted into the intensified learning model, so that the intensified learning model is taken turns from i-th The respective mark of push object of the predetermined number of the i-th wheel push is determined in the candidate target set of push.

In one embodiment, so that the intensified learning model determines i-th from the candidate target set of the i-th wheel push The respective mark of push object for taking turns the predetermined number of push includes, so that the intensified learning model: being based on i-th of shape The object identity of state information and each candidate target in the candidate target set of the i-th wheel push calculates each of the i-th wheel push The push probability of candidate target, and it is based on each push probability, determine the push object of the predetermined number of the i-th wheel push.

In one embodiment, first user clicks the first push pair in wheel push for the (i-1)-th wheel push As, wherein it is based on each push probability, determines that the push object of the predetermined number of the i-th wheel push includes determining that the i-th wheel pushes Each candidate target in belong to the first push object subclass the first candidate target, and it is candidate right based on each first The push probability of elephant determines the push object of the predetermined number of the i-th wheel push.

In one embodiment, first user clicks the first push pair in wheel push for the (i-1)-th wheel push As, wherein so that the intensified learning model determines the predetermined number of the i-th wheel push from the candidate target set of the i-th wheel push The respective mark of purpose push object includes, so that the son for the candidate target set that the intensified learning model is pushed from the i-th wheel The respective mark of push object of the predetermined number of the i-th wheel push is determined in set, wherein include described the in the subclass Multiple subclasses of one push object.

In one embodiment, the i-th wheel push further includes, after the push object for determining the push of the i-th wheel, to institute It states the first user and pushes the push object, to obtain the feedback of first user.

In one embodiment, i ≠ N, in one for being fed back to not click in the push object of first user In the case where, the method includes the continuous i for the first user to take turns push, and the method also includes being pushed away based on taking turns with the i The corresponding multi-group data of multiple push objects in sending, passes through model described in Policy-Gradient algorithm optimization, wherein with jth The corresponding one group of data of the second push object in wheel push include: status information corresponding with the push of jth wheel, the second push pair The mark of elephant and return value corresponding with the second push object, wherein j is 1 any natural number into i, the return value It is obtained based on feedback of first user to the second push object.

In one embodiment, i=N, the method includes the continuous N for the first user to take turns push, and the method is also Including after the feedback for obtaining first user, based on multiple groups corresponding with multiple push objects in N wheel push Data pass through model described in Policy-Gradient algorithm optimization, wherein with the second push in the jth wheel push in N wheel push The corresponding one group of data of object include: with jth wheel to push corresponding status information, the mark of the second push object and with the The corresponding return value of two push objects, wherein the return value is based on first user to the anti-of the second push object Feedback obtains.

It in one embodiment, is inquiry problem with the corresponding push object of N wheel push, the return value is described the One user takes positive value in the case where clicking the second push object, does not click on the second push object in first user In the case where be zero.

In one embodiment, the return value is in j=N and first user clicks the second push object Take the first value in situation, the return value takes in j ≠ N and in the case where first user clicks the second push object Second value, wherein first value is greater than the second value.

On the other hand this specification provides a kind of device for pushing object to user based on intensified learning model, described device The at most N number of pushing module continuously disposed including being directed to the first user, wherein each pushing module has corresponding predetermined time Object set is selected, each pushing module since second pushing module clicks through previous push in first user Start to dispose after the object of module push, also, the candidate target of each pushing module since second pushing module Set includes the respective multiple subclasses of multiple candidate targets of previous pushing module, wherein at most N number of pushing module I-th of pushing module include with lower unit:

Acquiring unit is configured to, and obtains i-th of status information, i-th of status information includes static nature and dynamic Feature, wherein the static nature includes existing feature of first user before disposing the device, and the behavioral characteristics include The mark for each object that first user has clicked for preceding i-1 pushing module；And

Determination unit is configured to, and i-th of status information is inputted the intensified learning model, so that described strong Change the push pair that learning model determines the predetermined number of i-th of pushing module from the candidate target set of i-th of pushing module As respective mark.

In one embodiment, the determination unit includes being deployed in the intensified learning model: computation subunit, It is configured to, pair of each candidate target in the candidate target set based on i-th of status information and i-th of pushing module As mark, the push probability of each candidate target of i-th of pushing module is calculated, and determines subelement, is configured to, based on each Probability is pushed, determines the push object of the predetermined number of i-th of pushing module.

In one embodiment, first user is directed to (i-1)-th and pushes away module clicks through module push first Push object, wherein the determining subelement is additionally configured to, and is determined in each candidate target of i-th of pushing module and is belonged to institute The first candidate target of the subclass of the first push object, and the push probability based on each first candidate target are stated, is determined i-th The push object of the predetermined number of pushing module.

In one embodiment, first user clicks through the of module push for (i-1)-th pushing module One push object, wherein the determination unit is additionally configured to, so that time of the intensified learning model from i-th of pushing module Select the respective mark of push object that the predetermined number of i-th of pushing module is determined in the subclass of object set, wherein described It include multiple subclasses of the first push object in subclass.

In one embodiment, i-th of pushing module further includes that push unit is configured to, and is pushed away determining i-th After sending the push object of module, the first user of Xiang Suoshu pushes the push object, to obtain the feedback of first user.

In one embodiment, i ≠ N, in one for being fed back to not click in the push object of first user In the case where, described device includes the continuous i pushing module for the first user, and described device further includes that optimization module is matched It is set to, based on multi-group data corresponding with multiple push objects in the i pushing module, passes through Policy-Gradient algorithm Optimize the model, wherein one group of data corresponding with the second push object of j-th of pushing module include: to push with j-th The corresponding status information of module, the mark of the second push object and with the second corresponding return value of push object, wherein j is 1 any natural number into i, the return value are obtained based on feedback of first user to the second push object.

In one embodiment, i=N, described device include continuous N number of pushing module for the first user, the dress Set and further include, optimization module is configured to, after the feedback for obtaining first user, based on in N number of pushing module The corresponding multi-group data of multiple push objects, pass through model described in Policy-Gradient algorithm optimization, wherein with N number of push The corresponding one group of data of the second push object of j-th of pushing module in module include: shape corresponding with j-th of pushing module State information, the mark of the second push object and return value corresponding with the second push object, wherein the return value is based on First user obtains the feedback of the second push object.

On the other hand this specification provides a kind of computer readable storage medium, be stored thereon with computer program, work as institute When stating computer program and executing in a computer, computer is enabled to execute any of the above-described method.

On the other hand this specification provides a kind of calculating equipment, including memory and processor, which is characterized in that described to deposit It is stored with executable code in reservoir, when the processor executes the executable code, realizes any of the above-described method.

In the scheme according to the push object of this specification embodiment, a kind of Object Push of novel structuring is proposed Process guides user step by step, by the process of the entire more wheel push state transitions of intensified learning modeling, and in a model The dynamic click information for considering user, improves predictablity rate.

Detailed description of the invention

This specification embodiment is described in conjunction with the accompanying drawings, and this specification embodiment can be made clearer:

Fig. 1 shows the process schematic that object is pushed to user according to this specification embodiment；

Fig. 2 shows a kind of methods for pushing object to user based on intensified learning model according to this specification embodiment；

Fig. 3 schematically illustrates the push object shown respectively to user in three-wheel push；

Fig. 4 shows a kind of device for pushing object to user based on intensified learning model according to this specification embodiment 400。

Specific embodiment

This specification embodiment is described below in conjunction with attached drawing.

Fig. 1 shows the process schematic that object is pushed to user according to this specification embodiment.It shows in figure by strong Change learning model 11 (i.e. intelligent body (agent)) for the successive decision three times of user 12 (i.e. environment) progress to carry out three respectively The process of secondary push.The intensified learning model is for example in intelligent customer service, for predicting the problem of user wants inquiry.At this In specification embodiment, it can be layered thinking by structuring when encountering problems by the simulation mankind and determine and finally want inquiry The problem of design, construct intensified learning model.That is, carrying out hierarchical prediction to problem, first in advance in decision three times Major class where survey problem, then group of the forecasting problem under major class, finally predicts the problem under group again.

Specifically, in first time decision, the current state based on user 12 inputs original state s to model 11₁, the shape State s₁In include static nature (as mark s₁Ellipse in white box shown in) and behavioral characteristics (not shown), static nature For the current signature of user, the attributive character having before the bout including user, historical behavior feature etc., behavioral characteristics The problem of being had clicked in the bout for user, here due to for first time decision, behavioral characteristics are sky.To mould Type 11 inputs s₁Later, model 11 calculates each candidate big of the corresponding first round push of the secondary decision by Policy-Gradient algorithm The probability of class, and the push major class based on the push of the determine the probability wheel, such as (a₁₁、a₁₂、a₁₃), the push major class namely mould The movement that type 11 exports.To, can the output based on model to user show (push) these major class.For example, paying In precious intelligent customer service, based on the output of model 11, three major class " flower ", " borrow " and " remaining sum are shown to user first It is precious ".After carrying out above-mentioned displaying to user, user can feed back the displaying, for example, user can click one of them greatly Class, or can be obtained based on the feedback in wheel push without any click and act corresponding return value with each (r₁₁、r₁₂、r₁₃).After user clicks a major class, such as " flower ", model 11 starts second of decision process.Specifically It is that the second state s is inputted to model 11 based on the current state of user₂, second state s₂It equally include static nature (as schemed Middle mark s₂Ellipse in white box shown in) and behavioral characteristics (mark s in such as figure₂Ellipse in grey box shown in), The static nature and state s₁Static nature it is identical, include the mark for the major class " flower " that user has clicked in the behavioral characteristics Know, such as a₁₁.By state s₂After input model 11, similarly, model 11 is based on state s₂Output and second of decision pair Three group (a of the second wheel push answered₂₁、a₂₂、a₂₃), such as correspond respectively to next layer that " flower " major class includes Group " bill ", " refund ", " service charge ".Similarly, after the wheel push of carry out second, the feedback of user can be obtained, such as User clicks " refund ", can obtain return value (r corresponding with each group based on the feedback of user₂₁、r₂₂、r₂₃), and And it can correspondingly obtain the state s of third time decision₃.By by state s₃Input model 11, the three of exportable third round push A problem (a₃₁、a₃₂、a₃₃), such as correspond respectively to " flower can be with payment beforehand ", " flower it is automatic refund withhold it is suitable Sequence ", " flower how to refund ", and the feedback that third round can push based on user, obtain with third round push in each problem Corresponding return value (r₃₁、r₃₂、r₃₃).After carrying out the three-wheel push to user as described above, the three-wheel can be based on Above-mentioned each data in push carry out model optimization, to improve the forecasting accuracy of model.

It is appreciated that the above-mentioned description to Fig. 1 be only exemplary rather than it is restrictive, for example, the push object It is not limited to user query problem, but can be other push objects, such as commodity, film review, thus corresponding major class and group Also change therewith, and user can also correspondingly change the movement of push object, the calculation method of return value also correspondingly becomes Change；The model is not limited by three-wheel push and is pushed to the user, but can be according to concrete scene sets itself；It is described Model is not limited by Policy-Gradient algorithm and carries out intensified learning etc..

Above-mentioned push process is detailed below.

Fig. 2 shows according to this specification embodiment it is a kind of based on intensified learning model to user push object method, The method includes the continuous at most N for the first user to take turns push, wherein every wheel push has corresponding predetermined candidate right As set, since every wheel push that the second wheel push starts first user click on previous round push in the object that pushes it After start, also, since second wheel push start every wheel push candidate target set include previous round push multiple candidates The respective multiple subclasses of object, wherein the i-th wheel push at most N wheel push the following steps are included:

Step S202 obtains i-th of status information, and i-th of status information includes static nature and behavioral characteristics, In, the static nature includes the existing feature of the first user before carrying out the method, and the behavioral characteristics include described The mark for each object that one user has clicked for preceding i-1 wheel push；And

I-th of status information is inputted the intensified learning model, so that the intensified learning mould by step S204 Type determines the respective mark of push object of the predetermined number of the i-th wheel push from the candidate target set of the i-th wheel push.

The at most N wheel push is the one bout (episode) of intensified learning, as described above, wherein N wheel The push object of push is, for example, the inquiry problem of user, then the push object of the 1st to N-1 wheel push is correspondingly inquiry problem The group at place, major class etc..Every wheel is pushed, corresponding candidate target set has all been preset.For example, in the 1st wheel push In, preset candidate target set includes each major class, for example, in Alipay intelligent customer service, the candidate target of the 1st wheel push Set may include " flower ", " borrow ", " Yuebao ", " sesame credit ", " ant insurance ", " ant forest " etc..In the 2nd wheel In push, preset candidate target set includes each group, and each group is respectively the above-mentioned respective son of each major class Class such as includes: subclass (" flower bill ", " flower refund ", " flower service charge ", " open flower "), " borrow " of " flower " Subclass (" borrow interest ", " borrow refund ", " borrowing amount ", " open borrow ") etc..It is preset in the 3rd wheel push Candidate target set includes each problem under above-mentioned each group, and each problem is above-mentioned each respective son of group Class such as includes subclass (that is, each inquiry problem relevant to flower refund), the subclass of " flower bill " of " flower refund " (each inquiry problem i.e. relevant to " flower bill "), etc..

It is appreciated that the candidate target set that every wheel since the second wheel push pushes is not limited to the described above restriction. For example, in the case where the model is the model of Policy-Gradient algorithm, it is each by being calculated based on input state in a model The push probability of candidate target, to determine the push object of output based on the sequence of the push probability of each candidate target.When After first user has had clicked on some push object in the first round, the time is calculated in order to save model, in the second wheel, Candidate target set can be limited to include each subclass for pushing object through clicking.In this case, for example, can pass through Specific mark indicates each group is belonging respectively to which major class in first round push in the such as second wheel.

Wherein, since every wheel push that the second wheel push starts first user click on previous round push in push Start after object.For example, after carrying out first time push, for example, as shown in fig. 1, having been pushed in the 1st wheel push each After a major class (for example, " flower ", " borrow " and " Yuebao "), if to click one of major class (such as " colored by user "), then into the second wheel push process of the bout, if user does not click on any one, which terminates, that is, should Bout only includes wheel push.

The process for including in every wheel push of the at most N wheel push is identical, and the i-th wheel push therein may include walking as follows Suddenly.

Firstly, obtaining i-th of status information, i-th of status information includes static nature and dynamic in step S202 Feature, wherein the static nature includes the existing feature of the first user before carrying out the method, and the behavioral characteristics include The mark for each object that first user has clicked for preceding i-1 wheel push.

I-th of status information namely i-th of the state s inputted in the i-th model prediction of the bout_i, the shape State s_iThe for example, form of feature vector, including multiple elements.Wherein, state s_iIn make a reservation for multiple dimensions element it is corresponding In the static nature of user, i.e., existing feature before executing the bout, such as the attributive character of user, Figure Characteristics, history Behavioural characteristic etc., thus, the static nature in each state corresponding with each secondary model prediction in one bout is It is identical.Wherein, state s_iIn make a reservation for the elements of multiple dimensions and correspond to the behavioral characteristics of user, be first in the behavioral characteristics The mark for each object that user has clicked in each push before the wheel push of the bout.For example, with reference to right above The description of Fig. 1, in first round push, due to not carrying out any click before the first user, the state s of input₁In Behavioral characteristics can for example be expressed as [0,0]；In the second wheel push, for example, the first user point after the first round pushes " flower " has been hit, thus, second state s of input₂In behavioral characteristics in include " flower " mark, for example, state s₂ In behavioral characteristics can be expressed as [a₁₁,0]；In third round push, for example, the first user clicks after the second wheel push " refund ", thus, the third state s of input₃In behavioral characteristics include the first user for first round push and the The mark of " flower " that two wheel push have been clicked and the mark of " refund ", for example, state s₃In behavioral characteristics can be expressed as [a₁₁,a₂₂]。

The intensified learning model be, for example, the model based on Policy-Gradient algorithm, in this case, model include about The strategic function π (a | s, θ) of state s and movement a, wherein θ is the model parameter of the intensified learning model, π (a | s, θ) be Using the probability of movement a under state s.For example, to mode input state s_iLater, in a model based on strategic function π (a | s, θ) obtain each candidate target (i.e. each movement) a of the i-th decision_ijPush probability, and based on each candidate target Probability is pushed, determines the push object of the predetermined number of wheel push.

For example, in the scene of intelligent customer service as described in Figure 1, in first round push, to mode input state s₁ Later, as described above, the candidate target of the first time decision of model is for example including " flower ", " borrow ", " Yuebao ", " sesame Credit ", " ant insurance ", " ant forest ", such as pass through a respectively₁₁、a₁₂、…、a₁₆Mark, model successively calculate with it is each Candidate target is corresponding to use probability π (a_1j|s₁, θ), wherein j 1,2 ... 6, and this 6 probability are ranked up, from And forward predetermined number (such as the 3) candidate target that will sort is determined as pushing object.It is appreciated that the predetermined number can To be set as all being identical relative to each decision process, alternatively, the predetermined number can be distinguished relative to each secondary decision Setting.For example, predetermined number can be set as to proportional to the candidate target number of the secondary decision, thus, carrying out the first round In push, the number of candidate target is less, so that the number for pushing object is also less, it is candidate right in carrying out third round push The number of elephant is more, so that the number for pushing object is also correspondingly more.

In general, after progress as described above such as first round push, for example, the first user clicks first round push " flower ", model is based on new state s₂The push object of prediction is each subclass under " flower ".However, for troubleshooting model Error can be filtered the output of model, then be ranked up again.For example, model is calculating each of the second wheel push After the push probability of candidate target, the candidate target of " flower " subclass is filtered out in each candidate target and is not, and to remaining The push probability of candidate target is ranked up, to finally determine the push object of the second wheel push.In one embodiment, may be used The candidate target of the second wheel push is filtered before model calculates push probability, that is, filtering out in candidate target set is not The candidate target of " flower " subclass, the subclass of " flower " so that only remaining in candidate target set, and it is based on the filtered time Object set (subclass that the candidate target collection is combined into former candidate target set) is selected to calculate the push probability of each candidate target, And the sequence of the push probability based on each candidate target, determine the push object of wheel push.

It is appreciated that the intensified learning model is not limited to using Policy-Gradient algorithm, and other algorithms can be used, such as Q learning algorithm, behavior-judge algorithm (actor-critic) etc., are not described in detail one by one herein.

After the push object for determining wheel push as described above, the first user of Xiang Suoshu pushes the push object, To obtain the feedback of first user.It, can be in window page to first user's sequence exhibition for example, in the scene of intelligent customer service Show the push object of the predetermined number of wheel push.Fig. 3 schematically illustrates the push pair shown respectively to user in three-wheel push As shown in Figure 3, in first round push, showing " flower ", " borrow " and " Yuebao " to user's sequence.In the displaying Later, the feedback of the first user can be obtained, that is, the first user can obtain its click condition based on the feedback of the first user Corresponding return value is taken, the mouse in Fig. 3 is used to indicate click of the user to the push object.For example, can preset, when first User clicks some push object a_1jWhen, it will return value r corresponding with the push object in wheel push_1jIt is denoted as 0.1, if First user does not click on push object a_1jWhen, it will return value r corresponding with the object_1jIt is denoted as 0.That is, in user After clicking " flower " in one wheel push, it can obtain, r₁₁=0.1, r₁₂=0, r₁₃=0.Only show it is appreciated that this is preset Meaning property, for example, when the first user clicks some push object a_1jWhen, can based on the push object predetermined number push pair Collating sequence as in sets return value r_1j, sort more forward, return value is bigger.For example, it is assumed that in first round push, Model exports the three push object a arranged in the following order₁₁、a₁₂、a₁₃, the first user can be clicked to push object a₁₁When Return value r₁₁It is set greater than the first user and clicks push object a₁₂When return value r₁₂。

After the first user clicks some push object of wheel push, model enters next round push thus and can The return value of next round push is obtained as described above.For example, the first user clicks " flower " in the first round pushes, then exist It in next round push, is predicted according to second of model, it may be determined that the sequence under " flower " major class is shown in second push Forward several " groups " are taken turns in push second to three groups of user's displaying model prediction such as Fig. 3 institute: flower bill, Flower refund, flower service charge.And one " group " (such as " flower refund ") in above-mentioned several " groups " is clicked in user Later, model carries out third time prediction, with determining three tools for showing next layer of group " flower refund " in third time push The problem of body.As shown in figure 3, third round push in user show model prediction three problems: flower can go back in advance Money, flower it is automatic refund withhold sequence, flower how to refund.

Such as in N wheel push, can set, N is taken turns and is pushed, some push pair of N wheel when the user clicks As in the case where, return value can be set as to bigger than the click return value for corresponding to preceding N-1 wheel.For example, for above-mentioned intelligence visitor The scene of clothes, when the first user clicks some push of third round object a_3jIn the case where (i.e. some problem), can will in the wheel The corresponding return value of push object is set as r_3j=1, when the first user clicks some push object of the first round or the second wheel In the case where, it can be by return value (r_1jOr r_2j) it is set as 0.1.For any wheel in three-wheel push, if user does not click on Any push object of wheel push, then the bout of model terminates, i.e., model will not enter next round push, that is to say, that The one bout of model includes at most N wheel push.

After the one bout of model terminates, the inputoutput data and feedback data training mould in the bout can be passed through Type.For example, in oneainstance, which includes the three-wheel push to the first user, that is to say, that the first user is in the first round There is click action in the second wheel push, to eventually enter into third round push.In this case, it can carry out to model extremely It is few to train three times.Specifically, it is assumed that the first user clicks after the first round pushes and is identified as a₁₁Push object, second It is clicked after wheel push and is identified as a₂₂Push object, third round push after click be identified as a₃₂Push object, then may be used Obtain three groups of training data (s₁、a₁₁、r₁₁)、(s₂、a₂₂、r₂₂) and (s₃、a₃₂、r₃₂), wherein as described above it can be concluded that, r₁₁ =0.1, r₂₂=0.1, r₃₂=1.It can be passed through based on each group of data in three groups of training datas according to Policy-Gradient algorithm Following formula (1) carries out model parameter update:

Wherein,Indicate desired value.For example, in use (s₁、a₁₁、r₁₁) training pattern when, following formula can be passed through (2) in calculation formula (1)

In use (s₂、a₂₂、r₂₂) and (s₃、a₃₂、r₃₂) training pattern when, each return value can be based on similar as abovely r₂₂And r₃₂, calculate separatelyWith

In that case, it other than obtaining above-mentioned three groups of training datas, can also be based on not obtaining in the push of each wheel The push object acquisition training data that user clicks.For example, for the push object a in first round push₁₂, one group of instruction can be obtained Practice data (s₁、a₁₂、r₁₂), in this case, since user does not click on the push object, r₁₂=0, correspondingly,It also is 0.

In another scenario, for example, in the second wheel push, the first user does not click on any one push object, at this In situation, which terminates after the second wheel push executes.Specifically, it is assumed that the first user point after the first round pushes It hits and is identified as a₁₁Push object, second wheel push after do not click on any push object, then can obtain one group of training data (s₁、a₁₁、r₁₁) wherein, r₁₁=0.1.To which model training can be carried out likewise by formula (1), and wherein, can pass through Formula (2) obtains in formula (1)I.e.Similarly, from In the bout, it is also can correspond to each push object being not clicked in the first round and the second wheel push, obtains multiple groups training Data, to be used for model training.

Fig. 4 shows a kind of device for pushing object to user based on intensified learning model according to this specification embodiment 400, described device includes the at most N number of pushing module 41 continuously disposed for the first user, wherein each pushing module tool There is corresponding predetermined candidate target set, each pushing module since second pushing module is clicked in first user Start to dispose after the object pushed by previous pushing module, also, each push since second pushing module The candidate target set of module includes the respective multiple subclasses of multiple candidate targets of previous pushing module, wherein it is described extremely I-th of pushing module in mostly N number of pushing module includes with lower unit:

Acquiring unit 411, is configured to, obtain i-th of status information, i-th of status information include static nature and Behavioral characteristics, wherein the static nature includes existing feature of first user before disposing the device, the behavioral characteristics Mark including each object that first user has clicked for preceding i-1 pushing module；And

Determination unit 412, is configured to, and i-th of status information is inputted the intensified learning model, so that described Intensified learning model determines the push of the predetermined number of i-th of pushing module from the candidate target set of i-th of pushing module The respective mark of object.

In one embodiment, the determination unit 412 includes being deployed in the intensified learning model: it is single to calculate son Member 4121, is configured to, each candidate in the candidate target set based on i-th of status information and i-th of pushing module The object identity of object calculates the push probability of each candidate target of i-th of pushing module, and determines subelement 4122, matches It is set to, is based on each push probability, determines the push object of the predetermined number of i-th of pushing module.

In one embodiment, i-th of pushing module 41 further includes that push unit 413 is configured to, and is determining i-th After the push object of a pushing module, the first user of Xiang Suoshu pushes the push object, to obtain first user's Feedback.

In one embodiment, i ≠ N, in one for being fed back to not click in the push object of first user In the case where, described device includes the continuous i pushing module for the first user, and described device further includes optimization module 42, It is configured to, based on multi-group data corresponding with multiple push objects in the i pushing module, is calculated by Policy-Gradient Method optimizes the model, wherein pushes object corresponding one with second of j-th of pushing module in the i pushing module Group data include: status information corresponding with j-th of pushing module, second push object mark and with second push pair As corresponding return value, wherein the return value is obtained based on feedback of first user to the second push object.

In one embodiment, i=N, described device include continuous N number of pushing module for the first user, the dress It sets and further includes, optimization module 42 is configured to, and after the feedback for obtaining first user, is based on and N number of pushing module In the corresponding multi-group data of multiple push objects, pass through model described in Policy-Gradient algorithm optimization, wherein N number of push away with described It includes: corresponding with j-th of pushing module for sending second of j-th of pushing module in module to push the corresponding one group of data of object Status information, the mark of the second push object and return value corresponding with the second push object, wherein the return value base It is obtained in feedback of first user to the second push object.

It is to be understood that herein " first ", the description such as " second ", it is for illustration only simple and to similar concept into Row is distinguished, and does not have other restriction effects.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.

It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.

Those of ordinary skill in the art should further appreciate that, describe in conjunction with the embodiments described herein Each exemplary unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clear Illustrate to Chu the interchangeability of hardware and software, generally describes each exemplary group according to function in the above description At and step.These functions hold track actually with hardware or software mode, depending on technical solution specific application and set Count constraint condition.Those of ordinary skill in the art can realize each specific application using distinct methods described Function, but this realization is it is not considered that exceed scope of the present application.

The step of method described in conjunction with the examples disclosed in this document or algorithm, can hold track with hardware, processor Software module or the combination of the two implement.Software module can be placed in random access memory (RAM), memory, read-only storage Device (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology neck In any other form of storage medium well known in domain.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of method for pushing object to user based on intensified learning model, the method includes for the continuous of the first user At most N take turns push, wherein the push of every wheel has corresponding predetermined candidate target set, every wheel since pushing the second wheel It pushes and starts after first user clicks on the object pushed in previous round push, also, since the second wheel push Every wheel push candidate target set include previous round push the respective multiple subclasses of multiple candidate targets, wherein it is described At most N wheel push in i-th wheel push the following steps are included:

I-th of status information is obtained, i-th of status information includes static nature and behavioral characteristics, wherein described static special Sign includes the existing feature of the first user before carrying out the method, and the behavioral characteristics include first user for preceding i- The mark for each object that 1 wheel push has been clicked；And

I-th of status information is inputted into the intensified learning model, so that the intensified learning model is pushed from the i-th wheel Candidate target set in determine i-th wheel push predetermined number the respective mark of push object.

2. according to the method described in claim 1, wherein, so that candidate target of the intensified learning model from the i-th wheel push Determine that the respective mark of the push object for the predetermined number that the i-th wheel pushes includes in set, so that the intensified learning model: base The object identity of each candidate target in i-th of status information and the candidate target set of the i-th wheel push, calculates the The push probability of each candidate target of i wheel push, and it is based on each push probability, determine the predetermined number of the i-th wheel push Push object.

3. according to the method described in claim 2, wherein, first user clicks in wheel push for the (i-1)-th wheel push The first push object, wherein be based on each push probability, determine that the push object of predetermined number of the i-th wheel push includes, really Belong to the first candidate target of the subclass of the first push object in each candidate target of fixed i-th wheel push, and based on each The push probability of a first candidate target determines the push object of the predetermined number of the i-th wheel push.

4. according to the method described in claim 1, wherein, first user clicks in wheel push for the (i-1)-th wheel push First push object, wherein so that the intensified learning model from i-th wheel push candidate target set in determine i-th wheel The respective mark of the push object of the predetermined number of push includes, so that candidate of the intensified learning model from the i-th wheel push The respective mark of push object of the predetermined number of the i-th wheel push is determined in the subclass of object set, wherein the subclass In include it is described first push object multiple subclasses.

5. according to the method described in claim 2, it is described i-th wheel push further include, determine the i-th wheel push push object it Afterwards, the first user of Xiang Suoshu pushes the push object, to obtain the feedback of first user.

6. according to the method described in claim 5, wherein, i ≠ N is fed back to not click on the push in first user In the case where one in object, the method includes the continuous i for the first user to take turns push, and the method also includes bases In multi-group data corresponding with multiple push objects in i wheel push, pass through mould described in Policy-Gradient algorithm optimization Type, wherein and the second corresponding one group of data of push object in the push of jth wheel include: state letter corresponding with the push of jth wheel The mark and return value corresponding with the second push object of breath, the second push object, wherein j is 1 any nature into i Number, the return value are obtained based on feedback of first user to the second push object.

7. according to the method described in claim 5, wherein, i=N, the method includes the continuous N wheels for the first user to push away It send, the method also includes after the feedback for obtaining first user, based on the multiple push taken turns in push with the N The corresponding multi-group data of object, passes through model described in Policy-Gradient algorithm optimization, wherein pushes away with the jth wheel in N wheel push The corresponding one group of data of the second push object in sending include: to push corresponding status information, the second push object with jth wheel Mark and return value corresponding with the second push object, wherein the return value is based on first user to described second The feedback for pushing object obtains.

8. according to the method described in claim 7, push object corresponding with N wheel push is inquiry problem, the return value Positive value is taken in the case where first user clicks the second push object, does not click on described second in first user It is zero in the case where pushing object.

9. according to the method described in claim 8, the return value is in j=N and first user click, second push The first value is taken in the case where object, the return value clicks the feelings of the second push object in j ≠ N and first user Second value is taken in condition, wherein first value is greater than the second value.

10. a kind of device for pushing object to user based on intensified learning model, described device includes the company for the first user At most N number of pushing module of continuous deployment, wherein each pushing module has corresponding predetermined candidate target set, from second Each pushing module that pushing module starts is after the object that first user clicks through previous pushing module push Start to dispose, also, the candidate target set of each pushing module since second pushing module includes previous push The respective multiple subclasses of multiple candidate targets of module, wherein i-th of pushing module packet at most N number of pushing module It includes with lower unit:

Acquiring unit is configured to, and obtains i-th of status information, and i-th of status information includes that static nature and dynamic are special Sign, wherein the static nature includes existing feature of first user before disposing the device, and the behavioral characteristics include institute State the mark for each object that the first user has clicked for preceding i-1 pushing module；And

Determination unit is configured to, and i-th of status information is inputted the intensified learning model, so that the extensive chemical It practises model and determines that the push object of the predetermined number of i-th of pushing module is each from the candidate target set of i-th of pushing module From mark.

11. device according to claim 10, wherein the determination unit includes being deployed in the intensified learning model : computation subunit is configured to, each in the candidate target set based on i-th of status information and i-th of pushing module The object identity of a candidate target calculates the push probability of each candidate target of i-th of pushing module, and determines subelement, It is configured to, is based on each push probability, determines the push object of the predetermined number of i-th of pushing module.

12. device according to claim 11, wherein first user pushes away module for (i-1)-th and clicks through this First push object of module push, wherein the determining subelement is additionally configured to, and determines each time of i-th of pushing module Select the first candidate target for belonging to the subclass of the first push object in object, and the push based on each first candidate target Probability determines the push object of the predetermined number of i-th of pushing module.

13. device according to claim 10, wherein first user clicks through for (i-1)-th pushing module First push object of module push, wherein the determination unit is additionally configured to, so that the intensified learning model is from i-th Determine that the push object of the predetermined number of i-th of pushing module is respective in the subclass of the candidate target set of a pushing module Mark, wherein include multiple subclasses of the first push object in the subclass.

14. device according to claim 11, i-th of pushing module further includes that push unit is configured to, true After the push object of fixed i-th of pushing module, the first user of Xiang Suoshu pushes the push object, is used with obtaining described first The feedback at family.

15. device according to claim 14, wherein i ≠ N is fed back to not click on described push away in first user In the case where sending one in object, described device includes the continuous i pushing module for the first user, and described device is also wrapped It includes, optimization module is configured to, and based on multi-group data corresponding with multiple push objects in the i pushing module, is led to Cross model described in Policy-Gradient algorithm optimization, wherein one group of data packet corresponding with the second push object of j-th of pushing module It includes: and the corresponding status information of j-th of pushing module, the mark of the second push object and corresponding with the second push object Return value, wherein j is 1 any natural number into i, and the return value is based on first user to second push pair The feedback of elephant obtains.

16. device according to claim 14, wherein i=N, described device include N number of pushing away for the continuous of the first user Send module, described device further includes that optimization module is configured to, after the feedback for obtaining first user, based on it is described The corresponding multi-group data of multiple push objects in N number of pushing module, passes through model described in Policy-Gradient algorithm optimization, wherein One group of data corresponding with the second push object of j-th of pushing module in N number of pushing module include: to push away with j-th Send the corresponding status information of module, the mark of the second push object and return value corresponding with the second push object, wherein The return value is obtained based on feedback of first user to the second push object.

17. device according to claim 16, push object corresponding with n-th pushing module is inquiry problem, described Return value takes positive value in the case where first user clicks the second push object, does not click on institute in first user It is zero in the case where stating the second push object.

18. device according to claim 17, the return value is in j=N and first user clicks described second and pushes away Take the first value in the case where sending object, the return value is in j ≠ N and first user clicks the second push object Second value is taken in situation, wherein first value is greater than the second value.

19. a kind of computer readable storage medium, is stored thereon with computer program, when the computer program in a computer When execution, computer perform claim is enabled to require the method for any one of 1-9.

20. a kind of calculating equipment, including memory and processor, which is characterized in that be stored with executable generation in the memory Code realizes method of any of claims 1-9 when the processor executes the executable code.