WO2020220757A1 - 基于强化学习模型向用户推送对象的方法和装置 - Google Patents

基于强化学习模型向用户推送对象的方法和装置 Download PDF

Info

Publication number
WO2020220757A1
WO2020220757A1 PCT/CN2020/071699 CN2020071699W WO2020220757A1 WO 2020220757 A1 WO2020220757 A1 WO 2020220757A1 CN 2020071699 W CN2020071699 W CN 2020071699W WO 2020220757 A1 WO2020220757 A1 WO 2020220757A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
list
lists
objects
feature vector
Prior art date
Application number
PCT/CN2020/071699
Other languages
English (en)
French (fr)
Inventor
陈岑
胡旭
傅驰林
张晓露
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to US16/813,654 priority Critical patent/US10902298B2/en
Publication of WO2020220757A1 publication Critical patent/WO2020220757A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the embodiments of this specification relate to the field of machine learning, and more specifically, to a method and device for determining a push object list for a user based on a reinforcement learning model.
  • user intention prediction aims to automatically predict the questions that the user may want to ask, and present candidate questions to the user for selection to reduce the user's cognitive burden.
  • the user intention prediction task can be regarded as a top N recommended task, where each predetermined question is an intent class.
  • the current existing methods treat the task as a multi-classification problem, and predict a list of objects (item) that the user is most likely to be interested in given the current user state, that is, a list of questions. These methods aim to maximize instant rewards, such as clicks, while ignoring the influence of the first recommendation object in the recommendation list on the subsequent recommendation object.
  • the embodiments of the present specification aim to provide a more effective solution for determining a push object list for users based on a reinforcement learning model, so as to solve the deficiencies in the prior art.
  • one aspect of this specification provides a method for determining a push object list for a user based on a reinforcement learning model, where for the first user, M groups of object lists have been determined in advance through the method, and each group of object lists There are currently i-1 objects in, where M and i are integers greater than or equal to 1, and where i is less than or equal to a predetermined integer N, the method includes:
  • the i-th state feature vector includes static features and dynamic features, where the static features include the attribute features of the first user, and the dynamic features include all the objects in the group of object lists. State the respective attributes of i-1 objects;
  • the i-th state feature vector is input to the reinforcement learning model, so that the reinforcement learning model outputs a weight vector corresponding to the i-th state feature vector, and the weight vector includes the respective weights of a predetermined number of ranking features ;
  • the ranking feature vector including the respective feature values of the predetermined number of ranking features
  • For the M-group object list determine an updated M-group object list based on the scores of each object in the M candidate object sets respectively corresponding to the M-group object list, where in the updated M-group object list
  • Each group of object lists includes i objects.
  • the dynamic characteristics include at least the following attribute characteristics of each of the i-1 objects: current popularity, object identification, and type of object.
  • the M groups of object lists include a first group of object lists, the candidate object set corresponding to the first group of object lists includes the first object, and the ranking feature vector corresponding to the first object includes at least The values of the following sorting features: the estimated click-through rate of the first user for the first object, the current popularity of the first object, the first object relative to i-1 objects in the first group of objects list Diversity.
  • determining that there are M groups of object lists in advance through the method includes: having determined in advance through the method to have a set of object lists, wherein, based on the M candidate object sets respectively corresponding to the M groups of object lists To determine the scores of each object in the group of objects in the updated list of M groups of objects includes, based on the scores of each object in the candidate object set corresponding to the group of object lists, taking the object with the highest score in the candidate object set as the first in the group of object lists. i objects, and use the group object list as the updated group object list.
  • M is greater than or equal to 2, wherein, based on the scores of each object in the M candidate object sets corresponding to the M groups of object lists, determining the updated M group of object lists includes: The object list corresponds to the score of each object in the M candidate object sets, and the updated M group object list is determined through the cluster search algorithm.
  • i is equal to N
  • the method further includes determining a push object list for the first user from the updated M group object list through a cluster search algorithm.
  • the method further includes:
  • the N+1th state feature vector includes a static feature and a dynamic feature, wherein the static feature includes the attribute feature of the first user, and the dynamic feature includes the Push the respective attribute characteristics of N objects in the object list;
  • the reinforcement learning model is trained based on the N sets of data respectively corresponding to the N cycles to optimize the reinforcement learning model, wherein the N sets of data include the first to the Nth sets of data, wherein the i-th set of data Including: the i-th state feature vector corresponding to the push object list, the weight vector corresponding to the i-th state feature vector, the i+1th state feature vector corresponding to the push object list, and the The return value corresponding to i cycles.
  • the object is a question.
  • the reward value corresponding to the i-th cycle is obtained based on the following feedback of the first user : Whether to click the i-th question in the push object list.
  • the reward value corresponding to the Nth cycle is obtained based on the following feedback of the first user: whether to click the Nth question in the push object list and the submitted satisfaction information.
  • the reinforcement learning model is a model based on a depth determination strategy gradient algorithm.
  • Another aspect of this specification provides a device for determining a push object list for a user based on a reinforcement learning model.
  • M groups of object lists have been determined in advance through the method, and each group of object lists currently includes i -1 objects, where both M and i are integers greater than or equal to 1, where i is less than or equal to a predetermined integer N, the device includes:
  • the first obtaining unit is configured to obtain an i-th state feature vector, where the i-th state feature vector includes a static feature and a dynamic feature, wherein the static feature includes an attribute feature of the first user, and the dynamic
  • the characteristics include the respective attribute characteristics of the i-1 objects in the group of object lists;
  • the input unit is configured to input the i-th state feature vector into the reinforcement learning model, so that the reinforcement learning model outputs a weight vector corresponding to the i-th state feature vector, and the weight vector includes a predetermined number
  • the second acquiring unit is configured to acquire the ranking feature vector of each object in the candidate object set corresponding to the group of object lists, the ranking feature vector including the respective feature values of the predetermined number of ranking features;
  • a calculation unit configured to calculate the score of each object in the candidate object set based on the dot product of the ranking feature vector of each object in the candidate object set and the weight vector;
  • the first determining unit is configured to, for the M groups of object lists, determine an updated M group of object lists based on the scores of each object in the M candidate object sets corresponding to the M groups of object lists, wherein Each group of object lists in the updated M group of object lists includes i objects.
  • the first determining unit is further configured to For the score of each object in the corresponding candidate object set, the object with the highest score in the candidate object set is taken as the i-th object of the group of object list, and the group of object list is taken as the updated group of object list.
  • the first determining unit is further configured to search through clusters based on the scores of each object in the M candidate object sets corresponding to the M groups of object lists.
  • the algorithm determines the updated list of M groups of objects.
  • i is equal to N
  • the device further includes a second determining unit configured to determine a push object list for the first user from the updated M group object list through a cluster search algorithm .
  • the device further includes:
  • a pushing unit configured to push each object to the first user in the order of the objects in the push object list to obtain feedback from the first user;
  • the fourth obtaining unit is configured to obtain an N+1th state feature vector, where the N+1th state feature vector includes a static feature and a dynamic feature, wherein the static feature includes the attribute feature of the first user ,
  • the dynamic characteristics include respective attribute characteristics of the N objects in the push object list;
  • the training unit is configured to train the reinforcement learning model based on the N sets of data respectively corresponding to the N cycles to optimize the reinforcement learning model, wherein the N sets of data include the first to the Nth sets of data, wherein, the i-th group of data includes: the i-th state feature vector corresponding to the push object list, the weight vector corresponding to the i-th state feature vector, and the (i+1)th state corresponding to the push object list The feature vector and the reward value corresponding to the i-th cycle.
  • Another aspect of this specification provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in a computer, the computer is caused to execute any of the above methods.
  • Another aspect of this specification provides a computing device including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, any one of the above methods is implemented.
  • the solution of determining the push object list for users based on the reinforcement learning model according to the embodiment of this specification aims to optimize the long-term accumulated mixed rewards.
  • the final reward value can be obtained based on multiple dimensions such as user clicks and user satisfaction.
  • the strategy function can be dynamically updated and adjusted according to the hotness of the issue and changes in user behavior patterns, thereby helping to increase the click-through rate.
  • Fig. 1 shows a schematic diagram of an object pushing system 100 according to an embodiment of the present specification
  • the method shown in FIG. 2 is, for example, a decision-making process of the model unit 11 shown in FIG. 1;
  • FIG. 4 schematically shows the process of determining the push object list in the system shown in FIG. 1 when the greedy search method is adopted;
  • Fig. 5 schematically shows the process of determining two sets of object lists by means of cluster search
  • Fig. 6 shows an apparatus 6000 for determining a push object list for a user based on a reinforcement learning model according to an embodiment of the present specification.
  • Fig. 1 shows a schematic diagram of an object pushing system 100 according to an embodiment of the present specification.
  • the object push system is, for example, a question prediction system, which enables a user to automatically predict a question list of questions that the user may want to ask when contacting customer service, and display the question list on the customer service page to improve the user experience, and Save labor customer service costs.
  • the object pushing system 100 according to the embodiment of the present specification is not limited to pushing a list of asked questions, but can be used to push lists of various objects, such as commodities, film and television works, news, and so on.
  • the system 100 includes a model unit 11, a training unit 12 and a ranking unit 13.
  • the model unit 11 includes, for example, a neural network for implementing reinforcement learning algorithms.
  • various reinforcement learning models can be used, such as models based on any of the following algorithms DDPG, DPG, Actor-critic, etc. , I will not list them one by one here, the following will take the DDPG algorithm as an example for description.
  • the ranking unit 13 In the case of pushing a list of questions through the system 100, for example, by sequentially inputting N consecutive states (s 1 , s 2 ... s N ) to the model unit 11, the ranking unit 13 finally obtains the pushed questions including N questions List.
  • the model unit 11 In the case of input s 1 , the model unit 11 outputs the corresponding behavior a 1 based on s 1.
  • each question is scored based on the ranking characteristics of a 1 and the candidate questions, and based on each question The score is determined to be the first question in the push list of questions.
  • the first problem can be determined by a greedy search algorithm. It is understandable that the embodiment of the present specification is not limited to this, for example, a cluster search algorithm can also be used for determination.
  • the second state s 2 of the environment is determined accordingly, that is, the current state of the environment is related to the characteristics of the user and the problems in the determined push problem list, and the second state is determined After the state s 2 , the behavior a 2 can be determined accordingly, and the second question in the question list can be pushed. Therefore, in the case that the push question list is preset to include N questions, the push question list including N questions can be obtained through N decision-making processes of the model.
  • the reinforcement learning model can be trained based on the above-mentioned states and behaviors and the respective reward values (ie N groups (s i , a i , s i+1 , r i )), and the updated The parameters are transmitted to the model unit 11 to update them.
  • Fig. 2 shows a method for determining a push object list for a user based on a reinforcement learning model according to an embodiment of the present specification, where for the first user, M groups of object lists have been determined in advance through the method, and each group of object lists There are currently i-1 objects in, where M and i are integers greater than or equal to 1, and where i is less than or equal to a predetermined integer N, the method includes:
  • Step S202 Obtain an i-th state feature vector, where the i-th state feature vector includes a static feature and a dynamic feature, wherein the static feature includes the attribute feature of the first user, and the dynamic feature includes the group of objects The respective attribute characteristics of the i-1 objects in the list;
  • Step S204 Input the i-th state feature vector to the reinforcement learning model, so that the reinforcement learning model outputs a weight vector corresponding to the i-th state feature vector, and the weight vector includes a predetermined number of ranking features Respective weights;
  • Step S206 Obtain a ranking feature vector of each object in the candidate object set corresponding to the group of object lists, where the ranking feature vector includes each feature value of the predetermined number of ranking features;
  • Step S208 Calculate the score of each object in the candidate object set based on the dot product of the ranking feature vector of each object in the candidate object set and the weight vector;
  • Step S210 For the M-group object list, determine an updated M-group object list based on the score of each object in the M candidate object sets corresponding to the M-group object list, wherein the updated M-group object Each group of object lists in the list includes i objects.
  • the method shown in FIG. 2 is, for example, a decision-making process of the model unit 11 shown in FIG. 1, that is, a process of inputting any state of s 1 , s 2 ... S N to the reinforcement learning model to add a question to the ranking question list.
  • the state si will be input to the model, where 1 ⁇ i ⁇ N.
  • the model's decision-making process based on s 1 , s 2 ... s i-1 it has been determined that there is a set of object lists. Include i-1 objects.
  • step S202 to step S208 are steps for each group of object lists in the above-mentioned existing M group of object lists, that is, step S202 to step S208 are respectively implemented for each group of object lists in the M group of object lists.
  • step S202 an i-th state feature vector is obtained, the i-th state feature vector includes a static feature and a dynamic feature, wherein the static feature includes the attribute feature of the first user, and the dynamic feature includes The respective attribute characteristics of the i-1 objects in the group object list.
  • the i-th state feature vector is the state si .
  • each group of predetermined object lists currently includes i-1 objects.
  • Set s i to be not only related to the static characteristics of the user, but also related to the determined i-1 objects, so that in the process of determining the i-th object, consideration of the existing objects in the list Attributes.
  • the static characteristics of the user are, for example, the user's age, educational background, geographic location, and so on.
  • the dynamic feature is, for example, the current popularity of each of the i-1 objects, object identification (for example, question number), object type, etc.
  • a predetermined number of questions can be preset as the candidate question set for this decision.
  • the popularity of each candidate question can be determined according to the number of questions asked by multiple users for each candidate question within a predetermined time period.
  • the above predetermined number of questions can be classified in advance to determine the type of each question.
  • the types of questions include, for example, questions about Huabei, questions about shopping, hot issues, etc.
  • the lower data bar corresponds to the static feature
  • the upper data bar schematically shows a part of the dynamic feature.
  • each square represents a dimension in the dynamic feature part
  • the value corresponding to each square represents the attributes of each question determined in the previous decision-making steps, such as the question type.
  • the value in each box is 0.
  • the model Before entering s 2 , the model has already performed the first time based on the entered s 1 The decision thus determines the first question in the question list. Therefore, the dynamic characteristic of s 2 can be determined based on the first question.
  • the first box of the dynamic characteristic of s 2 corresponds to the value 5. , which, for example, represents the type identification of the first question.
  • the value 5 in the first box and the value 2 in the second box of the dynamic feature of s 3 correspond to the first in the corresponding question list. Type of question and second question.
  • Step S204 Input the i-th state feature vector to the reinforcement learning model, so that the reinforcement learning model outputs a weight vector corresponding to the i-th state feature vector, and the weight vector includes a predetermined number of ranking features The respective weights.
  • decision in each step after determining the state s i, si input by reinforcement learning model, so that the corresponding behavior model can be output (i.e.
  • the reinforcement learning model is, for example, a DDPG model, which is obtained by learning based on a neural network.
  • the neural network includes a strategy network and a value network.
  • the strategy network includes, for example, two fully connected layers.
  • a i is calculated based on si by the following formulas (1) and (2):
  • W 1 , W 2 , b 1 and b 2 are the parameters in the strategy network.
  • the value of each element w ij of a i is restricted between [-1,1].
  • the reinforcement learning model is not limited to the DDPG model, so it is not limited to obtaining a i based on si through a policy network.
  • the structure of the policy network is not limited to using the activation function tanh, Therefore, the value of w ij is not necessarily limited to [-1, 1].
  • Step S206 Obtain the ranking feature vector of each object in the candidate object set corresponding to the group of object lists, where the ranking feature vector includes each feature value of the predetermined number of ranking features.
  • the problem can be preset as a predetermined number of times the candidate questions decision set. After inputting s 1 , at least one set of question lists will be determined according to the model decision result. For a set of question lists, the first question of the question list is already included. Therefore, when inputting s 2 to the model to perform In the second step of the decision-making process, the candidate question set corresponding to the set of question lists is the candidate question set obtained by removing the first question from the predetermined number of questions. In the subsequent decision-making process, the candidate question set corresponding to the set of question lists can be determined in the same way, that is, the candidate question set is obtained by removing the questions included in the set of question lists from the initial preset question set. The collection of questions obtained.
  • the ranking feature vector of object k in the i-th decision of the model can be expressed as
  • the dimension of the ranking feature vector is the same as the dimension of the behavior vector a i output by the above model, and corresponds to each ranking feature of the object.
  • the various ranking features may be determined based on factors affecting the ranking of objects in a specific scene. For example, in the case where the object is a question asked in a customer service scene, the ranking features include, for example, the estimated click of the user in the scene Rate, the current heat of the issue and the diversity of the issue.
  • the estimated click-through rate can be obtained through an existing click-through rate estimation model (CTR model) based on, for example, the user's click history behavior and user characteristics.
  • CTR model click-through rate estimation model
  • the estimated click-through rate is used to reflect the user's preference
  • the question heat is used to reflect the popularity of real-time questions
  • the question diversity is used to reflect the diversity of recommendation questions. For example, before the model makes the i-th decision, it is currently determined that there is a first set of question lists, and the candidate question set corresponding to the first set of question lists includes the first question, so that the question diversity feature value of the first question
  • the type is determined based on the existing i-1 question types in the set of question lists.
  • the diversity of the first question The characteristic value is determined to be 1, and in the case where the type of the first question is already included in the types of the i-1 questions, the diversity characteristic value of the first question may be determined to be 0.
  • Step S208 Calculate the score of each object in the candidate object set based on the dot product of the ranking feature vector of each object in the candidate object set and the weight vector.
  • the following formula (3) can be used to calculate the question k in the first Ranking points in the i decision
  • the formula (3) is only an optional calculation method, and the calculation of the score is not limited to this.
  • Step S210 For the M-group object list, determine an updated M-group object list based on the score of each object in the M candidate object sets corresponding to the M-group object list, wherein the updated M-group object Each group of object lists in the list includes i objects.
  • the determination can be made through a greedy search or a cluster search.
  • FIG. 4 schematically shows the process of determining the push object list in the system shown in FIG. 1 when the greedy search method is adopted. As shown in FIG. 4, the figure includes the model unit 11 and the sorting unit 13 in FIG. Initially, the object list has not been determined in the sorting unit. At this time, it can be considered that the object list includes 0 objects.
  • the model unit reinforcement learning model Based on the list of objects comprising objects 0, determining the first decision state s 1, s 1 and the state of the input model unit, the model unit reinforcement learning model based on the behavior of the state s 1 acquires a 1, the sorting unit 13, based on the fraction of a 1 acquires the candidate object set for each object, whereby the object in which the highest score is determined for the object in the first object in the list. After the first object is determined, the state s 2 of the second decision of the model can be determined based on the first object.
  • the behavior a 2 is obtained , and then the candidate object is obtained based on a2
  • the score of each object in the set is determined, and the second object in the object list is determined based on the score, so that the state s 3 of the third decision can be determined based on the first object and the second object in the object list. It is understood that the candidate object set in the second decision is different from the candidate object set in the first decision, and the first object is no longer included.
  • each subsequent decision-making process can be performed similarly to the aforementioned decision-making process, for example, after the fifth decision-making process of the model, the behavior a 5 is determined, so that the score of each object in the corresponding candidate object set can be calculated , Thereby determining the fifth object in the object list, and then based on the existing 5 objects in the object list, determining the state s 6 of the sixth decision, by inputting the state s 6 into the model, the behavior a 6 is obtained , And based on a 6 to determine the sixth object in the object list.
  • an object list including 6 objects can be determined through the six decision-making processes of the model, and the object list can be pushed to the corresponding user as the push object list, such as the first user.
  • the cluster width is 2, that is, in each decision of the model, two sets of object lists will be determined.
  • Fig. 5 schematically shows the process of determining two sets of object lists by means of cluster search.
  • the score of each object in the candidate object set can be calculated in the same way as in the above greedy search method, so that the score can be obtained
  • the top two objects for example, object 1 and object
  • the "s 1 " on the left side of the two groups of object lists is used to indicate that they are based on the state s 1 acquired.
  • new states s 21 and s 22 can be determined based on each object list. Similarly, it can be performed based on the state s 21 and the state s 22 respectively.
  • two corresponding object lists can be determined respectively, that is, a total of four object lists on the right part of the figure are determined, as shown in the figure, the upper two lists on the right part of the figure and the state s 21
  • the first object is object 1
  • the following two lists correspond to state s 22 , that is, the first object is object 2.
  • the sum of the scores of the first object and the second object can be calculated respectively, and the scores and the top two object lists can be taken as the two determined in the second decision.
  • Object lists such as the two object lists in the two dashed boxes in the figure.
  • the 6th decision two object lists are obtained as described above (each object list includes 6 After the determined object), the score of each object in the two object lists and the highest object list can be pushed to the corresponding user as the push object list.
  • the N objects in the list may be pushed to the first user in the order of the objects in the list.
  • N questions are displayed in sequence in the customer service interface, or the N questions are displayed in sequence, and so on.
  • the first user's feedback can be obtained, such as the first user's click on any of the N questions, the satisfaction information submitted by the user, and so on.
  • a satisfaction button is displayed in the customer service interface to reflect user satisfaction through the user's click.
  • each time return value r decision i Gets push a list of objects N times decision process model, and acquires model based on the push list object may be acquired corresponding to the set of push object list N groups of (s i , a i , s i+1 , r i ), where s N+1 in the Nth group of data can be determined based on the N objects in the object list. Therefore, the above-mentioned reinforcement learning model can be trained based on the above-mentioned N sets of data.
  • the neural network that implements the model calculation includes not only the strategy network described above, but also a value network, so that the strategy can be analyzed separately according to, for example, the gradient descent method.
  • Network and value network for tuning For example, if B represents the set of the above N groups (s i , a i , s i+1 , r i ), and ⁇ represents the parameters of the value network, then ⁇ can be updated by the following formula (5):
  • ⁇ tgt is the target parameter of the value network
  • ⁇ tgt is the target parameter of the strategy network
  • the function shown in formula (1) the ⁇ tgt and ⁇ tgt are values obtained based on soft update.
  • ⁇ tgt can be updated based on ⁇ by soft update.
  • the update of the target parameter ⁇ tgt of the policy network can also be performed by the gradient descent method based on the above N sets of data and the output Q of the value network, which will not be described in detail here.
  • Fig. 6 shows an apparatus 6000 for determining a push object list for a user based on a reinforcement learning model according to an embodiment of the present specification.
  • M groups of object lists have been determined in advance through the method, and each group of objects
  • the list currently includes i-1 objects, where both M and i are integers greater than or equal to 1, and where i is less than or equal to a predetermined integer N, the device includes:
  • the first obtaining unit 601 is configured to obtain an i-th state feature vector, where the i-th state feature vector includes a static feature and a dynamic feature, wherein the static feature includes the attribute feature of the first user, and the The dynamic characteristics include the respective attribute characteristics of the i-1 objects in the group of object lists;
  • the input unit 602 is configured to input the i-th state feature vector into the reinforcement learning model, so that the reinforcement learning model outputs a weight vector corresponding to the i-th state feature vector, and the weight vector includes a predetermined The respective weights of the number of ranking features;
  • the second acquiring unit 603 is configured to acquire the ranking feature vector of each object in the candidate object set corresponding to the group of object lists, the ranking feature vector including the respective feature values of the predetermined number of ranking features;
  • the calculation unit 604 is configured to calculate the score of each object in the candidate object set based on the dot product of the ranking feature vector of each object in the candidate object set and the weight vector;
  • the first determining unit 605 is configured to, for the M groups of object lists, determine an updated M group of object lists based on the scores of each object in the M candidate object sets corresponding to the M groups of object lists, where Each group of object lists in the updated M group of object lists includes i objects.
  • the first determining unit is further configured to For the score of each object in the corresponding candidate object set, the object with the highest score in the candidate object set is taken as the i-th object of the group of object list, and the group of object list is taken as the updated group of object list.
  • the first determining unit is further configured to search through clusters based on the scores of each object in the M candidate object sets corresponding to the M groups of object lists.
  • the algorithm determines the updated list of M groups of objects.
  • i is equal to N
  • the device further includes a second determining unit 606 configured to determine the push object for the first user from the updated M group object list through a cluster search algorithm List.
  • the device 6000 further includes:
  • the pushing unit 607 is configured to push each object to the first user in the order of the objects in the push object list to obtain feedback from the first user;
  • the fourth acquiring unit 609 is configured to acquire an N+1th state feature vector, where the N+1th state feature vector includes a static feature and a dynamic feature, wherein the static feature includes an attribute of the first user Features, the dynamic features include respective attribute features of the N objects in the push object list; and
  • the training unit 610 is configured to train the reinforcement learning model based on the N sets of data respectively corresponding to the N cycles to optimize the reinforcement learning model, wherein the N sets of data include the first to the Nth sets of data , wherein the i-th group of data includes: the i-th state feature vector corresponding to the push object list, the weight vector corresponding to the i-th state feature vector, and the i+1-th state feature vector corresponding to the push object list The state feature vector and the reward value corresponding to the i-th cycle.
  • Another aspect of this specification provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in a computer, the computer is caused to execute any of the above methods.
  • Another aspect of this specification provides a computing device including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, any one of the above methods is implemented.
  • the solution of determining the push object list for users based on the reinforcement learning model according to the embodiment of this specification has the following advantages: First, the solution of the embodiment of this specification takes into account user click rate , The location of the user’s clicked object and other user feedback (such as whether the user is satisfied, etc.) are also considered. These additional information are reflected in the return value of the model; secondly, the reinforcement learning model according to the embodiment of this specification uses the CTR model to score and Some real-time features are used as input, and the feature space is small. The iterative update of the model can be very fast. It assists the comprehensive scoring of real-time data of different sliding time windows.
  • the real-time changes of the environment can be applied in time;
  • the model state includes user, scene, and hierarchical information, so that the diversity and exploratory nature of object push can be controlled.
  • the model parameters according to the embodiment of the specification can be based on data collection, user experience, and assurance All aspects of the effect require intervention and adjustment.
  • the steps of the method or algorithm described in the embodiments disclosed in this document can be implemented by hardware, a software module executed by a processor, or a combination of the two.
  • the software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage medium.

Abstract

一种基于强化学习模型确定针对用户的推送对象列表的方法和装置,所述方法包括:对于每组对象列表,获取第i个状态特征向量(S202);将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出与该第i个状态特征向量对应的权重向量(S204);获取与该组对象列表对应的候选对象集合中各个对象的排序特征向量(S206);以及基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数(S208);以及对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表(S210),其中,所述更新的M组对象列表中的每组对象列表包括i个对象。

Description

基于强化学习模型向用户推送对象的方法和装置 技术领域
本说明书实施例涉及机器学习领域,更具体地,涉及一种基于强化学习模型确定针对用户的推送对象列表的方法和装置。
背景技术
传统的客户服务是人力/资源密集型和耗时的,因此,构建能够自动回答用户面临问题的智能助手非常重要。最近,人们越来越关注如何用机器学习来更好地构建这样的智能助手。作为客户服务机器人的核心功能,用户意图预测旨在自动预测用户可能想要询问的问题,并向用户呈现候选问题以供其选择以减轻用户的认知负担。更具体地说,用户意图预测任务可以被视为前N项(Top N)推荐的任务,其中每个预定好的问题是一个意图类(class)。目前的现有方法将该任务视为一个多分类问题,在给定当前用户状态下预测用户最可能感兴趣的对象(item)列表,即问题列表。这些方法旨在最大化即时奖励,例如点击,而忽略了推荐列表中在前的推荐对象对在后的推荐对象的影响。
因此,需要一种更有效的向用户推送一组对象列表的方案。
发明内容
本说明书实施例旨在提供一种更有效的基于强化学习模型确定针对用户的推送对象列表的方案,以解决现有技术中的不足。
为实现上述目的,本说明书一个方面提供一种基于强化学习模型确定针对用户的推送对象列表的方法,其中,对于第一用户,已预先通过所述方法确定有M组对象列表,每组对象列表中当前包括i-1个对象,其中,M、i都为大于等于1的整数,其中,i小于等于预定整数N,所述方法包括:
对于每组对象列表,
获取第i个状态特征向量,所述第i个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括该组对象列表中所述i-1个对象各自的属性特征;
将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出 与该第i个状态特征向量对应的权重向量,所述权重向量包括预定数目的排序特征各自的权重;
获取与该组对象列表对应的候选对象集合中各个对象的排序特征向量,所述排序特征向量包括所述预定数目的排序特征各自的特征值;以及
基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数;以及
对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表,其中,所述更新的M组对象列表中的每组对象列表包括i个对象。
在一个实施例中,所述动态特征至少包括所述i-1个对象各自的以下属性特征:当前热度、对象标识、对象所属类型。
在一个实施例中,所述M组对象列表中包括第一组对象列表,与该第一组对象列表对应的候选对象集合中包括第一对象,与该第一对象对应的排序特征向量至少包括以下排序特征的值:所述第一用户对该第一对象的预估点击率、该第一对象的当前热度、该第一对象相对于所述第一组对象列表中的i-1个对象的多样性。
在一个实施例中,已预先通过所述方法确定有M组对象列表包括,已预先通过所述方法确定有一组对象列表,其中,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表包括,基于与该组对象列表对应的候选对象集合中各个对象的分数,以所述候选对象集合中分数最高的对象作为该组对象列表的第i个对象,并将该组对象列表作为更新的一组对象列表。
在一个实施例中,M大于等于2,其中,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表包括,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,通过集束搜索算法确定更新的M组对象列表。
在一个实施例中,i等于N,所述方法还包括,通过集束搜索算法,从所述更新的M组对象列表中确定针对所述第一用户的推送对象列表。
在一个实施例中,所述方法还包括,
以所述推送对象列表中各个对象的排列顺序,向所述第一用户推送所述各个对象, 以获取所述第一用户的反馈;
基于所述排列顺序和所述反馈获取N个回报值,所述N个回报值与对所述方法的从i=1至N的N次循环分别对应;
获取第N+1个状态特征向量,所述第N+1个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括所述推送对象列表中N个对象各自的属性特征;以及
基于与所述N次循环分别对应的N组数据训练所述强化学习模型,以优化所述强化学习模型,其中,所述N组数据包括第1至第N组数据,其中,第i组数据包括:与所述推送对象列表对应的第i个状态特征向量、与该第i个状态特征向量对应的权重向量、与所述推送对象列表对应的第i+1个状态特征向量、以及与第i次循环对应的回报值。
在一个实施例中,所述对象为询问问题,对于第1至N-1次循环中的第i次循环,与所述第i次循环对应的回报值基于所述第一用户的如下反馈获取:是否点击所述推送对象列表中的第i个问题。
在一个实施例中,与所述第N次循环对应的回报值基于所述第一用户的如下反馈获取:是否点击所述推送对象列表中的第N个问题、以及提交的满意度信息。
在一个实施例中,所述强化学习模型为基于深度确定策略梯度算法的模型。
本说明书另一方面提供一种基于强化学习模型确定针对用户的推送对象列表的装置,其中,对于第一用户,已预先通过所述方法确定有M组对象列表,每组对象列表中当前包括i-1个对象,其中,M、i都为大于等于1的整数,其中,i小于等于预定整数N,所述装置包括:
用于每组对象列表的,
第一获取单元,配置为,获取第i个状态特征向量,所述第i个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括该组对象列表中所述i-1个对象各自的属性特征;
输入单元,配置为,将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出与该第i个状态特征向量对应的权重向量,所述权重向量包括预定数目的排序特征各自的权重;
第二获取单元,配置为,获取与该组对象列表对应的候选对象集合中各个对象的排 序特征向量,所述排序特征向量包括所述预定数目的排序特征各自的特征值;以及
计算单元,配置为,基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数;以及
第一确定单元,配置为,对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表,其中,所述更新的M组对象列表中的每组对象列表包括i个对象。
在一个实施例中,已预先通过所述方法确定有M组对象列表包括,已预先通过所述方法确定有一组对象列表,其中,所述第一确定单元还配置为,基于与该组对象列表对应的候选对象集合中各个对象的分数,以所述候选对象集合中分数最高的对象作为该组对象列表的第i个对象,并将该组对象列表作为更新的一组对象列表。
在一个实施例中,其中,M大于等于2,其中,所述第一确定单元还配置为,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,通过集束搜索算法确定更新的M组对象列表。
在一个实施例中,i等于N,所述装置还包括,第二确定单元,配置为,通过集束搜索算法,从所述更新的M组对象列表中确定针对所述第一用户的推送对象列表。
在一个实施例中,所述装置还包括,
推送单元,配置为,以所述推送对象列表中各个对象的排列顺序,向所述第一用户推送所述各个对象,以获取所述第一用户的反馈;
第三获取单元,配置为,基于所述排列顺序和所述反馈获取N个回报值,所述N个回报值与对所述方法的从i=1至N的N次循环分别对应;
第四获取单元,配置为,获取第N+1个状态特征向量,所述第N+1个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括所述推送对象列表中N个对象各自的属性特征;以及
训练单元,配置为,基于与所述N次循环分别对应的N组数据训练所述强化学习模型,以优化所述强化学习模型,其中,所述N组数据包括第1至第N组数据,其中,第i组数据包括:与所述推送对象列表对应的第i个状态特征向量、与该第i个状态特征向量对应的权重向量、与所述推送对象列表对应的第i+1个状态特征向量、以及与第i次循环对应的回报值。
本说明书另一方面提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行上述任一项方法。
本说明书另一方面提供一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现上述任一项方法。
通过根据本说明书实施例的基于强化学习模型确定针对用户的推送对象列表的方案旨在优化长期累积的混合奖励,例如最后的回报值可基于用户点击和用户满意度等多个维度获取,另外,策略函数可以随着问题火热度和用户行为模式的变化而不断动态更新和调整,从而有助于提升点击率。
附图说明
通过结合附图描述本说明书实施例,可以使得本说明书实施例更加清楚:
图1示出根据本说明书实施例的对象推送系统100的示意图;
图2所示方法为例如图1所示的模型单元11的一次决策过程;
图3示出根据本说明书实施例的模型的N(N=6)步决策过程;
图4示意示出当采用贪婪搜索方式时在图1所示系统中确定推送对象列表的过程;
图5示意示出通过集束搜索方式确定两组对象列表的过程;
图6示出根据本说明书实施例的一种基于强化学习模型确定针对用户的推送对象列表的装置6000。
具体实施方式
下面将结合附图描述本说明书实施例。
图1示出根据本说明书实施例的对象推送系统100的示意图。所述对象推送系统例如为问题预测系统,其使得用户联系客服时,可自动预测该用户可能想要询问的问题的问题列表,并在客服页面显示该问题列表,以提高用户的使用体验,并节省人工客服成本。可以理解,根据本说明书实施例的对象推送系统100不限于进行询问问题列表的推送,而可以用于推送各种对象的列表,如商品、影视作品、新闻等等。如图1所示,系统100包括模型单元11、训练单元12和排序单元13。所述模型单元11例如包括神经网络,以用于实施强化学习算法,在本说明书实施例中,可使用各种强化学习模型,如 基于如下任一算法DDPG、DPG、Actor-critic的模型等等,在此不一一列出,下文中将以DDPG算法为例进行描述。
在通过系统100进行例如问题列表推送的情况中,通过依次向模型单元11输入连续的N个状态(s 1,s 2…s N),而最终在排序单元13获取包括N个问题的推送问题列表。其中,例如,在输入s 1的情况中,模型单元11基于s 1输出相应的行为a 1,在排序单元13中,基于a 1和候选问题各自的排序特征对各个问题打分,并基于各个问题的分数,确定推送问题列表的第一个问题。这里可通过贪婪搜索的算法确定第一问题,可以理解,本说明书实施例不限于此,例如还可以采用集束搜索的算法确定。在确定了上述第一问题之后,也相应地确定了环境的第二个状态s 2,即环境的当前状态与用户的特征和已经确定的推送问题列表中的问题相关,而在确定第二个状态s 2之后,可相应地确定行为a 2,及推送问题列表中的第二个问题。从而,在推送问题列表被预设为包括N个问题的情况中,可通过模型的N次决策过程,获取包括N个问题的推送问题列表。
在获取上述推送问题列表之后,通过将该列表展示给用户,可获取用户的反馈,从而可基于该反馈获取模型的每次决策的回报值r i。从而可在训练模型12中,基于上述各个状态和行为以及各个回报值(即N组(s i,a i,s i+1,r i)),训练所述强化学习模型,并将更新的参数传送给模型单元11,以对其进行更新。
下面将对上述模型决策过程和模型训练过程进行详细描述。
图2示出根据本说明书实施例的一种基于强化学习模型确定针对用户的推送对象列表的方法,其中,对于第一用户,已预先通过所述方法确定有M组对象列表,每组对象列表中当前包括i-1个对象,其中,M、i都为大于等于1的整数,其中,i小于等于预定整数N,所述方法包括:
对于每组对象列表,
步骤S202,获取第i个状态特征向量,所述第i个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括该组对象列表中所述i-1个对象各自的属性特征;
步骤S204,将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出与该第i个状态特征向量对应的权重向量,所述权重向量包括预定数目的排序特征各自的权重;
步骤S206,获取与该组对象列表对应的候选对象集合中各个对象的排序特征向量, 所述排序特征向量包括所述预定数目的排序特征各自的特征值;以及
步骤S208,基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数;以及
步骤S210,对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表,其中,所述更新的M组对象列表中的每组对象列表包括i个对象。
图2所示方法为例如图1所示的模型单元11的一次决策过程,即向强化学习模型输入s 1,s 2…s N中任一状态以在排序问题列表中增加一个问题的过程。例如,将要对模型输入状态s i,其中1≤i≤N。如上文所述,在通过贪婪搜索算法对问题排序的情况中,在模型的分别基于s 1,s 2…s i-1的决策过程中,已确定有一组对象列表,该组对象列表中当前包括i-1个对象。在通过集束搜索的算法排序的情况中,例如将集束宽度预设为2,即M=2,从而,在模型的分别基于s 1,s 2…s i-1的决策过程中,已确定有两组对象列表,并且每组对象列表中当前包括i-1个对象。
下面详细描述该方法中的每个步骤。
其中,步骤S202-步骤S208是针对上述已有的M组对象列表中每组对象列表的步骤,即针对M组对象列表中的每组对象列表分别实施步骤S202-步骤S208。
首先,在步骤S202,获取第i个状态特征向量,所述第i个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括该组对象列表中所述i-1个对象各自的属性特征。
所述第i个状态特征向量即为上述状态s i,如上文所述,在将要实施该方法的当前,在预先确定的每组对象列表中当前包括i-1个对象,在本说明书实施例中,将s i设定为不仅与用户的静态特征相关,还与所述已经确定的i-1个对象相关,从而可在确定第i个对象的过程中,考虑列表中已有的对象的属性。其中,所述用户的静态特征例如为该用户的年龄、教育背景、地理位置等等。所述动态特征例如为所述i-1个对象各自的当前热度、对象标识(例如问题序号)、对象类型等。例如,所述对象为用户的询问问题,在对模型输入s1以执行模型的第一次决策之前,可预设预定数目的问题作为该次决策的候选问题集合。可根据多个用户在预定时段内对每个候选问题的提问次数,确定每个候选问题的热度。可预先对上述预定数目的问题进行分类,从而确定每个问题的类型,例如,在支付宝的客服系统中,问题的类型例如包括:关于花呗的问题、关于购物的问 题、热点问题等等。
图3示出根据本说明书实施例的模型的N(N=6)步决策过程,其中包括各步的输入状态s 1-s 6。如图3中所示,在各个状态中,下部的数据条对应于静态特征,上部的数据条示意示出动态特征的一部分。在动态特征部分中,每个方形表示动态特征部分中的一个维度,每个方形对应的数值表示前面各步决策中确定的各个问题的属性,例如问题类型。如图中所示,在输入s 1之前,还未确定问题列表中的问题,因此,各个方框中的数值为0,在输入s 2之前,模型已经基于输入的s 1进行了第一次决策从而确定了问题列表中的第一个问题,因此,可基于该第一个问题确定s 2的动态特征,如图中所示,s 2的动态特征的第一个方框对应于数值5,其例如表示第一个问题的类型标识,类似地,s 3的动态特征的第一个方框中的数值5和第二个方框中的数值2分别对应于相应问题列表中第一个问题和第二个问题的类型。
步骤S204,将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出与该第i个状态特征向量对应的权重向量,所述权重向量包括预定数目的排序特征各自的权重。仍然如图3中所示,在每步决策中,在确定状态s i之后,通过将si输入强化学习模型,从而可使得模型输出相应的行为(即权重向量)a i={w i0,w i1,…,w im},其中,i=1,2,…6,其中w ij表示排序特征f ij的权重,如图中所示,每个权重向量a i中的圆形表示该向量的一个维度,即对应于一个w ij的值,其中,三个圆形表示j=3,所述排序特征f ij为各个对象的用于获取排序分数的特征,将在下文对其详细描述。
如上文所述,所述强化学习模型例如为DDPG模型,该模型通过基于神经网络进行学习而获取。所述神经网络中包括策略网络和价值网络,在本说明书实施例中,策略网络例如包括两层全连接层,在策略网络中通过如下公式(1)和(2)基于s i计算a i
a i=μ(s i)=tanh(W 2H i+b 2)        (1)
H i=tanh(W 1S i+b 1)                (2)
其中,W 1、W 2、b 1和b 2为该策略网络中的参数,通过激活函数tanh(),将a i各个元素w ij的值限制在[-1,1]之间。可以理解,上述描述仅仅是示意性的,所述强化学习模型不限于DDPG模型,从而不限于通过策略网络来基于s i获得a i,另外,所述策略网络的结构不限于使用激活函数tanh,从而w ij的值不必需限制在[-1,1]之间。
步骤S206,获取与该组对象列表对应的候选对象集合中各个对象的排序特征向量,所述排序特征向量包括所述预定数目的排序特征各自的特征值。
如上文所述,例如,所述对象为询问问题,在对模型输入s 1以执行模型的第一次决策之前,可预设预定数目的问题作为该次决策的候选问题集合。在输入s 1之后,将根据模型决策结果确定至少一组问题列表,对于其中的一组问题列表,其中已经包括该问题列表的第一个问题,从而,在对模型输入例如s 2,以进行第2步决策的过程中,与该组问题列表对应的候选问题集合为从上述预定数目的问题中移除上述第一个问题所获取的候选问题集合。在后续次的决策过程中,可同样地确定与该组问题列表对应的候选问题集合,即,所述候选问题集合为通过从初始预设的问题集合中移除该组问题列表中包括的问题所获取的问题集合。
可将模型的第i次决策中对象k的排序特征向量表示为
Figure PCTCN2020071699-appb-000001
该排序特征向量的维度与上述模型输出的行为向量a i的维度相同,与对象的各个排序特征分别对应。所述各个排序特征可基于具体场景中对对象排序的影响因素确定,例如,在所述对象为客服场景中的询问问题的情况中,所述排序特征例如包括:该场景中用户的预估点击率、该问题的当前热度和问题多样性。所述预估点击率可通过现有的点击率预估模型(CTR模型)基于例如用户的点击历史行为和用户特征等获取。其中,预估点击率用于体现用户的偏好,问题热度用于体现实时问题流行度,问题多样性用于体现推荐问题的多样性。例如,在模型进行第i步决策之前,当前已确定有第一组问题列表,与该第一组问题列表对应的候选问题集合中包括第一问题,从而该第一问题的问题多样性特征值基于该组问题列表中已有的i-1个问题的类型确定,例如,在所述i-1个问题的类型中不包括第一问题的类型的情况中,可将第一问题的多样性特征值确定为1,在所述i-1个问题的类型中已经包括第一问题的类型的情况中,可将第一问题的多样性特征值确定为0。
步骤S208,基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数。
在通过上述步骤获取了在第i次决策中的权重向量和候选对象集合中各个对象的排序特征向量之后,对于候选对象集合中的问题k,可通过例如如下公式(3)计算问题k在第i次决策中的排序分
Figure PCTCN2020071699-appb-000002
Figure PCTCN2020071699-appb-000003
可以理解,所述公式(3)仅仅是可选的一种计算方法,所述分数的计算不限于此,例如,还可以通过分别对排序特征向量和权重向量进行归一化,然后再通过对其进行点积获取相应的分数,等等。
步骤S210,对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表,其中,所述更新的M组对象列表中的每组对象列表包括i个对象。
如上文所述,在根据各个对象的分数确定对象列表中,可通过贪婪搜索的方式或集束搜索的方式进行确定。
在采用贪婪搜索方式的情况中,在模型的每次决策中,仅选出候选对象集合中分数最高的对象作为推送对象列表的第一个对象。图4示意示出当采用贪婪搜索方式时在图1所示系统中确定推送对象列表的过程。如图4所示,图中包括图1中的模型单元11和排序单元13。在初始,排序单元中还未确定对象列表,此时,可认为对象列表中包括0个对象。基于该包括0个对象的对象列表,确定第一次决策的状态s 1,并将该状态s 1输入模型单元,模型单元中的强化学习模型基于该状态s 1获取行为a 1,在排序单元13中,基于a 1获取候选对象集合中各个对象的分数,从而将其中分数最高的对象确定为该对象列表中的第一个对象。在确定第一对象之后,可基于该第一个对象,确定模型的第二次决策的状态s 2,类似地,通过将s 2输入模型单元,从而获取行为a 2,继而基于a2获取候选对象集合中各个对象的分数,并基于分数确定该对象列表中的第二个对象,从而可基于该对象列表中的第一个对象和第二个对象,确定第三次决策的状态s 3,可以理解,在第二次决策中的候选对象集合与第一次决策中的候选对象集合已经不同,其中已经不包括所述第一个对象。类似地,后续的每次决策过程都可与前述决策过程类似地进行,例如,在模型的第5次决策过程之后,确定行为a 5,从而可计算相应的候选对象集合中每个对象的分数,从而确定该对象列表中的第5个对象,并继而基于该对象列表中已有的5个对象,确定第6次决策的状态s 6,通过将状态s 6输入模型,获取行为a 6,并基于a 6确定该对象列表中的第6个对象。从而可通过模型的六次决策过程确定包括6个对象的对象列表,并且可将该对象列表作为推送对象列表推送给相应的用户,如第一用户。
在采用集束搜索的方式的情况中,例如集束宽度为2,也就是说,在模型的每次决策中,将确定两组对象列表。图5示意示出通过集束搜索方式确定两组对象列表的过程。如图中左部所示,在进行模型的第一次决策中,在对模型输入s 1之后,可与上述贪婪搜索方式中相同地,计算候选对象集合中各个对象的分数,从而可获取分数排在前两位的两个对象(例如对象1和对象)分别作为两组对象列表中的第一个对象,该两组对象列表左侧的“s 1”用于指示其都是基于状态s 1获取的。如图中右部所示,在获取图中左部两 个对象列表之后,基于每个对象列表可确定新的状态s 21和s 22,类似地,可分别基于状态s 21和状态s 22进行模型的第二次决策,从而可分别确定相应的两个对象列表,即共确定图中右部的四个对象列表中,如图中所示,图中右部上面两个列表与状态s 21对应,即,其中第一个对象都为对象1,下面两个列表与状态s 22对应,即其中第一个对象都为对象2。在该四个对象列表中,可分别计算其中第一个对象和第二个对象的分数和,并取分数和排在前两位的两个对象列表作为在第二次决策中确定的两个对象列表,例如图中两个虚线框中的两个对象列表。例如将要通过模型的6次决策过程确定推送对象列表,即,N=6,在该情况中,在第6次决策中,在如上所述获取两个对象列表(该每个对象列表包括6个已确定的对象)之后,可将该两个对象列表中各个对象的分数和最高的对象列表作为推送对象列表推送给相应的用户。
在如上所述获取用户(例如第一用户)的推送对象列表之后,可以以该列表中的对象的排列顺序向第一用户推送该列表中的N个对象。例如在客服界面中展示顺序排列的N个问题,或者顺序展示所述N个问题等等。在对第一用户进行上述推送之后,可获取第一用户的反馈,如第一用户对N个问题中任一问题的点击、用户提交的满意度信息等等。在一个实例中,客服界面中显示有满意度按钮,以通过用户的点击体现用户的满意,在用户点击对象列表中推送对象列表中第p个对象的情况中,可通过如下公式(4)确定模型的第i次决策对应的回报值r i
Figure PCTCN2020071699-appb-000004
其中,当用户点击满意度按钮时,可将r’设置为1,否则将r’设置为0。也就是说,当用户点击第i个问题,当i≠N时,r i等于α p,当i=N(即模型的最后一次决策)时,r i等于α p+r′;当用户未点击第i个问题,当i≠N时,r i等于0,当i=N时,r i等于r′。
在通过模型的N次决策过程获取推送对象列表、并基于该推送对象列表获取模型的从i=1至i=N的各次决策的回报值r i之后,可获取与该组推送对象列表对应的N组(s i,a i,s i+1,r i),其中第N组数据中的s N+1可基于该对象列表中的N个对象确定。从而,可基于上述N组数据训练上述强化学习模型。例如,在所述强化学习模型为DDPG模型的情况中,实施模型计算的神经网络除了包括上文所述的策略网络之外,还包括价值网络,从而可根据例如梯度下降法分别对所述策略网络和价值网络进行调参。例如通过B表示上述N组(s i,a i,s i+1,r i)的集合,通过Ω表示价值网络的参数,则可通过如下的公式(5)更新Ω:
Figure PCTCN2020071699-appb-000005
其中,通过如下公式(6)获取公式(5)中的y(r i,s i+1):
Figure PCTCN2020071699-appb-000006
其中,Ω tgt为价值网络的目标参数,Θ tgt为策略网络的目标参数,
Figure PCTCN2020071699-appb-000007
为例如公式(1)所示的函数,所述Ω tgt、Θ tgt为基于软更新所获取的值,在上述通过公式(5)更新Ω之后,可通过软更新基于Ω更新Ω tgt。对策略网络的目标参数Θ tgt的更新也可通过梯度下降法基于上述N组数据和价值网络的输出Q进行,在此不再详述。
图6示出根据本说明书实施例的一种基于强化学习模型确定针对用户的推送对象列表的装置6000,其中,对于第一用户,已预先通过所述方法确定有M组对象列表,每组对象列表中当前包括i-1个对象,其中,M、i都为大于等于1的整数,其中,i小于等于预定整数N,所述装置包括:
用于每组对象列表的,
第一获取单元601,配置为,获取第i个状态特征向量,所述第i个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括该组对象列表中所述i-1个对象各自的属性特征;
输入单元602,配置为,将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出与该第i个状态特征向量对应的权重向量,所述权重向量包括预定数目的排序特征各自的权重;
第二获取单元603,配置为,获取与该组对象列表对应的候选对象集合中各个对象的排序特征向量,所述排序特征向量包括所述预定数目的排序特征各自的特征值;以及
计算单元604,配置为,基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数;以及
第一确定单元605,配置为,对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表,其中,所述更新的M组对象列表中的每组对象列表包括i个对象。
在一个实施例中,已预先通过所述方法确定有M组对象列表包括,已预先通过所述方法确定有一组对象列表,其中,所述第一确定单元还配置为,基于与该组对象列表对 应的候选对象集合中各个对象的分数,以所述候选对象集合中分数最高的对象作为该组对象列表的第i个对象,并将该组对象列表作为更新的一组对象列表。
在一个实施例中,其中,M大于等于2,其中,所述第一确定单元还配置为,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,通过集束搜索算法确定更新的M组对象列表。
在一个实施例中,i等于N,所述装置还包括,第二确定单元606,配置为,通过集束搜索算法,从所述更新的M组对象列表中确定针对所述第一用户的推送对象列表。
在一个实施例中,所述装置6000还包括,
推送单元607,配置为,以所述推送对象列表中各个对象的排列顺序,向所述第一用户推送所述各个对象,以获取所述第一用户的反馈;
第三获取单元608,配置为,基于所述排列顺序和所述反馈获取N个回报值,所述N个回报值与对所述方法的从i=1至N的N次循环分别对应;
第四获取单元609,配置为,获取第N+1个状态特征向量,所述第N+1个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括所述推送对象列表中N个对象各自的属性特征;以及
训练单元610,配置为,基于与所述N次循环分别对应的N组数据训练所述强化学习模型,以优化所述强化学习模型,其中,所述N组数据包括第1至第N组数据,其中,第i组数据包括:与所述推送对象列表对应的第i个状态特征向量、与该第i个状态特征向量对应的权重向量、与所述推送对象列表对应的第i+1个状态特征向量、以及与第i次循环对应的回报值。
本说明书另一方面提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行上述任一项方法。
本说明书另一方面提供一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现上述任一项方法。
通过根据本说明书实施例的基于强化学习模型确定针对用户的推送对象列表的方案,相比于现有的点击预测分类模型具有如下优势:首先,本说明书实施例的方案在考虑用户点击率之外,还考虑了用户点击对象的位置和用户的其它反馈(如用户是否满 意等),这些额外的信息体现在模型的回报值中;其次,根据本说明书实施例的强化学习模型以CTR模型打分及一些实时特征作为输入,特征空间小,模型的迭代更新可以很快,辅助不同滑动时间窗口的实时数据进行综合打分,在充分利用ctr模型的前提下,能及时应用环境的实时变化;最后在本说明书实施例中,模型状态中包含用户、场景及层次化的信息,从而可控制对象推送的多样性和探索性,另外,根据本说明书实施例的模型参数可根据在收集数据、用户体验和保证效果各方面的需要进行干预调节。
需要理解,本文中的“第一”,“第二”等描述,仅仅为了描述的简单而对相似概念进行区分,并不具有其他限定作用。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定 本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (22)

  1. 一种基于强化学习模型确定针对用户的推送对象列表的方法,其中,对于第一用户,已预先通过所述方法确定有M组对象列表,每组对象列表中当前包括i-1个对象,其中,M、i都为大于等于1的整数,其中,i小于等于预定整数N,所述方法包括:
    对于每组对象列表,
    获取第i个状态特征向量,所述第i个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括该组对象列表中所述i-1个对象各自的属性特征;
    将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出与该第i个状态特征向量对应的权重向量,所述权重向量包括预定数目的排序特征各自的权重;
    获取与该组对象列表对应的候选对象集合中各个对象的排序特征向量,所述排序特征向量包括所述预定数目的排序特征各自的特征值;以及
    基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数;以及
    对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表,其中,所述更新的M组对象列表中的每组对象列表包括i个对象。
  2. 根据权利要求1所述的方法,其中,所述动态特征至少包括所述i-1个对象各自的以下属性特征:当前热度、对象标识、对象所属类型。
  3. 根据权利要求1所述的方法,其中,所述M组对象列表中包括第一组对象列表,与该第一组对象列表对应的候选对象集合中包括第一对象,与该第一对象对应的排序特征向量至少包括以下排序特征的值:所述第一用户对该第一对象的预估点击率、该第一对象的当前热度、该第一对象相对于所述第一组对象列表中的i-1个对象的多样性。
  4. 根据权利要求1所述的方法,其中,已预先通过所述方法确定有M组对象列表包括,已预先通过所述方法确定有一组对象列表,其中,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表包括,基于与该组对象列表对应的候选对象集合中各个对象的分数,以所述候选对象集合中分数最高的对象作为该组对象列表的第i个对象,并将该组对象列表作为更新的一组对象列表。
  5. 根据权利要求1所述的方法,其中,M大于等于2,其中,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表包括, 基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,通过集束搜索算法确定更新的M组对象列表。
  6. 根据权利要求5所述的方法,其中,i等于N,所述方法还包括,通过集束搜索算法,从所述更新的M组对象列表中确定针对所述第一用户的推送对象列表。
  7. 根据权利要求6所述的方法,还包括,
    以所述推送对象列表中各个对象的排列顺序,向所述第一用户推送所述各个对象,以获取所述第一用户的反馈;
    基于所述排列顺序和所述反馈获取N个回报值,所述N个回报值与对所述方法的从i=1至N的N次循环分别对应;
    获取第N+1个状态特征向量,所述第N+1个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括所述推送对象列表中N个对象各自的属性特征;以及
    基于与所述N次循环分别对应的N组数据训练所述强化学习模型,以优化所述强化学习模型,其中,所述N组数据包括第1至第N组数据,其中,第i组数据包括:与所述推送对象列表对应的第i个状态特征向量、与该第i个状态特征向量对应的权重向量、与所述推送对象列表对应的第i+1个状态特征向量、以及与第i次循环对应的回报值。
  8. 根据权利要求7所述的方法,其中,所述对象为询问问题,对于第1至N-1次循环中的第i次循环,与所述第i次循环对应的回报值基于所述第一用户的如下反馈获取:是否点击所述推送对象列表中的第i个问题。
  9. 根据权利要求8所述的方法,与所述第N次循环对应的回报值基于所述第一用户的如下反馈获取:是否点击所述推送对象列表中的第N个问题、以及提交的满意度信息。
  10. 根据权利要求7所述的方法,其中,所述强化学习模型为基于深度确定策略梯度算法的模型。
  11. 一种基于强化学习模型确定针对用户的推送对象列表的装置,其中,对于第一用户,已预先通过所述方法确定有M组对象列表,每组对象列表中当前包括i-1个对象,其中,M、i都为大于等于1的整数,其中,i小于等于预定整数N,所述装置包括:
    用于每组对象列表的,
    第一获取单元,配置为,获取第i个状态特征向量,所述第i个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特 征包括该组对象列表中所述i-1个对象各自的属性特征;
    输入单元,配置为,将所述第i个状态特征向量输入所述强化学习模型,以使得所述强化学习模型输出与该第i个状态特征向量对应的权重向量,所述权重向量包括预定数目的排序特征各自的权重;
    第二获取单元,配置为,获取与该组对象列表对应的候选对象集合中各个对象的排序特征向量,所述排序特征向量包括所述预定数目的排序特征各自的特征值;以及
    计算单元,配置为,基于所述候选对象集合中各个对象的排序特征向量与所述权重向量的点积,计算所述候选对象集合中各个对象的分数;以及
    第一确定单元,配置为,对于所述M组对象列表,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,确定更新的M组对象列表,其中,所述更新的M组对象列表中的每组对象列表包括i个对象。
  12. 根据权利要求11所述的装置,其中,所述动态特征至少包括所述i-1个对象各自的以下属性特征:当前热度、对象标识、对象所属类型。
  13. 根据权利要求11所述的装置,其中,所述M组对象列表中包括第一组对象列表,与该第一组对象列表对应的候选对象集合中包括第一对象,与该第一对象对应的排序特征向量至少包括以下排序特征的值:所述第一用户对该第一对象的预估点击率、该第一对象的当前热度、该第一对象相对于所述第一组对象列表中的i-1个对象的多样性。
  14. 根据权利要求11所述的装置,其中,已预先通过所述方法确定有M组对象列表包括,已预先通过所述方法确定有一组对象列表,其中,所述第一确定单元还配置为,基于与该组对象列表对应的候选对象集合中各个对象的分数,以所述候选对象集合中分数最高的对象作为该组对象列表的第i个对象,并将该组对象列表作为更新的一组对象列表。
  15. 根据权利要求11所述的装置,其中,M大于等于2,其中,所述第一确定单元还配置为,基于与所述M组对象列表分别对应的M个候选对象集合中各个对象的分数,通过集束搜索算法确定更新的M组对象列表。
  16. 根据权利要求15所述的装置,其中,i等于N,所述装置还包括,第二确定单元,配置为,通过集束搜索算法,从所述更新的M组对象列表中确定针对所述第一用户的推送对象列表。
  17. 根据权利要求16所述的装置,还包括,
    推送单元,配置为,以所述推送对象列表中各个对象的排列顺序,向所述第一用户推送所述各个对象,以获取所述第一用户的反馈;
    第三获取单元,配置为,基于所述排列顺序和所述反馈获取N个回报值,所述N个回报值与对所述方法的从i=1至N的N次循环分别对应;
    第四获取单元,配置为,获取第N+1个状态特征向量,所述第N+1个状态特征向量包括静态特征和动态特征,其中,所述静态特征包括所述第一用户的属性特征,所述动态特征包括所述推送对象列表中N个对象各自的属性特征;以及
    训练单元,配置为,基于与所述N次循环分别对应的N组数据训练所述强化学习模型,以优化所述强化学习模型,其中,所述N组数据包括第1至第N组数据,其中,第i组数据包括:与所述推送对象列表对应的第i个状态特征向量、与该第i个状态特征向量对应的权重向量、与所述推送对象列表对应的第i+1个状态特征向量、以及与第i次循环对应的回报值。
  18. 根据权利要求17所述的装置,其中,所述对象为询问问题,对于第1至N-1次循环中的第i次循环,与所述第i次循环对应的回报值基于所述第一用户的如下反馈获取:是否点击所述推送对象列表中的第i个问题。
  19. 根据权利要求18所述的装置,与所述第N次循环对应的回报值基于所述第一用户的如下反馈获取:是否点击所述推送对象列表中的第N个问题、以及提交的满意度信息。
  20. 根据权利要求17所述的装置,其中,所述强化学习模型为基于深度确定策略梯度算法的模型。
  21. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-10中任一项的所述的方法。
  22. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-10中任一项所述的方法。
PCT/CN2020/071699 2019-04-29 2020-01-13 基于强化学习模型向用户推送对象的方法和装置 WO2020220757A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/813,654 US10902298B2 (en) 2019-04-29 2020-03-09 Pushing items to users based on a reinforcement learning model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910355868.6A CN110263245B (zh) 2019-04-29 2019-04-29 基于强化学习模型向用户推送对象的方法和装置
CN201910355868.6 2019-04-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/813,654 Continuation US10902298B2 (en) 2019-04-29 2020-03-09 Pushing items to users based on a reinforcement learning model

Publications (1)

Publication Number Publication Date
WO2020220757A1 true WO2020220757A1 (zh) 2020-11-05

Family

ID=67914122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/071699 WO2020220757A1 (zh) 2019-04-29 2020-01-13 基于强化学习模型向用户推送对象的方法和装置

Country Status (2)

Country Link
CN (1) CN110263245B (zh)
WO (1) WO2020220757A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10902298B2 (en) 2019-04-29 2021-01-26 Alibaba Group Holding Limited Pushing items to users based on a reinforcement learning model
CN110263245B (zh) * 2019-04-29 2020-08-21 阿里巴巴集团控股有限公司 基于强化学习模型向用户推送对象的方法和装置
CN110766086B (zh) * 2019-10-28 2022-07-22 支付宝(杭州)信息技术有限公司 基于强化学习模型对多个分类模型进行融合的方法和装置
CN110866587B (zh) * 2019-11-07 2021-10-15 支付宝(杭州)信息技术有限公司 一种基于对话系统对用户问句提出反问的方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140297476A1 (en) * 2013-03-28 2014-10-02 Alibaba Group Holding Limited Ranking product search results
CN104869464A (zh) * 2015-05-08 2015-08-26 海信集团有限公司 一种生成推荐节目列表的方法及装置
CN108304440A (zh) * 2017-11-01 2018-07-20 腾讯科技(深圳)有限公司 游戏推送的方法、装置、计算机设备及存储介质
CN108805594A (zh) * 2017-04-27 2018-11-13 北京京东尚科信息技术有限公司 信息推送方法和装置
CN110263245A (zh) * 2019-04-29 2019-09-20 阿里巴巴集团控股有限公司 基于强化学习模型向用户推送对象的方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108230057A (zh) * 2016-12-09 2018-06-29 阿里巴巴集团控股有限公司 一种智能推荐方法及系统
CN108230058B (zh) * 2016-12-09 2022-05-13 阿里巴巴集团控股有限公司 产品推荐方法及系统
US11915152B2 (en) * 2017-03-24 2024-02-27 D5Ai Llc Learning coach for machine learning system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140297476A1 (en) * 2013-03-28 2014-10-02 Alibaba Group Holding Limited Ranking product search results
CN104869464A (zh) * 2015-05-08 2015-08-26 海信集团有限公司 一种生成推荐节目列表的方法及装置
CN108805594A (zh) * 2017-04-27 2018-11-13 北京京东尚科信息技术有限公司 信息推送方法和装置
CN108304440A (zh) * 2017-11-01 2018-07-20 腾讯科技(深圳)有限公司 游戏推送的方法、装置、计算机设备及存储介质
CN110263245A (zh) * 2019-04-29 2019-09-20 阿里巴巴集团控股有限公司 基于强化学习模型向用户推送对象的方法和装置

Also Published As

Publication number Publication date
CN110263245B (zh) 2020-08-21
CN110263245A (zh) 2019-09-20

Similar Documents

Publication Publication Date Title
WO2020220757A1 (zh) 基于强化学习模型向用户推送对象的方法和装置
Nassar et al. A novel deep multi-criteria collaborative filtering model for recommendation system
US10902298B2 (en) Pushing items to users based on a reinforcement learning model
CN109102127B (zh) 商品推荐方法及装置
US8190537B1 (en) Feature selection for large scale models
WO2019029046A1 (zh) 一种视频推荐方法及系统
CN109783738B (zh) 一种基于多相似度的双极限学习机混合协同过滤推荐方法
WO2021135562A1 (zh) 特征有效性评估方法、装置、电子设备及存储介质
CN108230058A (zh) 产品推荐方法及系统
JPWO2017159403A1 (ja) 予測システム、方法およびプログラム
CN110781409A (zh) 一种基于协同过滤的物品推荐方法
JP6311851B2 (ja) 共クラスタリングシステム、方法およびプログラム
JP2019113943A (ja) 情報提供装置、情報提供方法、およびプログラム
US20210366006A1 (en) Ranking of business object
Navgaran et al. Evolutionary based matrix factorization method for collaborative filtering systems
WO2020135642A1 (zh) 一种基于生成对抗网络的模型训练方法及设备
Chen et al. Reinforcement learning for user intent prediction in customer service bots
CN113158024A (zh) 一种纠正推荐系统流行度偏差的因果推理方法
WO2020065611A1 (en) Recommendation method and system and method and system for improving a machine learning system
US20210166131A1 (en) Training spectral inference neural networks using bilevel optimization
CN113330462A (zh) 使用软最近邻损失的神经网络训练
CN112734510B (zh) 基于融合改进模糊聚类和兴趣衰减的商品推荐方法
CN113449182A (zh) 一种知识信息个性化推荐方法及系统
Bharadhwaj Layer-wise relevance propagation for explainable recommendations
Wang et al. Multi‐feedback Pairwise Ranking via Adversarial Training for Recommender

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20797995

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20797995

Country of ref document: EP

Kind code of ref document: A1