CN111159558A

CN111159558A - Recommendation list generation method and device and electronic equipment

Info

Publication number: CN111159558A
Application number: CN201911409205.4A
Authority: CN
Inventors: 刘俊宏; 张望舒; 温祖杰
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-15
Anticipated expiration: 2039-12-31
Also published as: CN111159558B

Abstract

One or more embodiments of the present specification provide a method, an apparatus, and an electronic device for generating a recommendation list; the method comprises the following steps: acquiring user characteristics of a user; according to the user characteristics and a pre-trained reinforcement learning model, obtaining a prediction result of a user clicking a list item in a recommendation list; responding to the clicking operation of the user on the list items in the recommendation list to obtain a clicking result; determining an award score corresponding to the prediction result according to the prediction result and the click result; determining a benchmark reward score; and optimizing the reinforcement learning model by adopting a strategy gradient algorithm according to the benchmark reward score, wherein the optimized reinforcement learning model is used for generating a recommendation list corresponding to the user.

Description

Recommendation list generation method and device and electronic equipment

Technical Field

One or more embodiments of the present specification relate to the field of computer technologies, and in particular, to a method and an apparatus for generating a recommendation list, and an electronic device.

Background

With the explosion of internet technology, web servers provide users with a large variety of different types of online services such as news, goods, pictures, video, audio, documents, and the like. When pushing these services to the user, this is typically done through a recommendation list. The contents of the recommendation list, namely, the contents include list items corresponding to different services, and a user can acquire or link to the corresponding service by clicking the list items in the recommendation list. How to implement personalized recommendation, that is, how to make the list items in the recommendation list conform to the interests of the user, so as to improve the pertinence and accuracy of the recommendation list, becomes an important problem in the field of list recommendation.

Disclosure of Invention

In view of this, an object of one or more embodiments of the present specification is to provide a recommendation list generation method, apparatus, and electronic device.

In view of the above, one or more embodiments of the present specification provide a recommendation list generation method, including:

acquiring user characteristics of a user;

according to the user characteristics and a pre-trained reinforcement learning model, obtaining a prediction result of a user clicking a list item in a recommendation list;

responding to the clicking operation of the user on the list items in the recommendation list to obtain a clicking result;

determining an award score corresponding to the prediction result according to the prediction result and the click result;

determining a benchmark reward score;

and optimizing the reinforcement learning model by adopting a strategy gradient algorithm according to the benchmark reward score, wherein the optimized reinforcement learning model is used for generating a recommendation list corresponding to the user.

In another aspect, one or more embodiments of the present specification further provide a recommendation list generation apparatus, including:

an acquisition module configured to acquire user characteristics of a user;

the prediction module is configured to obtain a prediction result of the user clicking a list item in the recommendation list according to the user characteristics and a pre-trained reinforcement learning model;

the response module is configured to respond to the clicking operation of the user on the list items in the recommendation list and obtain a clicking result;

a first determination module configured to determine a reward score corresponding to the predicted result according to the predicted result and the click result;

a second determination module configured to determine a benchmark reward score;

an optimization module configured to optimize the reinforcement learning model using a policy gradient algorithm according to the benchmark reward score, the optimized reinforcement learning model being used to generate a recommendation list corresponding to the user.

In yet another aspect, one or more embodiments of the present specification further provide an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method as described in any one of the above when executing the program.

As can be seen from the foregoing, the method, the apparatus, and the electronic device for generating a recommendation list provided in one or more embodiments of the present disclosure perform generation of a personalized recommendation list based on a reinforcement learning technique. In the process of determining the reward points in the reinforcement learning model, the recommendation list is treated as a whole, the reward points are determined by comparing the reward points with actual clicks of the recommendation list by the user, the mutual relation among list items in the recommendation list is considered as a whole, the actual clicks of the user on the recommendation list can be reflected better, the accuracy of a prediction result is improved remarkably, and more targeted and accurate recommendation for the user can be realized.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present disclosure, reference will now be made briefly to the attached drawings, which are used in the description of the embodiments or prior art, and it should be apparent that the drawings in the description below are only some embodiments of the disclosure, and that other drawings may be obtained by those skilled in the art without inventive effort.

FIG. 1 is a schematic diagram of a typical reinforcement learning system;

FIG. 2 is a schematic diagram of an implementation scenario of one or more embodiments of the present description;

FIG. 3 is a flow diagram of a method for generating a recommendation list according to one embodiment of the present disclosure;

FIG. 4 is a flow chart of the step of determining the reward score in one embodiment of the present description;

FIG. 5 is a flow chart of the step of determining a baseline award score in one embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a recommendation list generation apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

In order to provide a recommendation list for a user in a personalized manner, so that list items included in the recommendation list are interesting or needed by the user, that is, the pertinence and accuracy of the recommendation list are improved, one or more embodiments of the present specification provide a recommendation list generation method, apparatus, and electronic device, and the recommendation list for pushing and displaying to the user is generated based on a reinforcement learning technique. As known to those skilled in the art, reinforcement learning is a label-free learning method based on feedback of sequence behaviors, and learning of strategies is performed in a continuous 'trial and error' manner.

Fig. 1 is a schematic diagram of a typical reinforcement learning system. As shown in fig. 1, in general, a reinforcement learning system includes a model (also called an agent) and an environment, and the model continuously learns and optimizes its strategy through interaction and feedback with the environment. Specifically, the model observes and obtains the state (state) of the environment, and determines the action or action (action) to be taken with respect to the state of the current environment according to a certain policy. Such actions act on the environment, changing the state of the environment, and generating a feedback, also known as reward points (rewarded), to the model. The model judges whether the previous behavior is correct or not and whether the strategy needs to be adjusted or not according to the acquired reward points, and then the strategy is updated. By repeatedly observing states, determining behaviors, and receiving feedback, the model can continuously update the strategy, and the final goal is to learn a strategy to maximize the accumulated number of awards.

Fig. 2 is a schematic diagram of an implementation scenario of one or more embodiments of the present specification. In one or more embodiments of the present description,

based on the reinforcement learning technology, when a user clicks a list item in the recommendation list, the user is shown to be interested in or required for the clicked list item, and the user characteristics of the user are correspondingly obtained. In one or more embodiments of the present description, the user characteristic is taken as a state s of an environment in the reinforcement learning system; and the prediction of the user clicking on the list item given by the model is used as an action a in the reinforcement learning system. Then, the difference between the actual click of the user on the list item in the recommendation list and the prediction given by the model is used as a feedback reward score r, and the feedback reward score r is used as a learning sample to update the prediction strategy. The prediction strategy is optimized by repeating the learning and continuously groping and trial and error. The optimized strategy can predict the accurate prediction of the user for clicking the list items in the recommendation list, and the recommendation list is pushed or displayed to the user according to the prediction.

Hereinafter, the technical means of the present specification will be described by way of specific examples.

One embodiment of the present specification provides a method of generating a recommendation list. It is to be understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities.

Referring to fig. 3, the method for generating a recommendation list includes the following steps:

and S301, acquiring user characteristics of the user.

In this embodiment, the present invention is based on an existing recommendation list, where the recommendation list includes a plurality of list items, and each list item is an operable object and can link or obtain corresponding service content when clicked. For example, for music service software, the recommendation list may be a song recommendation list, where the list items correspond to songs; for service consultation software, the recommendation list can be a navigation list of intelligent customer service, and the list item corresponds to a service description.

In this embodiment, the user refers to any user who is pushed or shows the recommendation list. The user characteristics of the user are features that can represent the user. Specifically, the user characteristics may include at least one of portrait characteristics, behavior trace characteristics, and service usage characteristics.

Wherein, the portrait characteristics may specifically include: age, gender, residence, school calendar, occupation, etc. The behavior track characteristics can be operation track information, namely which operations are performed by the user before clicking the recommendation list; or page track information, that is, which pages the user browses before clicking on the recommendation list. The service use characteristics reflect the condition that the user uses each service provided by the software on the software to which the recommendation list belongs, such as whether to open a member, whether to open a payment function, whether to use a cloud disk and the like.

The user characteristics and the specific contents included in a user characteristic may be different from those listed above according to specific implementation requirements.

In addition, the method of this embodiment is intended to generate a recommendation list personalized for the user, so in the subsequent content of this embodiment, the user refers to the user corresponding to the extracted user feature, and will not be separately described in the following.

And S302, obtaining a prediction result of the user clicking a list item in the recommendation list according to the user characteristics and a pre-trained reinforcement learning model.

In this embodiment, a pre-trained reinforcement learning model already exists, and the reinforcement learning model may be obtained by training through an arbitrary algorithm, and may predict, for a recommendation list, a click of a user on a list item in the recommendation list according to a user characteristic of the user. The present embodiment aims to optimize the reinforcement learning model so as to make a more accurate prediction.

In this step, the user characteristics obtained in the previous step are input into the reinforcement learning model, and the reinforcement learning model outputs the prediction result of the user clicking the list item in the recommendation list, wherein the prediction result indicates the list item which is possibly clicked by the user and corresponds to the user characteristics for one recommendation list.

Specifically, the prediction result may be represented by a vector. For example, the recommendation list includes six list items in a sequential arrangement. According to the user characteristics, the prediction result output by the reinforcement learning model indicates that the user clicks the second list item and the third list item in the recommendation list, and the expression vector of the prediction result is (0, 1, 1, 0, 0, 0). For the above-mentioned representation vector, six elements are included, i.e. six list items are included.

And step S303, responding to the clicking operation of the user on the list items in the recommendation list to obtain a clicking result.

In the previous step, the click of the user on the recommendation list is predicted through the reinforcement learning model, and a prediction result is obtained. In this step, it will be considered what kind of click the recommendation list is actually clicked by the user. Specifically, after the recommendation list is pushed or displayed to the user, the click of the user on the recommendation list is detected and recorded to obtain a click result, and the click result represents which list items in the recommendation list are actually clicked by the user.

Specifically, the click result may be represented by a vector. For example, also for the recommendation list described above, which includes six ordered list items. The user receives the recommendation list and clicks the second and fourth list items, which indicates that the user is interested in the service or content corresponding to the second and fourth list items. Accordingly, the expression vector of the click result is (0, 1, 0, 1, 0, 0).

And step S304, determining the reward points corresponding to the prediction results according to the prediction results and the click results.

In this step, the reward score for the prediction result given by the reinforcement learning model is determined. Specifically, the reward points are determined by judging the degree of similarity thereof based on the predicted result and the click result obtained in the foregoing steps. The closer the prediction result is to the click result, the more accurate the prediction result is, the higher reward score can be obtained correspondingly; otherwise, if the prediction result is not accurate enough, a lower reward score is obtained correspondingly.

As an alternative embodiment, referring to fig. 4, the reward points may be determined by:

step S401, respectively determining a prediction result expression vector and a click result expression vector according to the prediction result and the click result;

step S402, calculating the vector distance between the prediction result expression vector and the click result expression vector, and determining the reward point according to the vector distance.

Specifically, the vector representing the prediction result and the click result are respectively determined, and then the vector distance and the vector distance between the two vectors are calculated, so that the similarity of the two vectors can be embodied, and the reward score can be determined according to the vector distance.

For the vector distance, the cosine distance between the predicted result representation vector and the click result representation vector may be selected and calculated. For example, for the predictor expression vector (0, 1, 1, 0, 0, 0) and the click result expression vector (0, 1, 0, 1, 0, 0) in the foregoing example, the remaining chord distance is calculated to be 0.5. Based on the nature of the cosine distance, the calculated values range from 0 to 2, with larger values indicating closer proximity of the two vectors. If desired, the calculation result of the cosine distance between the predicted result representing vector and the click result representing vector can be directly used as the reward point.

Of course, the vector Distance between the predicted result expression vector and the click result expression vector may be calculated by other vector Distance calculation methods, such as Euclidean Distance (Euclidean Distance), cosine Distance (cosine Distance), Manhattan Distance (Manhattan Distance), Chebyshev Distance (Chebyshev Distance), Mahalanobis Distance (Mahalanobis Distance), Hamming Distance (Hamming Distance), and jaccarded Distance (JaccardDistance). For different vector distance calculation modes, the calculation result can be correspondingly and mathematically processed, so that when two vectors are similar, a higher reward score is corresponding, and when two vectors are not similar, a lower reward score is corresponding.

Obviously, in addition to the above-described manner of determining the reward points through the vector distance, the reward points may also be obtained in other manners, such as calculating the reward points according to reward functions corresponding to different reinforcement learning training algorithms, for example, an SARSA algorithm, an evolutionary policy algorithm, a Q-learning algorithm, a cross information entropy method, and the like.

Step S305, a reference bonus point is determined.

In this step, a benchmark reward score for subsequent optimization of the reinforcement learning model is determined. In particular, based on Baseline skill in a strategy gradient algorithm. The method for skillfully training the Baseline model is characterized in that in the process of training the reinforcement learning model, in each iteration, a reference value is subtracted from the reward score obtained in the iteration, so that the variance is reduced to achieve the purpose of fast convergence.

In this step, the reference bonus point in this embodiment is the above reference value. Specifically, according to the Baseline skill in the policy gradient algorithm, the reference reward score may be empirically set to an empirical value, i.e., a constant.

Further, referring to FIG. 5, the base reward point may also be determined by:

step S501, obtaining a plurality of historical reward points; the historical reward score corresponds to a historical prediction result output by the reinforcement learning model;

step S502, taking the average value of a plurality of historical reward points as the reference reward point.

Specifically, a historical prediction result output by the reinforcement learning model is obtained. The historical prediction result is a plurality of prediction results predicted by the reinforcement learning model for different users for a recommendation list. Accordingly, the predicted results are determined by any reward point determination method. For the historical reward points, a calculator averages and takes the average of the historical reward points as a reference reward point.

The benchmark reward score is determined through the historical reward score, so that the reinforcement learning model can realize optimization based on a strategy gradient algorithm based on the reward score made historically in the subsequent optimization process, and better self-adaptive optimization can be realized.

And S306, optimizing the reinforcement learning model by adopting a strategy gradient algorithm according to the benchmark reward score, wherein the optimized reinforcement learning model is used for generating a recommendation list corresponding to the user.

In the step, a strategy gradient algorithm is adopted to optimize the reinforcement learning model according to the benchmark reward score determined in the step. As described in the foregoing steps, in each iteration of optimizing the reinforcement learning model, the base reward score is subtracted from the reward score obtained in the current iteration, so as to implement Baseline skill in the strategy gradient algorithm, where the base reward score actually represents an average level of the reward scores obtained by different actions. Specifically, when the difference between the reward score and the reference reward score is positive, the reward score obtained by the current action is higher than the average level, and the corresponding reinforcement learning model can strengthen the current prediction result, namely, the corresponding action with higher probability is output as the prediction result; on the contrary, when the difference between the reward score and the reference reward score is negative, the reward score obtained by the current action is lower than the average level, and the corresponding reinforcement learning model inhibits the current prediction result, namely, the corresponding action with lower probability is output as the prediction result.

The optimized reinforcement learning model can be used for generating a recommendation list corresponding to a user. The recommendation list is a personalized recommendation list for the user, i.e. list items contained in the recommendation list, which are more interesting or desirable for the user.

Specifically, the specific content of the prediction result output by the reinforcement learning model of the embodiment is the click of the user on the list item. Correspondingly, at least one list item is determined according to the prediction result output by the optimized reinforcement learning model, and the service or the content corresponding to the determined list item is the interest or the need of the predicted user. Further, a recommendation list is generated according to the determined list items. Specifically, the generated recommendation list may only include the determined list items; in the generated recommendation list, the determined list item is set at a front position, and other list items are filled at a rear position; in addition, in the generated recommendation list, the determined list items are highlighted, such as highlighting, flashing, font enlarging, and the like.

As can be seen, the method for generating a recommendation list according to this embodiment generates a personalized recommendation list based on a reinforcement learning technique. In the process of determining the reward points in the reinforcement learning model, the recommendation list is treated as a whole, the reward points are determined by comparing the reward points with actual clicks of the recommendation list by the user, the mutual relation among list items in the recommendation list is considered as a whole, the actual clicks of the user on the recommendation list can be reflected better, the accuracy of a prediction result is improved remarkably, and more targeted and accurate recommendation for the user can be realized.

As an alternative embodiment, in the method for generating the recommendation list, the reference reward point may also be generated by a pre-trained reward estimation model. Specifically, the user characteristics are input into the pre-trained reward estimation model, so that the reward estimation model outputs the reference reward score.

The reward estimation model may be obtained by training through any algorithm similar to the reinforcement learning model, and in this embodiment, only the reward score output by the reward function is obtained as the reference reward score in this embodiment. In addition, the reward estimation model may be a classification model obtained by training through any algorithm with the user characteristics and the reward score as sample data, and when the user characteristics are input, the corresponding reward score can be output, so that the output reward score is used as the reference reward score in the embodiment.

In this embodiment, the reference reward points are determined by a separate reward estimation model. The training of the reward estimation model is obtained by training based on sample data different from the strong learning model in the embodiment, so that the determined reference reward score in the embodiment can reflect the average level of reward scores corresponding to more different sample data, and the method in the embodiment can be suitable for more application scenes including different types of users and recommendation lists.

Further, in this embodiment, a training step for the reward estimation model may be further included. Specifically, the method comprises the following steps: and training the reward estimation model by taking the reward points determined in the generation method of the recommendation list of the embodiment as training targets.

Specifically, while the output of the reference reward points is applied to the reference value of the reward points in the method of the present embodiment, the reward points in the method of the present embodiment are further used as training targets, and the reward estimation model is trained accordingly. The method aims to enable the output of the reward pre-estimation model to be closer to the reward score determined in the method of the embodiment in the subsequent training process, namely correspondingly improve the accuracy of the reward score determined in the method of the embodiment, and further improve the accuracy of the final prediction result.

Based on the same inventive concept, the embodiment of the specification further provides a recommendation list generation device. Referring to fig. 6, the recommendation list generation apparatus includes:

an obtaining module 601 configured to obtain a user characteristic of a user;

the prediction module 602 is configured to obtain a prediction result of the user clicking a list item in the recommendation list according to the user characteristics and a pre-trained reinforcement learning model;

a response module 603 configured to obtain a click result in response to a click operation of the user on a list item in the recommendation list;

a first determining module 604 configured to determine a reward score corresponding to the predicted result according to the predicted result and the click result;

a second determination module 605 configured to determine a base reward score;

an optimization module 606 configured to optimize the reinforcement learning model using a policy gradient algorithm according to the benchmark reward score, the optimized reinforcement learning model being used to generate a recommendation list corresponding to the user.

As an optional embodiment, the user characteristics comprise at least one of: portrait features, behavioral trace features, service usage features.

As an alternative embodiment, the first determining module 604 is specifically configured to determine a prediction result representation vector and a click result representation vector according to the prediction result and the click result, respectively; calculating the vector distance between the prediction result expression vector and the click result expression vector; and taking the vector distance as the reward point.

As an alternative embodiment, the second determination module 605 is specifically configured to obtain a number of historical reward points; the historical reward score corresponds to a historical prediction result output by the reinforcement learning model; taking the average value of a plurality of historical reward points as the benchmark reward point.

As an alternative embodiment, the second determining module 605 is specifically configured to obtain the reference reward point according to the user characteristics and a pre-trained reward estimation model.

As an optional embodiment, the apparatus further comprises: and the training module is configured to train the reward estimation model by taking the reward points as training targets.

As an alternative embodiment, the optimization module 606 is specifically configured to subtract the benchmark reward score from the reward score obtained in the current iteration in each iteration of optimizing the reinforcement learning model.

As an optional embodiment, the apparatus further comprises: a push module configured to determine at least one list item according to the optimized reinforcement learning model; and generating a recommendation list corresponding to the user according to the list item, and pushing or displaying the recommendation list to the user.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, the embodiment of the specification further provides the electronic equipment. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the method according to any one of the above embodiments.

Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.

It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A recommendation list generation method comprises the following steps:

acquiring user characteristics of a user;

determining a benchmark reward score;

2. The method of generating a recommendation list of claim 1, said user characteristics comprising at least one of: portrait features, behavioral trace features, service usage features.

3. The method for generating a recommendation list according to claim 1, wherein determining a reward score corresponding to the prediction result according to the prediction result and the click result comprises:

respectively determining a prediction result expression vector and a click result expression vector according to the prediction result and the click result;

and calculating the vector distance between the predicted result expression vector and the click result expression vector, and determining the reward score according to the vector distance.

4. The recommendation list generation method according to claim 1, the determining a reference reward score, comprising:

obtaining a plurality of historical reward points; the historical reward score corresponds to a historical prediction result output by the reinforcement learning model;

taking the average value of a plurality of historical reward points as the benchmark reward point.

5. The recommendation list generation method according to claim 1, the determining a reference reward score, comprising:

and obtaining the reference reward score according to the user characteristics and a pre-trained reward pre-estimation model.

6. The method for generating a recommendation list according to claim 5, further comprising, after obtaining the reference reward score:

and training the reward estimation model by taking the reward points as a training target.

7. The method for generating a recommendation list according to claim 1, wherein optimizing the reinforcement learning model by using a policy gradient algorithm according to the benchmark reward score comprises:

and in each iteration of optimizing the reinforcement learning model, subtracting the reference reward score from the reward score obtained in the current iteration.

8. The recommendation list generation method of claim 1, further comprising:

determining at least one list item according to the optimized reinforcement learning model;

and generating a recommendation list corresponding to the user according to the list item, and pushing or displaying the recommendation list to the user.

9. An apparatus for generating a recommendation list, comprising:

an acquisition module configured to acquire user characteristics of a user;

a second determination module configured to determine a benchmark reward score;

10. The apparatus of claim 9, the user characteristics comprising at least one of: portrait features, behavioral trace features, service usage features.

11. The apparatus according to claim 9, the first determination module being specifically configured to determine, from the prediction result and the click result, a prediction result representation vector and a click result representation vector, respectively; calculating the vector distance between the prediction result expression vector and the click result expression vector; and taking the vector distance as the reward point.

12. The apparatus of claim 9, the second determination module being specifically configured to obtain a number of historical reward points; the historical reward score corresponds to a historical prediction result output by the reinforcement learning model; taking the average value of a plurality of historical reward points as the benchmark reward point.

13. The apparatus of claim 9, wherein the second determination module is specifically configured to derive the baseline reward score based on the user characteristics and a pre-trained reward projection model.

14. The apparatus of claim 13, further comprising:

and the training module is configured to train the reward estimation model by taking the reward points as training targets.

15. The apparatus of claim 9, the optimization module specifically configured to, in each iteration of optimizing the reinforcement learning model, subtract the baseline reward score from the reward score obtained in the current iteration.

16. The apparatus of claim 9, further comprising:

a push module configured to determine at least one list item according to the optimized reinforcement learning model; and generating a recommendation list corresponding to the user according to the list item, and pushing or displaying the recommendation list to the user.

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 8 when executing the program.