CN116308658A

CN116308658A - Recommendation method and device

Info

Publication number: CN116308658A
Application number: CN202310271141.6A
Authority: CN
Inventors: 姜佳; 暴宇健
Original assignee: Beijing Longzhi Digital Technology Service Co Ltd
Current assignee: Beijing Longzhi Digital Technology Service Co Ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-06-23

Abstract

The disclosure relates to the technical field of artificial intelligence, and provides a recommendation method, a recommendation device, a recommendation computer device and a recommendation computer readable storage medium. The method can improve the prediction accuracy of the recommendation strategy model on the target recommendation strategy corresponding to the target object, can realize the further improvement of the online index, can flexibly give different recommendation strategies for different users and objects (such as commodities), and can make up for the defect of lag of real-time behavior reaction on the user line; meanwhile, the data quality of the whole link can be further improved, cleaner and high-quality model training sample data are contributed to subsequent recall and sequence model iteration, the probability of bias of a recommendation strategy model is reduced, the recommendation strategy model can have higher interpretability, model strategy reception upgrading is facilitated, the recommendation strategy model is sensitive to online data, and a target recommendation strategy corresponding to a target object can be quickly adjusted according to data distribution.

Description

Recommendation method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a recommendation method and device.

Background

With the development of deep learning, industrial recommendation systems have continuously advanced in exploring user interests and slowing down information overload. In a typical industrial recommendation scenario (such as the common top page information stream, item detail pages, and short video), a final list of recommendations that is ordered to be closest to the user's interests is recommended to the user. A standard industry recommendation system is typically composed of three phases in sequence: recall, ordering, and rearrangement. Recall and ordering have been constantly focused and developed, while rearrangement is also increasingly focused and shows great potential as it directly determines the final goods to be displayed and their order of display. With the deep understanding of the rearrangement problem and its characteristics, various rearrangement methods have been proposed. The rearrangement strategy for different user groups, which is obtained by analyzing the service scene in different dimensions, is a rearrangement module which is easier to be online and has considerable benefits. How to select a proper rearrangement strategy for different user groups is a key problem, directly influencing the on-line strategy effect.

In the Internet e-commerce recommendation scenario, the system presents a list of items that are closest to the user's interests through recall and sort phases. However, with different implementation goals of internet services, such as a common click-through rate, conversion rate or GMV, different rearrangement strategies are required to promote different service indexes. At present, in the conventional commodity recommendation model, although abundant user interest characteristics are considered in the commodity recommendation strategy prediction process, when the user history online behavior is insufficient, the online acquired training sample and the real online effect may be inconsistent, and the commodity recommendation effect of the commodity recommendation model on the commodity predicted by the user is poor. Thus, there is a need for items that can be presented to a user by timely adjusting policies. The traditional adjustment mode is to divide the population of commodity recommendation strategies by using manual experience, and then the mode of online verification of whether the income reaches the expected value directly depends on the manual experience, and the strategy is single, so that the strategy cannot be converted in real time along with the transition of user behaviors on line, and is not suitable for rapid iteration.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a recommendation method, apparatus, computer device, and computer readable storage medium, so as to solve the problem in the prior art that, because the population is divided by using manual experience to perform the commodity recommendation policy, and then whether the online verification benefit reaches the expected one depends directly on the manual experience, and the policy is single, the policy cannot be converted in real time along with the transition of the user behavior on line, which is not suitable for rapid iteration.

In a first aspect of embodiments of the present disclosure, there is provided a recommendation method, the method including:

acquiring user attribute characteristics of a target user, object attribute characteristics of a target object and interaction behavior characteristics between the target user and the target object;

inputting the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object into a trained recommendation strategy model to obtain a target recommendation strategy corresponding to the target object;

recommending the target object to the target user based on a target recommendation strategy corresponding to the target object;

acquiring a target interaction result of the target user aiming at the target object, and determining a reward value corresponding to the target recommendation strategy according to the target interaction result;

And adjusting model parameters of the recommendation strategy model by using user attribute characteristics of the target user, object attribute characteristics of the target object, interaction behavior characteristics between the target user and the target object, the target recommendation strategy and reward values corresponding to the target recommendation strategy.

In a second aspect of the embodiments of the present disclosure, there is provided a recommendation apparatus, the apparatus including:

the device comprises a feature acquisition unit, a feature extraction unit and a feature extraction unit, wherein the feature acquisition unit is used for acquiring user attribute features of a target user, object attribute features of a target object and interaction behavior features between the target user and the target object;

the strategy acquisition unit is used for inputting the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object into a trained recommendation strategy model to obtain a target recommendation strategy corresponding to the target object;

the object recommending unit is used for recommending the target object to the target user based on a target recommending strategy corresponding to the target object;

the result acquisition unit is used for acquiring a target interaction result of the target user aiming at the target object and determining a reward value corresponding to the target recommendation strategy according to the target interaction result;

And the model adjustment unit is used for adjusting model parameters of the recommendation strategy model by utilizing user attribute characteristics of the target user, object attribute characteristics of the target object, interaction behavior characteristics between the target user and the target object and the target recommendation strategy and reward values corresponding to the target recommendation strategy.

In a third aspect of the disclosed embodiments, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when the computer program is executed.

In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: the embodiment of the disclosure can firstly acquire the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object; then, the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object can be input into a trained recommendation strategy model to obtain a target recommendation strategy corresponding to the target object; then, recommending the target object to the target user based on a target recommendation strategy corresponding to the target object; then, a target interaction result of the target user aiming at the target object can be obtained, and a reward value corresponding to the target recommendation strategy is determined according to the target interaction result; finally, model parameters of the recommendation policy model can be adjusted by using user attribute features of the target user, object attribute features of the target object, interaction behavior features between the target user and the target object, the target recommendation policy and reward values corresponding to the target recommendation policy. It can be understood that in this embodiment, the recommendation policy model is firstly utilized to obtain the target recommendation policy corresponding to the target object based on the user attribute feature of the target user, the object attribute feature of the target object and the interaction behavior feature between the target user and the target object, so that the recommendation policy model can extract the rich information and the dynamic semantic representation information in the user attribute feature of the target user, the object attribute feature of the target object and the interaction behavior feature between the target user and the target object, and therefore different recommendation policies can be selected based on the rich information and the dynamic semantic representation information in the features, and further the prediction accuracy of the recommendation policy model for the target recommendation policy corresponding to the target object can be improved; then, real-time target interaction results (i.e. feedback of online user behaviors) of the target user aiming at the target object can be utilized to timely adjust model parameters of the recommendation strategy model, so that the accuracy of prediction of the recommendation strategy model on the target recommendation strategy corresponding to the target object can be further improved. Therefore, the recommendation strategy model aims at the target recommendation strategy of the target object determined by the target user to be more suitable for the strategy of the current environment, so that the on-line index can be further improved, different recommendation strategies can be flexibly given for different users and objects (such as commodities), and the defect of lag of real-time behavior reaction on the user line can be overcome; meanwhile, the method provided by the embodiment can further improve the data quality of the whole link, contribute cleaner and high-quality model training sample data for subsequent recall and iteration of the sequencing model, reduce the biased probability of the recommendation strategy model, have higher interpretability, facilitate model strategy reception upgrading, and are sensitive to online data, and can quickly adjust the target recommendation strategy corresponding to the target object according to data distribution (namely, the condition of target interaction results of target users for the target object).

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;

FIG. 2 is a flow chart of a recommendation method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a training process for the neural network of FIG. 3 provided in an embodiment of the present disclosure;

FIG. 4 is a block diagram of a recommendation device provided by an embodiment of the present disclosure;

fig. 5 is a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

A recommendation method and apparatus according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

In the prior art, in the process of predicting the commodity recommendation strategy, the existing commodity recommendation model considers rich user interest characteristics, but when the user history online behavior is insufficient, the online acquired training sample and the real online effect may be inconsistent, and the commodity recommendation effect of the commodity recommendation model on the commodity predicted by the user is poor. Thus, there is a need for items that can be presented to a user by timely adjusting policies. The traditional adjustment mode is to divide the population of commodity recommendation strategies by using manual experience, and then the mode of online verification of whether the income reaches the expected value directly depends on the manual experience, and the strategy is single, so that the strategy cannot be converted in real time along with the transition of user behaviors on line, and is not suitable for rapid iteration.

In order to solve the above problems. The invention provides a recommendation method, in the method, because the recommendation policy model can be firstly utilized to obtain the target recommendation policy corresponding to the target object based on the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object, so that the recommendation policy model can extract the rich information and the dynamic semantic characterization information in the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object, different recommendation policies can be selected based on the rich information and the dynamic semantic characterization information in the characteristics, and the prediction accuracy of the recommendation policy model on the target recommendation policy corresponding to the target object can be improved; then, real-time target interaction results (i.e. feedback of online user behaviors) of the target user aiming at the target object can be utilized to timely adjust model parameters of the recommendation strategy model, so that the accuracy of prediction of the recommendation strategy model on the target recommendation strategy corresponding to the target object can be further improved. Therefore, the recommendation strategy model aims at the target recommendation strategy of the target object determined by the target user to be more suitable for the strategy of the current environment, so that the on-line index can be further improved, different recommendation strategies can be flexibly given for different users and objects (such as commodities), and the defect of lag of real-time behavior reaction on the user line can be overcome; meanwhile, the method provided by the embodiment can further improve the data quality of the whole link, contributes to the follow-up recall and the iteration of the sequencing model to cleaner and high-quality model training sample data, reduces the probability of the model having bias, has higher interpretability, is beneficial to model strategy reception upgrading, is sensitive to online data, and can quickly adjust the target recommendation strategy corresponding to the target object according to data distribution (namely, the condition of target interaction results of target users for the target object).

For example, the embodiment of the present invention may be applied to an application scenario as shown in fig. 1. In this scenario, a terminal device 1 and a server 2 may be included.

The terminal device 1 may be hardware or software. When the terminal device 1 is hardware, it may be various electronic devices having a display screen and supporting communication with the server 2, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the terminal device 1 is software, it may be installed in the electronic device as described above. The terminal device 1 may be implemented as a plurality of software or software modules, or as a single software or software module, to which the embodiments of the present disclosure are not limited. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal device 1.

The server 2 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 2 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiment of the present disclosure.

The server 2 may be hardware or software. When the server 2 is hardware, it may be various electronic devices that provide various services to the terminal device 1. When the server 2 is software, it may be a plurality of software or software modules providing various services to the terminal device 1, or may be a single software or software module providing various services to the terminal device 1, which is not limited by the embodiments of the present disclosure.

The terminal device 1 and the server 2 may be communicatively connected via a network. The network may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various communication devices without wiring, for example, bluetooth (Bluetooth), near field communication (Near Field Communication, NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.

Specifically, a user may input, through the terminal device 1, a user attribute feature of a target user, an object attribute feature of a target object, and an interaction behavior feature between the target user and the target object; the terminal device 1 sends the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object to the server 2. The server 2 stores a trained recommendation strategy model; the server 2 may input the user attribute feature of the target user, the object attribute feature of the target object, and the interaction behavior feature between the target user and the target object into the trained recommendation policy model to obtain a target recommendation policy corresponding to the target object; then, the server 2 may recommend the target object to the target user based on a target recommendation policy corresponding to the target object; then, the server 2 can acquire a target interaction result of the target user aiming at the target object, and determine a reward value corresponding to the target recommendation strategy according to the target interaction result; finally, the server 2 may adjust model parameters of the recommendation policy model by using user attribute features of the target user, object attribute features of the target object, interaction behavior features between the target user and the target object, the target recommendation policy, and reward values corresponding to the target recommendation policy. In this way, according to the embodiment, the recommendation policy model can be firstly utilized to obtain the target recommendation policy corresponding to the target object based on the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object, so that the recommendation policy model can extract rich information and dynamic semantic characterization information in the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object, different recommendation policies can be selected based on the rich information and the dynamic semantic characterization information in the characteristics, and the prediction accuracy of the recommendation policy model on the target recommendation policy corresponding to the target object can be improved; then, real-time target interaction results (i.e. feedback of online user behaviors) of the target user aiming at the target object can be utilized to timely adjust model parameters of the recommendation strategy model, so that the accuracy of prediction of the recommendation strategy model on the target recommendation strategy corresponding to the target object can be further improved. Therefore, the recommendation strategy model aims at the target recommendation strategy of the target object determined by the target user to be more suitable for the strategy of the current environment, so that the on-line index can be further improved, different recommendation strategies can be flexibly given for different users and objects (such as commodities), and the defect of lag of real-time behavior reaction on the user line can be overcome; meanwhile, the method provided by the embodiment can further improve the data quality of the whole link, contributes to the follow-up recall and the iteration of the sequencing model to cleaner and high-quality model training sample data, reduces the probability of the model having bias, has higher interpretability, is beneficial to model strategy reception upgrading, is sensitive to online data, and can quickly adjust the target recommendation strategy corresponding to the target object according to data distribution (namely, the condition of target interaction results of target users for the target object).

It should be noted that the specific types, numbers and combinations of the terminal device 1 and the server 2 and the network may be adjusted according to the actual requirements of the application scenario, which is not limited in the embodiment of the present disclosure.

It should be noted that the above application scenario is only shown for the convenience of understanding the present disclosure, and embodiments of the present disclosure are not limited in any way in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

Fig. 2 is a flowchart of a recommendation method provided in an embodiment of the present disclosure. A recommendation method of fig. 2 may be performed by the terminal device or the server of fig. 1. As shown in fig. 2, the recommendation method includes:

s201: and acquiring user attribute characteristics of a target user, object attribute characteristics of a target object and interaction behavior characteristics between the target user and the target object.

In this embodiment, a user who needs to make a recommendation may be referred to as a target user. The user attribute feature may be understood as feature information capable of reflecting the attribute of the target user who performs the interaction, and the target user may be understood as an account or a client that generates the interaction with respect to the target object, for example, the user attribute feature may be capable of reflecting the feature of the model of the mobile phone used by the user (i.e., the model of the mobile phone to which the account is logged in), the location of the account (such as province, city), and the like. It will be appreciated that the user attribute of the target user is characteristic information of the attribute of the target user itself.

The target object may be understood as an object that needs to be recommended to the target user, or may be an object to which the interactive behavior is performed, for example, the target object may be a short video, a commodity, a service, or the like. Object attribute characteristics may be understood as characteristic information that can reflect the attributes of the target object itself. For example, when the target object is a commodity or service, the object attribute feature may be a feature capable of reflecting an attribute such as a price of the commodity or service, a sales amount per day, a product type, a price of the commodity that the target user has last browsed, or the like.

The interactive behavior feature between the target user and the target object may be understood as feature information capable of reflecting an operation performed by the target user on the target object, for example, assuming that the target object is a commodity, the interactive behavior feature may include a feature capable of reflecting the number of clicks of the commodity by the target user, whether the target user is collected by the user, and/or purchased. In the online e-commerce scenario, a user often browses a plurality of goods or services in the same e-commerce website or mobile terminal application program, and the actions may be operations such as stay on a certain goods page, clicking on goods viewing details, and the like, and these operations may be collectively referred to as interaction actions.

S202: and inputting the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object into a trained recommendation strategy model to obtain a target recommendation strategy corresponding to the target object.

In this embodiment, the target recommendation policy corresponding to the target object may be understood as a policy for recommending the target object determined by the target user. It can be appreciated that, because the target recommendation policy is determined by the recommendation policy model based on the user attribute feature of the target user, the object attribute feature of the target object, and the interaction behavior feature between the target user and the target object, the determined target recommendation policy not only considers rich user interest features and object attribute features, but also combines the interaction behavior between the target user and the target object, so that the recommendation policy model can select different recommendation policies based on rich information in the user attribute feature of the target user, the object attribute feature of the target object, and the interaction behavior feature between the target user and the target object and dynamic semantic characterization information, and further can improve the accuracy of prediction of the recommendation policy model on the target recommendation policy corresponding to the target object. It is emphasized that in one implementation, the recommendation policies for the objects may also be different for different users and for different objects.

In this embodiment, after the user attribute feature of the target user, the object attribute feature of the target object, and the interaction behavior feature between the target user and the target object are obtained, in order to determine the target recommendation policy corresponding to the target object of the target user, the user attribute feature of the target user, the object attribute feature of the target object, and the interaction behavior feature between the target user and the target object may be input into the trained recommendation policy model, so as to obtain the target recommendation policy corresponding to the target object. The target recommendation policy corresponding to the target object may include at least one of the following: the target object to be recommended, the recommendation time of the target object, the recommendation mode, and the like. For example, assuming that the target object is a commodity, the target recommendation policy corresponding to the target object may be: taking the commodity with the highest click rate in nearly 1 day as a target object, and recommending the target object to a target user in a popup window mode when the target user opens a client; the target recommendation policy may also be: taking the commodity with the highest click number in the last 5 days of the target user as a target object, and recommending the commodity to the target user in a message pushing mode when the target user starts browsing the commodity; the target recommendation policy may also be: and randomly selecting one commodity from the ten categories in the sales volume as a target object, and recommending the commodity to a target user in a message pushing mode.

It should be noted that, in one implementation manner, the recommended policy model may be a classification model, for example, deep Q network, DNN, CNN, transformers, multi-layer full-communication network, multi-layer self-attention network, etc., and in this embodiment, a specific neural network model is not limited.

S203: recommending the target object to the target user based on a target recommendation strategy corresponding to the target object.

In this embodiment, after determining the target recommendation policy corresponding to the target object, the target object may be recommended to the target user according to the target recommendation policy corresponding to the target object. For example, assuming that the target object is a commodity, the target recommendation policy corresponding to the target object is: taking the commodity with the highest click rate in nearly 1 day as a target object, and recommending the target object to a target user in a popup window mode when the target user opens a client; and when the client corresponding to the target user is detected to be started, displaying the commodity with the highest click rate in the near 1 day in a popup window mode.

S204: and acquiring a target interaction result of the target user aiming at the target object, and determining a reward value corresponding to the target recommendation strategy according to the target interaction result.

In this embodiment, after the target object is displayed to the target user, the target interaction result of the target user on the target object may be obtained. The target interaction result may be understood as a real conversion result of the target user for the target object, that is, an operation content performed by the target user for the recommended target object. For example, assuming that the target user is the account a and the target object is the commodity a, after recommending and displaying the commodity a to the account a by adopting a target recommendation policy corresponding to the commodity a, if the account a browses the commodity a but does not purchase or collect the commodity a, the real conversion result of the target user for the target object is not purchased, that is, the target interaction result of the target user for the target object is not purchased. It should be noted that, assuming that the target object is a commodity, if the target user makes a further action such as ordering or reserving on the target commodity in a period of time after the target user interacts with the last commodity, the further action of the target user may be referred to as conversion.

After the target interaction result is acquired, determining a reward value corresponding to the target recommendation strategy according to the target interaction result. It will be appreciated that the prize value may reflect the situation of the target interaction result, the higher the prize value if the target interaction result is closer to the ideal result, and the lower the prize value if the difference between the target interaction result and the ideal result is greater. For example, as shown in fig. 2, if the target interaction result is that the target user clicks the target object, the prize value corresponding to the target interaction result is 1, and if the target interaction result is that the target user does not click the target object, the prize value corresponding to the target interaction result is 0.

It should be noted that, because the reward value may reflect the situation of the target interaction result, the recommendation policy model may be continuously perfected by using the reward value, so that the training data of the recommendation policy model may autonomously implement a direct experience source of the target. The recommendation strategy model judges whether the result predicted by the recommendation strategy model is good or not by receiving the rewarding value, so that the recommendation strategy model tends to the target state by selecting a behavior with higher profit.

That is, in this embodiment, in order to better train the recommendation policy model, the target interaction results of the target user for the target object need to be collected, so as to obtain whether the target user has finally performed conversion on the target object. It can be understood that after determining that the target user generates an actual transformation behavior on the target object, the recommendation policy model can be trained by using a target interaction result corresponding to the transformation behavior and a reward value corresponding to the target interaction result, and the trained recommendation policy model can be pushed to the target user by adopting a target recommendation policy with high transformation possibility corresponding to the target object for different target objects in use, so that the transformation rate of the target user on the target object is improved, for example, the possibility that the target user clicks the target object, the target user collects the target object, the target user purchases the target object and other behaviors are improved.

S205: and adjusting model parameters of the recommendation strategy model by using user attribute characteristics of the target user, object attribute characteristics of the target object, interaction behavior characteristics between the target user and the target object, the target recommendation strategy and reward values corresponding to the target recommendation strategy.

After the latest target recommendation strategy and the reward value corresponding to the target recommendation strategy are obtained, model parameters of the recommendation strategy model can be adjusted by utilizing real-time user attribute characteristics of the target user, object attribute characteristics of the target object, interaction behavior characteristics between the target user and the target object, the target recommendation strategy and the reward value corresponding to the target recommendation strategy, so that the mode that the recommendation strategy model selects different target recommendation strategies can be adjusted in real time by utilizing feedback of online real-time behaviors of the user. The mode provided by the embodiment can flexibly give different strategies for different users, can timely adjust the actual situation in real time, and can make up the defect of real-time behavior reaction lag on the user line of the recommendation model in the prior art.

It can be seen that the beneficial effects of the embodiments of the present disclosure compared with the prior art are: the embodiment of the disclosure can firstly acquire the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object; then, the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object can be input into a trained recommendation strategy model to obtain a target recommendation strategy corresponding to the target object; then, recommending the target object to the target user based on a target recommendation strategy corresponding to the target object; then, a target interaction result of the target user aiming at the target object can be obtained, and a reward value corresponding to the target recommendation strategy is determined according to the target interaction result; finally, model parameters of the recommendation policy model can be adjusted by using user attribute features of the target user, object attribute features of the target object, interaction behavior features between the target user and the target object, the target recommendation policy and reward values corresponding to the target recommendation policy. It can be understood that in this embodiment, the recommendation policy model is firstly utilized to obtain the target recommendation policy corresponding to the target object based on the user attribute feature of the target user, the object attribute feature of the target object and the interaction behavior feature between the target user and the target object, so that the recommendation policy model can extract the rich information and the dynamic semantic representation information in the user attribute feature of the target user, the object attribute feature of the target object and the interaction behavior feature between the target user and the target object, and therefore different recommendation policies can be selected based on the rich information and the dynamic semantic representation information in the features, and further the prediction accuracy of the recommendation policy model for the target recommendation policy corresponding to the target object can be improved; then, real-time target interaction results (i.e. feedback of online user behaviors) of the target user aiming at the target object can be utilized to timely adjust model parameters of the recommendation strategy model, so that the accuracy of prediction of the recommendation strategy model on the target recommendation strategy corresponding to the target object can be further improved. Therefore, the recommendation strategy model aims at the target recommendation strategy of the target object determined by the target user to be more suitable for the strategy of the current environment, so that the on-line index can be further improved, different recommendation strategies can be flexibly given for different users and objects (such as commodities), and the defect of lag of real-time behavior reaction on the user line can be overcome; meanwhile, the method provided by the embodiment can further improve the data quality of the whole link, contribute cleaner and high-quality model training sample data for subsequent recall and iteration of the sequencing model, reduce the biased probability of the recommendation strategy model, have higher interpretability, facilitate model strategy reception upgrading, and are sensitive to online data, and can quickly adjust the target recommendation strategy corresponding to the target object according to data distribution (namely, the condition of target interaction results of target users for the target object).

In some embodiments, the step of inputting the user attribute feature of the target user, the object attribute feature of the target object, and the interaction behavior feature between the target user and the target object in S202 to the trained recommendation policy model to obtain the target recommendation policy corresponding to the target object may include the following steps:

s202a: and inputting the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object into a trained recommendation strategy model to obtain corresponding candidate recommendation strategies of the target object and interaction conversion probabilities corresponding to the candidate recommendation strategies.

In this embodiment, a plurality of candidate recommendation strategies may be preset, and after the user attribute features of the target user, the object attribute features of the target object, and the interaction behavior features between the target user and the target object are input into the trained recommendation strategy model, the recommendation strategy model may determine the interaction conversion probability corresponding to each candidate recommendation strategy according to the rich information and the dynamic semantic characterization information of the user attribute features, the object attribute features, and the interaction behavior features. It should be noted that, the interaction transformation probability corresponding to the candidate recommendation policy may be understood as a probability that the target object is successfully transformed after the target user adopts the candidate recommendation policy to recommend the target object, that is, a probability that the target object can be executed by the target user to perform successful interaction actions (such as browsing, ordering, purchasing, etc.). It can be understood that the higher the interactive conversion probability corresponding to the candidate recommendation policy, the higher the probability of successful conversion of the target object after the target user recommends the target object by adopting the candidate recommendation policy, and conversely, the lower the interactive conversion probability corresponding to the candidate recommendation policy, the lower the probability of successful conversion of the target object after the target user recommends the target object by adopting the candidate recommendation policy.

S202b: and determining the target recommendation strategy corresponding to the target object according to the candidate recommendation strategy corresponding to the target object and the interaction conversion probability corresponding to each candidate recommendation strategy.

After determining the interaction conversion probability corresponding to each candidate recommendation policy, determining the target recommendation policy corresponding to the target object according to the corresponding candidate recommendation policy of the target object and the interaction conversion probability corresponding to each candidate recommendation policy. For example, the candidate recommendation policy with the highest interaction transformation probability may be used as the target recommendation policy corresponding to the target object.

Next, to be illustrated in connection with fig. 3, it is assumed that the target object is a commodity, and the candidate recommendation policies specified in advance include policy 1, policy 2, and policy 3, policy 1 being: recommending the highest click rate commodity in nearly 1 day, strategy 2: recommending the commodity with the highest click number in the last 5 days of the target user, and strategy 3: and randomly selecting one commodity from the ten categories in the sales volume to recommend. Inputting the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object into a trained recommendation strategy model, so that a candidate recommendation strategy corresponding to the target object and interaction conversion probability corresponding to each candidate recommendation strategy can be obtained, specifically, the interaction conversion probability corresponding to strategy 1 is 0.1, the interaction conversion probability corresponding to strategy 2 is 0.2, and the interaction conversion probability corresponding to strategy 3 is 0.7. Since the interactive conversion probability of the policy 3 is highest, the policy 3 can be used as a target recommendation policy corresponding to the target object.

In some embodiments, the adjusting the model parameters of the recommendation policy model in S205 by using the user attribute features of the target user, the object attribute features of the target object, the interaction behavior features between the target user and the target object, the target recommendation policy, and the reward values corresponding to the target recommendation policy includes:

based on a preset strategy optimization algorithm, the model parameters of the recommendation strategy model are adjusted by utilizing the user attribute characteristics of the target user, the object attribute characteristics of the target object, the interaction behavior characteristics between the target user and the target object and the target recommendation strategy and the reward values corresponding to the target recommendation strategy.

In this embodiment, the mode of adjusting the model parameters of the recommended policy model may adopt a preset policy optimization algorithm. In one implementation, the preset policy optimization algorithm may include at least one of: policy gradient algorithm, actor-criticism algorithm, PPO near-end policy optimization algorithm.

Next, a policy gradient algorithm will be exemplified as a preset policy optimization algorithm. After the user attribute characteristics of the target user, the object attribute characteristics of the target object, the interaction behavior characteristics between the target user and the target object, the target recommendation strategy and the rewarding values corresponding to the target recommendation strategy are obtained, a typical strategy gradient algorithm can be adopted, and model parameters of a recommendation strategy model are updated by utilizing the user attribute characteristics of the target user, the object attribute characteristics of the target object, the interaction behavior characteristics between the target user and the target object, the target recommendation strategy and the rewarding values corresponding to the target recommendation strategy.

First, a user attribute feature of a target user, an object attribute feature of a target object, an interaction behavior feature between the target user and the target object, and a loss function value corresponding to the target recommendation policy and a prize value corresponding to the target recommendation policy may be determined using equation 1.

Wherein, the liquid crystal display device comprises a liquid crystal display device,

is a loss function value; />

For the full differential operator, gradient is calculated on the component of the model parameter theta of the recommended strategy model, and the direction of updating the model parameter of the recommended strategy model is represented; θ is a model parameter of the recommended policy model; n represents the number of complete one-time interaction; n represents the total number of complete interactions; r (τ) ⁿ ) The scoring score of the target recommendation strategy in the nth interaction is represented, namely, the reward value corresponding to the target recommendation strategy; r () represents the final score (i.e., the prize value) of each time, and can be simply set to be 1 when the target user clicks the target object after recommendation, or to be 0 when the target user does not click the target object after recommendation; τ ⁿ The track representing the complete interaction of the nth interaction comprises an operation action executed by a target user and a target recommendation strategy output by a recommendation strategy model; a, a ⁿ The target recommendation strategy is output by the recommendation strategy model; s is(s) ⁿ Features input by the recommendation strategy model (namely user attribute features of a target user, object attribute features of a target object and interaction behavior features between the target user and the target object); p (a) ⁿ |s ⁿ θ) represents the interaction behavior feature s of the recommendation policy model between the target user and the target object after observing the user attribute feature of the target user, the object attribute feature of the target object, and the n-th interaction ⁿ Outputting the target recommendation strategy a ⁿ Is a probability of (2).

Then, the user attribute characteristics of the target user, the object attribute characteristics of the target object, the interactive behavior characteristics between the target user and the target object, the target recommendation strategy and the target recommendation method can be utilizedAnd determining model parameters of the updated recommendation strategy model by using loss function values corresponding to the reward values corresponding to the target recommendation strategy. As an example, the model parameters of the recommended policy model may be updated using equation 2,

wherein θ ^new Model parameters of the updated recommended strategy model are obtained; θ ^old Model parameters of the recommended strategy model before updating are obtained; η is the learning rate; />

The loss function value of the recommended policy model before updating.

In some embodiments, the recommended policy model may be trained based on a set of historical interaction training samples. The historical interaction training sample set can comprise a plurality of groups of historical interaction training samples, and each group of historical interaction training samples comprises a historical object attribute feature, a historical interaction behavior feature, a historical user attribute feature and a real recommendation strategy corresponding to a historical target object. The historical object attribute feature may be understood as an object attribute feature of a historical target object (i.e., a target object of a historical recommendation), the historical interaction behavior feature may be understood as an interaction behavior feature of a target user with the target object or historical data of the historical target object, the historical user attribute feature may be understood as a user attribute feature of the historical target object, and the actual recommendation policy corresponding to the historical target object may be understood as a recommendation policy actually adopted when the historical target object (i.e., the target object of the historical recommendation) is recommended.

It should be noted that, in one implementation manner, the collection manner of the historical interaction training samples in the historical interaction training sample set may be: firstly, the initialized recommended strategy model can be put into a simulator or an online environment to collect data, namely, the interaction result of a simulation user or a real user and the recommended strategy model is simulated, and after N data are collected, the N data can be used as historical interaction training samples to train the recommended strategy model.

As an example, the training process of the recommended policy model may include the steps of:

aiming at each group of historical interaction training samples in the historical interaction training sample set, inputting the attribute characteristics of the historical objects, the attribute characteristics of the historical interaction behaviors and the attribute characteristics of the historical users in the historical interaction training samples into the recommendation strategy model to obtain a prediction recommendation strategy corresponding to the historical target objects; and adjusting model parameters of the recommendation strategy model according to the prediction recommendation strategy and the real recommendation strategy corresponding to the historical target object to obtain a trained recommendation strategy model.

Specifically, the historical object attribute features, the historical interaction behavior features and the historical user attribute features in the historical interaction training sample can be input into the recommendation policy model to obtain the prediction recommendation policy corresponding to the historical target object. And then, calculating a loss function value according to the prediction recommendation strategy corresponding to the historical target object and the real recommendation strategy in the historical interaction training sample. If the loss function value does not meet the preset condition, the model parameters of the recommendation strategy model can be adjusted according to the loss function value to obtain an adjusted recommendation strategy model, the step of inputting the historical object attribute characteristics, the historical interaction behavior characteristics and the historical user attribute characteristics in the historical interaction training sample into the recommendation strategy model to obtain the prediction recommendation strategy corresponding to the historical target object is continuously executed until the loss function value meets the preset condition, or the step of inputting the historical object attribute characteristics, the historical interaction behavior characteristics and the historical user attribute characteristics in the historical interaction training sample into the recommendation strategy model to obtain the prediction recommendation strategy corresponding to the historical target object is executed until the training times reach the preset times.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 4 is a schematic diagram of a recommendation device provided in an embodiment of the present disclosure. As shown in fig. 4, the recommendation device includes:

a feature obtaining unit 401, configured to obtain a user attribute feature of a target user, an object attribute feature of a target object, and an interaction behavior feature between the target user and the target object;

a policy obtaining unit 402, configured to input a user attribute feature of the target user, an object attribute feature of the target object, and an interaction behavior feature between the target user and the target object into a trained recommendation policy model, so as to obtain a target recommendation policy corresponding to the target object;

an object recommending unit 403, configured to recommend the target object to the target user based on a target recommendation policy corresponding to the target object;

a result obtaining unit 404, configured to obtain a target interaction result of the target user for the target object, and determine a reward value corresponding to the target recommendation policy according to the target interaction result;

And the model adjustment unit 405 is configured to adjust model parameters of the recommendation policy model by using user attribute features of the target user, object attribute features of the target object, interaction behavior features between the target user and the target object, the target recommendation policy, and reward values corresponding to the target recommendation policy.

Optionally, the recommendation policy model is a classification model.

Optionally, the policy obtaining unit 402 is specifically configured to:

inputting the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object into a trained recommendation strategy model to obtain corresponding candidate recommendation strategies of the target object and interaction conversion probabilities corresponding to the candidate recommendation strategies;

and determining the target recommendation strategy corresponding to the target object according to the candidate recommendation strategy corresponding to the target object and the interaction conversion probability corresponding to each candidate recommendation strategy.

Optionally, the recommendation policy model is obtained by training based on a historical interaction training sample set, the historical interaction training sample set includes a plurality of groups of historical interaction training samples, and each group of historical interaction training samples includes a historical object attribute feature, a historical interaction behavior feature, a historical user attribute feature and a real recommendation policy corresponding to a historical target object.

Optionally, the training process of the recommended strategy model includes:

Optionally, the model adjustment unit 405 is specifically configured to:

Optionally, the preset policy optimization algorithm includes at least one of the following: policy gradient algorithm, actor-criticism algorithm, PPO near-end policy optimization algorithm.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: the embodiment of the disclosure provides a recommendation device, which comprises: the device comprises a feature acquisition unit, a feature extraction unit and a feature extraction unit, wherein the feature acquisition unit is used for acquiring user attribute features of a target user, object attribute features of a target object and interaction behavior features between the target user and the target object; the strategy acquisition unit is used for inputting the user attribute characteristics of the target user, the object attribute characteristics of the target object and the interaction behavior characteristics between the target user and the target object into a trained recommendation strategy model to obtain a target recommendation strategy corresponding to the target object; the object recommending unit is used for recommending the target object to the target user based on a target recommending strategy corresponding to the target object; the result acquisition unit is used for acquiring a target interaction result of the target user aiming at the target object and determining a reward value corresponding to the target recommendation strategy according to the target interaction result; and the model adjustment unit is used for adjusting model parameters of the recommendation strategy model by utilizing user attribute characteristics of the target user, object attribute characteristics of the target object, interaction behavior characteristics between the target user and the target object and the target recommendation strategy and reward values corresponding to the target recommendation strategy. It can be understood that in this embodiment, the recommendation policy model is firstly utilized to obtain the target recommendation policy corresponding to the target object based on the user attribute feature of the target user, the object attribute feature of the target object and the interaction behavior feature between the target user and the target object, so that the recommendation policy model can extract the rich information and the dynamic semantic representation information in the user attribute feature of the target user, the object attribute feature of the target object and the interaction behavior feature between the target user and the target object, and therefore different recommendation policies can be selected based on the rich information and the dynamic semantic representation information in the features, and further the prediction accuracy of the recommendation policy model for the target recommendation policy corresponding to the target object can be improved; then, real-time target interaction results (i.e. feedback of online user behaviors) of the target user aiming at the target object can be utilized to timely adjust model parameters of the recommendation strategy model, so that the accuracy of prediction of the recommendation strategy model on the target recommendation strategy corresponding to the target object can be further improved. Therefore, the recommendation strategy model aims at the target recommendation strategy of the target object determined by the target user to be more suitable for the strategy of the current environment, so that the on-line index can be further improved, different recommendation strategies can be flexibly given for different users and objects (such as commodities), and the defect of lag of real-time behavior reaction on the user line can be overcome; meanwhile, the method provided by the embodiment can further improve the data quality of the whole link, contribute cleaner and high-quality model training sample data for subsequent recall and iteration of the sequencing model, reduce the biased probability of the recommendation strategy model, have higher interpretability, facilitate model strategy reception upgrading, and are sensitive to online data, and can quickly adjust the target recommendation strategy corresponding to the target object according to data distribution (namely, the condition of target interaction results of target users for the target object).

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.

Fig. 5 is a schematic diagram of a computer device 5 provided by an embodiment of the present disclosure. As shown in fig. 5, the computer device 5 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Alternatively, the processor 501, when executing the computer program 503, performs the functions of the modules/modules in the apparatus embodiments described above.

Illustratively, the computer program 503 may be split into one or more modules/modules, which are stored in the memory 502 and executed by the processor 501 to complete the present disclosure. One or more of the modules/modules may be a series of computer program instruction segments capable of performing particular functions for describing the execution of the computer program 503 in the computer device 5.

The computer device 5 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device 5 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the computer device 5 and is not limiting of the computer device 5, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the computer device may also include input and output devices, network access devices, buses, etc.

The processor 501 may be a central processing module (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 502 may be an internal storage module of the computer device 5, for example, a hard disk or a memory of the computer device 5. The memory 502 may also be an external storage device of the computer device 5, for example, a plug-in hard disk provided on the computer device 5, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), or the like. Further, the memory 502 may also include both internal memory modules of the computer device 5 and external memory devices. The memory 502 is used to store computer programs and other programs and data required by the computer device. The memory 502 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of each functional module and module is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules or modules to perform all or part of the above-described functions. The functional modules and the modules in the embodiment can be integrated in one processing module, or each module can exist alone physically, or two or more modules can be integrated in one module, and the integrated modules can be realized in a form of hardware or a form of a software functional module. In addition, the specific names of the functional modules and the modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present disclosure. The modules in the above system, and the specific working process of the modules may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., a module or division of modules is merely a logical function division, and there may be additional divisions of actual implementation, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or modules, which may be in electrical, mechanical or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present disclosure may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

The integrated modules/modules may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A recommendation method, the method comprising:

2. The method of claim 1, wherein the recommended policy model is a classification model.

3. The method according to claim 1, wherein the inputting the user attribute feature of the target user, the object attribute feature of the target object, and the interaction behavior feature between the target user and the target object into the trained recommendation policy model, to obtain the target recommendation policy corresponding to the target object, includes:

4. The method of claim 1, wherein the recommendation policy model is trained based on a set of historical interaction training samples, the set of historical interaction training samples comprising a plurality of sets of historical interaction training samples, and each set of historical interaction training samples comprising a historical object attribute feature, a historical interaction behavior feature, a historical user attribute feature, and a real recommendation policy corresponding to a historical target object.

5. The method of claim 4, wherein the training process of the recommended policy model comprises:

6. The method according to claim 1, wherein said adjusting model parameters of said recommendation policy model using user attribute features of said target user, object attribute features of said target object, interaction behavior features between said target user and said target object, said target recommendation policy, and prize values corresponding to said target recommendation policy, comprises:

7. The method of claim 6, wherein the preset policy optimization algorithm comprises at least one of: policy gradient algorithm, actor-criticism algorithm, PPO near-end policy optimization algorithm.

8. A recommendation device, the device comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.