CN113836388A

CN113836388A - Information recommendation method and device, server and storage medium

Info

Publication number: CN113836388A
Application number: CN202010512868.5A
Authority: CN
Inventors: 王琳; 叶璨; 黄俊逸; 胥凯; 闫阳辉
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2021-12-24
Anticipated expiration: 2040-06-08
Also published as: CN113836388B

Abstract

The present disclosure relates to an information recommendation method, apparatus, server and storage medium, the method comprising: acquiring historical state information of a current account; the historical state information is used for recording operation information of the current account performing interactive operation on the historical information, the historical information is information historically recommended to the current account, and at least an operation type of the interactive operation between the current account and the historical information is recorded in the operation information; screening at least one target operation type from the operation types according to the operation information of the current account on the historical information; and acquiring a candidate information set according to the historical state information and the target operation type, wherein the candidate information set is used for pushing target information in the candidate information set to a terminal corresponding to the current account. By the method and the device, the operation information of interactive operation on the historical information can be implemented according to different user accounts, information recommendation can be carried out in a personalized mode, and the matching degree of the information recommendation is improved.

Description

Information recommendation method and device, server and storage medium

Technical Field

The present disclosure relates to the field of information processing technologies, and in particular, to an information recommendation method, an information recommendation apparatus, a server, and a storage medium.

Background

With the popularization of the mobile internet, the role of the recommendation system in each application program is more and more important. In the face of hundreds of millions of multimedia information, it is important to accurately help users recommend interesting information contents. The conventional recommendation system recommends by selecting content information whose various feedback operations are well balanced. For example, if an application includes a click operation and an approval operation for recommendation information, the content information recommended by the application is usually the content information with the maximum sum of the probability of being clicked by the user and the probability of being approved by the user, but when the target user has never clicked, the recommended information content is screened based on the click behavior, so that the matching degree of the information recommended to the user and the user is low.

Disclosure of Invention

The disclosure provides an information recommendation method, an information recommendation device, an electronic device and a storage medium, which are used for at least solving the problem of low matching degree of information recommendation in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an information recommendation method, including:

acquiring historical state information of a current account; the historical state information is used for recording operation information of the current account performing interactive operation on historical information, the historical information is history information recommended to the current account, and at least operation types of the interactive operation between the current account and the historical information are recorded in the operation information;

screening at least one target operation type from the operation types according to the operation information of the current account on the historical information;

and acquiring a candidate information set according to the historical state information and the target operation type, wherein the candidate information set is used for pushing target information in the candidate information set to a terminal corresponding to the current account.

In one embodiment, the step of screening out at least one target operation type from the operation types according to the operation information of the current account on the historical information includes:

inputting the operation information of the current account on the historical information into a pre-constructed first recommendation model, and acquiring a target operation type through the first recommendation model;

the step of obtaining a candidate information set according to the historical state information and the target operation type includes:

inputting the historical state information of the current account and the target operation type into a pre-constructed second recommendation model, and acquiring the candidate information set through the second recommendation model;

after the step of obtaining a candidate information set according to the historical state information and the target operation type, the method further includes:

acquiring feedback operation information of the current account on the target information in the candidate information set;

and obtaining a feedback value corresponding to the target operation type according to the feedback operation information, wherein the feedback value is used for carrying out iterative update on the second recommendation model.

In one embodiment, after the step of obtaining the candidate information set according to the historical state information and the target operation type, the method further includes:

determining a feedback value corresponding to each operation type according to the feedback operation information;

and obtaining a sum of the feedback values corresponding to the operation types, and determining the sum of the feedback values corresponding to the operation types as update information for performing iterative update on the first recommendation model.

In one embodiment, after the step of obtaining feedback operation information of the current account on each piece of target information in the candidate information set, the method further includes:

and updating the historical state information of the current account according to the candidate information set and the feedback operation information.

In one embodiment, before the step of inputting the operation information of the current account on the historical information into the first pre-constructed recommendation model, the method includes:

acquiring historical state information of the current account at different historical moments, and generating a training sample according to the historical state information at the different historical moments; wherein, training the sample includes: historical state information of a first historical moment, updating information of the first historical moment and historical state information of a second historical moment;

inputting the historical state information of the second historical time into the first recommendation model, obtaining an update information predicted value of the second historical time, and calculating an update information accumulated value of the first historical time according to the update information predicted value of the second historical time and the update information of the first historical time;

inputting the historical state information of the first historical moment into the first recommendation model to predict the updating information prediction value of the first historical moment;

and updating the network parameters of the first recommendation model according to the difference between the update information predicted value at the first historical moment and the update information accumulated value at the first historical moment.

In one embodiment, before the step of inputting the state information of the current account and the target optimization operation into a second pre-constructed recommendation model, the method further includes:

acquiring historical state information of the current account at different historical moments, and generating a training sample according to the historical state information at the different historical moments; wherein, training the sample includes: historical state information of a first historical moment, a target operation type of the first historical moment, a feedback value of the first historical moment, historical state information of a second historical moment and a target operation type of the second historical moment;

inputting the historical state information of the second historical moment and the target operation type of the second historical moment into the second recommendation model, obtaining a feedback predicted value of the second historical moment, and calculating a predicted accumulated feedback value of the first historical moment according to the feedback predicted value of the second historical moment and the feedback value of the first historical moment;

inputting the historical state information of the first historical moment and the target operation type of the first historical moment into the second recommendation model, and obtaining a feedback predicted value of the first historical moment;

and updating the network parameters of the second recommendation model according to the difference between the predicted accumulated feedback value at the first historical moment and the predicted feedback value at the first historical moment.

According to a second aspect of the embodiments of the present disclosure, there is provided an information recommendation apparatus including:

the state information acquisition module is configured to execute acquisition of historical state information of the current account; the historical state information is used for recording operation information of interactive operation performed on historical information by the current account, the historical information is information historically recommended to the current account, and at least operation types of the interactive operation performed on the current account and the historical information are recorded in the operation information;

the target operation type acquisition module is configured to execute operation information of the historical information according to the current account, and screen out at least one target operation type from the operation types;

and the information recommendation module is configured to execute acquisition of a candidate information set according to the historical state information and the target operation type, wherein the candidate information set is used for pushing target information in the candidate information set to the terminal corresponding to the current account.

In an exemplary embodiment, the target operation type obtaining module is configured to perform: inputting the operation information of the current account on the historical information into a pre-constructed first recommendation model, and acquiring a target operation type through the first recommendation model;

an information recommendation module configured to perform: inputting the historical state information of the current account and the target operation type into a pre-constructed second recommendation model, and acquiring the candidate information set through the second recommendation model;

the information recommendation device further comprises a model update module configured to perform: acquiring feedback operation information of the current account on the target information in the candidate information set;

In an exemplary embodiment, the model update module is configured to perform:

In an exemplary embodiment, the apparatus further comprises a status information updating module configured to perform: and updating the historical state information of the current account according to the candidate information set and the feedback operation information.

In an exemplary embodiment, the model update module is configured to perform:

According to a third aspect of the embodiments of the present disclosure, there is provided a server, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the information recommendation method as described in any embodiment of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein instructions that, when executed by a processor of a server, enable the server to perform the information recommendation method as described in any one of the embodiments of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, the program product comprising a computer program, the computer program being stored in a readable storage medium, from which the computer program is read and executed by at least one processor of an apparatus, such that the apparatus performs the information recommendation method described in any one of the embodiments of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: obtaining the historical state information of the current account; the historical state information is used for recording operation information of the current account performing interactive operation on the historical information, the historical information is information historically recommended to the current account, and at least an operation type of the interactive operation between the current account and the historical information is recorded in the operation information; screening at least one target operation type from the operation types according to the operation information of the current account on the historical information; and acquiring a candidate information set according to the historical state information and the target operation type, wherein the candidate information set is used for pushing target information in the candidate information set to a terminal corresponding to the current account, so that the operation information of performing interactive operation on the historical information according to different user accounts is realized, information recommendation is performed in a personalized manner, and the matching degree of the information recommendation is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a diagram illustrating an application environment of an information recommendation method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating an information recommendation method according to an example embodiment.

Fig. 3 is a flowchart illustrating an information recommendation method according to yet another exemplary embodiment.

FIG. 4 is a block diagram illustrating a first recommendation model and a second recommendation model, according to an example embodiment.

Fig. 5 is a block diagram illustrating an information recommendation apparatus according to an example embodiment.

Fig. 6 is an internal block diagram of a server according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The information recommendation method provided by the present disclosure may be applied to an application environment as shown in fig. 1. Wherein the terminal 110 interacts with the server 120 through the network. The terminal 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 120 may be implemented as an independent server or a server cluster formed by a plurality of servers.

Fig. 2 is a flowchart illustrating an information recommendation method according to an exemplary embodiment, where the information recommendation method is used in the server shown in fig. 1, as shown in fig. 2, and includes the following steps:

in step S210, acquiring historical state information of the current account; the historical state information is used for recording operation information of the current account performing interactive operation on the historical information, the historical information is information historically recommended to the current account, and at least an operation type of the interactive operation between the current account and the historical information is recorded in the operation information.

The current account refers to an information recommendation object, for example, the current account may be a user account of a login client, taking the client as a short video application client as an example, the current account refers to an account corresponding to a viewer user who logs in the short video application, and the information to be recommended refers to short video data. It can be understood that the server may recommend different information to the client corresponding to the current account, and the corresponding user may perform different operation types of interactive operations on the pushed information through the terminal, which may include, but is not limited to, a click operation, a praise operation, a focus operation, a long play operation, and the like.

The historical state information is used for recording operation information of performing an interactive operation on the historical information by the current account, for example, the historical state information records an operation type of performing the interactive operation between the current account and the historical information, and may include, for example, an information list of the current account clicked recently, an information list of praise, a user account list of interest, and the like; further, the historical status information may also include, but is not limited to, account attribute information of the current account and account login information, for example, the account attribute may include an account age corresponding to the account, a user gender, a terminal model of the user account login, and the like, and the account login information may include a login duration of the historical account login and a number of information clicked in the current login.

Specifically, a user corresponding to the current account can operate through a terminal, and the terminal responds to the operation and sends an information recommendation request to the server; after receiving an information recommendation request sent by a terminal corresponding to a current account, a server acquires historical state information of the current account according to an account identifier of the current account in the information recommendation request, and acquires and records operation information of performing interactive operation on the historical information by the current account from the historical state information of the current account.

In step S220, at least one target operation type is screened out from the operation types according to the operation information of the current account on the historical information.

The target operation type refers to an operation type of an interactive operation which needs to be optimized in the information recommendation process, and at least includes one of interactive operations such as click operation, praise operation, focus operation, long play operation and the like. The method comprises the steps that a click operation is supposed to be never fed back from historical operation information of state information of a current account, namely, the click rate of a user for clicking recommendation information is not large, the recommendation information pushed to the current account is information content with the maximum click reading probability of the user, the user is guided to click the recommendation information, so that more users can click, and the target operation type is click operation; for another example, assuming that there are many click operations fed back in the history operation information of the state information of the current account and a praise operation has never been fed back, at this time, the recommendation information that can be pushed to the current account is the information content with the maximum praise probability, and the user is guided to perform a praise operation on the recommendation information, so that more users have praise behaviors, and at this time, the target operation type is a click operation.

After the server acquires the operation information of the current account on the historical information, the target operation type is acquired according to the operation information of the current account on the historical information, and specifically, a target optimization operation target can be acquired according to the state information of the current account by using model algorithms such as a neural network model and a random forest model.

In step S230, a candidate information set is obtained according to the historical state information and the target operation type, where the candidate information set is used to push target information in the candidate information set to a terminal corresponding to the current account.

After the target operation type is determined, selecting a candidate information set according to the historical state information of the current account and the target operation type, recommending the candidate information set to the current account, wherein the candidate information set comprises information to be recommended, which has high possibility that the current account executes the target operation type according to the information, and recommending the candidate information set to the current account. Specifically, the state information of the current target user account, the target operation type and the information to be recommended are input into a pre-constructed neural network model for predicting the recommendation probability, the probability value of the current account executing the interactive operation corresponding to the target operation type for the information to be recommended is predicted by the neural network model for predicting the recommendation probability after the information to be recommended is pushed to the terminal corresponding to the current account, and then the candidate information set is selected according to the probability value corresponding to each information to be pushed.

According to the information recommendation method, the historical state information of the current account is acquired; the historical state information is used for recording operation information of the current account performing interactive operation on the historical information, the historical information is information historically recommended to the current account, and at least an operation type of the interactive operation between the current account and the historical information is recorded in the operation information; screening at least one target operation type from the operation types according to the operation information of the current account on the historical information; and acquiring a candidate information set according to the historical state information and the target operation type, wherein the candidate information set is used for pushing target information in the candidate information set to a terminal corresponding to the current account, so that the operation information of performing interactive operation on the historical information according to different user accounts is realized, information recommendation is carried out in a personalized manner, and the matching degree of the information recommendation is improved.

In an exemplary embodiment, as shown in fig. 3, fig. 3 is a flowchart illustrating an information recommendation method according to an exemplary embodiment, the information recommendation method including the steps of:

in step S310, acquiring historical status information of the current account; the historical state information is used for recording operation information of the current account performing interactive operation on the historical information, the historical information is information historically recommended to the current account, and at least an operation type of the interactive operation between the current account and the historical information is recorded in the operation information;

in step S320, inputting the operation information of the current account on the historical information into a first recommendation model which is constructed in advance, and obtaining a target operation type through the first recommendation model;

in step S330, the historical state information of the current account and the target operation type are input into a second pre-constructed recommendation model, and a candidate information set is obtained through the second recommendation model, where the candidate information set is used to push target information in the candidate information set to a terminal corresponding to the current account.

After obtaining the historical state information of the current account, the server may input the operation information in the historical state information of the current account into the first recommendation model, and the first recommendation model outputs probability values of various operation types, so that the target operation type to be optimized is determined according to the probability values of various operation types. After the target operation type is determined, the server inputs the historical state information of the current account and the target operation type into a second recommendation model, the second recommendation model outputs each piece of information to be recommended and pushes the information to be recommended to a terminal corresponding to the current account, and the current account executes the probability value of the interactive operation corresponding to the target operation type according to the information to be recommended. After the recommendation degree of each piece of information to be recommended is obtained, the server selects a candidate information set according to the probability value corresponding to each piece of information to be pushed, and recommends the candidate information set to the current account.

Wherein, the first recommendation model and the second recommendation model are trained neural network models; the first recommendation model is used for predicting probability values of all preset optimization operation information according to the input historical state information, and the second recommendation model is used for predicting sampling probabilities of all information to be recommended according to the input historical state information and the target operation type. Specifically, the first recommendation model and the second recommendation model may be reinforcement learning models, and further, the first recommendation model and the second recommendation model may be markov decision process models.

The reinforcement learning model can be understood as two interactive bodies of an Agent (Agent) and an Environment (Environment). Wherein, the agent can sense the State (State) of the environment and Reward (Reward) of environment feedback, and learn and decide based on the sensed State and Reward. That is, the agent has learningThe double functions of learning and decision making. The decision function of the intelligent agent means that the intelligent agent can make different actions according to the state and the strategy of the environment. The learning function of the intelligent agent means that the intelligent agent can sense the state of the external environment and the reward fed back, and learning and improving the strategy based on the sensed state and the reward. In this exemplary embodiment, the current account may be regarded as an environment, the first recommendation model and the second recommendation model are different agents, and the state spaces of the first recommendation model and the second recommendation model are state information of the current account. Specifically, as shown in fig. 4, fig. 4 is a schematic diagram of a first recommendation model and a second recommendation model in an embodiment, where it is assumed that a current time is a time t, and the first recommendation model is based on historical state information s of a current account_t(i.e., the state of the environment), a target operation type g is generated_t(i.e., actions made by the agent, i.e., the first recommendation model, depending on the state and policy of the environment), wherein the target operation type is one of the preset optimization operation information, which may be, for example, one of a click operation, a praise operation, a focus operation, and a long play operation. The second recommendation model is based on the historical state information s of the current account_tAnd given target operation type g of the first recommendation model_t(i.e., the state of the environment includes s_tAnd g_t) Selecting recommendation information a_tAnd obtaining a candidate information set (namely, the second recommendation model, which is an agent that acts according to the state and strategy of the environment), namely, the second recommendation model only selects the information to be recommended at each moment, which is relatively high in probability that the user performs the interactive operation of the target operation type on the information to be recommended.

For example, assume that in a short video application, the operation types of the interactive operation for the recommended video information include a click operation and a like operation; when the current account is the account of a new user and the history does not click to watch the short video, at the moment, the server inputs the historical state information of the current account into a first recommendation model, the first recommendation model outputs that the target operation type is click operation, then the target operation type (click operation) and the historical state information are input into a second recommendation model, video information with high probability of click operation performed by the current account is obtained through the second recommendation model, and a candidate information set is generated and pushed to a terminal corresponding to the current account. When the user corresponding to the current account uses the short video application program for a period of time, more click operations and less like operations are performed in the historical operation information, at this time, the server inputs the historical state information of the current account into the first recommendation model, the first recommendation model outputs the target operation type as like operations, then the target operation type (like operations) and the historical state information are input into the second recommendation model, video information with higher probability of like operations performed by the current account is obtained through the second recommendation model, and a candidate information set is generated and pushed to the terminal corresponding to the current account. The operation information of interactive operation on the historical information according to different user accounts is realized through the reinforcement learning model, information recommendation is carried out in a personalized mode, and the matching degree of the information recommendation is improved.

Further, in an exemplary embodiment, after the step of obtaining a candidate information set according to the historical state information and the target operation type, the method further includes: acquiring feedback operation information of the current account on the target information in the candidate information set; and obtaining a feedback value corresponding to the target operation type according to the feedback operation information, wherein the feedback value is used for carrying out iterative update on the second recommendation model.

The present exemplary embodiment is an iterative update process for the second recommendation model. The feedback operation information comprises operation types of different interactive operations executed by the current account on each target information in the candidate information set through the terminal by the user. In a specific application process, the server acquires feedback operation information of the current account aiming at the target information, acquires a feedback value corresponding to the target operation type based on the feedback operation information, determines the feedback value as a feedback value of the second recommendation model, and updates the second recommendation model according to the acquired feedback value. Specifically, the interactive operations of different operation types correspond to different reward values, namely feedback values, wherein the second reward value is used as a reward value given to the second recommendation model by the environment at this time and is determined according to the reward value of the interactive operation corresponding to the target operation type and whether the feedback operation information includes the target operation type. For example, the reward score corresponding to the click operation information is 10 points, the reward score corresponding to the click operation is 20 points, the reward score corresponding to the attention operation is 30 points, and the reward score corresponding to the long play operation is 40 points; in the information recommendation process, the target operation type is a praise operation, after the server recommends the candidate information set to the client corresponding to the current account, the user can click the target information in the candidate recommendation set through the client to read the target information and perform a praise operation on the target information, that is, the feedback operation information of the current account for the target recommendation information includes the click operation and the praise operation, at this time, the feedback value of the second recommendation model, that is, 20 points, is determined according to the target operation type, that is, the reward value corresponding to the praise operation. That is, the reward function optimized by the second recommendation model is only one of all feedback operations of the user, specifically which is determined by the first recommendation model. In this embodiment, a reward function of the second recommendation model is automatically designed according to state information corresponding to different account numbers, so that recommended information matches with each account number, and matching degree of the recommended information is improved.

In an exemplary embodiment, after the step of obtaining a candidate information set according to the historical state information and the target operation type, the method further includes: determining a feedback value corresponding to each operation type according to the feedback operation information; and obtaining a sum of the feedback values corresponding to the operation types, and determining the sum of the feedback values corresponding to the operation types as update information for performing iterative update on the first recommendation model.

The present exemplary embodiment iterates the update process for the first recommendation model. Since the first recommendation model and the second recommendation model are different agents, the agents can sense the state of the external environment and the reward fed back, and study and improve the strategy based on the sensed state and the reward. In a specific application process, the server acquires feedback operation information of the current account aiming at the target recommendation information, and acquires update information of the first recommendation model based on the feedback operation information.

Specifically, the interactive operations of different operation types correspond to different reward points, that is, feedback values, where the reward point given to the first recommendation model by the updated information as the environment at this time may be determined according to the sum of the reward points corresponding to all feedback operation information. For example, the reward score corresponding to the click operation information is 10 scores, the reward score corresponding to the click operation is 20 scores, the reward score corresponding to the attention operation is 30 scores, and the reward score corresponding to the long play operation is 40 scores; in the information recommendation process, after the server recommends the candidate information set to the terminal corresponding to the current account, the user can perform a click operation on the target information in the candidate information set through the terminal to read the target recommendation information and perform a praise operation on the target information, that is, the feedback operation information of the current account for the target information includes the click operation and the praise operation, at this time, the update information of the first recommendation model is the sum of reward scores corresponding to the click operation and the praise operation, that is, 30 minutes. That is, the reward function optimized by the first recommendation model is the sum of the reward values corresponding to all feedback operations for the account.

The above exemplary embodiment is explained in conjunction with the first recommendation model and the second recommendation model as shown in fig. 4. At the current moment t, the first recommendation model is used for recommending the current account according to the historical state information s of the current account_tGenerating a target operation type g_tTarget operation type g_tRepresenting the objective that the second recommendation model requires optimization. The second recommendation model is used for recommending the account number according to the historical state information s of the current account number_tAnd given target operation type g of the first recommendation model_tSelecting a target recommendation information a_tAnd recommend the target information a_tAnd recommending the current account. User-to-target recommendation information a corresponding to current account_tCertain feedback operations are performed. The first recommendation model determines the reward given by the secondary environment according to the sum of the reward values corresponding to all the feedback operation information, namely acquiring update information and performing iterative update on the network parameters of the first recommendation model; the second recommendation model operates only on the targetAs type g_tAnd the reward value corresponding to the corresponding feedback operation information is used as the reward given by the environment, namely the feedback value is obtained, and the network parameters are updated in an iterative manner.

In an exemplary embodiment, after the step of obtaining the feedback operation information of the current account for the target recommendation information, the method further includes: and updating the historical state information of the current account according to the candidate information set and the feedback operation information.

After the feedback operation information of the current account for the candidate information set is acquired, the server may add the feedback operation information to the state information of the current account to update the historical state information of the current account.

For example, the historical state information often includes operation information of the account, such as an information list that the account has clicked recently, an information list that the account has liked, a list of interested user accounts, and the like; after the server pushes the candidate information set to the client corresponding to the account, the user corresponding to the account performs feedback operations such as click operation, praise operation, attention operation, long play operation and the like on the target information based on the terminal; the server acquires feedback operation information of the account aiming at the target information in the candidate information set, takes the feedback operation information as click operation as an example, and adds the target information in the candidate information set to a clicked information list in the historical state information based on the feedback operation information so as to update the state information of the user account.

In an exemplary embodiment, before the step of inputting the operation information of the current account on the historical information into the first pre-constructed recommendation model, the method includes: acquiring historical state information of the current account at different historical moments, and generating a training sample according to the historical state information at the different historical moments; wherein, training the sample includes: historical state information of a first historical moment, updating information of the first historical moment and historical state information of a second historical moment; inputting the historical state information of the second historical moment into the first recommendation model, obtaining an update information predicted value of the second historical moment, and calculating an update information accumulated value of the first historical moment according to the update information predicted value of the second historical moment and the update information of the first historical moment; inputting the historical state information of the first historical moment into the first recommendation model to predict the updating information prediction value of the first historical moment; and updating the network parameters of the first recommendation model according to the difference between the update information predicted value at the first historical moment and the update information accumulated value at the first historical moment.

The present exemplary embodiment is a training process of the first recommendation model. The first recommendation model comprises a first strategy model and a first evaluation model. The first strategy model is a neural network structure, and the input is the state information s of the user account_tAnd outputting the probability value of the selected operation type corresponding to each interactive operation, so as to determine the target operation type according to the probability value of the selected operation type corresponding to each interactive operation. The first evaluation model is also a neural network structure, and the model structure is the same as the first strategy model structure except that the first evaluation model outputs a scalar value representing the expected cumulative prize that the first strategy model can obtain under the current strategy.

In an exemplary embodiment, before the step of inputting the state information of the current account and the target optimization operation into a second recommendation model constructed in advance, the method further includes: acquiring historical state information of the current account at different historical moments, and generating a training sample according to the historical state information at the different historical moments; wherein, the training sample includes: historical state information of a first historical moment, a target operation type of the first historical moment, a feedback value of the first historical moment, historical state information of a second historical moment and a target operation type of the second historical moment; inputting the historical state information of the second historical moment and the target operation type of the second historical moment into the second recommendation model, obtaining a feedback predicted value of the second historical moment, and calculating a predicted accumulated feedback value of the first historical moment according to the feedback predicted value of the second historical moment and the feedback value of the first historical moment; inputting the historical state information of the first historical moment and the target operation type of the first historical moment into the second recommendation model, and obtaining a feedback predicted value of the first historical moment; and updating the network parameters of the second recommendation model according to the difference between the predicted accumulated feedback value at the first historical moment and the feedback predicted value at the first historical moment.

The present exemplary embodiment is a training process of the second recommendation model. The second recommendation model comprises a second strategy model and a second evaluation model. The second strategy model is a neural network structure, the state information of the user account and the target operation type are input, the probability value of each piece of information to be recommended is output, and therefore the target recommendation information is determined according to the probability value of each piece of information to be recommended. The second evaluation model is also a neural network structure, and the model structure is the same as the second strategy model, except that the second evaluation model outputs a scalar value representing the expected cumulative prize that the second strategy model can obtain under the current strategy.

The technical solution of the present disclosure is further explained with reference to fig. 4. As shown in fig. 4, the first recommendation model and the second recommendation model may be markov decision process models. Assuming that the current moment is the moment t, the first recommendation model is based on the state information s of the current account_t(i.e., the state of the environment), a target operation type g is generated_t(i.e., actions made by the agent, such as the first recommendation model, according to the state and policy of the environment), wherein the target operation type is one of the preset optimization operation information, such as one of a click operation, a praise operation, a focus operation, and a long play operation. The second recommendation model is based on the state information s of the current account_tAnd given target operation type g of the first recommendation model_t(i.e., the state of the environment includes s_tAnd g_t) Selecting a target recommendation information a_t(i.e. actions taken by agents of the second recommendation model depending on the state and policy of the environment), i.e. the second recommendation model is real each timeAnd only the information to be recommended with higher probability of interactive operation corresponding to the type of the target operation performed by the user on the information to be recommended is selected.

The state space of the first recommendation model and the second recommendation model is historical state information of a current account, the historical state information may include operation information of the user account, account attribute information and user login information, the historical operation information may include an information list clicked recently by the user account, a favorable information list, a concerned user account list and the like, the account attribute may include an account age, a user gender, a terminal type of the user account login and the like corresponding to the user account, and the user login information may include login duration of the historical login of the user account and the number of information clicked in the current login.

The action space G of the first recommendation model includes operation types G e G corresponding to different interactive operations, and corresponds to a target to be optimized by the second recommendation model, that is, the target operation type includes specifically a click operation, a praise operation, a focus operation, and a long play operation.

Wherein the reward of the first recommendation model is r: and after target recommendation information is recommended for the current account, the sum of reward values of interactive operations corresponding to all feedback operations of the user corresponding to the current account.

Wherein, the state transition P of the first recommendation model is: and after target recommendation information is recommended for the user account, under the current state information s, taking an action g, and then transferring to the probability of s'.

Wherein, the action space A of the second recommendation model is: a specific action a e a represents the target information specifically selected by the information recommendation process this time.

Wherein the reward of the second recommendation model is r_g: and after target recommendation information is recommended for the user account, the reward value of the interactive operation corresponding to the target operation type selected by the first recommendation model.

Wherein, the discount rate gamma epsilon [0,1] of the first recommendation model and the second recommendation model attenuates the future rewards in a certain proportion.

If p (u, i) represents the probability that the user u clicks the recommendation information i, the probability that the user u does not click the recommendation information i is: q (u, i) ═ 1-p (u, i).

The probability that the user u does not click on all the recommended information is as follows: q (u) ═ Πq (u, i), the probability of at least one click is:

wherein n represents the number of the recommendation information browsed by the user u at this time.

From the above formula, it can be seen that the boost p (u) is mainly determined by two factors: more recommendation information is displayed and the click rate of the user is improved. Because the optimized target operation types of the second recommendation model are different every time, compared with the situation that recommendation information diversity with better comprehensiveness (clicking + praise + focusing + long playing) is selected every time, a user can browse more recommendation information empirically; on the other hand when

When the total click rate of the first recommendation model and the second recommendation model after the training is stable is relative, the average inequality can be known,

an equal sign holds if and only if q (u,0) ═ q (u,1) ═ … ═ q (u, n-1). Therefore, when selecting recommendation information with relatively balanced comprehensiveness, the click rate of each recommendation information is not large, the probability that the user clicks the recommendation information at least once is small, and conversely, when only one operation type corresponding to the interactive operation is optimized each time, the click rate difference of each recommended information is large, and the probability that the user clicks the recommended information at least once is large.

In practice, the Actor-Critic algorithm may be used for the first recommendation model and the second recommendation model. The Actor-Critic algorithm consists of a strategy model and an evaluation model. The strategy model is a neural network structure and outputs the probability pi of selecting one action at a time_θ(as), where θ is a parameter of the policy model, and the action output by the policy model in the first recommendation modelThe corresponding meaning is the probability of selecting the operation type corresponding to each interactive operation, and the meaning corresponding to the action output by the strategy model in the second recommendation model is the probability of selecting each information to be recommended. The evaluation model is also a neural network structure, the bottom layer parameters are the same as the strategy model parameters in structure, and the parameters are shared, except that the evaluation model only needs to output a scalar value at the tail end of the neural network to represent the current strategy pi_θNext, the desired cumulative reward V that the agent can obtain_w(s)＝∑_aπ_θ(a|s)(r(a,s)+γ∑_s′P(s′|s,a)V_w(s')). Wherein w is a network parameter of the evaluation model.

Specifically, the first recommendation model and the second recommendation model need to be trained in advance, and it is assumed that the network parameter of the policy model in the first recommendation model is θ, the network parameter of the evaluation model is w, the network parameter of the policy model in the second recommendation model is θ ', and the network parameter of the evaluation model is w'. Training a first recommendation model and a second recommendation model by using historical state information of a user account, and firstly collecting sample data (s, g, a, r)_gR, s ', g', T); wherein s represents the state information of the user account at the current moment; g represents the target operation type at the current moment; r represents the sum of the reward values corresponding to all the feedback operation information, namely the updating information; r is_gThe reward value corresponding to the target operation type g is represented, namely the feedback value; t represents whether the current time corresponds to the next time and is the termination time, when the next time is the termination time, the T mark is 1, and when the next time is not the termination time, the T mark is 0; s' represents the state information of the user account at the next moment; g' represents the target operation type at the next time.

Specifically, parameters of the policy network of the first recommendation model may be updated according to the collected sample data by the following formula:

wherein α is the strategy of the first recommendation modelThe learning rate of the neural network; γ (1-T) identifies whether the next time is the termination time, γ (1-T) equals 0 when the next time is the termination time, and γ (1-T) equals 1 when the next time is not the termination time;

representing a gradient of descent during the training of the policy network of the first recommendation model.

Likewise, the parameters of the policy network of the second recommendation model are updated by the following formula:

wherein α is a learning rate of the policy network of the second recommendation model; γ (1-T) identifies whether the next time is the termination time, γ (1-T) equals 0 when the next time is the termination time, and γ (1-T) equals 1 when the next time is not the termination time;

representing a decreasing gradient during the training of the policy network of the second recommendation model.

Likewise, the parameters of the evaluation network of the first recommendation model may be updated by the following formula:

wherein alpha is the learning rate of the evaluation network of the first recommendation model; γ (1-T) identifies whether the next time is the termination time, γ (1-T) equals 0 when the next time is the termination time, and γ (1-T) equals 1 when the next time is not the termination time;

and representing the descending gradient in the training process of the evaluation network of the first recommendation model.

Likewise, the parameters of the evaluation network of the second recommendation model may be updated by the following formula:

wherein alpha is the learning rate of the evaluation network of the second recommendation model; γ (1-T) identifies whether the next time is the termination time, γ (1-T) equals 0 when the next time is the termination time, and γ (1-T) equals 1 when the next time is not the termination time;

and representing the descending gradient in the training process of the evaluation network of the second recommendation model.

It should be understood that, although the steps in the flowcharts of fig. 2 to 3 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 to 3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

Fig. 5 is a block diagram illustrating an information recommendation apparatus according to an example embodiment. Referring to fig. 5, the apparatus includes a status information acquiring module 510, a target operation type acquiring module 520, and an information recommending module 530. Wherein the content of the first and second substances,

a status information obtaining module 510 configured to perform obtaining historical status information of the current account; the historical state information is used for recording operation information of interactive operation performed on the historical information by the current account, the historical information is information historically recommended to the current account, and at least an operation type of interactive operation performed on the current account and the historical information is recorded in the operation information;

a target operation type obtaining module 520, configured to perform operation information on the historical information according to the current account, and screen out at least one target operation type from the operation types;

and the information recommending module 530 is configured to acquire a candidate information set according to the historical state information and the target operation type, wherein the candidate information set is used for pushing target information in the candidate information set to the terminal corresponding to the current account.

an information recommendation module configured to perform: inputting the historical state information of the current account and the target operation type into a second pre-constructed recommendation model, and acquiring a candidate information set through the second recommendation model;

the information recommendation device further comprises a model updating module configured to perform: acquiring feedback operation information of the current account on the target information in the candidate information set; and obtaining a feedback value corresponding to the target operation type according to the feedback operation information, wherein the feedback value is used for carrying out iterative updating on the second recommendation model.

In an exemplary embodiment, the model update module is configured to perform: determining a feedback value corresponding to each operation type according to the feedback operation information; and obtaining the sum of the feedback values corresponding to the operation types, and determining the sum of the feedback values corresponding to the operation types as update information for performing iterative update on the first recommendation model.

In an exemplary embodiment, the apparatus further comprises a status information update module configured to perform: and updating the historical state information of the current account according to the candidate information set and the feedback operation information.

In an exemplary embodiment, the model update module is configured to perform:

acquiring historical state information of a current account at different historical moments, and generating a training sample according to the historical state information at different historical moments; wherein, training the sample includes: historical state information of a first historical moment, updating information of the first historical moment and historical state information of a second historical moment;

inputting the historical state information of the second historical moment into the first recommendation model, acquiring an update information predicted value of the second historical moment, and calculating an update information accumulated value of the first historical moment according to the update information predicted value of the second historical moment and the update information of the first historical moment;

inputting the historical state information of the first historical moment into a first recommendation model to predict an update information prediction value of the first historical moment;

and updating the network parameters of the first recommendation model according to the difference between the predicted update information value at the first historical moment and the accumulated update information value at the first historical moment.

In an exemplary embodiment, the model update module is configured to perform:

acquiring historical state information of a current account at different historical moments, and generating a training sample according to the historical state information at different historical moments; wherein, training the sample includes: historical state information of a first historical moment, a target operation type of the first historical moment, a feedback value of the first historical moment, historical state information of a second historical moment and a target operation type of the second historical moment;

inputting the historical state information of the second historical moment and the target operation type of the second historical moment into a second recommendation model, obtaining a feedback predicted value of the second historical moment, and calculating a predicted accumulated feedback value of the first historical moment according to the feedback predicted value of the second historical moment and the feedback value of the first historical moment;

inputting the historical state information of the first historical moment and the target operation type of the first historical moment into a second recommendation model, and obtaining a feedback predicted value of the first historical moment;

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 6 is a block diagram illustrating an apparatus 600 for information recommendation, according to an example embodiment. For example, the device 600 may be a server. Referring to fig. 6, device 600 includes a processing component 620 that further includes one or more processors and memory resources, represented by memory 622, for storing instructions, such as applications, that are executable by processing component 620. The application programs stored in memory 622 may include one or more modules that each correspond to a set of instructions. Further, the processing component 620 is configured to execute instructions to perform the above-described methods.

The device 600 may also include a power component 624 configured to perform power management for the device 600, a wired or wireless network interface 626 configured to connect the device 600 to a network, and an input/output (I/O) interface 628. The device 600 may operate based on an operating system stored in the memory 622, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, or the like.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 622 comprising instructions, executable by the processor of the device 600 to perform the method described above is also provided. The storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An information recommendation method, comprising:

acquiring historical state information of a current account; the historical state information is used for recording operation information of interactive operation performed on historical information by the current account, the historical information is history information recommended to the current account, and at least an operation type of the interactive operation performed on the current account and the historical information is recorded in the operation information;

2. The information recommendation method according to claim 1, wherein the step of screening at least one target operation type from the operation types according to the operation information of the current account on the historical information comprises:

3. The information recommendation method according to claim 2, wherein after the step of obtaining a candidate information set according to the historical state information and the target operation type, further comprising:

4. The information recommendation method according to claim 3, wherein after the step of obtaining the feedback operation information of the current account on each target information in the candidate information set, the method further comprises:

5. The information recommendation method according to claim 2, wherein before the step of inputting the operation information of the current account on the history information into the first pre-constructed recommendation model, the method comprises:

inputting the historical state information of the second historical moment into the first recommendation model, obtaining an update information predicted value of the second historical moment, and calculating an update information accumulated value of the first historical moment according to the update information predicted value of the second historical moment and the update information of the first historical moment;

6. The information recommendation method according to claim 2, wherein before the step of inputting the state information of the current account and the target optimization operation into a second pre-constructed recommendation model, the method further comprises:

7. An information recommendation apparatus, comprising:

the state information acquisition module is configured to execute acquisition of historical state information of the current account; the historical state information is used for recording operation information of interactive operation performed on historical information by the current account, the historical information is history information recommended to the current account, and at least an operation type of the interactive operation performed on the current account and the historical information is recorded in the operation information;

8. The information recommendation device according to claim 7, wherein the target operation type obtaining module is configured to perform: inputting the operation information of the current account on the historical information into a pre-constructed first recommendation model, and acquiring a target operation type through the first recommendation model;

the information recommendation device further comprises a model update module configured to perform: and obtaining a feedback value corresponding to the target operation type according to the feedback operation information, wherein the feedback value is used for carrying out iterative update on the second recommendation model.

9. A server, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the information recommendation method of any one of claims 1 to 6.

10. A storage medium in which instructions are executed by a processor of a server, so that the server can perform the information recommendation method according to any one of claims 1 to 6.