WO2022198983A1

WO2022198983A1 - Conversation recommendation method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022198983A1
Application number: PCT/CN2021/122167
Authority: WO
Inventors: 赵朋朋; 田鑫涛; 郝永静
Original assignee: 苏州大学
Priority date: 2021-03-23
Filing date: 2021-09-30
Publication date: 2022-09-29
Also published as: CN112925892B; CN112925892A

Abstract

A conversation recommendation method and apparatus, an electronic device, and a storage medium. A historical sequence is used to improve the recommendation efficiency. The method comprises: obtaining a historical interaction sequence between a user and an item, inputting the historical interaction sequence into a recommendation network model comprising an embedding layer, a self-attention block, and a prediction layer for training, and generating an item preference value, wherein the item comprises an item attribute; generating a candidate item set by using the item preference value and items having no interaction with the user; when a preference item attribute sent by the user is received, updating the candidate item set by using the preference item attribute, and calculating an interaction prediction value of each candidate item in the updated candidate item set by using an item correlation value; generating a candidate attribute set by using the interaction prediction value and the candidate item, and calculating a preference prediction value of each candidate item attribute in the candidate attribute set by using the interaction prediction value; and inputting the calculated candidate item set and the calculated candidate attribute set into a policy network for reinforcement learning, and carrying out conversation recommendation to the user.

Description

Dialogue recommendation method, device, electronic device and storage medium

This application claims the priority of the Chinese patent application filed on March 23, 2021 with the application number 202110308759.6 and the invention titled "A dialogue recommendation method, device, electronic device and storage medium", the entire contents of which are approved by Reference is incorporated in this application.

technical field

The present invention relates to the field of dialogue recommendation, in particular to a dialogue recommendation method, device, electronic device and storage medium.

Background technique

Conversational Recommender System (CRS, Conversational Recommender System) is a recommender system that can actively obtain preference attributes from users and use the attributes to recommend items. In the related art, dialogue recommendation can use the preference attribute currently asked to the user for recommendation, and it can ask the user for preference attribute or recommend the item considering the historical items that the user has interacted with, but ignore the interaction between the user and the historical item The impact of order on recommendation ignores the importance of the sequence of interaction history items, which makes it difficult for existing dialogue recommendation methods to efficiently and accurately recommend dialogue with users.

SUMMARY OF THE INVENTION

The purpose of the present invention is to provide a dialogue recommendation method, device, electronic device and storage medium, which can use the historical interaction sequence reflecting the user's historical item preference to train the recommendation network model and generate the candidate item set, thereby ensuring that the dialogue recommendation can be targeted Conduct conversations with sexual users to improve the efficiency and accuracy of conversation recommendations.

In order to solve the above-mentioned technical problems, the present invention provides a dialogue recommendation method, including:

Obtain the historical interaction sequence between the user and the item, and input the historical interaction sequence into the recommendation network model including the embedding layer, the self-attention block and the prediction layer for training, and generate the item preference value; wherein, the item includes the item Attributes;

generating a candidate item set using the item preference value and the items that the user has not interacted with;

When receiving the preference item attribute sent by the user, use the preference item attribute to update the candidate item set, and use the item correlation value to calculate the interaction prediction value of each candidate item in the updated candidate item set;

Generate a candidate attribute set using the interactive predicted value and the candidate item, and use the interactive predicted value to calculate the preference predicted value of each candidate item attribute in the candidate attribute set;

Input the calculated candidate item set and candidate attribute set into the policy network for reinforcement learning, and perform dialogue recommendation to the user.

Optionally, inputting the historical interaction sequence into a recommendation network model including an embedding layer, a self-attention block and a prediction layer for training to generate an item preference value, including:

Use the historical interaction sequence to generate a training sequence of preset length;

Input all the items and the training sequence into the embedding layer, and output an integrated embedding matrix;

Inputting the integrated embedding matrix into the self-attention block for iterative feature learning to generate a learning matrix;

Inputting the learning matrix into the prediction layer for matrix decomposition, and calculating the initial item preference value;

The recommendation network model is optimized by using a binary cross-entropy loss function until the output value of the binary cross-entropy loss function is the smallest, and the initial item preference value when the output value is the smallest is used as the item preference value .

Optionally, inputting the calculated candidate item set and candidate attribute set into the strategy network for reinforcement learning, and performing dialogue recommendation to the user, including:

Inputting the candidate item set and the candidate attribute set into the strategy network to perform the reinforcement learning to generate an action decision;

A dialog recommendation is made to the user using the action decision.

Optionally, the performing a dialogue recommendation to the user by using the action decision includes:

When the action decision is an item recommendation, send the candidate item with the largest interaction prediction value to the user terminal, and receive feedback data;

When the feedback data indicates that the candidate item is accepted, exit the dialogue recommendation;

When the feedback data indicates that the candidate item is rejected, the candidate item is removed from the candidate item set, and using the candidate item set after the removal is completed, the calculation of the updated using the item correlation value is performed. The interactive prediction value of each candidate item in the candidate item set;

When the action decision is an attribute query, the attribute of the candidate item with the largest predicted preference value is sent to the user terminal, and the feedback data is received;

Use the feedback data to verify the items in the candidate item set, remove the items that fail the verification, and finally use the removed candidate item set to perform the calculation of the updated candidate item using the item related value The step of interactively predicting the value of each candidate item in the item set.

Optionally, before using the action decision to recommend a dialogue to the user, the method further includes:

judging whether the dialogue round corresponding to the action decision is greater than a preset threshold;

If so, exit the dialog recommendation;

If not, perform the step of making dialog recommendation to the user by using the action decision.

Optionally, the updating the candidate item set using the preference item attribute includes:

A preference item that does not have the preference item attribute in the candidate item set is removed.

Optionally, generating a candidate attribute set using the interactive predicted value and the candidate item, and calculating the preference predicted value of each candidate item attribute in the candidate attribute set using the interactive predicted value, includes:

Generate a candidate attribute set by utilizing the item attributes contained in the first preset number of candidate items in descending order of the interactive predicted values;

Using the interactive prediction value of the candidate item, calculate the information entropy of each candidate item attribute in the candidate attribute set, and use the information entropy as the preference prediction value of the candidate item attribute.

The present invention also provides a dialogue recommendation device, comprising:

The recommendation network module is used to obtain the historical interaction sequence between the user and the item, and input the historical interaction sequence into the recommendation network model including the embedding layer, self-attention block and prediction layer for training, and generate candidate item sets and items a relevant value; wherein the item contains an item attribute;

A candidate item set generation module, configured to generate a candidate item set by using the item preference value and the items that the user has not interacted with;

a first calculation module, configured to update the candidate item set using the preference item attribute when receiving the preference item attribute sent by the user, and use the item correlation value to calculate each candidate item in the updated candidate item set The interactive prediction value of ;

a second calculation module, configured to generate a candidate attribute set using the interactive predicted value and the candidate item, and use the interactive predicted value to calculate the preference predicted value of each candidate item attribute in the candidate attribute set;

The dialogue recommendation module is used for inputting the calculated candidate item set and candidate attribute set into the policy network for reinforcement learning, and performing dialogue recommendation to the user.

The present invention also provides an electronic device, comprising:

memory for storing computer programs;

The processor is configured to implement the above-mentioned dialogue recommendation method when executing the computer program.

The present invention further provides a storage medium, where computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are loaded and executed by a processor, the above-mentioned dialog recommendation method is implemented.

The present invention provides a dialogue recommendation method, comprising: acquiring a historical interaction sequence between a user and an item, and inputting the historical interaction sequence into a recommendation network model including an embedding layer, a self-attention block and a prediction layer for training, and generating an item preference value; wherein the item includes an item attribute; a candidate item set is generated using the item preference value and an item that the user has not interacted with; when the preference item attribute sent by the user is received, the preference item is used Item attributes update the candidate item set, and use the item correlation value to calculate the interactive prediction value of each candidate item in the updated candidate item set; use the interactive prediction value and the candidate item to generate a candidate attribute set, and use the The interaction prediction value is used to calculate the preference prediction value of each candidate item attribute in the candidate attribute set; the calculated candidate item set and candidate attribute set are input into the strategy network for reinforcement learning, and dialogue recommendation is made to the user.

It can be seen that this method first obtains the historical interaction sequence between the user and the item. Since the sequence reflects the user's preference for historical items and the interaction sequence, this method uses the historical interaction sequence to participate in the training of the recommendation network model, which can ensure the use of the recommendation network. The candidate item set generated by the model also considers the user's preference for historical items and the interaction order, and can generate a candidate item set that is more in line with the user's preference, thereby ensuring that the candidate item set can be used in dialogue recommendation to conduct targeted dialogues with users. It can effectively reduce the dialogue rounds of dialogue recommendation, and at the same time can improve the accuracy of dialogue recommendation. The present invention also provides a dialogue recommendation device, an electronic device and a storage medium, which have the above beneficial effects.

Description of drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.

FIG. 1 is a flowchart of a dialog recommendation method provided by an embodiment of the present invention;

FIG. 2 is a structural block diagram of a dialogue recommendation apparatus provided by an embodiment of the present invention;

FIG. 3 is a structural block diagram of a dialogue recommendation system based on historical interaction sequences provided by an embodiment of the present invention.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Conversational Recommender System (CRS, Conversational Recommender System) is a recommender system that can actively obtain preference attributes from users and use the attributes to recommend items. In the related art, the dialogue recommendation can only be recommended by using the preference attributes currently inquired to the user, and it is difficult to consider the user's historical item preferences. It is difficult to make conversational recommendations with users efficiently and accurately. In view of this, the present invention provides a dialogue recommendation method, which can use the historical interaction sequence reflecting the user's historical item preference to train the recommendation network model and generate the candidate item set, thereby ensuring that the dialogue recommendation can conduct dialogue with the user in a targeted manner and improve the dialogue Recommended efficiency and accuracy. Please refer to FIG. 1. FIG. 1 is a flowchart of a dialog recommendation method provided by an embodiment of the present invention. The method may include:

S101. Obtain a historical interaction sequence between a user and an item, and input the historical interaction sequence into a recommendation network model including an embedding layer, a self-attention block, and a prediction layer for training to generate an item preference value; the item includes an item attribute.

The historical interaction sequence is the record of the user's interaction with the item in the past, and the items in the historical interaction sequence are sorted according to the time sequence of the interaction between the user and the item, and the interaction can be click to view, favorite, purchase, etc. Since the historical interaction sequence includes the user's historical preference for items and the user's interaction sequence for historical items, the embodiment of the present invention integrates the historical interaction sequence into the training of the recommendation network model, which can ensure that the recommendation network model can simultaneously integrate the user's historical preferences And the interactive sequence characteristics of historical items, which can ensure that the dialogue recommendation can conduct targeted dialogues according to the user's historical preferences, which can effectively improve the efficiency and accuracy of dialogue recommendation. It should be noted that the embodiments of the present invention do not limit a specific item. It can be understood that the item is a specific item, for example, a physical item such as a book, or a virtual item such as a movie. This embodiment of the present invention also does not limit the specific item attributes. It can be understood that the item attributes reflect a certain feature of the item. For example, when the item is a book, the item attribute can be the type of the book, such as novels, biographies, and textbooks. etc., it can also be an item attribute indicating whether it is a popular book, or it can be other item attributes related to books. This embodiment of the present invention also does not limit the number of item attributes that an item can contain, and an item can have one or more item attributes.

It should be noted that the recommendation network model used in the embodiment of the present invention is based on a deep learning neural network. The embodiments of the present invention do not limit the specific structures and learning methods of the embedding layer, the self-attention block, and the prediction layer in the recommendation network model, and users may refer to related technologies of deep learning neural networks.

Further, the embodiment of the present invention does not limit the number of items that can be included in the historical interaction sequence, and one or more items can be included in the historical interaction sequence. Since the length of the historical interaction sequence of different users is different, in order to facilitate the training of the recommendation network model, the historical interaction sequence can be converted into a training sequence of preset length. It is understandable that when the number of items contained in the historical interaction sequence is too small, it will be difficult to calculate a reliable preference for the user, so a minimum number of items can be set for the historical interaction sequence, when the number of items contained in the historical interaction sequence is less than the minimum number of items. , the network model training will not be performed. The embodiments of the present invention do not limit the specific value of the minimum number of items, which can be set by the user according to actual application requirements. In one possible case, the minimum number of items may be 10. It should be noted that the embodiment of the present invention does not limit the specific value of the preset length, which can be set according to actual application requirements.

In a possible situation, the historical interaction sequence is input into the recommendation network model including the embedding layer, the self-attention block and the prediction layer for training, and the process of generating the item preference value may include:

Step 11: Use the historical interaction sequence to generate a training sequence of preset length.

Specifically, for a user's historical interaction sequence

An initial training sequence can be generated first

Since the length of the historical interaction sequence between each user is different, the initial training sequence can be converted into a fixed-length training sequence s=(s ₁ ,s ₂ ,s ₃ ,...,s _n ), where n represents the new The preset length of the sequence. Since there are few historical visits of users, the length of the historical visit sequence is less than n. At this time, the initial training sequence can be left-filled with the preset padding items, and the number of padding is n-(|V _n |-1); of course, , it can be understood that if the length of the user's historical access sequence is greater than n, only the n items that have interacted recently are retained.

Step 12: Input all items and training sequences into the embedding layer, and output the integrated embedding matrix.

Specifically, in the embedding layer, all items are firstly embedded, and the item embedding matrix M∈R ^|V|×d is obtained. Perform a search operation on the item embedding matrix, where d represents the dimension of the embedding matrix, which can be set arbitrarily. The larger the dimension, the more potential features of the embedding matrix. Generate the training sequence embedding matrix E∈Rn ^×d of the training sequence s, where this lookup operation is

E _i represents the vector corresponding to the i-th element s _i in the training sequence S in the training sequence embedding matrix, and

represents the corresponding vector in the _si -item embedding matrix. Finally, the training sequence embedding matrix E is integrated with a learnable position embedding matrix P∈Rn ^×d to obtain the integrated embedding matrix:

E _s =E+P

Step 13: Input the integrated embedding matrix into the self-attention block for iterative feature learning to generate a learning matrix.

It should be noted that the embodiment of the present invention does not limit the specific learning method of the self-attention block, and the user may refer to the related technology of self-attention. The embodiments of the present invention also do not limit the specific structure of the self-attention block. In a possible situation, the self-attention block includes a self-attention layer and a point-to-point feedforward network. The present invention also does not limit whether to use multiple self-attention blocks for stacking computation. It can be understood that the more layers are stacked, the more features the self-attention block can learn.

In one possible case, the self-attention layer has three matrices Q(Query, query), K(Key, key), V(Value, value), all of which come from the same input into the self-attention block . In the calculation of the self-attention block, the dot product between Q and K will be calculated first. In order to prevent the dot product result from being too large, it will be divided by a scale.

where d is the dimension of the query and key vectors. Finally, the calculation result is normalized and converted into a probability distribution through a softmax operation, and multiplied by the matrix V to obtain the representation of the weight sum. Attention is defined by a scalar dot product as:

In this embodiment of the present invention, when predicting the (k+1)th item, only the first k items in the training sequence should be used for calculation. But in the self-attention layer, the self-attention layer learns the complete training sequence. For this, a mask operation can be used to block the connection between Qi and _Kj in the self-attention layer (i< _j ). After the self-attention layer outputs the learning result Sk based on the first _k items, a point-to-two-layer feedforward network can be used to convert the self-attention layer from a linear model to a nonlinear model. Meanwhile, in order to learn more complex item transformations, iterative feature learning can be performed by stacking self-attention blocks. The embodiment of the present invention does not limit the specific number of stacked layers, which can be set according to actual application requirements. For the bth (b>1) self-attention block, it can be defined as:

SA(F ^(b-1) )=Attention(F ^(b-1) W _Q ,F ^(b-1) W _K ,F ^(b-1) W _V )

Among them, SA (Self Attention) is the self-attention layer, FFN (Feed Forward Network) is the feedforward network,

represents the learning matrix of the b-th self-attention block based on the first k(k∈{1,2,3,…,n}) items, RELU (Rectified Linear Unit) is a linear rectification function,

represents the output of the self-attention layer that aggregated the top k items in the b-th self-attention block. W _Q , W _K , W _V , W ₁ , W ₂ ∈ R ^d×d are all learnable matrices, and b ₁ , b ₂ are d-dimensional vectors.

For the first self-attention block, it can be defined as S ⁽¹⁾ = SA(E _S ) and F ⁽¹⁾ = FFN(S ⁽¹⁾ ).

Multi-layer neural network has strong feature learning ability. But simply adding more network layers can lead to problems such as overfitting and consuming more training time, because when the network becomes deeper, the hidden danger of gradient disappearance will also increase, and the model performance will decrease. In this regard, the above situation can be alleviated by residual connections. Specifically, the process of residual connection can be as follows: normalize the input x of the self-attention layer and the feedforward network, and at the same time perform a dropout (drop method) operation on the output of the self-attention layer and the feedforward network, and finally put the original The output x is added to the output after completing the dropout operation as the final output.

Step 14: Input the learning matrix into the prediction layer for matrix decomposition, and calculate the key value of the initial project;

Specifically, after b self-attention blocks, we can use

Make project forecasts. can be

The MF layer (Matrix Factorization, matrix factorization layer) is input to perform matrix factorization to calculate the item-related value of item i to predict whether item i is a recommendable item. The item-related value can be calculated by the following formula:

Among them, ri _,k represents the relevance of item i becoming the next item (that can be recommended) based on the first k items; N∈R ^|V|×d is an item embedding matrix.

Step 15: Use the binary cross-entropy loss function to optimize the recommendation network model until the output value of the binary cross-entropy loss function is the smallest, and set the initial item preference value when the output value is the smallest as the item preference value.

It should be noted that the embodiments of the present invention do not limit the specific process of network optimization using a binary cross entropy loss function (Binary cross entropy), and reference may be made to related technologies. In one possible case, the binary cross-entropy loss function can be expressed as:

where e _t represents the expected output of the network at time step t, and s _t represents the actual output of the network at time step t. Since there may be a preset filling item in the training sequence, when s _t is the preset filling item, the expected output e _t should also be set as the filling item. When s _t is a normal item, _et should be set as: et =s _t ₊₁ . When t=n, that is, the training sequence has ended at time step t, then

S102. Generate a candidate item set by using the item preference value and the items that the user has not interacted with.

Since the item preference value can represent the user's interest preference for the item, in the embodiment of the present invention, the item preference value will be used as an indicator for generating a candidate item set for sorting the items. It can be understood that, in order to predict the items that the user may be interested in in the future, in this embodiment of the present invention, the items that the user has not interacted with will be used to generate a candidate item set, and the items that the user has not interacted with can be obtained by taking the total set containing all the items. It is obtained from the difference set of the historical item set corresponding to the historical interaction sequence.

S103 , when the preference item attribute sent by the user is received, update the candidate item set using the preference item attribute, and use the item correlation value to calculate the interactive prediction value of each candidate item in the updated candidate item set.

In dialog recommendation, the dialog needs to be initiated by the user first sending a preferred item attribute. The embodiment of the present invention does not limit the specific process for the user to send the preference item attribute. For example, the user can select from a preset item attribute list and send the selection data, or the user can send a dialogue recommendation request, and the dialogue recommendation system can send the selection data. Send a list of item properties to the user for selection by the user, and finally receive the selection data sent by the user.

Further, the embodiment of the present invention does not limit the specific method of using the preference item attribute and the candidate item set to generate the candidate attribute set. It can be understood that the item that does not have the preference item attribute in the candidate item set can be removed, and the candidate item can also be retained. Items with this preference item property are collected.

In one possible case, updating the candidate item set with the preference item attribute may include:

Step 21: Remove the items that do not have the preference item attribute in the candidate item set.

The following specifically describes the process of using the item correlation value to calculate the interactive prediction value of each candidate item in the updated candidate item set. In one possible case, specifically, the set of attributes accepted by the user before the dialog starts

and user denied attribute set

are all empty, the candidate item set V _cand contains items that the user has not interacted with, and the candidate attribute set P _cand is also empty. After receiving a preference item attribute p ₀ sent by the user, the preference item attribute can be used to first remove the items that do not have the preference item attribute in the candidate item set to update the candidate item set, which can be specifically expressed as:

in,

is the item with the preference item attribute p ₀ . After updating the candidate itemset, the output of the multi-layer self-attention block can be exploited

Calculate the interactive prediction value of each candidate item in the candidate item set,

It is obtained by inputting the initial training sequence _V'u of the historical interaction sequence into the b-layer self-attention block. The calculation process of the interactive predicted value can be expressed as:

When the policy network determines that the current round of dialogue is an item recommendation, it will extract from V _cand the first preset number of n candidate items V _rec with larger interaction prediction values. If the user accepts the item, the dialog recommendation ends; otherwise, the recommended item should be removed from the candidate item set V _cand , ie V _cand =V _cand -V _rec .

S104 , generating a candidate attribute set using the interactive predicted value and the candidate items, and using the interactive predicted value to calculate the preference predicted value of each candidate item attribute in the candidate attribute set.

Since item attributes belong to the item, and when users like certain items, they also have preferences for certain item attributes in these items, so after obtaining the interactive prediction value of the candidate item, the preference prediction of the candidate item attribute can also be calculated. value to determine the candidate item attributes that the user prefers.

It should be noted that the embodiment of the present invention does not limit the specific method of calculating the attribute preference prediction value of candidate items. It can be understood that multiple items can have the same item attribute. In this case, the interactive prediction value of the item to which the item attribute belongs is calculated. The average value of the interaction prediction value is used as the preference prediction value of the item attribute; of course, the information entropy of the candidate item attribute can also be calculated, and the information entropy can be used as the preference prediction value of the candidate item attribute, where the information entropy is A measure that removes information uncertainty. The lower the probability of something happening, the greater the information entropy it can provide when it happens. Considering that the information entropy is used as the preference prediction value of the candidate item attribute, the candidate item set can be efficiently filtered. Therefore, in the embodiment of the present invention, the preference prediction value of the candidate item attribute can be calculated by calculating the information entropy.

It can be understood that the amount of calculation of information entropy is large, so the candidate items can be sorted by the interactive prediction value, and the candidate attribute set can be generated by using the first preset number of candidate items with larger interactive prediction value, and finally the candidate attribute set can be sorted. Carry out information entropy calculation. It should be noted that the embodiment of the present invention does not limit the specific value of the preset number, which can be set according to actual application requirements.

In a possible situation, the process of using the interactive predicted value and the candidate item to generate a candidate attribute set, and using the interactive predicted value to calculate the preference predicted value of each candidate item attribute in the candidate attribute set, may include:

Step 31: Generate a candidate attribute set using the item attributes contained in the previous preset number of candidate items in the descending order of the interactive prediction values;

Specifically, the candidate item set V _cand is sorted by using the order of interactive prediction values from large to small, and the candidate attribute set is generated by using the first L candidate items:

in,

represents the attribute set of the first i candidate items in V _cand ,

represents the attribute set of the ith candidate item in V _cand ,

and

Represents the attribute set accepted by the user and the attribute set rejected by the user.

Step 32: Calculate the information entropy of each candidate item attribute in the candidate attribute set by using the interactive prediction value of the candidate item, and use the information entropy as the preference prediction value of the candidate item attribute.

Specifically, the information entropy of the candidate item attribute can be calculated by the following formula:

f _att (P _cand ,V _cand )=-prob(p)·log ₂ (prob(p)) p∈P _cand

where σ is the sigmoid function (activation function), sv is the interactive prediction value of the candidate item _V , and _Vp represents the item containing the item attribute p. The embodiment of the present invention uses the method of weight entropy to arrange higher weights for important candidate items by calculating the information entropy instead of treating each item equally. To put it simply, if multiple items in the candidate item contain attribute p, it will be difficult for attribute p to filter the candidate item, and p is not a suitable attribute. , the candidate item attribute can quickly filter the candidate items.

S105 , input the calculated candidate item set and the candidate attribute set into the strategy network for reinforcement learning, and perform dialogue recommendation to the user.

It should be noted that the policy network is based on a deep reinforcement learning network, which can generate an action decision for the current dialogue round according to the input state and the user's historical dialogue results, and then use the action decision to recommend this round of dialogue. The embodiments of the present invention do not limit the reinforcement learning process of the policy network, and reference may be made to the related technologies of the deep reinforcement learning network. The embodiments of the present invention also do not limit the network optimization method of the policy network, and reference may also be made to the related technologies of the deep reinforcement learning network.

In a possible situation, the calculated candidate item set and candidate attribute set are input into the policy network for reinforcement learning, and the process of recommending dialogue to the user may include:

Step 41: Input the candidate item set and the candidate attribute set into the policy network for reinforcement learning to generate action decisions.

Specifically, the policy network determines whether the action recommended by the current round of dialogue is to ask or recommend through reinforcement learning. The policy network involves four values, namely state, action, reward and policy. The state contains the dialogue history and the length of the current candidate item set. The dialogue history is encoded by s _his , the size of which is the maximum round T of dialogue recommendation, and each dimension represents the user's dialogue history in the t-th round. The user's conversation history can be represented by a special value. It should be noted that the present invention does not limit specific special values, which can be set according to actual application requirements. In a possible situation, you can use -2 to indicate that the recommendation failed, use -1 to indicate that the user rejected the query item attribute, 2 to indicate that the recommendation was successful, and 1 to indicate that the user accepted the query item attribute; use _slen to encode the current candidate item length of the set. The vector splicing of the above two states can be performed to obtain the total state:

The embodiment of the present invention includes two actions, ie, asking a _ask and recommending a _rec .

The embodiment of the present invention adopts five kinds of rewards:

(1)r _{rec_suc} , if the recommendation is successful, a strong positive reward will be given;

(2) r _{rec_fail} gives a strong negative reward if the recommendation fails;

(3)r _{ask_suc} , if the attribute of the query is accepted by the user, a mild reward is given;

(4)r _{ask_fail} , if the requested attribute is rejected by the user, a negative reward is given;

(5)r _quit , give a strong negative reward if the user quits or reaches the maximum round.

The middle reward rt in round _t is the weighted sum of these five.

It should be noted that, the embodiment of the present invention does not limit the specific setting method of the above-mentioned reward value, which can be set according to actual application requirements.

After inputting the current dialogue state s into the policy network, the policy network will output the action decision value Q(s, a) about the two actions, and then the action of this round of dialogue can be determined according to the action decision value. In the embodiment of the present invention, the policy network uses standard deep Q-learning to optimize the network.

Step 42: Use action decision to make dialogue recommendation to the user.

Specifically, the process of recommending dialogues to users by using action decisions may include:

Step 51: when the action decision is an item recommendation, send the candidate item with the largest interaction prediction value to the user terminal, and receive feedback data;

Step 52: When the feedback data indicates that the candidate item is accepted, exit the dialogue recommendation;

Step 53: When the feedback data indicates that the candidate item is rejected, the candidate item is removed from the candidate item set, and the candidate item set after the removal is used to perform the interactive prediction of each candidate item in the updated candidate item set using the item correlation value calculation. value;

Specifically, when the policy network determines that the current round of dialogue is an item recommendation, it will extract from V _cand the first preset number of n candidate items V _rec with larger interaction prediction values. If the user accepts the item, the dialog recommendation ends; otherwise, the recommended item should be removed from the candidate item set V _cand , ie V _cand =V _cand -V _rec .

Step 54: when the action decision is an attribute query, send the attribute of the candidate item with the largest preference prediction value to the user terminal, and receive the feedback data;

Step 55: Use the feedback data to verify the items in the candidate item set, and remove the items that have not passed the verification, and finally use the removed candidate item set to calculate each candidate item in the updated candidate item set using the item-related values. The steps for the interactive predictor of .

Specifically, when the policy network determines that this round of dialogue is an item attribute query, the candidate item attribute p _i ∈P _{cand with the largest preference prediction value is selected from the candidate item attribute set P cand} _.

When the feedback data sent back by the user is received, the candidate item can be verified by using the candidate item attribute in the feedback data. When the feedback data indicates that the user has accepted the candidate item attribute p, the verification process is: update the attribute set received by the user

and update the candidate itemset

When the feedback data indicates that the user rejected the candidate item attribute p, the verification process is: update the attribute set rejected by the user

and update the candidate itemset

Finally, it can be understood that when the dialogue recommendation cannot always get an item acceptable to the user, the dialogue recommendation will continue for several rounds. In order to ensure that the number of dialogue rounds recommended by dialogue does not increase all the time, the dialogue rounds can be limited by a preset threshold, and when the dialogue round reaches the preset threshold, the dialogue recommendation is exited.

In a possible situation, before using the action decision to make a dialogue recommendation to the user, it can also include:

Step 61: Determine whether the dialogue round corresponding to the action decision is greater than a preset threshold;

It should be noted that the present invention does not limit the specific value of the preset threshold, which can be set according to actual application requirements.

Step 62: If yes, exit the dialogue recommendation;

Step 63: If not, execute the step of recommending dialogue to the user by using the action decision.

Based on the above embodiment, the method first obtains the historical interaction sequence between the user and the item. Since the sequence reflects the user's preference for historical items and the interaction sequence, the method uses the historical interaction sequence to participate in the training of the recommendation network model, which can ensure The candidate item set generated by the recommendation network model also considers the user's preference for historical items and the interaction order, and can generate a candidate item set that is more in line with the user's preference, thereby ensuring that the candidate item set can be used in dialogue recommendation. Conducting a dialogue can effectively reduce the number of dialogue rounds of dialogue recommendation, and at the same time can improve the accuracy of dialogue recommendation.

The following describes a dialogue recommendation apparatus, electronic device, and storage medium provided by the embodiments of the present invention. The dialogue recommendation apparatus, electronic equipment, and storage medium described below and the dialogue recommendation method described above may refer to each other correspondingly.

Please refer to FIG. 2. FIG. 2 is a structural block diagram of a dialogue recommendation apparatus provided by an embodiment of the present invention. The apparatus may include:

The recommendation network module 201 is used to obtain the historical interaction sequence between the user and the item, and input the historical interaction sequence into the recommendation network model including the embedding layer, the self-attention block and the prediction layer for training, and generate the item preference value; wherein, project contains project properties;

A candidate item set generation module 202, configured to generate a candidate item set by utilizing the item preference value and items that the user has not interacted with;

The first calculation module 203 is configured to update the candidate item set using the preference item attribute when receiving the preference item attribute sent by the user, and use the item correlation value to calculate the interactive prediction value of each candidate item in the updated candidate item set;

The second calculation module 204 is configured to generate a candidate attribute set using the interactive predicted value and the candidate item, and use the interactive predicted value to calculate the preference predicted value of each candidate item attribute in the candidate attribute set;

The dialogue recommendation module 205 is used for inputting the calculated candidate item set and candidate attribute set into the policy network for reinforcement learning, and performing dialogue recommendation to the user.

Optionally, the recommending network module 201 may include:

The training sequence generation sub-module is used to generate a training sequence of preset length by using the historical interaction sequence;

The embedding layer sub-module is used to input all items and training sequences into the embedding layer, and output the integrated embedding matrix;

The self-attention block sub-module is used to input the integrated embedding matrix into the self-attention block for iterative feature learning and generate a learning matrix;

The prediction layer sub-module is used to input the learning matrix into the prediction layer for matrix decomposition and calculate the initial item preference value;

The network optimization sub-module is used to optimize the recommendation network model by using the binary cross-entropy loss function until the output value of the binary cross-entropy loss function is the smallest, and the initial item preference value when the output value is the smallest is used as the item preference value.

Optionally, the dialogue recommendation module 205 may include:

The reinforcement learning sub-module is used to input the candidate item set and the candidate attribute set into the policy network for reinforcement learning to generate action decisions;

The dialogue recommendation submodule is used to make dialogue recommendations to users using action decisions.

Optionally, the dialogue recommendation sub-module may include:

a first sending unit, configured to send the candidate item with the largest interaction prediction value to the user terminal when the action decision is an item recommendation, and receive feedback data;

a first processing unit, configured to quit the dialogue recommendation when the feedback data indicates that the candidate item is accepted;

The second processing unit is configured to remove the candidate item from the candidate item set when the feedback data indicates that the candidate item is rejected, and use the removed candidate item set to perform the calculation of each candidate item in the updated candidate item set using the item correlation value. The interactive predicted value of the item;

a second sending unit, configured to send the item attribute with the largest preference prediction value to the user terminal when the action decision is an attribute query, and receive feedback data;

The third processing unit is used to use the feedback data to verify the items in the candidate item set, remove the items that fail the verification, and finally use the removed candidate item set to calculate the updated candidate item using the item related value. The step of aggregating the interactive predictors of each candidate item.

Optionally, the dialogue recommendation module 205 may further include:

The dialogue round judgment sub-module is used to judge whether the dialogue round corresponding to the action decision is greater than the preset threshold; if so, exit the dialogue recommendation; if not, execute the step of using the action decision to recommend the dialogue to the user.

Optionally, the first computing module 203 includes:

The removal operation submodule is used to remove the preference items that do not have the preference item attribute in the candidate item set.

Optionally, the second computing module 204 includes:

The candidate attribute set generation sub-module is used to generate a candidate attribute set by using the item attributes contained in the previous preset number of candidate items in the order of the interactive prediction value from large to small;

The second calculation submodule is used for calculating the information entropy of each candidate item attribute in the candidate attribute set by using the interactive prediction value of the candidate item, and using the information entropy as the preference prediction value of the candidate item attribute.

Based on the above embodiment, please refer to FIG. 3 , which is a structural block diagram of a dialogue recommendation system based on a historical interaction sequence provided by an embodiment of the present invention. In this historical interaction sequence-based dialogue recommendation system SeqCR (Sequential Conversation Recommender), the Policy Network Module is used to complete the functions of the dialogue recommendation module 205 in the above embodiment, such as receiving feedback and updating preferences. ), and update candidate items and attributes (Update Candidate items and attributes) to the Sequential Module sequence module. The Sequential Module sequence module is used to complete the function of the recommended network module 201 in the above embodiment, including the Embedding Layer, Self-attention Block Self-attention block and Prediction Layer prediction layer, wherein the self-attention block includes Self-attention self-attention layer and Feed Forward Network feedforward network; Soring Module scoring module is used to complete the candidate itemset generation module 202 in the above embodiment, the first calculation The functions of the module 203 and the second calculation module 204, such as the Item Scroring item scoring (calculating the interactive prediction value) and the Attribute Scoring attribute scoring function (calculating the preference prediction value), User represents the user, and can receive the query sent by the policy network and send it to the policy network. Provide feedback.

An embodiment of the present invention also provides an electronic device, including:

memory for storing computer programs;

The processor is configured to implement the steps of the above dialogue recommendation method when executing the computer program.

Since the embodiments of the electronic device part correspond to the embodiments of the dialogue recommendation method part, the embodiments of the electronic device part refer to the description of the embodiments of the dialogue recommendation method part, which will not be repeated here.

Embodiments of the present invention further provide a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the dialog recommendation method of any of the foregoing embodiments are implemented.

Since the embodiments of the storage medium part correspond to the embodiments of the dialogue recommendation method part, the embodiments of the storage medium part refer to the description of the embodiments of the dialogue recommendation method part, which will not be repeated here.

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of the present invention.

The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

A dialogue recommendation method, device, electronic device and storage medium provided by the present invention have been described in detail above. The principles and implementations of the present invention are described herein by using specific examples, and the descriptions of the above embodiments are only used to help understand the method and the core idea of the present invention. It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made to the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims

A dialogue recommendation method, comprising:

Obtain the historical interaction sequence between the user and the item, and input the historical interaction sequence into the recommendation network model including the embedding layer, the self-attention block and the prediction layer for training, and generate the item preference value; wherein, the item includes the item Attributes;

generating a candidate item set using the item preference value and the items that the user has not interacted with;

When receiving the preference item attribute sent by the user, use the preference item attribute to update the candidate item set, and use the item correlation value to calculate the interaction prediction value of each candidate item in the updated candidate item set;

Generate a candidate attribute set using the interactive predicted value and the candidate item, and use the interactive predicted value to calculate the preference predicted value of each candidate item attribute in the candidate attribute set;

Input the calculated candidate item set and candidate attribute set into the policy network for reinforcement learning, and perform dialogue recommendation to the user.
The dialogue recommendation method according to claim 1, wherein the inputting the historical interaction sequence into a recommendation network model including an embedding layer, a self-attention block and a prediction layer for training to generate an item preference value, comprising:

Use the historical interaction sequence to generate a training sequence of preset length;

Input all the items and the training sequence into the embedding layer, and output an integrated embedding matrix;

Inputting the integrated embedding matrix into the self-attention block for iterative feature learning to generate a learning matrix;

Inputting the learning matrix into the prediction layer for matrix decomposition, and calculating the initial item preference value;

The recommendation network model is optimized by using a binary cross-entropy loss function until the output value of the binary cross-entropy loss function is the smallest, and the initial item preference value when the output value is the smallest is used as the item preference value .
The dialogue recommendation method according to claim 1, wherein the inputting the calculated candidate item set and the candidate attribute set into a strategy network for reinforcement learning, and performing dialogue recommendation to the user, comprising:

Inputting the candidate item set and the candidate attribute set into the strategy network to perform the reinforcement learning to generate an action decision;

A dialog recommendation is made to the user using the action decision.
The dialogue recommendation method according to claim 3, wherein the performing dialogue recommendation to the user by using the action decision comprises:

When the action decision is an item recommendation, send the candidate item with the largest interaction prediction value to the user terminal, and receive feedback data;

When the feedback data indicates that the candidate item is accepted, exit the dialogue recommendation;

When the feedback data indicates that the candidate item is rejected, the candidate item is removed from the candidate item set, and using the candidate item set after the removal is completed, the calculation of the updated using the item correlation value is performed. The interactive prediction value of each candidate item in the candidate item set;

When the action decision is an attribute query, send the attribute of the candidate item with the largest preference prediction value to the user terminal, and receive the feedback data;

Use the feedback data to verify the items in the candidate item set, remove the items that fail the verification, and finally use the removed candidate item set to perform the calculation of the updated candidate item using the item related value The step of interactively predicting the value of each candidate item in the item set.
The dialogue recommendation method according to claim 3, wherein before using the action decision to recommend the dialogue to the user, the method further comprises:

judging whether the dialogue round corresponding to the action decision is greater than a preset threshold;

If so, exit the dialog recommendation;

If not, perform the step of making dialog recommendation to the user by using the action decision.
The dialogue recommendation method according to claim 1, wherein the updating the candidate item set using the preference item attribute comprises:

A preference item that does not have the preference item attribute in the candidate item set is removed.
The dialogue recommendation method according to claims 1 to 6, characterized in that, generating a candidate attribute set by using the interaction prediction value and the candidate item, and calculating each candidate attribute set in the candidate attribute set using the interaction prediction value Preference predictors for item attributes, including:

Generate a candidate attribute set by utilizing the item attributes contained in the first preset number of candidate items in descending order of the interactive predicted values;

Using the interactive prediction value of the candidate item, calculate the information entropy of each candidate item attribute in the candidate attribute set, and use the information entropy as the preference prediction value of the candidate item attribute.
A dialogue recommendation device, comprising:

The recommendation network module is used to obtain the historical interaction sequence between the user and the item, and input the historical interaction sequence into the recommendation network model including the embedding layer, the self-attention block and the prediction layer for training to generate the item preference value; wherein , the item contains item properties;

A candidate item set generation module, configured to generate a candidate item set by using the item preference value and the items that the user has not interacted with;

a first calculation module, configured to update the candidate item set using the preference item attribute when receiving the preference item attribute sent by the user, and use the item correlation value to calculate each candidate item in the updated candidate item set The interactive prediction value of ;

a second calculation module, configured to generate a candidate attribute set using the interactive predicted value and the candidate item, and use the interactive predicted value to calculate the preference predicted value of each candidate item attribute in the candidate attribute set;

The dialogue recommendation module is used for inputting the calculated candidate item set and candidate attribute set into the policy network for reinforcement learning, and performing dialogue recommendation to the user.
An electronic device, comprising:

memory for storing computer programs;

The processor is configured to implement the dialogue recommendation method according to any one of claims 1 to 7 when executing the computer program.
A storage medium, characterized in that the storage medium stores computer-executable instructions, and when the computer-executable instructions are loaded and executed by a processor, the dialogue recommendation according to any one of claims 1 to 7 is implemented method.