CN111062775B

CN111062775B - Recommendation system recall method based on attention mechanism

Info

Publication number: CN111062775B
Application number: CN201911222216.1A
Authority: CN
Inventors: 郑子彬; 李威琪; 周晓聪
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2023-05-05
Anticipated expiration: 2039-12-03
Also published as: CN111062775A

Abstract

The invention discloses a recommendation system recall method based on an attention mechanism, which comprises the following steps: extracting user features and commodity features in the training samples, converting the user features into user embedded vectors, and converting the commodity features into commodity embedded vectors; inputting the user embedded vector and the commodity embedded vector into an attention mechanism model for training, learning the weight of each feature through an attention network in the model, and carrying out weighted summation on the embedded vectors of all the features according to the weight to obtain a user characterization vector and a commodity characterization vector; calculating the inner product of the user characterization vector and the commodity characterization vector to obtain the matching degree of the user purchasing commodity will of the training sample, establishing a cross entropy loss function of the matching degree of the user purchasing commodity will, calculating the minimized cross entropy loss function, and converging the attention mechanism model; and inputting the sample to be tested into the converged attention mechanism model, acquiring the matching degree of the user purchasing commodity will of the sample to be tested, and selecting the commodity with the matching degree of the user purchasing commodity will in a preset interval as a recall result to be recommended. The invention enhances generalization and greatly simplifies the calculation amount of recall recommendation.

Description

Recommendation system recall method based on attention mechanism

Technical Field

The invention relates to the field of computer recommendation systems, in particular to a recommendation system recall method based on an attention mechanism.

Background

With the improvement of living standard, the choices of consumption are more and more, the goods from three families to tens of thousands of families today, and many times, it takes much time to find the goods needed by us, and even if we want, it is not necessarily the best fit for us. The recommendation system can help us find the relevant commodities searched by us from the commodity pile and recommend the commodity which is most suitable for us to us. Recommendation systems are now very widely used and ubiquitous in life. When the user purchases the product on line, the user wants to recommend the product which the user wants to purchase, when the user listens to music, he wants to hear songs which are suitable for the user's own taste, and when the user searches things, he wants to find the search result. Thus, fast and accurate prediction of user preferences is a primary goal of recommendation systems.

And recall, as one of the phases of the recommendation system, needs to finish selecting hundreds or tens of related commodities from a large number of commodities, then sends the commodities into a sorting model, and is different from the sorting requirement with high precision, recall is equivalent to rough sorting, and the recall does not need to have high precision, but needs to quickly select commodities related to our search from a large number of candidate commodities.

Recall of the recommendation system is based on collaborative filtering at the earliest, but the collaborative filtering method has a cold start problem because of modeling by using the IDs of the users and the commodities, and is equivalent to using only the IDs as their unique features, so that important information such as other attributes of the users and the commodities cannot be effectively utilized. Based on feature modeling, we think of the simplest LR model, which is easy to implement, but ignores the feature combination problem.

The FM model is widely used once because he considers the feature combination problem, learns a weight for each feature combination, and makes the idea of feature vectorization widely used in various deep learning models. However, the FM model is only a low-order crossover between two features, failing to consider higher-order feature crossover.

In recent years, with the development of deep learning, many models based on deep learning have been applied to recommendation systems. Based on a deep learning recommendation model, sigmoid, tanh and other activation functions are added, nonlinear change is provided, and the multi-layer neural network is an implicit multi-order feature crossover. In recent years, deep FM model is proposed, and the low-order cross and the high-order cross of the feature combination are combined together, so that the effect is remarkable. Considering that the higher order crossover of the neural networK is implicit, the interpretability is not high, google proposes a (Deep & Cross networK k) DCN model, which combines explicit and implicit feature crossover combinations. Considering that these feature crossings are all on an element level, microsoft has also proposed an xDeepFM model, the study being directed to feature crossing combinations in the vector dimension. As can be seen, feature combinations are an important part of the model, but these models are complex, generally applied in the ranking stage, and less applied in the recall stage.

Attention mechanisms, which are initially applied to natural language processing, are capable of selectively extracting important information in long sentences and focusing attention on these important information while ignoring non-important information. And the attention mechanism can be different for different samples, as well as the distribution of attention. This feature is suitable for most fields, and thus is widely used. But there is currently less research on applying attention mechanisms to recommendation system models.

Disclosure of Invention

The main purpose of the invention is to provide a recommendation system recall method based on an attention mechanism, aiming at overcoming the problems.

In order to achieve the above object, the present invention provides a recall method of a recommendation system based on an attention mechanism, comprising the following steps:

s10, extracting user features and commodity features in the training samples, converting the user features into user embedded vectors, and converting the commodity features into commodity embedded vectors;

s20, inputting the user embedded vector and the commodity embedded vector into an attention mechanism model for training, learning the weight of each feature through an attention network in the model, and carrying out weighted summation on the embedded vectors of all the features according to the weight to obtain a user characterization vector and a commodity characterization vector; calculating the inner product of the user characterization vector and the commodity characterization vector to obtain the matching degree of the user purchasing commodity will of the training sample, establishing a cross entropy loss function of the matching degree of the user purchasing commodity will, calculating the minimized cross entropy loss function, and converging the attention mechanism model;

S30, inputting the sample to be tested into the converged attention mechanism model, obtaining the matching degree of the user purchasing commodity intention of the sample to be tested, and selecting the commodity of which the matching degree of the user purchasing commodity intention is in a preset interval as a recall result to be recommended.

Preferably, the attention mechanism model is a bidirectional attention mechanism model, the bidirectional attention mechanism model comprises a multi-layer user attention network and a multi-layer commodity attention network, each layer of user attention network comprises a two-layer feedforward neural network FNN and a normalized layer Softmax, each layer of commodity attention network comprises a two-layer feedforward neural network FNN and a normalized layer Softmax, and the user attention network and the commodity attention network are in a layer-by-layer recursive relationship.

Preferably, more than one of the steps S20The layer user attention network comprises a K layer user attention network in which the user characterization vector u ^(k) Given by the formula:

wherein the superscript K (K-1) of all variables is the attention network of the K (K-1) th layer, U_attribute is the attention network of the user, each layer of network is the same, the specific operation process of the network structure consists of the following several formulas, and the input of the network is the embedded vector of the user characteristics

And the output of the previous layer->

The output of the network characterizes the vector u for the user of the layer ^(k) ，m ^(k) Is a storage vector for storing the summation of characterization vectors obtained by the previous K-layer network, after the input is obtained, the attention network firstly normalizes through the FNN and softmax layers of the two layers of feedforward neural network to obtain attention weight

The weight vector is utilized to carry out weighted average on T user feature vectors to obtain a characterization vector u of the layer ^(k) ；

At the K-th layer, for t=1, 2,3, …, T, the weight of the user's T-th embedded vector at that layer is first found

wherein ,

are all network parameter matrices, < >>

Embedded vector u representing user t-th feature in a k-th layer user attention network _t For the parameter matrix of the input neural network, +.>

Storage vector representing output of a layer above in a layer k user attention network +.>

For the parameter matrix of the input neural network, +.>

Representing the hidden layer variable +.>

For the parameter matrix of the input neural network, +.>

For the hidden layer vector obtained by the user's t-th feature, tanh is the activation function, and as a custom vector multiplication, i.e., two vectors of the same length and elements in the same position are multiplied to obtain a new vector, " >

By matrix multiplication with a matrix with a number of rows 1, a value +.>

Then obtaining the weight of the surface sheet vector of the K layer of the final user through softmax conversion>

e is a natural constant;

then according to

Calculating the weighted sum of the embedded vectors of the user to obtain a characterization vector u of the K-th layer of the user ^(k) ：

Preferably, the multi-layer commodity attention network in S20 includes a K-layer commodity attention network, in which the commodity characterization vector v ^(k) Given by the formula:

wherein V_attribute represents commodity attention network, the network structure is the same as user attention network, and the weight of nth embedded vector of each commodity is obtained first

Then, according to the weight, the weighted sum of all commodity embedded vectors is obtained to obtain a commodity characterization vector v of the K-th layer ^(k) ；

At the K-th layer, for n=1, 2,3, …, N,the weight of the nth embedded vector of the commodity of the layer is obtained first

wherein ,

is a parameter matrix of the commodity attention network, +.>

Embedding vector v representing nth characteristic of commodity in kth layer commodity attention network _n For the parameter matrix of the input neural network, +.>

Storage vector representing output of a layer above in a layer k commodity attention network>

For the parameter matrix of the input neural network, +. >

Representing the hidden layer variable +.>

For the parameter matrix of the input neural network, +.>

For the hidden layer vector obtained with the nth characteristic of the commodity, a value +.>

Then obtaining the characterization vector weight of the K layer of the final commodity through softmax conversion>

Then according to

Calculating the weighted sum of the commodity embedded vectors to obtain a characterization vector v of a K-th layer of the commodity ^(k) ：

Preferably, in the step S20, the method for calculating the inner product of the user characterization vector and the commodity characterization vector to obtain the matching degree of the user' S intention of purchasing the commodity of the training sample specifically includes:

multi-layer user attention network stitching and combining token vectors u for user attention networks of all layers ^(k) Obtaining a final user characterization vector z _u ＝[u ⁽⁰⁾ ；…；u ^(K) ]；

Multi-layer commodity attention network splicing and combining characterization vector v of commodity attention network of all layers ^(k) Obtaining a final commodity representation vector z _v ＝[v ⁽⁰⁾ ；…；v ^(k) ]；

Calculating the final user token vector z _u And a final merchandise characterization vector z _v And obtaining the matching degree of the final user's willingness to purchase goods.

Preferably, the cross entropy loss function of the matching degree of the willingness of the user to purchase goods is specifically:

wherein m is the number of samples, y _i For sample labels, with click behavior treated as positive sample, labeled 1, no click behavior treated as negative sample, labeled 0, for each user, make up with each item clicked<u,v ⁺ >Considered as a positive sample pair; click-through commodity composition<u,v ^- >Consider as a negative pair of samples, model training by minimizing the loss function L, i.e., continually narrowing the distance between positive samples and expanding the distance between negative pairs.

Preferably, in S10, specifically:

dividing user data and commodity data from training samples, and processing the user data into sparse user vectors

T is the total number of user features, T is the current user feature, and u represents the user; processing commodity data into sparse commodity vector +.>

N is the total number of commodity features, N is the current commodity features, and v represents commodity;

classifying training sample data into category type features and continuous type features according to the attribute of the data, and adopting a single-heat coding vector x if the training sample data is the category type features _i ，x _i The vector length of the training sample is taken as the sum of the numbers of all the features of the current training sample, the value of the class feature value is 1, the other is 0, and a feature dictionary is established for the position sequence number of the class feature value in the vector; if the feature is continuous, the sum of the numbers of all the features of the current training sample is taken as a vector length, the value of the continuous feature is taken as a feature value of the vector, and the other feature values are 0, so that the feature value is coded into a sparse vector.

Preferably, the attention mechanism model is a representation type learning model

Preferably, the vector lengths of the user attention network and the commodity attention network are equal.

Preferably, the training sample data is collected from a click rate estimation CTR model.

The invention provides a recall method of a recommendation system based on attention, which comprises the basic processes of converting the characteristics of a user and a commodity into embedded vectors, searching important characteristic combinations through an attention mechanism network, weighting and summing the embedded vectors of all the characteristics to obtain respective characterization vectors of the user and the commodity in a space, and finally calculating the matching degree according to the distance between the user and the commodity in the vector space. The bright point of the invention is:

first, the method proposed by the present invention is a deep learning model. And is at the feature level because the model will first turn the feature vector into a low-dimensional dense embedded vector, the model output being equivalent to a weighted sum of all feature vectors. On one hand, the model can automatically learn feature combination intersection without manually doing feature engineering. On the other hand, most of the existing recall models are tree models or simple discrimination models, because the deep learning models are complex, because the existing several mainstream models, such as DCN, deep fm and the like, all need to calculate the cross combinations of the elements of all feature vectors, the calculated amount is large, and on the feature level, the combination number is small, the calculated amount is small, so the feasibility of applying the deep learning model to recall is large.

Second, the present invention innovatively applies an attention mechanism to feature combinations, finds important feature combinations for each sample through the attention mechanism, and ignores many unimportant feature combinations. The model takes the embedded vectors of the features as input, a group of attention weights are obtained through neural network learning, each feature corresponds to one weight, finally, the final vector is obtained by calculating the weighted sum of all the features according to the weights, and the combination of the feature vectors with different degrees is realized by giving different weights to each feature. The attention mechanism model combines deep learning, attention mechanism and feature engineering, and has great advantages.

Finally, the model in the invention is a representation learning model, and has strong generalization. The model can learn the characterization vector of the user and the commodity in the same space, so that the model can be applied to various different downstream tasks. In the invention, the downstream task of the model is a recall task, and an end-to-end model can be trained. In addition, the model can be divided into two parts, namely a user model and a commodity model, and the two models are simultaneously learned, so that the model is a model of a bidirectional attention mechanism. In the prediction stage, the characterization vectors of all commodities can be predicted independently, the characterization vectors can be stored, then the characterization vectors of each user are predicted, and the first M commodities closest to each other in the vector space are found in the stored characterization vectors of all shops, so that a large number of computations can be reduced by independent prediction, and the characterization vectors of the commodities are prevented from being repeatedly computed.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of the overall structure of an embodiment of the bidirectional attention model when k=2;

fig. 2 is a diagram showing a user attention network structure of a kth layer when t=3 according to the present invention;

fig. 3 is a diagram showing a commodity attention network structure of a kth layer when n=3 according to the present invention;

the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.

In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

1-3, the recommendation system recall method based on the attention mechanism provided by the invention comprises the following steps:

Preferably, the multi-layer user attention network in S20 includes a K-layer user attention network, in which the user characterization vector u ^(k) Given by the formula:

And the output of the previous layer->

The output of the network characterizes the vector u for the user of the layer ^(k) ，m ^(k) Is a storage vector which stores the accumulated sum of characterization vectors obtained by the previous K-layer network, and after input is obtained, the attention network firstlyThe attention weight is obtained by normalizing the FNN and softmax layers of the two layers of feedforward neural network

The weight vector is utilized to carry out weighted average on T user feature vectors to obtain a characterization vector u of the layer ^(k) ，

wherein ,

is a network parameter matrix, < >>

Storage vector representing output of a layer above in a layer k user attention network +. >

For the parameter matrix of the input neural network, +.>

Representing the hidden layer variable +.>

For the parameter matrix of the input neural network, +.>

For the hidden layer vector obtained by the user's t-th feature, tanh is the activation function, and as a custom vector multiplication, i.e., two vectors of the same length and elements in the same position are multiplied to obtain a new vector, ">

By matrix multiplication with a matrix of row 1, a value is obtained

Then obtaining the final K-layer surface sheet vector weight of commodity by softmax conversion>

e is a natural constant;

then according to

Then, according to the weight, the weighted sum of all commodity embedded vectors is obtained to obtain a commodity characterization vector v of the K-th layer ^(k) 。

In the K-th layer, for n=1, 2,3, …, N, the weight of the N-th embedded vector of the product of the layer is first obtained

wherein ,

is a parameter matrix of the commodity attention network, +.>

For the parameter matrix of the input neural network, +.>

Representing the hidden layer variable +.>

For the parameter matrix of the input neural network, +.>

Then according to

Preferably, in S10, specifically:

based on the attribute of the dataThe training sample data is divided into a category type feature and a continuous type feature, and if the training sample data is the category type feature, the single thermal coding vector x is adopted _i ，x _i The vector length of the training sample is taken as the sum of the numbers of all the features of the current training sample, the value of the class feature value is 1, the other is 0, and a feature dictionary is established for the position sequence number of the class feature value in the vector; if the feature is continuous, the sum of the numbers of all the features of the current training sample is taken as a vector length, the value of the continuous feature is taken as a feature value of the vector, and the other feature values are 0, so that the feature value is coded into a sparse vector.

Actual operation example:

the data of the CTR model is collected in a similar click rate estimation mode, the characteristics of each sample can be divided into two parts, one part is the characteristics of a user, such as gender, age and the like, the other part is the characteristics of goods, such as category, price and the like, each sample corresponds to a label, and the value of the label is 1 or 0, so that whether the user purchases the data (whether the user clicks or stores the data as the label in actual conditions) is indicated. I.e. each sample represents the purchase of a certain commodity by a certain user. The problem to be solved is a classification problem, a classification model is trained by the samples, the model is output to judge whether the user purchases the commodity, the model outputs a probability value of 0 to 1, the probability value represents the likelihood of the user purchasing the commodity, and the probability value is larger the probability value represents the likelihood of purchasing.

In the prediction stage, M commodities are called back from all commodities for a certain user, I samples are respectively formed by the characteristics of the user and the characteristics of all commodities, I is the number of the commodities, the I samples are input into a model to obtain I probability values, namely the probability of the user for purchasing each commodity is represented, the I probability values are ordered, and the first M corresponding commodities with the largest probability values, namely the M commodities most likely to be purchased by the user are taken and then recommended to the user.

Let us take 4 samples in table 1 as examples:

	user' s	Sex (sex)	Age of	Goods commodity	Category(s)	Price of	Label (Label)
								1	Zhang San	Man's body	9	Pencil with pencil lead	Stationery		2	1
2	Li Si	Man's body	38	Trousers	Clothes with a pair of elastic members	56	0
								3	Wang Wu	Female	12	Facial mask	Cosmetic product	35	1
4	Zhao Liu	Female	27	Basketball ball	Sports article	67	0

TABLE 1

In actual data preprocessing, we will discard the "user" column and the "merchandise" column.

The scheme comprises the following steps:

1) An embedding layer: the sparse characteristic data of the input users and commodities are respectively converted into low-dimensional dense embedded vector representations;

2) Attention mechanism layer: the embedded vectors of all the features are input, the weight of each feature is learned through the attention network, and the embedded vectors of the features are weighted and summed according to the weight to obtain the respective characterization vectors of the user and the commodity;

3) Output layer: and obtaining the matching degree of the user and the commodity by calculating the inner product of the characterization vector of the user and the commodity.

The specific operation of each step is described in detail below:

1) Embedding layer

At this level, we turn the input user and item feature data into embedded vector representations, respectively. The characteristics of the user and the commodity are input separately, and can be seen from a model structure diagram, so that the user and the commodity also need to be processed separately.

We first understand the concept of feature number and vocabulary, such as "gender" as a feature, the number of values of this feature is 2, i.e. "gender=men" and "gender=women", then the feature number is the number of features, and the vocabulary is the sum of the numbers of values of all features.

First, we need to process the data into sparse vectors

wherein ,/>

Is a set of sparse vectors of user features, i.e., { x _u,1 ,x _u,2 ,x _u,3 ,…,x _u,T T is the number of user features, and u in the subscript represents the user;

the method is characterized in that the method is a set of sparse vectors of commodity features, N is the number of the commodity features, and v in subscripts represents commodities.

The data generally includes a continuous value feature and a category value feature. Class value features, e.g. "gender", typically encoded as a single heat vector x _i E.g. "gender=men" is coded as "[0, ], 0,1,0, …,0]", vector length is the vocabulary size. For continuous value features, such as "age=10", we can see a class of features, also encoded as sparse vectors, such as "[0, ], 0,1,0, …,0]". The sparse vectors are each of a certain position with a value (specifically, the values of the class features are all 1, the successive featuresThe sign value is unchanged), the remainder being 0. A feature dictionary needs to be built for a specific location number so that each location can represent a feature.

Taking the samples in the following tables 2 and 3 as examples, firstly, feature dictionaries of users and commodities are established, and a position serial number is allocated to each feature:

features (e.g. a character)	Position number
		Sex = male	0
Sex = female	1
		Age of	2

TABLE 2

Features (e.g. a character)	Position number
		Category = stationery	0
Category = clothing	1
		Category = cosmetic	2
Category = sports goods	3
		Price of	4

TABLE 3 Table 3

Taking the characteristics of the user as an example, the number of the characteristic values of the user is 5, including "gender=man", "gender=woman" and "age", so that the vocabulary, that is, the dictionary length is 3, and the obtained sparse vector length is also 3:

"sex=men", position number 0, sparse vector [1, 0]

"sex=female", position number 1, sparse vector [0,1,0]

"age=10", position number 2, sparse vector [0,0,10]

Taking the characteristics of the commodity as an example, the number of the characteristic values of the user is 5, including "category=stationery", "category=clothing", "category=cosmetics", "category=sports goods", "price", so the vocabulary, that is, the dictionary length is 5, and the obtained sparse vector length is also 5:

"category=stationery", position number 0, sparse vector [1, 0]

"category=clothing", position number 1, sparse vector [0,1,0]

"category=cosmetic", position number 2, sparse vector [0,1,0]

Category=sports goods, position number is 3, and sparse vector [0,1,0] is obtained

"price=2", position number 4, sparse vector [0,0,0,0,2]

Then we can get a sparse vector for each feature of sample 1:

"sex=man": [1, 0];

"age=9": [0,0,9];

category = stationery ": [1, 0];

"price=2": [0,0,0,0,2].

I.e.

wherein ,x_u,1 The sparse vector representing the 1 st feature of the user, "gender=men", and so on, to obtain four sparse vectors, the general process is to integrate the sparse vectors of all the features of the sample into one-dimensional vector, for example, the user features of sample 1 can be integrated into one length 3 vector [ 10 9] ]. But the encoded vector thus processed is high in latitude and sparse. And if the vocabulary is large, like some ID class characteristics, the direct input neural network cannot be effectively trained.

So to reduce the dimensions we use another widely used method to transform these sparse long feature vectors into low-dimensional and dense vectors (i.e. embedding vectors) by multiplying the sparse vectors with an embedding matrix. Since the sparse vector has a value at only one place, the multiplication result is equivalent to selecting a certain column from the embedded matrix and multiplying the value, and since the value of the category feature is 1, we can use each column of the embedded matrix as the embedded vector of each feature, except that the continuous feature needs to be multiplied by a number:

u _t ＝W _embed,u x _u,t

v _n ＝W _embed,v x _v,n

wherein ,u_t An embedded vector representing the t-th feature of the user, the subscript u representing the user, v _n Is the embedded vector of the nth feature of the commodity, and the subscript v represents the commodity; w (W) _embed,u ∈R ^d×T I.e. user embedding (EmBedding) matrix, W _embed,v ∈R ^d×N Is the commodity embedding matrix, d is the embedding vector length, and T and N are the vocabulary sizes of the user feature and the commodity feature, respectively. Because d<<T,d<<N(<<Representing far smaller), the original T-dimensional or N-dimensional vector is converted into a d-dimensional vector, thereby achieving the purpose of reducing the vector length. Both embedding matrices are parameters that the embedding layer needs to learn to get, optimized along with other parameters of the network.

Finally, we construct a set of embedded vectors containing user features and a set of embedded vectors containing merchandise features, respectively:

wherein ,u_t An embedded vector for the T-th feature of the user, T is the feature number of the user, v _n The feature number of the N commodity is the embedded vector of the nth feature of the commodity.

2) Attention mechanism layer

Attention mechanisms can focus on important information in a multi-layer attention network and in each step. For feature-level in our model, the network of each layer represents a cross-combination between features, i.e. the attention mechanism is able to find important feature combinations and increase their weights. On the other hand, the attention mechanism comprises a multi-layer attention network, and the cross combination of the high-order features is deduced, so that the attention mechanism can extract important high-order feature combinations. The attention mechanism can also reduce the amount of information processed by screening feature combinations of the core.

Our approach simultaneously focuses on the characteristics of both the user and the commodity through a multi-layer attention network, both of which use the same network architecture to extract important feature combinations. In this section, we will explain the attention mechanisms used in each layer of attention network, which eventually make up the whole model. For simplicity we will omit the bias term b in the following equation.

As can be seen from the model structure diagram, the attention mechanism model is divided into a left part and a right part, namely a user attention mechanism and a commodity attention mechanism. The user attention mechanism takes the embedded vector of the user characteristics as input and comprises a plurality of layers of user attention networks; the commodity attention mechanism takes the embedded vector of commodity characteristics as input and comprises a plurality of layers of commodity attention networks. The attention network is a layer-by-layer recursive relationship, the input of the current layer is the output of the previous layer, and the output of the current layer is used as the input of the next layer. The number of layers can be determined according to the data characteristics, and the number of layers of the attention mechanisms of the user and the commodity are kept consistent. The two-layer attention network is used herein as an example to make up the attention mechanism.

[ user attention mechanism ]

The user attention mechanism aims to find important feature combinations among the features of the user. Is made up of K user attention networks. In the layer k attention network, the user characterizes the vector u ^(k) Given by the formula:

wherein the superscript k (k-1) of all variables represents the attention network of the k (k-1) th layer, U_attribute represents the attention network of the user, each layer of network is the same, and the specific operation process of the network structure consists of the following several formulas. The input to the network is an embedded vector of user features

And the output of the previous layer->

The output of the network characterizes the vector u for the user of the layer ^(k) ，m ^(k) Is a stored vector that holds the accumulated sum of the characterization vectors obtained from the previous k-layer network. After input is obtained, the attention network is normalized by a two-layer feed Forward Neural Network (FNN) and softmax layers to obtain attention weights

The weight vector is utilized to carry out weighted average on T user feature vectors to obtain a characterization vector u of the layer ^(k) 。

At the kth layer, for t=1, 2,3, …, T, the weight of the user's kth embedded vector at that layer is first found

wherein ,

are all network parameter matrices, < >>

Representing output of a layer above in a layer k user attention networkStore vector->

For the parameter matrix of the input neural network, +.>

Representing the hidden layer variable +.>

Is a parameter matrix of the input neural network. />

For the hidden layer vector obtained by the user's t-th feature, tan h is the activation function, and as a custom vector multiplication operation, i.e., two vectors of the same length and elements in the same position are multiplied to obtain a new vector. />

By matrix multiplication with a matrix with a number of rows 1, a value +. >

e is a natural constant.

Then, calculating the weighted sum of the embedded vectors of the user according to the obtained weights to obtain the characterization vector u of the kth layer of the user ^(k) ：

[ Commodity attentiveness mechanism ]

The commodity attention mechanism aims at searching important characteristic combinations in the characteristics of commodities, and the whole network is composed of K commodity attention networks. In the layer k attention network, the characterization vector of the commodity is obtained by the following formula:

wherein, V_Atation represents commodity attention network, and the network structure is the same as user attention network. The weight of the nth embedded vector of each commodity is obtained

wherein ,

is a parameter matrix of the commodity attention network, +.>

For the parameter matrix of the input neural network, +. >

Representing the hidden layer variable +.>

Is a parameter matrix of the input neural network. />

Then according to->

In particular, u as an input to the layer 1 attention network ⁽⁰⁾ and v⁽⁰⁾ Initialized to the mean of the feature vectors, and

and />

Respectively equal to u ⁽⁰⁾ and v⁽⁰⁾ ：/>

3) Output layer

After the characterization vectors of the user and the commodity are obtained, a general method for measuring the matching degree of the user and the commodity is that the inner product of the two vectors is obtained, and the higher the inner product is, the closer the distance between the two vectors in the characterization space is, so that the matching degree is higher. The purpose of the output layer is to obtain the characterization vectors of the user and the commodity and calculate their matching degree.

Through the attention mechanism layer, we obtain the characterization vector of each layer of network of the user and the commodity, and finally obtain the characterization vector z of the user and the commodity through combination _u and z_v And obtaining the final matching degree by solving the inner product:

z _u ＝[u ⁽⁰⁾ ；…；u ^(K) ]

z _v ＝[v ⁽⁰⁾ ；…；v ^(K) ]

S＝z _u ·z _v

Wherein, [; the method comprises the steps of carrying out a first treatment on the surface of the And the 'is a splicing operation' is a dot product operation, namely, multiplying two vectors with the same length and elements with the same position and summing to obtain the inner product of the two vectors.

As in fig. 1, the overall structure of the model is described when k=2,

model training uses cross entropy loss functions, which is a widely used loss function.

Wherein m is the number of samples, y _i For sample labels, the positive sample with click action is 1, otherwise, is 0, and for each user, the sample labels form a pair with each clicked commodity<u,v ⁺ >Is a positive sample pair; too many products can be sampled after clicking, and a plurality of products are randomly selected to form a plurality of pairs<u,v ^- >Is a negative sample pair. Model training is accomplished by minimizing the loss function L, i.e., continually narrowing the distance between positive samples and expanding the distance between negative pairs of samples.

In the prediction stage, the characterization vectors of all commodities are calculated and stored through the right half part of the model, namely the commodity attention mechanism. For each user, the characterization vector of the user is obtained through the left half part of the model, namely the user attention mechanism, then the matching degree of the user and all commodities is calculated, and finally the top P commodities with the highest matching degree are selected as recall results. The authentication index may select a recall rate.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.

Claims

1. A recommendation system recall method based on an attention mechanism, comprising the steps of:

The multi-layer user attention network in S20 includes a K-layer user attention network in which the user characterization vector u ^(k) Given by the formula:

And the output of the previous layer->

The output of the network characterizes the vector u for the user of the layer ^(k) ，m ^(k) Is the accumulation of characterization vectors obtained by the K-layer network before preservationAnd a storage vector, after input, the attention network normalizes the two layers of feedforward neural network FNN and softmax layers to obtain attention weight->

wherein ,

are all network parameter matrices, < >>

For the parameter matrix of the input neural network, +.>

Representing the hidden layer variable +.>

For the parameter matrix of the input neural network, +.>

By matrix multiplication with a matrix with a number of rows 1, a value +.>

e is a natural constant, and is a natural constant,

then according to

/>

2. The attention mechanism based recommendation system recall method of claim 1 wherein the attention mechanism model is a bi-directional attention mechanism model comprising a multi-layer user attention network and a multi-layer commodity attention network, each layer of user attention network comprising a two-layer feed forward neural network FNN and a normalized layer Softmax, each layer of commodity attention network comprising a two-layer feed forward neural network FNN and a normalized layer Softmax, the user attention network and the commodity attention network each being in a layer-by-layer recursive relationship.

3. The attention mechanism based recommendation system recall method of claim 2 wherein the multi-tier commodity attention network in S20 comprises a K-tier commodity attention network in which commodity characterization vector v ^(k) Given by the formula:

wherein ,

is a parameter matrix of the commodity attention network, +.>

For the parameter matrix of the input neural network, +.>

Representing the hidden layer variable +.>

For the parameter matrix of the input neural network, +.>

For the hidden layer vector obtained with the nth characteristic of the commodity, a value +. >

Then according to

4. The recall method of a recommendation system based on an attention mechanism of claim 1, wherein the method for calculating the inner product of the user characterization vector and the commodity characterization vector to obtain the matching degree of the user' S intention to purchase the commodity of the training sample in S20 is specifically as follows:

5. The attention mechanism based recommender system recall method of claim 1 wherein said cross entropy loss function of user willingness to purchase commodity matching is specifically:

wherein m is the number of samples, y _i For sample labels, with click behavior treated as positive sample, labeled 1, no click behavior treated as negative sample, labeled 0, for each user, make up with each item clicked <u,v ⁺ >Considered as a positive sample pair; click-through commodity composition<u,v ^- >Consider as a negative pair of samples, model training by minimizing the loss function L, i.e., continually narrowing the distance between positive samples and expanding the distance between negative pairs.

6. The attention mechanism based recommender system recall method of claim 1 wherein at S10:

classifying training sample data into category type features and continuous type features according to the attribute of the data, and adopting a single-heat coding vector x if the training sample data is the category type features _i ，x _i The vector length of (2) is taken as the sum of the numbers of all the characteristics of the current training sample, and the category characteristics thereofThe value is 1, the other values are 0, and a feature dictionary is built for the position serial numbers of the category feature values in the vectors; if the feature is continuous, the sum of the numbers of all the features of the current training sample is taken as a vector length, the value of the continuous feature is taken as a feature value of the vector, and the other feature values are 0, so that the feature value is coded into a sparse vector.

7. The attention mechanism based recommender system recall method of claim 1 wherein said attention mechanism model is a representational learning model.

8. The attention mechanism based recommender recall method of claim 1 wherein the vector lengths of said user attention network and said merchandise attention network are equal.

9. The attention mechanism based recommender recall method of claim 1 wherein said training sample data is collected from a click rate estimation CTR model.