CN113822742A

CN113822742A - Recommendation method based on self-attention mechanism

Info

Publication number: CN113822742A
Application number: CN202111098120.6A
Authority: CN
Inventors: 田玲; 闫科; 康昭; 惠孛; 罗光春; 张天舒; 曾翰林
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-21
Anticipated expiration: 2041-09-18
Also published as: CN113822742B

Abstract

The invention relates to the field of article recommendation, and discloses a recommendation method based on a self-attention mechanism, which improves the training efficiency and the personalized recommendation effect of a recommendation model. The method comprises the following steps: firstly, collecting historical interaction information of a user and preprocessing the historical interaction information to form a training sample set; then designing a recommendation model, taking the training sample set as the input of the recommendation model, and adopting a square loss function as an optimization target to train the recommendation model; and finally, calculating through a trained recommendation model to obtain the interaction probability between the user and the item to be recommended, and sequencing according to the interaction probability to generate a recommendation candidate set of the user.

Description

Recommendation method based on self-attention mechanism

Technical Field

The invention relates to the field of article recommendation, in particular to a recommendation method based on a self-attention mechanism.

Background

In the information explosion era, the number of commodities is increased sharply, users are difficult to find interesting contents in a short time, and how to quickly and accurately generate personalized recommended contents by using the historical records of the users becomes a current research hotspot. The recommendation algorithm can help users to find out interesting contents, and the current mainstream method comprises the following steps: the collaborative filtering recommendation algorithm is based on a content recommendation algorithm and a deep learning model recommendation algorithm. The recommendation algorithm based on the deep learning model gradually becomes the most important method due to strong fitting capability and generalization capability.

In the aspect of a recommendation algorithm based on a deep learning model, Jin et al propose a neural perception recommendation method, capture a preference vector of a user by using a neural network, and further infer the user's preference degree for a project. The method considers the quality of commodities related to stores, and considers that the shopping consideration of the user is determined by both personal interest and the quality of the commodities. DeepCoNN adopts two parallel convolution modules, wherein one is focused on learning user behaviors by using comments written by users, the other network learns commodity attributes from the commodity comments, and the extracted features are input into convolution layers, maximum pooling layers and full connection layers of different kernels to obtain user representation X_uAnd commodity representation Y_iAnd finally, calculating to obtain the expected rating value of the user on the commodity according to the two items.

The above research is mainly based on recommendation made by user historical information, the neural perception recommendation method proposed by Jin et al does not consider preference weights between users and various items, all interactions adopt the same weight value, and the lost information may influence the learning effect of the model. Meanwhile, the method needs to additionally calculate the overall quality score of the store through the feedback of the user, and the variation frequency of the score and the evaluation of the store is large, so that the training overhead of the model is increased. DeepCoNN adopts two feature extraction modules, each user needs to obtain the features of the items and the commodities from the user comments and the commodity comments through the two feature extraction modules, the calculation cost is high, and a large amount of text information learning possibly has negative influence on the recommendation effect.

In fact, the positions of the selected items appearing in the user records can reflect the interests of the users corresponding to the time nodes, and the research does not utilize the position information in the sequence, and meanwhile, the association degree between the items in the historical information is ignored.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a recommendation method based on a self-attention mechanism is provided, and the training efficiency and the personalized recommendation effect of a recommendation model are improved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a recommendation method based on a self-attention mechanism comprises the following steps:

s1, training a recommendation model:

s11, collecting and preprocessing historical interaction information of a user to form a training sample set;

s12, designing a recommendation model, taking the training sample set as the input of the recommendation model, and adopting a square loss function as an optimization target to train the recommendation model;

s2, recommending according to the trained recommendation model:

and calculating through the trained recommendation model to obtain the interaction probability between the user and the item to be recommended, and sequencing according to the interaction probability to generate a recommendation candidate set of the user.

As a further optimization, step S11 specifically includes:

s111, acquiring a historical interaction record of a user and converting the historical interaction record into a user-item interaction matrix; items of the interaction matrix comprise user codes, item codes and item categories;

s112, carrying out 0 setting and filling on the item type information which is a null value;

s113, converting the user codes, the item codes and the item types into one-hot codes, and performing numerical compression processing;

s114, sequencing the interaction information of the users in the user-item interaction matrix according to a time sequence to obtain a historical interaction sequence of each user, wherein each item in the historical interaction sequence is represented by one-hot codes.

As a further optimization, in step S12, the proposed model of the design includes: the device comprises an input layer, a coding layer, a feature fusion layer and an output layer; the input layer is used for converting input data into a low-dimensional embedded representation; the coding layer is responsible for acquiring long-term and short-term dependence representation of a user historical interaction sequence; the feature fusion layer is used for fusing the sequence features and the item category features and converting the sequence features and the item category features into more dense features; and the output layer is responsible for generating a final result as the interaction probability of the user and the project by combining the features obtained by the user code embedding layer, the project code embedding layer and the feature fusion layer.

As a further optimization, in step S12, the input layer is used to convert the input data into a low-dimensional embedded representation, and includes:

the one-hot coding of the user code, the item category and the historical interaction sequence represents that a vector corresponding to one-hot index is obtained by looking up a table in a random initialization embedding table, so that the vector is converted into a low-dimensional embedding value:

the embedded representation of the user code is

Wherein N is the total number of users, let d_u＝128；

The embedded representation of the item code is

Wherein M is the total number of itemsLet d_d＝128,

The embedded representation of the item category is

Wherein T is the total number of item categories, let d_c＝32；

The embedded representation of the historical interaction sequence is:

wherein ,

only a maximum of l items are kept per sequence,

for item coding, pⁱThe position of the item is encoded.

As a further optimization, in step S12, the encoding layer is responsible for obtaining a long-short term dependency representation of the user history interaction sequence, including:

adopting a Transformer module as a long-term dependent learning module to express E by embedding historical interaction sequences_TAs input, to obtain a sequence feature h_L；

The GRU module is adopted as a learning module of short-term dependence and is coded by k items at the tail end of the sequence in the history interactive sequence

Selecting the output of the kth GRU unit as a short-term dependency representation as input, and capturing the recent interest feature h of the user_S。

As a further optimization, the Transformer module comprises a multi-head self-attention layer, a feedforward network layer and a normalization layer; the multi-headed self-attention layer utilizes a plurality of self-attention modules to learn different hidden layer representations; the feedforward network layer adopts a Gelu activation function; the normalization layer employs a residual error network.

As a further optimization, in step S12, the feature fusion layer is used to fuse the sequence features and the item coding features and convert the sequence features into more dense features, and includes:

associating item categories with features

Sequence characteristics h obtained by the transform module_LAnd output h of GRU module_SSplicing the vectors into a new vector, and then adopting a multilayer perceptron to splice the vectors

Conversion to d dimension_mIn which d is an embedded representation of_m＝128：

wherein

It is shown that the splicing operation is performed,

are parameters that need to be learned.

As a further optimization, in step S12, the output layer is responsible for generating final results as interaction probabilities between the user and the project by combining the features obtained by the user code embedding, the project code embedding and the feature fusion layer, including:

user u_iCorresponding embedded representation

Item d_jIs embedded in the representation

And the concatenation vector

Is spliced and then the final output is obtained through MLP (multi-layer neural network)

wherein ,

i.e. user u_iAnd item d_jF is the activation function Relu,

are parameters that need to be learned.

As a further optimization, in step S12, the recommendation model is trained by using a square-loss function as an optimization target, including:

1) calculating a loss function:

wherein ,

a tag that represents the data is provided,

i.e. user u_iAnd item d_jThe probability of interaction of (a) is,

representing user u_iAnd item d_jWhether it has interacted, if d_jPresent in u_iIn the recording of (1), then

Otherwise, the value is 0; λ represents a regularization coefficient for controlling the degree of parameter regularization; phi denotes that regularization operation is requiredA parameter set;

2) and iterating the processing steps of the input layer, the coding layer, the feature fusion layer and the output layer in a random gradient descent mode until the training period is finished, and taking the model with the minimum loss as a trained recommendation model.

The invention has the beneficial effects that:

the image feature extraction with higher calculation cost is avoided, deep sequence features are captured from user history by using a Transformer module with high parallelism, the training efficiency is high, and the captured deep sequence features are combined with the project attributes, so that the characterization capability of model embedding can be enriched;

the model parameters are updated by adopting a multi-head attention mechanism, so that the learning capability of the model is effectively improved, the recommendation effect is more accurate, and the recommendation content is more in line with the user interest.

Drawings

FIG. 1 is a flow chart of a recommendation method based on a self-attention mechanism in an embodiment;

fig. 2 is a diagram of a recommendation model structure in the embodiment.

Detailed Description

The invention aims to provide a recommendation method based on a self-attention mechanism, which improves the training efficiency of a recommendation model and the personalized recommendation effect. Firstly, collecting historical interaction information of a user and preprocessing the historical interaction information to form a training sample set; then designing a recommendation model, taking the training sample set as the input of the recommendation model, and adopting a square loss function as an optimization target to train the recommendation model; and finally, calculating through a trained recommendation model to obtain the interaction probability between the user and the item to be recommended, and sequencing according to the interaction probability to generate a recommendation candidate set of the user.

Example (b):

as shown in fig. 1, the recommendation method based on the self-attention mechanism in this embodiment includes the following implementation steps:

A. training a recommendation model:

a1, collecting and preprocessing user historical interaction information to form a training sample set, and specifically comprising the following steps A11-A14:

a11, acquiring a historical interaction record of a user and converting the historical interaction record into a user-item interaction matrix; items of the interaction matrix comprise user codes, item codes and item categories; through conversion processing, the number u of each user is acquired_i∈{u₁,u₂,...,u_N}, number of each item d_j∈{d₁,d₂,...,d_MAnd item d_jCorresponding category

Where N is the number of users, M is the number of items, and T is the number of categories of items.

A12, carrying out 0 setting and filling on the item type information which is null;

a13, converting the user code, the item code and the item type into one-hot codes and performing numerical compression processing; the processed numerical information is a continuous integer, and the serial numbers start from 0;

a14, sequencing the interaction information of the users in the user-item interaction matrix according to the time sequence to obtain each user u_iHistorical interaction sequence of

L represents the length of the sequence, and each item in the historical interaction sequence is represented by one-hot coding.

A2, designing a recommendation model, taking a training sample set as an input of the recommendation model, and taking a square loss function as an optimization target to train the recommendation model, wherein the method specifically comprises the following steps of A21-A22:

a21, model construction:

in the step, a recommendation model based on a self-attention mechanism for commodity recommendation is designed, and the model consists of an input layer, a coding layer, a feature fusion layer and an output layer; the input layer is used for converting data represented by one-hot into low-dimensional embedded representation, the coding layer acquires long-term and short-term dependence representation of a user history sequence through a Transformer and a GRU module, the feature fusion layer fuses sequence features and item category features and converts the sequence features into denser features, and the output layer generates a final result as interaction probability of a user and an item by combining the features obtained by the user embedding, the item embedding and the feature fusion layer.

The 4 levels of the model are specified as follows:

(1) an input layer:

at the input level, the data represented by one-hot (including user code, item category) needs to be converted into embedded values of low dimension:

the embedded representation of the user code is

Where N is the total number of users, here let the embedding dimension d_u＝128。

The embedded representation of the item code is

Where M is the total number of items, let d be the embedding dimension_d＝128。

The embedded representation of the item category is

Where T is the number of item categories, here let the embedding dimension d_c＝32。

The sequence order of the items can reflect the change of user behaviors, but since the Transformer module designed in the coding layer does not carry the time sequence information like a recurrent neural network, additional sequence coding is needed to ensure the importance of the model to learn the position when processing the sequence information, and the position coding P is { P {¹,p²,...,p^lAnd l is the maximum length of the sequence.

The data output of the input layer is: combining position codes with original input codes of items

wherein

Only a maximum of l entries are reserved per sequence.

(2) And (3) coding layer:

the combination of long-term and short-term dependence can learn the related information of short-term dependence on the basis of effectively acquiring the sequence characteristics, and effectively captures the recent interest characteristics of the user. The Transformer module has the advantage of long-term dependence capture, and the characteristic representation of the sequence is effectively obtained by learning terms and weights among the terms through a self-attention mechanism. In addition, the parallelism of the Transformer is high, and a serialization learning mode of a recurrent neural network is avoided.

Therefore, we choose a Transformer as the long-term dependent learning module at the coding layer. Although GRU (gated round robin unit) is not as effective in extracting long sequence features as the Transformer, the performance is almost consistent on short sequences, and the parameter amount of the GRU module is much smaller than that of the Transformer, so that the GRU module is more suitable for learning short-term dependence. Therefore, we choose the GRU module as the learning module for short term dependency at the coding level.

The Transformer consists of a multi-head attention layer, a feedforward network layer and a normalization layer. The key of the module is a multi-head attention layer, which is based on a self-attention mechanism and assists in learning the token vectors in different subspaces by a plurality of heads. The basic flow equation for the attention mechanism is as follows:

wherein Q, K and V respectively represent query (query), key (key) and value (value), namely the association degree between query and key determines the weight of the current value. Since the present invention uses a self-attention mechanism, Q, K, V are generated from the same input, Q ═ HW^Q,K＝HW^K,V＝HW^V。

The multi-head attention is to learn different hidden layer characteristics by using a plurality of self-attention modules, which is specifically defined as follows:

h_i＝Attention(H^L-1W_i ^Q,H^L-1W_i ^K,H^L-1W_i ^V)

H^L＝[h₁,h₂,...,h_n]W^h

wherein ,H^LFor the hidden layer representation output of the L-th layer, each head can calculate the corresponding attention weight distribution and then generate a new parameter matrix, wherein W is_i ^Q，W_i ^K，W_i ^VAre independent weight matrices and are not shared by each head. Finally, splicing the obtained n heads together through a weight matrix W^hThe conversion yields a multi-headed attention output for the L layers.

The purpose of the feedforward network layer FFN is to enable a model to have nonlinear modeling capability, and a Gelu activation function is adopted.

Compared with Relu, the Gelu activation function introduces random regularization, and the convergence rate is improved to some extent. The activation expression is as follows:

Gelu(x)＝xφ(x)

where phi (x) is the cumulative distribution function of a standard gaussian distribution,

is a learned parameter and is shared among each transform.

In the normalization layer, the residual network is used to ensure the learning effect of the deep network parameters, and in combination with the multi-head attention layer MH and the feedforward network layer FFN, the overall flow of the Transformer is as follows:

AN^L＝LN(H^L-1+MH(H^L-1))

h^L＝LN(FFN(AN^L)+AN^L)

the whole long-term dependent coding unit is composed of a plurality of transformers, and parameters between layers are shared, so that the whole parameter quantity of the model is greatly reduced, and the possibility of model expansion is provided.

Combining input matrices E_TThe output of the first layer can be expressed as:

h¹＝LN(FFN(AN⁰)+AN⁰)

＝LN(FFN(LN(E_T+MH(E_T)))+LN(E_T+MH(E_T)))

through the processing of each coding layer, the output of the L layer is obtained

Let l be 100 here. Will output the vector

Adding to obtain the final long-term dependence of the sequence_L。

The GRU module responsible for short-term dependence learning is composed of a plurality of GRU units, each GRU unit is composed of an update gate and a reset gate r_tAnd an update gate z_tAll of which are in accordance with the hidden layer variable h at the previous moment_t-1Input x with the current time_tIn relation, the calculation formula is as follows:

r_t＝σ(W^rx_t+U^rh_t-1)

zt＝σ(W^zx_t+U^zh_t-1)

r_tand z_tSigmoid activation function, W, is adopted_r,U_r,W_z,U_zAre the weight parameters that need to be learned. Candidate hidden layer

Dependent on the reset gate r_tHidden layer variable h_t-1Inputting x_tThis layer is similar to the new information in long and short term memory, using r_tDetermine how much historical information to keep, r_tA value of 0 indicates that the history information is completely reset, new information

Only with x_tIn connection with this, the present invention is,

the expression of (a) is as follows:

hidden layer variable h at current moment_tBy updating the door z_tHidden layer variable h_t-1New information

Is calculated to obtain z_tDetermines hidden layer variable h_t-1Forgetting rate and new information

The final expression of the retention ratio of (c) is:

the input of the GRU module is the encoding of k items at the tail end of the sequence

Selecting the output of the kth GRU unit as the short term dependency representation, i.e.

(3) A characteristic fusion layer:

the role of this layer is to transform and compress each of the resulting features into a denser embedded representation, first with the item class corresponding to the feature

Sequence feature h obtained from the transform layer_LAnd output h of GRU module_SSplicing into new vectors, and then converting the vectors into dimensions by adopting a multilayer perceptronIs d_mIn which d is an embedded representation of_m128. The process is as follows:

wherein

It is shown that the splicing operation is performed,

are parameters that need to be learned.

(4) An output layer:

given user u_iItem d_jAccording to u_i、d_jAnd

u is obtained by calculation_iAnd d_jThe interaction probability of (2). Here u is first introduced_iAnd d_jCorresponding embedded representation

And

splicing and combining are carried out to form a new representation, and then final output is obtained through MLP (multi-layer neural network)

Where f is the activation function Relu.

The structure of the proposed model designed as above is shown in fig. 2.

A22, model training and saving:

in the step, a square loss function is used as an optimization target function of the model, iterative training is carried out in a specified period, and the optimal model obtained by training is obtained and stored.

Specifically, since the interaction probability of the user and the item is either 1 or 0, the objective of the model is to fit the interaction probability of the user and the item in the original data as much as possible, and therefore, the square loss function is selected to be used as the optimization objective function of the model. The loss function can be expressed as:

wherein ,

a tag that represents the data is provided,

Otherwise, it is 0.λ represents a regularization coefficient for controlling the degree of parameter regularization. Φ represents the set of parameters that need to be regularized.

The training mode is random gradient descent, the optimizer adopts Adam, the learning rate of the transform module is set to 0.001, the learning rate of the GRU module is set to 0.01, and the regularization parameter Dropout is set to 0.5. The training period epoch is initialized to 100 times, and the model with the lowest loss is stored in the set period and used as the model for subsequent deployment.

B. Recommending according to the trained recommendation model:

the method comprises the following steps of utilizing a trained recommendation model to carry out actual application recommendation, and acquiring a user u to be recommended in specific application_iHistory of (3) calculating a user u to be recommended_iAnd interaction probability values between the items. The items are processed according to the sequence of the probability values from large to smallAnd sorting, selecting a certain number of items (such as 20 items) as a recommendation set and pushing the recommendation set to the user.

Claims

1. A recommendation method based on a self-attention mechanism is characterized by comprising the following steps:

s1, training a recommendation model:

s2, recommending according to the trained recommendation model:

2. The self-attention mechanism-based recommendation method of claim 1,

step S11 specifically includes:

3. A recommendation method based on the self-attention mechanism as claimed in claim 1 or 2,

in step S12, the designed recommended model includes: the device comprises an input layer, a coding layer, a feature fusion layer and an output layer; the input layer is used for converting input data into a low-dimensional embedded representation; the coding layer is responsible for acquiring long-term and short-term dependence representation of a user historical interaction sequence; the feature fusion layer is used for fusing the sequence features and the item category features and converting the sequence features and the item category features into more dense features; and the output layer is responsible for generating a final result as the interaction probability of the user and the project by combining the features obtained by the user code embedding layer, the project code embedding layer and the feature fusion layer.

4. A recommendation method based on the self-attention mechanism as claimed in claim 3,

in step S12, the input layer is configured to convert the input data into a low-dimensional embedded representation, and includes:

the embedded representation of the user code is

Wherein N is the total number of users, let d_u＝128；

The embedded representation of the item code is

Wherein M is the total number of items, let d_d＝128,

The embedded representation of the item category is

Wherein T is the total number of item categories, let d_c＝32；

The embedded representation of the historical interaction sequence is:

wherein ,

only a maximum of l items are kept per sequence,

for item coding, pⁱThe position of the item is encoded.

5. A recommendation method based on the self-attention mechanism as claimed in claim 3,

in step S12, the encoding layer is responsible for acquiring a long-term and short-term dependency representation of a user history interaction sequence, including: adopting a Transformer module as a long-term dependent learning module to express E by embedding historical interaction sequences_TAs input, to obtain a sequence feature h_L；

6. A recommendation method based on the self-attention mechanism as claimed in claim 5,

the Transformer module comprises a multi-head self-attention layer, a feedforward network layer and a normalization layer; the multi-headed self-attention layer utilizes a plurality of self-attention modules to learn different hidden layer representations; the feedforward network layer adopts a Gelu activation function; the normalization layer employs a residual error network.

7. A recommendation method based on the self-attention mechanism as claimed in claim 5,

in step S12, the feature fusion layer is configured to fuse the sequence features and the item category features and convert the sequence features and the item category features into more dense features, and includes:

associating item categories with features

Conversion to d dimension_mIn which d is an embedded representation of_m128, the formula is as follows:

wherein

It is shown that the splicing operation is performed,

are parameters that need to be learned.

8. The self-attention mechanism-based recommendation method of claim 7,

in step S12, the output layer is responsible for generating a final result as an interaction probability between the user and the project by combining the features obtained by the user code embedding, the project code embedding and the feature fusion layer, and includes:

user u_iCorresponding embedded representation

Item d_jIs embedded in the representation

And the concatenation vector

Is spliced and then the final output is obtained through MLP

wherein ,

i.e. user u_iAnd item d_jF is the activation function Relu,

are parameters that need to be learned.

9. A recommendation method based on the self-attention mechanism as claimed in claim 3,

in step S12, training the recommendation model by using the square loss function as the optimization target includes:

1) calculating a loss function:

wherein ,

a tag that represents the data is provided,

i.e. user u_iAnd item d_jThe probability of interaction of (a) is,

Otherwise, the value is 0; λ represents a regularization coefficient for controlling the degree of parameter regularization; Φ represents the set of parameters that need to be regularized.