CN110781409B

CN110781409B - Article recommendation method based on collaborative filtering

Info

Publication number: CN110781409B
Application number: CN201911022328.2A
Authority: CN
Inventors: 郑莹; 吕艳霞
Original assignee: Northeastern University Qinhuangdao Branch
Current assignee: Northeastern University Qinhuangdao Branch
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2022-02-01
Anticipated expiration: 2039-10-25
Also published as: CN110781409A

Abstract

The invention provides an item recommendation method based on collaborative filtering, which relates to the technical field of recommendation systems and introduces a special dynamic weight to better predict the preference degree of a user u to an item i, wherein the dynamic weight can be estimated by using an attention mechanism, the recommendation performance is estimated by using recall ratio and precision ratio, the effectiveness and recommendation quality of the recommendation system are improved, the attention mechanism is proved to be helpful for estimating the contribution of history items interacted by the user to the user preference representation, and the personalized recommendation is more accurate. The attention scores are respectively calculated by utilizing the point-by-point attention and the self-attention, remarkable effects are obtained, meanwhile, the transform model and the recommendation algorithm are combined and compared with a conventional embedded model, and the improvement of the recommendation effect is shown.

Description

Article recommendation method based on collaborative filtering

Technical Field

The invention relates to the technical field of recommendation systems, in particular to an article recommendation method based on collaborative filtering.

Background

Collaborative Filtering (CF) is the earliest and well-known recommendation algorithm. The main functions are prediction and recommendation, and the method is not only deeply researched in academia but also widely applied in the industry. The algorithm discovers the preference of the user through mining historical behavior data of the user, and recommends similar-taste articles to the user based on different preferences. Collaborative Filtering recommendation algorithms are mainly divided into two categories, namely User-based Collaborative Filtering (User-based Collaborative Filtering, abbreviated as UserCF) and Item-based Collaborative Filtering (Item-based Collaborative Filtering, abbreviated as ItemCF). In brief, the following is: humans are classified as species and groups as groups. The collaborative filtering algorithm based on the user finds out the user's likes (such as commodity purchase, collection, content comment or share) of commodities or contents through the historical behavior data of the user, and measures and scores the likes. And calculating the relationship between the users according to attitudes and preference degrees of different users on the same commodity or content, and recommending commodities among users with the same preference. In brief, if two users A and B purchase three books x, y and z and give a good comment of 5 stars, the books w viewed by A can be recommended to the user B by the users A and B belonging to the same class. UserCF finds application in some websites (e.g., Digg), but the algorithm has some disadvantages. Firstly, as the number of users of a website is larger and larger, it is more and more difficult to calculate the user interest similarity matrix, and the increase of the operation time complexity and the space complexity and the increase of the number of users are approximate to a square relation. Second, user-based collaborative filtering makes interpretation of recommendation results difficult. Therefore, amazon, a well-known e-commerce company, proposes another article-based collaborative filtering algorithm.

An item-based collaborative filtering algorithm (ICF) recommends to users items that are similar to the items they previously liked. For example, the algorithm may recommend machine learning for you because you have purchased data mining guide. However, the ICF algorithm does not calculate the similarity between items using the content attributes of the items, and it calculates the similarity between items mainly by analyzing the behavior records of users. ICF not only provides a convincing explanation for prediction results in many recommendation scenarios, but also facilitates real-time personalization. In particular, the main calculation of estimating the similarity between items can be done off-line, whereas the online recommendation module only needs to perform a series of lookups on similar items, which is easily done in real time.

The earliest collaborative filtering ItemCF method based on items was to determine whether to add a target item to the user's recommendation list by calculating the similarity between the items that the user has contacted in the past and the current target item, i.e. the predicted score of the user u for a particular item i is equal to the similarity s between all the interacted items j of the user u and the item i respectively_ijMultiplying by the user's score r for j_ujAnd finally the accumulated values. The calculation formula is as follows:

early ItemCF methods used statistical measures to calculate the similarity between user historical items and target items, such as Pearson coefficient and Cosine similarity. This approach is simple but this heuristic-based approach to estimating item similarity lacks optimizations tailored to recommendations and therefore can produce suboptimal performance. Secondly, in the case of sparse data, it is assumed that the cosine similarity of the user to the unevaluated item is adjusted to 0, and the item set (co-related) evaluated by the user together in the Pearson coefficient may be small. Therefore, these methods need to be adapted and optimized to adapt different data sets to the recommended scheme.With the development of machine learning, a learning-based approach is proposed, called SLIM. The method mainly customizes a recommendation objective function to optimize the similarity between learning objects which are self-adaptive from data. That is, to minimize the loss between the original user item interaction matrix and the interaction matrix reconstructed based on the CF model of the item. Although SLIMs can achieve better recommendation accuracy, it has two inherent limitations. First, the offline training process can be very time consuming for large-scale data, since to learn directly with the similarity matrix S, the temporal complexity is O (I)²) Magnitude. Secondly it can only estimate the similarity between two items purchased together or scored together, cannot estimate the similarity between unrelated items and therefore cannot capture the transitive relationship between items. In an actual recommendation task, particularly when data is sparse, the recommendation effect of the SLIM is reduced.

FISM addresses these limitations well. This method is primarily to represent the items as low-dimensional embedded vectors, so that the similarity s between the items_ijIt is parameterized as the inner product of the embedded vectors of items i and j. As the number of users and articles increases and the whole interaction matrix becomes sparse, the effectiveness of the existing Top-K recommendation method is reduced, and an article-based method is provided in the FISM algorithm for generating Top-K recommendations, wherein the recommendation algorithm sets the learning of an article similarity matrix as the product of two low-dimensional latent factor matrices. A whole set of experiments performed on multiple data sets at several different sparsity levels shows that the method proposed in the fish algorithm can efficiently process sparse data sets. Due to the fact, the recommendation accuracy of the FISM is superior to that of other popular Top-K recommendation algorithms, and particularly as the data set becomes sparse, the performance of the FISM is greatly improved. Although it has superior performance, it is clearly unreasonable to assume that all of the historically interactive items of the user have the same contribution to the representation of the user's preferences. For example, basketball and everyday items do not have the same effect on real-time recommended basketball shoes. Therefore, a special dynamic weight is introduced to better predict the preference degree of the user u for the item i, and the dynamic weight is estimated by using an attention mechanism。

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an article recommendation method based on collaborative filtering.

An article recommendation method based on collaborative filtering comprises the following specific steps:

1. an article recommendation method based on collaborative filtering is characterized in that: the method comprises the following steps:

step 1: calculating the prediction score of a user u on a target item i, obtaining an embedding vector p and an embedding vector q of the prediction target item i through an embedding layer by using one-hot coding, wherein p represents that the item is a predicted item, and q represents that the item is a historical interactive item, and obtaining an evaluation index of the item, wherein the ItemCF formula based on attention is defined as follows:

a_ij＝f(p_i,q_j)

wherein i is a predicted target item, j is a user history interactive item, a_ijWeights, p, of historical interactive items to representations of user preferences calculated using an attention network_iAnd q is_jRepresenting the embedding vector of the predicted item set and the embedding vector of the user-interacted items, respectively, R represents the positive case set of user u,

showing the removal of the item i in the proper case,

is a coefficient;

step 1.1: embedding vector p to query an item set_iEmbedding vector q of user-interacted article set_jSplicing the two vectors to obtain a spliced vector c,

using the stitching vector as a point-by-point attention modelWill name the first attempt of attention mechanism as Dot;

step 1.1.1: independently performing three times of linear transformation on the splicing vector c, wherein the coefficient matrixes are respectively W_Q，W_K，W_VThus, the input Query, Key, Value (Q, K, V) of the attention network is obtained;

step 1.1.2: the dot product of Q and K transposes is implemented using a highly optimized matrix multiplication, after softmax, by V to get a weight matrix, expressing the Attention function as Attention (Q, K, V), and the calculation formula is as follows:

wherein d is_kExpressing the dimension of K, the softmax function converts the value into probability distribution, if the dimensions of Q, K and V are the same, the dimension of the output attention weight matrix is the same as the dimensions of the Q, K and V;

step 1.2: vector to be spliced

Putting the obtained product into a network as input, repeating the previous single point multiplied by attention for h times, splicing h times of result matrixes, and finally converting the result matrixes into required dimensionality through linear transformation, namely setting an attention function as a Self-attention model to calculate the weight of the contribution of the historical item j to the score of the user u prediction target item i, and naming the weight as Self;

step 1.3: the method comprises the steps of utilizing a main framework of a Transformer model, mainly dividing the main framework into an encoder module and a decoder module, setting the input of a first submodule of the encoder module as an embedded vector p of a target object to be predicted_iThe input of each remaining submodule is the output of the previous submodule, each encoder submodule is composed of two layers, the first layer is a self-Attention model layer, the second layer is a feedback layer, after the extension operation, the encoder and decoder both contain a fully connected forward network, including two linear transformations and a Relu activation output, and the formula is as follows:

FFN(x)＝max(0，xW₁+b₁)W₂+b₂

inputting a first sub-module of the decoder module into a set q of historical items for which user interaction is set_jThe input of each remaining sub-module is the output of the previous sub-module, each decoder sub-module is composed of three layers, the first layer and the second layer are self-attention layers, but the input Q of the second layer is the output of the previous layer, K and V are the outputs of the encoder, the third layer is a feedback layer, and "Add" is added after each layer&A normaize "layer to prevent fading or explosion while preventing overfitting; the output of the model is converted to the required size by a fully connected layer and softmax function to obtain the attention weight a_ijAnd the following work is carried out, and the model is defined as Trans;

step 1.4: customizing an objective function, treating observed user-item interactions as positive examples, extracting negative examples from remaining unobserved interactions, using R⁺And R^-Represents the set of positive and negative examples, uses log as a loss term, and penalizes the embedded vectors and the coefficient and bias terms of each network with L2 paradigm. Then the loss function is as follows:

where N represents the total number of training examples, σ represents sigmoid method to convert predicted values into probability values, the strength of L2 paradigm controlled by hyperparameter λ is used to prevent overfitting, θ { { p { (p)_i},{q_jW, b, h represents all trainable parameters, where W, b, h and all used parameters of linear transformation have regular penalties; a variant of the algorithm that uses stochastic gradient descent, called Adagrad optimization objective function, applies an adaptive learning rate to each parameter, extracts random samples from all training examples, and updates the relevant parameters in the negative direction of the gradient. A mini-batch method is used to randomly pick a user and then use all of its interacted article sets as a small batch.

Step 2: and (4) performing an experiment on the real article data set on the evaluation index, judging the performance according to the recommendation result, and comparing the experiment result with other recommendation methods.

The invention has the beneficial effects that:

the method applies a machine translation attention mechanism transformer in natural language processing to a recommendation model, performs experiments on the method provided by the invention on two real data sets of a movie and a picture, and evaluates by using two common recommendation model evaluation indexes of recall ratio and precision ratio. Based on the recall ratio, the method realizes the improvement of 3.2 percent relatively, and based on the precision ratio, the method realizes the improvement of 4.3 percent relatively, so the method can generate a more accurate personalized recommendation list for the user. The efficient recommendation system can provide an efficient and intelligent information filtering technology for the user under the condition that the user lacks experience in related fields or cannot process massive data, explores potential consumption tendency of the user, and provides personalized services for numerous users. Through recommending articles to the user accurately, the interest of the user can be improved, the browsing amount of the website, the click rate and the purchase rate are improved, and great convenience is brought to the life and leisure of the user while income is brought to the website. The better recommendation method can bring business value to the enterprise entity, optimize sales boundary and profit, help the product to expand the boundary, provide more various and more intimate experience through scene construction, and finally improve profit and the like.

Drawings

FIG. 1 is a basic framework of the attention-based Item CF model;

FIG. 2 is a point-by-point attention model structure;

FIG. 3 is a transform model base framework;

FIG. 4 is a comparison of the performance of the models FISM, Dot, Self, Trans at an embedding size of 16;

in FIG. a, ML-1M-HR, in FIG. b ML-1M-NDCG, in FIG. c Pinterest-20-HR and in FIG. d Pinterest-20-NDCG.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments. The specific embodiments described herein are merely illustrative of the invention and are not intended to be limiting.

An item recommendation method based on collaborative filtering comprises the following steps:

step 1: as shown in fig. 1, u is represented by Multi-hot coding, i.e. all items that user u has interacted with in the case of implicit feedback, in which case the Multi-hot coding of the user passes through the embedding layer and generates a set of vectors, where each vector represents a historical item associated with the user, and the target item to be predicted obtains its embedded vector through the embedding layer using one-hot coding. Calculating the prediction score of a user u on a target item i, and obtaining an embedding vector p and an embedding vector q of the prediction target item i through an embedding layer by using one-hot coding, wherein p represents that the item is a predicted item, and q represents that the item is a historical interactive item, as shown in FIG. 1, an attention-based ItemCF formula is defined as follows:

a_ij＝f(p_i,q_j)

showing the removal of the item i in the proper case,

is a coefficient;

step 1.1: embedding vector p to query an item set_iEmbedding vector q of user-interacted article set_jSplicing the two vectors to obtainThe stitching vector c learns the interaction weights,

taking the stitching vector as an input of the point-by-point attention model, as shown in fig. 2, the first attempt of the attention mechanism is named Dot;

step 1.1.2: the Q and K transposed dot products are implemented using highly optimized matrix multiplication because dot products are faster, more space-saving, and can be implemented using highly optimized matrix multiplication with factors

The adjustment is performed so that the inner product is not too large, otherwise the value of the softmax layer is not 0 or 1, which causes the problem of gradient disappearance or explosion, and thus the value can be kept at the position where the gradient is large, the softmax function is to convert the value into a probability distribution, which is very friendly to the gradient calculation, after softmax, the value is multiplied by V to obtain a weight matrix, and the Attention function is expressed as Attention (Q, K, V), and the calculation formula is as follows:

step 1.2: vector to be spliced

Putting the data into a network as input, repeating the previous single point with attention for h times, splicing the result matrixes of the h times, and finally converting the result matrixes into the required dimension through linear transformation, namely putting noteThe intention function is set as a Self-attention model to calculate the weight that the historical item j contributes to the score of the user u predicting the target item i and is named Self.

Step 1.3: the main framework of the transform model is mainly divided into an encoder module and a decoder module as shown in FIG. 3, wherein the input of a first sub-module of the encoder module is set as an embedded vector p of a target object to be predicted_iThe input of each remaining submodule is the output of the previous submodule, each encoder submodule is composed of two layers, the first layer is a self-Attention model layer, the second layer is a feedback layer, after the extension operation, the encoder and decoder both contain a fully connected forward network, including two linear transformations and a Relu activation output, and the formula is as follows:

FFN(x)＝max(0，xW₁+b₁)W₂+b₂

inputting a first sub-module of the decoder module into a set q of historical items for which user interaction is set_jThis greatly enhances the model interpretability, the input of each remaining submodule is the output of the previous submodule, each decoder submodule is composed of three layers, the first layer and the second layer are self-attention layers, but the input Q of the second layer is the output of the previous layer, K and V are the outputs of the encoder, the third layer is a feedback layer, and "Add" is added after each layer&A normaize "layer to prevent fading or explosion while preventing overfitting; for the output of the model, it will be converted to the required size by a fully connected layer and softmax function to obtain the attention weight α_ijAnd the following work is carried out, and the model is defined as Trans;

step 1.4: customizing an objective function, treating observed user-item interactions as positive examples, extracting negative examples from remaining unobserved interactions, using R⁺And R^-Representing a set of positive and negative examples, using a loss-over-cross function as an objective function, and penalizing the embedded vectors and the coefficients and bias terms of each network with the L2 paradigm. Then the objective function is as follows:

where N represents the total number of training examples, σ represents the likelihood that sigmoid method converts predicted values to probability values representing the likelihood that user u will interact with item i, and the hyperparameter λ controls the strength used for the L2 paradigm to prevent overfitting, θ { { p { (p) }_i},{q_jW, b, h represents all trainable parameters, where W, b, h and all used parameters of linear transformation have regular penalties; a variant of the algorithm that uses stochastic gradient descent, called Adagrad optimization objective function, applies an adaptive learning rate to each parameter, extracts random samples from all training examples, and updates the relevant parameters in the negative direction of the gradient.

The present embodiment implements all models using TensorFlow, which requires that all training instances of a batch must be the same length, since some active users may have interacted with thousands of items, so that the sampled small batch training set is still very large. To solve this problem, this embodiment uses a mini-batch method to randomly pick a user and then use all the interacted article sets as a small batch, instead of randomly drawing a fixed number of training examples as a small batch of training sets. This approach has two advantages: 1) the masking skill is not needed, so the speed is higher; 2) there is no need to specify the batch size, which can avoid resizing the batch. If the attention network and the item embedding vector are trained simultaneously, the output of the attention network changes the item embedding, so that the joint training easily causes the self-adaptive effect, and the convergence speed is reduced. In order to solve the practical problem in the model training, the present embodiment uses the FISM algorithm proposed by Kabbur et al to pre-train the model, and initializes the model by using the article embedding vector learned by the FISM algorithm. Since the FISM algorithm has no self-adaptation problem, the FISM algorithm can better learn the similarity of the embedded coded objects. Therefore, the learning of the attention network can be greatly promoted by initializing the model by using the FISM algorithm, and the performance can be better and the convergence can be fast.

The embodiment gives a weight to each item interacted in the user history, so that the user preference can be more accurately represented by the user history item set when the user predicts and scores the target item, the recommendation effect is improved, the personalized recommendation is more accurate, and the improvements are attributed to an effective attention introducing mechanism so as to distinguish the importance of the history items in the user representation. We performed a comprehensive experiment on two authentic object data sets ML-1M and Pinterest-20 on the evaluation indices HR and NDCG to evaluate Top-K. Performance is evaluated by Hit Ratio (HR) and Normalized counted relative Gain (NDCG) of the first 10 bits of the recommended result. These two indicators have been widely used in search systems for evaluating Top-K recommendations and information retrieval documents. HR @10 may be interpreted as a recall-based metric that represents the percentage of successfully recommended users (i.e., the positive case appears in the top 10), while NDCG @10 is a measure that takes into account the predicted location of the positive case, the larger the values of these two metrics the better.

We compare the experimental results with some popular recommendations. For these embedding-based methods (MF, MLP, FISM, and models herein), the embedding size controls its modeling capabilities; therefore, we set all methods to 16. As a result, as shown in Table 1, it can be seen that the attention-based models all achieve better results and the final results are similar. They received the highest scores for NDCG and HR scores in all data sets. On the ML-1M data set, the performances of the three models are improved to a certain extent compared with the FISM, wherein the Self model is relatively improved to the FISM by 3.1 percent and 4.3 percent in the aspects of HR and NDCG. This may be a relatively simple-structured model that captures user features more fully on a less sparse data set, characterizing user preferences. On Pinterest-20, Trans is better than the other two in that it reaches the highest score and is improved by 3.2% over the FISM on NDCG, probably because of the deeper network's better ability to capture sparse data. The learning-based collaborative filtering method generally performs much better than those based on heuristics such as Pop and ItemKNN, and particularly, the FISM is much higher than ItemKNN. Considering that both methods use the same predictive model, but differ in the approach to similarity estimation of the item, the positive impact of the custom optimization on the recommendations can be clearly seen.

Table 1 comparative effect chart

As shown in fig. 4, with an article embedding size of 16, the performance of FISM and the Dot, Self and Trans proposed in this application at each epoch, the three models reach the highest scores of HR and NDCG on both data sets, which achieve the same performance level, achieving a significant improvement over FISM. We believe that these advantages are due to the efficient design of the attention network when learning item-to-item interactions. Especially at the first epoch, the performance of our model significantly exceeds the FISM, and as the training time increases, the experimental effect becomes better until convergence.

Based on the above discussion, the research on the article-based collaborative filtering algorithm is attempted to access various attention models to improve the learning of the dynamic weight coefficient, and to implement and experiment, and compared with other models, a better effect is achieved. The main contribution is (1) the demonstration that the attention mechanism helps to capture the dynamic weight of the contribution of the new item to the similarity calculation between the historical item sets that the user has been exposed to. The personalized recommendation is more accurate. (2) The attention points are calculated from the attention by using the point-by-point attention, and a good effect is achieved. (3) The method combines the transform model and the recommendation algorithm and compares the transform model with the conventional embedded model, and shows the improvement of the recommendation effect.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions as defined in the appended claims.

Claims

a_ij＝f(p_i,q_j)

showing the removal of the item i in the proper case,

is a coefficient;

step 1.1: embedding vector p to predict an item set_iEmbedding vector q of user-interacted article set_jSplicing the two vectors to obtain a spliced vector c,

will be pieced togetherConnecting the vector as the input of the point-by-point attention model, and naming the first attempt of the attention model as Dot;

step 1.2: vector to be spliced

step 1.3: the method comprises the steps of utilizing a main framework of a Transformer model, mainly dividing the main framework into an encoder module and a decoder module, setting the input of a first submodule of the encoder module as an embedded vector p of a target object to be predicted_iThe input of each remaining submodule is the output of the previous submodule, each encoder submodule is composed of two layers, the first layer is a self-Attention model layer, the second layer is a feedback layer, and both the encoder and the decoder comprise a fully-connected forward network after the Attention operation, and the network comprises two linear transformationsAnd a Relu activation output, which is formulated as follows:

FFN(x)＝max(0，xW₁+b₁)W₂+b₂

inputting a first sub-module of the decoder module into a set q of historical items for which user interaction is set_jThe input of each remaining sub-module is the output of the previous sub-module, each decoder sub-module is composed of three layers, the first layer and the second layer are self-attention layers, but the input Q of the second layer is the output of the previous layer, K and V are the outputs of the encoder, the third layer is a feedback layer, and "Add" is added after each layer&A normaize "layer to prevent fading or explosion while preventing overfitting; the output of the model is converted into the required size by a full connection layer and softmax function to obtain the attention weight a_ijAnd the following work is carried out, and the model is defined as Trans;

step 1.4: customizing an objective function, treating observed user-item interactions as positive examples, extracting negative examples from remaining unobserved interactions, using R⁺And R^-Representing a set of positive and negative examples, using log as a loss term and penalizing the embedded vectors and the coefficients and bias terms of each network with the L2 paradigm, then the loss function is as follows:

where N represents the total number of training examples, σ represents sigmoid method to convert predicted values into probability values, the strength of L2 paradigm controlled by hyperparameter λ is used to prevent overfitting, θ { { p { (p)_i},{q_jW, b, h represents all trainable parameters, where W, b, h and all used parameters of linear transformation have regular penalties; a variant of the method using stochastic gradient descent, called Adagrad optimization objective function, applies adaptive learning rate to each parameter, extracts random samples from all training examples, updates the relevant parameters to the negative direction of the gradient, uses a mini-batch method to randomly pick a user, and uses all the interacted objects to select a new userTaking the product set as a small batch;