CN108363804B

CN108363804B - Local model weighted fusion Top-N movie recommendation method based on user clustering

Info

Publication number: CN108363804B
Application number: CN201810169922.3A
Authority: CN
Inventors: 汤颖; 孙康高
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-03-01
Filing date: 2018-03-01
Publication date: 2020-08-21
Anticipated expiration: 2038-03-01
Also published as: CN108363804A

Abstract

The local model weighted fusion Top-N movie recommendation method based on user clustering comprises the following steps: step 1: preprocessing data; data cleaning is carried out on the film which is not active and has small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A; step 2: clustering users; using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering; step 3, determining a local recommendation model and carrying out global recommendation model training; step 4, model weighting fusion recommendation; and 5, verifying the effectiveness of the model by leave-one-out cross validation.

Description

Local model weighted fusion Top-N movie recommendation method based on user clustering

Technical Field

The invention relates to a movie recommendation method on a network.

Background

With the rapid development of information technology and social networks, data generated by the internet has recently been exponentially soared, and a big data era comes. With the increase of data volume, people are more and more difficult to find the information really wanted from mass data. At this time, the recommendation system can exert the maximum application value. According to the user data, the article information and the historical behavior data of the user, the recommendation algorithm can accurately predict the preference of the user, and recommend the articles which the user may be interested in for the user in a personalized manner, so that the cost of finding the target information by the user is greatly reduced.

Recommendation algorithms can be divided into content-based recommendations and collaborative filtering recommendations. The modern recommendation system has two main tasks, one is score prediction and the other is Top-N recommendation which is most applied in real business scenarios. The Top-N recommendation algorithm lets the user select something of interest by recommending a ranked list of items of size N to the user. Top-N recommendation models are mainly divided into two types, namely neighborhood-based collaborative filtering and model-based collaborative filtering. The former can be further subdivided into a user-based neighborhood model (UserKNN) and an item-based neighborhood model (ItemKNN), and the latter is represented by a hidden factor model.

In common words, "things are grouped together," different user groups often form unique behavior patterns, so that the similarity of two identical things in different groups changes. However, the single recommendation algorithm model often cannot capture the local similarity difference, the similarity of two identical articles in any scene is considered to be consistent, the models cannot accurately capture the real preference of the user, and the quality of personalized recommendation is reduced. The recommendation algorithms for improving the overall recommendation effect by training a plurality of local recommendation models and fusing the local models can solve the problems to a certain extent, but the algorithms often do not fully utilize data provided by a recommendation scene, the utilized data are single, and the final recommendation effect is general.

Disclosure of Invention

In order to solve the problems that a single model in the prior art cannot accurately capture user preference and a multi-model fusion algorithm uses single training data, the invention provides a novel local model weighting fusion movie recommendation algorithm based on user clustering to realize Top-N personalized recommendation of movies.

The invention utilizes the text content information of the film, calculates the semantic level user characteristic vector through the LDA topic model, and realizes user clustering through a spectral clustering algorithm based on the semantic level user characteristic vector to construct local crowds. The method further utilizes the grading information of the user to the movie, constructs a local recommendation model and a global recommendation model through a sparse linear model, and realizes the final personalized recommendation of the movie Top-N through the linear weighted fusion of the local model and the global model.

The local model weighted fusion Top-N movie recommendation method based on user clustering has the general flow as shown in FIG. 1, and specifically comprises the following steps:

step 1: and (3) a data preprocessing stage. Data cleansing is performed on some inactive users and movies with small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;

1.1, carrying out data cleaning work on an original data set, eliminating users with the number of movies smaller than 20, and eliminating movies with the number of times of scoring smaller than 20 to obtain a new training data set;

1.2, counting labels printed by all the movies by all the users in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus. The calculation formulas of the word frequency TF, the inverse document frequency IDF and the word frequency-inverse document frequency TF-IDF are shown in formulas (1), (2) and (3);

TFIDF_i,j＝TF_i,j×IDF_i(3)

wherein TF_i,jMeaning the word t_iIn document d_jWord frequency of (1), n_i,jMeaning the word t_iIn document d_jNumber of occurrences in (iii), ∑_kn_k,jRepresenting a document d_jThe sum of the number of occurrences of all words in (b). IDF_iMeaning word t_iThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t |, in the corpus_i∈d_jDenotes the inclusion of the word t_iThe number of documents. TFIDF_i,jRepresenting a document d_jChinese word t_iThe word frequency of (1) and the document frequency of (2);

1.3, converting the explicit scoring information into implicit feedback information represented by 0-1 by 1-5 points, if the current user excessively marks the current movie, marking the current movie as 1, and if the current user does not excessively mark the movie, namely, the movie to be recommended is marked as 0, so as to obtain an n multiplied by m user-movie implicit feedback matrix A, wherein the number of users is n, and the number of movies is m;

step 2: and (5) a user clustering stage. Using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;

2.1 an LDA topic model is a document-topic-word three-layer Bayesian network that, given a corpus, can analyze the topic distribution of each document in the corpus, as well as the word distribution of each topic. Its joint probability is shown in equation (4);

theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters for the multi-term distribution of topics under each document, β denotes the Dirichlet prior parameters for the multi-term distribution of words under each topic, N denotes the number of documents in the corpus, z denotes the number of documents in the corpus, and_ntopic, w, representing the nth word in a document_nAn nth word representing a document;

each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word w_nA set of tags for all movies viewed by a user is mapped to a document w, and a particular type of movie preferred by the user is mapped to a topic z. If the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;

to be able to distinguish more distinct user populations, the greater the difference between different topics, the better. In order to determine the optimal number of the topics, a plurality of LDA models are trained by setting a plurality of topic numbers, the average similarity between topic vectors obtained by training each LDA model is calculated, and the topic number corresponding to the model with the minimum topic vector average similarity is taken as the optimal number of the topics of the model. Obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;

2.2, clustering the users by using a spectral clustering algorithm by using the user characteristic vectors (n in total) obtained in the steps;

the number of clusters needs to be determined first before clustering. Because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector. For example, in the LDA training process of a certain number of subjects being 10, a 10-dimensional subject intensity vector is obtained by the above method, the visualization is as shown in fig. 2 (the vertical axis represents the subject intensity, and the horizontal axis represents the subject), and it can be seen through observation that the subjects 2, 9, 3, 8, and 6 have the greatest intensity in the current data set, which indicates that most people like to watch these types of movies, so it is preferable to cluster the users into 5 categories using the spectral clustering algorithm in the current situation. The spectral clustering algorithm comprises the following specific steps:

(1) calculating a similarity matrix W and a similarity matrix D of n multiplied by n;

(2) calculating a Laplace matrix L-D-W;

(3) computing the first k eigenvectors t of L₁,t₂,…,t_k；

(4) K column vectors t₁,t₂,…,t_kForm a matrix T, T ∈ R^n×k；

(5) Let y be 1, …, n_i∈R^kIs the ith row vector of T;

(6) using the K-Means algorithm to assign users (y)_i)_{i＝1,2,…,n}Clustering into clusters C₁,C₂,…,C_k；

For each user cluster, setting all user row vectors which do not belong to the cluster in the user-movie implicit feedback matrix A to be 0, and generating a corresponding local implicit feedback training matrix for each cluster

P_uDenotes a cluster number, and P_u∈{1,…,k}；

And 3, determining a local recommendation model and carrying out global recommendation model training. The loss function of the sparse linear model SLIM is shown in formula (5);

wherein A represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of the L2 norm control model prevents the model from overfitting, and the model trains each column W of the W matrix in parallel through a random gradient descent method_jTo obtain the final W matrix, as shown in equation (6);

wherein, a_jRepresenting the jth column in matrix a. Predicted recommendation degree of user i with respect to movie j

The calculation formula is shown as formula (7);

constructing a global recommendation model and a local recommendation model by using a sparse linear model SLIM as a basic recommendation model, training by using a user-film implicit feedback matrix A to obtain a global film similarity matrix W, and training a matrix by using a local implicit feedback

Training to obtain a local film similarity matrix corresponding to each cluster

And 4, performing weighted fusion on the models. The calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);

wherein

Represents the weighted fusion recommendation degree, R, of movie j to user u_uFor the set of all movies that have interacted with user u, w_ljFor the similarity of movie i and movie j in the global model,

cluster P for movie l and movie j to which user u belongs_uAnd g is a weight parameter of the global model. The weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g. The optimal global model weight parameters in the current dataset may be determined experimentally. After all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;

and 5, the recommendation method can prove the effectiveness of the model through leave-one-out cross validation. One movie may be randomly drawn from the movie score set for each user into a test set, with the other movies used as training sets for the model. Then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test set_i. Finally, the recommendation quality of the model can be measured by two indexes of Hit Rate (HR) and Average Ranking Hit Rate (ARHR), where # hits represents the number of recommendation hits, # users represents the total number of users, and their definitions are shown in equations (9), (10);

the recommendation method flow steps are thus complete.

The invention provides a local model weighted fusion Top-N movie recommendation algorithm based on user clustering by integrating the technologies. In order to solve the problem that the local difference of articles cannot be accurately estimated by a traditional single recommendation model, and the user preference cannot be accurately captured, a global recommendation model and a local recommendation model based on user clustering are respectively trained, and Top-N recommendation of a movie is realized through linear weighted fusion between the models. In addition, in order to fully use data in a movie recommendation scene and improve recommendation quality from multiple dimensions, the invention realizes calculation of feature vectors of a user at a semantic level through an LDA topic model by utilizing movie label information, and realizes division of a user at a semantic level group.

The invention has the advantages that: (1) the algorithm is novel in thinking. The sparse linear model is used as a basic recommendation model, a global recommendation model and a local recommendation model based on user clustering are respectively trained, and finally a final fusion model is generated through linear weighted fusion. (2) And the recommendation quality is improved in multiple dimensions. In addition to training a recommendation model by using traditional scoring data, in a user clustering stage, by introducing movie label data and analyzing the subject attributes of the crowd on a semantic level by using an LDA subject model, the invention obtains a user feature vector and realizes crowd clustering by using a spectral clustering algorithm, thereby further improving the recommendation quality. (3) The algorithm is simple and quick to implement. In the training stage of the local model and the global model, because the models are independent of each other and each column of the similarity matrix of the models is also independent of each other, a parallel training method can be adopted, the training time of the models is greatly reduced, and the training efficiency of the models is improved. (4) The recommended quality is better. The local model weighted fusion recommendation algorithm provided by the invention is an effective combination of content recommendation, neighborhood-based collaborative filtering and model-based collaborative filtering, fully utilizes the advantages of each recommendation algorithm, makes up the defects of each other, and greatly improves the recommendation quality compared with the single use of a certain recommendation algorithm.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention;

FIG. 2 is a plot of the subject intensity profile of the method of the present invention.

Detailed Description

Referring to the general flow chart of the technical scheme of fig. 1, the invention has four stages, which are respectively: the method comprises a data preprocessing stage, a user clustering stage, a global recommendation model and local recommendation model training stage and a recommendation model linear weighting fusion stage. In the data preprocessing stage, a data set is cleaned, some inactive users and cold movies are removed, and a corpus used for LDA topic model training and a user movie implicit feedback training matrix used for sparse linear model training are constructed; a user clustering stage, wherein a user corpus obtained in the first stage is used for obtaining user characteristic vectors by training an LDA topic model, clustering of users is realized by a spectral clustering algorithm, and each cluster generates a local implicit feedback training matrix; in the global recommendation model and local recommendation model training stage, an original implicit feedback matrix and a local implicit feedback matrix are used for obtaining a global model and a local model through sparse linear model training respectively; and in the model linear weighting fusion recommendation stage, the global model and the local model obtained in the previous step are fused in a linear weighting mode to obtain a final recommendation model.

The method comprises the steps of inputting rating data of the film watched by a user and label data of the film and outputting a Top-N personalized film recommendation list aiming at the user.

The method comprises the following specific steps:

1.2, counting labels printed by all the movies by all the users in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus. The calculation formulas of TF (word frequency), IDF (inverse document frequency) and TF-IDF (word frequency-inverse document frequency) are shown in formulas (1), (2) and (3);

TFIDF_i,j＝TF_i,j×IDF_i(3)

(2) calculating a Laplace matrix L-D-W;

(3) computing the first k eigenvectors t of L₁,t₂,…,t_k；

(4) K column vectors t₁,t₂,…,t_kForm a matrix T, T ∈ R^n×k；

(5) Let y be 1, …, n_i∈R^kIs the ith row vector of T;

P_uDenotes a cluster number, and P_u∈{1,…,k}；

The calculation formula is shown as formula (7);

Training to obtain a local film similarity matrix corresponding to each cluster

wherein

and 5, the recommendation method can prove the effectiveness of the model through leave-one-out cross validation. One movie may be randomly drawn from the movie score set for each user into a test set, with the other movies used as training sets for the model. Then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test set_i. Finally, the recommendation quality of the model can be measured by two indexes of Hit Rate (HR) and Average Ranking Hit Rate (ARHR), where # hits represents recommendation qualityThe number of hits, # users represents the total number of users, and their definitions are shown in formulas (9), (10);

the recommendation method flow steps are thus complete.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. The local model weighted fusion Top-N movie recommendation method based on user clustering specifically comprises the following steps:

step 1: preprocessing data; data cleaning is carried out on the film which is not active and has small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;

1.2 counting labels printed by all the users on the movies in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus in the document; the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word frequency-inverse document frequency TF-IDF are shown in formulas (1), (2) and (3);

TFIDF_i,j＝TF_i,j×IDF_i(3)

wherein TF_i，jMeaning the word t_iIn document d_jWord frequency of (1), n_i，jMeaning the word t_iIn document d_jNumber of occurrences in (iii), ∑_kn_k，jRepresenting a document d_jThe sum of the occurrence times of all the words in the Chinese; IDF_iMeaning word t_iThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t is t_i∈d_iDenotes the inclusion of the word t_iThe number of documents of (a); TFIDF_i，jRepresenting a document d_jChinese word t_iThe word frequency of (1) and the document frequency of (2);

step 2: clustering users; using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;

2.1 the LDA topic model is a document-topic-word three-layer Bayesian network, given a corpus, the LDA topic model analyzes the topic distribution of each document in the corpus and the word distribution of each topic; the joint probability of the word distribution of the topic is shown in formula (4);

theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters of the multinomial distribution of topics under each document, β denotes the Dirichlet prior parameters of the multinomial distribution of words under each topicNumber, N denotes the number of documents in the corpus, z_nTopic, w, representing the nth word in a document_nAn nth word representing a document;

each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word w_nMapping a set of tags of all movies viewed by a user to a document w, and mapping a specific type of movie preferred by the user to a theme z; if the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;

in order to be able to distinguish more unique user groups, the larger the difference between different topics, the better; in order to determine the optimal number of topics, training a plurality of LDA models by setting a plurality of topic numbers, calculating the average similarity between topic vectors obtained by training each LDA model, and taking the topic number corresponding to the model with the minimum topic vector average similarity as the optimal number of topics of the model; obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;

2.2, clustering the users by using a spectral clustering algorithm by using the n user feature vectors obtained in the steps;

before clustering, firstly determining the number of clusters; because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole body, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector; the spectral clustering algorithm comprises the following specific steps:

(2) calculating a Laplace matrix L-D-W;

(3) computing the first k eigenvectors t of L₁,t₂,…,t_k；

(4) K column vectors t₁,t₂,…,t_kForm a matrix T, T ∈ R^n×k；

(5) Let y be 1, …, n_i∈R^kIs the ith row vector of T;

(6) using the K-Means algorithm to assign users (y)_i)_{i＝1，2，...，n}Clustering into clusters C₁,C₂,…,C_k；

P_uDenotes a cluster number, and P_u∈{1，...，k}；

Step 3, determining a local recommendation model and carrying out global recommendation model training; the loss function of the sparse linear model SLIM is shown in formula (5);

a represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of an L2 norm control model prevents the model from being over-fitted, and the model trains each row W of the W matrix in parallel through a random gradient descent method_jTo obtain the final W matrix, as shown in equation (6);

wherein, a_jRepresents the jth column in matrix A; predicted recommendation degree of user i with respect to movie j

The calculation formula is shown as formula (7);

Training to obtain a local film similarity matrix corresponding to each cluster

Step 4, model weighting fusion recommendation; the calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);

wherein

cluster P for movie l and movie j to which user u belongs_uCorresponding similarity in the local model, wherein the parameter g is a weight parameter of the global model; the weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g; is determined by experimentsOptimal global model weight parameters in the previous dataset; after all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;

step 5, proving the effectiveness of the model through leave-one-out cross validation; randomly extracting one movie from the movie scoring set of each user and putting the movie into a test set, wherein other movies are used as a training set of the model; then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test set_i(ii) a Finally, the recommendation quality of the model is measured by two indexes of the hit rate HR and the average ranking hit rate ARHR, wherein # hits represents the number of recommendation hits, and # users represents the total number of users, as shown in formulas (9) and (10);