CN108363804B - Local model weighted fusion Top-N movie recommendation method based on user clustering - Google Patents
Local model weighted fusion Top-N movie recommendation method based on user clustering Download PDFInfo
- Publication number
- CN108363804B CN108363804B CN201810169922.3A CN201810169922A CN108363804B CN 108363804 B CN108363804 B CN 108363804B CN 201810169922 A CN201810169922 A CN 201810169922A CN 108363804 B CN108363804 B CN 108363804B
- Authority
- CN
- China
- Prior art keywords
- user
- model
- movie
- document
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 21
- 239000011159 matrix material Substances 0.000 claims abstract description 57
- 238000012549 training Methods 0.000 claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 50
- 230000003595 spectral effect Effects 0.000 claims abstract description 15
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 238000002790 cross-validation Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims 2
- 201000003672 autosomal recessive hypophosphatemic rickets Diseases 0.000 claims 1
- 238000001914 filtration Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The local model weighted fusion Top-N movie recommendation method based on user clustering comprises the following steps: step 1: preprocessing data; data cleaning is carried out on the film which is not active and has small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A; step 2: clustering users; using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering; step 3, determining a local recommendation model and carrying out global recommendation model training; step 4, model weighting fusion recommendation; and 5, verifying the effectiveness of the model by leave-one-out cross validation.
Description
Technical Field
The invention relates to a movie recommendation method on a network.
Background
With the rapid development of information technology and social networks, data generated by the internet has recently been exponentially soared, and a big data era comes. With the increase of data volume, people are more and more difficult to find the information really wanted from mass data. At this time, the recommendation system can exert the maximum application value. According to the user data, the article information and the historical behavior data of the user, the recommendation algorithm can accurately predict the preference of the user, and recommend the articles which the user may be interested in for the user in a personalized manner, so that the cost of finding the target information by the user is greatly reduced.
Recommendation algorithms can be divided into content-based recommendations and collaborative filtering recommendations. The modern recommendation system has two main tasks, one is score prediction and the other is Top-N recommendation which is most applied in real business scenarios. The Top-N recommendation algorithm lets the user select something of interest by recommending a ranked list of items of size N to the user. Top-N recommendation models are mainly divided into two types, namely neighborhood-based collaborative filtering and model-based collaborative filtering. The former can be further subdivided into a user-based neighborhood model (UserKNN) and an item-based neighborhood model (ItemKNN), and the latter is represented by a hidden factor model.
In common words, "things are grouped together," different user groups often form unique behavior patterns, so that the similarity of two identical things in different groups changes. However, the single recommendation algorithm model often cannot capture the local similarity difference, the similarity of two identical articles in any scene is considered to be consistent, the models cannot accurately capture the real preference of the user, and the quality of personalized recommendation is reduced. The recommendation algorithms for improving the overall recommendation effect by training a plurality of local recommendation models and fusing the local models can solve the problems to a certain extent, but the algorithms often do not fully utilize data provided by a recommendation scene, the utilized data are single, and the final recommendation effect is general.
Disclosure of Invention
In order to solve the problems that a single model in the prior art cannot accurately capture user preference and a multi-model fusion algorithm uses single training data, the invention provides a novel local model weighting fusion movie recommendation algorithm based on user clustering to realize Top-N personalized recommendation of movies.
The invention utilizes the text content information of the film, calculates the semantic level user characteristic vector through the LDA topic model, and realizes user clustering through a spectral clustering algorithm based on the semantic level user characteristic vector to construct local crowds. The method further utilizes the grading information of the user to the movie, constructs a local recommendation model and a global recommendation model through a sparse linear model, and realizes the final personalized recommendation of the movie Top-N through the linear weighted fusion of the local model and the global model.
The local model weighted fusion Top-N movie recommendation method based on user clustering has the general flow as shown in FIG. 1, and specifically comprises the following steps:
step 1: and (3) a data preprocessing stage. Data cleansing is performed on some inactive users and movies with small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;
1.1, carrying out data cleaning work on an original data set, eliminating users with the number of movies smaller than 20, and eliminating movies with the number of times of scoring smaller than 20 to obtain a new training data set;
1.2, counting labels printed by all the movies by all the users in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus. The calculation formulas of the word frequency TF, the inverse document frequency IDF and the word frequency-inverse document frequency TF-IDF are shown in formulas (1), (2) and (3);
TFIDFi,j=TFi,j×IDFi(3)
wherein TFi,jMeaning the word tiIn document djWord frequency of (1), ni,jMeaning the word tiIn document djNumber of occurrences in (iii), ∑knk,jRepresenting a document djThe sum of the number of occurrences of all words in (b). IDFiMeaning word tiThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t |, in the corpusi∈djDenotes the inclusion of the word tiThe number of documents. TFIDFi,jRepresenting a document djChinese word tiThe word frequency of (1) and the document frequency of (2);
1.3, converting the explicit scoring information into implicit feedback information represented by 0-1 by 1-5 points, if the current user excessively marks the current movie, marking the current movie as 1, and if the current user does not excessively mark the movie, namely, the movie to be recommended is marked as 0, so as to obtain an n multiplied by m user-movie implicit feedback matrix A, wherein the number of users is n, and the number of movies is m;
step 2: and (5) a user clustering stage. Using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;
2.1 an LDA topic model is a document-topic-word three-layer Bayesian network that, given a corpus, can analyze the topic distribution of each document in the corpus, as well as the word distribution of each topic. Its joint probability is shown in equation (4);
theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters for the multi-term distribution of topics under each document, β denotes the Dirichlet prior parameters for the multi-term distribution of words under each topic, N denotes the number of documents in the corpus, z denotes the number of documents in the corpus, andntopic, w, representing the nth word in a documentnAn nth word representing a document;
each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word wnA set of tags for all movies viewed by a user is mapped to a document w, and a particular type of movie preferred by the user is mapped to a topic z. If the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;
to be able to distinguish more distinct user populations, the greater the difference between different topics, the better. In order to determine the optimal number of the topics, a plurality of LDA models are trained by setting a plurality of topic numbers, the average similarity between topic vectors obtained by training each LDA model is calculated, and the topic number corresponding to the model with the minimum topic vector average similarity is taken as the optimal number of the topics of the model. Obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;
2.2, clustering the users by using a spectral clustering algorithm by using the user characteristic vectors (n in total) obtained in the steps;
the number of clusters needs to be determined first before clustering. Because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector. For example, in the LDA training process of a certain number of subjects being 10, a 10-dimensional subject intensity vector is obtained by the above method, the visualization is as shown in fig. 2 (the vertical axis represents the subject intensity, and the horizontal axis represents the subject), and it can be seen through observation that the subjects 2, 9, 3, 8, and 6 have the greatest intensity in the current data set, which indicates that most people like to watch these types of movies, so it is preferable to cluster the users into 5 categories using the spectral clustering algorithm in the current situation. The spectral clustering algorithm comprises the following specific steps:
(1) calculating a similarity matrix W and a similarity matrix D of n multiplied by n;
(2) calculating a Laplace matrix L-D-W;
(3) computing the first k eigenvectors t of L1,t2,…,tk;
(4) K column vectors t1,t2,…,tkForm a matrix T, T ∈ Rn×k;
(5) Let y be 1, …, ni∈RkIs the ith row vector of T;
(6) using the K-Means algorithm to assign users (y)i)i=1,2,…,nClustering into clusters C1,C2,…,Ck;
For each user cluster, setting all user row vectors which do not belong to the cluster in the user-movie implicit feedback matrix A to be 0, and generating a corresponding local implicit feedback training matrix for each clusterPuDenotes a cluster number, and Pu∈{1,…,k};
And 3, determining a local recommendation model and carrying out global recommendation model training. The loss function of the sparse linear model SLIM is shown in formula (5);
wherein A represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of the L2 norm control model prevents the model from overfitting, and the model trains each column W of the W matrix in parallel through a random gradient descent methodjTo obtain the final W matrix, as shown in equation (6);
wherein, ajRepresenting the jth column in matrix a. Predicted recommendation degree of user i with respect to movie jThe calculation formula is shown as formula (7);
constructing a global recommendation model and a local recommendation model by using a sparse linear model SLIM as a basic recommendation model, training by using a user-film implicit feedback matrix A to obtain a global film similarity matrix W, and training a matrix by using a local implicit feedbackTraining to obtain a local film similarity matrix corresponding to each cluster
And 4, performing weighted fusion on the models. The calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);
whereinRepresents the weighted fusion recommendation degree, R, of movie j to user uuFor the set of all movies that have interacted with user u, wljFor the similarity of movie i and movie j in the global model,cluster P for movie l and movie j to which user u belongsuAnd g is a weight parameter of the global model. The weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g. The optimal global model weight parameters in the current dataset may be determined experimentally. After all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;
and 5, the recommendation method can prove the effectiveness of the model through leave-one-out cross validation. One movie may be randomly drawn from the movie score set for each user into a test set, with the other movies used as training sets for the model. Then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test seti. Finally, the recommendation quality of the model can be measured by two indexes of Hit Rate (HR) and Average Ranking Hit Rate (ARHR), where # hits represents the number of recommendation hits, # users represents the total number of users, and their definitions are shown in equations (9), (10);
the recommendation method flow steps are thus complete.
The invention provides a local model weighted fusion Top-N movie recommendation algorithm based on user clustering by integrating the technologies. In order to solve the problem that the local difference of articles cannot be accurately estimated by a traditional single recommendation model, and the user preference cannot be accurately captured, a global recommendation model and a local recommendation model based on user clustering are respectively trained, and Top-N recommendation of a movie is realized through linear weighted fusion between the models. In addition, in order to fully use data in a movie recommendation scene and improve recommendation quality from multiple dimensions, the invention realizes calculation of feature vectors of a user at a semantic level through an LDA topic model by utilizing movie label information, and realizes division of a user at a semantic level group.
The invention has the advantages that: (1) the algorithm is novel in thinking. The sparse linear model is used as a basic recommendation model, a global recommendation model and a local recommendation model based on user clustering are respectively trained, and finally a final fusion model is generated through linear weighted fusion. (2) And the recommendation quality is improved in multiple dimensions. In addition to training a recommendation model by using traditional scoring data, in a user clustering stage, by introducing movie label data and analyzing the subject attributes of the crowd on a semantic level by using an LDA subject model, the invention obtains a user feature vector and realizes crowd clustering by using a spectral clustering algorithm, thereby further improving the recommendation quality. (3) The algorithm is simple and quick to implement. In the training stage of the local model and the global model, because the models are independent of each other and each column of the similarity matrix of the models is also independent of each other, a parallel training method can be adopted, the training time of the models is greatly reduced, and the training efficiency of the models is improved. (4) The recommended quality is better. The local model weighted fusion recommendation algorithm provided by the invention is an effective combination of content recommendation, neighborhood-based collaborative filtering and model-based collaborative filtering, fully utilizes the advantages of each recommendation algorithm, makes up the defects of each other, and greatly improves the recommendation quality compared with the single use of a certain recommendation algorithm.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention;
FIG. 2 is a plot of the subject intensity profile of the method of the present invention.
Detailed Description
Referring to the general flow chart of the technical scheme of fig. 1, the invention has four stages, which are respectively: the method comprises a data preprocessing stage, a user clustering stage, a global recommendation model and local recommendation model training stage and a recommendation model linear weighting fusion stage. In the data preprocessing stage, a data set is cleaned, some inactive users and cold movies are removed, and a corpus used for LDA topic model training and a user movie implicit feedback training matrix used for sparse linear model training are constructed; a user clustering stage, wherein a user corpus obtained in the first stage is used for obtaining user characteristic vectors by training an LDA topic model, clustering of users is realized by a spectral clustering algorithm, and each cluster generates a local implicit feedback training matrix; in the global recommendation model and local recommendation model training stage, an original implicit feedback matrix and a local implicit feedback matrix are used for obtaining a global model and a local model through sparse linear model training respectively; and in the model linear weighting fusion recommendation stage, the global model and the local model obtained in the previous step are fused in a linear weighting mode to obtain a final recommendation model.
The method comprises the steps of inputting rating data of the film watched by a user and label data of the film and outputting a Top-N personalized film recommendation list aiming at the user.
The method comprises the following specific steps:
step 1: and (3) a data preprocessing stage. Data cleansing is performed on some inactive users and movies with small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;
1.1, carrying out data cleaning work on an original data set, eliminating users with the number of movies smaller than 20, and eliminating movies with the number of times of scoring smaller than 20 to obtain a new training data set;
1.2, counting labels printed by all the movies by all the users in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus. The calculation formulas of TF (word frequency), IDF (inverse document frequency) and TF-IDF (word frequency-inverse document frequency) are shown in formulas (1), (2) and (3);
TFIDFi,j=TFi,j×IDFi(3)
wherein TFi,jMeaning the word tiIn document djWord frequency of (1), ni,jMeaning the word tiIn document djNumber of occurrences in (iii), ∑knk,jRepresenting a document djThe sum of the number of occurrences of all words in (b). IDFiMeaning word tiThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t |, in the corpusi∈djDenotes the inclusion of the word tiThe number of documents. TFIDFi,jRepresenting a document djChinese word tiThe word frequency of (1) and the document frequency of (2);
1.3, converting the explicit scoring information into implicit feedback information represented by 0-1 by 1-5 points, if the current user excessively marks the current movie, marking the current movie as 1, and if the current user does not excessively mark the movie, namely, the movie to be recommended is marked as 0, so as to obtain an n multiplied by m user-movie implicit feedback matrix A, wherein the number of users is n, and the number of movies is m;
step 2: and (5) a user clustering stage. Using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;
2.1 an LDA topic model is a document-topic-word three-layer Bayesian network that, given a corpus, can analyze the topic distribution of each document in the corpus, as well as the word distribution of each topic. Its joint probability is shown in equation (4);
theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters for the multi-term distribution of topics under each document, β denotes the Dirichlet prior parameters for the multi-term distribution of words under each topic, N denotes the number of documents in the corpus, z denotes the number of documents in the corpus, andntopic, w, representing the nth word in a documentnAn nth word representing a document;
each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word wnA set of tags for all movies viewed by a user is mapped to a document w, and a particular type of movie preferred by the user is mapped to a topic z. If the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;
to be able to distinguish more distinct user populations, the greater the difference between different topics, the better. In order to determine the optimal number of the topics, a plurality of LDA models are trained by setting a plurality of topic numbers, the average similarity between topic vectors obtained by training each LDA model is calculated, and the topic number corresponding to the model with the minimum topic vector average similarity is taken as the optimal number of the topics of the model. Obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;
2.2, clustering the users by using a spectral clustering algorithm by using the user characteristic vectors (n in total) obtained in the steps;
the number of clusters needs to be determined first before clustering. Because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector. For example, in the LDA training process of a certain number of subjects being 10, a 10-dimensional subject intensity vector is obtained by the above method, the visualization is as shown in fig. 2 (the vertical axis represents the subject intensity, and the horizontal axis represents the subject), and it can be seen through observation that the subjects 2, 9, 3, 8, and 6 have the greatest intensity in the current data set, which indicates that most people like to watch these types of movies, so it is preferable to cluster the users into 5 categories using the spectral clustering algorithm in the current situation. The spectral clustering algorithm comprises the following specific steps:
(1) calculating a similarity matrix W and a similarity matrix D of n multiplied by n;
(2) calculating a Laplace matrix L-D-W;
(3) computing the first k eigenvectors t of L1,t2,…,tk;
(4) K column vectors t1,t2,…,tkForm a matrix T, T ∈ Rn×k;
(5) Let y be 1, …, ni∈RkIs the ith row vector of T;
(6) using the K-Means algorithm to assign users (y)i)i=1,2,…,nClustering into clusters C1,C2,…,Ck;
For each user cluster, setting all user row vectors which do not belong to the cluster in the user-movie implicit feedback matrix A to be 0, and generating a corresponding local implicit feedback training matrix for each clusterPuDenotes a cluster number, and Pu∈{1,…,k};
And 3, determining a local recommendation model and carrying out global recommendation model training. The loss function of the sparse linear model SLIM is shown in formula (5);
wherein A represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of the L2 norm control model prevents the model from overfitting, and the model trains each column W of the W matrix in parallel through a random gradient descent methodjTo obtain the final W matrix, as shown in equation (6);
wherein, ajRepresenting the jth column in matrix a. Predicted recommendation degree of user i with respect to movie jThe calculation formula is shown as formula (7);
constructing a global recommendation model and a local recommendation model by using a sparse linear model SLIM as a basic recommendation model, training by using a user-film implicit feedback matrix A to obtain a global film similarity matrix W, and training a matrix by using a local implicit feedbackTraining to obtain a local film similarity matrix corresponding to each cluster
And 4, performing weighted fusion on the models. The calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);
whereinRepresents the weighted fusion recommendation degree, R, of movie j to user uuFor the set of all movies that have interacted with user u, wljFor the similarity of movie i and movie j in the global model,cluster P for movie l and movie j to which user u belongsuAnd g is a weight parameter of the global model. The weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g. The optimal global model weight parameters in the current dataset may be determined experimentally. After all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;
and 5, the recommendation method can prove the effectiveness of the model through leave-one-out cross validation. One movie may be randomly drawn from the movie score set for each user into a test set, with the other movies used as training sets for the model. Then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test seti. Finally, the recommendation quality of the model can be measured by two indexes of Hit Rate (HR) and Average Ranking Hit Rate (ARHR), where # hits represents recommendation qualityThe number of hits, # users represents the total number of users, and their definitions are shown in formulas (9), (10);
the recommendation method flow steps are thus complete.
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.
Claims (1)
1. The local model weighted fusion Top-N movie recommendation method based on user clustering specifically comprises the following steps:
step 1: preprocessing data; data cleaning is carried out on the film which is not active and has small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;
1.1, carrying out data cleaning work on an original data set, eliminating users with the number of movies smaller than 20, and eliminating movies with the number of times of scoring smaller than 20 to obtain a new training data set;
1.2 counting labels printed by all the users on the movies in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus in the document; the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word frequency-inverse document frequency TF-IDF are shown in formulas (1), (2) and (3);
TFIDFi,j=TFi,j×IDFi(3)
wherein TFi,jMeaning the word tiIn document djWord frequency of (1), ni,jMeaning the word tiIn document djNumber of occurrences in (iii), ∑knk,jRepresenting a document djThe sum of the occurrence times of all the words in the Chinese; IDFiMeaning word tiThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t is ti∈diDenotes the inclusion of the word tiThe number of documents of (a); TFIDFi,jRepresenting a document djChinese word tiThe word frequency of (1) and the document frequency of (2);
1.3, converting the explicit scoring information into implicit feedback information represented by 0-1 by 1-5 points, if the current user excessively marks the current movie, marking the current movie as 1, and if the current user does not excessively mark the movie, namely, the movie to be recommended is marked as 0, so as to obtain an n multiplied by m user-movie implicit feedback matrix A, wherein the number of users is n, and the number of movies is m;
step 2: clustering users; using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;
2.1 the LDA topic model is a document-topic-word three-layer Bayesian network, given a corpus, the LDA topic model analyzes the topic distribution of each document in the corpus and the word distribution of each topic; the joint probability of the word distribution of the topic is shown in formula (4);
theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters of the multinomial distribution of topics under each document, β denotes the Dirichlet prior parameters of the multinomial distribution of words under each topicNumber, N denotes the number of documents in the corpus, znTopic, w, representing the nth word in a documentnAn nth word representing a document;
each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word wnMapping a set of tags of all movies viewed by a user to a document w, and mapping a specific type of movie preferred by the user to a theme z; if the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;
in order to be able to distinguish more unique user groups, the larger the difference between different topics, the better; in order to determine the optimal number of topics, training a plurality of LDA models by setting a plurality of topic numbers, calculating the average similarity between topic vectors obtained by training each LDA model, and taking the topic number corresponding to the model with the minimum topic vector average similarity as the optimal number of topics of the model; obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;
2.2, clustering the users by using a spectral clustering algorithm by using the n user feature vectors obtained in the steps;
before clustering, firstly determining the number of clusters; because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole body, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector; the spectral clustering algorithm comprises the following specific steps:
(1) calculating a similarity matrix W and a similarity matrix D of n multiplied by n;
(2) calculating a Laplace matrix L-D-W;
(3) computing the first k eigenvectors t of L1,t2,…,tk;
(4) K column vectors t1,t2,…,tkForm a matrix T, T ∈ Rn×k;
(5) Let y be 1, …, ni∈RkIs the ith row vector of T;
(6) using the K-Means algorithm to assign users (y)i)i=1,2,...,nClustering into clusters C1,C2,…,Ck;
For each user cluster, setting all user row vectors which do not belong to the cluster in the user-movie implicit feedback matrix A to be 0, and generating a corresponding local implicit feedback training matrix for each clusterPuDenotes a cluster number, and Pu∈{1,...,k};
Step 3, determining a local recommendation model and carrying out global recommendation model training; the loss function of the sparse linear model SLIM is shown in formula (5);
a represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of an L2 norm control model prevents the model from being over-fitted, and the model trains each row W of the W matrix in parallel through a random gradient descent methodjTo obtain the final W matrix, as shown in equation (6);
wherein, ajRepresents the jth column in matrix A; predicted recommendation degree of user i with respect to movie jThe calculation formula is shown as formula (7);
constructing a global recommendation model and a local recommendation model by using a sparse linear model SLIM as a basic recommendation model, training by using a user-film implicit feedback matrix A to obtain a global film similarity matrix W, and training a matrix by using a local implicit feedbackTraining to obtain a local film similarity matrix corresponding to each cluster
Step 4, model weighting fusion recommendation; the calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);
whereinRepresents the weighted fusion recommendation degree, R, of movie j to user uuFor the set of all movies that have interacted with user u, wljFor the similarity of movie i and movie j in the global model,cluster P for movie l and movie j to which user u belongsuCorresponding similarity in the local model, wherein the parameter g is a weight parameter of the global model; the weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g; is determined by experimentsOptimal global model weight parameters in the previous dataset; after all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;
step 5, proving the effectiveness of the model through leave-one-out cross validation; randomly extracting one movie from the movie scoring set of each user and putting the movie into a test set, wherein other movies are used as a training set of the model; then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test seti(ii) a Finally, the recommendation quality of the model is measured by two indexes of the hit rate HR and the average ranking hit rate ARHR, wherein # hits represents the number of recommendation hits, and # users represents the total number of users, as shown in formulas (9) and (10);
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810169922.3A CN108363804B (en) | 2018-03-01 | 2018-03-01 | Local model weighted fusion Top-N movie recommendation method based on user clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810169922.3A CN108363804B (en) | 2018-03-01 | 2018-03-01 | Local model weighted fusion Top-N movie recommendation method based on user clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108363804A CN108363804A (en) | 2018-08-03 |
CN108363804B true CN108363804B (en) | 2020-08-21 |
Family
ID=63002919
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810169922.3A Active CN108363804B (en) | 2018-03-01 | 2018-03-01 | Local model weighted fusion Top-N movie recommendation method based on user clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108363804B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408702B (en) * | 2018-08-29 | 2021-07-16 | 昆明理工大学 | Mixed recommendation method based on sparse edge noise reduction automatic coding |
CN111309874A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111309873A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN110008377B (en) * | 2019-03-27 | 2021-09-21 | 华南理工大学 | Method for recommending movies by using user attributes |
CN110084670B (en) * | 2019-04-15 | 2022-03-25 | 东北大学 | Shelf commodity combination recommendation method based on LDA-MLP |
CN110069663B (en) * | 2019-04-29 | 2021-06-04 | 厦门美图之家科技有限公司 | Video recommendation method and device |
CN111984856A (en) * | 2019-07-25 | 2020-11-24 | 北京嘀嘀无限科技发展有限公司 | Information pushing method and device, server and computer readable storage medium |
CN110443502A (en) * | 2019-08-06 | 2019-11-12 | 合肥工业大学 | Crowdsourcing task recommendation method and system based on worker's capability comparison |
CN112395487B (en) * | 2019-08-14 | 2024-04-26 | 腾讯科技(深圳)有限公司 | Information recommendation method and device, computer readable storage medium and electronic equipment |
CN110795570B (en) * | 2019-10-11 | 2022-06-17 | 上海上湖信息技术有限公司 | Method and device for extracting user time sequence behavior characteristics |
CN111008334B (en) * | 2019-12-04 | 2023-04-18 | 华中科技大学 | Top-K recommendation method and system based on local pairwise ordering and global decision fusion |
CN113111251A (en) * | 2020-01-10 | 2021-07-13 | 阿里巴巴集团控股有限公司 | Project recommendation method, device and system |
CN111460046A (en) * | 2020-03-06 | 2020-07-28 | 合肥海策科技信息服务有限公司 | Scientific and technological information clustering method based on big data |
CN111581522B (en) * | 2020-06-05 | 2021-03-09 | 预见你情感(北京)教育咨询有限公司 | Social analysis method based on user identity identification |
CN111897999B (en) * | 2020-07-27 | 2023-06-16 | 九江学院 | Deep learning model construction method for video recommendation and based on LDA |
CN112184391B (en) * | 2020-10-16 | 2023-10-10 | 中国科学院计算技术研究所 | Training method of recommendation model, medium, electronic equipment and recommendation model |
CN112348629A (en) * | 2020-10-26 | 2021-02-09 | 邦道科技有限公司 | Commodity information pushing method and device |
CN112364245B (en) * | 2020-11-20 | 2021-12-21 | 浙江工业大学 | Top-K movie recommendation method based on heterogeneous information network embedding |
CN112925926B (en) * | 2021-01-28 | 2022-04-22 | 北京达佳互联信息技术有限公司 | Training method and device of multimedia recommendation model, server and storage medium |
CN113342963B (en) * | 2021-04-29 | 2022-03-04 | 山东大学 | Service recommendation method and system based on transfer learning |
CN113268670B (en) * | 2021-06-16 | 2022-09-27 | 中移(杭州)信息技术有限公司 | Latent factor hybrid recommendation method, device, equipment and computer storage medium |
CN113449147A (en) * | 2021-07-06 | 2021-09-28 | 乐视云计算有限公司 | Video recommendation method and device based on theme |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544216A (en) * | 2013-09-23 | 2014-01-29 | Tcl集团股份有限公司 | Information recommendation method and system combining image content and keywords |
CN107609201A (en) * | 2017-10-25 | 2018-01-19 | 广东工业大学 | A kind of recommended models generation method and relevant apparatus based on commending system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514496B (en) * | 2012-06-21 | 2017-05-17 | 腾讯科技(深圳)有限公司 | Method and system for processing recommended target software |
-
2018
- 2018-03-01 CN CN201810169922.3A patent/CN108363804B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544216A (en) * | 2013-09-23 | 2014-01-29 | Tcl集团股份有限公司 | Information recommendation method and system combining image content and keywords |
CN107609201A (en) * | 2017-10-25 | 2018-01-19 | 广东工业大学 | A kind of recommended models generation method and relevant apparatus based on commending system |
Non-Patent Citations (2)
Title |
---|
Local item-item models for top-n recommendation;Evangelia Christakopoulou;《ACM》;20160919;全文 * |
基于谱聚类与多因子融合的协同过滤推荐算法;李倩;《计算机应用研究》;20171031;第34卷(第10期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108363804A (en) | 2018-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363804B (en) | Local model weighted fusion Top-N movie recommendation method based on user clustering | |
CN108763362B (en) | Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection | |
WO2020207196A1 (en) | Method and apparatus for generating user tag, storage medium and computer device | |
CN108665323B (en) | Integration method for financial product recommendation system | |
CN107357793B (en) | Information recommendation method and device | |
Huang et al. | Sentiment and topic analysis on social media: a multi-task multi-label classification approach | |
Liang et al. | A probabilistic rating auto-encoder for personalized recommender systems | |
CN111563164A (en) | Specific target emotion classification method based on graph neural network | |
CN107220365A (en) | Accurate commending system and method based on collaborative filtering and correlation rule parallel processing | |
CN109508385B (en) | Character relation analysis method in webpage news data based on Bayesian network | |
Chen et al. | Research on personalized recommendation hybrid algorithm for interactive experience equipment | |
CN107895303B (en) | Personalized recommendation method based on OCEAN model | |
CN107103093B (en) | Short text recommendation method and device based on user behavior and emotion analysis | |
CN107944485A (en) | The commending system and method, personalized recommendation system found based on cluster group | |
CN105701516B (en) | A kind of automatic image marking method differentiated based on attribute | |
Zhang et al. | Learning to match clothing from textual feature-based compatible relationships | |
Harakawa et al. | Extracting hierarchical structure of web video groups based on sentiment-aware signed network analysis | |
Duan et al. | A hybrid intelligent service recommendation by latent semantics and explicit ratings | |
Menaria et al. | Tweet sentiment classification by semantic and frequency base features using hybrid classifier | |
Najafabadi et al. | Tag recommendation model using feature learning via word embedding | |
Ma et al. | Book recommendation model based on wide and deep model | |
Patil et al. | A survey on artificial intelligence (AI) based job recommendation systems | |
Abdi et al. | Using an auxiliary dataset to improve emotion estimation in users’ opinions | |
Long et al. | Domain-specific user preference prediction based on multiple user activities | |
Bharadhwaj | Layer-wise relevance propagation for explainable recommendations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |