CN108363804B - Local model weighted fusion Top-N movie recommendation method based on user clustering - Google Patents

Local model weighted fusion Top-N movie recommendation method based on user clustering Download PDF

Info

Publication number
CN108363804B
CN108363804B CN201810169922.3A CN201810169922A CN108363804B CN 108363804 B CN108363804 B CN 108363804B CN 201810169922 A CN201810169922 A CN 201810169922A CN 108363804 B CN108363804 B CN 108363804B
Authority
CN
China
Prior art keywords
user
model
movie
document
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810169922.3A
Other languages
Chinese (zh)
Other versions
CN108363804A (en
Inventor
汤颖
孙康高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201810169922.3A priority Critical patent/CN108363804B/en
Publication of CN108363804A publication Critical patent/CN108363804A/en
Application granted granted Critical
Publication of CN108363804B publication Critical patent/CN108363804B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The local model weighted fusion Top-N movie recommendation method based on user clustering comprises the following steps: step 1: preprocessing data; data cleaning is carried out on the film which is not active and has small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A; step 2: clustering users; using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering; step 3, determining a local recommendation model and carrying out global recommendation model training; step 4, model weighting fusion recommendation; and 5, verifying the effectiveness of the model by leave-one-out cross validation.

Description

Local model weighted fusion Top-N movie recommendation method based on user clustering
Technical Field
The invention relates to a movie recommendation method on a network.
Background
With the rapid development of information technology and social networks, data generated by the internet has recently been exponentially soared, and a big data era comes. With the increase of data volume, people are more and more difficult to find the information really wanted from mass data. At this time, the recommendation system can exert the maximum application value. According to the user data, the article information and the historical behavior data of the user, the recommendation algorithm can accurately predict the preference of the user, and recommend the articles which the user may be interested in for the user in a personalized manner, so that the cost of finding the target information by the user is greatly reduced.
Recommendation algorithms can be divided into content-based recommendations and collaborative filtering recommendations. The modern recommendation system has two main tasks, one is score prediction and the other is Top-N recommendation which is most applied in real business scenarios. The Top-N recommendation algorithm lets the user select something of interest by recommending a ranked list of items of size N to the user. Top-N recommendation models are mainly divided into two types, namely neighborhood-based collaborative filtering and model-based collaborative filtering. The former can be further subdivided into a user-based neighborhood model (UserKNN) and an item-based neighborhood model (ItemKNN), and the latter is represented by a hidden factor model.
In common words, "things are grouped together," different user groups often form unique behavior patterns, so that the similarity of two identical things in different groups changes. However, the single recommendation algorithm model often cannot capture the local similarity difference, the similarity of two identical articles in any scene is considered to be consistent, the models cannot accurately capture the real preference of the user, and the quality of personalized recommendation is reduced. The recommendation algorithms for improving the overall recommendation effect by training a plurality of local recommendation models and fusing the local models can solve the problems to a certain extent, but the algorithms often do not fully utilize data provided by a recommendation scene, the utilized data are single, and the final recommendation effect is general.
Disclosure of Invention
In order to solve the problems that a single model in the prior art cannot accurately capture user preference and a multi-model fusion algorithm uses single training data, the invention provides a novel local model weighting fusion movie recommendation algorithm based on user clustering to realize Top-N personalized recommendation of movies.
The invention utilizes the text content information of the film, calculates the semantic level user characteristic vector through the LDA topic model, and realizes user clustering through a spectral clustering algorithm based on the semantic level user characteristic vector to construct local crowds. The method further utilizes the grading information of the user to the movie, constructs a local recommendation model and a global recommendation model through a sparse linear model, and realizes the final personalized recommendation of the movie Top-N through the linear weighted fusion of the local model and the global model.
The local model weighted fusion Top-N movie recommendation method based on user clustering has the general flow as shown in FIG. 1, and specifically comprises the following steps:
step 1: and (3) a data preprocessing stage. Data cleansing is performed on some inactive users and movies with small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;
1.1, carrying out data cleaning work on an original data set, eliminating users with the number of movies smaller than 20, and eliminating movies with the number of times of scoring smaller than 20 to obtain a new training data set;
1.2, counting labels printed by all the movies by all the users in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus. The calculation formulas of the word frequency TF, the inverse document frequency IDF and the word frequency-inverse document frequency TF-IDF are shown in formulas (1), (2) and (3);
Figure GDA0002399965690000021
Figure GDA0002399965690000022
TFIDFi,j=TFi,j×IDFi(3)
wherein TFi,jMeaning the word tiIn document djWord frequency of (1), ni,jMeaning the word tiIn document djNumber of occurrences in (iii), ∑knk,jRepresenting a document djThe sum of the number of occurrences of all words in (b). IDFiMeaning word tiThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t |, in the corpusi∈djDenotes the inclusion of the word tiThe number of documents. TFIDFi,jRepresenting a document djChinese word tiThe word frequency of (1) and the document frequency of (2);
1.3, converting the explicit scoring information into implicit feedback information represented by 0-1 by 1-5 points, if the current user excessively marks the current movie, marking the current movie as 1, and if the current user does not excessively mark the movie, namely, the movie to be recommended is marked as 0, so as to obtain an n multiplied by m user-movie implicit feedback matrix A, wherein the number of users is n, and the number of movies is m;
step 2: and (5) a user clustering stage. Using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;
2.1 an LDA topic model is a document-topic-word three-layer Bayesian network that, given a corpus, can analyze the topic distribution of each document in the corpus, as well as the word distribution of each topic. Its joint probability is shown in equation (4);
Figure GDA0002399965690000023
theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters for the multi-term distribution of topics under each document, β denotes the Dirichlet prior parameters for the multi-term distribution of words under each topic, N denotes the number of documents in the corpus, z denotes the number of documents in the corpus, andntopic, w, representing the nth word in a documentnAn nth word representing a document;
each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word wnA set of tags for all movies viewed by a user is mapped to a document w, and a particular type of movie preferred by the user is mapped to a topic z. If the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;
to be able to distinguish more distinct user populations, the greater the difference between different topics, the better. In order to determine the optimal number of the topics, a plurality of LDA models are trained by setting a plurality of topic numbers, the average similarity between topic vectors obtained by training each LDA model is calculated, and the topic number corresponding to the model with the minimum topic vector average similarity is taken as the optimal number of the topics of the model. Obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;
2.2, clustering the users by using a spectral clustering algorithm by using the user characteristic vectors (n in total) obtained in the steps;
the number of clusters needs to be determined first before clustering. Because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector. For example, in the LDA training process of a certain number of subjects being 10, a 10-dimensional subject intensity vector is obtained by the above method, the visualization is as shown in fig. 2 (the vertical axis represents the subject intensity, and the horizontal axis represents the subject), and it can be seen through observation that the subjects 2, 9, 3, 8, and 6 have the greatest intensity in the current data set, which indicates that most people like to watch these types of movies, so it is preferable to cluster the users into 5 categories using the spectral clustering algorithm in the current situation. The spectral clustering algorithm comprises the following specific steps:
(1) calculating a similarity matrix W and a similarity matrix D of n multiplied by n;
(2) calculating a Laplace matrix L-D-W;
(3) computing the first k eigenvectors t of L1,t2,…,tk
(4) K column vectors t1,t2,…,tkForm a matrix T, T ∈ Rn×k
(5) Let y be 1, …, ni∈RkIs the ith row vector of T;
(6) using the K-Means algorithm to assign users (y)i)i=1,2,…,nClustering into clusters C1,C2,…,Ck
For each user cluster, setting all user row vectors which do not belong to the cluster in the user-movie implicit feedback matrix A to be 0, and generating a corresponding local implicit feedback training matrix for each cluster
Figure GDA0002399965690000031
PuDenotes a cluster number, and Pu∈{1,…,k};
And 3, determining a local recommendation model and carrying out global recommendation model training. The loss function of the sparse linear model SLIM is shown in formula (5);
Figure GDA0002399965690000041
wherein A represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of the L2 norm control model prevents the model from overfitting, and the model trains each column W of the W matrix in parallel through a random gradient descent methodjTo obtain the final W matrix, as shown in equation (6);
Figure GDA0002399965690000042
wherein, ajRepresenting the jth column in matrix a. Predicted recommendation degree of user i with respect to movie j
Figure GDA0002399965690000043
The calculation formula is shown as formula (7);
Figure GDA0002399965690000044
constructing a global recommendation model and a local recommendation model by using a sparse linear model SLIM as a basic recommendation model, training by using a user-film implicit feedback matrix A to obtain a global film similarity matrix W, and training a matrix by using a local implicit feedback
Figure GDA0002399965690000045
Training to obtain a local film similarity matrix corresponding to each cluster
Figure GDA0002399965690000046
And 4, performing weighted fusion on the models. The calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);
Figure GDA0002399965690000047
wherein
Figure GDA0002399965690000048
Represents the weighted fusion recommendation degree, R, of movie j to user uuFor the set of all movies that have interacted with user u, wljFor the similarity of movie i and movie j in the global model,
Figure GDA0002399965690000049
cluster P for movie l and movie j to which user u belongsuAnd g is a weight parameter of the global model. The weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g. The optimal global model weight parameters in the current dataset may be determined experimentally. After all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;
and 5, the recommendation method can prove the effectiveness of the model through leave-one-out cross validation. One movie may be randomly drawn from the movie score set for each user into a test set, with the other movies used as training sets for the model. Then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test seti. Finally, the recommendation quality of the model can be measured by two indexes of Hit Rate (HR) and Average Ranking Hit Rate (ARHR), where # hits represents the number of recommendation hits, # users represents the total number of users, and their definitions are shown in equations (9), (10);
Figure GDA0002399965690000051
Figure GDA0002399965690000052
the recommendation method flow steps are thus complete.
The invention provides a local model weighted fusion Top-N movie recommendation algorithm based on user clustering by integrating the technologies. In order to solve the problem that the local difference of articles cannot be accurately estimated by a traditional single recommendation model, and the user preference cannot be accurately captured, a global recommendation model and a local recommendation model based on user clustering are respectively trained, and Top-N recommendation of a movie is realized through linear weighted fusion between the models. In addition, in order to fully use data in a movie recommendation scene and improve recommendation quality from multiple dimensions, the invention realizes calculation of feature vectors of a user at a semantic level through an LDA topic model by utilizing movie label information, and realizes division of a user at a semantic level group.
The invention has the advantages that: (1) the algorithm is novel in thinking. The sparse linear model is used as a basic recommendation model, a global recommendation model and a local recommendation model based on user clustering are respectively trained, and finally a final fusion model is generated through linear weighted fusion. (2) And the recommendation quality is improved in multiple dimensions. In addition to training a recommendation model by using traditional scoring data, in a user clustering stage, by introducing movie label data and analyzing the subject attributes of the crowd on a semantic level by using an LDA subject model, the invention obtains a user feature vector and realizes crowd clustering by using a spectral clustering algorithm, thereby further improving the recommendation quality. (3) The algorithm is simple and quick to implement. In the training stage of the local model and the global model, because the models are independent of each other and each column of the similarity matrix of the models is also independent of each other, a parallel training method can be adopted, the training time of the models is greatly reduced, and the training efficiency of the models is improved. (4) The recommended quality is better. The local model weighted fusion recommendation algorithm provided by the invention is an effective combination of content recommendation, neighborhood-based collaborative filtering and model-based collaborative filtering, fully utilizes the advantages of each recommendation algorithm, makes up the defects of each other, and greatly improves the recommendation quality compared with the single use of a certain recommendation algorithm.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention;
FIG. 2 is a plot of the subject intensity profile of the method of the present invention.
Detailed Description
Referring to the general flow chart of the technical scheme of fig. 1, the invention has four stages, which are respectively: the method comprises a data preprocessing stage, a user clustering stage, a global recommendation model and local recommendation model training stage and a recommendation model linear weighting fusion stage. In the data preprocessing stage, a data set is cleaned, some inactive users and cold movies are removed, and a corpus used for LDA topic model training and a user movie implicit feedback training matrix used for sparse linear model training are constructed; a user clustering stage, wherein a user corpus obtained in the first stage is used for obtaining user characteristic vectors by training an LDA topic model, clustering of users is realized by a spectral clustering algorithm, and each cluster generates a local implicit feedback training matrix; in the global recommendation model and local recommendation model training stage, an original implicit feedback matrix and a local implicit feedback matrix are used for obtaining a global model and a local model through sparse linear model training respectively; and in the model linear weighting fusion recommendation stage, the global model and the local model obtained in the previous step are fused in a linear weighting mode to obtain a final recommendation model.
The method comprises the steps of inputting rating data of the film watched by a user and label data of the film and outputting a Top-N personalized film recommendation list aiming at the user.
The method comprises the following specific steps:
step 1: and (3) a data preprocessing stage. Data cleansing is performed on some inactive users and movies with small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;
1.1, carrying out data cleaning work on an original data set, eliminating users with the number of movies smaller than 20, and eliminating movies with the number of times of scoring smaller than 20 to obtain a new training data set;
1.2, counting labels printed by all the movies by all the users in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus. The calculation formulas of TF (word frequency), IDF (inverse document frequency) and TF-IDF (word frequency-inverse document frequency) are shown in formulas (1), (2) and (3);
Figure GDA0002399965690000061
Figure GDA0002399965690000062
TFIDFi,j=TFi,j×IDFi(3)
wherein TFi,jMeaning the word tiIn document djWord frequency of (1), ni,jMeaning the word tiIn document djNumber of occurrences in (iii), ∑knk,jRepresenting a document djThe sum of the number of occurrences of all words in (b). IDFiMeaning word tiThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t |, in the corpusi∈djDenotes the inclusion of the word tiThe number of documents. TFIDFi,jRepresenting a document djChinese word tiThe word frequency of (1) and the document frequency of (2);
1.3, converting the explicit scoring information into implicit feedback information represented by 0-1 by 1-5 points, if the current user excessively marks the current movie, marking the current movie as 1, and if the current user does not excessively mark the movie, namely, the movie to be recommended is marked as 0, so as to obtain an n multiplied by m user-movie implicit feedback matrix A, wherein the number of users is n, and the number of movies is m;
step 2: and (5) a user clustering stage. Using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;
2.1 an LDA topic model is a document-topic-word three-layer Bayesian network that, given a corpus, can analyze the topic distribution of each document in the corpus, as well as the word distribution of each topic. Its joint probability is shown in equation (4);
Figure GDA0002399965690000071
theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters for the multi-term distribution of topics under each document, β denotes the Dirichlet prior parameters for the multi-term distribution of words under each topic, N denotes the number of documents in the corpus, z denotes the number of documents in the corpus, andntopic, w, representing the nth word in a documentnAn nth word representing a document;
each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word wnA set of tags for all movies viewed by a user is mapped to a document w, and a particular type of movie preferred by the user is mapped to a topic z. If the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;
to be able to distinguish more distinct user populations, the greater the difference between different topics, the better. In order to determine the optimal number of the topics, a plurality of LDA models are trained by setting a plurality of topic numbers, the average similarity between topic vectors obtained by training each LDA model is calculated, and the topic number corresponding to the model with the minimum topic vector average similarity is taken as the optimal number of the topics of the model. Obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;
2.2, clustering the users by using a spectral clustering algorithm by using the user characteristic vectors (n in total) obtained in the steps;
the number of clusters needs to be determined first before clustering. Because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector. For example, in the LDA training process of a certain number of subjects being 10, a 10-dimensional subject intensity vector is obtained by the above method, the visualization is as shown in fig. 2 (the vertical axis represents the subject intensity, and the horizontal axis represents the subject), and it can be seen through observation that the subjects 2, 9, 3, 8, and 6 have the greatest intensity in the current data set, which indicates that most people like to watch these types of movies, so it is preferable to cluster the users into 5 categories using the spectral clustering algorithm in the current situation. The spectral clustering algorithm comprises the following specific steps:
(1) calculating a similarity matrix W and a similarity matrix D of n multiplied by n;
(2) calculating a Laplace matrix L-D-W;
(3) computing the first k eigenvectors t of L1,t2,…,tk
(4) K column vectors t1,t2,…,tkForm a matrix T, T ∈ Rn×k
(5) Let y be 1, …, ni∈RkIs the ith row vector of T;
(6) using the K-Means algorithm to assign users (y)i)i=1,2,…,nClustering into clusters C1,C2,…,Ck
For each user cluster, setting all user row vectors which do not belong to the cluster in the user-movie implicit feedback matrix A to be 0, and generating a corresponding local implicit feedback training matrix for each cluster
Figure GDA0002399965690000081
PuDenotes a cluster number, and Pu∈{1,…,k};
And 3, determining a local recommendation model and carrying out global recommendation model training. The loss function of the sparse linear model SLIM is shown in formula (5);
Figure GDA0002399965690000082
wherein A represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of the L2 norm control model prevents the model from overfitting, and the model trains each column W of the W matrix in parallel through a random gradient descent methodjTo obtain the final W matrix, as shown in equation (6);
Figure GDA0002399965690000083
wherein, ajRepresenting the jth column in matrix a. Predicted recommendation degree of user i with respect to movie j
Figure GDA0002399965690000084
The calculation formula is shown as formula (7);
Figure GDA0002399965690000085
constructing a global recommendation model and a local recommendation model by using a sparse linear model SLIM as a basic recommendation model, training by using a user-film implicit feedback matrix A to obtain a global film similarity matrix W, and training a matrix by using a local implicit feedback
Figure GDA0002399965690000086
Training to obtain a local film similarity matrix corresponding to each cluster
Figure GDA0002399965690000087
And 4, performing weighted fusion on the models. The calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);
Figure GDA0002399965690000091
wherein
Figure GDA0002399965690000092
Represents the weighted fusion recommendation degree, R, of movie j to user uuFor the set of all movies that have interacted with user u, wljFor the similarity of movie i and movie j in the global model,
Figure GDA0002399965690000093
cluster P for movie l and movie j to which user u belongsuAnd g is a weight parameter of the global model. The weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g. The optimal global model weight parameters in the current dataset may be determined experimentally. After all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;
and 5, the recommendation method can prove the effectiveness of the model through leave-one-out cross validation. One movie may be randomly drawn from the movie score set for each user into a test set, with the other movies used as training sets for the model. Then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test seti. Finally, the recommendation quality of the model can be measured by two indexes of Hit Rate (HR) and Average Ranking Hit Rate (ARHR), where # hits represents recommendation qualityThe number of hits, # users represents the total number of users, and their definitions are shown in formulas (9), (10);
Figure GDA0002399965690000094
Figure GDA0002399965690000095
the recommendation method flow steps are thus complete.
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims (1)

1. The local model weighted fusion Top-N movie recommendation method based on user clustering specifically comprises the following steps:
step 1: preprocessing data; data cleaning is carried out on the film which is not active and has small popularity; constructing a user movie label document; converting the explicit grading information into implicit feedback information, and constructing a user-movie implicit feedback matrix A;
1.1, carrying out data cleaning work on an original data set, eliminating users with the number of movies smaller than 20, and eliminating movies with the number of times of scoring smaller than 20 to obtain a new training data set;
1.2 counting labels printed by all the users on the movies in the new data set to generate a label dictionary, representing the current user by a document formed by the labels of all the movies watched by the user, forming a corpus by the documents of all the users, and calculating the TF-IDF value of each word in the corpus in the document; the calculation formulas of the word frequency TF, the inverse document frequency IDF and the word frequency-inverse document frequency TF-IDF are shown in formulas (1), (2) and (3);
Figure FDA0002399965680000011
Figure FDA0002399965680000012
TFIDFi,j=TFi,j×IDFi(3)
wherein TFi,jMeaning the word tiIn document djWord frequency of (1), ni,jMeaning the word tiIn document djNumber of occurrences in (iii), ∑knk,jRepresenting a document djThe sum of the occurrence times of all the words in the Chinese; IDFiMeaning word tiThe inverse document frequency, | D | represents the total number of documents in the corpus, | { j: t is ti∈diDenotes the inclusion of the word tiThe number of documents of (a); TFIDFi,jRepresenting a document djChinese word tiThe word frequency of (1) and the document frequency of (2);
1.3, converting the explicit scoring information into implicit feedback information represented by 0-1 by 1-5 points, if the current user excessively marks the current movie, marking the current movie as 1, and if the current user does not excessively mark the movie, namely, the movie to be recommended is marked as 0, so as to obtain an n multiplied by m user-movie implicit feedback matrix A, wherein the number of users is n, and the number of movies is m;
step 2: clustering users; using film label information, training through an LDA topic model to obtain a user characteristic vector, and using a spectral clustering algorithm to realize user clustering;
2.1 the LDA topic model is a document-topic-word three-layer Bayesian network, given a corpus, the LDA topic model analyzes the topic distribution of each document in the corpus and the word distribution of each topic; the joint probability of the word distribution of the topic is shown in formula (4);
Figure FDA0002399965680000013
theta denotes the topic distribution of a document, z denotes a topic, w denotes a document, α denotes the Dirichlet prior parameters of the multinomial distribution of topics under each document, β denotes the Dirichlet prior parameters of the multinomial distribution of words under each topicNumber, N denotes the number of documents in the corpus, znTopic, w, representing the nth word in a documentnAn nth word representing a document;
each movie has a plurality of tags assigned to it by the user, and a movie tag is mapped to a word wnMapping a set of tags of all movies viewed by a user to a document w, and mapping a specific type of movie preferred by the user to a theme z; if the data set has n users, a corpus containing n documents and a label dictionary can be generated, each document in the corpus is represented by a vector of the dictionary length, and each value in the vector is a TF-IDF value of the label in the corresponding dictionary in the user document and the corpus;
in order to be able to distinguish more unique user groups, the larger the difference between different topics, the better; in order to determine the optimal number of topics, training a plurality of LDA models by setting a plurality of topic numbers, calculating the average similarity between topic vectors obtained by training each LDA model, and taking the topic number corresponding to the model with the minimum topic vector average similarity as the optimal number of topics of the model; obtaining the topic distribution theta of each document through LDA model training, and using the topic distribution theta to represent the characteristic vector of each user;
2.2, clustering the users by using a spectral clustering algorithm by using the n user feature vectors obtained in the steps;
before clustering, firstly determining the number of clusters; because each dimension of each user vector obtained by training represents the membership degree of the user belonging to the corresponding topic, in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated according to the dimension and then averaged to obtain a topic intensity vector representing the whole body, and the optimal clustering number is determined by observing the value distribution of the topic intensity vector; the spectral clustering algorithm comprises the following specific steps:
(1) calculating a similarity matrix W and a similarity matrix D of n multiplied by n;
(2) calculating a Laplace matrix L-D-W;
(3) computing the first k eigenvectors t of L1,t2,…,tk
(4) K column vectors t1,t2,…,tkForm a matrix T, T ∈ Rn×k
(5) Let y be 1, …, ni∈RkIs the ith row vector of T;
(6) using the K-Means algorithm to assign users (y)i)i=1,2,...,nClustering into clusters C1,C2,…,Ck
For each user cluster, setting all user row vectors which do not belong to the cluster in the user-movie implicit feedback matrix A to be 0, and generating a corresponding local implicit feedback training matrix for each cluster
Figure FDA0002399965680000021
PuDenotes a cluster number, and Pu∈{1,...,k};
Step 3, determining a local recommendation model and carrying out global recommendation model training; the loss function of the sparse linear model SLIM is shown in formula (5);
Figure FDA0002399965680000022
a represents a user-movie implicit feedback matrix, α and rho control the weight of L1 and L2 norms, a movie similarity sparse matrix W with the size of m × m can be obtained by minimizing the loss function, the L1 norm in the model controls the W sparsity, the complexity of an L2 norm control model prevents the model from being over-fitted, and the model trains each row W of the W matrix in parallel through a random gradient descent methodjTo obtain the final W matrix, as shown in equation (6);
Figure FDA0002399965680000031
wherein, ajRepresents the jth column in matrix A; predicted recommendation degree of user i with respect to movie j
Figure FDA0002399965680000032
The calculation formula is shown as formula (7);
Figure FDA0002399965680000033
constructing a global recommendation model and a local recommendation model by using a sparse linear model SLIM as a basic recommendation model, training by using a user-film implicit feedback matrix A to obtain a global film similarity matrix W, and training a matrix by using a local implicit feedback
Figure FDA0002399965680000034
Training to obtain a local film similarity matrix corresponding to each cluster
Figure FDA00023999656800000310
Step 4, model weighting fusion recommendation; the calculation formula of the local model weighted fusion recommendation degree is shown as a formula (8);
Figure FDA0002399965680000035
wherein
Figure FDA0002399965680000036
Represents the weighted fusion recommendation degree, R, of movie j to user uuFor the set of all movies that have interacted with user u, wljFor the similarity of movie i and movie j in the global model,
Figure FDA0002399965680000037
cluster P for movie l and movie j to which user u belongsuCorresponding similarity in the local model, wherein the parameter g is a weight parameter of the global model; the weight proportion of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the optimal recommendation effect of the fusion model is obtained by determining the optimal weight parameter g; is determined by experimentsOptimal global model weight parameters in the previous dataset; after all parameters in the model are determined, through calculating the weighted fusion recommendation degrees of all the movies about the current user u, the movies which have interacted with the current user are deleted from large to small, and the top N movies are selected and recommended to the current user;
step 5, proving the effectiveness of the model through leave-one-out cross validation; randomly extracting one movie from the movie scoring set of each user and putting the movie into a test set, wherein other movies are used as a training set of the model; then, a list of Top-N movies is recommended for each user by using the trained model, and whether the corresponding movie of the user appears in the recommended list and the specific position p of the movie appears in the list is observed in the test seti(ii) a Finally, the recommendation quality of the model is measured by two indexes of the hit rate HR and the average ranking hit rate ARHR, wherein # hits represents the number of recommendation hits, and # users represents the total number of users, as shown in formulas (9) and (10);
Figure FDA0002399965680000038
Figure FDA0002399965680000039
Figure FDA0002399965680000041
CN201810169922.3A 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering Active CN108363804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810169922.3A CN108363804B (en) 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810169922.3A CN108363804B (en) 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering

Publications (2)

Publication Number Publication Date
CN108363804A CN108363804A (en) 2018-08-03
CN108363804B true CN108363804B (en) 2020-08-21

Family

ID=63002919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810169922.3A Active CN108363804B (en) 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering

Country Status (1)

Country Link
CN (1) CN108363804B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408702B (en) * 2018-08-29 2021-07-16 昆明理工大学 Mixed recommendation method based on sparse edge noise reduction automatic coding
CN111309874A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN111309873A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN110008377B (en) * 2019-03-27 2021-09-21 华南理工大学 Method for recommending movies by using user attributes
CN110084670B (en) * 2019-04-15 2022-03-25 东北大学 Shelf commodity combination recommendation method based on LDA-MLP
CN110069663B (en) * 2019-04-29 2021-06-04 厦门美图之家科技有限公司 Video recommendation method and device
CN111984856A (en) * 2019-07-25 2020-11-24 北京嘀嘀无限科技发展有限公司 Information pushing method and device, server and computer readable storage medium
CN110443502A (en) * 2019-08-06 2019-11-12 合肥工业大学 Crowdsourcing task recommendation method and system based on worker's capability comparison
CN112395487B (en) * 2019-08-14 2024-04-26 腾讯科技(深圳)有限公司 Information recommendation method and device, computer readable storage medium and electronic equipment
CN110795570B (en) * 2019-10-11 2022-06-17 上海上湖信息技术有限公司 Method and device for extracting user time sequence behavior characteristics
CN111008334B (en) * 2019-12-04 2023-04-18 华中科技大学 Top-K recommendation method and system based on local pairwise ordering and global decision fusion
CN113111251A (en) * 2020-01-10 2021-07-13 阿里巴巴集团控股有限公司 Project recommendation method, device and system
CN111460046A (en) * 2020-03-06 2020-07-28 合肥海策科技信息服务有限公司 Scientific and technological information clustering method based on big data
CN111581522B (en) * 2020-06-05 2021-03-09 预见你情感(北京)教育咨询有限公司 Social analysis method based on user identity identification
CN111897999B (en) * 2020-07-27 2023-06-16 九江学院 Deep learning model construction method for video recommendation and based on LDA
CN112184391B (en) * 2020-10-16 2023-10-10 中国科学院计算技术研究所 Training method of recommendation model, medium, electronic equipment and recommendation model
CN112348629A (en) * 2020-10-26 2021-02-09 邦道科技有限公司 Commodity information pushing method and device
CN112364245B (en) * 2020-11-20 2021-12-21 浙江工业大学 Top-K movie recommendation method based on heterogeneous information network embedding
CN112925926B (en) * 2021-01-28 2022-04-22 北京达佳互联信息技术有限公司 Training method and device of multimedia recommendation model, server and storage medium
CN113342963B (en) * 2021-04-29 2022-03-04 山东大学 Service recommendation method and system based on transfer learning
CN113268670B (en) * 2021-06-16 2022-09-27 中移(杭州)信息技术有限公司 Latent factor hybrid recommendation method, device, equipment and computer storage medium
CN113449147A (en) * 2021-07-06 2021-09-28 乐视云计算有限公司 Video recommendation method and device based on theme

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544216A (en) * 2013-09-23 2014-01-29 Tcl集团股份有限公司 Information recommendation method and system combining image content and keywords
CN107609201A (en) * 2017-10-25 2018-01-19 广东工业大学 A kind of recommended models generation method and relevant apparatus based on commending system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514496B (en) * 2012-06-21 2017-05-17 腾讯科技(深圳)有限公司 Method and system for processing recommended target software

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544216A (en) * 2013-09-23 2014-01-29 Tcl集团股份有限公司 Information recommendation method and system combining image content and keywords
CN107609201A (en) * 2017-10-25 2018-01-19 广东工业大学 A kind of recommended models generation method and relevant apparatus based on commending system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Local item-item models for top-n recommendation;Evangelia Christakopoulou;《ACM》;20160919;全文 *
基于谱聚类与多因子融合的协同过滤推荐算法;李倩;《计算机应用研究》;20171031;第34卷(第10期);全文 *

Also Published As

Publication number Publication date
CN108363804A (en) 2018-08-03

Similar Documents

Publication Publication Date Title
CN108363804B (en) Local model weighted fusion Top-N movie recommendation method based on user clustering
CN108763362B (en) Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection
WO2020207196A1 (en) Method and apparatus for generating user tag, storage medium and computer device
CN108665323B (en) Integration method for financial product recommendation system
CN107357793B (en) Information recommendation method and device
Huang et al. Sentiment and topic analysis on social media: a multi-task multi-label classification approach
Liang et al. A probabilistic rating auto-encoder for personalized recommender systems
CN111563164A (en) Specific target emotion classification method based on graph neural network
CN107220365A (en) Accurate commending system and method based on collaborative filtering and correlation rule parallel processing
CN109508385B (en) Character relation analysis method in webpage news data based on Bayesian network
Chen et al. Research on personalized recommendation hybrid algorithm for interactive experience equipment
CN107895303B (en) Personalized recommendation method based on OCEAN model
CN107103093B (en) Short text recommendation method and device based on user behavior and emotion analysis
CN107944485A (en) The commending system and method, personalized recommendation system found based on cluster group
CN105701516B (en) A kind of automatic image marking method differentiated based on attribute
Zhang et al. Learning to match clothing from textual feature-based compatible relationships
Harakawa et al. Extracting hierarchical structure of web video groups based on sentiment-aware signed network analysis
Duan et al. A hybrid intelligent service recommendation by latent semantics and explicit ratings
Menaria et al. Tweet sentiment classification by semantic and frequency base features using hybrid classifier
Najafabadi et al. Tag recommendation model using feature learning via word embedding
Ma et al. Book recommendation model based on wide and deep model
Patil et al. A survey on artificial intelligence (AI) based job recommendation systems
Abdi et al. Using an auxiliary dataset to improve emotion estimation in users’ opinions
Long et al. Domain-specific user preference prediction based on multiple user activities
Bharadhwaj Layer-wise relevance propagation for explainable recommendations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant