CN108363804A

CN108363804A - Local model weighted fusion Top-N movie recommendation method based on user clustering

Info

Publication number: CN108363804A
Application number: CN201810169922.3A
Authority: CN
Inventors: 汤颖; 孙康高
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-03-01
Filing date: 2018-03-01
Publication date: 2018-08-03
Anticipated expiration: 2038-03-01
Also published as: CN108363804B

Abstract

A local model weighted fusion Top-N movie recommendation method based on user clustering comprises the steps of 1, preprocessing data, cleaning data of an inactive user and a movie with low popularity, constructing a user movie label document, converting explicit grading information into implicit feedback information, constructing a user-movie implicit feedback matrix A, 2, clustering users, training a L DA topic model by utilizing movie label information to obtain a user characteristic vector, realizing user clustering by using a spectral clustering algorithm, 3, determining a local recommendation model and performing global recommendation model training, 4, performing a model weighted fusion recommendation stage, and 5, proving the effectiveness of the model by leave-one-out cross verification.

Description

Local model weighted fusion Top-N movie recommendation method based on user clustering

技术领域technical field

本发明涉及一种网络上的电影推荐方法。The invention relates to a method for recommending movies on the network.

背景技术Background technique

随着信息科技和社交网络的快速发展，互联网产生的数据近来呈指数式暴涨，大数据时代来临。随着数据量的增多，人们越来越难以从海量数据中发现自己真正想要的信息。此时，推荐系统则能发挥它的最大应用价值。根据用户资料、物品信息以及用户历史行为数据，推荐算法能够准确预测用户的喜好，个性化地为用户推荐他们可能感兴趣的东西，大大降低了用户发现目标信息的成本。With the rapid development of information technology and social networks, the data generated by the Internet has recently skyrocketed, and the era of big data is coming. With the increase of the amount of data, it becomes more and more difficult for people to find the information they really want from the massive data. At this time, the recommendation system can exert its maximum application value. Based on user profiles, item information, and user historical behavior data, the recommendation algorithm can accurately predict user preferences and recommend items that may be of interest to users in a personalized manner, greatly reducing the cost for users to discover target information.

推荐算法可分为基于内容的推荐以及协同过滤推荐。现代化的推荐系统主要有两个任务，一个是评分预测，另一个是在现实商业场景中应用最多的Top-N推荐。Top-N推荐算法通过给用户推荐一个经过排名且大小为n的物品列表的方式让用户选择自己感兴趣的东西。Top-N推荐模型主要分为两种类型，分别是基于邻域的协同过滤和基于模型的协同过滤。前者又可细分为基于用户的邻域模型(UserKNN)和基于物品的邻域模型(ItemKNN)，后者则以隐因子模型为代表。Recommendation algorithms can be divided into content-based recommendation and collaborative filtering recommendation. Modern recommendation systems mainly have two tasks, one is score prediction, and the other is Top-N recommendation, which is most widely used in real business scenarios. The Top-N recommendation algorithm allows users to choose what they are interested in by recommending a ranked and n-sized item list to users. Top-N recommendation models are mainly divided into two types, namely neighborhood-based collaborative filtering and model-based collaborative filtering. The former can be subdivided into user-based neighborhood model (UserKNN) and item-based neighborhood model (ItemKNN), and the latter is represented by latent factor model.

俗话说“物以类聚人以群分”，不同用户群体内部往往会形成各自独特的行为模式，使得两个相同的物品在不同的人群中相似度发生改变。而单一推荐算法模型往往捕捉不到这些局部的相似度差别，它们认为两个相同的物品在任何场景中的相似度都是一致的，这些模型无法准确捕获用户的真实偏好，降低了个性化推荐的质量。通过训练多个局部推荐模型，再融合局部模型来提升总体推荐效果的推荐算法在一定程度上能解决以上问题，但是这些算法往往没有充分利用推荐场景提供的数据，利用到的数据比较单一，最终的推荐效果也一般。As the saying goes, "Things of a feather flock together and people are divided into groups." Different user groups often form their own unique behavior patterns, which makes the similarity of two identical items change in different groups of people. However, single recommendation algorithm models often fail to capture these local similarity differences. They believe that the similarity of two identical items in any scene is the same. These models cannot accurately capture the real preferences of users, which reduces the personalized recommendation the quality of. The recommendation algorithm that improves the overall recommendation effect by training multiple local recommendation models and then fusing the local models can solve the above problems to a certain extent, but these algorithms often do not make full use of the data provided by the recommendation scene, and the data used is relatively single. The recommended effect is also average.

发明内容Contents of the invention

为了克服现有技术的单一模型无法准确捕获用户偏好以及多模型融合算法使用训练数据单一的问题，本发明提供一种新的基于用户聚类的局部模型加权融合电影推荐算法来实现电影的Top-N个性化推荐。In order to overcome the problem that a single model in the prior art cannot accurately capture user preferences and the multi-model fusion algorithm uses a single training data, the present invention provides a new local model weighted fusion movie recommendation algorithm based on user clustering to realize the Top- N personalized recommendation.

本发明利用电影的文本内容信息，通过LDA主题模型计算语义层次用户特征向量，并基于此通过谱聚类算法来实现用户聚类，构造局部人群。本发明进一步利用用户对电影的评分信息，通过稀疏线性模型构造局部推荐模型和全局推荐模型，通过局部模型和全局模型的线性加权融合来实现最终的电影Top-N个性化推荐。The present invention utilizes the text content information of movies to calculate semantic level user feature vectors through the LDA topic model, and based on this, realizes user clustering through spectral clustering algorithm and constructs local groups of people. The present invention further uses the user's rating information on movies, constructs a local recommendation model and a global recommendation model through a sparse linear model, and realizes the final Top-N personalized recommendation of movies through the linear weighted fusion of the local model and the global model.

基于用户聚类的局部模型加权融合Top-N电影推荐方法，总体流程如图1所示，具体包括如下步骤：The local model weighted fusion Top-N movie recommendation method based on user clustering, the overall process is shown in Figure 1, including the following steps:

步骤1：数据预处理阶段。对一些不活跃用户以及流行度很小的电影进行数据清洗；构造用户电影标签文档；把显式的评分信息转换成隐式反馈信息，构造用户-电影隐式反馈矩阵A；Step 1: Data preprocessing stage. Perform data cleaning on some inactive users and movies with low popularity; construct user movie label documents; convert explicit rating information into implicit feedback information, and construct user-movie implicit feedback matrix A;

1.1对原始数据集进行数据清洗工作，剔除观影数小于20部电影的用户，同时剔除被评分次数小于20次的电影，得到新的训练数据集；1.1 Carry out data cleaning work on the original data set, remove users who watched less than 20 movies, and remove movies rated less than 20 times to obtain a new training data set;

1.2统计新数据集里所有用户给电影打的标签生成一个标签字典，把用户看过的所有电影的标签组成的文档来表示当前用户，所有用户的文档组成一个语料库，计算文档中每个词在语料库中的TF-IDF值。词频TF，逆文档频IDF以及词频-逆文档频TF-IDF的计算公式如公式(1)(2)(3)所示；1.2 Count all the tags that users put on movies in the new data set to generate a tag dictionary, and use the tags of all the movies that the user has watched to represent the current user. TF-IDF values in the corpus. The calculation formulas of term frequency TF, inverse document frequency IDF and term frequency-inverse document frequency TF-IDF are shown in formula (1)(2)(3);

TFIDF_i,j＝TF_i,j×IDF_i (3)TFIDF _i,j =TF _i,j ×IDF _i (3)

其中TF_i,j表示词语t_i在文档d_j中的词频，n_i,j表示词语t_i在文档d_j中出现的次数，∑_kn_k,j表示文档d_j中所有词语的出现次数之和。IDF_i表示词t_i的逆文档频，|D|表示语料库中文档的总数，|{j:t_i∈d_j}|表示包含词语t_i的文档数目。TFIDF_i,j表示文档d_j中词语t_i的词频逆文档频；Where TF _i,j represents the word frequency of word t _i in document d _j , ni _,j represents the number of times word t _i appears in document d _j , ∑ _k n _k,j represents the number of occurrences of all words in document d _j Sum. IDF _i represents the inverse document frequency of term t _i , |D| represents the total number of documents in the corpus, and |{j:t _i ∈ d _j }| represents the number of documents containing term t _i . TFIDF _{i, j} represents the word frequency inverse document frequency of word t _i in document d _j ;

1.3把显式的评分信息如1-5分，转换成用0-1表示的隐式反馈信息，若当前用户对当前电影打过分则记为1，没打过分的电影即待推荐的电影记为0，得到一个n×m的用户-电影隐式反馈矩阵，用户数为n，电影数为m；1.3 Convert explicit scoring information such as 1-5 points into implicit feedback information represented by 0-1. If the current user has rated the current movie too much, it will be recorded as 1, and the movie that has not rated too much is the movie to be recommended. is 0, and an n×m user-movie implicit feedback matrix is obtained, the number of users is n, and the number of movies is m;

步骤2：用户聚类阶段。利用电影标签信息，通过LDA主题模型训练得到用户特征向量，用谱聚类算法实现用户聚类；Step 2: User clustering stage. Using movie label information, user feature vectors are obtained through LDA topic model training, and user clustering is realized by spectral clustering algorithm;

2.1LDA主题模型是一个文档-主题-单词的三层贝叶斯网络，给定一个语料库，该模型可以分析该语料库中每篇文档的主题分布，以及每个主题的词分布。它的联合概率如公式(4)所示；2.1 The LDA topic model is a document-topic-word three-layer Bayesian network. Given a corpus, the model can analyze the topic distribution of each document in the corpus and the word distribution of each topic. Its joint probability is shown in formula (4);

θ表示一篇文档的主题分布，z表示一个主题，w表示一篇文档，α表示每篇文档下主题的多项分布的Dirichlet先验参数，β表示每个主题下词的多项分布的Dirichlet先验参数，N表示语料库中的文档数，z_n表示一篇文档中第n个词的主题，w_n表示一篇文档的第n个单词；θ represents the topic distribution of a document, z represents a topic, w represents a document, α represents the Dirichlet prior parameter of the multinomial distribution of topics under each document, and β represents the Dirichlet of the multinomial distribution of words under each topic A priori parameter, N represents the number of documents in the corpus, z _n represents the topic of the nth word in a document, w _n represents the nth word in a document;

每部电影都有多个用户给它赋予的标签，把一个电影标签映射成一个单词w_n，把一个用户看过的所有电影的标签组成的集合映射成一篇文档w，把用户所偏好的一类特定的电影类型映射成一个主题z。若数据集里共有n个用户，则可生成一个含有n篇文档的语料库以及一个字典，语料库中的每篇文档用字典长度的向量表示，向量中的每个值是对应字典中标签在该用户文档及语料库中的TF-IDF值；Each movie has multiple tags given to it by users. A movie tag is mapped to a word w _n , a set of tags of all movies a user has watched is mapped to a document w, and a user-preferred Class-specific movie genres are mapped to a topic z. If there are n users in the data set, a corpus containing n documents and a dictionary can be generated. Each document in the corpus is represented by a vector of the length of the dictionary, and each value in the vector is the corresponding label in the dictionary. TF-IDF values in documents and corpora;

为了能区分出更加独特的用户群体，不同主题之间的差异性越大越好。为了确定最佳主题个数，通过设置多个主题数训练多个LDA模型，计算每个LDA模型训练得到的主题向量之间的平均相似度，取主题向量平均相似度最小的模型对应的主题数作为模型最佳主题个数。通过LDA模型训练，得到每一篇文档的主题分布θ，用它来表示每一个用户的特征向量；In order to be able to distinguish more unique user groups, the more differences between the different topics, the better. In order to determine the optimal number of topics, train multiple LDA models by setting multiple topic numbers, calculate the average similarity between the topic vectors obtained by each LDA model training, and take the topic number corresponding to the model with the smallest average similarity of topic vectors The optimal number of topics to use as a model. Through LDA model training, the topic distribution θ of each document is obtained, and it is used to represent the feature vector of each user;

2.2利用以上步骤得到的用户特征向量(共n个)，使用谱聚类算法实现对用户的聚类；2.2 Utilize the user feature vectors (n in total) obtained by the above steps, and use the spectral clustering algorithm to realize the clustering of users;

在聚类之前首先需要确定聚类个数。因为训练得到的每个用户向量的每一维度表示该用户属于对应主题的隶属度，故为了确定每个主题在当前用户群体中的重要性，把所有用户特征向量按维度做累加后再取平均，得到一个代表整体的主题强度向量，通过观察主题强度向量的值分布来确定最佳聚类个数。例如，在某次主题数为10的LDA训练过程中，按以上方法得到一个10维的主题强度向量，可视化如图2所示(纵轴表示主题强度，横轴为主题)，通过观察可以看到主题2、9、3、8、6在当前数据集中强度最大，说明喜欢看这些类型电影的人最多，故当前情况使用谱聚类算法把用户聚成5类较适宜。谱聚类算法具体步骤如下：Before clustering, it is first necessary to determine the number of clusters. Because each dimension of each user vector obtained through training represents the membership degree of the user belonging to the corresponding topic, so in order to determine the importance of each topic in the current user group, all user feature vectors are accumulated by dimension and then averaged , get a topic strength vector representing the whole, and determine the optimal number of clusters by observing the value distribution of the topic strength vector. For example, in an LDA training process with a topic number of 10, a 10-dimensional topic strength vector is obtained by the above method, and the visualization is shown in Figure 2 (the vertical axis represents the topic strength, and the horizontal axis is the topic). Through observation, we can see Topics 2, 9, 3, 8, and 6 are the most intense in the current data set, indicating that the most people like to watch these types of movies. Therefore, it is more appropriate to use the spectral clustering algorithm to cluster users into 5 categories in the current situation. The specific steps of the spectral clustering algorithm are as follows:

(1)计算n×n的相似度矩阵W和度矩阵D；(1) Calculate n×n similarity matrix W and degree matrix D;

(2)计算拉普拉斯矩阵L＝D-W；(2) Calculate the Laplacian matrix L=D-W;

(3)计算L的前k个特征向量t₁,t₂,…,t_k；(3) Calculate the first k eigenvectors t ₁ , t ₂ ,...,t _k of L;

(4)将k个列向量t₁,t₂,…,t_k组成矩阵T，T∈R^n×k；(4) Composing k column vectors t ₁ , t ₂ ,...,t _k into a matrix T, T∈R ^n×k ;

(5)对于i＝1,…,n，令y_i∈R^k是T的第i行向量；(5) For i=1,...,n, let y _i ∈ R ^k be the i-th row vector of T;

(6)使用K-Means算法将用户(y_i)_{i＝1,2,…,n}聚类成簇C₁,C₂,…,C_k；(6) Use the K-Means algorithm to cluster users (y _i ) _{i=1, 2,..., n} into clusters C ₁ , C ₂ ,..., C _k ;

对于每个用户聚类，把原始隐式反馈训练矩阵A中不属于该聚类的用户行向量都置为0，每个聚类都生成一个对应的局部隐式反馈训练矩阵P_u表示聚类编号，且P_u∈{1,…,k}；For each user cluster, the user row vectors in the original implicit feedback training matrix A that do not belong to the cluster are set to 0, and each cluster generates a corresponding local implicit feedback training matrix P _u represents the cluster number, and P _u ∈ {1,…,k};

步骤3确定局部推荐模型和进行全局推荐模型训练。稀疏线性模型SLIM的损失函数如公式(5)所示；Step 3 determines the local recommendation model and performs global recommendation model training. The loss function of the sparse linear model SLIM is shown in formula (5);

其中，A表示原始的用户-电影隐式反馈矩阵，α和ρ控制L1和L2范数的权重，通过最小化该损失函数可以获得一个大小为m×m的电影相似度稀疏矩阵W。该模型中L1范数控制W稀疏程度，L2范数控制模型的复杂度，防止模型过拟合。该模型通过随机梯度下降法，并行训练W矩阵的每一列w_j来得到最终的W矩阵，如公式(6)所示；Among them, A represents the original user-movie implicit feedback matrix, α and ρ control the weights of L1 and L2 norms, and a sparse matrix W of movie similarity of size m×m can be obtained by minimizing the loss function. In this model, the L1 norm controls the sparsity of W, and the L2 norm controls the complexity of the model to prevent the model from overfitting. The model uses the random gradient descent method to train each column w _j of the W matrix in parallel to obtain the final W matrix, as shown in formula (6);

其中，a_j表示矩阵A中的第j列。用户i关于电影j的预测推荐度计算公式如公式(7)所示；Among them, a _j represents the jth column in the matrix A. User i's predicted recommendation for movie j The calculation formula is shown in formula (7);

使用稀疏线性模型SLIM作为基本推荐模型构建全局推荐模型和局部推荐模型，利用全局隐式反馈训练矩阵A训练得到全局电影相似度矩阵W，利用局部隐式反馈训练矩阵训练得到每个聚类对应的局部电影相似度矩阵 Use the sparse linear model SLIM as the basic recommendation model to build a global recommendation model and a local recommendation model, use the global implicit feedback training matrix A to train the global movie similarity matrix W, and use the local implicit feedback training matrix Train to get the local movie similarity matrix corresponding to each cluster

步骤4模型加权融合推荐阶段。局部模型加权融合推荐度计算公式如公式(8)所示；Step 4: Model weighted fusion recommendation stage. The calculation formula of local model weighted fusion recommendation degree is shown in formula (8);

其中表示电影j对于用户u的加权融合推荐度，R_u为与用户u发生过交互的所有电影的集合，w_lj为电影l和电影j在全局模型中的相似度，为电影l和电影j在用户u所属的聚类P_u对应的局部模型中的相似度，参数g为全局模型的权重参数。通过调节参数g来控制全局模型和局部模型在融合模型中的权重比例，通过确定最优权重参数g获得融合模型的最佳推荐效果。可以通过实验来确定在当前数据集中最佳的全局模型权重参数。在确定了模型中的所有参数之后，通过计算所有电影关于当前用户u的加权融合推荐度，按从大到小的排序，同时删除已经与当前用户发生过交互的电影，取排在前N位的电影推荐给当前用户；in Represents the weighted fusion recommendation degree of movie j for user u, R _u is the set of all movies that have interacted with user u, w _lj is the similarity between movie l and movie j in the global model, is the similarity between movie l and movie j in the local model corresponding to the cluster P _u to which user u belongs, and the parameter g is the weight parameter of the global model. The weight ratio of the global model and the local model in the fusion model is controlled by adjusting the parameter g, and the best recommendation effect of the fusion model is obtained by determining the optimal weight parameter g. The optimal global model weight parameters in the current dataset can be determined experimentally. After determining all the parameters in the model, by calculating the weighted fusion recommendation of all movies for the current user u, sort them from large to small, and delete the movies that have interacted with the current user, and take the top N positions movies recommended to the current user;

步骤5.该推荐方法可通过留一法交叉验证来证明模型的有效性。可以从每个用户的电影评分集合中随机抽取一部电影放入测试集中，其他电影用来作为模型的训练集。然后用训练好的模型为每个用户推荐一个Top-N的电影列表，观察测试集里该用户的对应那一部电影是否出现在推荐列表中以及其出现在列表中的具体位置p_i。最后，可以用命中率(HR)和平均排名命中率(ARHR)两个指标来衡量模型的推荐质量，其中#hits表示推荐命中数，#users表示用户总数，它们的定义如公式(9)、(10)所示；Step 5. The recommended method can prove the validity of the model by leave-one-out cross-validation. A movie can be randomly selected from each user's movie rating set and put into the test set, and other movies are used as the training set of the model. Then use the trained model to recommend a Top-N movie list for each user, and observe whether the corresponding movie of the user in the test set appears in the recommended list and its specific position p _i in the list. Finally, two indicators, hit rate (HR) and average ranking hit rate (ARHR), can be used to measure the recommendation quality of the model, where #hits represents the number of recommended hits, and #users represents the total number of users. Their definitions are as in formula (9), (10);

推荐方法流程步骤至此结束。This concludes the recommended method flow steps.

本发明综合上述技术提出了基于用户聚类的局部模型加权融合Top-N电影推荐算法。为了解决传统单一推荐模型无法准确估计物品的局部差异性，导致无法准确捕获用户偏好的问题，提出了分别训练全局推荐模型和基于用户聚类的局部推荐模型，通过模型之间的线性加权融合来实现电影的Top-N推荐。另外，为了充分使用电影推荐场景中的数据，从多个维度来提升推荐的质量，本发明利用电影标签信息，通过LDA主题模型来实现对用户在语义层次的特征向量的计算，实现用户在语义层次族群的划分。The present invention combines the above technologies and proposes a Top-N movie recommendation algorithm based on user clustering based on local model weighted fusion. In order to solve the problem that the traditional single recommendation model cannot accurately estimate the local differences of items, resulting in the inability to accurately capture user preferences, a global recommendation model and a local recommendation model based on user clustering are proposed to be trained separately, through linear weighted fusion between the models. Realize the Top-N recommendation of movies. In addition, in order to make full use of the data in the movie recommendation scene and improve the quality of recommendation from multiple dimensions, the present invention uses movie label information to realize the calculation of the feature vector of the user at the semantic level through the LDA topic model, and realizes the user's semantic level. The division of hierarchical groups.

本发明的优点是：(1)算法思路新颖。使用稀疏线性模型作为基本推荐模型，分别训练全局推荐模型和基于用户聚类的局部推荐模型，最后通过线性加权融合生成最终的融合模型，这一思路能够处理电影的在不同人群中的相似度差异，有效克服了单一模型无法准确捕获用户偏好的问题。(2)多维度提升推荐质量。除了使用传统的评分数据来训练推荐模型，在用户聚类阶段，本发明通过引入电影标签数据，利用LDA主题模型分析人群在语义层次上的主题属性，得到用户特征向量并用谱聚类算法实现人群聚类，进一步提升了推荐的质量。(3)算法实现简单快速。在局部模型和全局模型训练阶段，由于各模型之间互相独立，各模型相似度矩阵的每一列之间也相互独立，故可采用并行训练的方法，极大降低了模型的训练时间，提升了模型训练的效率。(4)推荐质量较优。本发明提出的局部模型加权融合推荐算法是内容推荐、基于邻域的协同过滤、基于模型的协同过滤三者的有效结合，充分利用了每种推荐算法的优点，又弥补了互相之间的不足，相比于单一使用某种推荐算法，在推荐质量上有了极大的提升。The advantages of the present invention are: (1) The algorithm idea is novel. Use the sparse linear model as the basic recommendation model, train the global recommendation model and the local recommendation model based on user clustering respectively, and finally generate the final fusion model through linear weighted fusion. This idea can deal with the difference in similarity between different groups of people. , which effectively overcomes the problem that a single model cannot accurately capture user preferences. (2) Improve recommendation quality in multiple dimensions. In addition to using traditional scoring data to train the recommendation model, in the user clustering stage, the present invention introduces movie label data, uses the LDA topic model to analyze the topic attributes of the crowd at the semantic level, obtains user feature vectors, and uses the spectral clustering algorithm to realize the crowd Clustering further improves the quality of recommendations. (3) The algorithm implementation is simple and fast. In the training phase of the local model and the global model, since each model is independent of each other, each column of the similarity matrix of each model is also independent of each other, so the method of parallel training can be adopted, which greatly reduces the training time of the model and improves the The efficiency of model training. (4) The recommendation quality is better. The local model weighted fusion recommendation algorithm proposed by the present invention is an effective combination of content recommendation, neighborhood-based collaborative filtering, and model-based collaborative filtering. It makes full use of the advantages of each recommendation algorithm and makes up for the shortcomings of each other. , compared with a single recommendation algorithm, the recommendation quality has been greatly improved.

附图说明Description of drawings

图1是本发明方法的总流程图；Fig. 1 is the general flowchart of the inventive method;

图2是本发明方法的主题强度分布图。Figure 2 is a graph of the subject intensity distribution of the method of the present invention.

具体实施方式Detailed ways

参照图1技术方案总流程图，本发明共有四个阶段，分别是：数据预处理阶段、用户聚类阶段、全局推荐模型和局部推荐模型训练阶段以及推荐模型线性加权融合阶段。数据预处理阶段是对数据集进行清洗，剔除掉一些不活跃用户和冷门电影，构造用于LDA主题模型训练的语料库和用于稀疏线性模型训练的用户电影隐式反馈训练矩阵；用户聚类阶段，使用第一阶段得到的用户语料库通过训练LDA主题模型，得到用户特征向量，通过谱聚类算法实现对用户的聚类，每个聚类生成一个局部隐式反馈训练矩阵；全局推荐模型和局部推荐模型训练阶段，用原始隐式反馈矩阵和局部隐式反馈矩阵分别通过稀疏线性模型训练得到全局模型和局部模型；模型线性加权融合推荐阶段，把前一步得到的全局模型和局部模型通过线性加权的方式融合得到最终的推荐模型。Referring to the overall flow chart of the technical solution in Figure 1, the present invention has four stages, namely: data preprocessing stage, user clustering stage, global recommendation model and local recommendation model training stage, and recommendation model linear weighted fusion stage. The data preprocessing stage is to clean the data set, remove some inactive users and unpopular movies, construct a corpus for LDA topic model training and a user movie implicit feedback training matrix for sparse linear model training; user clustering stage , use the user corpus obtained in the first stage to train the LDA topic model to obtain the user feature vector, and realize the clustering of users through the spectral clustering algorithm, and each cluster generates a local implicit feedback training matrix; the global recommendation model and the local In the recommendation model training stage, the original implicit feedback matrix and the local implicit feedback matrix are used to obtain the global model and local model through sparse linear model training respectively; in the model linear weighted fusion recommendation stage, the global model and local model obtained in the previous step are linearly weighted way to get the final recommendation model.

本发明的输入为用户观影的评分数据、以及电影的标签数据，输出为针对用户的Top-N个性化电影推荐列表。The input of the present invention is the rating data of the user watching movies and the label data of the movie, and the output is the Top-N personalized movie recommendation list for the user.

具体步骤如下：Specific steps are as follows:

1.2统计新数据集里所有用户给电影打的标签生成一个标签字典，把用户看过的所有电影的标签组成的文档来表示当前用户，所有用户的文档组成一个语料库，计算文档中每个词在语料库中的TF-IDF值。TF(词频)，IDF(逆文档频)以及TF-IDF(词频-逆文档频)的计算公式如公式(1)(2)(3)所示；1.2 Count all the tags that users put on movies in the new data set to generate a tag dictionary, and use the tags of all the movies that the user has watched to represent the current user. TF-IDF values in the corpus. The calculation formulas of TF (term frequency), IDF (inverse document frequency) and TF-IDF (term frequency-inverse document frequency) are shown in formula (1)(2)(3);

TFIDF_i,j＝TF_i,j×IDF_i (3)TFIDF _i,j =TF _i,j ×IDF _i (3)

(2)计算拉普拉斯矩阵L＝D-W；(2) Calculate the Laplacian matrix L=D-W;

本说明书实施例所述的内容仅仅是对发明构思的实现形式的列举，本发明的保护范围不应当被视为仅限于实施例所陈述的具体形式，本发明的保护范围也及于本领域技术人员根据本发明构思所能够想到的等同技术手段。The content described in the embodiments of this specification is only an enumeration of the implementation forms of the inventive concept. The protection scope of the present invention should not be regarded as limited to the specific forms stated in the embodiments. Equivalent technical means that a person can think of based on the concept of the present invention.

Claims

1. the partial model Weighted Fusion Top-N films based on user clustering recommend method, specifically comprise the following steps：

Step 1：Data prediction；Data cleansing is carried out to the film of inactive user and popularity very little；Structuring user's electricity Shadow lagged document；Explicit score information is converted into implicit feedback information, structuring user's-film implicit feedback matrix A；

1.1 pairs of raw data sets carry out data cleansing work, reject the user that viewing number is less than 20 films, while rejecting and being commented Gradation number is less than 20 films, obtains new training dataset；

All users generate a label dictionary to the label that film is beaten in 1.2 statistics new data sets, user have been seen all The document of the label composition of film indicates active user, and the document of all users forms a corpus, calculate every in document TF-IDF value of a word in corpus；Word frequency TF, the calculation formula of inverse document frequency IDF and term frequency-inverse document frequency TF-IDF As shown in formula (1) (2) (3)；

TFIDF_i,j=TF_i,j×IDF_i (3)

Wherein TF_{I, j}Indicate word t_iIn document d_jIn word frequency, n_{I, j}Indicate word t_iIn document d_jThe number of middle appearance, ∑_kn_{K, j} Indicate document d_jIn all words the sum of occurrence number；IDF_iIndicate word t_iInverse document frequency, | D | indicate corpus in document Sum, | { j：t_i∈d_j| it indicates to include word t_iNumber of documents；TFIDF_{I, j}Indicate document d_jMiddle word t_iThe inverse text of word frequency Shelves frequency；

1.3 such as 1-5 points of explicit score informations, are converted into the implicit feedback information indicated with 0-1, if active user is to working as Preceding film, which is beaten, is excessively then denoted as 1, does not beat excessive film film i.e. to be recommended and is denoted as 0, obtains user-film of a n × m Implicit feedback matrix, number of users n, film number are m；

Step 2：User clustering；Using film label information, train to obtain user characteristics vector by LDA topic models, with spectrum Clustering algorithm realizes user clustering；

2.1 LDA topic models are three layers of Bayesian networks of a document-theme-word, give a corpus, LDA master The theme distribution of every document in the topic model analysis corpus, and the word of each theme are distributed；The connection of the word distribution of theme It closes shown in probability such as formula (4)；

θ indicates that the theme distribution of a document, z indicate that a theme, w indicate that a document, α indicate theme under every document The Dirichlet Study firsts of multinomial distribution, β indicate the Dirichlet Study firsts of the multinomial distribution of word under each theme, N Indicate the number of files in corpus, z_nIndicate the theme of n-th of word in a document, w_nIndicate n-th of word of a document；

Every film has the label that multiple users assign to it, a film label mapping at a word w_n, a use The compound mapping of the label composition for all films that family has been seen is at a document w, the preferred specific film of one kind of user Type mapping is at a theme z；If sharing n user in data set, producible one corpus containing n documents and One dictionary, every document in corpus indicate that each value of vector is that corresponding dictionary is got the bid with the vector of dictionary length Sign the TF-IDF values in the customer documentation and corpus；

In order to distinguish more unique user group, the otherness between different themes is the bigger the better；It is best in order to determine Theme number trains multiple LDA models by the way that multiple numbers of topics are arranged, calculates the theme vector that each LDA model trainings obtain Between average similarity, take the corresponding number of topics of model of theme vector average similarity minimum as model best theme Number；By LDA model trainings, the theme distribution θ of each document is obtained, the feature vector of each user is indicated with it；

The 2.2 n user characteristics vectors obtained using above step, the cluster to user is realized using spectral clustering；

Before cluster number is clustered firstly the need of determining；It should because each dimension for each user vector that training obtains indicates User belongs to the degree of membership of corresponding theme, therefore in order to determine importance of each theme in active user group, institute is useful Family feature vector by dimension do it is cumulative after be averaged again, obtain one and represent whole theme intensity vector, pass through observation theme The Distribution value of intensity vector most preferably clusters number to determine；；Spectral clustering is as follows：

(1) the similarity matrix W and degree matrix D of n × n are calculated；

(2) Laplacian Matrix L=D-W is calculated；

(3) the preceding k feature vector t of L is calculated₁,t₂,…,t_k；

(4) by k column vector t₁,t₂,…,t_kForm matrix T, T ∈ R^n×k；

(5) for i=1 ..., n enables y_i∈R^kIt is the i-th row vector of T；

(6) use K-Means algorithms by user (y_i)_{I=1,2 ..., n}Cluster cluster C₁,C₂,…,C_k；

For each user clustering, user's row vector that the cluster is not belonging in original implicit feedback training matrix A is all set to 0, each cluster generates a corresponding local implicit feedback training matrixP_uIndicate cluster number, and P_u∈ { 1 ..., k }；

Step 3 determines local recommended models and carries out global recommended models training；The loss function of sparse linear model SLIM is such as Shown in formula (5)；

Wherein, A indicates that original user-film implicit feedback matrix, the weight of α and ρ control L1 and L2 norms pass through minimum The loss function can obtain the film similarity sparse matrix W that a size is m × m；L1 Norm Controls W is sparse in the model Degree, the complexity of L2 Norm Control models, prevents model over-fitting；The model passes through stochastic gradient descent method, parallel training W Each row w of matrix_jFinal W matrixes are obtained, as shown in the formula (6)；

Wherein, a_jJth row in representing matrix A；Prediction recommendations of the user i about film jShown in calculation formula such as formula (7)；

Using sparse linear model SLIM as basic recommendation model construction overall situation recommended models and local recommended models, using complete Office implicit feedback training matrix A trains to obtain global film similarity matrix W, utilizes local implicit feedback training matrixInstruction Get the corresponding local film similarity matrix of each cluster

Step 4 model-weight merges the recommendation stage；Shown in partial model Weighted Fusion recommendation calculation formula such as formula (8)；

WhereinIndicate film j for the Weighted Fusion recommendation of user u, R_uFor all films interacted occurred with user u Set, w_ljFor the similarity of film l and film j in world model,For the cluster of film l and film j belonging to user u P_uSimilarity in corresponding partial model, parameter g are the weight parameter of world model；The overall situation is controlled by adjustment parameter g The weight proportion of model and partial model in Fusion Model, by determining that optimal weights parameter g obtains the best of Fusion Model Recommendation effect；Best world model's weight parameter is concentrated in current data to determine by testing；In model is determined After all parameters, the Weighted Fusion recommendation by all films of calculating about active user u presses sequence from big to small, The film for occurring to interact with active user is deleted simultaneously, and the film for coming top N is taken to recommend active user；

Step 5. proves the validity of model by leave one cross validation；It is random from the film of each user scoring set It extracts a film to be put into test set, other films are used as the training set of model；Then it is every with trained model Whether a user recommends the movie listings of a Top-N, observe that film of the correspondence of the user in test set and appear in and push away It recommends in list and it appears in specific location p in list_i；Finally, with hit rate HR and average ranking hit rate ARHR two A index weighs the recommendation quality of model, and wherein #hits indicates that hits, #users is recommended to indicate total number of users, such as formula (9), shown in (10)；