CN108269172B

CN108269172B - Collaborative filtering method based on comprehensive similarity transfer

Info

Publication number: CN108269172B
Application number: CN201810050004.9A
Authority: CN
Inventors: 琚生根; 孙界平; 陈黎; 夏欣; 金玉; 王婧研
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2020-02-18
Anticipated expiration: 2038-01-18
Also published as: CN108269172A

Abstract

The present invention discloses a collaborative filtering method based on comprehensive similarity migration. Compared with the prior art, the present invention utilizes both user rating information and user attribute information in similarity calculation, and considers the difference between users. For the differences in the scoring standards of satisfaction, the method of measuring the similarity of user scores is adopted to measure the similarity of user scores, which improves the accuracy of similarity calculation and thus improves the quality of data migration. The experimental results show that this model can alleviate the data sparsity problem more effectively than other methods. In the future, joint item similarity or other knowledge, such as text information, can be considered to transfer the data in the auxiliary field. In this way, the quality of the transferred data can be improved, thereby improving the recommendation accuracy.

Description

Collaborative filtering method based on comprehensive similarity transfer

技术领域technical field

本发明涉及网络信息计算领域，尤其涉及一种基于综合相似度迁移的协同过滤方法。The invention relates to the field of network information computing, in particular to a collaborative filtering method based on comprehensive similarity migration.

背景技术Background technique

当前，网络信息量呈指数级增长，网络用户一方面可获取丰富信息，另一方面却面临信息过载问题，难以从海量信息中挖掘对自己有用的信息。推荐系统可根据用户兴趣，从海量数据筛选出用户感兴趣的部分。目前，推荐系统已得到广泛应用，如Amazon，eBay，MovieLens，GroupLens等电子商务平台。At present, the amount of network information is increasing exponentially. On the one hand, network users can obtain rich information, but on the other hand, they are faced with the problem of information overload, and it is difficult to mine useful information from the massive information. The recommendation system can filter out the parts that users are interested in from the massive data according to the user's interests. At present, recommender systems have been widely used, such as Amazon, eBay, MovieLens, GroupLens and other e-commerce platforms.

协同过滤技术是推荐系统中应用最广泛的技术之一，其基本思想是：利用用户的历史评分数据，来预测用户对未评分物品的兴趣度，选择兴趣度最高的几项物品作为推荐结果。对于传统的协同过滤算法，其最关键步骤是计算用户间或物品间的相似度，但随着数据的增长，用户评分数据会极度稀疏，而推荐质量也会随之下降。Collaborative filtering technology is one of the most widely used technologies in recommender systems. Its basic idea is to use the user's historical rating data to predict the user's interest in unrated items, and select the items with the highest interest as the recommendation result. For the traditional collaborative filtering algorithm, the most critical step is to calculate the similarity between users or items, but as the data grows, the user rating data will be extremely sparse, and the recommendation quality will also decrease.

目前，针对数据稀疏问题【1】，有以下几种解决方案：一是通过填充未评分物品来降低数据集的稀疏性【2-4】，该算法适用于物品更新不频繁且物品数远小于用户数的场景，依赖于用户行为同时存在冷启动问题；二是通过矩阵分解来降低数据集的稀疏性【5】，该算法利用用户与项目之间的潜在关系，对评分矩阵进行奇异值分解，该方法训练代价大，其不适应用户兴趣的改变；三是利用迁移学习思想，通过领域间的交叉部分来促进目标领域的学习【6-7】，该算法通过发现目标领域与辅助领域的潜在关系，达到辅助目标领域训练的目的，由此，其依赖于潜在关系的可靠程度，如不可靠会导致负迁移。目前，一些学者提出了利用多领域数据来缓解目标领域数据稀疏问题。如Jamali【8-9】等人提出了一种基于上下文的矩阵分解模HeteroMF，Currently, for the data sparsity problem [1], there are the following solutions: one is to reduce the sparsity of the dataset by filling unrated items [2-4], this algorithm is suitable for items that are not updated frequently and the number of items is much smaller than In the scenario of the number of users, there is a cold start problem depending on the user behavior; the second is to reduce the sparsity of the dataset through matrix decomposition [5]. This algorithm uses the potential relationship between users and items to perform singular value decomposition on the rating matrix. , the training cost of this method is high, and it does not adapt to the change of user interests; the third is to use the idea of transfer learning to promote the learning of the target domain through the intersection between domains [6-7]. The latent relationship achieves the purpose of assisting the training of the target domain. Therefore, it depends on the reliability of the latent relationship. If it is not reliable, it will lead to negative transfer. At present, some scholars have proposed to use multi-domain data to alleviate the problem of data sparseness in the target domain. For example, Jamali [8-9] et al. proposed a context-based matrix factorization modulo HeteroMF,

其主要思想是利用多领域间共同实体，并共享实体的特征因子来同时对多个矩阵进行联合分解，其算法需要训练较多参数其需消耗大量时间计算梯度；LiBin【10】等人提出了一种评分矩阵生成模型RMGM(RatingMatrixGenerativeModel)，其主要思想是通过找到共享的隐式集群级别的评级矩阵，然后利用这个矩阵填充目标领域中原始矩阵的空值，该方法使用与强相关领域且并没有理论支持；李超【10】等人提出的一种基于用户相似度迁移模TSUCF(TransferSimilarityIts main idea is to use common entities between multiple domains and share the eigenfactors of entities to jointly decompose multiple matrices at the same time. The algorithm needs to train more parameters and consumes a lot of time to calculate gradients; LiBin [10] et al. A rating matrix generation model RMGM (RatingMatrixGenerativeModel), the main idea of which is to find a shared implicit cluster-level rating matrix, and then use this matrix to fill the null values of the original matrix in the target domain. There is no theoretical support; Li Chao [10] and others proposed a transfer model based on user similarity TSUCF (TransferSimilarity

User-basedCollaborativeFiltering)，其主要思想是交叉领域数据建立起辅助领域与目标领域的联系，达到辅助目标领域的目的，该方法仅利用用户评分信息，而且在衡量评分相似度时，仅采用共同物品数目来衡量，没有考虑用户的偏好。User-basedCollaborativeFiltering), whose main idea is to establish the connection between the auxiliary field and the target field with cross-domain data to achieve the purpose of assisting the target field. This method only uses user rating information, and when measuring the similarity of ratings, only the number of common items is used to measure, without taking into account user preferences.

虽然以上算法均采用辅助领域知识来提高推荐精度，但仍有一下不足：一是基于矩阵变换的模型，模型训练参数较多，二是要求辅助领域与目标领域满足强相关，模型适用场景少；三是计算用户评分相似度时，忽略了用户对满意度的打分标准的差异性。Although the above algorithms all use auxiliary domain knowledge to improve the recommendation accuracy, there are still some shortcomings: first, the model based on matrix transformation has many model training parameters; Third, when calculating the similarity of user scores, the differences in the scoring standards of users' satisfaction are ignored.

数据稀疏性问题是传统协同过滤算法的主要瓶颈之一。迁移学习通常是利用目标领域与辅助领域的潜在关系，对辅助领域进行知识迁移，以此来提高目标领域的推荐质量。现有的基于相似度迁移模型，普遍只利用了用户评分信息，并且在评分相似度计算上忽略了用户评分标准差异。针对以上问题，本发明提出了一个基于综合相似度迁移的推荐方法。The problem of data sparsity is one of the main bottlenecks of traditional collaborative filtering algorithms. Transfer learning usually uses the potential relationship between the target domain and the auxiliary domain to transfer knowledge to the auxiliary domain, so as to improve the recommendation quality of the target domain. Existing similarity-based transfer models generally only utilize user rating information, and ignore differences in user rating standards in the calculation of rating similarity. In view of the above problems, the present invention proposes a recommendation method based on comprehensive similarity migration.

参考文献：references:

[1]PanW。Asurveyoftransferlearningforcollaborativerecommendationwithauxiliarydata[J]。Neurocomputing,2016,177(C):447-453。[1] PanW. Asurveyoftransferlearningforcollaborativerecommendationwithauxiliarydata[J]. Neurocomputing, 2016, 177(C):447-453.

[2]LemireD,MaclachlanA。Slopeonepredictorsforonlinerating-basedcollaborativefiltering[C]//Proceedingsofthe2005SIAMInternationalConferenceonDataMining。SocietyforIndustrialandAppliedMathematics,2005:471-475。[2] Lemire D, Maclachlan A. Slopeonepredictorsforonlinerating-basedcollaborativefiltering[C]//Proceedingsofthe2005SIAMInternationalConferenceonDataMining. Society for Industry and Applied Mathematics, 2005: 471-475.

[3]WangP,YeHW。Apersonalizedrecommendationalgorithmcombiningslopeoneschemeanduserbasedcollaborativefiltering[C]//IndustrialandInformationSystems,2009。IIS'09。InternationalConferenceon。IEEE,2009:152-154。[3] Wang P, Ye HW. Apersonalizedrecommendationalgorithmcombiningslopeoneschemeanduserbasedcollaborativefiltering[C]//IndustrialandInformationSystems, 2009. IIS'09. InternationalConferenceon. IEEE, 2009:152-154.

[4]SunZ,LuoN,KuangW。Onereal-timepersonalizedrecommendationsystemsbasedonSlopeOnealgorithm[C]//FuzzySystemsandKnowledgeDiscovery(FSKD),2011EighthInternationalConferenceon。IEEE,2011,3:1826-1830。[4] Sun Z, Luo N, Kuang W. Onereal-timepersonalizedrecommendationsystemsbasedonSlopeOnealgorithm[C]//FuzzySystemsandKnowledgeDiscovery(FSKD), 2011EighthInternationalConferenceon. IEEE, 2011, 3:1826-1830.

[5]SarwarB,KarypisG,KonstanJ,etal。Applicationofdimensionalityreductioninrecommendersystem-acasestudy[R]。MinnesotaUnivMinneapolisDeptofComputerScience,2000。[5] Sarwar B, Karypis G, Konstan J, et al. Applicationofdimensionalityreductioninrecommendersystem-acasestudy[R]. MinnesotaUnivMinneapolisDeptofComputerScience, 2000.

[6]LiB,YangQ,XueX。Transferlearningforcollaborativefilteringviaarating-matrixgenerativemodel[C]//InternationalConferenceonMachineLearning,ICML2009,Montreal,Quebec,Canada,June。DBLP,2009:617-624。[6] LiB, YangQ, XueX. Transferlearningforcollaborativefilteringviaarating-matrixgenerativemodel[C]//InternationalConferenceonMachineLearning, ICML2009, Montreal, Quebec, Canada, June. DBLP, 2009: 617-624.

[7]PanW,XiangEW,LiuNN,etal。TransferLearninginCollaborativeFilteringforSparsityReduction[C]//AAAI。2010,10:230-235。[7] PanW, XiangEW, LiuNN, et al. TransferLearninginCollaborativeFilteringforSparsityReduction[C]//AAAI. 2010, 10:230-235.

[8]JamaliM,LakshmananL。Heteromf:recommendationinheterogeneousinformationnetworksusingcontextdependentfactormodels[C]//Proceedingsofthe22ndinternationalconferenceonWorldWideWeb。ACM,2013:643-654。[8] Jamali M, Lakshmanan L. Heteromf:recommendationinheterogeneousinformationnetworksusingcontextdependentfactormodels[C]//Proceedingsofthe22ndinternationalconferenceonWorldWideWeb. ACM, 2013: 643-654.

[9]WuS,LiuQ,WangL,etal。Contextualoperationforrecommendersystems[J]。IEEETransactionsonKnowledgeandDataEngineering,2016,28(8):2000-2012。[9] WuS, LiuQ, WangL, et al. Contextualoperationforrecommendersystems[J]. IEEE Transactionsson Knowledge and Data Engineering, 2016, 28(8): 2000-2012.

[10]LiB,YangQ,XueX。Transferlearningforcollaborativefilteringviaarating-matrixgenerativemodel[C]//Proceedingsofthe26thannualinternationalconferenceonmachinelearning。ACM,2009:617-624。[10] Li B, Yang Q, Xue X. Transferlearningforcollaborativefilteringviaarating-matrixgenerativemodel[C]//Proceedingsofthe26thannualinternationalconferenceonmachinelearning. ACM, 2009: 617-624.

[11]李超,周涛,黄俊铭,等。基于用户相似性传递的跨平台交叉推荐算法[J]。中文信息学报,2016,30(2):90-98。[11] Li Chao, Zhou Tao, Huang Junming, et al. Cross-platform cross-recommendation algorithm based on user similarity transfer[J]. Chinese Journal of Information, 2016, 30(2): 90-98.

发明内容SUMMARY OF THE INVENTION

本发明的目的就在于为了解决上述问题而提供一种基于综合相似度迁移的协同过滤方法。The purpose of the present invention is to provide a collaborative filtering method based on comprehensive similarity migration in order to solve the above problems.

本发明通过以下技术方案来实现上述目的：The present invention realizes above-mentioned purpose through following technical scheme:

本发明包括以下步骤：The present invention includes the following steps:

(1)基于综合相似度迁移的推荐方法：设有两个平台e₁和e₂，U₁表示只在平台e₁中存在历史行为信息的用户，U₂表示只在平台e₂中存在历史行为信息的用户，U_c表示在平台e₁和e₂中均有过历史行为信息的用户，定义为交叉用户；在实际情况中，交叉用户的数量远远小于非交叉用户的数量；通过交叉用户，为非交叉用户U₁和U₂建立起相似度联系，以此帮助目标领域进行推荐；(1) Recommendation method based on comprehensive similarity migration: there are two platforms e ₁ and e ₂ , U ₁ represents users with historical behavior information only in platform e ₁ , and U ₂ represents users with historical behavior information only in platform e ₂ Users with behavioral information, U _c represents users who have historical behavioral information in both platforms e ₁ and e ₂ , and are defined as cross-users; in actual situations, the number of cross-users is much smaller than the number of non-cross-users; User, establishes a similarity relationship for non-intersecting users U ₁ and U ₂ , so as to help the target domain to recommend;

(2)相似度迁移：非交叉用户U₁和用户U₂无法直接计算相似性，但是，用户U₁和用户U₂分别与交叉用户U_c的相似度是可以计算的，所以，可将交叉用户U_c作为纽带来建立用户U₁和用户U₂的相似度；(2) Similarity migration: the similarity between the non-cross user U ₁ and the user U ₂ cannot be calculated directly, but the similarity between the user U ₁ and the user U ₂ and the cross user U _c can be calculated, so the cross user U c can be calculated. User U _c is used as a link to establish the similarity between user U ₁ and user U ₂ ;

相似度迁移步骤：首先找出与平台1和平台2的公共用户集U_c；然后分别计算U₁与U_c的相似性，记为向量

U₂与U_c的相似性，记为最后计算

与

的内积，即为U₁和U₂的传递相似度Similarity migration step: first find out the common user set U _c with platform 1 and platform 2; then calculate the similarity between U ₁ and U _c respectively, denoted as a vector

The similarity between U ₂ and U _c , denoted as final calculation

and

The inner product of , which is the transitive similarity of U ₁ and U ₂

其中，U₁₁表示平台1中的非交叉用户1，U₂₁、U₂₂表示平台2中的非交叉用户1，U_c1、U_c2等表示交叉用户，S₁、S₂等表示相似度；如果要计算U₁₁与U₂₁之间的相似度，则可通过U_c1、U_c2、U_c3过渡，间接计算

综上，则U₁和U₂之间的相似度计算可形式化为：Among them, U ₁₁ represents non-intersecting user 1 in platform 1, U ₂₁ and U ₂₂ represent non-intersecting user 1 in platform 2, U _c1 , U _c2 , etc. represent cross-users, and S ₁ , S ₂ , etc. represent similarity; if To calculate the similarity between U ₁₁ and U ₂₁ , it can be calculated indirectly through U _c1 , U _c2 , U _c3 transitions

To sum up, the similarity calculation between U ₁ and U ₂ can be formalized as:

(3)相似度计算：计算非交叉用户U₁与U₂的相似度之前，需先计算非交叉用户U₁、U₂分别与交叉用户的相似度，相似度计算如下：(3) Similarity calculation: Before calculating the similarity between the non-intersecting users U ₁ and U ₂ , it is necessary to calculate the similarity between the non-intersecting users U ₁ and U ₂ and the intersecting users respectively. The similarity calculation is as follows:

1)用户评分相似度1) User rating similarity

本文通过评分分布一致性、可信度两方面衡量用户评分相似度；This paper measures the similarity of user ratings through the consistency of rating distribution and credibility;

评分分布一致性是由两用户评价过的相同物品的评分分布决定；评分分布越一致，说明两用户的兴趣越相似；设{ur₁,ur₂,...,ur_n}，{ur₁,ur₂,...,ur_n}分别为用户u与用户v对共同物品的评分集，将两组数据分别进行递增排序，即{ur₁,ur₂,...,ur_n}，

如果1,2,...,n与x₁,x₂,...,x_n的匹配度越大，则表明两者的一致性越高；计算公式如下所示；The consistency of the score distribution is determined by the score distribution of the same items evaluated by two users; the more consistent the score distribution, the more similar the interests of the two users; let {ur ₁ ,ur ₂ ,...,ur _n }, {ur ₁ ,ur ₂ ,...,ur _n } are the rating sets of user u and user v for common items, respectively, and the two sets of data are sorted in ascending order, namely {ur ₁ ,ur ₂ ,...,ur _n },

If 1,2,...,n and x ₁ ,x ₂ ,...,x _n match more closely, it indicates that the two are more consistent; the calculation formula is as follows;

可信度是根据两用户评价过的相同物品的数量决定的，若数量很小，即使评分分布一致，也不代表两者一定相似；计算公式如下所示；The credibility is determined according to the number of the same items evaluated by two users. If the number is small, even if the score distribution is the same, it does not mean that the two are necessarily similar; the calculation formula is as follows;

其中，I_u表示用户u评价的物品集；Among them, I _u represents the set of items evaluated by user u;

用户评分相似度计算公式如下所示；The user rating similarity calculation formula is as follows;

sim₁(u,v)＝dist(u,v)conf(u,v) (1-4)sim ₁ (u,v)=dist(u,v)conf(u,v) (1-4)

2)用户属性相似度2) User attribute similarity

用户属性相似度是根据用户属性来衡量；一般认为，拥有相同属性的用户在一定程度上具有相似的兴趣；计算公式如下所示；User attribute similarity is measured according to user attributes; it is generally believed that users with the same attributes have similar interests to a certain extent; the calculation formula is as follows;

其中，n表示属性个数，sim(u,v,i)表示在第i个属性上两用户是否相同，如相同，则为1，反之为0，d_i表示第i个属性的区分度，如果具有某属性的用户对所有物品都进行了评分则表明该属性没有区分度，其值由不同数据集决定；Among them, n represents the number of attributes, sim(u, v, i) represents whether the two users are the same on the ith attribute, if they are the same, it is 1, otherwise it is 0, and d _i represents the degree of discrimination of the ith attribute, If users with a certain attribute have rated all items, it means that the attribute has no discrimination, and its value is determined by different data sets;

3)最终相似度3) Final similarity

一般情况下，当用户对某物品评分之后，应该尽量利用用户对物品评分信息，当用户对某物品没有评分，则应尽量利用用户属性信息；当用户所评分的物品数量增多时，方法平滑过渡到使用评分信息进行推荐，本文使用sigmoid函数进行平滑处理，最终用户相似度定义如下：Under normal circumstances, when the user rates an item, the user's rating information should be used as much as possible. When the user does not rate an item, the user's attribute information should be used as much as possible. When the number of items rated by the user increases, the method transitions smoothly. To use the rating information for recommendation, this paper uses the sigmoid function for smoothing, and the end-user similarity is defined as follows:

sim(u,v)＝αsim₁(u,v)+(1-α)sim₂(u,v) (1-6)sim(u,v)=αsim ₁ (u,v)+(1-α)sim ₂ (u,v) (1-6)

其中，C_uv表示用户u和用户v共同评价的物品集合；由上述公式表示，用户相似度计算会随着用户所评价物品数量的增多，平滑过渡到使用评分信息，这种平滑过渡可以提高在冷启动状态下预测准确率；Among them, C _uv represents the set of items jointly evaluated by user u and user v; expressed by the above formula, the user similarity calculation will smoothly transition to the use of rating information as the number of items evaluated by the user increases. Prediction accuracy in cold start state;

(4)方法描述：(4) Method description:

A)计算用户相似度方法：第一步根据用户属性信息，计算用户属性相似度；第二步根据用户评分信息，计算用户评分相似度；第三步：根据用户属性相似度与用户评分相似度，计算最终用户相似度；A) Method for calculating user similarity: the first step is to calculate the similarity of user attributes according to the user attribute information; the second step is to calculate the similarity of user scores according to the information of user scores; the third step: the similarity of user attributes and user scores are calculated according to the similarity of user attributes , calculate the end-user similarity;

B)基于迁移学习的推荐方法：第一步计算U₁与U_c之间的相似度

第二步计算U₂与U_c之间的相似度

第三步计算迁移相似度第四步利用迁移相似度

结合UCF方法进行推荐。B) Recommendation method based on transfer learning: the first step calculates the similarity between U ₁ and U _c

The second step calculates the similarity between U ₂ and U _c

The third step calculates the migration similarity The fourth step utilizes the transfer similarity

Combined with UCF method for recommendation.

本发明的有益效果在于：The beneficial effects of the present invention are:

本发明是一种基于综合相似度迁移的协同过滤方法，与现有技术相比，本发明在相似度计算上，即利用了用户评分信息同时也利用了用户属性信息，并且考虑了用户间对满意度的打分标准的差异性，采用了用户评分分布一致性来衡量用户评分相似度的方法，提高了相似度计算的准确性，从而提高了数据迁移的质量。实验结果表明，该模型较其他方法能比较有效地缓解数据稀疏性问题。未来可以考虑联合项目相似度或其他知识，如文本信息，对辅助领域的数据进行迁移，通过这种方式可以提高迁移数据的质量，从而提高推荐精确度。The present invention is a collaborative filtering method based on comprehensive similarity migration. Compared with the prior art, the present invention utilizes both user rating information and user attribute information in similarity calculation, and considers the relationship between users. The differences in the scoring standards of satisfaction, the method of measuring the similarity of user scores by the consistency of user score distribution, improves the accuracy of similarity calculation, and thus improves the quality of data migration. The experimental results show that the model can alleviate the data sparsity problem more effectively than other methods. In the future, joint item similarity or other knowledge, such as text information, can be considered to transfer the data in the auxiliary field. In this way, the quality of the transferred data can be improved, thereby improving the recommendation accuracy.

附图说明Description of drawings

图1是本发明的用户评分矩阵图；Fig. 1 is a user rating matrix diagram of the present invention;

图2是本发明的相似性迁移示意图；Fig. 2 is the similarity migration schematic diagram of the present invention;

图3是本发明的A组下不同方法RMSE值对比图；Fig. 3 is the RMSE value contrast figure of different methods under the A group of the present invention;

图4是本发明的B组不同方法RMSE值对比图；Fig. 4 is the B group different method RMSE value contrast figure of the present invention;

图5是本发明的C组不同方法RMSE值对比图；Fig. 5 is the C group different method RMSE value contrast figure of the present invention;

图6是本发明的D组不同方法RMSE值对比图；Fig. 6 is the D group different method RMSE value comparison diagram of the present invention;

图7是本发明的E组不同方法RMSE值对比图；Fig. 7 is E group different method RMSE value comparison diagram of the present invention;

图8是本发明的N＝5下不同方法RMSE值对比图；Fig. 8 is the RMSE value contrast figure of different methods under N=5 of the present invention;

图9是本发明的N＝10不同方法RMSE值对比图；Fig. 9 is N=10 different methods RMSE value comparison diagram of the present invention;

图10是本发明的N＝20不同方法RMSE值对比图；Fig. 10 is N=20 different methods RMSE value comparison diagram of the present invention;

图11是本发明的N＝30不同方法RMSE值对比图；Fig. 11 is N=30 different methods RMSE value comparison diagram of the present invention;

图12是本发明的N＝40不同方法RMSE值对比图。Fig. 12 is a comparison diagram of RMSE values of different methods of N=40 of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步说明：The present invention will be further described below in conjunction with the accompanying drawings:

基于综合相似度迁移的推荐方法：Recommended method based on comprehensive similarity transfer:

本发明提出一种基于综合相似度迁移的推荐方法，利用辅助领域信息来缓解目标领域数据稀疏性问题。The invention proposes a recommendation method based on comprehensive similarity migration, which utilizes auxiliary domain information to alleviate the problem of data sparsity in the target domain.

下面将以两个电影平台为例对本发明方法进行说明。假设有两个平台e₁和e₂，U₁表示只在平台e₁中存在历史行为信息的用户，U₂表示只在平台e₂中存在历史行为信息的用户，U_c表示在平台e₁和e₂中均有过历史行为信息的用户，定义为交叉用户。用户行为矩阵如图1所示。The method of the present invention will be described below by taking two movie platforms as examples. Suppose there are two platforms e ₁ and e ₂ , U ₁ represents a user with historical behavior information only on platform e ₁ , U ₂ represents a user with historical behavior information only on platform e ₂ , U _c represents a user on platform e ₁ Users who have historical behavior information in both and e ₂ are defined as cross-users. The user behavior matrix is shown in Figure 1.

在实际情况中，交叉用户的数量远远小于非交叉用户的数量。In practice, the number of cross-users is much smaller than the number of non-cross-users.

传统推荐方法是利用所占比例较少的交叉用户对所占比例较大的非交叉用户进行推荐，由此会出现冷启动问题和数据稀疏问题，使得推荐质量较低。The traditional recommendation method is to use a small proportion of cross-users to recommend a large proportion of non-cross-users, which will cause cold start problems and data sparse problems, resulting in low recommendation quality.

本发明方法是通过交叉用户，为非交叉用户U₁和U₂建立起相似度联系，以此帮助目标领域进行推荐。The method of the present invention establishes a similarity relationship for the non-intersecting users U ₁ and U ₂ through cross-users, thereby helping the target field to recommend.

相似度迁移：Similarity transfer:

如图1所示，非交叉用户U₁和用户U₂无法直接计算相似性，但是，用户U₁和用户U₂分别与交叉用户U_c的相似度是可以计算的，所以，可将交叉用户U_c作为纽带来建立用户U₁和用户U₂的相似度。As shown in Figure 1, the similarity between the non-intersecting user U ₁ and the user U ₂ cannot be calculated directly, but the similarity between the user U ₁ and the user U ₂ and the intersecting user U _c can be calculated, so the intersecting user U _c is used as a link to establish the similarity between user U ₁ and user U ₂ .

U₂与U_c的相似性，记为

最后计算

与

The similarity between U ₂ and U _c , denoted as

final calculation

and

The inner product of , which is the transitive similarity of U ₁ and U ₂

相似度迁移如图2所示；Similarity migration is shown in Figure 2;

其中，U₁₁表示平台1中的非交叉用户1，U₂₁、U₂₂表示平台2中的非交叉用户1，U_c1、U_c2等表示交叉用户，S₁、S₂等表示相似度。如果要计算U₁₁与U₂₁之间的相似度，则可通过U_c1、U_c2、U_c3过渡，间接计算

综上，则U₁和U₂之间的相似度计算可形式化为：Wherein, U ₁₁ represents non-intersecting user 1 in platform 1, U ₂₁ and U ₂₂ represent non-intersecting user 1 in platform 2, U _c1 , U _c2 etc. represent cross-users, S ₁ , S ₂ etc. represent similarity. If you want to calculate the similarity between U ₁₁ and U ₂₁ , you can indirectly calculate through U _c1 , U _c2 , U _c3 transitions

相似度计算：Similarity calculation:

基于以上分析，计算非交叉用户U₁与U₂的相似度之前，需先计算非交叉用户U₁、U₂分别与交叉用户的相似度，相似度计算如下所示：Based on the above analysis, before calculating the similarity between the non-intersecting users U ₁ and U ₂ , it is necessary to calculate the similarity between the non-intersecting users U ₁ and U ₂ and the intersecting users respectively. The similarity calculation is as follows:

1)用户评分相似度1) User rating similarity

本发明通过评分分布一致性、可信度两方面衡量用户评分相似度。The invention measures the similarity of user scores through two aspects of score distribution consistency and reliability.

评分分布一致性是由两用户评价过的相同物品的评分分布决定。评分分布越一致，说明两用户的兴趣越相似。设{ur₁,ur₂,...,ur_n}，{ur₁,ur₂,...,ur_n}分别为用户u与用户v对共同物品的评分集，将两组数据分别进行递增排序，即{ur₁,ur₂,...,ur_n}，

如果1,2,...,n与x₁,x₂,...,x_n的匹配度越大，则表明两者的一致性越高。计算公式如下所示。The consistency of the rating distribution is determined by the rating distribution of the same item evaluated by two users. The more consistent the score distribution is, the more similar the interests of the two users are. Let {ur ₁ ,ur ₂ ,...,ur _n },{ur ₁ ,ur ₂ ,...,ur _n } be the rating sets of user u and user v for common items, respectively. ascending sort, ie {ur ₁ ,ur ₂ ,...,ur _n },

If 1,2,...,n and x ₁ ,x ₂ ,...,x _n match more closely, it indicates that the two are more consistent. The calculation formula is shown below.

可信度是根据两用户评价过的相同物品的数量决定的，若数量很小，即使评分分布一致，也不代表两者一定相似。计算公式如下所示。The credibility is determined according to the number of the same items that two users have evaluated. If the number is small, even if the score distribution is consistent, it does not mean that the two are necessarily similar. The calculation formula is shown below.

其中，I_u表示用户u评价的物品集。Among them, I _u represents the set of items evaluated by user u.

用户评分相似度计算公式如下所示。The user rating similarity calculation formula is as follows.

sim₁(u,v)＝dist(u,v)conf(u,v) (1-4)sim ₁ (u,v)=dist(u,v)conf(u,v) (1-4)

2)用户属性相似度2) User attribute similarity

用户属性相似度是根据用户属性来衡量。一般认为，拥有相同属性的用户在一定程度上具有相似的兴趣。计算公式如下所示。User attribute similarity is measured according to user attributes. It is generally believed that users with the same attributes have similar interests to a certain extent. The calculation formula is shown below.

其中，n表示属性个数，sim(u,v,i)表示在第i个属性上两用户是否相同，如相同，则为1，反之为0，d_i表示第i个属性的区分度，如果具有某属性的用户对所有物品都进行了评分则表明该属性没有区分度，其值由不同数据集决定。Among them, n represents the number of attributes, sim(u, v, i) represents whether the two users are the same on the ith attribute, if they are the same, it is 1, otherwise it is 0, and d _i represents the degree of discrimination of the ith attribute, If users with a certain attribute have rated all items, it means that the attribute has no discrimination, and its value is determined by different data sets.

3)最终相似度3) Final similarity

一般情况下，当用户对某物品评分之后，应该尽量利用用户对物品评分信息，当用户对某物品没有评分，则应尽量利用用户属性信息。当用户所评分的物品数量增多时，方法应平滑过渡到使用评分信息进行推荐，本发明使用sigmoid函数进行平滑处理，最终用户相似度定义如下：In general, when a user rates an item, the user's rating information for the item should be used as much as possible, and when the user does not rate an item, the user's attribute information should be used as much as possible. When the number of items rated by the user increases, the method should smoothly transition to using the rating information for recommendation. The present invention uses the sigmoid function for smoothing, and the final user similarity is defined as follows:

其中，C_uv表示用户u和用户v共同评价的物品集合。由上述公式表示，用户相似度计算会随着用户所评价物品数量的增多，平滑过渡到使用评分信息，这种平滑过渡可以提高在冷启动状态下预测准确率。Among them, C _uv represents the set of items jointly evaluated by user u and user v. According to the above formula, the user similarity calculation will smoothly transition to the use of rating information as the number of items evaluated by the user increases, and this smooth transition can improve the prediction accuracy in the cold start state.

方法描述：Method description:

1)计算用户相似度方法：第一步根据用户属性信息，计算用户属性相似度；第二步根据用户评分信息，计算用户评分相似度；第三步：根据用户属性相似度与用户评分相似度，计算最终用户相似度。1) Calculate the user similarity method: the first step is to calculate the user attribute similarity according to the user attribute information; the second step is to calculate the user rating similarity according to the user rating information; the third step: according to the user attribute similarity and the user rating similarity , calculate the end-user similarity.

2)基于迁移学习的推荐方法：第一步计算U₁与U_c之间的相似度

第二步计算U₂与U_c之间的相似度第三步计算迁移相似度

第四步利用迁移相似度

结合UCF方法进行推荐。2) Recommendation method based on transfer learning: the first step is to calculate the similarity between U ₁ and U _c

The second step calculates the similarity between U ₂ and U _c The third step calculates the migration similarity

The fourth step utilizes the transfer similarity

Combined with UCF method for recommendation.

实验：experiment:

实验数据：Experimental data:

实验采用MovieLen电影网站的数据集。数据集描述如下所示。The experiment adopts the dataset of MovieLen movie website. The dataset description is shown below.

表1 Movielens数据描述Table 1 Movielens data description

实验数据集划分如下所示。The experimental dataset is divided as follows.

表5数据集划分Table 5 Data set division

评价指标：Evaluation indicators:

为了衡量方法的预测准确度，本实验采用均方根误差RMSE(Root Mean SquaredError，RMSE)来验证本发明方法所得预测结果与用户真实评分的差距。In order to measure the prediction accuracy of the method, the root mean square error RMSE (Root Mean Squared Error, RMSE) is used in this experiment to verify the difference between the prediction result obtained by the method of the present invention and the user's real score.

RMSE计算方法如下：The RMSE calculation method is as follows:

其中，r_ui表示用户u对物品i的真实评分，pre_ui表示用户u对物品i的预测评分，T为测试集，|T|表示测试集大小。RMSE越小，说明预测值与实际值越近，预测结果的准确率越高。Among them, r _ui represents the actual rating of item i by user u, pre _ui represents the predicted rating of item i by user u, T is the test set, and |T| represents the size of the test set. The smaller the RMSE, the closer the predicted value is to the actual value, and the higher the accuracy of the predicted result.

对比方法：Comparison method:

1)UCF方法：只能利用交叉用户进行推荐。1) UCF method: Only cross-users can be used for recommendation.

2)基于用户相似性传递的推荐方法(TSUCF)：利用所占比例较少的交叉用户的评分信息作为纽带，将两个不同电商的用户建立联系，达到推荐的效果。2) Recommendation method based on user similarity transfer (TSUCF): Using the score information of cross-users with a small proportion as a link, establish a connection between users of two different e-commerce companies to achieve the effect of recommendation.

3)本发明方法：本发明方法在TSUCF方法上做出改进，一是充分利用用户属性信息，二是考虑了用户的评分标准的差异性，采用共同物品的评分分布一致性来衡量用户评分相似性。3) The method of the present invention: The method of the present invention makes improvements on the TSUCF method. One is to make full use of the user attribute information, and the other is to consider the differences in the user's scoring standards, and use the score distribution consistency of common items to measure the similarity of user scores. sex.

实验结果：Experimental results:

考虑到最近邻居数N的大小会对结果有影响，实验分别在最近邻居数为5、10、20、30、40前提下进行方法对比。Considering that the size of the number of nearest neighbors N will affect the results, the experiments are carried out on the premise that the number of nearest neighbors is 5, 10, 20, 30, and 40.

从上图2-图7可以明显看出，本发明方法在不同最近邻居数下均能取得较好的推荐效果。It can be clearly seen from the above Figures 2 to 7 that the method of the present invention can achieve better recommendation results under different numbers of nearest neighbors.

考虑到交叉用户数目对实验结果的影响，实验分别在交叉用户数目为95、189、283、377、471下进行方法对比。Considering the influence of the number of cross-users on the experimental results, the experiments were carried out to compare the methods under the number of cross-users of 95, 189, 283, 377, and 471 respectively.

从上图8-图12可以明显看出，本发明方法在不同交叉用户数下均能取得最好的推荐效果。It can be clearly seen from the above Figures 8 to 12 that the method of the present invention can achieve the best recommendation effect under different cross-user numbers.

本发明方法利用用户属性相似度及用户评分相似度对辅助领域的数据进行迁移以解决目标领域的数据稀疏性问题，未来可以考虑联合项目相似度或其他知识，如文本信息，对辅助领域的数据进行迁移，通过这种方式可以提高迁移数据的质量，从而提高推荐精确度。The method of the present invention uses the similarity of user attributes and the similarity of user ratings to migrate the data in the auxiliary field to solve the problem of data sparsity in the target field. In this way, the quality of the migrated data can be improved, thereby improving the recommendation accuracy.

以上显示和描述了本发明的基本原理和主要特征及本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，本发明还会有各种变化和改进，这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. Those skilled in the art should understand that the present invention is not limited by the above-mentioned embodiments, and the descriptions in the above-mentioned embodiments and the description are only to illustrate the principle of the present invention. Without departing from the spirit and scope of the present invention, the present invention will have Various changes and modifications fall within the scope of the claimed invention. The claimed scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. A collaborative filtering method based on comprehensive similarity migration is characterized by comprising the following steps:

(1) the recommendation method based on the comprehensive similarity migration comprises the following steps: is provided with two platforms e₁And e₂，U₁Shown only on platform e₁Users in whom historical behavior information exists, U₂Shown only on platform e₂Users in whom historical behavior information exists, U_cIs shown on a platform e₁And e₂Users who have historical behavior information in the user group are defined as cross users; in practical cases, the number of cross users is much smaller than the number of non-cross users; by cross-user, as non-cross-user U₁And U₂Establishing similarity relation to help the target field to recommend;

(2) and (3) similarity migration: non-cross user U₁And user U₂Similarity cannot be directly calculated, however, user U₁And user U₂Respectively with cross users U_cCan be calculated, so that the cross users U can be crossed_cCreating user U as a link₁And user U₂The similarity of (2);

and (3) similarity migration step: first, find and platform e₁And a platform e₂Public user set U_c(ii) a Then calculate U separately₁And U_cSimilarity of (2) is expressed as a vector

U₂And U_cIs recorded as

Final calculation

And

is the inner product of (1), namely U₁And U₂The transfer similarity of (2);

wherein, U₁₁Presentation platform e₁Non-intersecting users 1, U in₂₁、U₂₂Presentation platform e₂Non-intersecting users 1, U in_c1、U_c2Equal for cross-users, S₁、S₂Etc. represent similarity; if U is to be calculated₁₁And U₂₁The similarity between the two can pass through U_c1、U_c2、U_c3Transitional, indirect calculation

To sum up, then U₁And U₂The similarity calculation between them can be formalized as:

(3) and (3) similarity calculation: calculating non-cross user U₁And U₂Before the similarity, non-cross user U needs to be calculated₁、U₂The similarity with the cross users respectively is calculated as follows:

1) similarity of user scores

Measuring the similarity of user scores through two aspects of score distribution consistency and credibility;

the consistency of the score distribution isThe score distribution of the same items evaluated by two users; the more consistent the score distribution, the more similar the interests of the two users are; let { ur₁,ur₂,...,ur_n}，

Respectively carrying out increasing ordering on two groups of data for the scoring sets of the common items of the user u and the user v, namely { ur₁,ur₂,...,ur_n}，

If 1,2, n and x₁,x₂,...,x_nThe greater the matching degree of (A) is, the higher the consistency of the two is; the calculation formula is as follows:

the credibility is determined according to the quantity of the same articles evaluated by two users, and if the quantity is small, the two items are not similar to each other even if the grading distribution is consistent; the calculation formula is as follows:

wherein, I_uAn item set representing a user u's rating;

the user score similarity calculation formula is as follows:

sim₁(u,v)＝dist(u,v)conf(u,v) (1-4)

2) user attribute similarity

The user attribute similarity is measured according to the user attribute; users with the same attributes have similar interests to some extent; the calculation formula is as follows:

wherein n represents the number of attributes, sim (u, v, i) represents whether two users are the same on the ith attribute, if so, the number is 1, otherwise, the number is 0, and d_iThe discrimination degree of the ith attribute is represented, if all the articles are scored by a user with a certain attribute, the attribute is not distinguished, and the value of the discrimination degree is determined by different data sets;

3) final degree of similarity

After the user scores a certain article, the user is used for scoring the article, and when the user does not score the certain article, the user attribute information is used; when the number of the items scored by the user is increased, the method is smoothly transited to the step of recommending by using scoring information, the step of smoothing is performed by using a sigmoid function, and finally the similarity of the user is defined as follows:

sim(u,v)＝αsim₁(u,v)+(1-α)sim₂(u,v) (1-6)

wherein, C_uvA set of items representing a common evaluation of user u and user v; the user similarity calculation is smoothly transited to the use scoring information along with the increase of the number of the goods evaluated by the user, and the smooth transition can improve the prediction accuracy rate in the cold start state.