CN109446420A

CN109446420A - A kind of cross-domain collaborative filtering method and system

Info

Publication number: CN109446420A
Application number: CN201811209371.5A
Authority: CN
Inventors: 于旭; 付裕; 徐凌伟; 杜军威; 巩敦卫
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2019-03-08
Anticipated expiration: 2038-10-17
Also published as: CN109446420B

Abstract

The invention discloses a kind of cross-domain collaborative filtering methods, after user items score data is converted to training sample set, Funk-SVD is carried out to the user items rating matrix in each auxiliary domain to decompose to obtain user's latent variable, then the training sample set is extended using user's latent variable obtain the first spread training sample set, add items characteristic information obtains the second spread training sample set to extend the first spread training sample set, use the uneven classifier of the second spread training sample set training, the missing data of the user items score data is finally predicted based on the uneven classifier and generates recommendation；It is extended by using auxiliary numeric field data and solves aiming field data sparsity problem, then the training of uneven classifier is carried out to training sample after extension, using the missing item of uneven classifier prediction aiming field, and then recommending data is obtained, solve the problems, such as that existing recommender system data set is sparse and disequilibrium.

Description

A kind of cross-domain collaborative filtering method and system

Technical field

The invention belongs to technical field of information recommendation, specifically, being to be related to a kind of cross-domain collaborative filtering method and system.

Background technique

The rapid growth of internet information needs effective intelligent information agent that can filter out all available information, and Wherein finding the information to user's most worthy.

In recent years, recommender system was widely used in e-business network and online social media, current main recommended method Be divided into: content-based recommendation, the recommendation based on correlation rule, is recommended, based on knowing based on effectiveness the recommendation based on collaborative filtering Know recommendation, combined recommendation etc.；Wherein, the recommendation based on collaborative filtering is most successful strategy, basic thought in recommended method It is the resource that user similar with certain user likes, which is likely to also like；Certain user likes certain resource, he is likely to Also like other resources similar with the resource；I.e. users can work as one man the behavior by oneself on website, such as right Evaluation, browsing of resource etc., excavation of helping each other filter out oneself interested content.

However, user is typically reluctant to score to the project that they do not like, this is just in actual recommender system It is unbalanced for causing most of score data collection.

Summary of the invention

This application provides a kind of cross-domain collaborative filtering method and systems, and solving existing recommender system, there are data set injustice The technical issues of weighing apparatus.

In order to solve the above technical problems, the application is achieved using following technical scheme:

It is proposed a kind of cross-domain collaborative filtering method, comprising the following steps: user items score data is converted into classification and is calculated The training sample set of method；To it is each auxiliary domain user items rating matrix carry out Funk-SVD decomposition, obtain user it is potential to Amount；The feature vector that the training sample concentrates user is extended using user's latent variable, obtains the first spread training sample This collection；Add items characteristic information obtains the second spread training to extend the feature vector of project in the first spread training sample set Sample set；Use the uneven classifier of the second spread training sample set training；Institute is predicted based on the uneven classifier It states the missing data of user items score data and generates recommendation.

Further, user items score data is converted to the training sample set of sorting algorithm, specifically: use L_uTable Show row of the user in user items rating matrix, using L_iColumn of the expression project in user items rating matrix；Based on spy Levy vector (L_u, L_i) structuring user's project score data classification algorithm training sample set { (L_u,L_i,R_ui) | (u, i) ∈ κ }, Middle κ is the set for having scoring " user-project " pair in rating matrix, R_uiIndicate scoring of the user u to project i.

Further, to the user items rating matrix in each auxiliary domain carry out Funk-SVD decompose to obtain user it is potential to Amount, specifically includes: setting objective functionUsing p_u+γ(e_uiq_i-λ p_u) and q_i+γ(e_uip_u-λq_i) update p_uAnd q_i, to optimize the objective function；Wherein, λ is regularization parameter, and γ is study speed Rate；The latent variable of user u on j-th of auxiliary domain is obtained based on optimum resultsWherein for j from 1 to K, K is for assisting domain Number.

Further, it using the uneven classifier of the second spread training sample set training, specifically includes: initialization institute The sample weights for stating each sample in the second spread training sample set areWherein, A is sample number, 1≤a≤A； Following steps repeat T times: when 1) according to the t times iteration, all weight { D_t(x_a) | 1≤a≤A }, training simultaneously obtains Weak Classifier h_t；Wherein, t is from 1 to T；2) each training sample x is calculated_aPenalty term p_t=1- | amb |, Wherein,For the weight of Weak Classifier；

3) it usesUpdate sample weights；Wherein, Z_tFor regularization because Son, λ ∈ [0.5,12] are the update step-length of the penalty term；

Calculate uneven classifier

It is proposed a kind of cross-domain system filtration system, including training sample conversion module, user's latent variable generation module, instruction Practice the first expansion module of sample, the second expansion module of training sample, uneven classifier training module and recommending module；The instruction Practice sample conversion module, for user items score data to be converted to the training sample set of sorting algorithm；The user is potential Vector generation module carries out Funk-SVD decomposition for the user items rating matrix to each auxiliary domain, it is potential to obtain user Vector；First expansion module of training sample concentrates use for extending the training sample using user's latent variable The feature vector at family obtains the first spread training sample set；Second expansion module of training sample is used for add items feature The feature vector of project obtains the second spread training sample set in first spread training sample set described in Information expansion；The injustice Weigh classifier training module, for using the uneven classifier of the second spread training sample set training；The recommending module, For predicting the missing data of the user items score data based on the uneven classifier and generating recommendation.

Further, the training sample conversion module is specifically used for, using L_uIndicate user in user items scoring square Row in battle array, using L_iColumn of the expression project in user items rating matrix, and it is based on feature vector (L_u, L_i) structuring user's Classification algorithm training the sample set { (L of project score data_u,L_i,R_ui) | (u, i) ∈ κ }, wherein κ is that have scoring in rating matrix " user-project " pair set, R_uiIndicate scoring of the user u to project i.

Further, user's latent variable generation module includes objective function setup unit, objective function optimization list Member and user's latent variable generation unit；

The objective function setup unit, for setting objective function The objective function optimization unit, for using pu+ γ (e_uiq_i-λp_u) and q_i+γ(e_uip_u-λq_i) update p_uAnd q_i, with optimization The objective function；Wherein, λ is regularization parameter, and γ is learning rate；User's latent variable generation unit is used for base The latent variable of user u on j-th of auxiliary domain is obtained in optimum resultsWherein for j from 1 to K, K is the number for assisting domain.

Further, the uneven classifier training module includes sample weights initialization unit, Weak Classifier training Unit, sample weights updating unit and uneven classifier generation unit；The sample weights initialization unit, for initializing The sample weights of each sample are in the second spread training sample setWherein, A is sample number, 1≤a≤ A；The Weak Classifier training unit, when for according to the t times iteration, all sample weights { D_t(x_a) | 1≤a≤A }, training is simultaneously Obtain Weak Classifier h_t；Wherein, t is from 1 to T；The sample weights updating unit, for calculating each training sample x_aPunishment Item p_t=1- | amb |,Wherein,It is weak The weight of classifier；It usesUpdate sample weights；Wherein, Z_tFor Regularization factors, λ ∈ [0.5,12] are the update step-length of the penalty term；The imbalance classifier generation unit, is used for The Weak Classifier training unit and the sample weights updating unit repeat T times after calculating,

Calculate uneven classifier

Compared with prior art, the advantages of the application and good effect is: the cross-domain collaborative filtering method that the application proposes In system, the score data in user items rating matrix is converted according to its position in a matrix as feature vector For training sample, then from other include decomposed in auxiliary domains of relative abundance information by Funk-SVD obtain user it is potential to Amount, and the first spread training sample set is obtained using user's latent variable spread training sample set, to reduce aiming field Sparsity, and then the second spread training sample is obtained to extend the first spread training sample set using the item characteristic information in auxiliary domain This collection, finally using the uneven classifier of training sample set training after extension, namely to the training set after conversion and extension into Row classification, predicts the missing data of the user items rating matrix of aiming field, generates to the recommending data of user；In the application, It solves the problems, such as existing recommender system using uneven disaggregated model there are data sets unbalanced, effectively overcome scoring Partial velocities problem.

After the detailed description of the application embodiment is read in conjunction with the figure, other features and advantages of the application will become more Add clear.

Detailed description of the invention

Fig. 1 is the method flow diagram for the cross-domain collaborative filtering method that the application proposes；

Fig. 2 is the system architecture diagram for the cross-domain collaborative filtering system that the application proposes.

Specific embodiment

The specific embodiment of the application is described in more detail with reference to the accompanying drawing.

The cross-domain collaborative filtering method that the application proposes, it is intended to which training is converted to the user items rating matrix of aiming field After sample set, auxiliary numeric field data is used to be extended to solve aiming field data sparsity problem, then to training sample after extension The training of this progress imbalance classifier, using the missing item of uneven classifier prediction aiming field, and then obtains recommending data, Solve the problems, such as that existing recommender system data set is sparse and disequilibrium.Specifically include the following steps:

Step S11: user items score data is converted to the training sample set of sorting algorithm.

In the embodiment of the present application, it is assumed that aiming field T, u and i respectively represent the project of user, between user and project Relationship indicates that R is scoring by u × i → R, and range is set as { 1,2,3,4,5 }；In the embodiment of the present application, using L_uIt indicates Row of the user u in user items rating matrix, using L_iColumn of the expression project i in user items rating matrix, then user Each scoring in project score data may be expressed as a training sample { (L_u,L_i,R_ui) | (u, i) ∈ κ }, wherein κ is There is the set of " user-project " pair of scoring in rating matrix, that is, user items rating matrix as shown in Table 1 is converted For training sample set as shown in Table 2:

Table one

	i₁	i₂	i₃	i₄
					u₁	5	4
u₂		5		1
					u₃	2	4	3

Table two

L_u	L_i	label
			1	1	5
1	3	4
			2	2	5
2	4	1
			3	1	2
3	3	4
			3	4	3

In table one, u₁、u₁And u₃For three users, i₁、i₂、i₃And i₄It is four projects, is commented using user in user items The position of row in sub-matrix is as L_u, use the position of column of the project in user items rating matrix as L_i, therefore can use (1,1,5) indicates the correlation between u and i, so that the user items rating matrix of table one to be converted to the training of table two Sample set, that is, being based on feature vector (L_u, L_i) generate can user items score data training sample set.

Step S12: to it is each auxiliary domain user items rating matrix carry out Funk-SVD decomposition, obtain user it is potential to Amount.

In traditional collaborative filtering method, in order to solve the problems, such as user items rating matrix sparsity, usually from same Effective information is looked in a domain, such as the relationship of user and project are inferred with social networks, trusting relationship or the information of comment, but Information in same domain is not readily available, and in the embodiment of the present application, extracts effective information from auxiliary domain using cross-domain mode Mode solve the problems, such as aiming field Sparse.

In the embodiment of the present application, Funk-SVD is decomposed to the user items rating matrix being applied in auxiliary domain, to obtain User's latent variable is obtained, is multiplied that is, will be decomposed by Funk-SVD user items rating matrix being decomposed into user's latent factor In the form of project latent factor, high-dimensional user items rating matrix is broken down into two low dimensional matrixes, such as X (m*n) It is decomposed into U (m*k) × V (k*n), m and n are that the line number of user items rating matrix and columns k indicate latent factor dimension respectively, And k is far smaller than min (m, n).Funk-SVD, which is decomposed, is intended to maximumlly be fitted the known point of X to predict the unknown point of X, and k is too It is small then possibly can not fitting data, and k then may cause overfitting greatly very much, useIndicate pre- assessment of the user u to project i Point, then haveWherein p_uIndicate the latent factor vector of user u, q_iThe latent factor vector of expression project i.

In decomposition, set objective function asWherein p^*= {p_user| user ∈ userset } indicate the set of all user's latent variables, q^*={ q_item| item ∈ itemset } indicate institute There is the set of project latent factor.

Using p_u←p_u+γ(e_uiq_i-λp_u) and q_i←q_i+γ(e_uip_u-λq_i) update p_uAnd q_iCarry out optimization object function, with Optimal optimum results are obtained, wherein

Finally the latent variable of user u on j-th of auxiliary domain is obtained based on optimum resultsWherein j is from 1 to K, supplemented by K Help the number in domain；λ is regularization parameter, and γ is learning rate, and the excessive algorithm that will lead to of γ value will not restrain, and value is too small to be will lead to Algorithm is lot more time to restrain.

Step S13: using the feature vector of user in user's latent variable spread training sample set, the first extension instruction is obtained Practice sample set.And step S14: add items characteristic information come extend the feature of project in the first spread training sample set to Amount, obtains the second spread training sample set.

User's latent variable obtained in step S12 is added in the training sample in aiming field namely user is potential Vector is added to feature vector (L_u, L_i), obtain the first spread training sample set

In addition, add items characteristic information obtains the second spread training sample set to extend the first spread training sample set, Recommend performance to improve.By taking film domain as an example, the attribute of film can be added in feature vector, be retrieved according to movie name all The attribute information of film, and therefrom choose and set a several attributes as the item characteristic for being added to feature vector, such as direct, School, performer, country, language etc. obtain the second spread training sample set and are represented by

Q is item characteristic quantity.

Step S15: the uneven classifier of the second spread training sample set training is used.

In the embodiment of the present application, the second extension after conversion and extension is instructed using AdaBoost.NC imbalance algorithm Practice sample set to classify.If becoming one strong classification the basic principle is that multiple classifiers are reasonably combined Device, using the thought of iteration, each iteration only trains a Weak Classifier, and trained Weak Classifier will participate in next iteration Use, that is to say, that after iv-th iteration, just there is N number of Weak Classifier altogether, wherein N-1 is trained before being, Various parameters all no longer change, this training n-th classifier, wherein the relationship of Weak Classifier is that n-th Weak Classifier more may be used Can point data that preceding N-1 Weak Classifier is not divided pair, final classification output will see the resultant effect of this N number of classifier.? In AdaBoost.NC algorithm, there are two weight, first is sample weights that training sample concentrates each sample, with vector D table Show, after the completion of primary study, needs to readjust sample weights, adjust in this subseries by the sample of wrong classification samples Weight, so that can be learnt with emphasis to it in next study；Another weight is the power of each Weak Classifier Weight, is indicated with vector α, since there are multiple classifiers, so a fuzzy item need to be arranged to measure between different classifications device Difference, the fuzzy item are usedIt indicates, h_tIndicate the classification results of t-th of Weak Classifier；If Training sample x is correctly classified by t-th of Weak Classifier, then h_tValue be 1, be otherwise -1；H is point for combining all classifiers Class result.

Specifically, in the embodiment of the present application, initializing the sample power of each sample in the second spread training sample set first Weight isWherein, A is sample number, 1≤a≤A；Set the number of Weak Classifier as T, then not based on AdaBoost.NC Following steps are repeated to be iterated calculating T times by the thought of balanced algorithm: when 1) according to the t times iteration, all sample weights {D_t(x_a) | 1≤a≤A } it trains and obtains Weak Classifier h_t；Wherein, for t from 1 to T, t is often repeated once increase by 1, directly from 1 value To T；2) each training sample x is calculated_aPenalty term p_t=1- | amb |, whereinFor The weight of Weak Classifier；3) it usesUpdate sample weights；Wherein, Z_tFor regularization factors, λ ∈ [0.5,12] is the update step-length of penalty term；After the completion of T iteration, uneven classifier is calculated

Step S16: the missing data based on uneven classifier prediction user items score data simultaneously generates recommendation.

It is above-mentioned as it can be seen that the application propose cross-domain collaborative filtering method in, by the scoring number in user items rating matrix According to, be converted into training sample as feature vector according to its position in a matrix, then from other include relative abundance information It assists decomposing in domain by Funk-SVD and obtains user's latent variable, and obtained using user's latent variable spread training sample set First spread training sample set to reduce the sparsity of aiming field, and then is expanded using the item characteristic information in auxiliary domain It opens up the first spread training sample set and obtains the second spread training sample set, finally using the second spread training sample set after extension Training imbalance classifier, namely classify to the training set after conversion and extension, predict the user items scoring of aiming field The missing data of matrix is generated to the recommending data of user；In the application, existing recommendation is solved using uneven disaggregated model There are the unbalanced problems of data set for system, effectively overcome the partial velocities problem of scoring.

Based on cross-domain collaborative filtering method set forth above, the application also proposes a kind of cross-domain collaborative filtering system, such as Fig. 2 It is shown, including training sample conversion module 21, user's latent variable generation module 22, the first expansion module of training sample 23, instruction Practice the second expansion module of sample 24, uneven classifier training module 25 and recommending module 26.

Training sample conversion module 21 is used to be converted to user items score data the training sample set of sorting algorithm；With Family latent variable generation module 22 is used to carry out Funk-SVD decomposition to the user items rating matrix in each auxiliary domain, is used Family latent variable；The first expansion module of training sample 23 is used for the spy using user in user's latent variable spread training sample set Vector is levied, the first spread training sample set is obtained；The second expansion module of training sample 24 is extended for add items characteristic information The feature vector of project obtains the second spread training sample set in first spread training sample set；Uneven classifier training module 25 for using the uneven classifier of the second spread training sample set training；Recommending module 26 is used for pre- based on uneven classifier It surveys the missing data of user items score data and generates recommendation.

Specifically, training sample conversion module is used to use L_uIt indicates row of the user in user items rating matrix, adopts Use L_iColumn of the expression project in user items scoring is put to the proof, and it is based on feature vector (L_u, L_i) generate user items score data Classification algorithm training sample set, { (L_u,L_i,R_ui) | (u, i) ∈ κ }, wherein κ is that have scoring " user-item in rating matrix The set of mesh " pair, R_uiIndicate scoring of the user u to project i.

In the embodiment of the present application, user's latent variable generation module 22 includes objective function setup unit 221, objective function Optimize unit 222 and user's latent variable generation unit 223；Objective function setup unit 221 is for setting objective functionObjective function optimization unit 222 is used to use p_u←p_u+γ (e_uiq_i-λp_u) and q_i←q_i+γ(e_uip_u-λq_i) update p_uAnd q_i, with optimization object function；User's latent variable generation unit 223 for obtaining the latent variable of user u on j-th of auxiliary domain based on optimum resultsWherein for j from 1 to K, K is auxiliary domain Number.

Uneven classifier training module 25 include sample weights initialization unit 251, Weak Classifier training unit 252, Sample weights updating unit 253 and uneven classifier generation unit 254；Sample weights initialization unit 251 is for initializing The sample weights of each sample are in second spread training sample setWherein, A is sample number, 1≤a≤A；It is weak When classifier training unit 252 is used for according to the t times iteration of sample, all sample weights { D_t(xa) | 1≤a≤A }, training simultaneously obtains To Weak Classifier h_t；Wherein, t is from 1 to T；Sample weights updating unit 253 is for calculating each training sample x_aPenalty term p_t= 1-|amb|,Wherein,For the power of Weak Classifier Weight；It usesUpdate sample weights；Wherein, Z_tFor regularization factors, λ ∈ It [0.5,12] is the update step-length of penalty term；Uneven classifier generation unit 254 is used in Weak Classifier training unit and sample Weight updating unit repeats T times after calculating, and calculates uneven classifier

The recommended method of cross-domain collaborative filtering system is described in detail in cross-domain collaborative filtering method set forth above, herein It will not go into details.

It should be noted that the above description is not a limitation of the present invention, the present invention is also not limited to the example above, The variations, modifications, additions or substitutions that those skilled in the art are made within the essential scope of the present invention, are also answered It belongs to the scope of protection of the present invention.

Claims

1. a kind of cross-domain collaborative filtering method, which comprises the following steps:

User items score data is converted to the training sample set of sorting algorithm；

Funk-SVD decomposition is carried out to the user items rating matrix in each auxiliary domain, obtains user's latent variable；

The feature vector that the training sample concentrates user is extended using user's latent variable, obtains the first spread training sample This collection；

Add items characteristic information obtains the second spread training to extend the feature vector of project in the first spread training sample set Sample set；

Use the uneven classifier of the second spread training sample set training；

The missing data of the user items score data is predicted based on the uneven classifier and generates recommendation.

2. cross-domain collaborative filtering method according to claim 1, which is characterized in that be converted to user items score data The training sample set of sorting algorithm, specifically:

Using L_uRow of the user in user items rating matrix is indicated, using L_iExpression project is in user items rating matrix Column；

Based on feature vector (L_u, L_i) structuring user's project score data classification algorithm training sample set { (Lu, L_i,R_ui)|(u, I) ∈ κ }, wherein κ is the set for having scoring " user-project " pair in rating matrix, R_uiIndicate scoring of the user u to project i.

3. cross-domain collaborative filtering method according to claim 2, which is characterized in that comment the user items in each auxiliary domain Sub-matrix carries out Funk-SVD and decomposes to obtain user's latent variable, specifically includes:

Set objective function

Using pu+ γ (e_uiq_i-λp_u) and q_i+γ(e_uip_u-λq_i) update p_uAnd q_i, to optimize the objective function；Wherein, λ is Regularization parameter, γ are learning rate；

The latent variable of user u on j-th of auxiliary domain is obtained based on optimum resultsWherein for j from 1 to K, K is for assisting domain Number.

4. cross-domain collaborative filtering method according to claim 1, which is characterized in that use the second spread training sample The uneven classifier of collection training, specifically includes:

The sample weights for initializing each sample in the second spread training sample set areWherein, A is sample Number, 1≤a≤A；

Following steps repeat T times:

1) when according to the t times iteration, all sample weights { D_t(x_a) | 1≤a≤A }, training simultaneously obtains Weak Classifier h_t；Wherein, t From 1 to T；

2) each training sample x is calculated_aPenalty term p_t=1- | amb |, Wherein,For the weight of Weak Classifier；

Calculate uneven classifier

5. a kind of cross-domain system filtration system, which is characterized in that generate mould including training sample conversion module, user's latent variable Block, the first expansion module of training sample, the second expansion module of training sample, uneven classifier training module and recommending module；

The training sample conversion module, for user items score data to be converted to the training sample set of sorting algorithm；

User's latent variable generation module carries out Funk-SVD points for the user items rating matrix to each auxiliary domain Solution, obtains user's latent variable；

First expansion module of training sample concentrates user for extending the training sample using user's latent variable Feature vector, obtain the first spread training sample set；Second expansion module of training sample is believed for add items feature The feature vector that breath extends project in the first spread training sample set obtains the second spread training sample set；

The imbalance classifier training module, for using the uneven classifier of the second spread training sample set training；

The recommending module, for predicting the missing data of the user items score data simultaneously based on the uneven classifier It generates and recommends.

6. cross-domain collaborative filtering system according to claim 5, which is characterized in that the training sample conversion module is specific For using L_uRow of the user in user items rating matrix is indicated, using L_iExpression project is in user items rating matrix Column, and be based on feature vector (L_u, L_i) structuring user's project score data classification algorithm training sample set { (L_u,L_i,R_ui)| (u, i) ∈ κ }, wherein κ is the set for having scoring " user-project " pair in rating matrix, R_uiIndicate that user u comments project i Point.

7. cross-domain collaborative filtering system according to claim 6, which is characterized in that user's latent variable generation module Including objective function setup unit, objective function optimization unit and user's latent variable generation unit；

The objective function setup unit, for setting objective function

The objective function optimization unit, for using p_u+γ(e_uiq_i-λp_u) and q_i+γ(e_uip_u-λq_i) update p_uAnd q_i, with Optimize the objective function；Wherein, λ is regularization parameter, and γ is learning rate；

User's latent variable generation unit, for obtaining the latent variable of user u on j-th of auxiliary domain based on optimum resultsWherein for j from 1 to K, K is the number for assisting domain.

8. cross-domain collaborative filtering method according to claim 5, which is characterized in that the imbalance classifier training module It is generated including sample weights initialization unit, Weak Classifier training unit, sample weights updating unit and uneven classifier single Member；

The sample weights initialization unit, the sample for initializing each sample in the second spread training sample set are weighed Weight isWherein, A is sample number, 1≤a≤A；

The Weak Classifier training unit, when for according to the t times iteration, all sample weights { D_t(x_a) | 1≤a≤A } training And obtain Weak Classifier h_t；Wherein, t is from 1 to T；

The sample weights updating unit, for calculating each training sample x_aPenalty term p_t=1- | amb |,

Wherein,For Weak Classifier Weight；It uses Update sample weights；Wherein, Z_tFor regularization The factor, λ ∈ [0.5,12] are the update step-length of the penalty term；

The imbalance classifier generation unit, in the Weak Classifier training unit and the sample weights updating unit It repeats T times after calculating, calculates uneven classifier