CN110321485A

CN110321485A - A kind of proposed algorithm of combination user comment and score information

Info

Publication number: CN110321485A
Application number: CN201910531413.5A
Authority: CN
Inventors: 李慧; 张舒; 刘飞; 施珺; 戴红伟; 樊宁; 杨玉; 李海宁
Original assignee: Huaihai Institute of Techology
Current assignee: Huaihai Institute of Techology
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2019-10-11

Abstract

The invention discloses the proposed algorithm of a kind of combination user comment and score information, the specific steps realized are as follows: construct the generative probabilistic model for finding potential theme dimension in user comment text；Construct the recommendation objective function combined based on user's rating matrix decomposition model with motif discovery model；It is realized by the iterative calculation to objective function and is predicted based on the Products Show of user comment text and score data.Algorithm of the invention has fully considered the comment information of user, using the potential theme distribution in comment text, user's score data is combined with user comment text, efficiently solves the problems, such as the cold start-up in recommender system；Simultaneously than individually considering that the method for two kinds of data sources more can accurately carry out score in predicting, especially suitable for the score in predicting to new product and new user.

Description

A kind of proposed algorithm of combination user comment and score information

Technical field:

The present invention relates to proposed algorithm field, specially a kind of proposed algorithm of combination user comment and score information.

Background technique:

Recommender system is widely applied in disparate networks platform, it has changed user and has produced in online discovery and assessment The mode of product.Existing recommended method can be divided into two major classes: collaborative filtering method and content-based recommendation method.It cooperateed with Filtering method is to comment grading information to be modeled based on dominant, although comparatively ideal recommendation effect can be obtained, there is scoring The sparsity problem of data.Content-based recommendation method is that by excavating there are the commodity of same or similar attribute to be pushed away It recommends, the recommendation that this method generates has the problem of recommendation results unification.There are many research about scoring modeling, however score The cold start-up of sparsity, recommendation possessed by data and the transfer of user preferences are a problem always, cannot be solved well Certainly.At the same time, another feedback system on website, i.e. comment itself are commented on, is often ignored.Therefore, Recent study Person excavates the information such as the label of the relationship of user, the comment of user and commodity by trial to further increase the matter of recommended method Amount.Although also many research work be all research scoring and user comment text, they be all by this two o'clock in isolation Research, few researchs, which attempt to combine in both information sources, carries out proposed algorithm research.

Summary of the invention:

The purpose of the present invention is in view of the drawbacks of the prior art, provide the recommendation of a kind of combination user comment and score information Algorithm, to solve the problems, such as that above-mentioned background technique proposes.

To achieve the above object, the invention provides the following technical scheme: a kind of combination user comment and score information push away Recommend algorithm, comprising the following steps:

Step (1): the generative probabilistic model for finding potential theme dimension in user comment text, calculation formula are constructed It is as follows:

In formula, N_dIndicate the number of words of file d；

In formula, θ_iIndicate that the k for project i ties up theme distribution；

In formula, Zu, i, j indicate theme of the user u about j-th of word of project i；

In formula, Wu, i, j indicate j-th of word that user u comments on project i；

Step (2): the recommendation function that building is combined based on user's rating matrix decomposition model with motif discovery model, Calculation formula is as follows:

In formula, Θ indicates scoring, the i.e. parameter based on rating matrix decomposition model；

In formula, Φ={ θ, φ } indicates topic parameter, the i.e. parameter sets based on comment LDA motif discovery model；

In formula, k indicates the weight of control transfer function；

In formula, z is the topic parameter of each word in corpus T；

In formula, rec (u, i) is that user u scores to the prediction of project i；

In formula, r_u,iIndicate grading of the user u to i；

In formula, λ is a hyper parameter, for controlling this two-part weight；

In formula, and logL (T | θ, φ, z) indicate user comment collection LDA log-likelihood function；

Step (3): the product based on user comment text and score data is realized by the iterative calculation to objective function Recommend prediction, minimize objective function are as follows:

As a preferred technical solution of the present invention, step (1) building is used to find in user comment text Process is embodied in the generative probabilistic model of potential theme dimension are as follows:

Step (1a): each document is considered as the sequence being made of N number of word, M is in document sets packet containing piece text The number of shelves, z indicate that the theme distribution of word, α and β are the hyper parameter of the distribution θ and word distribution phi of theme in text, clothes respectively It is distributed from priori Di Li Cray；

Step (1b): by each document d ∈ D and a theme θ with K dimension theme distribution_dIt is associated, i.e. in document d The word segment of discussion topic K, the text in text d is with θ_d,kProbability in discussion topic k；

Step (1c): assuming that theme distribution θ_dDirichlet distribution itself is obeyed, last model includes each theme Word distribution phi_k, the theme distribution θ of each document_dAnd each word z_d,jTheme distribution；

Step (1d): parameter Φ={ θ, φ } and theme distribution z is updated by sampling；

Step (1e): given word distribution phi_kAfter the theme distribution of each word, it may be constructed and belong to particular text corpus Library, calculation formula are as follows:

In formula, N_dIndicate the number of words of file d.

As a preferred technical solution of the present invention, step (2) building decomposes mould based on user's rating matrix Process is embodied in the recommendation function that type is combined with motif discovery model are as follows:

Step (2a): being defined " document " in this model, and it is as follows to define method:

Document is exported from comment text, is that all comment set of specific project i are defined as document d_i；

Step (2b): grading parameters γ will be learnt_iWith user comment parameter θ_iThe two connects, herein it is implicitly assumed that The number K for the latent factor that rating matrix decomposes is identical with the potential theme number K of comment text, and latent factor is with identical Weight, the transformational relation constructed between latent factor and potential theme is as follows:

In formula, θ_i,kTheme of the expression project i on potential feature K, is used here exponential form and ensures several θ_i,kAll For positive value, and meet ∑_kθ_i,k=1, γ_i,kValue of the latent factor vector of expression project i on feature k introduces parameter k and comes " peak value " of transformation is controlled, in other words, k indicates the weight of control conversion, as κ → ∞, θ_iWill be close to unit vector, the unit Vector is only for γ_iMaximal index value 1；As κ → 0, θ_iClose to being uniformly distributed, intuitively, big κ indicates that user only discusses Most important theme, and small κ then indicates that user equably discusses all themes；

Step (2c): the corpus objective function T based on user's scoring and user comment is defined:

In formula, Θ and Φ={ θ, φ } respectively indicate scoring and topic parameter, the i.e. ginseng based on rating matrix decomposition model It counts and the parameter sets based on comment LDA motif discovery model, k indicates the weight of control transfer function, z is every in corpus T The topic parameter of a word, and logL (T | θ, φ, z) indicate user comment collection LDA log-likelihood function, the first part of this equation It is the error of prediction grading point, second part is the log-likelihood function of comment text topic model, and λ is a hyper parameter, is used In controlling this two-part weight, rec (u, i) is that user u scores to the prediction of project i, can be obtained by following formula:

Rec (u, i)=alpha+beta_u+β_i+γ_u·γ_i

Wherein, α indicates global bias, i.e., the average value of whole score datas, and reflection different data collection scores to user It influences, β_uIt is user and project biasing, the influence of two biasings reflection different users and disparity items to scoring, γ respectively with β i The K that u and γ i respectively indicates user u and project i ties up potential theme vector, and γ i can intuitively be considered as " belonging to for product i Property ", and γ u is user to these attributes " preference ", meanwhile, the training corpus that a given scoring is T, parameter Θ= The selection of { α, β u, β i, γ u, γ i } is usually to minimize mean square error (MSE), it may be assumed that

Wherein, Ω (Θ) is the regularizer for punishing " complexity " model.

As a preferred technical solution of the present invention, the step (3) is realized by the iterative calculation to objective function Specific implementation process based on the prediction of the Products Show of user comment text and score data are as follows:

Step (3a): substituting into the formula in step (2c) for the formula in step (1e), and building minimizes objective function, Calculation formula is as follows:

Step (3b): the theme z of fixed word_d,jAfterwards, gradient descent method can be used to parameter TMF model matrix resolution parameter The quantity k of lumped parameter Θ and motif discovery parameter sets Φ and potential dimension/theme is solved, and calculation formula is as follows:

Wherein, the derivation process of objective function is solved to above formula, calculation formula is as follows:

In formula, n_d,kIndicate the number of theme k occur in document d；

Step (3c): no longer become by step (3b) and the continuous iteration of Gibbs sampling method until to the parameter of output Change or reach certain threshold value, algorithm reaches convergence, wherein gibbs sampler algorithm calculation formula is as follows:

Beneficial effects of the present invention: algorithm of the invention has fully considered the comment information of user, using in comment text Potential theme distribution, user's score data is combined with user comment text, efficiently solve in recommender system cold opens Dynamic problem；Simultaneously than individually considering that the method for two kinds of data sources more can accurately carry out score in predicting, especially suitable for new The score in predicting of product and new user, because the history score data that these new users may possess is very little, so that it cannot right Its potential factor is modeled.

Detailed description of the invention:

Fig. 1 is the graphical representation of probability production model LDA of the present invention；

Fig. 2 is influence comparison diagram of the various regularization parameter λ values of the present invention to mean square error MSE value；

Fig. 3 is inventive algorithm and comparison diagram of the various conventional recommendation algorithms in MSE index value；

Fig. 4 is inventive algorithm and comparison diagram of the various conventional recommendation algorithms in ACC index value.

Specific embodiment:

The preferred embodiments of the present invention will be described in detail with reference to the accompanying drawing, so that advantages and features of the invention energy It is easier to be understood by those skilled in the art, so as to make a clearer definition of the protection scope of the present invention.

The present invention provides a kind of technical solution: a kind of proposed algorithm of combination user comment and score information:

The following steps are included:

In formula, N_dIndicate the number of words of file d；

In formula, θ_iIndicate that the k for project i ties up theme distribution；

In formula, Wu, i, j indicate j-th of word that user u comments on project i；

In formula, k indicates the weight of control transfer function；

In formula, z is the topic parameter of each word in corpus T；

In formula, rec (u, i) is that user u scores to the prediction of project i；

In formula, r_u,iIndicate grading of the user u to i；

In formula, λ is a hyper parameter, for controlling this two-part weight；

The generative probabilistic model specific implementation for finding potential theme dimension in user comment text of step (1) building Process are as follows:

Step (1c): assuming that theme distribution θ_dDirichlet distribution itself is obeyed, last model includes each theme Word distribution phi_k, the theme distribution θ of each document_dAnd each word z_d,jTheme distribution,

In formula, N_dIndicate the number of words of file d.

The recommendation function tool of step (2) building combined based on user's rating matrix decomposition model with motif discovery model Body implementing procedure are as follows:

Rec (u, i)=alpha+beta_u+β_i+γ_u·γ_i

Wherein, Ω (Θ) is the regularizer for punishing " complexity " model.

Step (3) realizes that the product based on user comment text and score data pushes away by the iterative calculation to objective function Recommend the specific implementation process of prediction are as follows:

In formula, n_d,kIndicate the number of theme k occur in document d；

Include optimization algorithm performance to verify the present invention, chooses conventional method Offset, LFM of current Products Show (LatentFactorModel), SVD++ model, SlopeOne comparative test.

Offset method: this method is a kind of collaborative filtering model based on global bias, is used in model construction The average predicted value as the commodity of commodity, the i.e. prediction using the average mark of all marking of certain commodity as user to the commodity Scoring.

LFM (LatentFactorModel), hidden semantic model: this method by matrix decomposition (SVD) to unknown commodity into Row prediction scoring.This model only accounts for the score information of user, and there is no the comment text information for considering user.

SVD++ model: the model is the information that neighborhood commodity are added in SVD model, so that SVD++ model has been obtained, Using the accumulation result of latent factor possessed by user's history comment commodity as field merchandise news.

SlopeOne: this method is the current widely used collaborative filtering method based on commodity, and algorithm operation is high Effect is succinct, and can be obtained by Open-Source Tools.

For quantization performance, the common mean square error MSE of in-service evaluation recommender system of the present invention determines as evaluation index Adopted formula is as follows:

Wherein M indicates the total quantity of prediction scoring,Indicate that user u scores to the prediction of project i, r_uiIndicate u pairs of user The practical scoring of project i.

In addition to use mean square error MSE be evaluation index other than, this experiment also introduces accuracy (Accuracy) conduct weigh Second index of amount comment prediction accuracy, ACC are defined as follows:

Wherein m indicates system prediction scoring and the practical consistent frequency that scores of user.

We use empirical value, i.e. α=0.2 and β=0.1 to the value of hyper parameter α and β in LDA model in an experiment.

Statistical information (table 1) selection of data set used in the present invention collects user comment data in various public resources.Number According to main source be Amazon, obtain about 35,000,000 user comments.In order to obtain these data, 75,000,000 are listed first The character string of a similar asin (goods number of Amazon oneself), these character strings are obtained from the Internet Archive, In there are about 2,500,000 commodity at least one user comments.According to the top categories (such as books, film) of every kind of product by this Data set is further divided into 26 parts.This data set is the superset of existing publicly available Amazon data set.Amount to from Forty-two million user comment is obtained in 1000 general-purpose families and 3,000,000 projects.The user that data set covers 5,100,000,000 words in total comments By.

Parameter K value verification test (table 2) shows under different number of theme, the mean square error MSE of the proposed algorithm and Accuracy ACC.The result of MSE and ACC when taking different value by comparing K in table 1, it can be seen that the theme when the value of K is 10 It is the most clear to divide.When the value of K constantly increases, system performance is constantly promoted.It is noted that when K value increases from 10 When to 20, the amplitude that system performance increases is smaller, therefore tests the theme number for finally setting K=10 to default.In order to make LDA Model can realize fast convergence in comment data, and experiment sets 100 for the number of iterations.

(Fig. 2) is tested in influence of the various regularization parameter λ values to mean square error MSE value.The effect of regularization parameter λ is to use Control the regularization weight of subject matter preferences, regularization term is in machine learning algorithm for avoiding model excessive with actual result One of effective means of fitting.The experimental result of Fig. 2 gives the behavior pattern of parameter system under different values, by number in figure According to it is found that the MSE index of system tends to be steady when λ=0.5 value nearby.Therefore present invention experiment finally joins regularization The value of number λ is set as 0.5.

It, can be with by experimental result by inventive algorithm and comparison (Fig. 3) of the various conventional recommendation algorithms in MSE index value Find out that latent factor matrix decomposition effectively improves the recommendation quality of recommender system, LFM is than the side Offset based on global bias Method performance will be got well, and the performance of SVD++ method is better than LFM.TMF method due to consider user scoring and comment information, because This obtains optimal recommendation performance.

After randomly selecting 10 product categories, inventive algorithm and pair of the various conventional recommendation algorithms in MSE index value Than (table 3), in order to verify the validity of motif discovery, experiment has randomly selected 10 classes in Amazon28 subclass data again Not, comprising level-ones classifications such as mother and baby, food, phonotape and videotape, makeups.TMF model is in 10 subclasses from the point of view of comprehensive all data subsets Better than conventional model, in such as " clothes " and " shoes and hats " several classifications, TMF algorithm is the most obvious than what other algorithms were promoted. It can be seen that these classifications that TMF behaves oneself best all are more subjective.Because of these class users federation in comment Numerous aspects of this product are prompted, therefore use comment text, TMF can preferably " separate " objective figures of product and comment Theorist is to its subjective opinion.

Inventive algorithm and comparison (Fig. 4) of the various conventional recommendation algorithms in ACC index value, the practical value that scores is 1- Result need to be rounded nearby when the prediction scoring that algorithm obtains is decimal to calculate ACC by 5 integer.From the results of view, TMF Algorithm behaves oneself best.Test set takes respectively to be randomly assigned to distribute two kinds of situations with timing.As a whole, the performance of score in predicting It is better than timing distribution method in the case where being randomly assigned.

Table 1

Table 2

Table 3

The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention Range.

Claims

1. a kind of proposed algorithm of combination user comment and score information, which comprises the following steps:

Step (1): the generative probabilistic model for finding potential theme dimension in user comment text is constructed, calculation formula is such as Under:

In formula, N_dIndicate the number of words of file d；

In formula, θ_iIndicate that the k for project i ties up theme distribution；

In formula, Wu, i, j indicate j-th of word that user u comments on project i；

Step (2): the recommendation function that building is combined based on user's rating matrix decomposition model with motif discovery model calculates Formula is as follows:

In formula, k indicates the weight of control transfer function；

In formula, z is the topic parameter of each word in corpus T；

In formula, rec (u, i) is that user u scores to the prediction of project i；

In formula, r_u,iIndicate grading of the user u to i；

In formula, λ is a hyper parameter, for controlling this two-part weight；

Step (3): the Products Show based on user comment text and score data is realized by the iterative calculation to objective function Prediction minimizes objective function are as follows:

2. the proposed algorithm of a kind of combination user comment and score information according to claim 1, it is characterised in that: described The generative probabilistic model specific implementation process for finding potential theme dimension in user comment text of step (1) building are as follows:

Step (1a): each document is considered as the sequence being made of N number of word, M is the document containing a piece in document sets packet Number, z indicate that the theme distribution of word, α and β are the hyper parameter of the distribution θ and word distribution phi of theme in text respectively, obey first Test the distribution of Di Li Cray；

Step (1b): by each document d ∈ D and a theme θ with K dimension theme distribution_dIt is associated, i.e. master is discussed in document d The word segment of K is inscribed, the text in text d is with θ_d,kProbability in discussion topic k；

Step (1c): assuming that theme distribution θ_dDirichlet distribution itself is obeyed, last model includes the word of each theme Distribution phi_k, the theme distribution θ of each document_dAnd each word z_d,jTheme distribution；

Step (1e): given word distribution phi_kAfter the theme distribution of each word, it may be constructed and belong to particular text corpus, Calculation formula is as follows:

In formula, N_dIndicate the number of words of file d.

3. the proposed algorithm of a kind of combination user comment and score information according to claim 1, it is characterised in that: described The recommendation function specific implementation stream of step (2) building combined based on user's rating matrix decomposition model with motif discovery model Journey are as follows:

Step (2b): grading parameters γ will be learnt_iWith user comment parameter θ_iThe two connects, herein it is implicitly assumed that scoring The number K of the latent factor of matrix decomposition is identical with the potential theme number K of comment text, and latent factor power having the same Weight, the transformational relation constructed between latent factor and potential theme are as follows:

In formula, θ_i,kTheme of the expression project i on potential feature K, is used here exponential form and ensures several θ_i,kAll it is positive Value, and meet ∑_kθ_i,k=1, γ_i,kValue of the latent factor vector of expression project i on feature k introduces parameter k to control " peak value " of transformation, in other words, k indicate the weight of control conversion, as κ → ∞, θ_iWill be close to unit vector, the unit vector Only for γ_iMaximal index value 1；As κ → 0, θ_iClose to being uniformly distributed, intuitively, it is most heavy that big κ indicates that user only discusses The theme wanted, and small κ then indicates that user equably discusses all themes；

In formula, Θ and Φ={ θ, φ } respectively indicate scoring and topic parameter, i.e., the parameter based on rating matrix decomposition model and Based on the parameter sets of comment LDA motif discovery model, k indicates the weight of control transfer function, and z is each list in corpus T The topic parameter of word, and logL (T | θ, φ, z) indicate user comment collection LDA log-likelihood function, the first part of this equation is pre- The error for fraction of testing and assessing, second part is the log-likelihood function of comment text topic model, and λ is a hyper parameter, for controlling This two-part weight is made, rec (u, i) is that user u scores to the prediction of project i, it can be obtained by following formula:

Rec (u, i)=alpha+beta_u+β_i+γ_u·γ_i

Wherein, α indicates global bias, i.e., the average value of whole score datas reflects the influence that different data collection scores to user, β_uIt is user and project biasing, the influence of two biasings reflection different users and disparity items to scoring, γ u and γ respectively with β i The K that i respectively indicates user u and project i ties up potential theme vector, and γ i can intuitively be considered as " attribute " of product i, and γ u is user to these attributes " preference ", meanwhile, give the training corpus that a scoring is T, parameter Θ={ α, β u, β I, γ u, γ i } selection be usually minimize mean square error (MSE), it may be assumed that

Wherein, Ω (Θ) is the regularizer for punishing " complexity " model.

4. the proposed algorithm of a kind of combination user comment and score information according to claim 1, it is characterised in that: described Step (3) is predicted by the iterative calculation realization to objective function based on the Products Show of user comment text and score data Process is embodied are as follows:

Step (3a): substituting into the formula in step (2c) for the formula in step (1e), and building minimizes objective function, calculates Formula is as follows:

Step (3b): the theme z of fixed word_d,jAfterwards, gradient descent method can be used to parameter TMF model matrix resolution parameter set The quantity k of parameter Θ and motif discovery parameter sets Φ and potential dimension/theme is solved, and calculation formula is as follows:

In formula, n_d,kIndicate the number of theme k occur in document d；

Step (3c): by step (3b) and the continuous iteration of Gibbs sampling method until no longer changing to the parameter of output or Person reaches certain threshold value, and algorithm reaches convergence, wherein gibbs sampler algorithm calculation formula is as follows: