CN108363804A - Local model weighted fusion Top-N movie recommendation method based on user clustering - Google Patents

Local model weighted fusion Top-N movie recommendation method based on user clustering Download PDF

Info

Publication number
CN108363804A
CN108363804A CN201810169922.3A CN201810169922A CN108363804A CN 108363804 A CN108363804 A CN 108363804A CN 201810169922 A CN201810169922 A CN 201810169922A CN 108363804 A CN108363804 A CN 108363804A
Authority
CN
China
Prior art keywords
user
model
film
document
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810169922.3A
Other languages
Chinese (zh)
Other versions
CN108363804B (en
Inventor
汤颖
孙康高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201810169922.3A priority Critical patent/CN108363804B/en
Publication of CN108363804A publication Critical patent/CN108363804A/en
Application granted granted Critical
Publication of CN108363804B publication Critical patent/CN108363804B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A local model weighted fusion Top-N movie recommendation method based on user clustering comprises the steps of 1, preprocessing data, cleaning data of an inactive user and a movie with low popularity, constructing a user movie label document, converting explicit grading information into implicit feedback information, constructing a user-movie implicit feedback matrix A, 2, clustering users, training a L DA topic model by utilizing movie label information to obtain a user characteristic vector, realizing user clustering by using a spectral clustering algorithm, 3, determining a local recommendation model and performing global recommendation model training, 4, performing a model weighted fusion recommendation stage, and 5, proving the effectiveness of the model by leave-one-out cross verification.

Description

Partial model Weighted Fusion Top-N films based on user clustering recommend method
Technical field
The present invention relates to the films on a kind of network to recommend method.
Background technology
With the fast development of Information technology and social networks, the data of internet generation are recently exponential to rise suddenly and sharply, greatly Data age arrives.With increasing for data volume, people are increasingly difficult to find the letter oneself really wanted from mass data Breath.At this point, commending system can then play its maximum application value.According to subscriber data, Item Information and user's history row For data, proposed algorithm is capable of the hobby of Accurate Prediction user, is that user recommends their possible interested things personalizedly, Greatly reduce the cost that user has found target information.
Proposed algorithm can be divided into content-based recommendation and collaborative filtering recommending.The commending system of modernization mainly has two A task, one is score in predicting, the other is recommending using most Top-N in present commercial scene.Top-N recommends to calculate Method allows user to select oneself interested east by way of recommending one to pass through ranking to user and size is the item lists of n West.Top-N recommended models are broadly divided into two types, are the collaborative filtering based on neighborhood and the collaboration based on model respectively Filter.The former can be subdivided into the neighbourhood model (UserKNN) based on user and the neighbourhood model (ItemKNN) based on article again, after Person is then using hidden factor model as representative.
It as the saying goes " things of a kind come together, people of a mind fall into the same group, and things of a kind come together, people of a mind fall into the same group ", often form respectively unique behavior mould inside different user group Formula so that two identical articles similarity in different crowds changes.And single proposed algorithm model often captures Less than the similarity difference of these parts, they think that similarity of two identical articles in any scene is all consistent , these models can not accurately capture the actual preferences of user, reduce the quality of personalized recommendation.Pass through the multiple parts of training Recommended models, then merge partial model and can solve above ask to a certain extent to promote the proposed algorithm of global recommendation effect Topic, but these algorithms often do not make full use of the data for recommending scene to provide, the data used are relatively simple, final Recommendation effect is also general.
Invention content
In order to overcome, the single model of the prior art can not accurately capture user preference and multi-model blending algorithm uses The single problem of training data, the present invention provide a kind of new partial model Weighted Fusion film recommendation calculation based on user clustering Method realizes the Top-N personalized recommendations of film.
The present invention utilizes the content of text information of film, and semantic hierarchies user characteristics vector is calculated by LDA topic models, And user clustering is realized by spectral clustering based on this, construct local crowd.The present invention further utilizes user to film Score information pass through partial model and the overall situation by sparse linear Construction of A Model part recommended models and global recommended models The linear weighted function of model merges to realize final film Top-N personalized recommendations.
Partial model Weighted Fusion Top-N films based on user clustering recommend method, and overall procedure is as shown in Figure 1, tool Body includes the following steps:
Step 1:Data preprocessing phase.It is clear that data are carried out to the film of some inactive users and popularity very little It washes;Structuring user's film lagged document;Explicit score information is converted into implicit feedback information, structuring user's-film is implicit Feedback matrix A;
1.1 pairs of raw data sets carry out data cleansing work, reject the user that viewing number is less than 20 films, reject simultaneously It is scored the film that number is less than 20 times, obtains new training dataset;
All users generate a label dictionary to the label that film is beaten in 1.2 statistics new data sets, and user has been seen The document of the label composition of all films indicates active user, and the document of all users forms a corpus, calculates document In each TF-IDF value of the word in corpus.Word frequency TF, the calculating of inverse document frequency IDF and term frequency-inverse document frequency TF-IDF Shown in formula such as formula (1) (2) (3);
TFIDFi,j=TFi,j×IDFi (3)
Wherein TFi,jIndicate word tiIn document djIn word frequency, ni,jIndicate word tiIn document djThe number of middle appearance, ∑knk,jIndicate document djIn all words the sum of occurrence number.IDFiIndicate word tiInverse document frequency, | D | indicate corpus The sum of middle document, | { j:ti∈dj| it indicates to include word tiNumber of documents.TFIDFi,jIndicate document djMiddle word tiWord The inverse document frequency of frequency;
Such as 1-5 points of 1.3 explicit score informations, are converted into the implicit feedback information indicated with 0-1, if active user couple Current movie, which is beaten, is excessively then denoted as 1, does not beat excessive film film i.e. to be recommended and is denoted as 0, obtains user-electricity of a n × m Shadow implicit feedback matrix, number of users n, film number are m;
Step 2:The user clustering stage.Using film label information, by LDA topic models train to obtain user characteristics to Amount realizes user clustering with spectral clustering;
2.1LDA topic models are three layers of Bayesian networks of a document-theme-word, give a corpus, should Model can analyze the theme distribution of every document in the corpus, and the word distribution of each theme.Its joint probability is such as Shown in formula (4);
θ indicates that the theme distribution of a document, z indicate that a theme, w indicate that a document, α indicate main under every document The Dirichlet Study firsts of the multinomial distribution of topic, β indicate the Dirichlet priori ginseng of the multinomial distribution of word under each theme Number, N indicate the number of files in corpus, znIndicate the theme of n-th of word in a document, wnIndicate n-th of list of a document Word;
Every film has the label that multiple users assign to it, a film label mapping at a word wn, The compound mapping of the label composition for all films that one user has seen is specific the preferred one kind of user at a document w Film types be mapped to a theme z.If sharing n user in data set, a language material containing n documents is produced Library and a dictionary, every document in corpus indicate that each of vector value is corresponding word with the vector of dictionary length TF-IDF value of the label in the customer documentation and corpus in allusion quotation;
In order to distinguish more unique user group, the otherness between different themes is the bigger the better.For determination Best theme number trains multiple LDA models by the way that multiple numbers of topics are arranged, calculates the theme that each LDA model trainings obtain Average similarity between vector takes the corresponding number of topics of model of theme vector average similarity minimum most preferably main as model Inscribe number.By LDA model trainings, obtain the theme distribution θ of each document, with it come indicate the feature of each user to Amount;
The 2.2 user characteristics vectors (total n) obtained using above step, gather user using spectral clustering realization Class;
Before cluster number is clustered firstly the need of determining.Because of each dimension table for each user vector that training obtains Show that the user belongs to the degree of membership of corresponding theme, therefore in order to determine importance of each theme in active user group, institute Have user's feature vector by dimension do it is cumulative after be averaged again, obtain a theme intensity vector for representing entirety, pass through observation The Distribution value of theme intensity vector most preferably clusters number to determine.For example, in the LDA training process that certain number of topics is 10, The theme intensity vector of one 10 dimension is obtained by the above process, and (longitudinal axis indicates that theme intensity, horizontal axis are as shown in Figure 2 for visualization Theme), by observation it can be seen that theme 2,9,3,8,6 concentrates maximum intensity in current data, illustrate to like seeing these types The people of film is most, therefore using spectral clustering user to be polymerized to 5 classes convenient for present case.Spectral clustering specific steps are such as Under:
(1) the similarity matrix W and degree matrix D of n × n are calculated;
(2) Laplacian Matrix L=D-W is calculated;
(3) the preceding k feature vector t of L is calculated1,t2,…,tk
(4) by k column vector t1,t2,…,tkForm matrix T, T ∈ Rn×k
(5) for i=1 ..., n enables yi∈RkIt is the i-th row vector of T;
(6) use K-Means algorithms by user (yi)I=1,2 ..., nCluster cluster C1,C2,…,Ck
For each user clustering, being not belonging to user's row vector of the cluster all in original implicit feedback training matrix A It is set to 0, each cluster generates a corresponding local implicit feedback training matrixPuIndicate cluster number, and Pu∈ {1,…,k};
Step 3 determines local recommended models and carries out global recommended models training.The loss letter of sparse linear model SLIM Number is as shown in formula (5);
Wherein, A indicates original user-film implicit feedback matrix, the weight of α and ρ control L1 and L2 norms, by most The smallization loss function can obtain the film similarity sparse matrix W that a size is m × m.L1 Norm Controls W in the model Sparse degree, the complexity of L2 Norm Control models, prevents model over-fitting.The model is by stochastic gradient descent method, parallel Each row w of training W matrixesjFinal W matrixes are obtained, as shown in the formula (6);
Wherein, ajJth row in representing matrix A.Prediction recommendations of the user i about film jCalculation formula such as formula (7) shown in;
It is sharp using sparse linear model SLIM as basic recommendation model construction overall situation recommended models with local recommended models It trains to obtain global film similarity matrix W with global implicit feedback training matrix A, utilizes local implicit feedback training matrixTraining obtains the corresponding local film similarity matrix of each cluster
Step 4 model-weight merges the recommendation stage.Partial model Weighted Fusion recommendation calculation formula such as formula (8) institute Show;
WhereinIndicate film j for the Weighted Fusion recommendation of user u, RuIt is all to occur to interact with user u The set of film, wljFor the similarity of film l and film j in world model,It is film l and film j belonging to user u Cluster PuSimilarity in corresponding partial model, parameter g are the weight parameter of world model.It is controlled by adjustment parameter g The weight proportion of world model processed and partial model in Fusion Model, by determining that optimal weights parameter g obtains Fusion Model Best recommendation effect.Best world model's weight parameter can be concentrated in current data to determine by testing.In determination After all parameters in model, by calculating Weighted Fusion recommendation of all films about active user u, by from greatly to Small sequence, while the film for occurring to interact with active user is deleted, take the film for coming top N to recommend current use Family;
The step 5. recommendation method can be by leave one cross validation to prove model validity.It can be from each user Film scoring set in randomly select a film and be put into test set, other films are used as the training set of model.So Afterwards with trained model be each user recommend a Top-N movie listings, observe test set in the user correspondence that Whether one film appears in recommendation list and it appears in specific location p in listi.Finally, hit rate can be used (HR) and average ranking hit rate (ARHR) two indices weigh the recommendation quality of model, and wherein #hits indicates to recommend hit Number, #users indicates total number of users, shown in their definition such as formula (9), (10);
Method flow step is recommended to end here.
In summary technology proposes the recommendation calculation of the partial model Weighted Fusion Top-N films based on user clustering to the present invention Method.It can not accurately estimate the local otherness of article to solve the single recommended models of tradition, lead to not accurately capture user The problem of preference, it is proposed that global recommended models and the local recommended models based on user clustering are respectively trained, by model it Between linear weighted function merge realize film Top-N recommend.In addition, in order to fully film be used to recommend the data in scene, The quality of recommendation is promoted from multiple dimensions, the present invention utilizes film label information, realized to user by LDA topic models In the calculating of the feature vector of semantic hierarchies, division of the user in semantic hierarchies group is realized.
It is an advantage of the invention that:(1) algorithm thinking is novel.Using sparse linear model as basic recommendation model, respectively Training overall situation recommended models and the local recommended models based on user clustering generate final melt finally by linear weighted function fusion Molding type, this thinking can handle the similarity difference in different crowd of film, and effectively overcoming single model can not The problem of accurate capture user preference.(2) various dimensions, which are promoted, recommends quality.Recommend to train in addition to using traditional score data Model, in the user clustering stage, the present invention analyzes crowd in semanteme by introducing film label data, using LDA topic models Subject attribute on level obtains user characteristics vector and realizes crowd's cluster with spectral clustering, further improves recommendation Quality.(3) algorithm is realized simple and quick.In partial model and world model's training stage, due to only mutually between each model It is vertical, between each row of each distortion matrix also independently of each other, therefore the method that parallel training can be used, greatly reduce mould The training time of type improves the efficiency of model training.(4) recommend quality more excellent.Partial model weighting proposed by the present invention is melted Effective combination that proposed algorithm is commending contents, the collaborative filtering based on neighborhood, the collaborative filtering three based on model is closed, fully The advantages of each proposed algorithm is utilized, and compensate for deficiency from each other, compared to certain proposed algorithm of single use, Recommend have great promotion in quality.
Description of the drawings
Fig. 1 is the general flow chart of the method for the present invention;
Fig. 2 is the theme intensity distribution of the method for the present invention.
Specific implementation mode
Technical solution general flow chart referring to Fig.1, the present invention share four-stage, are respectively:Data preprocessing phase, user Clustering phase, global recommended models and local recommended models training stage and recommended models linear weighted function fusing stage.Data Pretreatment stage is cleaned to data set, some inactive users and unexpected winner film are weeded out, and is configured to LDA theme moulds The corpus of type training and user's film implicit feedback training matrix for sparse linear model training;The user clustering stage, The user's corpus obtained using the first stage is obtained user characteristics vector, is calculated by spectral clustering by training LDA topic models Method realizes that the cluster to user, each cluster generate a local implicit feedback training matrix;Global recommended models and part push away Model training stage is recommended, is obtained respectively by sparse linear model training with original implicit feedback matrix and local implicit feedback matrix To world model and partial model;The linear Weighted Fusion of model recommends stage, the world model that back is obtained and localized mode Type merges to obtain final recommended models by way of linear weighted function.
The input of the present invention is the score data of user's viewing and the label data of film, is exported as user's Top-N personalization film recommendation lists.
It is as follows:
Step 1:Data preprocessing phase.It is clear that data are carried out to the film of some inactive users and popularity very little It washes;Structuring user's film lagged document;Explicit score information is converted into implicit feedback information, structuring user's-film is implicit Feedback matrix A;
1.1 pairs of raw data sets carry out data cleansing work, reject the user that viewing number is less than 20 films, reject simultaneously It is scored the film that number is less than 20 times, obtains new training dataset;
All users generate a label dictionary to the label that film is beaten in 1.2 statistics new data sets, and user has been seen The document of the label composition of all films indicates active user, and the document of all users forms a corpus, calculates document In each TF-IDF value of the word in corpus.TF (word frequency), IDF (inverse document frequency) and TF-IDF (term frequency-inverse document frequency) Calculation formula such as formula (1) (2) (3) shown in;
TFIDFi,j=TFi,j×IDFi (3)
Wherein TFi,jIndicate word tiIn document djIn word frequency, ni,jIndicate word tiIn document djThe number of middle appearance, ∑knk,jIndicate document djIn all words the sum of occurrence number.IDFiIndicate word tiInverse document frequency, | D | indicate corpus The sum of middle document, | { j:ti∈dj| it indicates to include word tiNumber of documents.TFIDFi,jIndicate document djMiddle word tiWord The inverse document frequency of frequency;
Such as 1-5 points of 1.3 explicit score informations, are converted into the implicit feedback information indicated with 0-1, if active user couple Current movie, which is beaten, is excessively then denoted as 1, does not beat excessive film film i.e. to be recommended and is denoted as 0, obtains user-electricity of a n × m Shadow implicit feedback matrix, number of users n, film number are m;
Step 2:The user clustering stage.Using film label information, by LDA topic models train to obtain user characteristics to Amount realizes user clustering with spectral clustering;
2.1LDA topic models are three layers of Bayesian networks of a document-theme-word, give a corpus, should Model can analyze the theme distribution of every document in the corpus, and the word distribution of each theme.Its joint probability is such as Shown in formula (4);
θ indicates that the theme distribution of a document, z indicate that a theme, w indicate that a document, α indicate main under every document The Dirichlet Study firsts of the multinomial distribution of topic, β indicate the Dirichlet priori ginseng of the multinomial distribution of word under each theme Number, N indicate the number of files in corpus, znIndicate the theme of n-th of word in a document, wnIndicate n-th of list of a document Word;
Every film has the label that multiple users assign to it, a film label mapping at a word wn, The compound mapping of the label composition for all films that one user has seen is specific the preferred one kind of user at a document w Film types be mapped to a theme z.If sharing n user in data set, a language material containing n documents is produced Library and a dictionary, every document in corpus indicate that each of vector value is corresponding word with the vector of dictionary length TF-IDF value of the label in the customer documentation and corpus in allusion quotation;
In order to distinguish more unique user group, the otherness between different themes is the bigger the better.For determination Best theme number trains multiple LDA models by the way that multiple numbers of topics are arranged, calculates the theme that each LDA model trainings obtain Average similarity between vector takes the corresponding number of topics of model of theme vector average similarity minimum most preferably main as model Inscribe number.By LDA model trainings, obtain the theme distribution θ of each document, with it come indicate the feature of each user to Amount;
The 2.2 user characteristics vectors (total n) obtained using above step, gather user using spectral clustering realization Class;
Before cluster number is clustered firstly the need of determining.Because of each dimension table for each user vector that training obtains Show that the user belongs to the degree of membership of corresponding theme, therefore in order to determine importance of each theme in active user group, institute Have user's feature vector by dimension do it is cumulative after be averaged again, obtain a theme intensity vector for representing entirety, pass through observation The Distribution value of theme intensity vector most preferably clusters number to determine.For example, in the LDA training process that certain number of topics is 10, The theme intensity vector of one 10 dimension is obtained by the above process, and (longitudinal axis indicates that theme intensity, horizontal axis are as shown in Figure 2 for visualization Theme), by observation it can be seen that theme 2,9,3,8,6 concentrates maximum intensity in current data, illustrate to like seeing these types The people of film is most, therefore using spectral clustering user to be polymerized to 5 classes convenient for present case.Spectral clustering specific steps are such as Under:
(1) the similarity matrix W and degree matrix D of n × n are calculated;
(2) Laplacian Matrix L=D-W is calculated;
(3) the preceding k feature vector t of L is calculated1,t2,…,tk
(4) by k column vector t1,t2,…,tkForm matrix T, T ∈ Rn×k
(5) for i=1 ..., n enables yi∈RkIt is the i-th row vector of T;
(6) use K-Means algorithms by user (yi)I=1,2 ..., nCluster cluster C1,C2,…,Ck
For each user clustering, being not belonging to user's row vector of the cluster all in original implicit feedback training matrix A It is set to 0, each cluster generates a corresponding local implicit feedback training matrixPuIndicate cluster number, and Pu∈ {1,…,k};
Step 3 determines local recommended models and carries out global recommended models training.The loss letter of sparse linear model SLIM Number is as shown in formula (5);
Wherein, A indicates original user-film implicit feedback matrix, the weight of α and ρ control L1 and L2 norms, by most The smallization loss function can obtain the film similarity sparse matrix W that a size is m × m.L1 Norm Controls W in the model Sparse degree, the complexity of L2 Norm Control models, prevents model over-fitting.The model is by stochastic gradient descent method, parallel Each row w of training W matrixesjFinal W matrixes are obtained, as shown in the formula (6);
Wherein, ajJth row in representing matrix A.Prediction recommendations of the user i about film jCalculation formula such as formula (7) shown in;
It is sharp using sparse linear model SLIM as basic recommendation model construction overall situation recommended models with local recommended models It trains to obtain global film similarity matrix W with global implicit feedback training matrix A, utilizes local implicit feedback training matrixTraining obtains the corresponding local film similarity matrix of each cluster
Step 4 model-weight merges the recommendation stage.Partial model Weighted Fusion recommendation calculation formula such as formula (8) institute Show;
WhereinIndicate film j for the Weighted Fusion recommendation of user u, RuIt is all to occur to interact with user u The set of film, wljFor the similarity of film l and film j in world model,It is film l and film j belonging to user u Cluster PuSimilarity in corresponding partial model, parameter g are the weight parameter of world model.It is controlled by adjustment parameter g The weight proportion of world model processed and partial model in Fusion Model, by determining that optimal weights parameter g obtains Fusion Model Best recommendation effect.Best world model's weight parameter can be concentrated in current data to determine by testing.In determination After all parameters in model, by calculating Weighted Fusion recommendation of all films about active user u, by from greatly to Small sequence, while the film for occurring to interact with active user is deleted, take the film for coming top N to recommend current use Family;
The step 5. recommendation method can be by leave one cross validation to prove model validity.It can be from each user Film scoring set in randomly select a film and be put into test set, other films are used as the training set of model.So Afterwards with trained model be each user recommend a Top-N movie listings, observe test set in the user correspondence that Whether one film appears in recommendation list and it appears in specific location p in listi.Finally, hit rate can be used (HR) and average ranking hit rate (ARHR) two indices weigh the recommendation quality of model, and wherein #hits indicates to recommend hit Number, #users indicates total number of users, shown in their definition such as formula (9), (10);
Method flow step is recommended to end here.
Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention Range is not construed as being only limitted to the concrete form that embodiment is stated, protection scope of the present invention is also and in art technology Personnel according to present inventive concept it is conceivable that equivalent technologies mean.

Claims (1)

1. the partial model Weighted Fusion Top-N films based on user clustering recommend method, specifically comprise the following steps:
Step 1:Data prediction;Data cleansing is carried out to the film of inactive user and popularity very little;Structuring user's electricity Shadow lagged document;Explicit score information is converted into implicit feedback information, structuring user's-film implicit feedback matrix A;
1.1 pairs of raw data sets carry out data cleansing work, reject the user that viewing number is less than 20 films, while rejecting and being commented Gradation number is less than 20 films, obtains new training dataset;
All users generate a label dictionary to the label that film is beaten in 1.2 statistics new data sets, user have been seen all The document of the label composition of film indicates active user, and the document of all users forms a corpus, calculate every in document TF-IDF value of a word in corpus;Word frequency TF, the calculation formula of inverse document frequency IDF and term frequency-inverse document frequency TF-IDF As shown in formula (1) (2) (3);
TFIDFi,j=TFi,j×IDFi (3)
Wherein TFI, jIndicate word tiIn document djIn word frequency, nI, jIndicate word tiIn document djThe number of middle appearance, ∑knK, j Indicate document djIn all words the sum of occurrence number;IDFiIndicate word tiInverse document frequency, | D | indicate corpus in document Sum, | { j:ti∈dj| it indicates to include word tiNumber of documents;TFIDFI, jIndicate document djMiddle word tiThe inverse text of word frequency Shelves frequency;
1.3 such as 1-5 points of explicit score informations, are converted into the implicit feedback information indicated with 0-1, if active user is to working as Preceding film, which is beaten, is excessively then denoted as 1, does not beat excessive film film i.e. to be recommended and is denoted as 0, obtains user-film of a n × m Implicit feedback matrix, number of users n, film number are m;
Step 2:User clustering;Using film label information, train to obtain user characteristics vector by LDA topic models, with spectrum Clustering algorithm realizes user clustering;
2.1 LDA topic models are three layers of Bayesian networks of a document-theme-word, give a corpus, LDA master The theme distribution of every document in the topic model analysis corpus, and the word of each theme are distributed;The connection of the word distribution of theme It closes shown in probability such as formula (4);
θ indicates that the theme distribution of a document, z indicate that a theme, w indicate that a document, α indicate theme under every document The Dirichlet Study firsts of multinomial distribution, β indicate the Dirichlet Study firsts of the multinomial distribution of word under each theme, N Indicate the number of files in corpus, znIndicate the theme of n-th of word in a document, wnIndicate n-th of word of a document;
Every film has the label that multiple users assign to it, a film label mapping at a word wn, a use The compound mapping of the label composition for all films that family has been seen is at a document w, the preferred specific film of one kind of user Type mapping is at a theme z;If sharing n user in data set, producible one corpus containing n documents and One dictionary, every document in corpus indicate that each value of vector is that corresponding dictionary is got the bid with the vector of dictionary length Sign the TF-IDF values in the customer documentation and corpus;
In order to distinguish more unique user group, the otherness between different themes is the bigger the better;It is best in order to determine Theme number trains multiple LDA models by the way that multiple numbers of topics are arranged, calculates the theme vector that each LDA model trainings obtain Between average similarity, take the corresponding number of topics of model of theme vector average similarity minimum as model best theme Number;By LDA model trainings, the theme distribution θ of each document is obtained, the feature vector of each user is indicated with it;
The 2.2 n user characteristics vectors obtained using above step, the cluster to user is realized using spectral clustering;
Before cluster number is clustered firstly the need of determining;It should because each dimension for each user vector that training obtains indicates User belongs to the degree of membership of corresponding theme, therefore in order to determine importance of each theme in active user group, institute is useful Family feature vector by dimension do it is cumulative after be averaged again, obtain one and represent whole theme intensity vector, pass through observation theme The Distribution value of intensity vector most preferably clusters number to determine;;Spectral clustering is as follows:
(1) the similarity matrix W and degree matrix D of n × n are calculated;
(2) Laplacian Matrix L=D-W is calculated;
(3) the preceding k feature vector t of L is calculated1,t2,…,tk
(4) by k column vector t1,t2,…,tkForm matrix T, T ∈ Rn×k
(5) for i=1 ..., n enables yi∈RkIt is the i-th row vector of T;
(6) use K-Means algorithms by user (yi)I=1,2 ..., nCluster cluster C1,C2,…,Ck
For each user clustering, user's row vector that the cluster is not belonging in original implicit feedback training matrix A is all set to 0, each cluster generates a corresponding local implicit feedback training matrixPuIndicate cluster number, and Pu∈ { 1 ..., k };
Step 3 determines local recommended models and carries out global recommended models training;The loss function of sparse linear model SLIM is such as Shown in formula (5);
Wherein, A indicates that original user-film implicit feedback matrix, the weight of α and ρ control L1 and L2 norms pass through minimum The loss function can obtain the film similarity sparse matrix W that a size is m × m;L1 Norm Controls W is sparse in the model Degree, the complexity of L2 Norm Control models, prevents model over-fitting;The model passes through stochastic gradient descent method, parallel training W Each row w of matrixjFinal W matrixes are obtained, as shown in the formula (6);
Wherein, ajJth row in representing matrix A;Prediction recommendations of the user i about film jShown in calculation formula such as formula (7);
Using sparse linear model SLIM as basic recommendation model construction overall situation recommended models and local recommended models, using complete Office implicit feedback training matrix A trains to obtain global film similarity matrix W, utilizes local implicit feedback training matrixInstruction Get the corresponding local film similarity matrix of each cluster
Step 4 model-weight merges the recommendation stage;Shown in partial model Weighted Fusion recommendation calculation formula such as formula (8);
WhereinIndicate film j for the Weighted Fusion recommendation of user u, RuFor all films interacted occurred with user u Set, wljFor the similarity of film l and film j in world model,For the cluster of film l and film j belonging to user u PuSimilarity in corresponding partial model, parameter g are the weight parameter of world model;The overall situation is controlled by adjustment parameter g The weight proportion of model and partial model in Fusion Model, by determining that optimal weights parameter g obtains the best of Fusion Model Recommendation effect;Best world model's weight parameter is concentrated in current data to determine by testing;In model is determined After all parameters, the Weighted Fusion recommendation by all films of calculating about active user u presses sequence from big to small, The film for occurring to interact with active user is deleted simultaneously, and the film for coming top N is taken to recommend active user;
Step 5. proves the validity of model by leave one cross validation;It is random from the film of each user scoring set It extracts a film to be put into test set, other films are used as the training set of model;Then it is every with trained model Whether a user recommends the movie listings of a Top-N, observe that film of the correspondence of the user in test set and appear in and push away It recommends in list and it appears in specific location p in listi;Finally, with hit rate HR and average ranking hit rate ARHR two A index weighs the recommendation quality of model, and wherein #hits indicates that hits, #users is recommended to indicate total number of users, such as formula (9), shown in (10);
CN201810169922.3A 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering Active CN108363804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810169922.3A CN108363804B (en) 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810169922.3A CN108363804B (en) 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering

Publications (2)

Publication Number Publication Date
CN108363804A true CN108363804A (en) 2018-08-03
CN108363804B CN108363804B (en) 2020-08-21

Family

ID=63002919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810169922.3A Active CN108363804B (en) 2018-03-01 2018-03-01 Local model weighted fusion Top-N movie recommendation method based on user clustering

Country Status (1)

Country Link
CN (1) CN108363804B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408702A (en) * 2018-08-29 2019-03-01 昆明理工大学 A kind of mixed recommendation method based on sparse edge noise reduction autocoding
CN110008377A (en) * 2019-03-27 2019-07-12 华南理工大学 A method of film recommendation is carried out using user property
CN110069663A (en) * 2019-04-29 2019-07-30 厦门美图之家科技有限公司 Video recommendation method and device
CN110084670A (en) * 2019-04-15 2019-08-02 东北大学 A kind of commodity on shelf combined recommendation method based on LDA-MLP
CN110443502A (en) * 2019-08-06 2019-11-12 合肥工业大学 Crowdsourcing task recommendation method and system based on worker's capability comparison
CN110795570A (en) * 2019-10-11 2020-02-14 上海上湖信息技术有限公司 Method and device for extracting user time sequence behavior characteristics
CN111008334A (en) * 2019-12-04 2020-04-14 华中科技大学 Top-K recommendation method and system based on local pairwise ordering and global decision fusion
CN111309873A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN111309874A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN111460046A (en) * 2020-03-06 2020-07-28 合肥海策科技信息服务有限公司 Scientific and technological information clustering method based on big data
CN111581522A (en) * 2020-06-05 2020-08-25 预见你情感(北京)教育咨询有限公司 Social analysis method based on user identity identification
CN111897999A (en) * 2020-07-27 2020-11-06 九江学院 LDA-based deep learning model construction method for video recommendation
CN111984856A (en) * 2019-07-25 2020-11-24 北京嘀嘀无限科技发展有限公司 Information pushing method and device, server and computer readable storage medium
CN112184391A (en) * 2020-10-16 2021-01-05 中国科学院计算技术研究所 Recommendation model training method, medium, electronic device and recommendation model
CN112348629A (en) * 2020-10-26 2021-02-09 邦道科技有限公司 Commodity information pushing method and device
CN112364245A (en) * 2020-11-20 2021-02-12 浙江工业大学 Top-K movie recommendation method based on heterogeneous information network embedding
CN112395487A (en) * 2019-08-14 2021-02-23 腾讯科技(深圳)有限公司 Information recommendation method and device, computer-readable storage medium and electronic equipment
CN112925926A (en) * 2021-01-28 2021-06-08 北京达佳互联信息技术有限公司 Training method and device of multimedia recommendation model, server and storage medium
CN113111251A (en) * 2020-01-10 2021-07-13 阿里巴巴集团控股有限公司 Project recommendation method, device and system
CN113268670A (en) * 2021-06-16 2021-08-17 中移(杭州)信息技术有限公司 Latent factor hybrid recommendation method, device, equipment and computer storage medium
CN113342963A (en) * 2021-04-29 2021-09-03 山东大学 Service recommendation method and system based on transfer learning
CN113449147A (en) * 2021-07-06 2021-09-28 乐视云计算有限公司 Video recommendation method and device based on theme
CN114936314A (en) * 2022-03-24 2022-08-23 阿里巴巴达摩院(杭州)科技有限公司 Recommendation information generation method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544216A (en) * 2013-09-23 2014-01-29 Tcl集团股份有限公司 Information recommendation method and system combining image content and keywords
US20150120742A1 (en) * 2012-06-21 2015-04-30 Tencent Technology (Shenzhen) Company Limited Method and system for processing recommended target software
CN107609201A (en) * 2017-10-25 2018-01-19 广东工业大学 A kind of recommended models generation method and relevant apparatus based on commending system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120742A1 (en) * 2012-06-21 2015-04-30 Tencent Technology (Shenzhen) Company Limited Method and system for processing recommended target software
CN103544216A (en) * 2013-09-23 2014-01-29 Tcl集团股份有限公司 Information recommendation method and system combining image content and keywords
CN107609201A (en) * 2017-10-25 2018-01-19 广东工业大学 A kind of recommended models generation method and relevant apparatus based on commending system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
EVANGELIA CHRISTAKOPOULOU: "Local item-item models for top-n recommendation", 《ACM》 *
李倩: "基于谱聚类与多因子融合的协同过滤推荐算法", 《计算机应用研究》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408702B (en) * 2018-08-29 2021-07-16 昆明理工大学 Mixed recommendation method based on sparse edge noise reduction automatic coding
CN109408702A (en) * 2018-08-29 2019-03-01 昆明理工大学 A kind of mixed recommendation method based on sparse edge noise reduction autocoding
CN111309873A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN111309874A (en) * 2018-11-23 2020-06-19 北京嘀嘀无限科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN110008377A (en) * 2019-03-27 2019-07-12 华南理工大学 A method of film recommendation is carried out using user property
CN110008377B (en) * 2019-03-27 2021-09-21 华南理工大学 Method for recommending movies by using user attributes
CN110084670A (en) * 2019-04-15 2019-08-02 东北大学 A kind of commodity on shelf combined recommendation method based on LDA-MLP
CN110084670B (en) * 2019-04-15 2022-03-25 东北大学 Shelf commodity combination recommendation method based on LDA-MLP
CN110069663A (en) * 2019-04-29 2019-07-30 厦门美图之家科技有限公司 Video recommendation method and device
CN110069663B (en) * 2019-04-29 2021-06-04 厦门美图之家科技有限公司 Video recommendation method and device
CN111984856A (en) * 2019-07-25 2020-11-24 北京嘀嘀无限科技发展有限公司 Information pushing method and device, server and computer readable storage medium
CN110443502A (en) * 2019-08-06 2019-11-12 合肥工业大学 Crowdsourcing task recommendation method and system based on worker's capability comparison
CN112395487A (en) * 2019-08-14 2021-02-23 腾讯科技(深圳)有限公司 Information recommendation method and device, computer-readable storage medium and electronic equipment
CN112395487B (en) * 2019-08-14 2024-04-26 腾讯科技(深圳)有限公司 Information recommendation method and device, computer readable storage medium and electronic equipment
CN110795570B (en) * 2019-10-11 2022-06-17 上海上湖信息技术有限公司 Method and device for extracting user time sequence behavior characteristics
CN110795570A (en) * 2019-10-11 2020-02-14 上海上湖信息技术有限公司 Method and device for extracting user time sequence behavior characteristics
CN111008334B (en) * 2019-12-04 2023-04-18 华中科技大学 Top-K recommendation method and system based on local pairwise ordering and global decision fusion
CN111008334A (en) * 2019-12-04 2020-04-14 华中科技大学 Top-K recommendation method and system based on local pairwise ordering and global decision fusion
CN113111251A (en) * 2020-01-10 2021-07-13 阿里巴巴集团控股有限公司 Project recommendation method, device and system
CN111460046A (en) * 2020-03-06 2020-07-28 合肥海策科技信息服务有限公司 Scientific and technological information clustering method based on big data
CN111581522A (en) * 2020-06-05 2020-08-25 预见你情感(北京)教育咨询有限公司 Social analysis method based on user identity identification
CN111897999A (en) * 2020-07-27 2020-11-06 九江学院 LDA-based deep learning model construction method for video recommendation
CN111897999B (en) * 2020-07-27 2023-06-16 九江学院 Deep learning model construction method for video recommendation and based on LDA
CN112184391B (en) * 2020-10-16 2023-10-10 中国科学院计算技术研究所 Training method of recommendation model, medium, electronic equipment and recommendation model
CN112184391A (en) * 2020-10-16 2021-01-05 中国科学院计算技术研究所 Recommendation model training method, medium, electronic device and recommendation model
CN112348629A (en) * 2020-10-26 2021-02-09 邦道科技有限公司 Commodity information pushing method and device
CN112364245B (en) * 2020-11-20 2021-12-21 浙江工业大学 Top-K movie recommendation method based on heterogeneous information network embedding
CN112364245A (en) * 2020-11-20 2021-02-12 浙江工业大学 Top-K movie recommendation method based on heterogeneous information network embedding
CN112925926A (en) * 2021-01-28 2021-06-08 北京达佳互联信息技术有限公司 Training method and device of multimedia recommendation model, server and storage medium
CN113342963A (en) * 2021-04-29 2021-09-03 山东大学 Service recommendation method and system based on transfer learning
CN113268670B (en) * 2021-06-16 2022-09-27 中移(杭州)信息技术有限公司 Latent factor hybrid recommendation method, device, equipment and computer storage medium
CN113268670A (en) * 2021-06-16 2021-08-17 中移(杭州)信息技术有限公司 Latent factor hybrid recommendation method, device, equipment and computer storage medium
CN113449147A (en) * 2021-07-06 2021-09-28 乐视云计算有限公司 Video recommendation method and device based on theme
CN114936314A (en) * 2022-03-24 2022-08-23 阿里巴巴达摩院(杭州)科技有限公司 Recommendation information generation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN108363804B (en) 2020-08-21

Similar Documents

Publication Publication Date Title
CN108363804A (en) Local model weighted fusion Top-N movie recommendation method based on user clustering
CN108763362B (en) Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection
CN110162693B (en) Information recommendation method and server
Phorasim et al. Movies recommendation system using collaborative filtering and k-means
CN108665323B (en) Integration method for financial product recommendation system
CN109241440A (en) It is a kind of based on deep learning towards implicit feedback recommended method
CN109145112A (en) A kind of comment on commodity classification method based on global information attention mechanism
CN108256119A (en) A kind of construction method of resource recommendation model and the resource recommendation method based on the model
CN110263257A (en) Multi-source heterogeneous data mixing recommended models based on deep learning
CN106610970A (en) Collaborative filtering-based content recommendation system and method
JP2009099088A (en) Sns user profile extraction device, extraction method and extraction program, and device using user profile
CN110083764A (en) A kind of collaborative filtering cold start-up way to solve the problem
KR20170027576A (en) Apparatus and method of researcher rcommendation based on matching studying career
CN107180078A (en) A kind of method for vertical search based on user profile learning
Li et al. On integrating multiple type preferences into competitive analyses of customer requirements in product planning
Desai Sentiment analysis of Twitter data
CN110457477A (en) A kind of Interest Community discovery method towards social networks
CN114461879B (en) Semantic social network multi-view community discovery method based on text feature integration
CN112000804A (en) Microblog hot topic user group emotion tendentiousness analysis method
Xie et al. A probabilistic recommendation method inspired by latent Dirichlet allocation model
Barkan et al. Cold start revisited: A deep hybrid recommender with cold-warm item harmonization
CN118250516A (en) Hierarchical processing method for users
CN110795640A (en) Adaptive group recommendation method for compensating group member difference
CN109902131A (en) A kind of group recommended method based on antithesis self-encoding encoder
CN113672818B (en) Method and system for acquiring social media user portraits

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant