CN108363804A - Local model weighted fusion Top-N movie recommendation method based on user clustering - Google Patents
Local model weighted fusion Top-N movie recommendation method based on user clustering Download PDFInfo
- Publication number
- CN108363804A CN108363804A CN201810169922.3A CN201810169922A CN108363804A CN 108363804 A CN108363804 A CN 108363804A CN 201810169922 A CN201810169922 A CN 201810169922A CN 108363804 A CN108363804 A CN 108363804A
- Authority
- CN
- China
- Prior art keywords
- user
- model
- film
- document
- theme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 54
- 239000011159 matrix material Substances 0.000 claims abstract description 52
- 239000013598 vector Substances 0.000 claims abstract description 48
- 230000003595 spectral effect Effects 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 7
- 239000000203 mixture Substances 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 150000001875 compounds Chemical class 0.000 claims description 3
- 238000002790 cross-validation Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- -1 number of users n Substances 0.000 claims description 3
- 201000003672 autosomal recessive hypophosphatemic rickets Diseases 0.000 claims 1
- 230000005611 electricity Effects 0.000 claims 1
- 239000000284 extract Substances 0.000 claims 1
- 238000001228 spectrum Methods 0.000 claims 1
- 238000007781 pre-processing Methods 0.000 abstract description 4
- 238000004140 cleaning Methods 0.000 abstract 1
- 238000012795 verification Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 238000001914 filtration Methods 0.000 description 4
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 239000012141 concentrate Substances 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 238000013499 data model Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000000465 moulding Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A local model weighted fusion Top-N movie recommendation method based on user clustering comprises the steps of 1, preprocessing data, cleaning data of an inactive user and a movie with low popularity, constructing a user movie label document, converting explicit grading information into implicit feedback information, constructing a user-movie implicit feedback matrix A, 2, clustering users, training a L DA topic model by utilizing movie label information to obtain a user characteristic vector, realizing user clustering by using a spectral clustering algorithm, 3, determining a local recommendation model and performing global recommendation model training, 4, performing a model weighted fusion recommendation stage, and 5, proving the effectiveness of the model by leave-one-out cross verification.
Description
Technical field
The present invention relates to the films on a kind of network to recommend method.
Background technology
With the fast development of Information technology and social networks, the data of internet generation are recently exponential to rise suddenly and sharply, greatly
Data age arrives.With increasing for data volume, people are increasingly difficult to find the letter oneself really wanted from mass data
Breath.At this point, commending system can then play its maximum application value.According to subscriber data, Item Information and user's history row
For data, proposed algorithm is capable of the hobby of Accurate Prediction user, is that user recommends their possible interested things personalizedly,
Greatly reduce the cost that user has found target information.
Proposed algorithm can be divided into content-based recommendation and collaborative filtering recommending.The commending system of modernization mainly has two
A task, one is score in predicting, the other is recommending using most Top-N in present commercial scene.Top-N recommends to calculate
Method allows user to select oneself interested east by way of recommending one to pass through ranking to user and size is the item lists of n
West.Top-N recommended models are broadly divided into two types, are the collaborative filtering based on neighborhood and the collaboration based on model respectively
Filter.The former can be subdivided into the neighbourhood model (UserKNN) based on user and the neighbourhood model (ItemKNN) based on article again, after
Person is then using hidden factor model as representative.
It as the saying goes " things of a kind come together, people of a mind fall into the same group, and things of a kind come together, people of a mind fall into the same group ", often form respectively unique behavior mould inside different user group
Formula so that two identical articles similarity in different crowds changes.And single proposed algorithm model often captures
Less than the similarity difference of these parts, they think that similarity of two identical articles in any scene is all consistent
, these models can not accurately capture the actual preferences of user, reduce the quality of personalized recommendation.Pass through the multiple parts of training
Recommended models, then merge partial model and can solve above ask to a certain extent to promote the proposed algorithm of global recommendation effect
Topic, but these algorithms often do not make full use of the data for recommending scene to provide, the data used are relatively simple, final
Recommendation effect is also general.
Invention content
In order to overcome, the single model of the prior art can not accurately capture user preference and multi-model blending algorithm uses
The single problem of training data, the present invention provide a kind of new partial model Weighted Fusion film recommendation calculation based on user clustering
Method realizes the Top-N personalized recommendations of film.
The present invention utilizes the content of text information of film, and semantic hierarchies user characteristics vector is calculated by LDA topic models,
And user clustering is realized by spectral clustering based on this, construct local crowd.The present invention further utilizes user to film
Score information pass through partial model and the overall situation by sparse linear Construction of A Model part recommended models and global recommended models
The linear weighted function of model merges to realize final film Top-N personalized recommendations.
Partial model Weighted Fusion Top-N films based on user clustering recommend method, and overall procedure is as shown in Figure 1, tool
Body includes the following steps:
Step 1:Data preprocessing phase.It is clear that data are carried out to the film of some inactive users and popularity very little
It washes;Structuring user's film lagged document;Explicit score information is converted into implicit feedback information, structuring user's-film is implicit
Feedback matrix A;
1.1 pairs of raw data sets carry out data cleansing work, reject the user that viewing number is less than 20 films, reject simultaneously
It is scored the film that number is less than 20 times, obtains new training dataset;
All users generate a label dictionary to the label that film is beaten in 1.2 statistics new data sets, and user has been seen
The document of the label composition of all films indicates active user, and the document of all users forms a corpus, calculates document
In each TF-IDF value of the word in corpus.Word frequency TF, the calculating of inverse document frequency IDF and term frequency-inverse document frequency TF-IDF
Shown in formula such as formula (1) (2) (3);
TFIDFi,j=TFi,j×IDFi (3)
Wherein TFi,jIndicate word tiIn document djIn word frequency, ni,jIndicate word tiIn document djThe number of middle appearance,
∑knk,jIndicate document djIn all words the sum of occurrence number.IDFiIndicate word tiInverse document frequency, | D | indicate corpus
The sum of middle document, | { j:ti∈dj| it indicates to include word tiNumber of documents.TFIDFi,jIndicate document djMiddle word tiWord
The inverse document frequency of frequency;
Such as 1-5 points of 1.3 explicit score informations, are converted into the implicit feedback information indicated with 0-1, if active user couple
Current movie, which is beaten, is excessively then denoted as 1, does not beat excessive film film i.e. to be recommended and is denoted as 0, obtains user-electricity of a n × m
Shadow implicit feedback matrix, number of users n, film number are m;
Step 2:The user clustering stage.Using film label information, by LDA topic models train to obtain user characteristics to
Amount realizes user clustering with spectral clustering;
2.1LDA topic models are three layers of Bayesian networks of a document-theme-word, give a corpus, should
Model can analyze the theme distribution of every document in the corpus, and the word distribution of each theme.Its joint probability is such as
Shown in formula (4);
θ indicates that the theme distribution of a document, z indicate that a theme, w indicate that a document, α indicate main under every document
The Dirichlet Study firsts of the multinomial distribution of topic, β indicate the Dirichlet priori ginseng of the multinomial distribution of word under each theme
Number, N indicate the number of files in corpus, znIndicate the theme of n-th of word in a document, wnIndicate n-th of list of a document
Word;
Every film has the label that multiple users assign to it, a film label mapping at a word wn,
The compound mapping of the label composition for all films that one user has seen is specific the preferred one kind of user at a document w
Film types be mapped to a theme z.If sharing n user in data set, a language material containing n documents is produced
Library and a dictionary, every document in corpus indicate that each of vector value is corresponding word with the vector of dictionary length
TF-IDF value of the label in the customer documentation and corpus in allusion quotation;
In order to distinguish more unique user group, the otherness between different themes is the bigger the better.For determination
Best theme number trains multiple LDA models by the way that multiple numbers of topics are arranged, calculates the theme that each LDA model trainings obtain
Average similarity between vector takes the corresponding number of topics of model of theme vector average similarity minimum most preferably main as model
Inscribe number.By LDA model trainings, obtain the theme distribution θ of each document, with it come indicate the feature of each user to
Amount;
The 2.2 user characteristics vectors (total n) obtained using above step, gather user using spectral clustering realization
Class;
Before cluster number is clustered firstly the need of determining.Because of each dimension table for each user vector that training obtains
Show that the user belongs to the degree of membership of corresponding theme, therefore in order to determine importance of each theme in active user group, institute
Have user's feature vector by dimension do it is cumulative after be averaged again, obtain a theme intensity vector for representing entirety, pass through observation
The Distribution value of theme intensity vector most preferably clusters number to determine.For example, in the LDA training process that certain number of topics is 10,
The theme intensity vector of one 10 dimension is obtained by the above process, and (longitudinal axis indicates that theme intensity, horizontal axis are as shown in Figure 2 for visualization
Theme), by observation it can be seen that theme 2,9,3,8,6 concentrates maximum intensity in current data, illustrate to like seeing these types
The people of film is most, therefore using spectral clustering user to be polymerized to 5 classes convenient for present case.Spectral clustering specific steps are such as
Under:
(1) the similarity matrix W and degree matrix D of n × n are calculated;
(2) Laplacian Matrix L=D-W is calculated;
(3) the preceding k feature vector t of L is calculated1,t2,…,tk;
(4) by k column vector t1,t2,…,tkForm matrix T, T ∈ Rn×k;
(5) for i=1 ..., n enables yi∈RkIt is the i-th row vector of T;
(6) use K-Means algorithms by user (yi)I=1,2 ..., nCluster cluster C1,C2,…,Ck;
For each user clustering, being not belonging to user's row vector of the cluster all in original implicit feedback training matrix A
It is set to 0, each cluster generates a corresponding local implicit feedback training matrixPuIndicate cluster number, and Pu∈
{1,…,k};
Step 3 determines local recommended models and carries out global recommended models training.The loss letter of sparse linear model SLIM
Number is as shown in formula (5);
Wherein, A indicates original user-film implicit feedback matrix, the weight of α and ρ control L1 and L2 norms, by most
The smallization loss function can obtain the film similarity sparse matrix W that a size is m × m.L1 Norm Controls W in the model
Sparse degree, the complexity of L2 Norm Control models, prevents model over-fitting.The model is by stochastic gradient descent method, parallel
Each row w of training W matrixesjFinal W matrixes are obtained, as shown in the formula (6);
Wherein, ajJth row in representing matrix A.Prediction recommendations of the user i about film jCalculation formula such as formula
(7) shown in;
It is sharp using sparse linear model SLIM as basic recommendation model construction overall situation recommended models with local recommended models
It trains to obtain global film similarity matrix W with global implicit feedback training matrix A, utilizes local implicit feedback training matrixTraining obtains the corresponding local film similarity matrix of each cluster
Step 4 model-weight merges the recommendation stage.Partial model Weighted Fusion recommendation calculation formula such as formula (8) institute
Show;
WhereinIndicate film j for the Weighted Fusion recommendation of user u, RuIt is all to occur to interact with user u
The set of film, wljFor the similarity of film l and film j in world model,It is film l and film j belonging to user u
Cluster PuSimilarity in corresponding partial model, parameter g are the weight parameter of world model.It is controlled by adjustment parameter g
The weight proportion of world model processed and partial model in Fusion Model, by determining that optimal weights parameter g obtains Fusion Model
Best recommendation effect.Best world model's weight parameter can be concentrated in current data to determine by testing.In determination
After all parameters in model, by calculating Weighted Fusion recommendation of all films about active user u, by from greatly to
Small sequence, while the film for occurring to interact with active user is deleted, take the film for coming top N to recommend current use
Family;
The step 5. recommendation method can be by leave one cross validation to prove model validity.It can be from each user
Film scoring set in randomly select a film and be put into test set, other films are used as the training set of model.So
Afterwards with trained model be each user recommend a Top-N movie listings, observe test set in the user correspondence that
Whether one film appears in recommendation list and it appears in specific location p in listi.Finally, hit rate can be used
(HR) and average ranking hit rate (ARHR) two indices weigh the recommendation quality of model, and wherein #hits indicates to recommend hit
Number, #users indicates total number of users, shown in their definition such as formula (9), (10);
Method flow step is recommended to end here.
In summary technology proposes the recommendation calculation of the partial model Weighted Fusion Top-N films based on user clustering to the present invention
Method.It can not accurately estimate the local otherness of article to solve the single recommended models of tradition, lead to not accurately capture user
The problem of preference, it is proposed that global recommended models and the local recommended models based on user clustering are respectively trained, by model it
Between linear weighted function merge realize film Top-N recommend.In addition, in order to fully film be used to recommend the data in scene,
The quality of recommendation is promoted from multiple dimensions, the present invention utilizes film label information, realized to user by LDA topic models
In the calculating of the feature vector of semantic hierarchies, division of the user in semantic hierarchies group is realized.
It is an advantage of the invention that:(1) algorithm thinking is novel.Using sparse linear model as basic recommendation model, respectively
Training overall situation recommended models and the local recommended models based on user clustering generate final melt finally by linear weighted function fusion
Molding type, this thinking can handle the similarity difference in different crowd of film, and effectively overcoming single model can not
The problem of accurate capture user preference.(2) various dimensions, which are promoted, recommends quality.Recommend to train in addition to using traditional score data
Model, in the user clustering stage, the present invention analyzes crowd in semanteme by introducing film label data, using LDA topic models
Subject attribute on level obtains user characteristics vector and realizes crowd's cluster with spectral clustering, further improves recommendation
Quality.(3) algorithm is realized simple and quick.In partial model and world model's training stage, due to only mutually between each model
It is vertical, between each row of each distortion matrix also independently of each other, therefore the method that parallel training can be used, greatly reduce mould
The training time of type improves the efficiency of model training.(4) recommend quality more excellent.Partial model weighting proposed by the present invention is melted
Effective combination that proposed algorithm is commending contents, the collaborative filtering based on neighborhood, the collaborative filtering three based on model is closed, fully
The advantages of each proposed algorithm is utilized, and compensate for deficiency from each other, compared to certain proposed algorithm of single use,
Recommend have great promotion in quality.
Description of the drawings
Fig. 1 is the general flow chart of the method for the present invention;
Fig. 2 is the theme intensity distribution of the method for the present invention.
Specific implementation mode
Technical solution general flow chart referring to Fig.1, the present invention share four-stage, are respectively:Data preprocessing phase, user
Clustering phase, global recommended models and local recommended models training stage and recommended models linear weighted function fusing stage.Data
Pretreatment stage is cleaned to data set, some inactive users and unexpected winner film are weeded out, and is configured to LDA theme moulds
The corpus of type training and user's film implicit feedback training matrix for sparse linear model training;The user clustering stage,
The user's corpus obtained using the first stage is obtained user characteristics vector, is calculated by spectral clustering by training LDA topic models
Method realizes that the cluster to user, each cluster generate a local implicit feedback training matrix;Global recommended models and part push away
Model training stage is recommended, is obtained respectively by sparse linear model training with original implicit feedback matrix and local implicit feedback matrix
To world model and partial model;The linear Weighted Fusion of model recommends stage, the world model that back is obtained and localized mode
Type merges to obtain final recommended models by way of linear weighted function.
The input of the present invention is the score data of user's viewing and the label data of film, is exported as user's
Top-N personalization film recommendation lists.
It is as follows:
Step 1:Data preprocessing phase.It is clear that data are carried out to the film of some inactive users and popularity very little
It washes;Structuring user's film lagged document;Explicit score information is converted into implicit feedback information, structuring user's-film is implicit
Feedback matrix A;
1.1 pairs of raw data sets carry out data cleansing work, reject the user that viewing number is less than 20 films, reject simultaneously
It is scored the film that number is less than 20 times, obtains new training dataset;
All users generate a label dictionary to the label that film is beaten in 1.2 statistics new data sets, and user has been seen
The document of the label composition of all films indicates active user, and the document of all users forms a corpus, calculates document
In each TF-IDF value of the word in corpus.TF (word frequency), IDF (inverse document frequency) and TF-IDF (term frequency-inverse document frequency)
Calculation formula such as formula (1) (2) (3) shown in;
TFIDFi,j=TFi,j×IDFi (3)
Wherein TFi,jIndicate word tiIn document djIn word frequency, ni,jIndicate word tiIn document djThe number of middle appearance,
∑knk,jIndicate document djIn all words the sum of occurrence number.IDFiIndicate word tiInverse document frequency, | D | indicate corpus
The sum of middle document, | { j:ti∈dj| it indicates to include word tiNumber of documents.TFIDFi,jIndicate document djMiddle word tiWord
The inverse document frequency of frequency;
Such as 1-5 points of 1.3 explicit score informations, are converted into the implicit feedback information indicated with 0-1, if active user couple
Current movie, which is beaten, is excessively then denoted as 1, does not beat excessive film film i.e. to be recommended and is denoted as 0, obtains user-electricity of a n × m
Shadow implicit feedback matrix, number of users n, film number are m;
Step 2:The user clustering stage.Using film label information, by LDA topic models train to obtain user characteristics to
Amount realizes user clustering with spectral clustering;
2.1LDA topic models are three layers of Bayesian networks of a document-theme-word, give a corpus, should
Model can analyze the theme distribution of every document in the corpus, and the word distribution of each theme.Its joint probability is such as
Shown in formula (4);
θ indicates that the theme distribution of a document, z indicate that a theme, w indicate that a document, α indicate main under every document
The Dirichlet Study firsts of the multinomial distribution of topic, β indicate the Dirichlet priori ginseng of the multinomial distribution of word under each theme
Number, N indicate the number of files in corpus, znIndicate the theme of n-th of word in a document, wnIndicate n-th of list of a document
Word;
Every film has the label that multiple users assign to it, a film label mapping at a word wn,
The compound mapping of the label composition for all films that one user has seen is specific the preferred one kind of user at a document w
Film types be mapped to a theme z.If sharing n user in data set, a language material containing n documents is produced
Library and a dictionary, every document in corpus indicate that each of vector value is corresponding word with the vector of dictionary length
TF-IDF value of the label in the customer documentation and corpus in allusion quotation;
In order to distinguish more unique user group, the otherness between different themes is the bigger the better.For determination
Best theme number trains multiple LDA models by the way that multiple numbers of topics are arranged, calculates the theme that each LDA model trainings obtain
Average similarity between vector takes the corresponding number of topics of model of theme vector average similarity minimum most preferably main as model
Inscribe number.By LDA model trainings, obtain the theme distribution θ of each document, with it come indicate the feature of each user to
Amount;
The 2.2 user characteristics vectors (total n) obtained using above step, gather user using spectral clustering realization
Class;
Before cluster number is clustered firstly the need of determining.Because of each dimension table for each user vector that training obtains
Show that the user belongs to the degree of membership of corresponding theme, therefore in order to determine importance of each theme in active user group, institute
Have user's feature vector by dimension do it is cumulative after be averaged again, obtain a theme intensity vector for representing entirety, pass through observation
The Distribution value of theme intensity vector most preferably clusters number to determine.For example, in the LDA training process that certain number of topics is 10,
The theme intensity vector of one 10 dimension is obtained by the above process, and (longitudinal axis indicates that theme intensity, horizontal axis are as shown in Figure 2 for visualization
Theme), by observation it can be seen that theme 2,9,3,8,6 concentrates maximum intensity in current data, illustrate to like seeing these types
The people of film is most, therefore using spectral clustering user to be polymerized to 5 classes convenient for present case.Spectral clustering specific steps are such as
Under:
(1) the similarity matrix W and degree matrix D of n × n are calculated;
(2) Laplacian Matrix L=D-W is calculated;
(3) the preceding k feature vector t of L is calculated1,t2,…,tk;
(4) by k column vector t1,t2,…,tkForm matrix T, T ∈ Rn×k;
(5) for i=1 ..., n enables yi∈RkIt is the i-th row vector of T;
(6) use K-Means algorithms by user (yi)I=1,2 ..., nCluster cluster C1,C2,…,Ck;
For each user clustering, being not belonging to user's row vector of the cluster all in original implicit feedback training matrix A
It is set to 0, each cluster generates a corresponding local implicit feedback training matrixPuIndicate cluster number, and Pu∈
{1,…,k};
Step 3 determines local recommended models and carries out global recommended models training.The loss letter of sparse linear model SLIM
Number is as shown in formula (5);
Wherein, A indicates original user-film implicit feedback matrix, the weight of α and ρ control L1 and L2 norms, by most
The smallization loss function can obtain the film similarity sparse matrix W that a size is m × m.L1 Norm Controls W in the model
Sparse degree, the complexity of L2 Norm Control models, prevents model over-fitting.The model is by stochastic gradient descent method, parallel
Each row w of training W matrixesjFinal W matrixes are obtained, as shown in the formula (6);
Wherein, ajJth row in representing matrix A.Prediction recommendations of the user i about film jCalculation formula such as formula
(7) shown in;
It is sharp using sparse linear model SLIM as basic recommendation model construction overall situation recommended models with local recommended models
It trains to obtain global film similarity matrix W with global implicit feedback training matrix A, utilizes local implicit feedback training matrixTraining obtains the corresponding local film similarity matrix of each cluster
Step 4 model-weight merges the recommendation stage.Partial model Weighted Fusion recommendation calculation formula such as formula (8) institute
Show;
WhereinIndicate film j for the Weighted Fusion recommendation of user u, RuIt is all to occur to interact with user u
The set of film, wljFor the similarity of film l and film j in world model,It is film l and film j belonging to user u
Cluster PuSimilarity in corresponding partial model, parameter g are the weight parameter of world model.It is controlled by adjustment parameter g
The weight proportion of world model processed and partial model in Fusion Model, by determining that optimal weights parameter g obtains Fusion Model
Best recommendation effect.Best world model's weight parameter can be concentrated in current data to determine by testing.In determination
After all parameters in model, by calculating Weighted Fusion recommendation of all films about active user u, by from greatly to
Small sequence, while the film for occurring to interact with active user is deleted, take the film for coming top N to recommend current use
Family;
The step 5. recommendation method can be by leave one cross validation to prove model validity.It can be from each user
Film scoring set in randomly select a film and be put into test set, other films are used as the training set of model.So
Afterwards with trained model be each user recommend a Top-N movie listings, observe test set in the user correspondence that
Whether one film appears in recommendation list and it appears in specific location p in listi.Finally, hit rate can be used
(HR) and average ranking hit rate (ARHR) two indices weigh the recommendation quality of model, and wherein #hits indicates to recommend hit
Number, #users indicates total number of users, shown in their definition such as formula (9), (10);
Method flow step is recommended to end here.
Content described in this specification embodiment is only enumerating to the way of realization of inventive concept, protection of the invention
Range is not construed as being only limitted to the concrete form that embodiment is stated, protection scope of the present invention is also and in art technology
Personnel according to present inventive concept it is conceivable that equivalent technologies mean.
Claims (1)
1. the partial model Weighted Fusion Top-N films based on user clustering recommend method, specifically comprise the following steps:
Step 1:Data prediction;Data cleansing is carried out to the film of inactive user and popularity very little;Structuring user's electricity
Shadow lagged document;Explicit score information is converted into implicit feedback information, structuring user's-film implicit feedback matrix A;
1.1 pairs of raw data sets carry out data cleansing work, reject the user that viewing number is less than 20 films, while rejecting and being commented
Gradation number is less than 20 films, obtains new training dataset;
All users generate a label dictionary to the label that film is beaten in 1.2 statistics new data sets, user have been seen all
The document of the label composition of film indicates active user, and the document of all users forms a corpus, calculate every in document
TF-IDF value of a word in corpus;Word frequency TF, the calculation formula of inverse document frequency IDF and term frequency-inverse document frequency TF-IDF
As shown in formula (1) (2) (3);
TFIDFi,j=TFi,j×IDFi (3)
Wherein TFI, jIndicate word tiIn document djIn word frequency, nI, jIndicate word tiIn document djThe number of middle appearance, ∑knK, j
Indicate document djIn all words the sum of occurrence number;IDFiIndicate word tiInverse document frequency, | D | indicate corpus in document
Sum, | { j:ti∈dj| it indicates to include word tiNumber of documents;TFIDFI, jIndicate document djMiddle word tiThe inverse text of word frequency
Shelves frequency;
1.3 such as 1-5 points of explicit score informations, are converted into the implicit feedback information indicated with 0-1, if active user is to working as
Preceding film, which is beaten, is excessively then denoted as 1, does not beat excessive film film i.e. to be recommended and is denoted as 0, obtains user-film of a n × m
Implicit feedback matrix, number of users n, film number are m;
Step 2:User clustering;Using film label information, train to obtain user characteristics vector by LDA topic models, with spectrum
Clustering algorithm realizes user clustering;
2.1 LDA topic models are three layers of Bayesian networks of a document-theme-word, give a corpus, LDA master
The theme distribution of every document in the topic model analysis corpus, and the word of each theme are distributed;The connection of the word distribution of theme
It closes shown in probability such as formula (4);
θ indicates that the theme distribution of a document, z indicate that a theme, w indicate that a document, α indicate theme under every document
The Dirichlet Study firsts of multinomial distribution, β indicate the Dirichlet Study firsts of the multinomial distribution of word under each theme, N
Indicate the number of files in corpus, znIndicate the theme of n-th of word in a document, wnIndicate n-th of word of a document;
Every film has the label that multiple users assign to it, a film label mapping at a word wn, a use
The compound mapping of the label composition for all films that family has been seen is at a document w, the preferred specific film of one kind of user
Type mapping is at a theme z;If sharing n user in data set, producible one corpus containing n documents and
One dictionary, every document in corpus indicate that each value of vector is that corresponding dictionary is got the bid with the vector of dictionary length
Sign the TF-IDF values in the customer documentation and corpus;
In order to distinguish more unique user group, the otherness between different themes is the bigger the better;It is best in order to determine
Theme number trains multiple LDA models by the way that multiple numbers of topics are arranged, calculates the theme vector that each LDA model trainings obtain
Between average similarity, take the corresponding number of topics of model of theme vector average similarity minimum as model best theme
Number;By LDA model trainings, the theme distribution θ of each document is obtained, the feature vector of each user is indicated with it;
The 2.2 n user characteristics vectors obtained using above step, the cluster to user is realized using spectral clustering;
Before cluster number is clustered firstly the need of determining;It should because each dimension for each user vector that training obtains indicates
User belongs to the degree of membership of corresponding theme, therefore in order to determine importance of each theme in active user group, institute is useful
Family feature vector by dimension do it is cumulative after be averaged again, obtain one and represent whole theme intensity vector, pass through observation theme
The Distribution value of intensity vector most preferably clusters number to determine;;Spectral clustering is as follows:
(1) the similarity matrix W and degree matrix D of n × n are calculated;
(2) Laplacian Matrix L=D-W is calculated;
(3) the preceding k feature vector t of L is calculated1,t2,…,tk;
(4) by k column vector t1,t2,…,tkForm matrix T, T ∈ Rn×k;
(5) for i=1 ..., n enables yi∈RkIt is the i-th row vector of T;
(6) use K-Means algorithms by user (yi)I=1,2 ..., nCluster cluster C1,C2,…,Ck;
For each user clustering, user's row vector that the cluster is not belonging in original implicit feedback training matrix A is all set to
0, each cluster generates a corresponding local implicit feedback training matrixPuIndicate cluster number, and Pu∈
{ 1 ..., k };
Step 3 determines local recommended models and carries out global recommended models training;The loss function of sparse linear model SLIM is such as
Shown in formula (5);
Wherein, A indicates that original user-film implicit feedback matrix, the weight of α and ρ control L1 and L2 norms pass through minimum
The loss function can obtain the film similarity sparse matrix W that a size is m × m;L1 Norm Controls W is sparse in the model
Degree, the complexity of L2 Norm Control models, prevents model over-fitting;The model passes through stochastic gradient descent method, parallel training W
Each row w of matrixjFinal W matrixes are obtained, as shown in the formula (6);
Wherein, ajJth row in representing matrix A;Prediction recommendations of the user i about film jShown in calculation formula such as formula (7);
Using sparse linear model SLIM as basic recommendation model construction overall situation recommended models and local recommended models, using complete
Office implicit feedback training matrix A trains to obtain global film similarity matrix W, utilizes local implicit feedback training matrixInstruction
Get the corresponding local film similarity matrix of each cluster
Step 4 model-weight merges the recommendation stage;Shown in partial model Weighted Fusion recommendation calculation formula such as formula (8);
WhereinIndicate film j for the Weighted Fusion recommendation of user u, RuFor all films interacted occurred with user u
Set, wljFor the similarity of film l and film j in world model,For the cluster of film l and film j belonging to user u
PuSimilarity in corresponding partial model, parameter g are the weight parameter of world model;The overall situation is controlled by adjustment parameter g
The weight proportion of model and partial model in Fusion Model, by determining that optimal weights parameter g obtains the best of Fusion Model
Recommendation effect;Best world model's weight parameter is concentrated in current data to determine by testing;In model is determined
After all parameters, the Weighted Fusion recommendation by all films of calculating about active user u presses sequence from big to small,
The film for occurring to interact with active user is deleted simultaneously, and the film for coming top N is taken to recommend active user;
Step 5. proves the validity of model by leave one cross validation;It is random from the film of each user scoring set
It extracts a film to be put into test set, other films are used as the training set of model;Then it is every with trained model
Whether a user recommends the movie listings of a Top-N, observe that film of the correspondence of the user in test set and appear in and push away
It recommends in list and it appears in specific location p in listi;Finally, with hit rate HR and average ranking hit rate ARHR two
A index weighs the recommendation quality of model, and wherein #hits indicates that hits, #users is recommended to indicate total number of users, such as formula
(9), shown in (10);
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810169922.3A CN108363804B (en) | 2018-03-01 | 2018-03-01 | Local model weighted fusion Top-N movie recommendation method based on user clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810169922.3A CN108363804B (en) | 2018-03-01 | 2018-03-01 | Local model weighted fusion Top-N movie recommendation method based on user clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108363804A true CN108363804A (en) | 2018-08-03 |
CN108363804B CN108363804B (en) | 2020-08-21 |
Family
ID=63002919
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810169922.3A Active CN108363804B (en) | 2018-03-01 | 2018-03-01 | Local model weighted fusion Top-N movie recommendation method based on user clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108363804B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408702A (en) * | 2018-08-29 | 2019-03-01 | 昆明理工大学 | A kind of mixed recommendation method based on sparse edge noise reduction autocoding |
CN110008377A (en) * | 2019-03-27 | 2019-07-12 | 华南理工大学 | A method of film recommendation is carried out using user property |
CN110069663A (en) * | 2019-04-29 | 2019-07-30 | 厦门美图之家科技有限公司 | Video recommendation method and device |
CN110084670A (en) * | 2019-04-15 | 2019-08-02 | 东北大学 | A kind of commodity on shelf combined recommendation method based on LDA-MLP |
CN110443502A (en) * | 2019-08-06 | 2019-11-12 | 合肥工业大学 | Crowdsourcing task recommendation method and system based on worker's capability comparison |
CN110795570A (en) * | 2019-10-11 | 2020-02-14 | 上海上湖信息技术有限公司 | Method and device for extracting user time sequence behavior characteristics |
CN111008334A (en) * | 2019-12-04 | 2020-04-14 | 华中科技大学 | Top-K recommendation method and system based on local pairwise ordering and global decision fusion |
CN111309873A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111309874A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111460046A (en) * | 2020-03-06 | 2020-07-28 | 合肥海策科技信息服务有限公司 | Scientific and technological information clustering method based on big data |
CN111581522A (en) * | 2020-06-05 | 2020-08-25 | 预见你情感(北京)教育咨询有限公司 | Social analysis method based on user identity identification |
CN111897999A (en) * | 2020-07-27 | 2020-11-06 | 九江学院 | LDA-based deep learning model construction method for video recommendation |
CN111984856A (en) * | 2019-07-25 | 2020-11-24 | 北京嘀嘀无限科技发展有限公司 | Information pushing method and device, server and computer readable storage medium |
CN112184391A (en) * | 2020-10-16 | 2021-01-05 | 中国科学院计算技术研究所 | Recommendation model training method, medium, electronic device and recommendation model |
CN112348629A (en) * | 2020-10-26 | 2021-02-09 | 邦道科技有限公司 | Commodity information pushing method and device |
CN112364245A (en) * | 2020-11-20 | 2021-02-12 | 浙江工业大学 | Top-K movie recommendation method based on heterogeneous information network embedding |
CN112395487A (en) * | 2019-08-14 | 2021-02-23 | 腾讯科技(深圳)有限公司 | Information recommendation method and device, computer-readable storage medium and electronic equipment |
CN112925926A (en) * | 2021-01-28 | 2021-06-08 | 北京达佳互联信息技术有限公司 | Training method and device of multimedia recommendation model, server and storage medium |
CN113111251A (en) * | 2020-01-10 | 2021-07-13 | 阿里巴巴集团控股有限公司 | Project recommendation method, device and system |
CN113268670A (en) * | 2021-06-16 | 2021-08-17 | 中移(杭州)信息技术有限公司 | Latent factor hybrid recommendation method, device, equipment and computer storage medium |
CN113342963A (en) * | 2021-04-29 | 2021-09-03 | 山东大学 | Service recommendation method and system based on transfer learning |
CN113449147A (en) * | 2021-07-06 | 2021-09-28 | 乐视云计算有限公司 | Video recommendation method and device based on theme |
CN114936314A (en) * | 2022-03-24 | 2022-08-23 | 阿里巴巴达摩院(杭州)科技有限公司 | Recommendation information generation method and device, storage medium and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544216A (en) * | 2013-09-23 | 2014-01-29 | Tcl集团股份有限公司 | Information recommendation method and system combining image content and keywords |
US20150120742A1 (en) * | 2012-06-21 | 2015-04-30 | Tencent Technology (Shenzhen) Company Limited | Method and system for processing recommended target software |
CN107609201A (en) * | 2017-10-25 | 2018-01-19 | 广东工业大学 | A kind of recommended models generation method and relevant apparatus based on commending system |
-
2018
- 2018-03-01 CN CN201810169922.3A patent/CN108363804B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150120742A1 (en) * | 2012-06-21 | 2015-04-30 | Tencent Technology (Shenzhen) Company Limited | Method and system for processing recommended target software |
CN103544216A (en) * | 2013-09-23 | 2014-01-29 | Tcl集团股份有限公司 | Information recommendation method and system combining image content and keywords |
CN107609201A (en) * | 2017-10-25 | 2018-01-19 | 广东工业大学 | A kind of recommended models generation method and relevant apparatus based on commending system |
Non-Patent Citations (2)
Title |
---|
EVANGELIA CHRISTAKOPOULOU: "Local item-item models for top-n recommendation", 《ACM》 * |
李倩: "基于谱聚类与多因子融合的协同过滤推荐算法", 《计算机应用研究》 * |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408702B (en) * | 2018-08-29 | 2021-07-16 | 昆明理工大学 | Mixed recommendation method based on sparse edge noise reduction automatic coding |
CN109408702A (en) * | 2018-08-29 | 2019-03-01 | 昆明理工大学 | A kind of mixed recommendation method based on sparse edge noise reduction autocoding |
CN111309873A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN111309874A (en) * | 2018-11-23 | 2020-06-19 | 北京嘀嘀无限科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN110008377A (en) * | 2019-03-27 | 2019-07-12 | 华南理工大学 | A method of film recommendation is carried out using user property |
CN110008377B (en) * | 2019-03-27 | 2021-09-21 | 华南理工大学 | Method for recommending movies by using user attributes |
CN110084670A (en) * | 2019-04-15 | 2019-08-02 | 东北大学 | A kind of commodity on shelf combined recommendation method based on LDA-MLP |
CN110084670B (en) * | 2019-04-15 | 2022-03-25 | 东北大学 | Shelf commodity combination recommendation method based on LDA-MLP |
CN110069663A (en) * | 2019-04-29 | 2019-07-30 | 厦门美图之家科技有限公司 | Video recommendation method and device |
CN110069663B (en) * | 2019-04-29 | 2021-06-04 | 厦门美图之家科技有限公司 | Video recommendation method and device |
CN111984856A (en) * | 2019-07-25 | 2020-11-24 | 北京嘀嘀无限科技发展有限公司 | Information pushing method and device, server and computer readable storage medium |
CN110443502A (en) * | 2019-08-06 | 2019-11-12 | 合肥工业大学 | Crowdsourcing task recommendation method and system based on worker's capability comparison |
CN112395487A (en) * | 2019-08-14 | 2021-02-23 | 腾讯科技(深圳)有限公司 | Information recommendation method and device, computer-readable storage medium and electronic equipment |
CN112395487B (en) * | 2019-08-14 | 2024-04-26 | 腾讯科技(深圳)有限公司 | Information recommendation method and device, computer readable storage medium and electronic equipment |
CN110795570B (en) * | 2019-10-11 | 2022-06-17 | 上海上湖信息技术有限公司 | Method and device for extracting user time sequence behavior characteristics |
CN110795570A (en) * | 2019-10-11 | 2020-02-14 | 上海上湖信息技术有限公司 | Method and device for extracting user time sequence behavior characteristics |
CN111008334B (en) * | 2019-12-04 | 2023-04-18 | 华中科技大学 | Top-K recommendation method and system based on local pairwise ordering and global decision fusion |
CN111008334A (en) * | 2019-12-04 | 2020-04-14 | 华中科技大学 | Top-K recommendation method and system based on local pairwise ordering and global decision fusion |
CN113111251A (en) * | 2020-01-10 | 2021-07-13 | 阿里巴巴集团控股有限公司 | Project recommendation method, device and system |
CN111460046A (en) * | 2020-03-06 | 2020-07-28 | 合肥海策科技信息服务有限公司 | Scientific and technological information clustering method based on big data |
CN111581522A (en) * | 2020-06-05 | 2020-08-25 | 预见你情感(北京)教育咨询有限公司 | Social analysis method based on user identity identification |
CN111897999A (en) * | 2020-07-27 | 2020-11-06 | 九江学院 | LDA-based deep learning model construction method for video recommendation |
CN111897999B (en) * | 2020-07-27 | 2023-06-16 | 九江学院 | Deep learning model construction method for video recommendation and based on LDA |
CN112184391B (en) * | 2020-10-16 | 2023-10-10 | 中国科学院计算技术研究所 | Training method of recommendation model, medium, electronic equipment and recommendation model |
CN112184391A (en) * | 2020-10-16 | 2021-01-05 | 中国科学院计算技术研究所 | Recommendation model training method, medium, electronic device and recommendation model |
CN112348629A (en) * | 2020-10-26 | 2021-02-09 | 邦道科技有限公司 | Commodity information pushing method and device |
CN112364245B (en) * | 2020-11-20 | 2021-12-21 | 浙江工业大学 | Top-K movie recommendation method based on heterogeneous information network embedding |
CN112364245A (en) * | 2020-11-20 | 2021-02-12 | 浙江工业大学 | Top-K movie recommendation method based on heterogeneous information network embedding |
CN112925926A (en) * | 2021-01-28 | 2021-06-08 | 北京达佳互联信息技术有限公司 | Training method and device of multimedia recommendation model, server and storage medium |
CN113342963A (en) * | 2021-04-29 | 2021-09-03 | 山东大学 | Service recommendation method and system based on transfer learning |
CN113268670B (en) * | 2021-06-16 | 2022-09-27 | 中移(杭州)信息技术有限公司 | Latent factor hybrid recommendation method, device, equipment and computer storage medium |
CN113268670A (en) * | 2021-06-16 | 2021-08-17 | 中移(杭州)信息技术有限公司 | Latent factor hybrid recommendation method, device, equipment and computer storage medium |
CN113449147A (en) * | 2021-07-06 | 2021-09-28 | 乐视云计算有限公司 | Video recommendation method and device based on theme |
CN114936314A (en) * | 2022-03-24 | 2022-08-23 | 阿里巴巴达摩院(杭州)科技有限公司 | Recommendation information generation method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN108363804B (en) | 2020-08-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363804A (en) | Local model weighted fusion Top-N movie recommendation method based on user clustering | |
CN108763362B (en) | Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection | |
CN110162693B (en) | Information recommendation method and server | |
Phorasim et al. | Movies recommendation system using collaborative filtering and k-means | |
CN108665323B (en) | Integration method for financial product recommendation system | |
CN109241440A (en) | It is a kind of based on deep learning towards implicit feedback recommended method | |
CN109145112A (en) | A kind of comment on commodity classification method based on global information attention mechanism | |
CN108256119A (en) | A kind of construction method of resource recommendation model and the resource recommendation method based on the model | |
CN110263257A (en) | Multi-source heterogeneous data mixing recommended models based on deep learning | |
CN106610970A (en) | Collaborative filtering-based content recommendation system and method | |
JP2009099088A (en) | Sns user profile extraction device, extraction method and extraction program, and device using user profile | |
CN110083764A (en) | A kind of collaborative filtering cold start-up way to solve the problem | |
KR20170027576A (en) | Apparatus and method of researcher rcommendation based on matching studying career | |
CN107180078A (en) | A kind of method for vertical search based on user profile learning | |
Li et al. | On integrating multiple type preferences into competitive analyses of customer requirements in product planning | |
Desai | Sentiment analysis of Twitter data | |
CN110457477A (en) | A kind of Interest Community discovery method towards social networks | |
CN114461879B (en) | Semantic social network multi-view community discovery method based on text feature integration | |
CN112000804A (en) | Microblog hot topic user group emotion tendentiousness analysis method | |
Xie et al. | A probabilistic recommendation method inspired by latent Dirichlet allocation model | |
Barkan et al. | Cold start revisited: A deep hybrid recommender with cold-warm item harmonization | |
CN118250516A (en) | Hierarchical processing method for users | |
CN110795640A (en) | Adaptive group recommendation method for compensating group member difference | |
CN109902131A (en) | A kind of group recommended method based on antithesis self-encoding encoder | |
CN113672818B (en) | Method and system for acquiring social media user portraits |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |