CN110020141A - A kind of personalized recommendation method and system based on improvement cluster and Spark frame - Google Patents
A kind of personalized recommendation method and system based on improvement cluster and Spark frame Download PDFInfo
- Publication number
- CN110020141A CN110020141A CN201711132268.0A CN201711132268A CN110020141A CN 110020141 A CN110020141 A CN 110020141A CN 201711132268 A CN201711132268 A CN 201711132268A CN 110020141 A CN110020141 A CN 110020141A
- Authority
- CN
- China
- Prior art keywords
- project
- cluster
- cluster centre
- degree
- membership
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Abstract
The invention discloses a kind of based on the personalized recommendation method for improving cluster and Spark frame, comprising: determines effective score data collection;Cluster preprocessing is carried out to project using Canopy algorithm, generates at least one Canopy cluster centre;The cluster centre for initializing FCM algorithm updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and updates cluster centre according to the degree of membership of update, and iteration determines final cluster set up to meeting stop condition;Calculate separately the similarity of each cluster centre in destination item and final cluster set, choose the item design candidate items space in the corresponding cluster set of similarity for being more than or equal to default similarity threshold, the similarity in destination item and candidate items space between each project is calculated, the K arest neighbors set of destination item is found;User is obtained to the preference predicted value of destination item, and chooses the higher N number of project of preference predicted value using top-N recommended method and is recommended.
Description
Technical field
The present invention relates to the personalized recommendation fields of big data, and more particularly, to one kind based on improve cluster and
The personalized recommendation method and system of Spark frame.
Background technique
The fast development of mobile Internet indicates that the mankind enter big data era, and serious information overload makes user very
Hardly possible easily obtains information needed, and in this context, personalized recommendation technology is come into being.Collaborative filtering be with most successful and
One of widest recommended technology, is broadly divided into the collaborative filtering based on user and project-based collaborative filtering two major classes.
Since project renewal speed is relatively slow and the number of project is much smaller than number of users, so that the project of calculating is similar
It is much smaller than calculating user's similarity expense to spend expense.In actual recommendation system, user can not comment all items
Point, the project for possessing user's scoring can only account for the 1%-2% or so of project sum.Based on sparse user-project rating matrix
Calculating nearest-neighbors collection will result directly in recommendation inaccuracy.
K nearest-neighbors for finding destination item are the cores of collaborative filtering.Measure the similar journey of two projects
The common method of degree is Pearson correlation coefficient, but this method cannot show the reliability of this similarity degree.Because may
The situation that excessive user's intersection is too small but Pearson correlation coefficient is very big is commented between appearance project jointly.In addition, being carried on the back in big data
Under scape, Collaborative Filtering Recommendation Algorithm there is also poor expandability, recommend low efficiency the problem of.
Therefore parallel computation is carried out to algorithm, improves operation efficiency, the generation of recommendation results is accelerated to seem very necessary.Mesh
Preceding parallel data processing platform has Hadoop and two kinds of Spark.Spark is a set of open source, memory-based can run
Parallel computation frame on distributed type assemblies.Compared to Hadoop, output and result can be stored in memory among its Job
In, reduce I/O number of access hard disk, more high efficiency, it is achieved that based on cluster and improved project similarity calculation is improved
Collaborative Filtering Recommendation Algorithm on Spark parallelization operation, for quickly and accurately be user provide personalized recommendation have
There are important theoretical value and reference significance.
Summary of the invention
The present invention provides a kind of based on the personalized recommendation method and system that improve cluster and Spark frame, to solve
The problem of how quickly and accurately providing personalized recommendation for user.
To solve the above-mentioned problems, according to an aspect of the invention, there is provided it is a kind of based on improvement cluster and Spark frame
The personalized recommendation method of frame, which is characterized in that the described method includes:
Data prediction is carried out to user-project rating matrix, effective score data collection is determined, wherein the scoring number
It include: user data, project data and score data according to collection;
Cluster preprocessing is carried out to project using Canopy algorithm, generates at least one Canopy cluster centre;
The cluster centre that FCM algorithm is initialized according at least one described Canopy cluster centre set, to each project
Its degree of membership to cluster centre is updated using degree of membership calculation formula, and cluster centre is updated according to the degree of membership of update, repeatedly
In generation, up to meeting stop condition, determines final cluster set;
The similarity of each cluster centre in destination item and final cluster set is calculated separately, selection is more than or equal to default
Item design candidate items space in the corresponding cluster set of the similarity of similarity threshold, and utilize weighting Pearson phase
Relationship number calculates the similarity in destination item and candidate items space between each project, finds the K arest neighbors of destination item
Set;
According to the K arest neighbors set of the destination item, user is obtained to the preference predicted value of destination item, and is utilized
Top-N recommended method is chosen the higher N number of project of preference predicted value and is recommended.
Preferably, wherein described carry out data prediction to user-project rating matrix, effective score data is determined
Collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file
The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid,
Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively
For<Iid, (Uid, Rating)>element format determines effective score data collection.
Preferably, wherein the cluster of described at least one Canopy cluster centre set according to initialization FCM algorithm
Center updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to the degree of membership of update
Cluster centre is updated, iteration determines final cluster set up to meeting stop condition, comprising:
Step 1, the cluster centre v of FCM algorithm is initialized according at least one described Canopy cluster centre seti, really
Determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Step 2, each project is updated to the degree of membership of cluster centre using degree of membership calculation formula
Step 3, according to the degree of membershipCluster centre is updated using cluster centre calculation formula
Step 4, by the cluster centreWith the cluster centre of updateIt is compared, if | | vi (p+1)-vi (p)||
< ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until | | vi (p+1)-vi (p)| | < ε is determined final
Cluster set.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is
The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j
Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is between project
The threshold value for user's intersection size that scores.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item
Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, n is the K arest neighbors set of destination item,For mesh
The average score of mark project i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
According to another aspect of the present invention, it provides a kind of based on the personalized recommendation for improving cluster and Spark frame
System, which is characterized in that the system comprises:
Data pre-processing unit determines effective scoring number for carrying out data prediction to user-project rating matrix
According to collection, wherein the score data collection includes: user data, project data and score data;
Canopy cluster centre generation unit generates extremely for carrying out cluster preprocessing to project using Canopy algorithm
A few Canopy cluster centre;
Final cluster set determination unit, for initializing FCM according at least one described Canopy cluster centre set
The cluster centre of algorithm updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to more
New degree of membership updates cluster centre, and iteration determines final cluster set up to meeting stop condition;
K arest neighbors set determination unit calculates separately the phase of destination item with each cluster centre in final cluster set
Like degree, the item design candidate items chosen in the corresponding cluster set of similarity for being more than or equal to default similarity threshold are empty
Between, and the similarity in destination item and candidate items space between each project is calculated using weighting Pearson correlation coefficient,
Find the K arest neighbors set of destination item;
It is pre- to the preference of destination item to obtain user for the K arest neighbors set according to the destination item for recommendation unit
Measured value, and choose the higher N number of project of preference predicted value using top-N recommended method and recommended.
Preferably, wherein the data pre-processing unit, carries out data prediction to user-project rating matrix, determine
Effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file
The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid,
Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively
For<Iid, (Uid, Rating)>element format determines effective score data collection.
Preferably, wherein the final cluster gathers determination unit, according at least one described Canopy cluster centre collection
The cluster centre for closing initialization FCM algorithm, updates it using degree of membership calculation formula to each project and is subordinate to cluster centre
Degree, and cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition, comprising:
Cluster centre generates subelement, for being calculated according at least one Canopy cluster centre set initialization FCM
The cluster centre v of methodi, determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Degree of membership computation subunit, for updating each project to the degree of membership of cluster centre using degree of membership calculation formula
It updates cluster centre and determines subelement, for according to the degree of membershipMore using cluster centre calculation formula
New cluster centre
Final cluster centre set determines subelement, is used for the cluster centreWith the cluster centre of update
It is compared, if | | vi (p+1)-vi (p)| | < ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until |
|vi (p+1)-vi (p)| | < ε determines final cluster set.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is
The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre updates calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j
Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is between project
The threshold value for user's intersection size that scores.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item
Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For
The average score of destination item i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
Personalized recommendation method and system based on improvement cluster and Spark frame of the invention, is clustered using Canopy
Algorithm carries out item cluster pretreatment, generates Canopy cluster centre, then uses FCM according to the Canopy cluster centre of generation
Algorithm completes the final cluster to project;Comment excessive user's intersection too small jointly between reduction project but Pearson phase relation
The very big situation of number, to weight Pearson correlation coefficient as the measure of similitude between two projects;By Spark
Advantage on memory is calculated and iterated to calculate, realizes the parallelization of algorithm, solves traditional Collaborative Filtering Recommendation Algorithm with this and exist
The problems such as computation complexity faced under big data background is high, processing speed is slow.Data can be effectively relieved in method of the invention
Sparsity solves the problems, such as traditional Collaborative Filtering Recommendation Algorithm poor expandability under big data background, recommends low efficiency.In addition,
The measure of similarity between improved project can improve and recommend precision.For excavating useful information in massive information, quickly,
Personalized recommendation is accurately finished with certain meaning.
Detailed description of the invention
By reference to the following drawings, exemplary embodiments of the present invention can be more fully understood by:
Fig. 1 is according to embodiment of the present invention based on the personalized recommendation method 100 for improving cluster and Spark frame
Flow chart;And
Fig. 2 is according to embodiment of the present invention based on the personalized recommendation system 200 for improving cluster and Spark frame
Structural schematic diagram.
Specific embodiment
Exemplary embodiments of the present invention are introduced referring now to the drawings, however, the present invention can use many different shapes
Formula is implemented, and is not limited to the embodiment described herein, and to provide these embodiments be at large and fully disclose
The present invention, and the scope of the present invention is sufficiently conveyed to person of ordinary skill in the field.Show for what is be illustrated in the accompanying drawings
Term in example property embodiment is not limitation of the invention.In the accompanying drawings, identical cells/elements use identical attached
Icon note.
Unless otherwise indicated, term (including scientific and technical terminology) used herein has person of ordinary skill in the field
It is common to understand meaning.Further it will be understood that with the term that usually used dictionary limits, should be understood as and its
The context of related fields has consistent meaning, and is not construed as Utopian or too formal meaning.
Fig. 1 is according to embodiment of the present invention based on the personalized recommendation method 100 for improving cluster and Spark frame
Flow chart.As shown in Figure 1, embodiment of the present invention is utilized based on the personalized recommendation method for improving cluster and Spark frame
Canopy algorithm carries out cluster preprocessing to project, generates at least one Canopy cluster centre, and according to it is described at least one
Canopy cluster centre set initializes the cluster centre of FCM algorithm, avoids the blindness of initial cluster center selection, thus
Improve the accuracy of cluster;Its degree of membership to cluster centre, and root are updated using degree of membership calculation formula to each project
Cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition;Calculate destination item
With the similarity of each cluster centre in final cluster set, choose corresponding more than or equal to the similarity of default similarity threshold
The item design candidate items space in gathering is clustered, and calculates destination item and candidate using weighting Pearson correlation coefficient
Similarity in project space between each project finds the K arest neighbors set of destination item, improves the precision of recommendation;Root
According to the K arest neighbors set of the destination item, user is obtained to the preference predicted value of destination item, and utilizes the recommendation side top-N
Method is chosen the higher N number of project of preference predicted value and is recommended, for improving the real time response speed of algorithm and recommending accuracy.
It is run based on parallelization of the Collaborative Filtering Recommendation Algorithm of cluster and improved project similarity calculation on Spark is improved, for
Quickly and accurately personalized recommendation is provided with important theoretical value and reference significance for user.
The personalized recommendation method 100 based on improvement cluster and Spark frame of embodiment of the present invention is from step 101
Start, data prediction is carried out to user-project rating matrix in step 101, determines effective score data collection, wherein described
Score data collection includes: user data, project data and score data.
Preferably, wherein described carry out data prediction to user-project rating matrix, effective score data is determined
Collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file
The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid,
Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively
For<Iid, (Uid, Rating)>element format determines effective score data collection.
Common score data collection generally uses text mode to be stored, and corresponding every data line in text is one
A user records the scoring of some project.Before being calculated, needs to handle data set, filter out and do not meet item
The scoring of part records, and is converted into the data format needed in calculating process.In embodiments of the present invention, using Uid, Iid,
Rating indicates user, project and the scoring in scoring record, number of users Nu, item number n respectively.To user-project
Rating matrix carries out data prediction, determines effective score data collection, specific processing step is as follows:
Score data file is formed into Initial R DD (Resilient Distributed by row reading using textFile
Datasets), the number of partitions is arranged in elasticity distribution formula data set;
Every scoring record in Initial R DD is converted into<Uid by map, the form of (Iid, Rating)>key-value pair, and
Obtained new RDD is named as train;
It is map to train to operate to obtain with Iid as key, tuple (Uid, Rating) is the key-value pair of value, forms member
Plain format be<Iid, (Uid, Rating)>RDD2.
Preferably, cluster preprocessing is carried out to project using Canopy algorithm in step 102, generates at least one Canopy
Cluster centre.
Preferably, the cluster of FCM algorithm is initialized according at least one described Canopy cluster centre set in step 103
Center updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to the degree of membership of update
Cluster centre is updated, iteration determines final cluster set up to meeting stop condition.
Preferably, wherein the cluster of described at least one Canopy cluster centre set according to initialization FCM algorithm
Center updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to the degree of membership of update
Cluster centre is updated, iteration determines final cluster set up to meeting stop condition, comprising:
Step 1031, the cluster centre v of FCM algorithm is initialized according at least one described Canopy cluster centre seti,
Determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Step 1032, each project is updated to the degree of membership of cluster centre using degree of membership calculation formula
Step 1033, according to the degree of membershipCluster centre is updated using cluster centre calculation formula
Step 1034, by the cluster centreWith the cluster centre of updateIt is compared, if | | vi (p+1)-vi (p)
| | < ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until | | vi (p+1)-vi (p)| | < ε is determined most
Cluster set eventually.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;For pth
The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after secondary iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
In embodiments of the present invention, using Canopy algorithm to project carry out cluster preprocessing, by project score to
Data in duration set are divided into different Canopy and generate corresponding cluster centre, then the Canopy formed
Initial cluster center of the cluster centre as FCM algorithm updates each project to cluster centre using degree of membership calculation formula
Degree of membership, and cluster centre is updated according to the degree of membership, iteration determines final cluster set up to meeting stop condition.It borrows
Advantage of the Spark on memory is calculated and iterated to calculate is helped, the parallelization of algorithm is realized, traditional collaborative filtering recommending is solved with this
The problems such as computation complexity that algorithm faces under big data background is high, processing speed is slow.Specific step is as follows:
Step 1, it creates initial Canopy cluster centre list C_List and is set as empty, obtained by cross validation mode
Canopy distance threshold t1And t2(t1> t2);
Step 2, the project score vector set<Iid that will be obtained from RDD2, (Uid, Rating)>it is denoted as Item, from
A project score vector is obtained in Item, is denoted as Item1, and be added in C_List as Canopy cluster centre point,
Then Item1 is deleted from Item;
Step 3, gather an optional project score vector in remaining element from Item and be denoted as Item2, calculate itself and C_
The distance of all cluster centre points in List, if distance is less than t2Or it is greater than t1, then Item2 is added in C_List, and from
It is deleted in Item;If distance is between t1And t2Between, then Item2 is added in corresponding Canopy;
Step 4, step 3 is repeated until Item is sky, output cluster centre set Q;
Step 5, cluster centre is initialized according to cluster centre set QDetermine clusters number c, fuzzy indicator m and appearance
Perhaps error ε;
Step 6, each project is updated according to degree of membership calculation formula and angle value is subordinate to cluster centre, wherein described be subordinate to
Spend calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is
The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n);
Step 7, it is subordinate to angle value based on what step 6 obtained, cluster centre is updated using cluster centre calculation formula, wherein gathering
Class center calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project;
Step 8, by the cluster centreWith new cluster centreIt is compared, if | | vi (p+1)-vi (p)| | < ε,
Then stop calculating, determines final cluster set;Otherwise, return step 6, until | | vi (p+1)-vi (p)| | < ε determines final cluster
Set.Final cluster centre set and l item cluster set are exported, is broadcasted in each child node.
Preferably, the similarity of each cluster centre in destination item and final cluster set is calculated separately in step 104,
Choose the item design candidate items space in the corresponding cluster set of similarity for being more than or equal to default similarity threshold, and benefit
The similarity in destination item and candidate items space between each project is calculated with weighting Pearson correlation coefficient, finds mesh
The K arest neighbors set of mark project.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j
Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is between project
The threshold value for user's intersection size that scores.
The situation that excessive user's intersection is too small but Pearson correlation coefficient is very big is commented, jointly between reduction project to add
Weigh measure of the Pearson correlation coefficient as similitude between two projects.In embodiments of the present invention, specifically simultaneously
Steps are as follows for rowization calculating:
Step1, using groupByKey by item feature vector<Iid in candidate items space, (Uid, Rating)>
Merger is carried out, forming Iid is key, and List ((Uid, Rating) ...) is the key-value pair of value.
Step2 calculates all users by scoring lists iterations of the flatMap to a certain project and comments a certain project
Divide average value.Simultaneously in view of the project to cross to same user's evaluation is matched, converts element format to and be with user
Key, key-value pair of the binary group (Iid, Rating-avg) as value, wherein Rating-avg is scoring of the user to project
With the difference of project scoring mean value.
Step3, merging formation element format by the project that groupByKey crosses same user's evaluation is < Uid, List
((Iid, Rating-avg) ...) > RDD3.
Step4 is matched the project that same user's evaluation is crossed by flatMap, formation element lattice two-by-two in the form of ascending order
Formula is<(Iid1, Iid2), (Rating1-avg1, Rating2-avg2)>RDD4, wherein Iid1, Iid2 are some user u
Certain two project and Iid1 of evaluation are less than Iid2, and avg1 is average score of all users to project Iid1, and Rating1 is certain
Scoring of a user u to project Iid1.
Step5, using aggregateByKey by customized seqOp and comOp function by the member with identical key
Plain merger is formed with tuple (Iid1, Iid2) as key, and (v1, v2, v3) is the key-value pair of value.Wherein v1, v2, v3 difference
In corresponding weighting Pearson correlation coefficient formulaWith
Step6, by mapValues solve pairing project between similarity, formation element format be < (Iid1,
Iid2), sim > RDD, sim represents the similarity to score between any pairing project two-by-two here.This process can shift to an earlier date from
Result is saved in HDFS by line computation, and when on-line prediction directly reads, and improves the operational efficiency of algorithm.
Step7, by map function by obtained RDD become element format be<Iid1, (Iid2, sim)>RDD5. because
In order to reduce calculation amount, first project is ranked up and is then matched, so < Iid1 is only had recorded in RDD5, (Iid2,
Sim)>without record<Iid2, (Iid1, sim)>but the arest neighbors set for calculating destination item need to collect destination item with
The list of the similarity of sundry item, so to be converted to final item similarity.Exchange the position of project id, shape
At element format are as follows:<Iid2, (Iid1, sim)>RDD6, then to RDD5 and RDD6 be union operation generate RDD7, at this time
<Iid1 has been existed simultaneously for identical two projects, (Iid2, sim)>and<Iid2, (Iid1, sim)>.
Step8 utilizes groupByKey formation element format are as follows:
<Iid1, List ((Iid2, sim), (Iid3, sim) ...)>RDD, then filtering out key is destination item j
Record, is denoted as neighbor.
Step9 utilizes following code
Neighbor.map x=>
Val a=x._2.toSeq.sortWith ((o, p)=> o._2 > p._2) .take (K)
(x._1,a)
}
To select the k nearest neighbor set Kneighbor of destination item j.
Preferably, user is obtained to the inclined of destination item according to the K arest neighbors set of the destination item in step 105
Good predicted value, and choose the higher N number of project of preference predicted value using top-N recommended method and recommended.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item
Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For
The average score of destination item i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
In embodiments of the present invention, step is implemented are as follows:
Step1 is matched user u and the project in project j neighbour set by flatMap, formation element format one by one
For<(u, Iid1), sim>RDD1, wherein Iid1 is the project in neighbour's set, and sim is Iid1 and between scoring item j
Similarity.
Step2, using filter filtering train formation element format be<(u, Iid1), Rating>RDD2.
Step3, with the element merger of identical key, will be converted in RDD1 and RDD2 by join and be formed RDD3, element lattice
Formula be<(u, Iid1), (sim, Rating)>.
Step4 acquires last prediction using map and reduce and scores.
Step5, using similar approach carry out prediction scoring to all items that user did not evaluate, and choose scoring highest
N number of project generate recommendation list.
Fig. 2 is according to embodiment of the present invention based on the personalized recommendation system 200 for improving cluster and Spark frame
Structural schematic diagram.As shown in Fig. 2, the personalized recommendation system based on improvement cluster and Spark frame of embodiment of the present invention
200 include: data pre-processing unit 201, Canopy cluster centre generation unit 202, final cluster set determination unit 203, K
Arest neighbors set determination unit 204 and recommendation unit 205.Preferably, it in data pre-processing unit 201, scores user-project
Matrix carries out data prediction, effective score data collection is determined, wherein the score data collection includes: user data, project
Data and score data.
Preferably, wherein the data pre-processing unit 201, carries out data prediction to user-project rating matrix, really
Fixed effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file
The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid,
Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively
For<Iid, (Uid, Rating)>element format determines effective score data collection.
Preferably, in Canopy cluster centre generation unit 202, project is carried out using Canopy algorithm to cluster pre- place
Reason, generates at least one Canopy cluster centre.
Preferably, in final cluster set determination unit 203, according at the beginning of at least one described Canopy cluster centre set
The cluster centre of beginningization FCM algorithm updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and
Cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition.
Preferably, wherein the final cluster gathers determination unit 203, according at least one described Canopy cluster centre
The cluster centre of set initialization FCM algorithm updates its person in servitude to cluster centre using degree of membership calculation formula to each project
Category degree, and cluster centre is updated according to the degree of membership of update, iteration determines final cluster set, wraps up to meeting stop condition
It includes:
Cluster centre generates subelement, for being calculated according at least one Canopy cluster centre set initialization FCM
The cluster centre v of methodi, determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Degree of membership computation subunit, for updating each project to the degree of membership of cluster centre using degree of membership calculation formula
It updates cluster centre and determines subelement, for according to the degree of membershipMore using cluster centre calculation formula
New cluster centre
Final cluster centre set determines subelement, is used for the cluster centreWith the cluster centre of update
It is compared, if | | vi (p+1)-vi (p)| | < ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until |
|vi (p+1)-vi (p)| | < ε determines final cluster set.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is
The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre updates calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
Preferably, it in K arest neighbors set determination unit 204, calculates separately each in destination item and final cluster set
The similarity of cluster centre chooses the item design in the corresponding cluster set of similarity for being more than or equal to default similarity threshold
Candidate items space, and using weighting Pearson correlation coefficient calculate in destination item and candidate items space each project it
Between similarity, find the K arest neighbors set of destination item.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j
Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is project
Between score user's intersection size threshold value.
Preferably, user is obtained to destination item according to the K arest neighbors set of the destination item in recommendation unit 205
Preference predicted value, and using top-N recommended method choose the higher N number of project of preference predicted value recommended.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item
Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i ,-Kn is the K arest neighbors set of destination item,For
The average score of destination item i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
The embodiment of the present invention based on improve cluster and Spark frame personalized recommendation system 200 with it is of the invention
The personalized recommendation method 100 based on improvement cluster and Spark frame of another embodiment is corresponding, and details are not described herein.
The present invention is described by reference to a small amount of embodiment.However, it is known in those skilled in the art, as
Defined by subsidiary Patent right requirement, in addition to the present invention other embodiments disclosed above equally fall in it is of the invention
In range.
Normally, all terms used in the claims are all solved according to them in the common meaning of technical field
It releases, unless in addition clearly being defined wherein.All references " one/described/be somebody's turn to do [device, component etc.] " are all opened ground
At least one example being construed in described device, component etc., unless otherwise expressly specified.Any method disclosed herein
Step need not all be run with disclosed accurate sequence, unless explicitly stated otherwise.
Claims (14)
1. a kind of based on the personalized recommendation method for improving cluster and Spark frame, which is characterized in that the described method includes:
Data prediction is carried out to user-project rating matrix, effective score data collection is determined, wherein the score data collection
It include: user data, project data and score data;
Cluster preprocessing is carried out to project using Canopy algorithm, generates at least one Canopy cluster centre;
The cluster centre that FCM algorithm is initialized according at least one described Canopy cluster centre set utilizes each project
Degree of membership calculation formula updates its degree of membership to cluster centre, and updates cluster centre according to the degree of membership of update, and iteration is straight
To stop condition is met, final cluster set is determined;
The similarity for calculating separately destination item with each cluster centre in final cluster set, chooses similar more than or equal to presetting
The item design candidate items space in the corresponding cluster set of similarity of threshold value is spent, and utilizes weighting Pearson phase relation
Number calculates the similarity in destination item and candidate items space between each project, finds the K arest neighbors set of destination item;
According to the K arest neighbors set of the destination item, user is obtained to the preference predicted value of destination item, and utilizes top-N
Recommended method is chosen the higher N number of project of preference predicted value and is recommended.
2. the method according to claim 1, wherein described locate user-project rating matrix progress data in advance
Reason, determines effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and item will be met using text file
The score data of part is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into<Uid using map, (Iid, Rating)>
The form of key-value pair, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form record be respectively converted into<
Iid, (Uid, Rating) > element format determine effective score data collection.
3. the method according to claim 1, wherein described at least one Canopy cluster centre collection according to
The cluster centre for closing initialization FCM algorithm, updates it using degree of membership calculation formula to each project and is subordinate to cluster centre
Degree, and cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition, comprising:
Step 1, the cluster centre v of FCM algorithm is initialized according at least one described Canopy cluster centre seti, determine cluster
Number c, fuzzy indicator m and allowable error threshold epsilon;
Step 2, each project is updated to the degree of membership of cluster centre using degree of membership calculation formula
Step 3, according to the degree of membershipCluster centre is updated using cluster centre calculation formula
Step 4, by the cluster centreWith the cluster centre of updateIt is compared, ifThen
Stop calculating, determines final cluster set;Otherwise, return step 2, untilDetermine final cluster set
It closes.
4. according to the method described in claim 3, it is characterized in that, the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;Repeatedly for pth time
The distance between a project of a cluster centre of i-th (i=1,2 ..., c) and kth after generation (k=1,2 ..., n).
5. according to the method described in claim 4, it is characterized in that, the cluster centre more new formula are as follows:
Wherein, xkFor the score vector of k-th of project.
6. the method according to claim 1, wherein described calculate project using weighting Pearson correlation coefficient
Between similarity, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate being averaged for project i and j
Scoring;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is scored between project
The threshold value of user's intersection size.
7. the method according to claim 1, wherein the K arest neighbors set according to the destination item, is obtained
Family is taken to the preference predicted value of destination item, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For target item
The average score of mesh i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
8. a kind of based on the personalized recommendation system for improving cluster and Spark frame, which is characterized in that the system comprises:
Data pre-processing unit determines effective score data for carrying out data prediction to user-project rating matrix
Collection, wherein the score data collection includes: user data, project data and score data;
Canopy cluster centre generation unit generates at least one for carrying out cluster preprocessing to project using Canopy algorithm
A Canopy cluster centre;
Final cluster set determination unit, for initializing FCM algorithm according at least one described Canopy cluster centre set
Cluster centre, its degree of membership to cluster centre is updated using degree of membership calculation formula to each project, and according to update
Degree of membership updates cluster centre, and iteration determines final cluster set up to meeting stop condition;
K arest neighbors set determination unit calculates separately the similarity of each cluster centre in destination item and final cluster set,
Choose the item design candidate items space in the corresponding cluster set of similarity for being more than or equal to default similarity threshold, and benefit
The similarity in destination item and candidate items space between each project is calculated with weighting Pearson correlation coefficient, finds mesh
The K arest neighbors set of mark project;
Recommendation unit obtains user and predicts the preference of destination item for the K arest neighbors set according to the destination item
Value, and choose the higher N number of project of preference predicted value using top-N recommended method and recommended.
9. system according to claim 8, which is characterized in that the data pre-processing unit, to user-project scoring square
Battle array carries out data prediction, determines effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and item will be met using text file
The score data of part is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into<Uid using map, (Iid, Rating)>
The form of key-value pair, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form record be respectively converted into<
Iid, (Uid, Rating) > element format determine effective score data collection.
10. system according to claim 8, which is characterized in that the final cluster gathers determination unit, according to it is described extremely
The cluster centre of few Canopy cluster centre set initialization FCM algorithm, utilizes degree of membership calculation formula to each project
Update its degree of membership to cluster centre, and cluster centre updated according to the degree of membership of update, iteration until meet stop condition,
Determine final cluster set, comprising: cluster centre generates subelement, for according at least one described Canopy cluster centre collection
Close the cluster centre v of initialization FCM algorithmi, determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Degree of membership computation subunit, for updating each project to the degree of membership of cluster centre using degree of membership calculation formula
It updates cluster centre and determines subelement, for according to the degree of membershipIt is updated using cluster centre calculation formula poly-
Class center
Final cluster centre set determines subelement, is used for the cluster centreWith the cluster centre of updateIt carries out
Compare, ifThen stop calculating, determines final cluster set;Otherwise, return step 2, untilDetermine final cluster set.
11. system according to claim 10, which is characterized in that the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;Repeatedly for pth time
The distance between a project of a cluster centre of i-th (i=1,2 ..., c) and kth after generation (k=1,2 ..., n).
12. system according to claim 11, which is characterized in that the cluster centre calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
13. system according to claim 8, which is characterized in that described to calculate project using weighting Pearson correlation coefficient
Between similarity, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate being averaged for project i and j
Scoring;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is scored between project
The threshold value of user's intersection size.
14. system according to claim 8, which is characterized in that the K arest neighbors set according to the destination item,
User is obtained to the preference predicted value of destination item, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For target item
The average score of mesh i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711132268.0A CN110020141A (en) | 2017-11-15 | 2017-11-15 | A kind of personalized recommendation method and system based on improvement cluster and Spark frame |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711132268.0A CN110020141A (en) | 2017-11-15 | 2017-11-15 | A kind of personalized recommendation method and system based on improvement cluster and Spark frame |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110020141A true CN110020141A (en) | 2019-07-16 |
Family
ID=67186788
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711132268.0A Pending CN110020141A (en) | 2017-11-15 | 2017-11-15 | A kind of personalized recommendation method and system based on improvement cluster and Spark frame |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020141A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111209953A (en) * | 2020-01-03 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Method and device for recalling neighbor vector, computer equipment and storage medium |
CN111367901A (en) * | 2020-02-27 | 2020-07-03 | 智慧航海(青岛)科技有限公司 | Ship data denoising method |
CN112487276A (en) * | 2019-09-11 | 2021-03-12 | 腾讯科技(深圳)有限公司 | Object acquisition method, device, equipment and storage medium |
CN113139021A (en) * | 2021-04-23 | 2021-07-20 | 上海中通吉网络技术有限公司 | Express delivery network calling center data identification method |
CN115063877A (en) * | 2022-06-06 | 2022-09-16 | 南通大学 | Parallel superpixel Spark clustering method for big data fundus image |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412948A (en) * | 2013-08-27 | 2013-11-27 | 北京交通大学 | Cluster-based collaborative filtering commodity recommendation method and system |
CN104239496A (en) * | 2014-09-10 | 2014-12-24 | 西安电子科技大学 | Collaborative filtering method based on integration of fuzzy weight similarity measurement and clustering |
CN107153846A (en) * | 2017-05-26 | 2017-09-12 | 南京邮电大学 | A kind of road traffic state modeling method based on Fuzzy C-Means Cluster Algorithm |
-
2017
- 2017-11-15 CN CN201711132268.0A patent/CN110020141A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412948A (en) * | 2013-08-27 | 2013-11-27 | 北京交通大学 | Cluster-based collaborative filtering commodity recommendation method and system |
CN104239496A (en) * | 2014-09-10 | 2014-12-24 | 西安电子科技大学 | Collaborative filtering method based on integration of fuzzy weight similarity measurement and clustering |
CN107153846A (en) * | 2017-05-26 | 2017-09-12 | 南京邮电大学 | A kind of road traffic state modeling method based on Fuzzy C-Means Cluster Algorithm |
Non-Patent Citations (2)
Title |
---|
廖彬等: "基于Spark的ItemBased推荐算法性能优化", 《计算机应用》 * |
王晓军等: "基于模糊聚类的可扩展的协同过滤方法", 《南京邮电大学学报(自然科学版)》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487276A (en) * | 2019-09-11 | 2021-03-12 | 腾讯科技(深圳)有限公司 | Object acquisition method, device, equipment and storage medium |
CN112487276B (en) * | 2019-09-11 | 2023-10-17 | 腾讯科技(深圳)有限公司 | Object acquisition method, device, equipment and storage medium |
CN111209953A (en) * | 2020-01-03 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Method and device for recalling neighbor vector, computer equipment and storage medium |
CN111209953B (en) * | 2020-01-03 | 2024-01-16 | 腾讯科技(深圳)有限公司 | Recall method, recall device, computer equipment and storage medium for neighbor vector |
CN111367901A (en) * | 2020-02-27 | 2020-07-03 | 智慧航海(青岛)科技有限公司 | Ship data denoising method |
CN113139021A (en) * | 2021-04-23 | 2021-07-20 | 上海中通吉网络技术有限公司 | Express delivery network calling center data identification method |
CN115063877A (en) * | 2022-06-06 | 2022-09-16 | 南通大学 | Parallel superpixel Spark clustering method for big data fundus image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110020141A (en) | A kind of personalized recommendation method and system based on improvement cluster and Spark frame | |
Ashari et al. | Performance comparison between Naïve Bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool | |
CN110390396B (en) | Method, device and system for estimating causal relationship between observed variables | |
CN105659225B (en) | Use the query expansion and inquiry-document matches of path constrained random migration | |
CN109948066B (en) | Interest point recommendation method based on heterogeneous information network | |
CN108733976B (en) | Key protein identification method based on fusion biology and topological characteristics | |
Kong et al. | Big data‐driven machine learning‐enabled traffic flow prediction | |
EP2860672A2 (en) | Scalable cross domain recommendation system | |
CN110046713B (en) | Robustness ordering learning method based on multi-target particle swarm optimization and application thereof | |
CN107633100A (en) | A kind of point of interest based on incorporation model recommends method and device | |
Chen et al. | Next POI recommendation based on location interest mining with recurrent neural networks | |
CN104573130A (en) | Entity resolution method based on group calculation and entity resolution device based on group calculation | |
CN104298778A (en) | Method and system for predicting quality of rolled steel product based on association rule tree | |
Ma | A new group ranking approach for ordinal preferences based on group maximum consensus sequences | |
CN108427756A (en) | Personalized query word completion recommendation method and device based on same-class user model | |
CN109213951A (en) | A kind of proposed algorithm calculated based on trust with matrix decomposition | |
Zhang et al. | An improved probabilistic relaxation method for matching multi-scale road networks | |
CN111553279A (en) | Interest point characterization learning and identification method, device, equipment and storage medium | |
WO2015040806A1 (en) | Hierarchical latent variable model estimation device, hierarchical latent variable model estimation method, supply amount prediction device, supply amount prediction method, and recording medium | |
Hussain et al. | Clustering uncertain graphs using ant colony optimization (ACO) | |
Shen et al. | A Generic Framework for Top-${\schmi k} $ Pairs and Top-${\schmi k} $ Objects Queries over Sliding Windows | |
Wu et al. | Cost-sensitive decision tree with multiple resource constraints | |
Quan et al. | An optimized task assignment framework based on crowdsourcing knowledge graph and prediction | |
Ampellio et al. | A hybrid swarm-based algorithm for single-objective optimization problems involving high-cost analyses | |
Zhu et al. | Discovering large conserved functional components in global network alignment by graph matching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190716 |