CN110020141A - A kind of personalized recommendation method and system based on improvement cluster and Spark frame - Google Patents

A kind of personalized recommendation method and system based on improvement cluster and Spark frame Download PDF

Info

Publication number
CN110020141A
CN110020141A CN201711132268.0A CN201711132268A CN110020141A CN 110020141 A CN110020141 A CN 110020141A CN 201711132268 A CN201711132268 A CN 201711132268A CN 110020141 A CN110020141 A CN 110020141A
Authority
CN
China
Prior art keywords
project
cluster
cluster centre
degree
membership
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711132268.0A
Other languages
Chinese (zh)
Inventor
刘芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aisino Corp
Original Assignee
Aisino Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aisino Corp filed Critical Aisino Corp
Priority to CN201711132268.0A priority Critical patent/CN110020141A/en
Publication of CN110020141A publication Critical patent/CN110020141A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

The invention discloses a kind of based on the personalized recommendation method for improving cluster and Spark frame, comprising: determines effective score data collection;Cluster preprocessing is carried out to project using Canopy algorithm, generates at least one Canopy cluster centre;The cluster centre for initializing FCM algorithm updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and updates cluster centre according to the degree of membership of update, and iteration determines final cluster set up to meeting stop condition;Calculate separately the similarity of each cluster centre in destination item and final cluster set, choose the item design candidate items space in the corresponding cluster set of similarity for being more than or equal to default similarity threshold, the similarity in destination item and candidate items space between each project is calculated, the K arest neighbors set of destination item is found;User is obtained to the preference predicted value of destination item, and chooses the higher N number of project of preference predicted value using top-N recommended method and is recommended.

Description

A kind of personalized recommendation method and system based on improvement cluster and Spark frame
Technical field
The present invention relates to the personalized recommendation fields of big data, and more particularly, to one kind based on improve cluster and The personalized recommendation method and system of Spark frame.
Background technique
The fast development of mobile Internet indicates that the mankind enter big data era, and serious information overload makes user very Hardly possible easily obtains information needed, and in this context, personalized recommendation technology is come into being.Collaborative filtering be with most successful and One of widest recommended technology, is broadly divided into the collaborative filtering based on user and project-based collaborative filtering two major classes.
Since project renewal speed is relatively slow and the number of project is much smaller than number of users, so that the project of calculating is similar It is much smaller than calculating user's similarity expense to spend expense.In actual recommendation system, user can not comment all items Point, the project for possessing user's scoring can only account for the 1%-2% or so of project sum.Based on sparse user-project rating matrix Calculating nearest-neighbors collection will result directly in recommendation inaccuracy.
K nearest-neighbors for finding destination item are the cores of collaborative filtering.Measure the similar journey of two projects The common method of degree is Pearson correlation coefficient, but this method cannot show the reliability of this similarity degree.Because may The situation that excessive user's intersection is too small but Pearson correlation coefficient is very big is commented between appearance project jointly.In addition, being carried on the back in big data Under scape, Collaborative Filtering Recommendation Algorithm there is also poor expandability, recommend low efficiency the problem of.
Therefore parallel computation is carried out to algorithm, improves operation efficiency, the generation of recommendation results is accelerated to seem very necessary.Mesh Preceding parallel data processing platform has Hadoop and two kinds of Spark.Spark is a set of open source, memory-based can run Parallel computation frame on distributed type assemblies.Compared to Hadoop, output and result can be stored in memory among its Job In, reduce I/O number of access hard disk, more high efficiency, it is achieved that based on cluster and improved project similarity calculation is improved Collaborative Filtering Recommendation Algorithm on Spark parallelization operation, for quickly and accurately be user provide personalized recommendation have There are important theoretical value and reference significance.
Summary of the invention
The present invention provides a kind of based on the personalized recommendation method and system that improve cluster and Spark frame, to solve The problem of how quickly and accurately providing personalized recommendation for user.
To solve the above-mentioned problems, according to an aspect of the invention, there is provided it is a kind of based on improvement cluster and Spark frame The personalized recommendation method of frame, which is characterized in that the described method includes:
Data prediction is carried out to user-project rating matrix, effective score data collection is determined, wherein the scoring number It include: user data, project data and score data according to collection;
Cluster preprocessing is carried out to project using Canopy algorithm, generates at least one Canopy cluster centre;
The cluster centre that FCM algorithm is initialized according at least one described Canopy cluster centre set, to each project Its degree of membership to cluster centre is updated using degree of membership calculation formula, and cluster centre is updated according to the degree of membership of update, repeatedly In generation, up to meeting stop condition, determines final cluster set;
The similarity of each cluster centre in destination item and final cluster set is calculated separately, selection is more than or equal to default Item design candidate items space in the corresponding cluster set of the similarity of similarity threshold, and utilize weighting Pearson phase Relationship number calculates the similarity in destination item and candidate items space between each project, finds the K arest neighbors of destination item Set;
According to the K arest neighbors set of the destination item, user is obtained to the preference predicted value of destination item, and is utilized Top-N recommended method is chosen the higher N number of project of preference predicted value and is recommended.
Preferably, wherein described carry out data prediction to user-project rating matrix, effective score data is determined Collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid, Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively For<Iid, (Uid, Rating)>element format determines effective score data collection.
Preferably, wherein the cluster of described at least one Canopy cluster centre set according to initialization FCM algorithm Center updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to the degree of membership of update Cluster centre is updated, iteration determines final cluster set up to meeting stop condition, comprising:
Step 1, the cluster centre v of FCM algorithm is initialized according at least one described Canopy cluster centre seti, really Determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Step 2, each project is updated to the degree of membership of cluster centre using degree of membership calculation formula
Step 3, according to the degree of membershipCluster centre is updated using cluster centre calculation formula
Step 4, by the cluster centreWith the cluster centre of updateIt is compared, if | | vi (p+1)-vi (p)|| < ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until | | vi (p+1)-vi (p)| | < ε is determined final Cluster set.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is between project The threshold value for user's intersection size that scores.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, n is the K arest neighbors set of destination item,For mesh The average score of mark project i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
According to another aspect of the present invention, it provides a kind of based on the personalized recommendation for improving cluster and Spark frame System, which is characterized in that the system comprises:
Data pre-processing unit determines effective scoring number for carrying out data prediction to user-project rating matrix According to collection, wherein the score data collection includes: user data, project data and score data;
Canopy cluster centre generation unit generates extremely for carrying out cluster preprocessing to project using Canopy algorithm A few Canopy cluster centre;
Final cluster set determination unit, for initializing FCM according at least one described Canopy cluster centre set The cluster centre of algorithm updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to more New degree of membership updates cluster centre, and iteration determines final cluster set up to meeting stop condition;
K arest neighbors set determination unit calculates separately the phase of destination item with each cluster centre in final cluster set Like degree, the item design candidate items chosen in the corresponding cluster set of similarity for being more than or equal to default similarity threshold are empty Between, and the similarity in destination item and candidate items space between each project is calculated using weighting Pearson correlation coefficient, Find the K arest neighbors set of destination item;
It is pre- to the preference of destination item to obtain user for the K arest neighbors set according to the destination item for recommendation unit Measured value, and choose the higher N number of project of preference predicted value using top-N recommended method and recommended.
Preferably, wherein the data pre-processing unit, carries out data prediction to user-project rating matrix, determine Effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid, Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively For<Iid, (Uid, Rating)>element format determines effective score data collection.
Preferably, wherein the final cluster gathers determination unit, according at least one described Canopy cluster centre collection The cluster centre for closing initialization FCM algorithm, updates it using degree of membership calculation formula to each project and is subordinate to cluster centre Degree, and cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition, comprising:
Cluster centre generates subelement, for being calculated according at least one Canopy cluster centre set initialization FCM The cluster centre v of methodi, determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Degree of membership computation subunit, for updating each project to the degree of membership of cluster centre using degree of membership calculation formula
It updates cluster centre and determines subelement, for according to the degree of membershipMore using cluster centre calculation formula New cluster centre
Final cluster centre set determines subelement, is used for the cluster centreWith the cluster centre of update It is compared, if | | vi (p+1)-vi (p)| | < ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until | |vi (p+1)-vi (p)| | < ε determines final cluster set.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre updates calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is between project The threshold value for user's intersection size that scores.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For The average score of destination item i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
Personalized recommendation method and system based on improvement cluster and Spark frame of the invention, is clustered using Canopy Algorithm carries out item cluster pretreatment, generates Canopy cluster centre, then uses FCM according to the Canopy cluster centre of generation Algorithm completes the final cluster to project;Comment excessive user's intersection too small jointly between reduction project but Pearson phase relation The very big situation of number, to weight Pearson correlation coefficient as the measure of similitude between two projects;By Spark Advantage on memory is calculated and iterated to calculate, realizes the parallelization of algorithm, solves traditional Collaborative Filtering Recommendation Algorithm with this and exist The problems such as computation complexity faced under big data background is high, processing speed is slow.Data can be effectively relieved in method of the invention Sparsity solves the problems, such as traditional Collaborative Filtering Recommendation Algorithm poor expandability under big data background, recommends low efficiency.In addition, The measure of similarity between improved project can improve and recommend precision.For excavating useful information in massive information, quickly, Personalized recommendation is accurately finished with certain meaning.
Detailed description of the invention
By reference to the following drawings, exemplary embodiments of the present invention can be more fully understood by:
Fig. 1 is according to embodiment of the present invention based on the personalized recommendation method 100 for improving cluster and Spark frame Flow chart;And
Fig. 2 is according to embodiment of the present invention based on the personalized recommendation system 200 for improving cluster and Spark frame Structural schematic diagram.
Specific embodiment
Exemplary embodiments of the present invention are introduced referring now to the drawings, however, the present invention can use many different shapes Formula is implemented, and is not limited to the embodiment described herein, and to provide these embodiments be at large and fully disclose The present invention, and the scope of the present invention is sufficiently conveyed to person of ordinary skill in the field.Show for what is be illustrated in the accompanying drawings Term in example property embodiment is not limitation of the invention.In the accompanying drawings, identical cells/elements use identical attached Icon note.
Unless otherwise indicated, term (including scientific and technical terminology) used herein has person of ordinary skill in the field It is common to understand meaning.Further it will be understood that with the term that usually used dictionary limits, should be understood as and its The context of related fields has consistent meaning, and is not construed as Utopian or too formal meaning.
Fig. 1 is according to embodiment of the present invention based on the personalized recommendation method 100 for improving cluster and Spark frame Flow chart.As shown in Figure 1, embodiment of the present invention is utilized based on the personalized recommendation method for improving cluster and Spark frame Canopy algorithm carries out cluster preprocessing to project, generates at least one Canopy cluster centre, and according to it is described at least one Canopy cluster centre set initializes the cluster centre of FCM algorithm, avoids the blindness of initial cluster center selection, thus Improve the accuracy of cluster;Its degree of membership to cluster centre, and root are updated using degree of membership calculation formula to each project Cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition;Calculate destination item With the similarity of each cluster centre in final cluster set, choose corresponding more than or equal to the similarity of default similarity threshold The item design candidate items space in gathering is clustered, and calculates destination item and candidate using weighting Pearson correlation coefficient Similarity in project space between each project finds the K arest neighbors set of destination item, improves the precision of recommendation;Root According to the K arest neighbors set of the destination item, user is obtained to the preference predicted value of destination item, and utilizes the recommendation side top-N Method is chosen the higher N number of project of preference predicted value and is recommended, for improving the real time response speed of algorithm and recommending accuracy. It is run based on parallelization of the Collaborative Filtering Recommendation Algorithm of cluster and improved project similarity calculation on Spark is improved, for Quickly and accurately personalized recommendation is provided with important theoretical value and reference significance for user.
The personalized recommendation method 100 based on improvement cluster and Spark frame of embodiment of the present invention is from step 101 Start, data prediction is carried out to user-project rating matrix in step 101, determines effective score data collection, wherein described Score data collection includes: user data, project data and score data.
Preferably, wherein described carry out data prediction to user-project rating matrix, effective score data is determined Collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid, Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively For<Iid, (Uid, Rating)>element format determines effective score data collection.
Common score data collection generally uses text mode to be stored, and corresponding every data line in text is one A user records the scoring of some project.Before being calculated, needs to handle data set, filter out and do not meet item The scoring of part records, and is converted into the data format needed in calculating process.In embodiments of the present invention, using Uid, Iid, Rating indicates user, project and the scoring in scoring record, number of users Nu, item number n respectively.To user-project Rating matrix carries out data prediction, determines effective score data collection, specific processing step is as follows:
Score data file is formed into Initial R DD (Resilient Distributed by row reading using textFile Datasets), the number of partitions is arranged in elasticity distribution formula data set;
Every scoring record in Initial R DD is converted into<Uid by map, the form of (Iid, Rating)>key-value pair, and Obtained new RDD is named as train;
It is map to train to operate to obtain with Iid as key, tuple (Uid, Rating) is the key-value pair of value, forms member Plain format be<Iid, (Uid, Rating)>RDD2.
Preferably, cluster preprocessing is carried out to project using Canopy algorithm in step 102, generates at least one Canopy Cluster centre.
Preferably, the cluster of FCM algorithm is initialized according at least one described Canopy cluster centre set in step 103 Center updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to the degree of membership of update Cluster centre is updated, iteration determines final cluster set up to meeting stop condition.
Preferably, wherein the cluster of described at least one Canopy cluster centre set according to initialization FCM algorithm Center updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and according to the degree of membership of update Cluster centre is updated, iteration determines final cluster set up to meeting stop condition, comprising:
Step 1031, the cluster centre v of FCM algorithm is initialized according at least one described Canopy cluster centre seti, Determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Step 1032, each project is updated to the degree of membership of cluster centre using degree of membership calculation formula
Step 1033, according to the degree of membershipCluster centre is updated using cluster centre calculation formula
Step 1034, by the cluster centreWith the cluster centre of updateIt is compared, if | | vi (p+1)-vi (p) | | < ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until | | vi (p+1)-vi (p)| | < ε is determined most Cluster set eventually.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;For pth The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after secondary iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
In embodiments of the present invention, using Canopy algorithm to project carry out cluster preprocessing, by project score to Data in duration set are divided into different Canopy and generate corresponding cluster centre, then the Canopy formed Initial cluster center of the cluster centre as FCM algorithm updates each project to cluster centre using degree of membership calculation formula Degree of membership, and cluster centre is updated according to the degree of membership, iteration determines final cluster set up to meeting stop condition.It borrows Advantage of the Spark on memory is calculated and iterated to calculate is helped, the parallelization of algorithm is realized, traditional collaborative filtering recommending is solved with this The problems such as computation complexity that algorithm faces under big data background is high, processing speed is slow.Specific step is as follows:
Step 1, it creates initial Canopy cluster centre list C_List and is set as empty, obtained by cross validation mode Canopy distance threshold t1And t2(t1> t2);
Step 2, the project score vector set<Iid that will be obtained from RDD2, (Uid, Rating)>it is denoted as Item, from A project score vector is obtained in Item, is denoted as Item1, and be added in C_List as Canopy cluster centre point, Then Item1 is deleted from Item;
Step 3, gather an optional project score vector in remaining element from Item and be denoted as Item2, calculate itself and C_ The distance of all cluster centre points in List, if distance is less than t2Or it is greater than t1, then Item2 is added in C_List, and from It is deleted in Item;If distance is between t1And t2Between, then Item2 is added in corresponding Canopy;
Step 4, step 3 is repeated until Item is sky, output cluster centre set Q;
Step 5, cluster centre is initialized according to cluster centre set QDetermine clusters number c, fuzzy indicator m and appearance Perhaps error ε;
Step 6, each project is updated according to degree of membership calculation formula and angle value is subordinate to cluster centre, wherein described be subordinate to Spend calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n);
Step 7, it is subordinate to angle value based on what step 6 obtained, cluster centre is updated using cluster centre calculation formula, wherein gathering Class center calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project;
Step 8, by the cluster centreWith new cluster centreIt is compared, if | | vi (p+1)-vi (p)| | < ε, Then stop calculating, determines final cluster set;Otherwise, return step 6, until | | vi (p+1)-vi (p)| | < ε determines final cluster Set.Final cluster centre set and l item cluster set are exported, is broadcasted in each child node.
Preferably, the similarity of each cluster centre in destination item and final cluster set is calculated separately in step 104, Choose the item design candidate items space in the corresponding cluster set of similarity for being more than or equal to default similarity threshold, and benefit The similarity in destination item and candidate items space between each project is calculated with weighting Pearson correlation coefficient, finds mesh The K arest neighbors set of mark project.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is between project The threshold value for user's intersection size that scores.
The situation that excessive user's intersection is too small but Pearson correlation coefficient is very big is commented, jointly between reduction project to add Weigh measure of the Pearson correlation coefficient as similitude between two projects.In embodiments of the present invention, specifically simultaneously Steps are as follows for rowization calculating:
Step1, using groupByKey by item feature vector<Iid in candidate items space, (Uid, Rating)> Merger is carried out, forming Iid is key, and List ((Uid, Rating) ...) is the key-value pair of value.
Step2 calculates all users by scoring lists iterations of the flatMap to a certain project and comments a certain project Divide average value.Simultaneously in view of the project to cross to same user's evaluation is matched, converts element format to and be with user Key, key-value pair of the binary group (Iid, Rating-avg) as value, wherein Rating-avg is scoring of the user to project With the difference of project scoring mean value.
Step3, merging formation element format by the project that groupByKey crosses same user's evaluation is < Uid, List ((Iid, Rating-avg) ...) > RDD3.
Step4 is matched the project that same user's evaluation is crossed by flatMap, formation element lattice two-by-two in the form of ascending order Formula is<(Iid1, Iid2), (Rating1-avg1, Rating2-avg2)>RDD4, wherein Iid1, Iid2 are some user u Certain two project and Iid1 of evaluation are less than Iid2, and avg1 is average score of all users to project Iid1, and Rating1 is certain Scoring of a user u to project Iid1.
Step5, using aggregateByKey by customized seqOp and comOp function by the member with identical key Plain merger is formed with tuple (Iid1, Iid2) as key, and (v1, v2, v3) is the key-value pair of value.Wherein v1, v2, v3 difference In corresponding weighting Pearson correlation coefficient formulaWith
Step6, by mapValues solve pairing project between similarity, formation element format be < (Iid1, Iid2), sim > RDD, sim represents the similarity to score between any pairing project two-by-two here.This process can shift to an earlier date from Result is saved in HDFS by line computation, and when on-line prediction directly reads, and improves the operational efficiency of algorithm.
Step7, by map function by obtained RDD become element format be<Iid1, (Iid2, sim)>RDD5. because In order to reduce calculation amount, first project is ranked up and is then matched, so < Iid1 is only had recorded in RDD5, (Iid2, Sim)>without record<Iid2, (Iid1, sim)>but the arest neighbors set for calculating destination item need to collect destination item with The list of the similarity of sundry item, so to be converted to final item similarity.Exchange the position of project id, shape At element format are as follows:<Iid2, (Iid1, sim)>RDD6, then to RDD5 and RDD6 be union operation generate RDD7, at this time <Iid1 has been existed simultaneously for identical two projects, (Iid2, sim)>and<Iid2, (Iid1, sim)>.
Step8 utilizes groupByKey formation element format are as follows:
<Iid1, List ((Iid2, sim), (Iid3, sim) ...)>RDD, then filtering out key is destination item j Record, is denoted as neighbor.
Step9 utilizes following code
Neighbor.map x=>
Val a=x._2.toSeq.sortWith ((o, p)=> o._2 > p._2) .take (K)
(x._1,a)
}
To select the k nearest neighbor set Kneighbor of destination item j.
Preferably, user is obtained to the inclined of destination item according to the K arest neighbors set of the destination item in step 105 Good predicted value, and choose the higher N number of project of preference predicted value using top-N recommended method and recommended.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For The average score of destination item i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
In embodiments of the present invention, step is implemented are as follows:
Step1 is matched user u and the project in project j neighbour set by flatMap, formation element format one by one For<(u, Iid1), sim>RDD1, wherein Iid1 is the project in neighbour's set, and sim is Iid1 and between scoring item j Similarity.
Step2, using filter filtering train formation element format be<(u, Iid1), Rating>RDD2.
Step3, with the element merger of identical key, will be converted in RDD1 and RDD2 by join and be formed RDD3, element lattice Formula be<(u, Iid1), (sim, Rating)>.
Step4 acquires last prediction using map and reduce and scores.
Step5, using similar approach carry out prediction scoring to all items that user did not evaluate, and choose scoring highest N number of project generate recommendation list.
Fig. 2 is according to embodiment of the present invention based on the personalized recommendation system 200 for improving cluster and Spark frame Structural schematic diagram.As shown in Fig. 2, the personalized recommendation system based on improvement cluster and Spark frame of embodiment of the present invention 200 include: data pre-processing unit 201, Canopy cluster centre generation unit 202, final cluster set determination unit 203, K Arest neighbors set determination unit 204 and recommendation unit 205.Preferably, it in data pre-processing unit 201, scores user-project Matrix carries out data prediction, effective score data collection is determined, wherein the score data collection includes: user data, project Data and score data.
Preferably, wherein the data pre-processing unit 201, carries out data prediction to user-project rating matrix, really Fixed effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and will be accorded with using text file The score data of conjunction condition is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into < Uid using map, (Iid, Rating) > key-value pair form, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form records to be converted respectively For<Iid, (Uid, Rating)>element format determines effective score data collection.
Preferably, in Canopy cluster centre generation unit 202, project is carried out using Canopy algorithm to cluster pre- place Reason, generates at least one Canopy cluster centre.
Preferably, in final cluster set determination unit 203, according at the beginning of at least one described Canopy cluster centre set The cluster centre of beginningization FCM algorithm updates its degree of membership to cluster centre using degree of membership calculation formula to each project, and Cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition.
Preferably, wherein the final cluster gathers determination unit 203, according at least one described Canopy cluster centre The cluster centre of set initialization FCM algorithm updates its person in servitude to cluster centre using degree of membership calculation formula to each project Category degree, and cluster centre is updated according to the degree of membership of update, iteration determines final cluster set, wraps up to meeting stop condition It includes:
Cluster centre generates subelement, for being calculated according at least one Canopy cluster centre set initialization FCM The cluster centre v of methodi, determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Degree of membership computation subunit, for updating each project to the degree of membership of cluster centre using degree of membership calculation formula
It updates cluster centre and determines subelement, for according to the degree of membershipMore using cluster centre calculation formula New cluster centre
Final cluster centre set determines subelement, is used for the cluster centreWith the cluster centre of update It is compared, if | | vi (p+1)-vi (p)| | < ε then stops calculating, and determines final cluster set;Otherwise, return step 2, until | |vi (p+1)-vi (p)| | < ε determines final cluster set.
Preferably, wherein the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;It is The distance between a project of (i=1,2 ..., c) a cluster centre and kth i-th after p iteration (k=1,2 ..., n).
Preferably, wherein the cluster centre updates calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
Preferably, it in K arest neighbors set determination unit 204, calculates separately each in destination item and final cluster set The similarity of cluster centre chooses the item design in the corresponding cluster set of similarity for being more than or equal to default similarity threshold Candidate items space, and using weighting Pearson correlation coefficient calculate in destination item and candidate items space each project it Between similarity, find the K arest neighbors set of destination item.
Preferably, wherein described utilize the similarity weighted between Pearson correlation coefficient calculating project, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate project i and j Average score;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is project Between score user's intersection size threshold value.
Preferably, user is obtained to destination item according to the K arest neighbors set of the destination item in recommendation unit 205 Preference predicted value, and using top-N recommended method choose the higher N number of project of preference predicted value recommended.
Preferably, wherein the K arest neighbors set according to the destination item, obtains user to the preference of destination item Predicted value, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i ,-Kn is the K arest neighbors set of destination item,For The average score of destination item i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
The embodiment of the present invention based on improve cluster and Spark frame personalized recommendation system 200 with it is of the invention The personalized recommendation method 100 based on improvement cluster and Spark frame of another embodiment is corresponding, and details are not described herein.
The present invention is described by reference to a small amount of embodiment.However, it is known in those skilled in the art, as Defined by subsidiary Patent right requirement, in addition to the present invention other embodiments disclosed above equally fall in it is of the invention In range.
Normally, all terms used in the claims are all solved according to them in the common meaning of technical field It releases, unless in addition clearly being defined wherein.All references " one/described/be somebody's turn to do [device, component etc.] " are all opened ground At least one example being construed in described device, component etc., unless otherwise expressly specified.Any method disclosed herein Step need not all be run with disclosed accurate sequence, unless explicitly stated otherwise.

Claims (14)

1. a kind of based on the personalized recommendation method for improving cluster and Spark frame, which is characterized in that the described method includes:
Data prediction is carried out to user-project rating matrix, effective score data collection is determined, wherein the score data collection It include: user data, project data and score data;
Cluster preprocessing is carried out to project using Canopy algorithm, generates at least one Canopy cluster centre;
The cluster centre that FCM algorithm is initialized according at least one described Canopy cluster centre set utilizes each project Degree of membership calculation formula updates its degree of membership to cluster centre, and updates cluster centre according to the degree of membership of update, and iteration is straight To stop condition is met, final cluster set is determined;
The similarity for calculating separately destination item with each cluster centre in final cluster set, chooses similar more than or equal to presetting The item design candidate items space in the corresponding cluster set of similarity of threshold value is spent, and utilizes weighting Pearson phase relation Number calculates the similarity in destination item and candidate items space between each project, finds the K arest neighbors set of destination item;
According to the K arest neighbors set of the destination item, user is obtained to the preference predicted value of destination item, and utilizes top-N Recommended method is chosen the higher N number of project of preference predicted value and is recommended.
2. the method according to claim 1, wherein described locate user-project rating matrix progress data in advance Reason, determines effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and item will be met using text file The score data of part is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into<Uid using map, (Iid, Rating)> The form of key-value pair, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form record be respectively converted into< Iid, (Uid, Rating) > element format determine effective score data collection.
3. the method according to claim 1, wherein described at least one Canopy cluster centre collection according to The cluster centre for closing initialization FCM algorithm, updates it using degree of membership calculation formula to each project and is subordinate to cluster centre Degree, and cluster centre is updated according to the degree of membership of update, iteration determines final cluster set up to meeting stop condition, comprising:
Step 1, the cluster centre v of FCM algorithm is initialized according at least one described Canopy cluster centre seti, determine cluster Number c, fuzzy indicator m and allowable error threshold epsilon;
Step 2, each project is updated to the degree of membership of cluster centre using degree of membership calculation formula
Step 3, according to the degree of membershipCluster centre is updated using cluster centre calculation formula
Step 4, by the cluster centreWith the cluster centre of updateIt is compared, ifThen Stop calculating, determines final cluster set;Otherwise, return step 2, untilDetermine final cluster set It closes.
4. according to the method described in claim 3, it is characterized in that, the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;Repeatedly for pth time The distance between a project of a cluster centre of i-th (i=1,2 ..., c) and kth after generation (k=1,2 ..., n).
5. according to the method described in claim 4, it is characterized in that, the cluster centre more new formula are as follows:
Wherein, xkFor the score vector of k-th of project.
6. the method according to claim 1, wherein described calculate project using weighting Pearson correlation coefficient Between similarity, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate being averaged for project i and j Scoring;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is scored between project The threshold value of user's intersection size.
7. the method according to claim 1, wherein the K arest neighbors set according to the destination item, is obtained Family is taken to the preference predicted value of destination item, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For target item The average score of mesh i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
8. a kind of based on the personalized recommendation system for improving cluster and Spark frame, which is characterized in that the system comprises:
Data pre-processing unit determines effective score data for carrying out data prediction to user-project rating matrix Collection, wherein the score data collection includes: user data, project data and score data;
Canopy cluster centre generation unit generates at least one for carrying out cluster preprocessing to project using Canopy algorithm A Canopy cluster centre;
Final cluster set determination unit, for initializing FCM algorithm according at least one described Canopy cluster centre set Cluster centre, its degree of membership to cluster centre is updated using degree of membership calculation formula to each project, and according to update Degree of membership updates cluster centre, and iteration determines final cluster set up to meeting stop condition;
K arest neighbors set determination unit calculates separately the similarity of each cluster centre in destination item and final cluster set, Choose the item design candidate items space in the corresponding cluster set of similarity for being more than or equal to default similarity threshold, and benefit The similarity in destination item and candidate items space between each project is calculated with weighting Pearson correlation coefficient, finds mesh The K arest neighbors set of mark project;
Recommendation unit obtains user and predicts the preference of destination item for the K arest neighbors set according to the destination item Value, and choose the higher N number of project of preference predicted value using top-N recommended method and recommended.
9. system according to claim 8, which is characterized in that the data pre-processing unit, to user-project scoring square Battle array carries out data prediction, determines effective score data collection, comprising:
Score data ineligible in the user-project rating matrix is filtered out, and item will be met using text file The score data of part is read in by row, determines initial score data collection, and the number of partitions is arranged;
Every scoring record that the initial score data is concentrated is respectively converted into<Uid using map, (Iid, Rating)> The form of key-value pair, wherein Uid is project number, and Iid is Customs Assigned Number, and Rating is scoring;
Will be described with<Uid using map, every scoring existing for (Iid, Rating)>key-value pair form record be respectively converted into< Iid, (Uid, Rating) > element format determine effective score data collection.
10. system according to claim 8, which is characterized in that the final cluster gathers determination unit, according to it is described extremely The cluster centre of few Canopy cluster centre set initialization FCM algorithm, utilizes degree of membership calculation formula to each project Update its degree of membership to cluster centre, and cluster centre updated according to the degree of membership of update, iteration until meet stop condition, Determine final cluster set, comprising: cluster centre generates subelement, for according at least one described Canopy cluster centre collection Close the cluster centre v of initialization FCM algorithmi, determine clusters number c, fuzzy indicator m and allowable error threshold epsilon;
Degree of membership computation subunit, for updating each project to the degree of membership of cluster centre using degree of membership calculation formula
It updates cluster centre and determines subelement, for according to the degree of membershipIt is updated using cluster centre calculation formula poly- Class center
Final cluster centre set determines subelement, is used for the cluster centreWith the cluster centre of updateIt carries out Compare, ifThen stop calculating, determines final cluster set;Otherwise, return step 2, untilDetermine final cluster set.
11. system according to claim 10, which is characterized in that the degree of membership calculation formula are as follows:
Wherein,Degree of membership of the ith cluster center relative to k-th of project when for+1 iteration of pth;Repeatedly for pth time The distance between a project of a cluster centre of i-th (i=1,2 ..., c) and kth after generation (k=1,2 ..., n).
12. system according to claim 11, which is characterized in that the cluster centre calculation formula are as follows:
Wherein, xkFor the score vector of k-th of project.
13. system according to claim 8, which is characterized in that described to calculate project using weighting Pearson correlation coefficient Between similarity, comprising:
Wherein, ruiAnd rujRespectively indicate scoring of the user u to project i and project j;WithRespectively indicate being averaged for project i and j Scoring;UijFor the user's set for evaluating project i and project j jointly;Num is UijIn element number;A is scored between project The threshold value of user's intersection size.
14. system according to claim 8, which is characterized in that the K arest neighbors set according to the destination item, User is obtained to the preference predicted value of destination item, comprising:
Wherein, puiIt is user u to the preference predicted value of destination item i, Kn is the K arest neighbors set of destination item,For target item The average score of mesh i;Sim (i, j) is the similarity of destination item i and j;rujScoring for user u to project j.
CN201711132268.0A 2017-11-15 2017-11-15 A kind of personalized recommendation method and system based on improvement cluster and Spark frame Pending CN110020141A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711132268.0A CN110020141A (en) 2017-11-15 2017-11-15 A kind of personalized recommendation method and system based on improvement cluster and Spark frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711132268.0A CN110020141A (en) 2017-11-15 2017-11-15 A kind of personalized recommendation method and system based on improvement cluster and Spark frame

Publications (1)

Publication Number Publication Date
CN110020141A true CN110020141A (en) 2019-07-16

Family

ID=67186788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711132268.0A Pending CN110020141A (en) 2017-11-15 2017-11-15 A kind of personalized recommendation method and system based on improvement cluster and Spark frame

Country Status (1)

Country Link
CN (1) CN110020141A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209953A (en) * 2020-01-03 2020-05-29 腾讯科技(深圳)有限公司 Method and device for recalling neighbor vector, computer equipment and storage medium
CN111367901A (en) * 2020-02-27 2020-07-03 智慧航海(青岛)科技有限公司 Ship data denoising method
CN112487276A (en) * 2019-09-11 2021-03-12 腾讯科技(深圳)有限公司 Object acquisition method, device, equipment and storage medium
CN113139021A (en) * 2021-04-23 2021-07-20 上海中通吉网络技术有限公司 Express delivery network calling center data identification method
CN115063877A (en) * 2022-06-06 2022-09-16 南通大学 Parallel superpixel Spark clustering method for big data fundus image

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412948A (en) * 2013-08-27 2013-11-27 北京交通大学 Cluster-based collaborative filtering commodity recommendation method and system
CN104239496A (en) * 2014-09-10 2014-12-24 西安电子科技大学 Collaborative filtering method based on integration of fuzzy weight similarity measurement and clustering
CN107153846A (en) * 2017-05-26 2017-09-12 南京邮电大学 A kind of road traffic state modeling method based on Fuzzy C-Means Cluster Algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412948A (en) * 2013-08-27 2013-11-27 北京交通大学 Cluster-based collaborative filtering commodity recommendation method and system
CN104239496A (en) * 2014-09-10 2014-12-24 西安电子科技大学 Collaborative filtering method based on integration of fuzzy weight similarity measurement and clustering
CN107153846A (en) * 2017-05-26 2017-09-12 南京邮电大学 A kind of road traffic state modeling method based on Fuzzy C-Means Cluster Algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
廖彬等: "基于Spark的ItemBased推荐算法性能优化", 《计算机应用》 *
王晓军等: "基于模糊聚类的可扩展的协同过滤方法", 《南京邮电大学学报(自然科学版)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487276A (en) * 2019-09-11 2021-03-12 腾讯科技(深圳)有限公司 Object acquisition method, device, equipment and storage medium
CN112487276B (en) * 2019-09-11 2023-10-17 腾讯科技(深圳)有限公司 Object acquisition method, device, equipment and storage medium
CN111209953A (en) * 2020-01-03 2020-05-29 腾讯科技(深圳)有限公司 Method and device for recalling neighbor vector, computer equipment and storage medium
CN111209953B (en) * 2020-01-03 2024-01-16 腾讯科技(深圳)有限公司 Recall method, recall device, computer equipment and storage medium for neighbor vector
CN111367901A (en) * 2020-02-27 2020-07-03 智慧航海(青岛)科技有限公司 Ship data denoising method
CN113139021A (en) * 2021-04-23 2021-07-20 上海中通吉网络技术有限公司 Express delivery network calling center data identification method
CN115063877A (en) * 2022-06-06 2022-09-16 南通大学 Parallel superpixel Spark clustering method for big data fundus image

Similar Documents

Publication Publication Date Title
CN110020141A (en) A kind of personalized recommendation method and system based on improvement cluster and Spark frame
Ashari et al. Performance comparison between Naïve Bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool
CN110390396B (en) Method, device and system for estimating causal relationship between observed variables
CN105659225B (en) Use the query expansion and inquiry-document matches of path constrained random migration
CN109948066B (en) Interest point recommendation method based on heterogeneous information network
CN108733976B (en) Key protein identification method based on fusion biology and topological characteristics
Kong et al. Big data‐driven machine learning‐enabled traffic flow prediction
EP2860672A2 (en) Scalable cross domain recommendation system
CN110046713B (en) Robustness ordering learning method based on multi-target particle swarm optimization and application thereof
CN107633100A (en) A kind of point of interest based on incorporation model recommends method and device
Chen et al. Next POI recommendation based on location interest mining with recurrent neural networks
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN104298778A (en) Method and system for predicting quality of rolled steel product based on association rule tree
Ma A new group ranking approach for ordinal preferences based on group maximum consensus sequences
CN108427756A (en) Personalized query word completion recommendation method and device based on same-class user model
CN109213951A (en) A kind of proposed algorithm calculated based on trust with matrix decomposition
Zhang et al. An improved probabilistic relaxation method for matching multi-scale road networks
CN111553279A (en) Interest point characterization learning and identification method, device, equipment and storage medium
WO2015040806A1 (en) Hierarchical latent variable model estimation device, hierarchical latent variable model estimation method, supply amount prediction device, supply amount prediction method, and recording medium
Hussain et al. Clustering uncertain graphs using ant colony optimization (ACO)
Shen et al. A Generic Framework for Top-${\schmi k} $ Pairs and Top-${\schmi k} $ Objects Queries over Sliding Windows
Wu et al. Cost-sensitive decision tree with multiple resource constraints
Quan et al. An optimized task assignment framework based on crowdsourcing knowledge graph and prediction
Ampellio et al. A hybrid swarm-based algorithm for single-objective optimization problems involving high-cost analyses
Zhu et al. Discovering large conserved functional components in global network alignment by graph matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190716