CN106126727A - A kind of big data processing method of commending system - Google Patents
A kind of big data processing method of commending system Download PDFInfo
- Publication number
- CN106126727A CN106126727A CN201610515790.6A CN201610515790A CN106126727A CN 106126727 A CN106126727 A CN 106126727A CN 201610515790 A CN201610515790 A CN 201610515790A CN 106126727 A CN106126727 A CN 106126727A
- Authority
- CN
- China
- Prior art keywords
- user
- project
- list
- data
- data processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The invention discloses a kind of big data processing method of commending system, belong to commending system technical field, solve traditional commending system big data processing method development difficulty big, inefficient technical problem.The method includes: obtain user preference utility matrix and project data;Carry out distributed data processing based on described utility matrix and obtain user evaluates in utility matrix the bulleted list evaluation user corresponding with each project and score value list thereof, carry out distributed data processing based on described project data and obtain the similar terms list of each project;Carry out distributed data processing based on the similar terms list evaluating user and score value list and each project that each project is corresponding and obtain recommended project and the prediction weighted list thereof of user;The bulleted list evaluated based on user and the recommended project of user and prediction weighted list thereof carry out distributed data processing and obtain the project recommendation result of user.
Description
Technical field
The present invention relates to commending system technical field, specifically, relate to a kind of big data processing method of commending system.
Background technology
Commending system, namely refers to personalized recommendation, and its basic essence is exactly the Characteristic of Interest according to user and history row
For record, recommend, to user, commodity or the project information that user is interested.Along with the scale of the Internet constantly expands, move
Moving the omnipresent of the Internet, the information in the present the Internet of user's body is also in diversified development, from basic demographic
To the geographical location information of dynamically change, from simple content-browsing person to the supplier becoming content.Individual in reality is also
Real " visual human " will be become in the Internet.How substantial amounts of for user destructuring Heterogeneous Information is effectively analyzed
And draw reliable satisfied result and solve information overload problem, thus the personalized recommendation system that has been born.Personalized recommendation
System is built upon a kind of Advanced Business intelligent platform on the basis of mass data is excavated, and core is the bulk information by user
Set up contacting between user and project, thus reach the purpose of a kind of INFORMATION DISCOVERY.Commending system essence is more refers to one
Kind of service, is to aid in user and frees from mass data, it is provided that be suitable for allowing the recommendation service of the more preferable decision-making of user, and this
The service of kind will penetrate into whole internet world so that the every aspect of following people's life, such as news, travels, diet
Deng.Commending system is the subset that cloud computing essence i.e. provides elasticity service.
The realization of commending system must rely on big data to process and support, and is no longer that a kind of traditional algorithm realizes, and will receive
Collect to data carry out processing and obtain a reliable model by an expansible platform, be big data process recommendation
Basic place in system.
At present, the platform that the exploitation big data of commending system process has relatively more options, but is required for more complicated programming
Skill and architectural framework, high to exploitation personnel requirement, need developer to control substantial amounts of design details, difficulty is big, and calculates
Inefficient.
Summary of the invention
It is an object of the invention to provide a kind of big data processing method of commending system, with the traditional commending system solved
Big data processing method development difficulty is big, inefficient technical problem.
The present invention provides a kind of big data processing method of commending system, and the method includes:
Obtain user preference utility matrix and project data;
Based on described utility matrix carry out distributed data processing obtain user evaluates in utility matrix bulleted list and
Evaluation user that each project is corresponding and score value list thereof, carry out distributed data processing based on described project data and obtain every
The similar terms list of individual project;
The similar terms list of evaluating user and score value list and each project corresponding based on each project is carried out
Distributed data processing obtains recommended project and the prediction weighted list thereof of user;
The bulleted list evaluated based on user and the recommended project of user and prediction weighted list thereof carry out distributed data
Process the project recommendation result obtaining user.
The big data processing method of commending system that the present invention provides, also includes:
Carry out distributed data processing based on the bulleted list of user's evaluation in utility matrix and obtain the similar users of user
List;
Similar users list based on user and project data carry out distributed data processing and obtain the project recommendation of user
Result.
The big data processing method of commending system that the present invention provides, also includes:
Obtain user data and carry out Distributed Cluster process acquisition cluster result;
Bulleted list and the project data evaluated based on described cluster result, described user carry out distributed data processing and obtain
Obtain the physical chemical characteristics of each cluster;
Physical chemical characteristics data based on each cluster obtain the physical chemical characteristics of new user according to the Clustering and selection of new user;
Physical chemical characteristics based on new user carries out distributed search and obtains the project recommendation result of new user index of articles.
The big data processing method of commending system that the present invention provides, also includes:
Obtain the training dataset including item id, the ID that this project is marked and score value information and comprise
There is the test data set of the item id information that ID and this user are to be predicted;
According to training dataset carry out distributed data processing add up project therein to and difference and generation difference
User list;
The project concentrated based on training data to and the user list of difference and generation difference carry out at distributed data
Reason statistics training data is concentrated has the project of difference relationship to, total number of users, scoring and and mean difference;
Carry out distributed data processing according to training dataset and test data set and obtain the project that user had marked
And mark and project to be predicted;
Concentrate based on training data and have the project of difference relationship to, total number of users, scoring, mean difference, user
The project marked and scoring and project to be predicted thereof carry out distributed data processing acquisition SlopeOne prediction and calculate required
Value;
Value needed for calculating based on SlopeOne prediction carries out distributed data processing and calculates the project that user needs to predict
Predictive value.
Include in the step carrying out distributed data processing based on utility matrix and project data:
Obtaining key-value pair based on described utility matrix by the first mapping-reduction module is user: the project that user evaluates
First output data of list;
Obtaining key-value pair based on described utility matrix by the second mapping-reduction module is project: the evaluation that project is corresponding
User and the second output data of score value list thereof;
Obtaining key-value pair based on described project data by the 3rd mapping-reduction module is project: the similar terms of project
3rd output data of list;
Step in the recommended project and prediction weighted list thereof that obtain user includes:
Export data obtaining key-value pair by the 4th mapping-reduction module based on described second and the 3rd is user: user
Recommended project and prediction weighted list the 4th output data;
Step in the project recommendation result obtaining user includes:
Export data obtaining key-value pair by the 4th mapping-reduction module based on described first and the 4th is user: user
The output result of project recommendation the results list.
Step in the similar users list obtaining user includes:
It is user that the bulleted list evaluated based on described user obtains key-value pair by the 6th mapping-reduction module: user
Similar users list the 5th output data;
The step carrying out distributed data processing at similar users list based on user and project data includes:
Key-value pair is obtained for using with described project data by the 7th mapping-reduction module based on described 5th output data
Family: the output result of project recommendation the results list of user.
Step at the physical chemical characteristics obtaining each cluster includes:
The bulleted list evaluated based on described user and cluster result data obtain key assignments by the 8th mapping-reduction module
To be cluster for output key-value pair: the 6th of the bulleted list that cluster is corresponding exports data;
Key-value pair is obtained for cluster by the 9th mapping-reduction module: cluster based on the 6th output data and project data
Physical chemical characteristics list the 7th output data;
Step in the project recommendation result obtaining new user includes:
Physical chemical characteristics based on new user carries out retrieval by the tenth mapping-reduction module to index of articles and obtains key-value pair
For user: the output result of project recommendation the results list of user.
Carry out the step of distributed data processing according to training dataset include described:
Obtaining key-value pair according to training dataset data by the first mapping block and shuffle module is user: user marks
Project and the output data of corresponding score value list;
It is user based on described key-value pair: the output data of user's scoring item and corresponding score value list are by the
It is project pair that one reduction module obtains key-value pair: its difference and produce the output data of user of this difference.
Concentrate at statistics training data and have the project of difference relationship to, total number of users, scoring and and mean difference
Step includes:
Be project pair based on key-value pair: its difference and produce this difference user output data by merge module generate
Key-value pair is project pair: its difference and produce the output data of statistics number list of user of this difference;
Be project pair based on key-value pair: its difference and produce this difference user statistics number list output data lead to
Crossing the second reduction module and obtaining key-value pair is project: calculated sign, had the project of difference relationship to, total number of users, scoring
With and the output data of mean difference.
Include in the step carrying out distributed data processing according to training dataset and test data set:
Being attached processing acquisition key-value pair by the 3rd mapping block with test data set to training dataset is user:
The output data of the linkage record list of this user;
It is user based on described key-value pair: the output data of the linkage record list of this user are obtained by the 3rd reduction module
Key-value pair is user: forecast demand mark, the project marked of user and scoring thereof and project to be predicted.
The big data processing method of commending system that the embodiment of the present invention provides realizes pushing away based on content by Hadoop platform
The big data recommended process design, big data based on Project cooperation recommendation process design, cold start-up prioritization scheme is big
Data process the big data of design and SlopeOne algorithm and process design, by executed in parallel mechanism, are greatly improved big
The computational efficiency that data process.And the big data of the SlopeOne algorithm of present invention offer process design and can realize parallelization
After incremental computations, and solve in intermediate computations it is possible that the problem of low memory, utilize simultaneously and merge module
Combiner realizes the further optimization for algorithm, further improves computational efficiency and reliability.
Other features and advantages of the present invention will illustrate in the following description, and, becoming from description of part
Obtain it is clear that or understand by implementing the present invention.The purpose of the present invention and other advantages can be by description, rights
Structure specifically noted in claim and accompanying drawing realizes and obtains.
Accompanying drawing explanation
For the technical scheme in the clearer explanation embodiment of the present invention, required in embodiment being described below
Accompanying drawing does simply to be introduced:
Fig. 1 is the schematic flow sheet of the big data processing method of commending system that the embodiment of the present invention provides;
Fig. 2 is the commending contents big data processing scheme application flow schematic diagram that the embodiment of the present invention provides;
Fig. 3 is that the Project cooperation that the embodiment of the present invention provides recommends big data processing scheme schematic flow sheet;
Fig. 4 is that the Project cooperation that the embodiment of the present invention provides recommends big data processing scheme application flow schematic diagram;
Fig. 5 is the prioritization scheme big data processing scheme schematic flow sheet that the embodiment of the present invention provides;
Fig. 6 is the prioritization scheme big data processing scheme application flow schematic diagram that the embodiment of the present invention provides;
Fig. 7 is the schematic flow sheet of the SlopeOne algorithm process scheme that the embodiment of the present invention provides;
Fig. 8 is the application flow schematic diagram of the SlopeOne algorithm process scheme that the embodiment of the present invention provides.
Detailed description of the invention
Describe embodiments of the present invention in detail below with reference to drawings and Examples, whereby how the present invention is applied
Technological means solves technical problem, and the process that realizes reaching technique effect can fully understand and implement according to this.Need explanation
As long as not constituting conflict, each embodiment in the present invention and each feature in each embodiment can be combined with each other,
The technical scheme formed is all within protection scope of the present invention.
The embodiment of the present invention provides a kind of big data processing method of commending system, and the method is based on big data processing platform (DPP)
Hadoop realizes, and processes design, the big data recommended based on Project cooperation including big data based on commending contents
Process design, the big data of cold start-up prioritization scheme process and design and the big data process design side of SlopeOne algorithm
Case.
As shown in Fig. 1 Fig. 2, based on commending contents in the big data processing method of commending system that the embodiment of the present invention provides
Big data process design include: step 101 to step 103.Wherein, in a step 101, user preference effectiveness square is obtained
Battle array and project data.In commending system, generally there are two dvielements, the first kind i.e. user, Equations of The Second Kind i.e. project.User can be right
Some project has preference information, and the combing from data of the preference information of these users is expressed as the shape of utility matrix the most afterwards
Formula is the most permissible just for utility matrix when data process.The representation of general utility matrix such as table 1 below:
Table 1 utility matrix
In utility matrix, each numeric representation is user's fancy grade to project, blank expression user's happiness to project
The most uncertain, this is the recommendation target that commending system is to be reached.
In a step 102, carry out distributed data processing based on utility matrix and obtain the item that user evaluates in utility matrix
Evaluation user that mesh list is corresponding with each project and score value list thereof, carry out distributed data processing based on project data and obtain
Obtain the similar terms list of each project.More specifically, in a step 102, based on utility matrix by the first mapping-reduction
It is user that module obtains key-value pair: the first output data of the bulleted list that user evaluates.Mapping-reduction (Map-Reduce) place
Reason framework is mainly used in the concurrent operation of large-scale dataset.The main thought of its framework comes from divide and rule algorithm and functional expression
Programming, the core methed of model is " mapping " and " reduction ", i.e. Map method and Reduce method, and its process is specially and is opening
The process of sending out specifies Map (mapping) function, is used for one group of key-value pair to be mapped to one group of new key-value pair, then specifies concurrent
Reduce (reduction) function, each being used for ensureing in the key-value pair of all mappings shares identical key group, thus
To result.Specifically, Map process is mainly responsible for carrying out data cutting the record formation key then collecting each input
The value form to key-value, after shuffling and sorting, output is to Reduce process, Reduce carry out reduction process.Whole
Big data are carried out segmentation and are transported to different machines process by individual Map-Reduce framework.Using utility matrix as first in this step
The input of mapping-reduction module, after the first mapping-reduction module carries out corresponding Map-Reduce process export result for <
User:{Item1, Item2, Itemn ... } >, wherein key Key is user User, is worth project Item evaluated by this user
List, i.e. obtains the project information that in utility matrix, each user be have rated.
Parallel with said process, obtaining key-value pair based on utility matrix by the second mapping-reduction module is project: item
What mesh was corresponding evaluates the second output data of user and score value list thereof.Using utility matrix as the second mapping-reduction module
Input, after the second mapping-reduction module carries out corresponding Map-Reduce process export result for < Item:{User1,
User2 ... } >, wherein key Key is project Item, is worth for scoring user User corresponding to project and score value list thereof.I.e. obtain
Which user is each project in utility matrix that obtains have carried out scoring and concrete score value information.
Parallel with said process, obtaining key-value pair based on project data by the 3rd mapping-reduction module is project: item
3rd output data of purpose similar terms list.3rd mapping-reduction module has two project data input interfaces, first
Being the input of the Map of standard, second is the data that the 3rd mapping-reduction module is read in simultaneously when Map calculates, and the 3rd reflects
Penetrate-the whole project data that reads in when calculating Map of the project of the input of reduction module measured Map is scanned, logical
Cross the characteristic relation between project and calculate N number of most like project, and then output result: key Key is project, is worth for this project
Similar terms list,<Item1:{Item2, Item3 ... }>.I.e. obtain those most like with Key project in utility matrix
Project information.
In step 103, based on each project corresponding evaluate user and score value list similar with each project
Bulleted list carries out distributed data processing and obtains recommended project and the prediction weighted list thereof of user.Concrete processing procedure is:
First, export data obtaining key-value pair by the 4th mapping-reduction module based on second and the 3rd is user: the recommendation items of user
Mesh and the 4th output data of prediction weighted list.By the output of the second mapping-reduction module < Item:{User1,
User2 ... }>and the 3rd mapping-reduction module output<Item1:{Item2, Item3 ... }>map as the 4th-return
The about input of module, the 4th mapping-reduction module carry out data combine Join operation thus to obtain each user's further
Recommended project list:<User:{predictItem ... }>, wherein, key Key is user User, is worth the recommendation items for user
Mesh predictItem and corresponding prediction weighted list.
Then, at step 104, the bulleted list evaluated based on user and the recommended project of user and prediction weight thereof arrange
Table carries out distributed data processing and obtains the project recommendation result of user.Export data based on first and the 4th to reflect by the 5th
Penetrate-reduction module obtain key-value pair be user: the output result of the project recommendation result of user.By the first mapping-reduction module
First output data of output<User:{Item1, Item2, Itemn ... }>with the first mappings-reduction module export the
Four output data<User:{predictItem ... }>as the input of the 5th mapping-reduction module, five mappings-reduction
Module carry out data combine Join operation remove repeat project obtain final recommendation results, will the 4th output data in certain
The project that user had evaluated is removed, and then obtains this user final evaluation for non-assessment item.
As shown in Figure 3 and Figure 4, the big data processing method of commending system that the embodiment of the present invention provides also includes: based on
The big data that Project cooperation is recommended process design, and the program includes that step 201 is to 204.
Step 201 is identical with the implementation that above-mentioned big data based on commending contents process design with step 202.
Using utility matrix as the input of the first mapping-reduction module, carry out corresponding Map-Reduce in the first mapping-reduction module
Export after process result be<User:{Item1, Item2, Itemn ... }>, wherein key Key is user User, is worth for this use
The project Item list that family is evaluated, i.e. obtains the project information that in utility matrix, each user be have rated.
In step 203, it is primarily based on the bulleted list that user evaluates in utility matrix to carry out distributed data processing and obtain
Obtain the similar users list of user.The bulleted list i.e. evaluated based on user obtains key-value pair by the 6th mapping-reduction module
For user: the 5th of the similar users list of user exports data.6th mapping-reduction module has two Data Input Interfaces,
First is the input of Map of standard, and second is the data that the 6th mapping-reduction module is read in simultaneously when Map calculates,
User behavior information in first output data of the input of the six measured Map of mapping-reduction module is read when calculating Map
All user behavior information in the first output data entered are scanned, and calculate N number of row by the characteristic relation of user behavior
For most like user, so output result:<User:{sUser ... }>, wherein key Key is user User, is worth for using with this
The list of similar users sUser at family.
Finally, in step 204, similar users list based on user carries out distributed data processing with project data and obtains
Obtain the project recommendation result of user.Key assignments is obtained with project data by the 7th mapping-reduction module based on the 5th output data
To for user: the output result of project recommendation the results list of user.Using the 5th output data as the 7th mapping-reduction module
Input, the 7th mapping-reduction module carries out parallelization process to the 5th output data, the project in retrieval Hbase data base
Data, and then obtain the recommendation results<User:{rItem ... } of final user>, key Key is user User, is worth and uses for this
The project recommendation result rItem list at family.
As shown in Figure 5 and Figure 6, the big data processing method of commending system that the embodiment of the present invention provides also includes:
The big data of SlopeOne algorithm process design, and this design includes that step 401 is to 404.
In step 301, obtain user data and carry out Distributed Cluster process acquisition cluster result.In this step, can
To use that increases income based on Hadoop machine learning storehouse Mahout, user data to carry out Distributed Cluster and process the cluster obtained
Result.
In step 302, bulleted list and the project data evaluated based on cluster result, user are carried out at distributed data
Reason obtains the physical chemical characteristics of each cluster.Wherein the acquisition mode of the bulleted list that user evaluates is with above-mentioned based on commending contents
It is identical with the acquisition mode that the big data recommended based on Project cooperation process in design that big data process design.Pass through
First mapping-reduction carry out process obtain export result for<User:{Item1, Item2, Itemn ... }>, wherein key Key
For user User, it is worth the project Item list evaluated by this user.
Then, bulleted list and the cluster result data evaluated based on user obtain key by the 8th mapping-reduction module
Be worth to for output key-value pair be cluster: cluster correspondence bulleted list the 6th export data.I.e. by the 8th mapping-reduction mould
Block carries out Join operation and export {<Cluster:<Item ...>}, and wherein key Key is for clustering Cluster, is worth for this cluster
Comprised the list of project Item, show which bulleted list each cluster has.
And then, the 6th output data and project data based on obtaining obtain key-value pair by the 9th mapping-reduction module
For cluster: the 7th of the physical chemical characteristics list of cluster exports data.Carried out the 7th output data that Map-Reduce obtains <
Cluster:<feature ...>}, wherein key Key is cluster Cluster, is worth for this poly-physical chemical characteristics list.
In step 303, physical chemical characteristics data based on each cluster obtain new user according to the Clustering and selection of new user
Physical chemical characteristics.After new user enters Systematic selection cluster centre, special with reference to the materialization of each cluster obtained in step 302
Levy the physical chemical characteristics obtaining new user.
In step 304, physical chemical characteristics based on new user carries out the distributed search new user's of acquisition to index of articles
Project recommendation result.Physical chemical characteristics based on new user is carried out at HDFS index of beginning a project by the tenth mapping-reduction module
It is user that distributed search obtains key-value pair: the output result of project recommendation the results list of user.
As shown in Figure 7 and Figure 8, the big data processing method of commending system that the embodiment of the present invention provides also includes:
The big data of SlopeOne algorithm process design.At present SlopeOne algorithm realizes simply by unit, therefore this
A kind of effective SlopeOne parallel computation scheme of bright proposition, in conjunction with terseness and the high efficiency of SlopeOne, it is achieved recommend system
Big data of uniting process.
SlopeOne algorithm various piece is resolved into Map-Reduce process by this programme, allows each Map-Reduce bear
Blame SlopeOne parallel computation process, and multiple Map-Reduce exists the dependence of serial, i.e. later Map-
Reduce needs the output of multiple front Map-Reduce as input.Hadoop provide proprietary DLL JobControl with
ControledJob carries out the dependence of relevant Map-Reduce and controls, and ultimately forms serial structure, it is achieved whole SlopeOne's
Algorithm.
Concrete, the program includes that step 401 is to 404.In step 401, acquisition includes item id, comments this project
Point ID and score value information training dataset with include ID and this user item id information to be predicted
Test data set.The input data of parallel SlopeOne are divided into training dataset and test data set by the present invention, we
The training dataset of agreement parallel algorithm is as follows with the form of test data set, and is all text type:
For training dataset:<the item id ID user scoring time to project>
Data form is such as:
1 1 3 881250949
2 2 3 891717742
For test data set:<item id that ID is to be predicted>
Data form is such as:
1 1
1 2
In the present invention, training dataset uses Nexflix to carry out calculating acquisition, and test data set is obtained by commending system,
Training data concentrates<the item id ID>of scoring existed right to allow test data set to occur in the present invention, and these are heavy
Can reject when multiple data algorithm is carried out, so result will not be produced impact.
After obtaining training dataset and test data set, this programme is designed as the concatenated schemes of 5 Map-Reduce.First
First read in training dataset, count user's score information, simultaneously as SlopeOne need to utilize simultaneously training dataset with
Test data set calculates, thus need when by training set counting user score information, synchronize be trained data set with
Test data set is attached operation.After counting user's score information, next add up the difference between each project, with
And number, this step will produce substantial amounts of intermediate data, in the completed, fill the data set of these differences with attended operation
Decorations and Join operate, and final test concentrates prediction score value to be calculated to use the computing formula of SlopeOne to draw the most again.
In step 402, according to training dataset carry out distributed data processing add up project difference therein to and
Difference and the user list of generation difference.For user's score data, i.e. training data, we are the most all read into one
In Map-Reduce, by the first record<User:Item>mapping Map module output user, i.e. key-value pair is user: user
Scoring item and corresponding score value, User Yu Item is two self-defining Hadoop key assignmentses, and wherein key User protects
That deposit is the ID of user, the ID of value Item the most in store user scoring item and corresponding score value.
After Map output result, undue project is commented to be collected by shuffling Shuffle process by all of for certain user,
And user is input in the first reduction reduce module by the sequence of ID natural order.The pattern of the input of Reduce be < User:
List{Item, Item ... }>, it is project difference pair that reduce is output as<KeyPair:KeyPairValue>, i.e. key-value pair:
Difference Rating and produce difference user's User list, KeyPair record for project difference to ID, KeyPairValue then
Project between difference (including the user producing difference), in order to reduce intermediate data, we only calculate upper the three of matrix
Angle, lower triangle directly ignores (on handle, triangle directly negates the most permissible when normal calculating).Because there is List in this step
{ Item, Item ... } transfers the calculating process of KeyPair to, and in Hadoop, Reduce iteration is not supported to reuse, institute
With for if it exceeds<User:List{Item, the Item ... } of memory storage>relative recording collection, the present invention proposes to solve as follows
Certainly way: judge the size of storage object, a given threshold values (such as 0.9), if storage object exceedes this value, can be by relevant
Record set write local hard drive, the most again iterative processing, so can cause certain performance loss, but be not result in that task is lost
Lose.
In step 403, the project concentrated based on training data to and the user list of difference and generation difference carry out
Distributed data processing statistics training data is concentrated has the project of difference relationship to, total number of users, scoring and and mean deviation
The output data of value.This step basis is step 402, and the main purpose of this step is to calculate all of project pair in training set
Mean difference.In step 402, Reduce exports the project pair being, but wherein having a feature is that this project is to being following appearance
:
<Item1-Item2:{<User1,Rating>}>
I.e. project difference pair: its difference and produce the user of this difference.
This step is generated as:
<Item1-Item2:{<averageDiffRating>}>
I.e. project difference pair: its meansigma methods of marking.In the Map stage, first step data are left intact by we, directly
Output, but in real process, this can cause reduce process data amount excessive, because the efficiency of network transmission, causes
The pressure of reduce is bigger, and in this process, we to be suitably optimized for we, improves the efficiency of reduce.Hadoop permits
Permitted us and called at the beginning of merging module Combiner carries out on this Map place machine after the Map function processed on each machine
Step Reduce calculates.
This step in, from Map output data, the most meaningfully have identical items pair total number of users and scoring and
If (directly averaging and improper because local average is not equal to the overall situation averagely), in this step with merging module
Combiner, its effect be exactly statistics sum with mark and, then as the input of Reducer.Therefore, it is item based on key-value pair
Mesh difference pair: its difference and produce this difference user output data by merge module generate key-value pair be project difference
It is right: its difference and produce the output data of statistics number list of user of this difference:<KeyPair, KeyPairValue>,
The run-out key key of KeyPair or previous step, and KeyPairValue project between difference (include producing difference
User) number be then revised as the statistics number of Combiner.These data through copy with shuffle after defeated as Reducer
Entering, the input after duplication is<KeyPair, List{KeyPairValue, KeyPairValue ... }>.
After Reducer gets parms, add up KeyPair, and the value of corresponding output is KeyPairValue number, and its
Meansigma methods.At this Map-Reduce, being exactly the terminal of the statistical computation of all of training data in fact, we need to examine now
How consider keeps this part of result of calculation no longer to calculate this part data when realizing again calculating, i.e. incremental computations.The present invention gives
The scheme gone out is: the number of project pair, and current scoring summation writes into every record, can reduce when incremental computations
The result of last computation, such that it is able to be conveniently added the project of incremental computations to number and scoring and.It addition, optionally we
The record produced by Reducer is plus special mark: 0, shows that this is the data that training data is concentrated.
It is to say, be project pair based on key-value pair: its difference and produce the statistics number list of user of this difference
Output data<KeyPair, List{KeyPairValue, KeyPairValue ... }>obtain key assignments by the second reduction module
To for project: calculated sign, had the project of difference relationship to, total number of users, scoring and and the output number of mean difference
According to.The data form that we finally export through above process is as follows:
<baseItemID:{<flag,compareid,totaluser,totalrating,averageRating>}>
Wherein: baseItemID is item id, and flag is for calculate mark, and compareid is the project having difference relationship
Id, other totalusr are total number of users, totalrating for scoring and, averageRating is mean difference.
This ground square key key the most no longer uses the form of project pair, mainly considers follow-up to carry out with test set
The problem that join processes, and because project is to being upper triangular form in step 402, so record itself has been
Uniquely.
In step 404, distributed data acquisition user is carried out according to training dataset and test data set the most scored
The project crossed and scoring thereof and project to be predicted.This step carries out parallel computation with step 402 and step 403, and disobeys
Rely and calculate process in step 402 and step 403.This step calculates process and mainly processes training dataset and test data set,
And obtain needing the project of prediction.
In this step, it is attached processing acquisition key assignments by the 3rd mapping block with test data set to training dataset
To for user: the output data of the linkage record list of this user.Step one SlopeOne design in, training dataset with
The data form of test data set is different: training dataset often 4 records of row, and test data set often row only 2
Record.Utilizing the inconsistent of the two data set format, which data set is the data distinguishing input Map module be from.Map
It is output as the linkage record collection of these two data sets:<LongWritable, ItemPredictStatus>, wherein
LongWritable is the ID of user, and ItemPredictStatus represents the linkage record list of this user, for self-defined
Hadoop Value Types, optionally, the data of its record are<item, tag, user, rating>, for training data set identifier
Tag is: 0x0, and the other three value is the data read in simultaneously, for test data set mark tag is: 0x1, score value rating
Being all set to 0, other two is normal reading value.
Then, it is user based on key-value pair: the output data of the linkage record list of this user pass through the 3rd reduction module
Acquisition key-value pair is user: forecast demand mark, the project marked of user and scoring thereof and project to be predicted.The
The output of three reduction modules Reduce is the project that user did not mark and the scoring item that there may be difference relationship thereof
Relation (because Slopeone needs user to have scoring record), form is: <userID:{ < flag, basicid_has,
Rating, basicid_no > } >, wherein forecast demand mark flag mark is set to this record of 1 expression to be that this user needs pre-
Surveying, basicid_has represents the project that user had marked, and basicid_no is then project id of prediction, rating
Scoring for the scoring of user's scoring item, i.e. basicid_has.
In step 405, concentrate based on training data and have the project of difference relationship to, total number of users, scoring, mean deviation
Project that value, user had marked and scoring and project to be predicted thereof carry out distributed data processing, and to obtain SlopeOne pre-
Survey the value needed for calculating, including: project that user had marked, need to calculate with slopeone the project, of scoring
The score value of the project of scored mistake, the most all users mark difference, total number of users and simultaneously to having marked
Project and the user needing the project calculating scoring to be marked.
Step 405 is that the result obtained according to step 403 and step 404 calculates user for which project can mark
Prediction, this programme have been directed towards when step 403 training set be calculated project between difference relationship, in step 404
Time drawn user to need which project to mark, but for situations below: Item1-Item4 in such as table 1,
Assuming that the difference between them does not exists, that any user can not utilize the value of Item1 to predict that Item4 or Item4 is pre-
Surveying Item1, can only cross Item2, Item3 etc. has the project of difference to just calculating to Item1 or Item4, and we are at this
The purpose of step processes this situation exactly.
The Reduce that the input of Map is step 403 and step 404 exports result, run-out key Key then according to step 403 with
Step 404 exports the flag of result and is identified the project pair of reduction, and its form is<KeyPair, value>, and wherein value is not
Become, directly export.
Reduce input is for<KeyPair, List{<value>... }>, Reduce is by calculating the data in training set
Project the relation of the project pair in data and test set is finally exported computable record <userID, <basicItem,
TargetItem, user, basicRating, diffRating, totaluser > >, wherein key is userID, value:
Basicitem is the project that rate crosses of having marked, and targetItem is the project needing to calculate scoring rate with slopeone,
BasicRating is the rate value of basicitem, diffRating be then the most all user's rate differences (target and
Basicitem's), totaluser is total number of users, and user is baiscItem Yu targetItem have been carried out rate simultaneously
User.
In a step 406, the value needed for calculating based on the SlopeOne prediction obtained carries out distributed data processing calculating
User needs the predictive value of the project of prediction.The final step of parallel computation, this step mainly according to the result of step 405 and
SlopeOne computing formula calculates predictive value.The run-out key Key of Map is: the item id of ID-to be predicted, output valve is not done
Processing, Reduce is then the realization of SlopeOne computing formula, is output as<item id of ID-to be predicted, it was predicted that value>.
Reduce key code is as follows:
In above-mentioned SlopeOne big flow chart of data processing design, wherein step 402 is mainly training dataset
Statistical disposition, it can produce substantial amounts of intermediate data, and step 403 is the place that the delta algorithm of whole algorithm realizes, because making
Intensive storage, can reduce during step 403 and deposit memory space in a large number, step 404 be to test set process part, same to time step
Rapid 404 is also the place of later stage optimal inspection data set, and step 405 is then to carry out pre-by the result of step 403 with step 404
Survey analysis and the calculating of feasibility, be to solve the openness calculating node of SlopeOne equally.Step 406 is that whole algorithm produces
The Map-Reduce predicted the outcome.In whole design, step 403 needs to examine because of the reason of hadoop iterator with step 405
The problem considering low memory, we have employed the overflow data shape form to disk for solving this problem.
The big data processing method of commending system that the embodiment of the present invention provides realizes pushing away based on content by Hadoop platform
The big data recommended process design, big data based on Project cooperation recommendation process design, cold start-up prioritization scheme is big
Data process the big data of design and SlopeOne algorithm and process design, by executed in parallel mechanism, are greatly improved big
The computational efficiency that data process.And the big data of the SlopeOne algorithm of present invention offer process design and can realize parallelization
After incremental computations, and solve in intermediate computations it is possible that the problem of low memory, utilize simultaneously and merge module
Combiner realizes the further optimization for algorithm, further improves computational efficiency and reliability.
While it is disclosed that embodiment as above, but described content is only to facilitate understand the present invention and adopt
Embodiment, be not limited to the present invention.Technical staff in any the technical field of the invention, without departing from this
On the premise of spirit and scope disclosed in invention, in form and any amendment and change can be made in details implement,
But the scope of patent protection of the present invention, still must be defined in the range of standard with appending claims.
Claims (10)
1. the big data processing method of commending system, it is characterised in that including:
Obtain user preference utility matrix and project data;
Carry out distributed data processing based on described utility matrix and obtain user evaluates in utility matrix bulleted list and each
Evaluation user that project is corresponding and score value list thereof, carry out distributed data processing based on described project data and obtain each item
Purpose similar terms list;
The similar terms list of evaluating user and score value list and each project corresponding based on each project is distributed
Formula data process recommended project and the prediction weighted list thereof obtaining user;
The bulleted list evaluated based on user and the recommended project of user and prediction weighted list thereof carry out distributed data processing
Obtain the project recommendation result of user.
The big data processing method of commending system the most according to claim 1, it is characterised in that also include:
Carry out distributed data processing based on the bulleted list of user's evaluation in utility matrix and obtain the similar users list of user;
Similar users list based on user and project data carry out distributed data processing and obtain the project recommendation result of user.
The big data processing method of commending system the most according to claim 1, it is characterised in that also include:
Obtain user data and carry out Distributed Cluster process acquisition cluster result;
The bulleted list evaluated based on described cluster result, described user and project data carry out distributed data processing and obtain every
The physical chemical characteristics of individual cluster;
Physical chemical characteristics data based on each cluster obtain the physical chemical characteristics of new user according to the Clustering and selection of new user;
Physical chemical characteristics based on new user carries out distributed search and obtains the project recommendation result of new user index of articles.
The big data processing method of commending system the most according to claim 1, it is characterised in that also include:
Obtain the training dataset including item id, the ID that this project is marked and score value information and include use
The test data set of the item id information that family ID is to be predicted with this user;
According to training dataset carry out distributed data processing add up project therein to and difference and the user of generation difference
List;
The project concentrated based on training data to and the user list of difference and generation difference carry out distributed data processing system
Meter training data is concentrated has the project of difference relationship to, total number of users, scoring and and mean difference;
According to training dataset and test data set carry out distributed data processing obtain the project marked of user and
Scoring and project to be predicted;
Concentrate based on training data and have the project of difference relationship the most scored to, total number of users, scoring, mean difference, user
The project crossed and scoring and project to be predicted thereof carry out distributed data processing and obtain the value needed for SlopeOne prediction calculates;
Value needed for calculating based on SlopeOne prediction carries out distributed data processing and calculates user and need project pre-of prediction
Measured value.
The big data processing method of commending system the most according to claim 1, it is characterised in that based on utility matrix and item
Mesh data carry out the step of distributed data processing and include:
Obtaining key-value pair based on described utility matrix by the first mapping-reduction module is user: the bulleted list that user evaluates
First output data;
Obtaining key-value pair based on described utility matrix by the second mapping-reduction module is project: the evaluation user that project is corresponding
And the second output data of score value list;
Obtaining key-value pair based on described project data by the 3rd mapping-reduction module is project: the similar terms list of project
The 3rd output data;
Step in the recommended project and prediction weighted list thereof that obtain user includes:
Export data obtaining key-value pair by the 4th mapping-reduction module based on described second and the 3rd is user: user pushes away
Recommend the 4th output data of project and prediction weighted list thereof;
Step in the project recommendation result obtaining user includes:
Export data obtaining key-value pair by the 4th mapping-reduction module based on described first and the 4th is user: the item of user
The output result of mesh recommendation results list.
The big data processing method of commending system the most according to claim 2, it is characterised in that in the similar use obtaining user
The step of family list includes:
It is user that the bulleted list evaluated based on described user obtains key-value pair by the 6th mapping-reduction module: the phase of user
The 5th output data like user list;
The step carrying out distributed data processing at similar users list based on user and project data includes:
Obtaining key-value pair with described project data by the 7th mapping-reduction module based on described 5th output data is user:
The output result of project recommendation the results list of user.
The big data processing method of commending system the most according to claim 3, it is characterised in that at the thing obtaining each cluster
The step changing feature includes:
The bulleted list evaluated based on described user and cluster result data obtain key-value pair by the 8th mapping-reduction module and are
Output key-value pair is cluster: the 6th output data of the bulleted list that cluster is corresponding;
Obtaining key-value pair based on the 6th output data and project data by the 9th mapping-reduction module is cluster: the thing of cluster
Change the 7th output data of feature list;
Step in the project recommendation result obtaining new user includes:
Physical chemical characteristics based on new user carries out retrieval by the tenth mapping-reduction module to index of articles and obtains key-value pair for using
Family: the output result of project recommendation the results list of user.
The big data processing method of commending system the most according to claim 4, it is characterised in that described according to training data
Collection carries out the step of distributed data processing and includes:
Obtaining key-value pair according to training dataset data by the first mapping block and shuffle module is user: user's scoring item
And the output data of score value list accordingly;
It is user based on described key-value pair: the output data of user's scoring item and corresponding score value list are returned by first
About module obtains key-value pair is project pair: its difference and produce the output data of user of this difference.
The big data processing method of commending system the most according to claim 4, it is characterised in that concentrate at statistics training data
Have the project of difference relationship to, total number of users, scoring and and the step of mean difference include:
Be project pair based on key-value pair: its difference and produce this difference user output data by merge module generate key assignments
To for project pair: its difference and produce the output data of statistics number list of user of this difference;
It it is project pair based on key-value pair: its difference and produce the output data of statistics number list of user of this difference by the
It is project that two reduction modules obtain key-value pairs: calculated sign, have the project of difference relationship to, total number of users, scoring and with
And the output data of mean difference.
The big data processing method of commending system the most according to claim 4, it is characterised in that according to training dataset
The step carrying out distributed data processing with test data set includes:
Being attached processing acquisition key-value pair by the 3rd mapping block with test data set to training dataset is user: this use
The output data of the linkage record list at family;
It is user based on described key-value pair: the output data of the linkage record list of this user obtain key by the 3rd reduction module
Value is to for user: project that forecast demand mark, user had marked and scoring thereof and project to be predicted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610515790.6A CN106126727A (en) | 2016-07-01 | 2016-07-01 | A kind of big data processing method of commending system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610515790.6A CN106126727A (en) | 2016-07-01 | 2016-07-01 | A kind of big data processing method of commending system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106126727A true CN106126727A (en) | 2016-11-16 |
Family
ID=57469110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610515790.6A Pending CN106126727A (en) | 2016-07-01 | 2016-07-01 | A kind of big data processing method of commending system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126727A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779867A (en) * | 2016-12-30 | 2017-05-31 | 中国民航信息网络股份有限公司 | Support vector regression based on context-aware recommends method and system |
WO2019128394A1 (en) * | 2017-12-29 | 2019-07-04 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for processing fusion data and information recommendation system |
CN110020130A (en) * | 2017-10-27 | 2019-07-16 | 镇江雅迅软件有限责任公司 | A kind of news recommender system based on utility matrix |
CN111930731A (en) * | 2020-07-28 | 2020-11-13 | 苏州亿歌网络科技有限公司 | Data dump method, device, equipment and storage medium |
CN112541119A (en) * | 2020-12-08 | 2021-03-23 | 厦门诚创网络股份有限公司 | Efficient and energy-saving small recommendation system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329683A (en) * | 2008-07-25 | 2008-12-24 | 华为技术有限公司 | Recommendation system and method |
CN102609533A (en) * | 2012-02-15 | 2012-07-25 | 中国科学技术大学 | Kernel method-based collaborative filtering recommendation system and method |
CN102663128A (en) * | 2012-04-24 | 2012-09-12 | 南京师范大学 | Recommending system of large-scale collaborative filtering |
CN103678431A (en) * | 2013-03-26 | 2014-03-26 | 南京邮电大学 | Recommendation method based on standard labels and item grades |
CN105069072A (en) * | 2015-07-30 | 2015-11-18 | 天津大学 | Emotional analysis based mixed user scoring information recommendation method and apparatus |
CN105141508A (en) * | 2015-09-10 | 2015-12-09 | 天津师范大学 | Microblog system friend recommending method based on neighbor relations |
-
2016
- 2016-07-01 CN CN201610515790.6A patent/CN106126727A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329683A (en) * | 2008-07-25 | 2008-12-24 | 华为技术有限公司 | Recommendation system and method |
CN102609533A (en) * | 2012-02-15 | 2012-07-25 | 中国科学技术大学 | Kernel method-based collaborative filtering recommendation system and method |
CN102663128A (en) * | 2012-04-24 | 2012-09-12 | 南京师范大学 | Recommending system of large-scale collaborative filtering |
CN103678431A (en) * | 2013-03-26 | 2014-03-26 | 南京邮电大学 | Recommendation method based on standard labels and item grades |
CN105069072A (en) * | 2015-07-30 | 2015-11-18 | 天津大学 | Emotional analysis based mixed user scoring information recommendation method and apparatus |
CN105141508A (en) * | 2015-09-10 | 2015-12-09 | 天津师范大学 | Microblog system friend recommending method based on neighbor relations |
Non-Patent Citations (1)
Title |
---|
李星: "个性化推荐系统优化及其大数据处理研究", 《万方数据》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779867A (en) * | 2016-12-30 | 2017-05-31 | 中国民航信息网络股份有限公司 | Support vector regression based on context-aware recommends method and system |
CN106779867B (en) * | 2016-12-30 | 2020-10-23 | 中国民航信息网络股份有限公司 | Support vector regression recommendation method and system based on context awareness |
CN110020130A (en) * | 2017-10-27 | 2019-07-16 | 镇江雅迅软件有限责任公司 | A kind of news recommender system based on utility matrix |
WO2019128394A1 (en) * | 2017-12-29 | 2019-07-04 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for processing fusion data and information recommendation system |
US11061966B2 (en) | 2017-12-29 | 2021-07-13 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for processing fusion data and information recommendation system |
CN111930731A (en) * | 2020-07-28 | 2020-11-13 | 苏州亿歌网络科技有限公司 | Data dump method, device, equipment and storage medium |
CN112541119A (en) * | 2020-12-08 | 2021-03-23 | 厦门诚创网络股份有限公司 | Efficient and energy-saving small recommendation system |
CN112541119B (en) * | 2020-12-08 | 2022-07-05 | 厦门诚创网络股份有限公司 | Efficient and energy-saving small recommendation system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126727A (en) | A kind of big data processing method of commending system | |
CN103336790B (en) | Hadoop-based fast neighborhood rough set attribute reduction method | |
US8326760B2 (en) | Computer-based collective intelligence recommendations for transaction review | |
CN104798043B (en) | A kind of data processing method and computer system | |
CN105631003B (en) | Support intelligent index construct, inquiry and the maintaining method of mass data classified statistic | |
CN107515898B (en) | Tire enterprise sales prediction method based on data diversity and task diversity | |
CN103336791B (en) | Hadoop-based fast rough set attribute reduction method | |
CN102750286A (en) | Novel decision tree classifier method for processing missing data | |
CN107423279A (en) | A kind of information extraction and analysis method of credit financing short message | |
CN102737126A (en) | Classification rule mining method under cloud computing environment | |
CN103678436A (en) | Information processing system and information processing method | |
CN104008420A (en) | Distributed outlier detection method and system based on automatic coding machine | |
CN108345908A (en) | Sorting technique, sorting device and the storage medium of electric network data | |
CN105117426A (en) | Intelligent search system for HSCODE | |
CN102456064B (en) | Method for realizing community discovery in social networking | |
CN102622609A (en) | Method for automatically classifying three-dimensional models based on support vector machine | |
CN105808582A (en) | Parallel generation method and device of decision tree on the basis of layered strategy | |
CN107729939A (en) | A kind of CIM extended method and device towards newly-increased power network resources | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
CN110020141A (en) | A kind of personalized recommendation method and system based on improvement cluster and Spark frame | |
Hashem et al. | An Integrative Modeling of BigData Processing. | |
Mai et al. | Detecting the intellectual pathway of resilience thinking in urban and regional studies: A critical reflection on resilience literature | |
CN105426392A (en) | Collaborative filtering recommendation method and system | |
CN112214488A (en) | European style spatial data index tree and construction and retrieval method | |
CN110389932A (en) | Electric power automatic document classifying method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161116 |