CN106126727A

CN106126727A - A kind of big data processing method of commending system

Info

Publication number: CN106126727A
Application number: CN201610515790.6A
Authority: CN
Inventors: 杨成; 李晨; 李星
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2016-07-01
Filing date: 2016-07-01
Publication date: 2016-11-16

Abstract

The invention discloses a kind of big data processing method of commending system, belong to commending system technical field, solve traditional commending system big data processing method development difficulty big, inefficient technical problem.The method includes: obtain user preference utility matrix and project data；Carry out distributed data processing based on described utility matrix and obtain user evaluates in utility matrix the bulleted list evaluation user corresponding with each project and score value list thereof, carry out distributed data processing based on described project data and obtain the similar terms list of each project；Carry out distributed data processing based on the similar terms list evaluating user and score value list and each project that each project is corresponding and obtain recommended project and the prediction weighted list thereof of user；The bulleted list evaluated based on user and the recommended project of user and prediction weighted list thereof carry out distributed data processing and obtain the project recommendation result of user.

Description

A kind of big data processing method of commending system

Technical field

The present invention relates to commending system technical field, specifically, relate to a kind of big data processing method of commending system.

Background technology

Commending system, namely refers to personalized recommendation, and its basic essence is exactly the Characteristic of Interest according to user and history row For record, recommend, to user, commodity or the project information that user is interested.Along with the scale of the Internet constantly expands, move Moving the omnipresent of the Internet, the information in the present the Internet of user's body is also in diversified development, from basic demographic To the geographical location information of dynamically change, from simple content-browsing person to the supplier becoming content.Individual in reality is also Real " visual human " will be become in the Internet.How substantial amounts of for user destructuring Heterogeneous Information is effectively analyzed And draw reliable satisfied result and solve information overload problem, thus the personalized recommendation system that has been born.Personalized recommendation System is built upon a kind of Advanced Business intelligent platform on the basis of mass data is excavated, and core is the bulk information by user Set up contacting between user and project, thus reach the purpose of a kind of INFORMATION DISCOVERY.Commending system essence is more refers to one Kind of service, is to aid in user and frees from mass data, it is provided that be suitable for allowing the recommendation service of the more preferable decision-making of user, and this The service of kind will penetrate into whole internet world so that the every aspect of following people's life, such as news, travels, diet Deng.Commending system is the subset that cloud computing essence i.e. provides elasticity service.

The realization of commending system must rely on big data to process and support, and is no longer that a kind of traditional algorithm realizes, and will receive Collect to data carry out processing and obtain a reliable model by an expansible platform, be big data process recommendation Basic place in system.

At present, the platform that the exploitation big data of commending system process has relatively more options, but is required for more complicated programming Skill and architectural framework, high to exploitation personnel requirement, need developer to control substantial amounts of design details, difficulty is big, and calculates Inefficient.

Summary of the invention

It is an object of the invention to provide a kind of big data processing method of commending system, with the traditional commending system solved Big data processing method development difficulty is big, inefficient technical problem.

The present invention provides a kind of big data processing method of commending system, and the method includes:

Obtain user preference utility matrix and project data；

Based on described utility matrix carry out distributed data processing obtain user evaluates in utility matrix bulleted list and Evaluation user that each project is corresponding and score value list thereof, carry out distributed data processing based on described project data and obtain every The similar terms list of individual project；

The similar terms list of evaluating user and score value list and each project corresponding based on each project is carried out Distributed data processing obtains recommended project and the prediction weighted list thereof of user；

The bulleted list evaluated based on user and the recommended project of user and prediction weighted list thereof carry out distributed data Process the project recommendation result obtaining user.

The big data processing method of commending system that the present invention provides, also includes:

Carry out distributed data processing based on the bulleted list of user's evaluation in utility matrix and obtain the similar users of user List；

Similar users list based on user and project data carry out distributed data processing and obtain the project recommendation of user Result.

Obtain user data and carry out Distributed Cluster process acquisition cluster result；

Bulleted list and the project data evaluated based on described cluster result, described user carry out distributed data processing and obtain Obtain the physical chemical characteristics of each cluster；

Physical chemical characteristics data based on each cluster obtain the physical chemical characteristics of new user according to the Clustering and selection of new user；

Physical chemical characteristics based on new user carries out distributed search and obtains the project recommendation result of new user index of articles.

Obtain the training dataset including item id, the ID that this project is marked and score value information and comprise There is the test data set of the item id information that ID and this user are to be predicted；

According to training dataset carry out distributed data processing add up project therein to and difference and generation difference User list；

The project concentrated based on training data to and the user list of difference and generation difference carry out at distributed data Reason statistics training data is concentrated has the project of difference relationship to, total number of users, scoring and and mean difference；

Carry out distributed data processing according to training dataset and test data set and obtain the project that user had marked And mark and project to be predicted；

Concentrate based on training data and have the project of difference relationship to, total number of users, scoring, mean difference, user The project marked and scoring and project to be predicted thereof carry out distributed data processing acquisition SlopeOne prediction and calculate required Value；

Value needed for calculating based on SlopeOne prediction carries out distributed data processing and calculates the project that user needs to predict Predictive value.

Include in the step carrying out distributed data processing based on utility matrix and project data:

Obtaining key-value pair based on described utility matrix by the first mapping-reduction module is user: the project that user evaluates First output data of list；

Obtaining key-value pair based on described utility matrix by the second mapping-reduction module is project: the evaluation that project is corresponding User and the second output data of score value list thereof；

Obtaining key-value pair based on described project data by the 3rd mapping-reduction module is project: the similar terms of project 3rd output data of list；

Step in the recommended project and prediction weighted list thereof that obtain user includes:

Export data obtaining key-value pair by the 4th mapping-reduction module based on described second and the 3rd is user: user Recommended project and prediction weighted list the 4th output data；

Step in the project recommendation result obtaining user includes:

Export data obtaining key-value pair by the 4th mapping-reduction module based on described first and the 4th is user: user The output result of project recommendation the results list.

Step in the similar users list obtaining user includes:

It is user that the bulleted list evaluated based on described user obtains key-value pair by the 6th mapping-reduction module: user Similar users list the 5th output data；

The step carrying out distributed data processing at similar users list based on user and project data includes:

Key-value pair is obtained for using with described project data by the 7th mapping-reduction module based on described 5th output data Family: the output result of project recommendation the results list of user.

Step at the physical chemical characteristics obtaining each cluster includes:

The bulleted list evaluated based on described user and cluster result data obtain key assignments by the 8th mapping-reduction module To be cluster for output key-value pair: the 6th of the bulleted list that cluster is corresponding exports data；

Key-value pair is obtained for cluster by the 9th mapping-reduction module: cluster based on the 6th output data and project data Physical chemical characteristics list the 7th output data；

Step in the project recommendation result obtaining new user includes:

Physical chemical characteristics based on new user carries out retrieval by the tenth mapping-reduction module to index of articles and obtains key-value pair For user: the output result of project recommendation the results list of user.

Carry out the step of distributed data processing according to training dataset include described:

Obtaining key-value pair according to training dataset data by the first mapping block and shuffle module is user: user marks Project and the output data of corresponding score value list；

It is user based on described key-value pair: the output data of user's scoring item and corresponding score value list are by the It is project pair that one reduction module obtains key-value pair: its difference and produce the output data of user of this difference.

Concentrate at statistics training data and have the project of difference relationship to, total number of users, scoring and and mean difference Step includes:

Be project pair based on key-value pair: its difference and produce this difference user output data by merge module generate Key-value pair is project pair: its difference and produce the output data of statistics number list of user of this difference；

Be project pair based on key-value pair: its difference and produce this difference user statistics number list output data lead to Crossing the second reduction module and obtaining key-value pair is project: calculated sign, had the project of difference relationship to, total number of users, scoring With and the output data of mean difference.

Include in the step carrying out distributed data processing according to training dataset and test data set:

Being attached processing acquisition key-value pair by the 3rd mapping block with test data set to training dataset is user: The output data of the linkage record list of this user；

It is user based on described key-value pair: the output data of the linkage record list of this user are obtained by the 3rd reduction module Key-value pair is user: forecast demand mark, the project marked of user and scoring thereof and project to be predicted.

The big data processing method of commending system that the embodiment of the present invention provides realizes pushing away based on content by Hadoop platform The big data recommended process design, big data based on Project cooperation recommendation process design, cold start-up prioritization scheme is big Data process the big data of design and SlopeOne algorithm and process design, by executed in parallel mechanism, are greatly improved big The computational efficiency that data process.And the big data of the SlopeOne algorithm of present invention offer process design and can realize parallelization After incremental computations, and solve in intermediate computations it is possible that the problem of low memory, utilize simultaneously and merge module Combiner realizes the further optimization for algorithm, further improves computational efficiency and reliability.

Other features and advantages of the present invention will illustrate in the following description, and, becoming from description of part Obtain it is clear that or understand by implementing the present invention.The purpose of the present invention and other advantages can be by description, rights Structure specifically noted in claim and accompanying drawing realizes and obtains.

Accompanying drawing explanation

For the technical scheme in the clearer explanation embodiment of the present invention, required in embodiment being described below Accompanying drawing does simply to be introduced:

Fig. 1 is the schematic flow sheet of the big data processing method of commending system that the embodiment of the present invention provides；

Fig. 2 is the commending contents big data processing scheme application flow schematic diagram that the embodiment of the present invention provides；

Fig. 3 is that the Project cooperation that the embodiment of the present invention provides recommends big data processing scheme schematic flow sheet；

Fig. 4 is that the Project cooperation that the embodiment of the present invention provides recommends big data processing scheme application flow schematic diagram；

Fig. 5 is the prioritization scheme big data processing scheme schematic flow sheet that the embodiment of the present invention provides；

Fig. 6 is the prioritization scheme big data processing scheme application flow schematic diagram that the embodiment of the present invention provides；

Fig. 7 is the schematic flow sheet of the SlopeOne algorithm process scheme that the embodiment of the present invention provides；

Fig. 8 is the application flow schematic diagram of the SlopeOne algorithm process scheme that the embodiment of the present invention provides.

Detailed description of the invention

Describe embodiments of the present invention in detail below with reference to drawings and Examples, whereby how the present invention is applied Technological means solves technical problem, and the process that realizes reaching technique effect can fully understand and implement according to this.Need explanation As long as not constituting conflict, each embodiment in the present invention and each feature in each embodiment can be combined with each other, The technical scheme formed is all within protection scope of the present invention.

The embodiment of the present invention provides a kind of big data processing method of commending system, and the method is based on big data processing platform (DPP) Hadoop realizes, and processes design, the big data recommended based on Project cooperation including big data based on commending contents Process design, the big data of cold start-up prioritization scheme process and design and the big data process design side of SlopeOne algorithm Case.

As shown in Fig. 1 Fig. 2, based on commending contents in the big data processing method of commending system that the embodiment of the present invention provides Big data process design include: step 101 to step 103.Wherein, in a step 101, user preference effectiveness square is obtained Battle array and project data.In commending system, generally there are two dvielements, the first kind i.e. user, Equations of The Second Kind i.e. project.User can be right Some project has preference information, and the combing from data of the preference information of these users is expressed as the shape of utility matrix the most afterwards Formula is the most permissible just for utility matrix when data process.The representation of general utility matrix such as table 1 below:

Table 1 utility matrix

In utility matrix, each numeric representation is user's fancy grade to project, blank expression user's happiness to project The most uncertain, this is the recommendation target that commending system is to be reached.

In a step 102, carry out distributed data processing based on utility matrix and obtain the item that user evaluates in utility matrix Evaluation user that mesh list is corresponding with each project and score value list thereof, carry out distributed data processing based on project data and obtain Obtain the similar terms list of each project.More specifically, in a step 102, based on utility matrix by the first mapping-reduction It is user that module obtains key-value pair: the first output data of the bulleted list that user evaluates.Mapping-reduction (Map-Reduce) place Reason framework is mainly used in the concurrent operation of large-scale dataset.The main thought of its framework comes from divide and rule algorithm and functional expression Programming, the core methed of model is " mapping " and " reduction ", i.e. Map method and Reduce method, and its process is specially and is opening The process of sending out specifies Map (mapping) function, is used for one group of key-value pair to be mapped to one group of new key-value pair, then specifies concurrent Reduce (reduction) function, each being used for ensureing in the key-value pair of all mappings shares identical key group, thus To result.Specifically, Map process is mainly responsible for carrying out data cutting the record formation key then collecting each input The value form to key-value, after shuffling and sorting, output is to Reduce process, Reduce carry out reduction process.Whole Big data are carried out segmentation and are transported to different machines process by individual Map-Reduce framework.Using utility matrix as first in this step The input of mapping-reduction module, after the first mapping-reduction module carries out corresponding Map-Reduce process export result for < User:{Item1, Item2, Itemn ... } >, wherein key Key is user User, is worth project Item evaluated by this user List, i.e. obtains the project information that in utility matrix, each user be have rated.

Parallel with said process, obtaining key-value pair based on utility matrix by the second mapping-reduction module is project: item What mesh was corresponding evaluates the second output data of user and score value list thereof.Using utility matrix as the second mapping-reduction module Input, after the second mapping-reduction module carries out corresponding Map-Reduce process export result for < Item:{User1, User2 ... } >, wherein key Key is project Item, is worth for scoring user User corresponding to project and score value list thereof.I.e. obtain Which user is each project in utility matrix that obtains have carried out scoring and concrete score value information.

Parallel with said process, obtaining key-value pair based on project data by the 3rd mapping-reduction module is project: item 3rd output data of purpose similar terms list.3rd mapping-reduction module has two project data input interfaces, first Being the input of the Map of standard, second is the data that the 3rd mapping-reduction module is read in simultaneously when Map calculates, and the 3rd reflects Penetrate-the whole project data that reads in when calculating Map of the project of the input of reduction module measured Map is scanned, logical Cross the characteristic relation between project and calculate N number of most like project, and then output result: key Key is project, is worth for this project Similar terms list,<Item1:{Item2, Item3 ... }>.I.e. obtain those most like with Key project in utility matrix Project information.

In step 103, based on each project corresponding evaluate user and score value list similar with each project Bulleted list carries out distributed data processing and obtains recommended project and the prediction weighted list thereof of user.Concrete processing procedure is: First, export data obtaining key-value pair by the 4th mapping-reduction module based on second and the 3rd is user: the recommendation items of user Mesh and the 4th output data of prediction weighted list.By the output of the second mapping-reduction module < Item:{User1, User2 ... }>and the 3rd mapping-reduction module output<Item1:{Item2, Item3 ... }>map as the 4th-return The about input of module, the 4th mapping-reduction module carry out data combine Join operation thus to obtain each user's further Recommended project list:<User:{predictItem ... }>, wherein, key Key is user User, is worth the recommendation items for user Mesh predictItem and corresponding prediction weighted list.

Then, at step 104, the bulleted list evaluated based on user and the recommended project of user and prediction weight thereof arrange Table carries out distributed data processing and obtains the project recommendation result of user.Export data based on first and the 4th to reflect by the 5th Penetrate-reduction module obtain key-value pair be user: the output result of the project recommendation result of user.By the first mapping-reduction module First output data of output<User:{Item1, Item2, Itemn ... }>with the first mappings-reduction module export the Four output data<User:{predictItem ... }>as the input of the 5th mapping-reduction module, five mappings-reduction Module carry out data combine Join operation remove repeat project obtain final recommendation results, will the 4th output data in certain The project that user had evaluated is removed, and then obtains this user final evaluation for non-assessment item.

As shown in Figure 3 and Figure 4, the big data processing method of commending system that the embodiment of the present invention provides also includes: based on The big data that Project cooperation is recommended process design, and the program includes that step 201 is to 204.

Step 201 is identical with the implementation that above-mentioned big data based on commending contents process design with step 202. Using utility matrix as the input of the first mapping-reduction module, carry out corresponding Map-Reduce in the first mapping-reduction module Export after process result be<User:{Item1, Item2, Itemn ... }>, wherein key Key is user User, is worth for this use The project Item list that family is evaluated, i.e. obtains the project information that in utility matrix, each user be have rated.

In step 203, it is primarily based on the bulleted list that user evaluates in utility matrix to carry out distributed data processing and obtain Obtain the similar users list of user.The bulleted list i.e. evaluated based on user obtains key-value pair by the 6th mapping-reduction module For user: the 5th of the similar users list of user exports data.6th mapping-reduction module has two Data Input Interfaces, First is the input of Map of standard, and second is the data that the 6th mapping-reduction module is read in simultaneously when Map calculates, User behavior information in first output data of the input of the six measured Map of mapping-reduction module is read when calculating Map All user behavior information in the first output data entered are scanned, and calculate N number of row by the characteristic relation of user behavior For most like user, so output result:<User:{sUser ... }>, wherein key Key is user User, is worth for using with this The list of similar users sUser at family.

Finally, in step 204, similar users list based on user carries out distributed data processing with project data and obtains Obtain the project recommendation result of user.Key assignments is obtained with project data by the 7th mapping-reduction module based on the 5th output data To for user: the output result of project recommendation the results list of user.Using the 5th output data as the 7th mapping-reduction module Input, the 7th mapping-reduction module carries out parallelization process to the 5th output data, the project in retrieval Hbase data base Data, and then obtain the recommendation results<User:{rItem ... } of final user>, key Key is user User, is worth and uses for this The project recommendation result rItem list at family.

As shown in Figure 5 and Figure 6, the big data processing method of commending system that the embodiment of the present invention provides also includes: The big data of SlopeOne algorithm process design, and this design includes that step 401 is to 404.

In step 301, obtain user data and carry out Distributed Cluster process acquisition cluster result.In this step, can To use that increases income based on Hadoop machine learning storehouse Mahout, user data to carry out Distributed Cluster and process the cluster obtained Result.

In step 302, bulleted list and the project data evaluated based on cluster result, user are carried out at distributed data Reason obtains the physical chemical characteristics of each cluster.Wherein the acquisition mode of the bulleted list that user evaluates is with above-mentioned based on commending contents It is identical with the acquisition mode that the big data recommended based on Project cooperation process in design that big data process design.Pass through First mapping-reduction carry out process obtain export result for<User:{Item1, Item2, Itemn ... }>, wherein key Key For user User, it is worth the project Item list evaluated by this user.

Then, bulleted list and the cluster result data evaluated based on user obtain key by the 8th mapping-reduction module Be worth to for output key-value pair be cluster: cluster correspondence bulleted list the 6th export data.I.e. by the 8th mapping-reduction mould Block carries out Join operation and export {<Cluster:<Item ...>}, and wherein key Key is for clustering Cluster, is worth for this cluster Comprised the list of project Item, show which bulleted list each cluster has.

And then, the 6th output data and project data based on obtaining obtain key-value pair by the 9th mapping-reduction module For cluster: the 7th of the physical chemical characteristics list of cluster exports data.Carried out the 7th output data that Map-Reduce obtains < Cluster:<feature ...>}, wherein key Key is cluster Cluster, is worth for this poly-physical chemical characteristics list.

In step 303, physical chemical characteristics data based on each cluster obtain new user according to the Clustering and selection of new user Physical chemical characteristics.After new user enters Systematic selection cluster centre, special with reference to the materialization of each cluster obtained in step 302 Levy the physical chemical characteristics obtaining new user.

In step 304, physical chemical characteristics based on new user carries out the distributed search new user's of acquisition to index of articles Project recommendation result.Physical chemical characteristics based on new user is carried out at HDFS index of beginning a project by the tenth mapping-reduction module It is user that distributed search obtains key-value pair: the output result of project recommendation the results list of user.

As shown in Figure 7 and Figure 8, the big data processing method of commending system that the embodiment of the present invention provides also includes: The big data of SlopeOne algorithm process design.At present SlopeOne algorithm realizes simply by unit, therefore this A kind of effective SlopeOne parallel computation scheme of bright proposition, in conjunction with terseness and the high efficiency of SlopeOne, it is achieved recommend system Big data of uniting process.

SlopeOne algorithm various piece is resolved into Map-Reduce process by this programme, allows each Map-Reduce bear Blame SlopeOne parallel computation process, and multiple Map-Reduce exists the dependence of serial, i.e. later Map- Reduce needs the output of multiple front Map-Reduce as input.Hadoop provide proprietary DLL JobControl with ControledJob carries out the dependence of relevant Map-Reduce and controls, and ultimately forms serial structure, it is achieved whole SlopeOne's Algorithm.

Concrete, the program includes that step 401 is to 404.In step 401, acquisition includes item id, comments this project Point ID and score value information training dataset with include ID and this user item id information to be predicted Test data set.The input data of parallel SlopeOne are divided into training dataset and test data set by the present invention, we The training dataset of agreement parallel algorithm is as follows with the form of test data set, and is all text type:

For training dataset:<the item id ID user scoring time to project>

Data form is such as:

1 1 3 881250949

2 2 3 891717742

For test data set:<item id that ID is to be predicted>

Data form is such as:

1 1

1 2

In the present invention, training dataset uses Nexflix to carry out calculating acquisition, and test data set is obtained by commending system, Training data concentrates<the item id ID>of scoring existed right to allow test data set to occur in the present invention, and these are heavy Can reject when multiple data algorithm is carried out, so result will not be produced impact.

After obtaining training dataset and test data set, this programme is designed as the concatenated schemes of 5 Map-Reduce.First First read in training dataset, count user's score information, simultaneously as SlopeOne need to utilize simultaneously training dataset with Test data set calculates, thus need when by training set counting user score information, synchronize be trained data set with Test data set is attached operation.After counting user's score information, next add up the difference between each project, with And number, this step will produce substantial amounts of intermediate data, in the completed, fill the data set of these differences with attended operation Decorations and Join operate, and final test concentrates prediction score value to be calculated to use the computing formula of SlopeOne to draw the most again.

In step 402, according to training dataset carry out distributed data processing add up project difference therein to and Difference and the user list of generation difference.For user's score data, i.e. training data, we are the most all read into one In Map-Reduce, by the first record<User:Item>mapping Map module output user, i.e. key-value pair is user: user Scoring item and corresponding score value, User Yu Item is two self-defining Hadoop key assignmentses, and wherein key User protects That deposit is the ID of user, the ID of value Item the most in store user scoring item and corresponding score value.

After Map output result, undue project is commented to be collected by shuffling Shuffle process by all of for certain user, And user is input in the first reduction reduce module by the sequence of ID natural order.The pattern of the input of Reduce be < User: List{Item, Item ... }>, it is project difference pair that reduce is output as<KeyPair:KeyPairValue>, i.e. key-value pair: Difference Rating and produce difference user's User list, KeyPair record for project difference to ID, KeyPairValue then Project between difference (including the user producing difference), in order to reduce intermediate data, we only calculate upper the three of matrix Angle, lower triangle directly ignores (on handle, triangle directly negates the most permissible when normal calculating).Because there is List in this step { Item, Item ... } transfers the calculating process of KeyPair to, and in Hadoop, Reduce iteration is not supported to reuse, institute With for if it exceeds<User:List{Item, the Item ... } of memory storage>relative recording collection, the present invention proposes to solve as follows Certainly way: judge the size of storage object, a given threshold values (such as 0.9), if storage object exceedes this value, can be by relevant Record set write local hard drive, the most again iterative processing, so can cause certain performance loss, but be not result in that task is lost Lose.

In step 403, the project concentrated based on training data to and the user list of difference and generation difference carry out Distributed data processing statistics training data is concentrated has the project of difference relationship to, total number of users, scoring and and mean deviation The output data of value.This step basis is step 402, and the main purpose of this step is to calculate all of project pair in training set Mean difference.In step 402, Reduce exports the project pair being, but wherein having a feature is that this project is to being following appearance :

<Item1-Item2:{<User1,Rating>}>

I.e. project difference pair: its difference and produce the user of this difference.

This step is generated as:

<Item1-Item2:{<averageDiffRating>}>

I.e. project difference pair: its meansigma methods of marking.In the Map stage, first step data are left intact by we, directly Output, but in real process, this can cause reduce process data amount excessive, because the efficiency of network transmission, causes The pressure of reduce is bigger, and in this process, we to be suitably optimized for we, improves the efficiency of reduce.Hadoop permits Permitted us and called at the beginning of merging module Combiner carries out on this Map place machine after the Map function processed on each machine Step Reduce calculates.

This step in, from Map output data, the most meaningfully have identical items pair total number of users and scoring and If (directly averaging and improper because local average is not equal to the overall situation averagely), in this step with merging module Combiner, its effect be exactly statistics sum with mark and, then as the input of Reducer.Therefore, it is item based on key-value pair Mesh difference pair: its difference and produce this difference user output data by merge module generate key-value pair be project difference It is right: its difference and produce the output data of statistics number list of user of this difference:<KeyPair, KeyPairValue>, The run-out key key of KeyPair or previous step, and KeyPairValue project between difference (include producing difference User) number be then revised as the statistics number of Combiner.These data through copy with shuffle after defeated as Reducer Entering, the input after duplication is<KeyPair, List{KeyPairValue, KeyPairValue ... }>.

After Reducer gets parms, add up KeyPair, and the value of corresponding output is KeyPairValue number, and its Meansigma methods.At this Map-Reduce, being exactly the terminal of the statistical computation of all of training data in fact, we need to examine now How consider keeps this part of result of calculation no longer to calculate this part data when realizing again calculating, i.e. incremental computations.The present invention gives The scheme gone out is: the number of project pair, and current scoring summation writes into every record, can reduce when incremental computations The result of last computation, such that it is able to be conveniently added the project of incremental computations to number and scoring and.It addition, optionally we The record produced by Reducer is plus special mark: 0, shows that this is the data that training data is concentrated.

It is to say, be project pair based on key-value pair: its difference and produce the statistics number list of user of this difference Output data<KeyPair, List{KeyPairValue, KeyPairValue ... }>obtain key assignments by the second reduction module To for project: calculated sign, had the project of difference relationship to, total number of users, scoring and and the output number of mean difference According to.The data form that we finally export through above process is as follows:

<baseItemID:{<flag,compareid,totaluser,totalrating,averageRating>}>

Wherein: baseItemID is item id, and flag is for calculate mark, and compareid is the project having difference relationship Id, other totalusr are total number of users, totalrating for scoring and, averageRating is mean difference.

This ground square key key the most no longer uses the form of project pair, mainly considers follow-up to carry out with test set The problem that join processes, and because project is to being upper triangular form in step 402, so record itself has been Uniquely.

In step 404, distributed data acquisition user is carried out according to training dataset and test data set the most scored The project crossed and scoring thereof and project to be predicted.This step carries out parallel computation with step 402 and step 403, and disobeys Rely and calculate process in step 402 and step 403.This step calculates process and mainly processes training dataset and test data set, And obtain needing the project of prediction.

In this step, it is attached processing acquisition key assignments by the 3rd mapping block with test data set to training dataset To for user: the output data of the linkage record list of this user.Step one SlopeOne design in, training dataset with The data form of test data set is different: training dataset often 4 records of row, and test data set often row only 2 Record.Utilizing the inconsistent of the two data set format, which data set is the data distinguishing input Map module be from.Map It is output as the linkage record collection of these two data sets:<LongWritable, ItemPredictStatus>, wherein LongWritable is the ID of user, and ItemPredictStatus represents the linkage record list of this user, for self-defined Hadoop Value Types, optionally, the data of its record are<item, tag, user, rating>, for training data set identifier Tag is: 0x0, and the other three value is the data read in simultaneously, for test data set mark tag is: 0x1, score value rating Being all set to 0, other two is normal reading value.

Then, it is user based on key-value pair: the output data of the linkage record list of this user pass through the 3rd reduction module Acquisition key-value pair is user: forecast demand mark, the project marked of user and scoring thereof and project to be predicted.The The output of three reduction modules Reduce is the project that user did not mark and the scoring item that there may be difference relationship thereof Relation (because Slopeone needs user to have scoring record), form is: <userID:{ < flag, basicid_has, Rating, basicid_no > } >, wherein forecast demand mark flag mark is set to this record of 1 expression to be that this user needs pre- Surveying, basicid_has represents the project that user had marked, and basicid_no is then project id of prediction, rating Scoring for the scoring of user's scoring item, i.e. basicid_has.

In step 405, concentrate based on training data and have the project of difference relationship to, total number of users, scoring, mean deviation Project that value, user had marked and scoring and project to be predicted thereof carry out distributed data processing, and to obtain SlopeOne pre- Survey the value needed for calculating, including: project that user had marked, need to calculate with slopeone the project, of scoring The score value of the project of scored mistake, the most all users mark difference, total number of users and simultaneously to having marked Project and the user needing the project calculating scoring to be marked.

Step 405 is that the result obtained according to step 403 and step 404 calculates user for which project can mark Prediction, this programme have been directed towards when step 403 training set be calculated project between difference relationship, in step 404 Time drawn user to need which project to mark, but for situations below: Item1-Item4 in such as table 1, Assuming that the difference between them does not exists, that any user can not utilize the value of Item1 to predict that Item4 or Item4 is pre- Surveying Item1, can only cross Item2, Item3 etc. has the project of difference to just calculating to Item1 or Item4, and we are at this The purpose of step processes this situation exactly.

The Reduce that the input of Map is step 403 and step 404 exports result, run-out key Key then according to step 403 with Step 404 exports the flag of result and is identified the project pair of reduction, and its form is<KeyPair, value>, and wherein value is not Become, directly export.

Reduce input is for<KeyPair, List{<value>... }>, Reduce is by calculating the data in training set Project the relation of the project pair in data and test set is finally exported computable record <userID, <basicItem, TargetItem, user, basicRating, diffRating, totaluser > >, wherein key is userID, value: Basicitem is the project that rate crosses of having marked, and targetItem is the project needing to calculate scoring rate with slopeone, BasicRating is the rate value of basicitem, diffRating be then the most all user's rate differences (target and Basicitem's), totaluser is total number of users, and user is baiscItem Yu targetItem have been carried out rate simultaneously User.

In a step 406, the value needed for calculating based on the SlopeOne prediction obtained carries out distributed data processing calculating User needs the predictive value of the project of prediction.The final step of parallel computation, this step mainly according to the result of step 405 and SlopeOne computing formula calculates predictive value.The run-out key Key of Map is: the item id of ID-to be predicted, output valve is not done Processing, Reduce is then the realization of SlopeOne computing formula, is output as<item id of ID-to be predicted, it was predicted that value>. Reduce key code is as follows:

In above-mentioned SlopeOne big flow chart of data processing design, wherein step 402 is mainly training dataset Statistical disposition, it can produce substantial amounts of intermediate data, and step 403 is the place that the delta algorithm of whole algorithm realizes, because making Intensive storage, can reduce during step 403 and deposit memory space in a large number, step 404 be to test set process part, same to time step Rapid 404 is also the place of later stage optimal inspection data set, and step 405 is then to carry out pre-by the result of step 403 with step 404 Survey analysis and the calculating of feasibility, be to solve the openness calculating node of SlopeOne equally.Step 406 is that whole algorithm produces The Map-Reduce predicted the outcome.In whole design, step 403 needs to examine because of the reason of hadoop iterator with step 405 The problem considering low memory, we have employed the overflow data shape form to disk for solving this problem.

While it is disclosed that embodiment as above, but described content is only to facilitate understand the present invention and adopt Embodiment, be not limited to the present invention.Technical staff in any the technical field of the invention, without departing from this On the premise of spirit and scope disclosed in invention, in form and any amendment and change can be made in details implement, But the scope of patent protection of the present invention, still must be defined in the range of standard with appending claims.

Claims

1. the big data processing method of commending system, it is characterised in that including:

Obtain user preference utility matrix and project data；

Carry out distributed data processing based on described utility matrix and obtain user evaluates in utility matrix bulleted list and each Evaluation user that project is corresponding and score value list thereof, carry out distributed data processing based on described project data and obtain each item Purpose similar terms list；

The similar terms list of evaluating user and score value list and each project corresponding based on each project is distributed Formula data process recommended project and the prediction weighted list thereof obtaining user；

The bulleted list evaluated based on user and the recommended project of user and prediction weighted list thereof carry out distributed data processing Obtain the project recommendation result of user.

The big data processing method of commending system the most according to claim 1, it is characterised in that also include:

Carry out distributed data processing based on the bulleted list of user's evaluation in utility matrix and obtain the similar users list of user；

Similar users list based on user and project data carry out distributed data processing and obtain the project recommendation result of user.

The bulleted list evaluated based on described cluster result, described user and project data carry out distributed data processing and obtain every The physical chemical characteristics of individual cluster；

Obtain the training dataset including item id, the ID that this project is marked and score value information and include use The test data set of the item id information that family ID is to be predicted with this user；

According to training dataset carry out distributed data processing add up project therein to and difference and the user of generation difference List；

The project concentrated based on training data to and the user list of difference and generation difference carry out distributed data processing system Meter training data is concentrated has the project of difference relationship to, total number of users, scoring and and mean difference；

According to training dataset and test data set carry out distributed data processing obtain the project marked of user and Scoring and project to be predicted；

Concentrate based on training data and have the project of difference relationship the most scored to, total number of users, scoring, mean difference, user The project crossed and scoring and project to be predicted thereof carry out distributed data processing and obtain the value needed for SlopeOne prediction calculates；

Value needed for calculating based on SlopeOne prediction carries out distributed data processing and calculates user and need project pre-of prediction Measured value.

The big data processing method of commending system the most according to claim 1, it is characterised in that based on utility matrix and item Mesh data carry out the step of distributed data processing and include:

Obtaining key-value pair based on described utility matrix by the first mapping-reduction module is user: the bulleted list that user evaluates First output data；

Obtaining key-value pair based on described utility matrix by the second mapping-reduction module is project: the evaluation user that project is corresponding And the second output data of score value list；

Obtaining key-value pair based on described project data by the 3rd mapping-reduction module is project: the similar terms list of project The 3rd output data；

Export data obtaining key-value pair by the 4th mapping-reduction module based on described second and the 3rd is user: user pushes away Recommend the 4th output data of project and prediction weighted list thereof；

Step in the project recommendation result obtaining user includes:

Export data obtaining key-value pair by the 4th mapping-reduction module based on described first and the 4th is user: the item of user The output result of mesh recommendation results list.

The big data processing method of commending system the most according to claim 2, it is characterised in that in the similar use obtaining user The step of family list includes:

It is user that the bulleted list evaluated based on described user obtains key-value pair by the 6th mapping-reduction module: the phase of user The 5th output data like user list；

Obtaining key-value pair with described project data by the 7th mapping-reduction module based on described 5th output data is user: The output result of project recommendation the results list of user.

The big data processing method of commending system the most according to claim 3, it is characterised in that at the thing obtaining each cluster The step changing feature includes:

The bulleted list evaluated based on described user and cluster result data obtain key-value pair by the 8th mapping-reduction module and are Output key-value pair is cluster: the 6th output data of the bulleted list that cluster is corresponding；

Obtaining key-value pair based on the 6th output data and project data by the 9th mapping-reduction module is cluster: the thing of cluster Change the 7th output data of feature list；

Step in the project recommendation result obtaining new user includes:

Physical chemical characteristics based on new user carries out retrieval by the tenth mapping-reduction module to index of articles and obtains key-value pair for using Family: the output result of project recommendation the results list of user.

The big data processing method of commending system the most according to claim 4, it is characterised in that described according to training data Collection carries out the step of distributed data processing and includes:

Obtaining key-value pair according to training dataset data by the first mapping block and shuffle module is user: user's scoring item And the output data of score value list accordingly；

It is user based on described key-value pair: the output data of user's scoring item and corresponding score value list are returned by first About module obtains key-value pair is project pair: its difference and produce the output data of user of this difference.

The big data processing method of commending system the most according to claim 4, it is characterised in that concentrate at statistics training data Have the project of difference relationship to, total number of users, scoring and and the step of mean difference include:

Be project pair based on key-value pair: its difference and produce this difference user output data by merge module generate key assignments To for project pair: its difference and produce the output data of statistics number list of user of this difference；

It it is project pair based on key-value pair: its difference and produce the output data of statistics number list of user of this difference by the It is project that two reduction modules obtain key-value pairs: calculated sign, have the project of difference relationship to, total number of users, scoring and with And the output data of mean difference.

The big data processing method of commending system the most according to claim 4, it is characterised in that according to training dataset The step carrying out distributed data processing with test data set includes:

Being attached processing acquisition key-value pair by the 3rd mapping block with test data set to training dataset is user: this use The output data of the linkage record list at family；

It is user based on described key-value pair: the output data of the linkage record list of this user obtain key by the 3rd reduction module Value is to for user: project that forecast demand mark, user had marked and scoring thereof and project to be predicted.