CN109271421A

CN109271421A - A kind of large data clustering method based on MapReduce

Info

Publication number: CN109271421A
Application number: CN201811099090.9A
Authority: CN
Inventors: 韦鹏程; 蔡银应; 邹杨; 黄思行; 张艳霞
Original assignee: Chongqing University of Education
Current assignee: Chongqing University of Education
Priority date: 2018-09-20
Filing date: 2018-09-20
Publication date: 2019-01-25

Abstract

The invention belongs to big data processing technology fields, disclose a kind of large data clustering method and application based on MapReduce；The input of initial data and format conversion；Canopy is divided and screening, obtains the initial division that clusters；K-Means iteration, using the result that Canopy is clustered as the division that initially clusters；Data point is distributed, after the completion of K-Means iteration, obtains the k complete informations to cluster.It is selected for the center that initially clusters present in traditional K-Means algorithm and excessive problem is measured in iterative calculation, it is proposed a kind of K-Means innovatory algorithm for dividing and filtering based on Canopy, and this algorithm is realized in MapReduce technological frame, it conducts in-depth research.As the result is shown this innovatory algorithm cluster accuracy rate, in terms of there is apparent performance to improve.

Description

A kind of large data clustering method based on MapReduce

Technical field

The invention belongs to big data processing technology fields more particularly to a kind of large data based on MapReduce to gather Class method.

Background technique

With the arriving of big data era, under more and more application scenarios, people's data scale extension to be treated To TB even PB rank, and it is desirable that therefrom fast and effeciently excavate reliable, useful hiding information (AlexeyB et al.2018).Therefore, how rapidly and accurately tap value information is of great significance currently from big data.Clustering A kind of one of core technology as the field of data mining, usually can be used as the pre-processing of other data mining algorithms (Treu T et al.2018).However in face of so huge data scale, traditional clustering method is in data storage, calculating Expense etc. is not able to satisfy reality and needs (Efstathiou G et al.2018).

MapReduce computation module is a kind of distributed computing method that Google is proposed, is had highly reliable, expansible Property strong, the characteristics of being easily programmed, the extremely complex distributions such as data storage, error handle, load balancing are concealed to programmer Formula calculation processing details is a kind of popular distributed proccessing (Driver S P et al.2018).Hadoop is flat Platform is the open source projects under Apache foundation, realizes MapReduce model, and manage using HDFS (distributed file system) Data are managed, can be used as the powerful (HumphreyP J et al.2018) of parallel clustering technique study.

Famous big data Study on Problems expert Victor mayer-Schoenberg once said that the essence in the world was exactly data, big number Primary great restriction on age (Barentsen G et al.2018) will be opened according to the epoch.The rise of big data is not only one The booming or excessive heat of secondary technological layer is held in both hands, it is likely to once change people life style and understands the side in the world The revolution (Littlefair S P et al.2018) of formula.This intuitively experiences universe by telescope just as the mankind, uses Microscope accurately observes microorganism, and the development of big data technology is so that people no longer feel the nothing being submerged in data ocean Power, but value is sought from valuable data mineral reserve, it is related to business, public health, safety, politics etc..In fact, information is quick-fried It is fried that (Clark C is D.2017) realistically occurs at one's side.2003, the Human Genome Project used whole ten years it is complete At the part work for decoding gene-code for the first time, and now, gene instrument is only used can complete equally to work for 15 minutes The task of amount.Financial field reaches its maturity with automated transaction program in mathematical model and computerized algorithm is established, according to estimating Meter, US stock market, which is up in 7,000,000,000 exchange hand, three/second is that being completed by program.Major Internet company is even more to collect The value of data is more understood by the user data of magnanimity, the companies such as Google, Amazon, is all the neck of big data technology without exception The person of leading and pusher (Peng H et al.2017).The growth rate in face of huge data and sharply accumulated, people gradually from It is recognized in uneasiness, the scientific value and social value of big data are exactly " big ", can turn to the Grasping level of big data Turn to the economic value (Mukherjee A P et al.2017) of reality.Meanwhile big data can become solution pressing problem, Such as environmental problem, Blight control, the powerful of government's ability of administration is improved.

In conclusion problem of the existing technology is: in face of huge data scale, traditional clustering method is in data Storage, computing cost etc. are not able to satisfy real needs, and accuracy rate is low, cannot effectively excavate reliable, useful hide Information.

The difficulty and meaning solved the problems, such as:

(1) innovatory algorithm is divided to obtain canopy and is clustered and center and sentenced according to certain by the Canopy of all data sets Other condition screens the center of clustering canopy, and the K-Means iteration initial center obtained with this is than traditional K-Means algorithm The obtained initial center of random selection method it is more accurate, reduce local optimum that may be present in K-Means algorithm and ask The influence of topic, so that final cluster result is more accurate.

(2) whether the more accurate initial center obtained by Canopy initial division can accelerate K-Means iteration Convergence process effectively reduces the number of iterations.

(3) improved algorithm has faster convergence rate and the compactness that preferably clusters.

(4) decrease speed of innovatory algorithm error sum of squares in an iterative process is there is no faster than traditional algorithm.

Summary of the invention

In view of the problems of the existing technology, the present invention provides a kind of large data clustering based on MapReduce Method.

The invention is realized in this way a kind of large data clustering method based on MapReduce, described to be based on The large data clustering method of MapReduce the following steps are included:

Step 1, the input of initial data and format conversion；Hadoop defines three kinds of input data format modes: TextInputFormat, KeyValueInputFormat and SequenceFileInputFormat carry out the number of clustering According to for high dimension vector form, SequenceFileInputFormat is selected；The InputDriver class for calling Hadoop included；

Step 2, Canopy is divided and screening, obtains the initial division that clusters, and determine to data set according to screening conditions Suitably cluster number K, the K value as subsequent K-Means algorithm；Obtain the K central informations that cluster initially to cluster, including generation Table cluster center the outer T1 of feature vector, weighted value, the data point number within the scope of its T2 and T2 range within the scope of data point Number；Canopy clustering is carried out to all data, is that the data point fallen within the scope of T2 is marked plus strong according to T1, T2 threshold value Note；By a MapReduce task, including a map stage and a reduce stage realize design object；

Step 3, K-Means iteration, using the result that Canopy is clustered as the division that initially clusters, and to be endowed more The center Canopy of high weight clusters the substitution of the set of data points within the scope of T2 as falling in Canopy, participates in K-Means and changes In generation, realizes filtering；Iteration, an including map stage and one each time is completed by a complete MapReduce task The reduce stage；

Step 4 distributes data point, after the completion of K-Means iteration, the k complete informations to cluster is obtained, by all data Point is assigned in corresponding cluster；It is realized with a MapReduce task；On each mapper calculate local data point with The distance at each center cluster and data point is added to distance is nearest to cluster in global cluser set, and will knot Fruit is output to HDFS.

Further, AddObjtoCanopyList method is to be added to each data point accordingly in the step 2 During Canopy clusters；In Cleanup () method, program output<" centroid ", Canopycenter>key-value pair conduct Intermediate result is to next stage；Hadoop is by the Local C anopy information formed on each mapper by network transmission to only One reducer；

Canopyreduce stage, unique reducer handle the Local C anopy information from each mapper, are formed Global Canopy collection merges write-in HDFS.Need exist for modifying the weighted value at the center Canopy in global Canopy set, and Canopy set is screened according to the value of n1/n2, to obtain the number K that clusters of subsequent K-Means cluster； The center Local C anopy is added to global canopy information by AddCanopycentertoCanopyList method In, and global canopy information is calculated again, including updated center vector, weighted value, n1, n2 information；It is inciting somebody to action Canopy central information is written before distributed file system, and Cleanup method is first examined, to remove the condition of being unsatisfactory for The center canopy.

Further, in the step 3 when first time iteration KMeansMapper from HDFS read Canopy cluster result, Each iteration all reads last K-Means cluster result as input file from HDFS；It is done in Canopy clustering phase On the data point that marks by force do not participate in distance function calculating；Each data point is added after distance calculates that distance is nearest to cluster The heart, and the influence to the generation that clusters is recorded, it is indicated with clusterObservation；NearestCluster method is by local Data point on machine is added to that distance is nearest to cluster；

A corresponding reducer is arranged for each cluster, specific mapping function is held by the jobtracker of hadoop Row hides user and realizes details；Reducer summarizes the local cluster information that each mapper is sent, and is formed global Cluster information, and it is output to HDFS, the input file as next iteration；Computeconvergence () method calculates Whether iteration stopping condition is reached, if restrained, iterative process is so far；Otherwise, it executes Cluster.computeParameters () method, that is, recalculate each parameter that clusters, so that the intermediate knot of current iteration Fruit will not influence next iteration.

Further, the sampling of the large data clustering method based on MapReduce and filter method include: whole Before the clustering of data point, by the main mode that clusters of sample analysis identification, those are filtered out in clustering and is fallen in Data point in the identified mode that clusters；Clustering of obtaining of two stages is done into primary merging；

For raw data set, Canopy clustering is carried out to it first, by the way that reasonable T1, T2 threshold value is arranged to original Beginning data set does initial division；Obtain the central information that several Canopy cluster with them；It clusters for each Canopy, one Data point that this clusters is partly belonged within the scope of T2, another part belong to the data that this clusters within the scope of T1 And except T2 range；Think, it is close enough and compact with the center Canopy those of to fall within the scope of T2 data point, All data points within the scope of T2 are replaced by the center Canopy for being endowed bigger weight, i.e., are greater than with a weighted value common The center Canopy of data point replaces all data points within the scope of T2 to participate in process of cluster analysis tightened up below, realizes Filter；For primary data, the weighted value of each data point is uniformly set as 1, then in the Canopy as substitution The weighted value of the heart, the condition for needing to meet are that the data point number that falls within the scope of T2 is more, the weighted value at the center Canopy It should be bigger；The number of the data point fallen within the scope of T2 can serve as the weighted value at the center Canopy；

It for initial data, is deployed in distributed file system first, Hadoop will be according to computer cluster Concrete condition assigns data to each machine, and all data are stored on machine different in cluster；In clustering In the process, using Canopy clustering algorithm, all data are once divided, 3 are obtained in figure and is mainly clustered；For falling in Data point within the scope of T2 is uniformly substituted using the center of clustering；It distributes to each raw data points with identical weight The cluster weighted value at center of 1, each Canopy for being used as substitution is quantitatively directly equal to the number of data point within the scope of its T2； After clustering the information that more accurately clustered by K-Means, finally all actual data points are assigned in clustering accordingly It goes, completes cluster process.

Further, the Canopy of the large data clustering method threshold value based on MapReduce is divided and filtering Kmeans algorithm carries out a clustering to all data sets by Canopy clustering algorithm；Utilize Canopy cluster result pair All data sets make initial division and filtration treatment.

Another object of the present invention is to provide the large data clustering sides described in a kind of application item based on MapReduce The big data processing system of method.

Advantages of the present invention and good effect are as follows: the present invention is on the basis combed to traditional cluster algorithm On, the general characteristic of parallelization clustering method is illustrated by the specific algorithm that four kinds of suitable MapReduce are realized.For tradition K-Means algorithm present in initially cluster center selection and iterative calculation measure excessive problem, propose that one kind is based on Canopy is divided and the K-Means innovatory algorithm of filtering, and this algorithm is realized in MapReduce technological frame, carries out In-depth study.As the result is shown this innovatory algorithm cluster accuracy rate, in terms of there is apparent performance to mention It is high.

Detailed description of the invention

Fig. 1 is that the present invention implements the large data clustering method flow diagram based on MapReduce provided.

Fig. 2 is the convergence curve figure on the data set D11 that present invention implementation provides.

Fig. 3 is the convergence curve figure on the data set D12 that present invention implementation provides.

Fig. 4 is the convergence curve figure on the data set D13 that present invention implementation provides.

Fig. 5 is that the present invention implements the acceleration of the polygon diagram provided and the quantity figure of cluster machine.

Fig. 6 is that the present invention implements the polygon diagram of the trunking efficiency provided and the quantity figure of cluster machine.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Application principle of the invention is further described with reference to the accompanying drawing.

As shown in Figure 1, the large data clustering method provided in an embodiment of the present invention based on MapReduce includes following Step:

S101: the input of initial data and format conversion；Hadoop defines three kinds of input data format modes: TextInputFormat, KeyValueInputFormat and SequenceFileInputFormat carry out the number of clustering According to for high dimension vector form, SequenceFileInputFormat is selected；The InputDriver class for calling Hadoop included；

S102:Canopy is divided and screening, obtains the initial division that clusters, and determine to data acquisition system according to screening conditions The suitable number K that clusters, the K value as subsequent K-Means algorithm；The K central informations that cluster initially to cluster are obtained, including are represented The data point to cluster within the scope of the outer T1 of feature vector, weighted value, the data point number within the scope of its T2 and T2 range at center Number；Canopy clustering is carried out to all data, is that the data point fallen within the scope of T2 is marked plus strong according to T1, T2 threshold value Note；By a MapReduce task, including a map stage and a reduce stage realize design object；

S103:K-Means iteration, using the result that Canopy is clustered as the division that initially clusters, and it is higher to be endowed The center Canopy of weight clusters the substitution of the set of data points within the scope of T2 as falling in Canopy, participates in K-Means iteration, Realize filtering；Iteration, an including map stage and one each time is completed by a complete MapReduce task The reduce stage；

S104: distribution data point after the completion of K-Means iteration, obtains the k complete informations to cluster, by all data points It is assigned in clustering accordingly；It is realized with a MapReduce task；Local data point and complete is calculated on each mapper The distance at each center cluster and data point is added to distance is nearest to cluster in office's cluser set, and by result It is output to HDFS.

In a preferred embodiment of the invention, AddObjtoCanopyList method is by each data point in step S102 It is added to during corresponding Canopy clusters.In Cleanup () method, program output < " centroid ", Canopycenter > key-value pair is as intermediate result to next stage.Here the office that hadoop will be formed on each mapper Portion's Canopy information gives unique reducer by network transmission (key is identical, consequently only that a reducer), it is clear that here Network overhead be lesser.

Canopyreduce stage, unique reducer handle the Local C anopy information from each mapper, are formed Global Canopy collection merges write-in HDFS.Need exist for modifying the weighted value at the center Canopy in global Canopy set, and Canopy set is screened according to the value of n1/n2, to obtain being suitble to the number K that clusters of subsequent K-Means cluster.Here The center Local C anopy is added to global canopy information by AddCanopycentertoCanopyList method In, and global canopy information is calculated again, including the information such as updated center vector, weighted value, n1, n2.It is inciting somebody to action Canopy central information is written before distributed file system, and Cleanup method is first examined, and is unsatisfactory for condition to remove those The center canopy.

In a preferred embodiment of the invention, in step S103, KMeansMapper is read from HDFS when first time iteration Canopy cluster result, each iteration later all read last K-Means cluster result as input file from HDFS. The upper data point (data point fallen within the scope of T2) marked by force, which is done, in Canopy clustering phase does not participate in distance function calculating, Calculation scale is reduced in this way.The nearest center that clusters of distance is added in each data point after distance calculates, and records it to poly- The influence that cluster generates, is indicated with clusterObservation.NearestCluster method is by the data on local machine Point (after filtering) is added to that distance is nearest to cluster.Due to each new data point be added to cluster in can be to the letter that clusters Breath has an impact, and this influence is described by clusterObservation class.

In K-Means clustering algorithm, the quantity that clusters is determining.Therefore, arrangement one is clustered accordingly to be each Reducer, specific mapping function are executed by the jobtracker of hadoop, hide to user and realize details.On Reducer Operation is very simple, summarizes the local cluster information that each mapper is sent, and forms overall situation cluster information, and be output to HDFS, the input file as next iteration.Whether the calculating of Computeconvergence () method reaches iteration stopping item Part, if restrained, iterative process is so far.Otherwise, cluster.computeParameters () method is executed, Each parameter that clusters is recalculated, so that the intermediate result of current iteration will not influence next iteration, guarantees each iteration mistake The accuracy at the center of clustering is recalculated in journey.No matter whether iteration stops, and the overall situation of current iteration clusters information all will write-in HDFS。

Sampling provided by the invention is as follows with filter method:

One reasonable idea is identified and is led by sample analysis before starting the clustering for total data point Cluster mode, those data points fallen in the identified mode that clusters is filtered out in clustering later, finally Clustering of obtaining of two stages is done into primary merging.Robson L.F.Cordeiro proposes a kind of based on this thought MapReduce is realized --- SnI algorithm (Sample and Ignore).Algorithm reads all data sets and first according to certain Sampling policy extraction section data point as sample, mainly clustered mode by the clustering to sample data.It connects Clustering is carried out to all data, algorithm filters out those and falls in the data points mainly to cluster in mode at this time, and sends out The mode that clusters at existing remainder strong point, finally merges the mode of clustering that two stages obtain to obtain global clustering result.

It must be discussed according to original mode value that clusters of collection, so that a kind of good method of sampling of performance will become cluster point Analyse the accurate important prerequisite of result；Second is that cluster the stage done and merged in two stages, how to guarantee that two stages generate cluster into Row accurately and effectively merges, this will generate significant impact to the size, the shape sum number amount that cluster in final cluster result.Therefore, Although the calculation scale of clustering method can be effectively reduced to the sampling process of all data sets, the uncertainty of itself New unstability is brought to cluster process.On the other hand, " ignoring " is the improved though to have a great attraction, but its On condition that meeting, tool is convictive to ignore condition.By ViktorMayerThe big data era number of discussion According to one of three big changes of theory --- the inspiration of all thought that not sample, the author thinks, with its pains before filtration Suitable sampling algorithm is designed, not as good as the sample that all data are regarded as.Herein, the present invention proposes a kind of new sampling Thinking --- sample is entirety！

When being determined using all data as " sample ", new problem is comed one after another, and is sent out using which type of algorithm Do now those initially cluster? Canopy clustering algorithm is to presenting many good characteristics, such as simple traversal, the distance of lightweight Function calculates, and it is quickly and more accurate to divide.This makes Canopy algorithm become a kind of algorithm that splendid determination initially clusters. It is described below and how to realize sampling using Canopy algorithm and filter.

For raw data set, Canopy clustering is carried out to it first, by the way that reasonable T1, T2 threshold value is arranged to original Beginning data set does initial division.At this moment the central information that several Canopy cluster with them is obtained.It is poly- for each Canopy Cluster, a part belong to data point that this clusters within the scope of T2, another part belong to the data that this clusters in T1 model Within enclosing and except T2 range.Think, it is close enough with the center Canopy those of to fall within the scope of T2 data point And it is compact, all data points within the scope of T2 can be approximatively replaced by the center Canopy for being endowed bigger weight, that is, used It is tightened up below that the center Canopy that one weighted value is greater than general data point replaces all data points within the scope of T2 to participate in Clustering (such as K-Means cluster) process, to realize filtering.It is mentioned above, for primary data, each data point Weighted value be uniformly set as 1, then for the weighted value at the center Canopy as substitution, the condition for needing to meet It is that the data point number that falls within the scope of T2 is more, the weighted value at the center Canopy should be bigger.Herein, approximatively recognize The number of data point to fall within the scope of T2 can serve as the weighted value at the center Canopy.

Figure is described initially to be clustered division, and the number that Canopy is clustered within the scope of T2 by Canopy clustering method After strong point is replaced with Canopy central point, subsequent K-Means clustering is participated in, the mistake of global clustering result is finally obtained Journey.

It for initial data, is deployed in distributed file system first, Hadoop will be according to computer cluster Concrete condition assigns data to each machine, and data all in this way are stored on machine different in cluster.In cluster point During analysis, Canopy clustering algorithm is first used, all data are once divided, 3 are obtained in figure and is mainly clustered.It is right In the data point fallen in ringlet (T2 range), uniformly substituted using the center of clustering.In this way, participating in next step K- The data point scale gone in Means cluster iteration is effectively reduced.Particularly, cluster center for difference, weighted value with The data point number that the number of data point within the scope of its T2, i.e. center itself are substituted is in direct ratio.Actual algorithm later It in realization, distributes to each raw data points with identical weight 1, the center in this way, each Canopy for being used as substitution clusters Weighted value is quantitatively directly equal to the number of data point within the scope of its T2.Realize filtered data point number and initial data Data point number compared to there is apparent reduction, therefore, can substantially reduce the calculation scale of K-Means iterative process.It is logical It crosses after K-Means clusters the information that more accurately clustered, is finally assigned to all actual data points in clustering accordingly, Complete cluster process.

The choosing method of threshold value provided by the invention is as follows:

If selecting K-Means clustering algorithm as subsequent clustering method, asked here there are two in need of consideration Topic: first is that the selection of T1, T2 threshold value, in fact, the value selection of T2 is excessive to will lead to T2 model due to above-mentioned filtering thought Element and the center Canopy in enclosing are not compact enough and influence cluster result, and the value of T2 select it is too small will result in it is excessive Canopy clusters；Second is that K-Means algorithm requires in advance to defining K value, and had been obtained in Canopy cluster preprocessing Can the number that Canopy clusters, how is the relationship between the two parameters, in other words, be directly used as the K of K-Means algorithm Value.

The Kmeans algorithm provided by the invention divided based on Canopy with filtering:

By above for sampling and filtering elaborating for thought, main improvement of the present invention for K-Means algorithm Thinking is very clear.Firstly, abandoning the thinking of local sampling, the new approaches for taking sample i.e. all are poly- by Canopy Class algorithm to all data sets carry out a clustering, why can this is done because Canopy algorithm enough quickly letter It is single, even if the sampling analysis based on all data can consume more calculating times than local sampling, but in subsequent K- Since the result that clusters of entirety sampling is more accurate in Means, the calculating consumption of K-Means iteration can be reduced accordingly, to support Disappear more computing costs caused by all samplings.

Then, initial division and filtration treatment are made to all data sets using Canopy cluster result.It here not only can be with The K value and the center that initially clusters that suitable K-Means clustering algorithm needs are obtained, it can also be a large amount of former by the center substitution that clusters Beginning data point, thus extensive Reduction Computation amount during K-Means iteration.Finally obtained using K-Means clustering algorithm To the information that clusters, all data points are assigned to and are suitably clustered.

It is specifically described to divide based on Canopy below and be realized with the MapReduce of the K-Means clustering algorithm filtered.Above A kind of MapReduce implementation of Canopy algorithm and K-Means algorithm has been illustrated respectively, it is maximum common Feature is just to try to avoid data mobile, the data for the information that needs to cluster by the only necessary description of network transmission, and institute There is the relevant calculating of data point all to complete on the local machine, is calculated and non-diverting data to realize transfer.Based on same Mentality of designing, the present invention devise based on Canopy divide with filter K-Means clustering algorithm MapReduce realize, There are four steps in total:

Step 1: the input of initial data and format conversion.Hadoop defines three kinds of input data format modes: TextInputFormat, KeyValueInputFormat and SequenceFileInputFormat need to carry out clustering Data be high dimension vector form, therefore select SequenceFileInputFormat.Hadoop need to be only called to carry InputDriver class, user, which need not pay close attention to, realizes details.

Step 2: Canopy is divided and screening.Three targets are realized in this step: first, it obtains initial clustering and draws Point, and the number K that clusters fitted to data acquisition system, the K value as subsequent K-Means algorithm are determined according to screening conditions；Second, it obtains The central information that clusters initially to cluster to K, including representing the feature vector at the center that clusters, weighted value, the number within the scope of its T2 Data point number within the scope of strong point number and the outer T1 of T2 range；Third carries out Canopy clustering to all data, according to T1, T2 threshold value are that the data point fallen within the scope of T2 is marked plus strong.Pass through a MapReduce task, including a map rank Section realizes design object with a reduce stage.

Here AddObjtoCanopyList method is to be added to each data point during corresponding Canopy clusters, Concrete implementation method can refer to described previously.In Cleanup () method, program output < " centroid ", Canopycenter > key-value pair is as intermediate result to next stage.Here the office that hadoop will be formed on each mapper Portion's Canopy information gives unique reducer by network transmission (key is identical, consequently only that a reducer), it is clear that here Network overhead be lesser.

Canopyreduce stage, unique reducer handle the Local C anopy information from each mapper, are formed Global Canopy collection merges write-in HDFS.It is realized different from above-described basic Canopy cluster, needs to modify herein complete The weighted value at the center Canopy in office's Canopy set, and Canopy set is screened according to the value of n1/n2, to obtain It is suitble to the number K that clusters of subsequent K-Means cluster.Here AddCanopycentertoCanopyList method will be local The center Canopy is added in global canopy information as data point, and global canopy information is calculated again, including The information such as updated center vector, weighted value, n1, n2.Before distributed file system is written in canopy central information, Cleanup method is first examined, to remove the center canopy that those are unsatisfactory for condition.

Step 3: K-Means iteration.Here iteration body is similar with basic K-Means iteration body, the difference is that with The result of Canopy cluster is as initially clustering division, and the center Canopy to be endowed higher weight is as falling in Canopy clusters the substitution of the set of data points within the scope of T2, K-Means iteration is participated in, to realize " filtering ".Pass through one A complete MapReduce task completes iteration, an including map stage and a reduce stage each time.

When first time iteration KMeansMapper from HDFS read Canopy cluster result, each iteration later all from HDFS reads last K-Means cluster result as input file.The upper number marked by force is done in Canopy clustering phase Distance function calculating is not participated at strong point (data point fallen within the scope of T2), reduces calculation scale in this way.Each data point warp The nearest center that clusters of distance is added in distance after calculating, and records its influence to the generation that clusters, and uses ClusterObservation is indicated.NearestCluster method is to be added to the data point on local machine (after filtering) Distance is nearest to cluster.Due to each new data point be added to cluster in the information that clusters can be had an impact, pass through ClusterObservation class describes this influence.

Step 4: distribution data point.After the completion of K-Means iteration, the k complete informations to cluster have been obtained, finally All data points are exactly assigned in corresponding cluster by the task of one step.This is realized with a MapReduce task Process, it is only necessary to a map stage.It is calculated on each mapper each in local data point and overall situation cluser set Data point is simultaneously added to that distance is nearest to cluster by the distance at the center cluster, and result is output to HDFS.This step It is very simple, therefore repeat no more Implementation of pseudocode.So far, it entirely divides to cluster with the K-Means filtered based on Canopy and calculate Method is realized in MapReduce frame and is finished.

1, interpretation of result:

Firstly, introducing three indexs of evaluation cluster accuracy: precision ratio, recall ratio and error sum of squares.Assuming that: cluster As a result in, it is R that the result for belonging to the C that clusters, which is assigned to the example number in the C that clusters, belongs to cluster and C but is not allocated to poly- Example number in cluster C is S, is not belonging to cluster C but the example number that is assigned in the C that clusters is T, the number that clusters is K.Look into standard Rate (Precision), which refers in current cluster, belongs to ratio shared by the example to cluster, that is, have: entire being averaged for data set is looked into Full rate are as follows: recall ratio (Recall) refers to that being correctly assigned to the example number in currently clustering accounts for that the cluster is practical to possess example number Ratio, it is existing: the recall level average of entire data set are as follows: error sum of squares be data point to each center that clusters of correspondence away from From quadratic sum, the condition of iteration convergence and the important indicator of clustering method desk evaluation usually can be used as.Assuming that error is flat Side and be J, the number that clusters be K, data point sum be N, data point x, cluster center be c, then have:

Table 1 respectively shows the comparative experiments result on tri- data sets of D11, D12 and D13.The wherein choosing of (T1, T2) It takes and is respectively as follows: (2.5,1.5) (150,100) (1000,800)

1 data set D11 operation result of table

It is divided based on Canopy and is superior in terms of precision ratio, recall ratio and minimal error with the K-Means algorithm of filtering Traditional K-Means algorithm, can obtain more accurate cluster result.Its reason is analyzed, it is presently believed that innovatory algorithm is logical The Canopy for crossing all data sets, which divides to obtain canopy and clusters, center and according to certain criterion to cluster to canopy The heart is screened, and the K-Means iteration initial center obtained with this is obtained than the random selection method of traditional K-Means algorithm Initial center is more accurate, so that the influence of local optimum problem that may be present in K-Means algorithm is reduced, so that Final cluster result is more accurate.On the other hand, whether the more accurate initial center obtained by Canopy initial division Does the convergence process that can accelerate K-Means iteration, effectively reduce the number of iterations? Fig. 2, Fig. 3, Fig. 4 respectively describe D11, The iteration convergence curve of two kinds of algorithms on tri- data sets of D12, D13.

From Fig. 5, Fig. 6 show traditional K-Means algorithm and improve K-Means algorithm convergence curve, in can be obvious See that improved algorithm has faster convergence rate and the compactness that preferably clusters in ground.Compare the convergence in every width figure Curve, it is found that the shape of convergence curve is substantially similar, i.e. the lower reduction of speed of innovatory algorithm error sum of squares in an iterative process There is no faster than traditional algorithm for degree.This will be understood by, because there is no modify iteration side in improving K-Means algorithm Method.Therefore, faster convergence rate can be attributed to the fact that the accurate of initial division substantially, and which results in two results: first is that clustering Center is more accurate to reduce the number of iterations, more accurate reduces error sum of squares second is that clustering to divide.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of large data clustering method based on MapReduce, which is characterized in that described based on the big of MapReduce Type cluster data method the following steps are included:

Step 2, Canopy is divided and screening, obtains the initial division that clusters, and determine according to screening conditions and fit to data acquisition system The number K that clusters, the K value as subsequent K-Means algorithm；The K central informations that cluster initially to cluster are obtained, including are represented poly- Data point number within the scope of the outer T1 of feature vector, weighted value, the data point number within the scope of its T2 and the T2 range at cluster center； Canopy clustering is carried out to all data, is that the data point fallen within the scope of T2 is marked plus strong according to T1, T2 threshold value；It is logical Cross a MapReduce task, including a map stage and a reduce stage realize design object；

Step 3, K-Means iteration, using the result that Canopy is clustered as the division that initially clusters, and to be endowed more Gao Quan The center Canopy of weight clusters the substitution of the set of data points within the scope of T2 as falling in Canopy, participates in K-Means iteration, reality Now filter；Iteration, an including map stage and a reduce each time is completed by a complete MapReduce task Stage；

Step 4 distributes data point, after the completion of K-Means iteration, the k complete informations to cluster is obtained, by all data points point It is fitted in clustering accordingly；It is realized with a MapReduce task；Local data point and the overall situation are calculated on each mapper The distance at each center cluster and data point is added to distance is nearest to cluster in cluser set, and result is defeated HDFS is arrived out.

2. the large data clustering method based on MapReduce as described in claim 1, which is characterized in that the step AddObjtoCanopyList method is to be added to each data point during corresponding Canopy clusters in two；In Cleanup In () method, program output<" centroid ", Canopycenter>key-value pair is as intermediate result to next stage； Hadoop gives the Local C anopy information formed on each mapper to unique reducer by network transmission；

Canopyreduce stage, unique reducer handle the Local C anopy information from each mapper, are formed global Canopy collection merges write-in HDFS；Need to modify the weighted value at the center Canopy in global Canopy set, and according to n1/n2 Value Canopy set is screened, to obtain the number K that clusters of subsequent K-Means cluster； The center Local C anopy is added to global canopy information by AddCanopycentertoCanopyList method In, and global canopy information is calculated again, including updated center vector, weighted value, n1, n2 information；It is inciting somebody to action Canopy central information is written before distributed file system, and Cleanup method is first examined, to remove the condition of being unsatisfactory for The center canopy.

3. the large data clustering method based on MapReduce as described in claim 1, which is characterized in that the step KMeansMapper is from HDFS reading Canopy cluster result when first time iteration in three, and each iteration is all from HDFS reading upper one Secondary K-Means cluster result is as input file；Canopy clustering phase done the upper data point marked by force do not participate in away from It is calculated from function；The nearest center that clusters of distance is added in each data point after distance calculates, and records the shadow to the generation that clusters It rings, is indicated with clusterObservation；NearestCluster method be the data point on local machine is added to away from From in nearest cluster；

A corresponding reducer is arranged for each cluster, specific mapping function is executed by the jobtracker of hadoop, right User, which hides, realizes details；Reducer summarizes the local cluster information that each mapper is sent, and forms overall situation cluster Information, and it is output to HDFS, the input file as next iteration；Whether the calculating of Computeconvergence () method reaches To iteration stopping condition, if restrained, iterative process is so far；Otherwise, it executes Cluster.computeParameters () method, that is, recalculate each parameter that clusters, so that the intermediate knot of current iteration Fruit will not influence next iteration.

4. the large data clustering method based on MapReduce as described in claim 1, which is characterized in that described to be based on Before the sampling of the large data clustering method of MapReduce and filter method include: the clustering of total data point, lead to The main mode that clusters of sample analysis identification is crossed, those numbers fallen in the identified mode that clusters are filtered out in clustering Strong point；Clustering of obtaining of two stages is done into primary merging；

For raw data set, Canopy clustering is carried out to it first, by the way that reasonable T1, T2 threshold value is arranged to original number Initial division is done according to collection；Obtain the central information that several Canopy cluster with them；It clusters for each Canopy, a part Belong to data point that this clusters within the scope of T2, another part belong to the data that this clusters within the scope of T1 and Except T2 range；Think, it is close enough and compact with the center Canopy those of to fall within the scope of T2 data point, by T2 All data points in range are replaced by the center Canopy for being endowed bigger weight, i.e., are greater than general data with a weighted value The center Canopy of point replaces all data points within the scope of T2 to participate in process of cluster analysis tightened up below, realizes filtering； For primary data, the weighted value of each data point is uniformly set as 1, then for the center Canopy as substitution Weighted value, the condition for needing to meet are that the data point number that falls within the scope of T2 is more, and the weighted value at the center Canopy should It is bigger；The number of the data point fallen within the scope of T2 can serve as the weighted value at the center Canopy；

It for initial data, is deployed in distributed file system first, Hadoop will be according to the specific of computer cluster Situation assigns data to each machine, and all data are stored on machine different in cluster；In the process of clustering In, using Canopy clustering algorithm, all data are once divided, 3 are obtained in figure and is mainly clustered；For falling in T2 model Interior data point is enclosed, is uniformly substituted using the center of clustering；It is distributed with identical weight 1, often to each raw data points The cluster weighted value at center of a Canopy for being used as substitution is quantitatively directly equal to the number of data point within the scope of its T2；Pass through After K-Means clusters the information that more accurately clustered, finally all actual data points are assigned in clustering accordingly, it is complete At cluster process.

5. the large data clustering method based on MapReduce as described in claim 1, which is characterized in that described to be based on The Canopy of the large data clustering method threshold value of MapReduce, which is divided, to be clustered with the Kmeans algorithm of filtering by Canopy Algorithm carries out a clustering to all data sets；Initial division and mistake are made to all data sets using Canopy cluster result Filter processing.

6. a kind of big number using the large data clustering method described in Claims 1 to 5 any one based on MapReduce According to processing system.