CN103838863A

CN103838863A - Big-data clustering algorithm based on cloud computing platform

Info

Publication number: CN103838863A
Application number: CN201410104227.0A
Authority: CN
Inventors: 孟海东; 任敬佩; 宋宇辰
Original assignee: Inner Mongolia University of Science and Technology
Current assignee: Inner Mongolia University of Science and Technology
Priority date: 2014-03-14
Filing date: 2014-03-14
Publication date: 2014-06-04
Anticipated expiration: 2034-03-14
Also published as: CN103838863B

Abstract

The invention discloses a large-data clustering algorithm based on a cloud computing platform. Primitive data is pre-processed; data are divided into M sub-data and distributed to M Map functions and local clustering is carried out on the sub-data; clusters with the same key are combined; if the number R of practical clusters is smaller than the number k of clusters, the number c of representative points and a constriction factor a are regulated and clustering is carried out again until termination conditions are achieved. if a new data set is generated, local clustering is carried out according to judgment conditions that the number K of new data source centers is larger than the obtained number K of clusters before updating or the number of new data source points is larger than the number of data source points before updating. According to the large-data clustering algorithm based on the cloud computing platform, the parallel computing capacity of a high-performance clustering system of cloud computing is used for solving the problem that mass data need to be processed in clustering, and therefore a relation of the data can be rapidly and efficiently dug up.

Description

A kind of large Data Clustering Algorithm based on cloud computing platform

Technical field

The invention belongs to data mining technology field, relate to a kind of large Data Clustering Algorithm based on cloud computing platform.

Background technology

Cluster analysis, as the cross discipline in the fields such as statistics, machine learning and data mining, has attracted numerous researchers to bound oneself to it, and makes it to become a very active research topic of data mining research field.Researchers both domestic and external have proposed a lot of clustering algorithms up to now, and main clustering method can be divided into: based on method, the method based on level, the method based on density, the method based on grid and the method based on model etc. of dividing.

In " the 6th the mobile Internet international symposium " that hold on August 21st, 2012, U.S. Ka Neijimeilong computer machine people specialty doctor Deng Kan represents, find the value in large data, rely on the algorithm of data mining, and will have the algorithm of data mining to add the parallel computation of cloud computing.Distributed cloud storage platform provides more honest cost and high handling property, adds efficient data mining algorithm, becomes the good medicine that solves large data problem.

In Southampton University of Southampton " the mass data Research on Mining under cloud computing ", mention the more and more many medium-sized and small enterprises analysis mass datas of appearing as of cloud computing cheap solution is provided.Introduce SPRINT (the Scalable Parallelizable Induction of Decision of Trees based in cloud computing Hadoop cluster frameworks and data mining technology, a kind of Decision Tree Algorithm with scalability) on the basis of sorting algorithm, describe the execution flow process on MapReduce (data processing model) programming model of SPRINT parallel algorithm in Hadoop (a kind of distributed programmed framework) in detail, and utilize the decision-tree model analyzing to classify to input data.

At present, the data mining work based on cloud computing platform has obtained numerous achievements.Apache Mahout (project of increasing income under Apache SoftWare Foudation) project development goes out the Parallel Algorithms for Data Mining of multiple commercial presence angle; The parallel distributed data Mining Platform (PDMiner, Parallel Distributed Miner) that Inst. of Computing Techn. Academia Sinica releases can be realized the mass data processing of TB rank; The parallel data mining instrument (BC-PDM, Blue Carrier based Parallel Data Mining) of China Mobile has more been to provide the service mode based on Web.These marked achievements, have promoted the development in this field energetically.On the basis of cloud computing programming model MapReduce, existing several data mining algorithm is implemented.The scholars such as CHU in 2007 have proposed the Naive Bayes Classification Algorithm based on MapReduce.This algorithm adopts the thought of distribution process, and the mode of sample being carried out to decentralized statistics and centralized integration by employing is carried out structural classification device, but it can process discrete data, can not provide effective support to continuous data.In addition, in data mining work, the MapReduce of General Clustering Algorithm realizes, and in the scope in road, there is not yet relevant authority's report as far as our knowledge goes.

Current, rest in the optimization of serial method more than both at home and abroad on to the research of clustering method.Serial clustering algorithm has obtained a lot of research and application in statistics and database field, as K-Means (K averaging method) algorithm, towards the comprehensive hierarchical clustering (BIRCH of Large-scale Database System, Balanced Reducing and Clustering Using Hierarchies) algorithm, process statistical information grid (STING, the Statistical Information Grid) algorithm etc. of spatial data.In the face of growing high-volume database and high dimensional data type, in order to obtain better computing power, the clustering algorithm under research parallel model, utilizes the high-speed computational capability of cluster to solve the cluster computing of large data, has very important significance.

Along with internet, real-time stream, the diversified development of connection device, and the promotion of the demand such as search service, community network, Mobile business and open cooperation, cloud computing develops rapidly.Parallel distributed different from the past calculates, and the generation of cloud computing will promote whole the Internet model theory, revolutionary change occurs management mode of enterprise.) etc. major company be the forerunner of cloud computing.

Google is as the user of the maximum cloud computing of number.At present, Google has allowed third party to move large-scale concurrent application by GoogleApp Engine (Google's application searches engine) in the cloud computing of Google.MapReduce is the Distributed Calculation programming framework being proposed at first in 2004 by Google, and it can support the distributed treatment of big data quantity.

Hadoop is increase income a Distributed Calculation Open Source Framework of tissue of Apache, on a lot of large-scale websites, all obtain application, in Hadoop framework, most crucial design is MapReduce and Hadoop distributed file system (HDFS, Hadoop Distributed File System).Amazon uses elasticity to calculate cloud (EC2, Elastic Compute Cloud) and simple storage service (S3, Simple Storage Service) is calculated and stores service for enterprise provides.IBM has released " Lan Yun " computing platform of " change game rule " in November, 2007, buy the i.e. cloud computing platform of use for client brings.Microsoft, immediately following cloud computing paces, releases Windows Azure operating system in October, 2008.Azure (being translated into " blue sky ") is after Windows replaces DOS, the transition of subversiveness again of Microsoft, by make new cloud computing platform in Internet architecture, allows Windows really extend to " blue sky " by PC.

In China, cloud computing development is also very swift and violent.Within 2008, IBM has successively set up two cloud computing centers in Chinese Wuxi and Beijing; A-1. Net has released CloudEx (cloud cable release elastic cloud computing platform) product line, and internet host service, on-line storage virtualization services etc. are provided; Research institute of China Mobile has set up the cloud computing experimental center of 1024 CPU; Polytechnics of PLA has developed cloud storage system MassCloud (magnanimity cloud storage platform), and supports extensive video surveillance applications and the digital earth system based on 3G with it.

Based on the present situation of data mining cluster research, the existing excavation for large data clusters, mostly the method for employing is to adopt the sampling to data, chooses representative data, realizes the cluster analysis of Points replacing surfaces.When in the face of large data processing, what generally adopt is the method realization based on sampling probability, but the methods of sampling is not considered relative distance and the data skewness of the overall situation between data point or between interval, occurs the problem that demarcation interval is really up to the mark.Although afterwards, introduce again cluster, fuzzy concept and cloud model etc. interval division problem really up to the mark is improved, and obtained good effect yet, these methods are not all considered the not same-action of large data data point to Knowledge Discovery task.Therefore, more effective, quicker for making to excavate the clustering rule obtaining, must start with from the not same-action that takes into full account data point, cluster analysis is carried out to more deep research.And cloud computing just the processing between the large data data point based in reality propose, this for excavate more effective clustering rule powerful theoretical foundation is provided.

Summary of the invention

The object of the invention is to overcome the defect that above-mentioned technology exists, a kind of large Data Clustering Algorithm based on cloud computing platform is provided, the method utilizes the computation capability of the High Performance Cluster System of cloud computing to solve the large data processing problem that cluster faces, so that can be quick, effectively excavate the relation of data.Its concrete technical scheme is:

Based on a large Data Clustering Algorithm for cloud computing platform, comprise the following steps:

(1) raw data is carried out to pre-service;

Its basic thought is: first, scan whole data source, check whether there is null value, supplement missing value, choosing according to the mean value of that one dimension at null value place of missing value supplements, secondly, data set is carried out vectorization and cut apart, after cutting apart, data block is distributed on node, each node is distributed to M Map function data block, a threshold value T (distance between points) is set in function, M (allowing minimum number in bunch), choose c distance and carry out cluster at a distance of point farthest as representative point, it is a class that the point that meets T requirement is gathered, be put in one bunch, so circulation is until the point not meeting, then remaining point is divided into a class, form one bunch, and at each bunch with (N (in bunch number a little), SUM (somewhat every dimensional vector sum), SUMSQ (a little in the component quadratic sum of every one dimension)) represent Yi Gecu center, finally, check the number of bunch mid point of final formation, if bunch in number be less than M, all point deletions in this bunch, otherwise form a data acquisition U, obtain a cluster number K.Concrete steps are as follows:

1: scan whole data set and check in each dimension, whether there is null value, supplement missing value;

2: data set is carried out to vectorization;

3: be M subdata by Segmentation of Data Set, be assigned to each child node;

4: by M sub-data allocations, to M Map function, each Map task is processed a data fragmentation;

5: in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;

6: if spacing is between points less than T, gathering is a class; Otherwise, remaining point is divided into one bunch;

7: calculate in each bunch the number of point, if bunch in the number of point be less than M, so this bunch of deletion;

8: in the Reduce stage, form a new data set U, the number K of compute cluster, and (N, SUM, SUMSQ) expression for the central point of each bunch all bunches.

(2) data set U is divided into M subdata, and distributes M Map function;

(3) in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;

(4) calculate the central point (N, SUM, SUMSQ) of each bunch;

(5), in the Reduce stage, the class of identical key is merged; Forming Cu center is (N1+N2---+Ni, SUM1+SUM2----+SUMi, SUMSQ1+SUMSQ2----+SUMSQi);

(6) if actual cluster number R is less than cluster number K, adjust representative point number c and contraction factor a, re-start cluster, until reach termination condition.

(7) because large data not only have the feature of higher-dimension and mass data, produce and the fast feature of Data Update but also there are data; Therefore, adopt following methods to solve based on this this algorithm of feature;

Its basic thought is: first, pre-service (the same) is carried out in new data source, obtain number K and all number of data points N of the data set U in new data source and the central point of cluster; Next, do not have cluster numbers K or the counting of new data source of upgrading front acquisition to be greater than counting of the front data source of renewal if new data source Center Number K is greater than, and so, new data source and the data source that there is not renewal re-started to Segmentation of Data Set; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart; Then subset is assigned in each child node, distributes to several Map functions, carry out Local Clustering; If the first situation, K is chosen for [(K so _newly+ K _old)/2], on the contrary K is chosen for the value that there is no to upgrade front K; Then repeated for 3,4,5,6 stages (pretreatment stage); Concrete steps are as follows:

1: pre-service (the same) is carried out in new data source;

2: vectorization data set;

3: the size of data source points N and central point number K before judging new data source points N and central point number K and there is no renewal;

4: if N _newly> N _old|| K _newly> K _old, two data sets re-start and cut apart so, K=[(K _newly+ K _old)/2]; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart, K=K _old;

5: data set U is divided into M subdata, and distributes M Map function;

6: in the Map stage, subdata is carried out to Local Clustering, choose the representative point that c spacing is maximum distance;

7: the central point (N, SUM, SUMSQ) that calculates each bunch;

8: in the Reduce stage, the class of identical key is merged; Forming Cu center is (N1+N2---+Ni, SUM1+SUM2----+SUMi, SUMSQ1+SUMSQ2----+SUMSQi);

9: if actual cluster number R is less than cluster number K, adjust representative point number c and contraction factor a, re-start cluster, until reach termination condition.

Compared with prior art, beneficial effect of the present invention is: the present invention utilizes the computation capability of the High Performance Cluster System of cloud computing to solve the large data processing problem that cluster faces; Take parallel clustering as target, propose new cluster thinking and improved one's methods; The data processing cost of enterprise reduces greatly, also no longer exists with ... high performance machine simultaneously; Large data mining exploitation based on cloud computing is convenient, has shielded bottom.Under parallelization condition, cloud computing can utilize existing equipment to improve processing power and the speed to large-scale data, has both guaranteed fault-tolerance, also increases node; Realize the effect of cloud computing to Cluster Analysis in Data Mining, realize a new abstract model, and by parallelization, fault-tolerant, data distribute, load balancing etc. mixed and disorderly details shield, just can process data fast, thereby excavate the relevance between data, obtain large data for modern life tremendous influence, solve the processing problem of data mining in the face of large data.

Accompanying drawing explanation

Fig. 1 is the pretreatment process figure that the present invention is based on large data in the large Data Clustering Algorithm of cloud computing platform;

Fig. 2 is the present invention's large Data Clustering Algorithm process flow diagram in the large Data Clustering Algorithm of cloud computing platform;

Fig. 3 the present invention is based on clustering algorithm process flow diagram after the large Data Update of cloud computing platform.

Embodiment

Below in conjunction with specific embodiment, technical scheme of the present invention is described in more detail.

With reference to Fig. 1,2,3, in Fig. 1, T: distance between points; M: comprise number a little in bunch; N: the number of point in bunch; SUM: institute a little each dimensional vector and; SUMSQ: institute is each dimension component quadratic sum a little.In Fig. 3, N1: the number of the point in primary data source; N2: the number in new data source; K1: initial clustering number; K2: new pretreated clusters number; Pi: the central point of initial cluster; K=[(K1+K2)/2].

(1) raw data is carried out to pre-service;

2: data set is carried out to vectorization;

3: be M subdata by Segmentation of Data Set, be assigned to each child node;

(2) data set U is divided into M subdata, and distributes M Map function;

(4) calculate the central point (N, SUM, SUMSQ) of each bunch;

1: pre-service (the same) is carried out in new data source;

2: vectorization data set;

5: data set U is divided into M subdata, and distributes M Map function;

7: the central point (N, SUM, SUMSQ) that calculates each bunch;

Determine the validity of algorithm and ageing

In order to verify validity based on large Data Clustering Algorithm under Hadoop platform and ageing, this algorithm adopts several groups of test data set to verify.Utilize classical UCI data set and Public Data Sets (Amazon provided the development data collection of tens TB since 2008 for developer), the validity of the large data clusters result of test based under cloud computing platform and ageing.

The above; it is only preferably embodiment of the present invention; protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in, the simple change of the technical scheme that can obtain apparently or equivalence replace all fall within the scope of protection of the present invention.

Claims

1. the large Data Clustering Algorithm based on cloud computing platform, is characterized in that, comprises the following steps:

1) raw data set is carried out to pre-service;

2) data U is divided into M subdata, and distributes to M Map function;

3), in the Map stage, subdata is carried out to Local Clustering;

4), in the Reduce stage, the class of identical key is merged;

5) if actual cluster number R is less than cluster number k, adjust representative point number c and contraction factor, re-start cluster, until reach termination condition;

6) if N _newly> N _old|| K _newly> K _old, two data sets re-start and cut apart so, K=[(K _newly+ K _old)/2]; Otherwise a central point of the K that the data set not upgrading obtains bunch forms new data set as K point with new data source and cuts apart, K=K _old;

7) repeat 3), 4), 5) till straight termination condition of stage.