CN103544328A

CN103544328A - Parallel k mean value clustering method based on Hadoop

Info

Publication number: CN103544328A
Application number: CN201310568611.1A
Authority: CN
Inventors: 高阳; 王睿; 史颖欢
Original assignee: JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd; Nanjing University
Current assignee: JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd; Nanjing University
Priority date: 2013-11-15
Filing date: 2013-11-15
Publication date: 2014-01-29

Abstract

The invention discloses a parallel k mean value clustering method based on a Hadoop. The parallel k mean value clustering method based on the Hadoop comprises the following steps that data are preprocessed and initial centers of k classes are selected; the distances between data objects and all the centers are calculated at Mapper ends of all the nodes of the Hadoop platform; the center which has the shortest distance from the Mapper ends is selected and local data are sent to Combiner ends of all nodes of the Hadoop platform; the data objects of the same center are converged at the Combiner ends and the data objects of the same center are summed up; all local data of the same cluster are transmitted to Reducer ends of all the nodes of the Hadoop platform; local data of all clusters are gathered at the Reducer ends and new centers of all the clusters are calculated; the steps are repeated until a convergent clustering result is obtained. The parallel k mean value clustering method based on the Hadoop can effectively solve the clustering problem of mass data and greatly improve the clustering speed.

Description

A kind of parallel k means clustering method based on Hadoop

Technical field

The present invention relates to a kind of parallel k means clustering method based on Hadoop.

Background technology

K mean algorithm is most widely used division methods in clustering method.It,, by specifying clusters number k value, is divided data set, generates cluster, is a kind of clustering algorithm of known cluster classification number.The feature of this algorithm is that iteration carries out that distance between data object is calculated and the calculating of cluster mean value, when data object no longer changes in the position of cluster, can finishing iteration obtain cluster result.This algorithm needs constantly data object to be distributed to adjustment in iterative process as can be seen here, constantly calculates the cluster Xin center after adjusting.When data volume scale is magnanimity rank, to the operational efficiency of algorithm itself, be a test, will be difficult to practical requirement the working time of algorithm, therefore need to consider to improve the travelling speed of algorithm.In addition, magnanimity scale data are also a kind of tests for processing speed and the internal memory of computing machine.The computing needing when algorithm carries out iteration mainly contains two parts: the one, calculate the distance between each data object and all cluster centres; The 2nd, generate new cluster centre, need to average to all data objects in each cluster.Along with continuing to increase of data scale, the distance between data object and cluster centre is calculated required time overhead also to be increased.

Summary of the invention

Goal of the invention: the problem and shortage existing for above-mentioned prior art, the object of this invention is to provide a kind of parallel k means clustering method based on Hadoop, can effectively solve the clustering problem of mass data, greatly improve the speed of cluster.

Technical scheme: for achieving the above object, the technical solution used in the present invention is a kind of parallel k means clustering method based on Hadoop, comprises the steps:

(1) data pre-service;

(2) select the initial center of k class;

(3) at the Mapper of each node of Hadoop platform end, calculate each data object to the distance of all initial center;

(4) in the central point ，Jiang local data of described Mapper end chosen distance minimum, send to the Combiner end of each node of Hadoop platform;

(5) at described Combiner end, a set of data objects that belongs to same central point is combined, calculate belong to same central point data object and, the local data of all same clusters is sent to the Reducer end of each node of Hadoop platform;

(6) at described Reducer end, gather the local data of all clusters, calculate all cluster Xin center;

(7) repeated execution of steps (3) is to (6), until remain unchanged for all k cluster centre, iteration finishes, and obtains cluster result, otherwise continues iteration.

Further, described local data is (key, value), wherein centered by key, and the subset that value is data object.

Beneficial effect: the present invention makes full use of data-intensive characteristic, adopts parallelization to calculate k mean data, and then obtains the method for overall k cluster centre.Experimental result shows, the inventive method can promote cluster efficiency greatly, for the classification problem of processing large-scale data, has good effect, has good speed-up ratio, the requirement of reduction method to internal memory and kernel processes ability.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the inventive method;

Fig. 2 is the comparison schematic diagram of the theoretical speed-up ratio of the inventive method and actual speed-up ratio;

Fig. 3 is the working time schematic diagram of the inventive method on different nodes.

Embodiment

Below in conjunction with the drawings and specific embodiments, further illustrate the present invention, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.

As shown in Figure 1, describe the step of the inventive method below in detail:

Step 1, data pre-service, for example, calculate TF-IDF (the term frequency – inverse document frequency) value of text data.

Step 2, suitably selects the initial cluster center of k class, for example, choose at random.

Step 3, (for example, for text data, calculates all the other chordal distances at the Mapper of each node of Hadoop platform end computational data object to the distance of all initial cluster centers; For vectorial type data, calculate its Euclidean distance).

Step 4, is key at the central point ，Yi center of described Mapper end chosen distance minimum, and the subset of data object is value, forms data pair, is called the Combiner end that ，Jiang local data of local data sends to each node of Hadoop platform.

Step 5, is combined belonging to concentric set of data objects at described Combiner end, calculate belong to concentric data object and, the local data of all same clusters is sent to the Reducer end of each node of Hadoop platform.

Step 6, holds the local data that gathers all clusters at described Reducer, calculate all cluster Xin center.

Step 7, repeated execution of steps 3 to 6, until remain unchanged for all k cluster centre, iteration finishes, and obtains cluster result, otherwise continues iteration.

As shown in Figures 2 and 3, the speed-up ratio that the inventive method possesses in the situation that nodes is different in the situation that data volume is very large and the gap of desirable speed-up ratio are very little, almost as broad as long in the situation that nodes is no more than 3.

Claims

1. the parallel k means clustering method based on Hadoop, comprises the steps:

(1) data pre-service;

(2) select the initial center of k class;

2. a kind of parallel k means clustering method based on Hadoop according to claim 1, is characterized in that: described local data is (key, value), wherein centered by key, and the subset that value is data object.