CN101996198B

CN101996198B - Cluster realizing method and system

Info

Publication number: CN101996198B
Application number: CN200910091866.7A
Authority: CN
Inventors: 徐萌; 高丹; 邓超; 罗治国; 周文辉; 孙少陵; 何清; 赵卫中; 马慧芳
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2009-08-31
Filing date: 2009-08-31
Publication date: 2016-06-29
Anticipated expiration: 2029-08-31
Also published as: CN101996198A

Abstract

The invention discloses a kind of cluster realizing method and system.By main controlled node, the candidate samples in candidate queue is carried out burst, at least two computing node whether each sample determined respectively according to the epsilon neighborhood set and minimum density concurrently in the burst sample of distribution is core sample；Due to each computing node parallel processing, accelerate the signature velocity of cluster belonging to each sample in sample database.Unmarked sample current in sample database is carried out piecemeal by main controlled node by the present invention, piecemeal sample is distributed and is handed down at least two computing node, concurrently the candidate samples in candidate queue is processed by each computing node, again through the result merging the node each computing node of merging.Owing to each computing node only processes part sample, solve mass data cannot the problem that processes of unit, and owing to by multiple computing nodes and multiple and close node and carry out parallel processing, treatment effeciency can be substantially increased.

Description

Cluster realizing method and system

Technical field

The present invention relates to Data Mining, particularly relate to cluster realizing method and the corresponding system of a kind of Massive Sample data.

Background technology

At current data excavation applications, existing clustering algorithm can be divided into a few class, including based on the method divided, based on the method for level, the method for density based, the method based on grid and the method etc. based on model.

When carrying out data mining, it is necessary to carrying out total data calculating one by one and analyzing, Algorithms T-cbmplexity is high.Mass data is one to various clustering algorithms challenge.Existing clustering algorithm is mostly also merely resting on laboratory stage, for mass data, and some algorithm or can not be effectively treated, or treatment effeciency is very low.

DBSCAN algorithm is a clustering algorithm based on spatial density.This algorithm will have enough highdensity region and be divided into cluster, it is possible to find the cluster of arbitrary shape in the sample space with " noise " (referring to have some non-core sample points).

The ultimate principle of DBSCAN algorithm is:

The epsilon neighborhood (region in the radius ε of given object being called to the epsilon neighborhood of this object) of sample and minimum density (minimum density is the minimum number of sample size in appointment epsilon neighborhood) when setting data excavates, and when belonging to not labeled in the epsilon neighborhood of not labeled sample, the sample size of cluster meets more than the minimum density set, it is determined that this sample is core sample.Labelling core sample belongs to current cluster, and sample each in the epsilon neighborhood of this core sample is inserted candidate queue and is labeled as and belongs to current cluster.Further determine that in candidate queue, whether each candidate samples is core sample, if, repeating and each sample in the epsilon neighborhood of the core sample determined is inserted candidate team, until traveling through each sample in whole sample database, marking cluster belonging to each sample.

Above-mentioned DBSCAN clustering algorithm, for a small amount of sample, it is possible to realize on unit easily.But for Massive Sample, on the one hand owing to unit memory size is limited, it is impossible to read in the sample data of magnanimity；On the other hand, owing to needing to carry out waiting the dynamic renewal of first queue in cluster process, belonging to each sample in sample database is carried out, cluster labelling, processes chronic, and in actual data service application, efficiency is very low.

Therefore, for the process of mass data in practical application, how effectively promoting treatment effeciency is the subject matter needing in data mining to solve.

Summary of the invention

The embodiment of the present invention provides cluster realizing method and cluster to realize system, and by adopting multiple nodal parallel to process, mass data cannot be realized clustering processing and the low problem for the treatment of effeciency by solution prior art.

A kind of cluster realizing method that the embodiment of the present invention provides includes:

Step 1, main controlled node determine a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；

Candidate samples in described candidate queue is carried out burst by step 2, described main controlled node, is distributed by burst sample and is handed down at least two computing node；

Whether each sample that step 3, each described computing node are determined in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density is core sample；Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue；And after the burst sample of distribution is all disposed, notify described main controlled node；

After step 4, described main controlled node receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, going to above-mentioned steps 2；When being absent from candidate samples, go to above-mentioned steps 1, until cluster belonging to each sample labelling in described sample database.

The another kind of cluster realizing method that the embodiment of the present invention provides, including:

Unmarked sample current in raw data base is carried out piecemeal by step 1, main controlled node, is distributed by piecemeal sample and is handed down at least two computing node；And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；

Candidate samples in described candidate queue is handed down to each described computing node by step 2, described main controlled node；

Step 3, each described computing node epsilon neighborhood according to the piecemeal sample distributed and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node；

Step 4, described merging node are to each candidate samples, the corresponding sample size that accumulative described computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged；When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node；And when determine be absent from core sample time, notify described main controlled node；

Each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies described main controlled node after receiving the core sample notice that described merging node sends by step 5, each computing node；

Step 6, described main controlled node go to above-mentioned steps 1 after receiving the notice that described merging node sends；And after receiving the notice that each described computing node sends, each candidate samples of described candidate queue is labeled as and belongs to current cluster, go to above-mentioned steps 2；Until the affiliated cluster of each sample labelling in raw data base.

A kind of cluster that the embodiment of the present invention provides realizes system, including: main controlled node, at least two computing node；

Described main controlled node, for determining a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；And the candidate samples in described candidate queue is carried out burst, burst sample is distributed and is handed down to described at least two computing node；And receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, and burst is handed down to described at least two computing node again；After judging that described candidate queue is sky, then determine that next core sample repeats said process, until cluster belonging to each sample labelling in described sample database；

Whether described computing node, be core sample for determining each sample in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density；Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue；And after the burst sample of distribution is all disposed, notify described main controlled node.

The another kind of cluster that the embodiment of the present invention provides realizes system, including: main controlled node, at least two calculate joint and merge node；

Described main controlled node, for unmarked sample current in raw data base is carried out piecemeal, distributes piecemeal sample and is handed down at least two computing node；And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；And the candidate samples in described candidate queue is handed down to each described computing node；It is additionally operable to receive the notice of described computing node, each candidate samples of described candidate queue is labeled as and belongs to current cluster, and again issue the candidate samples in candidate queue to described at least two computing node；When after the notice receiving described merging node, then determine next core sample, and repeat said process, until cluster belonging to each sample labelling in described raw data base；

Described computing node, for the epsilon neighborhood of the piecemeal sample according to distribution and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node；And receive the core sample notice merging node transmission, each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies described main controlled node；

Described merging node, for each candidate samples, adding up the corresponding sample size that described computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged；When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node；And when determine be absent from core sample time, notify described main controlled node.

In a kind of cluster realizing method provided by the invention and correspondence system, by the candidate samples in candidate queue is carried out burst, by multiple computing node parallel processings (determining core node parallel), make full use of the calculating resource of each node in system, it is greatly shortened mass data and carries out waiting time when cluster result processes, improve computational efficiency.

In another kind of cluster realizing method provided by the invention and correspondence system, process distributing to different computing nodes after pending sample piecemeal, solve mass data and all cannot be read in, by unit, the problem that internal memory is calculated processing；In cluster realizing method provided by the invention, have employed at least two computing node and participate in cluster calculation process concurrently, accelerate calculating speed；Effectively merge again through merging node, make full use of the calculating resource of each node in system, efficiently solve prior art and mass data cannot be realized clustering processing and the low problem for the treatment of effeciency.

Accompanying drawing explanation

Cluster realizing method one flow chart that Fig. 1 provides for the embodiment of the present invention；

The practical application flow chart of the cluster realizing method one that Fig. 2 provides for the embodiment of the present invention；

The flow chart of steps of the cluster realizing method two that Fig. 3 provides for the embodiment of the present invention；

The practical application flow chart of the cluster realizing method two that Fig. 4 provides for the embodiment of the present invention；

The cluster corresponding with cluster realizing method one that Fig. 5 provides for the embodiment of the present invention realizes system structure schematic diagram；

The cluster corresponding with cluster realizing method two that Fig. 6 provides for the embodiment of the present invention realizes system structure schematic diagram.

Detailed description of the invention

Below in conjunction with accompanying drawing, the cluster realizing method and the systems that provide the embodiment of the present invention are described in detail.

Referring to Fig. 1, for cluster realizing method one flow chart that the embodiment of the present invention provides, comprise the steps:

Step S101, main controlled node determine a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue.

Candidate samples in candidate queue is carried out burst by step S102, main controlled node, is distributed by burst sample and is handed down at least two computing node.

Whether each sample that step S103, each computing node are determined in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density is core sample；Each sample in the epsilon neighborhood of the core sample determined is stored in candidate queue；And after the burst sample of distribution is all disposed, notify main controlled node.

After step S104, main controlled node receive the notice that each computing node sends, it is judged that whether candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, going to above-mentioned steps S102；When being absent from candidate samples, go to above-mentioned steps S101, until cluster belonging to each sample labelling in sample database.

Referring to Fig. 2, for the practical application flow chart of the above-mentioned cluster realizing method one that the embodiment of the present invention provides, including:

Step S201, main controlled node determine a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue.

Candidate samples in candidate queue is carried out burst by step S202, main controlled node, is distributed by burst sample and is handed down at least two computing node.

In reality, main controlled node is according to the computing node quantity (being assumed to be N number of) participating in calculating, whole candidate samples in candidate item queue are divided into the burst (N number of burst) of respective numbers, and each burst sample is distributed to different computing nodes.

In reality, the candidate samples quantity that main controlled node can distribute according to epicycle, determine the computing node quantity participating in parallel processing flexibly.When needing candidate samples quantity to be processed more, start multiple computing node, when candidate samples negligible amounts, the corresponding computing node reducing participation parallel processing.

Step S203, each computing node read a sample in the burst sample distributing to self according to the order of sequence, determine whether this sample is core sample according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density；When determining that this sample is core sample, perform step S204, otherwise, go to step S205.

Step S204, each sample in the epsilon neighborhood of this core sample determined is stored in candidate queue, continues step S205.

Step S205, each computing node judge whether each sample distributing in the burst sample of self is disposed, and if so, perform step S206；Otherwise go to step S203.

Step S206, notice main controlled node are disposed.

After step S207, main controlled node receive the notice that each computing node sends, it is judged that whether candidate queue exists candidate samples, when there is candidate item sample, perform step S208；When being absent from candidate samples, perform step S209.

Step S208, each candidate samples is labeled as belongs to current cluster, go to above-mentioned steps S202.

Whether step S209, main controlled node judgment sample data base exist not labeled sample；If having, go to step S201；Otherwise, step S210 is performed.

Step S210, belonging to each sample of labelling cluster, obtain belonging to the sample of each cluster.

Above-mentioned cluster realizing method one provided by the invention, is particularly suitable for the sample in the less of associated databases of sample size is clustered.Have employed at least two computing node and participate in cluster calculation process concurrently, accelerate calculating speed.

When the sample size in raw data base is huge especially, first raw data base can be sampled, generate sample database, after the sample database generated is determined according to above-mentioned cluster realizing method one cluster belonging to each sample in sample database again, determine cluster belonging to all the other samples in raw data base further again, method particularly includes:

Main controlled node, to each cluster current, calculates the average of the sample value of each sample belonging to this cluster, it is determined that go out the corresponding cluster centre point of each cluster respectively；And

All the other samples except each sample comprised in the sample database generated in raw data base are carried out piecemeal by main controlled node, and piecemeal sample is distributed at least two computing node；The piecemeal sample of distribution and each cluster centre dot information of calculating are sent respectively to each computing node；

Each computing node calculates the distance of each sample corresponding sample point in sample space and each cluster centre point in the piecemeal sample of distribution respectively, each sample is belonged to described apart from cluster belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to described main controlled node；

The cluster of the sample identification that main controlled node returns according to each described computing node and affiliated cluster identifies, it is determined that go out cluster belonging to all the other samples each in raw data base.

In above-mentioned flow process, when there is candidate samples in candidate queue, main controlled node, before each candidate samples being labeled as and belonging to current cluster, also merges the same sample being deposited in candidate queue by different computing nodes.

Above-mentioned cluster realizing method one, by the candidate samples in candidate queue is carried out burst, (core node is determined parallel by multiple computing node parallel processings, and be about to core node epsilon neighborhood in each sample be stored in candidate queue), make full use of the calculating resource of each node in system, it is greatly shortened mass data and carries out waiting time when cluster result processes, improve computational efficiency.

Referring to Fig. 3, for the flow chart of steps of the cluster realizing method two that the embodiment of the present invention provides, comprise the following steps:

Unmarked sample current in raw data base is carried out piecemeal by step S301, main controlled node, is distributed by piecemeal sample and is handed down at least two computing node；And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；

Candidate samples in candidate queue is handed down to each computing node by step S302, main controlled node；

Step S303, each computing node epsilon neighborhood according to the piecemeal sample distributed and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node；

Step S304, merging node are to each candidate samples, the corresponding sample size that cumulative calculation node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged；When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node；And when determine be absent from core sample time, notify main controlled node；

Step S305, each computing node receive after merging the core sample notice that node sends, and are stored in described candidate queue by each sample in the local epsilon neighborhood of corresponding core sample, after being stored in, notify main controlled node；

Step S306, main controlled node receive after merging the notice that node sends, and go to above-mentioned steps S301；And after receiving the notice that each computing node sends, each candidate samples of candidate queue is labeled as and belongs to current cluster, go to above-mentioned steps S302；Until the affiliated cluster of each sample labelling in raw data base.

Referring to Fig. 4, for the practical application flow chart of the above-mentioned cluster realizing method two that the embodiment of the present invention provides, including:

Unmarked sample current in raw data base is carried out piecemeal by step S401, main controlled node, is distributed by piecemeal sample and is handed down at least two computing node；And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue.

In reality, pending whole samples, according to the computing node quantity (being assumed to be N number of) participating in calculating, are divided into the piecemeal (N number of piecemeal) of respective numbers, and each piecemeal sample are distributed to different computing nodes by main controlled node.

Candidate samples in candidate queue is handed down to each computing node by step S402, main controlled node.

Step S403, each computing node epsilon neighborhood according to the piecemeal sample distributed and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node.

Step S404, merging node are to each candidate samples, the corresponding sample size that accumulative each computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged；When determining at least one core sample of existence, perform step S405；Otherwise, step S406 is gone to.

The core sample determined is informed to each computing node by step S405, merging node, goes to step S407.

Step S406, merging node notice this candidate samples issued of main controlled node are absent from core sample, go to step S409.

Above-mentioned steps S405 and step S406 is two branching step, it is impossible to occur simultaneously.

Step S407, each computing node receive after merging the core sample notice that node sends, and are stored in described candidate queue by each sample in the local epsilon neighborhood of corresponding core sample, after being stored in, notify main controlled node, continue step S408.

Each candidate samples of candidate queue is labeled as and belongs to current cluster by step S408, main controlled node, goes to above-mentioned steps S402.

Step S409, main controlled node judge the sample that whether there is labeled affiliated cluster in raw data base, if existing, go to above-mentioned steps S401；Otherwise, step S410 is performed.

Step S410, belonging to each sample of labelling cluster, obtain belonging to the sample of each cluster.

In above-mentioned steps S408, each candidate samples of candidate queue is labeled as before belonging to current cluster by main controlled node, also merges the same sample being deposited in candidate queue by different computing nodes.

In one embodiment, merge node and at least include two；The corresponding candidate samples that each merging node merges is allocated in advance by main controlled node.

In above-mentioned flow process, each computing node adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node, and concrete grammar includes:

Method one: the corresponding candidate samples each merging node merged by main controlled node, prenotice to each computing node, the corresponding candidate samples that each computing node merges according to each merging node, the sample size of the corresponding candidate samples counted this locality reports the merging node of correspondence；

Method two: the corresponding candidate samples that each merging node merges according to self, uploads the statistical information of corresponding candidate samples respectively to the request of each described computing node；Each computing node returns the sample size of the corresponding candidate samples that this locality counts to each merging node.

In above-mentioned cluster realizing method two, have employed plural computing node and participate in cluster calculation concurrently, improve computational efficiency.And each computing node only processes a part of sample data, solve the problem that mass data cannot be realized processing by unit.And the merging node of respective numbers can be set according to candidate samples quantity, make merging process also parallelization, improve merging treatment speed further.

In the cluster realizing method that the above embodiment of the present invention provides, it is possible to adopt Map/Reduce function to realize.Wherein, each computing node adopts the sample size that Map function obtains in the piecemeal sample distributing to self in the local epsilon neighborhood of each candidate samples, and is sent to merging node；Merging node adopts Reduce function to merge the corresponding sample size that each computing node sends.In Map/Reduce function Key and Value pair, the object and the result that process according to actual needs are determined respectively.

Based on same inventive concept, according to the cluster realizing method one that the above embodiment of the present invention provides, the present invention provides a kind of corresponding cluster to realize system, and its structural representation is as it is shown in figure 5, include: main controlled node main controlled node 51 and at least two computing node 52.

Main controlled node 51, for determining a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；And the candidate samples in candidate queue is carried out burst, burst sample is distributed and is handed down to computing node 52；And receive the notice that computing node 52 sends, it is judged that whether candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, and burst is handed down to computing node 52 again；After judging that candidate queue is sky, then determine that next core sample repeats said process, until cluster belonging to each sample labelling in sample database；

Whether computing node 52, be core sample for determining each sample in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density；Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue；And after the burst sample of distribution is all disposed, notify described main controlled node.

In one specific embodiment, main controlled node 51 is additionally operable to, and is sampled raw data base processing, obtains sample database；And to each cluster obtained according to sample database, calculate the average of the sample value of each sample currently belonging to this cluster respectively, it is determined that go out the corresponding cluster centre point of each cluster；Main controlled node 51 is additionally operable to, all the other except the sample comprised in sample database sample in raw data base is carried out piecemeal, piecemeal sample is distributed to computing node 52, and the piecemeal sample of distribution and the cluster centre point determined are sent respectively to each computing node 52；The cluster of the sample identification and the affiliated cluster that return according to each computing node 52 identifies, it is determined that go out the affiliated cluster of each sample in raw data base all the other samples except sample database；

Each computing node 52, it is additionally operable to calculate respectively in the piecemeal sample of distribution each sample corresponding sample point in sample space and the distance of each cluster centre point, each sample is belonged to and clusters apart from belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to main controlled node 51.

Based on same inventive concept, according to the cluster realizing method two that the above embodiment of the present invention provides, the present invention provides a kind of corresponding cluster to realize system, its structural representation as shown in Figure 6, including: main controlled node 61, at least two calculate joint 62 and merge node 63；

Main controlled node 61, for unmarked sample current in raw data base is carried out piecemeal, distributes piecemeal sample and is handed down at least two computing node；And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；And the candidate samples in candidate queue is handed down to each computing node 62；It is additionally operable to receive the notice of computing node 62, each candidate samples of candidate queue is labeled as and belongs to current cluster, and again issue the candidate samples in candidate queue to computing node 62；After receiving the notice merging node 63, then determine next core sample, and repeat said process, until cluster belonging to each sample labelling in raw data base；

Computing node 62, for the epsilon neighborhood of the piecemeal sample according to distribution and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node 63；And receive the core sample notice merging node 63 transmission, each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies main controlled node 61；

Merge node 63, be used for each candidate samples, the corresponding sample size that cumulative calculation node 62 sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged；When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node 62；And when determine be absent from core sample time, notify main controlled node 61.

Merge node 63 and at least include two.Main controlled node 61 is additionally operable to, and allocates the corresponding candidate samples that each merging node merges in advance.

Obviously, the present invention can be carried out various change and modification without deviating from the spirit and scope of the present invention by those skilled in the art.So, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a cluster realizing method, it is characterised in that including:

Candidate samples in described candidate queue, according to the computing node quantity participating in calculating, is carried out burst, is distributed by burst sample and be handed down at least two computing node by step 2, described main controlled node；

After step 4, described main controlled node receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate samples, each candidate samples being labeled as and belongs to current cluster, going to above-mentioned steps 2；When being absent from candidate samples, go to above-mentioned steps 1, until cluster belonging to each sample labelling in described sample database.

2. cluster realizing method as claimed in claim 1, it is characterised in that also included before described step 1: raw data base is sampled processing by described main controlled node, obtains described sample database；And

Also include after described step 4:

Described main controlled node, to each cluster current, calculates the average of the sample value of each sample belonging to this cluster, it is determined that go out the corresponding cluster centre point of each cluster respectively；

All the other samples except the sample comprised in described sample database in described raw data base are carried out piecemeal by described main controlled node, and piecemeal sample is distributed at least two computing node；And the piecemeal sample of distribution and the described cluster centre point determined are sent respectively to each described computing node；

Each described computing node calculates the distance of each sample corresponding sample point in sample space and each cluster centre point in the piecemeal sample of distribution respectively, each sample is belonged to described apart from cluster belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to described main controlled node；

The cluster of the sample identification that described main controlled node returns according to each described computing node and affiliated cluster identifies, it is determined that go out the affiliated cluster of each sample in all the other samples described.

3. cluster realizing method as claimed in claim 1 or 2, it is characterized in that, in described step 4, when described candidate queue exists candidate samples, described main controlled node, before each candidate samples being labeled as and belonging to current cluster, also merges and is deposited into the same sample in described candidate queue by different computing nodes.

4. a cluster realizing method, it is characterised in that including:

Unmarked sample current in raw data base, according to the computing node quantity participating in calculating, is carried out piecemeal, is distributed by piecemeal sample and be handed down at least two computing node by step 1, main controlled node；And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；

5. cluster realizing method as claimed in claim 4, it is characterized in that, in described step 6, each candidate samples of described candidate queue is labeled as before belonging to current cluster by described main controlled node, also merges and is deposited into the same sample in described candidate queue by different computing nodes.

6. cluster realizing method as claimed in claim 5, it is characterised in that described merging node at least includes two；The corresponding candidate samples that each merging node merges is allocated in advance by described main controlled node；Add up the sample size in the local epsilon neighborhood of each candidate samples described in step 3 respectively, and be sent to merging node, specifically include:

The corresponding candidate samples that described computing node merges according to each merging node, the sample size of the corresponding candidate samples counted this locality reports the merging node of correspondence；Or

The corresponding candidate samples that each merging node merges according to self, uploads the statistical information of described corresponding candidate samples respectively to the request of each described computing node；Each described computing node returns the sample size of the corresponding candidate samples that this locality counts to each merging node.

7. a cluster realizes system, it is characterised in that including: main controlled node, at least two computing node；

Described main controlled node, for determining a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；And according to participating in the computing node quantity of calculating, the candidate samples in described candidate queue is carried out burst, burst sample is distributed and is handed down to described at least two computing node；And receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate samples, each candidate samples being labeled as and belongs to current cluster, and burst is handed down to described at least two computing node again；After judging that described candidate queue is sky, then determine that next core sample repeats said process, until cluster belonging to each sample labelling in described sample database；

8. cluster as claimed in claim 7 realizes system, it is characterised in that described main controlled node is additionally operable to, and is sampled raw data base processing, obtains described sample database；And to each cluster obtained according to sample database, calculate the average of the sample value of each sample currently belonging to this cluster respectively, it is determined that go out the corresponding cluster centre point of each cluster；And all the other samples except the sample comprised in described sample database in described raw data base are carried out piecemeal, piecemeal sample is distributed at least two computing node, and the piecemeal sample of distribution and the described cluster centre point determined are sent respectively to each described computing node；The cluster of the sample identification and the affiliated cluster that return according to each described computing node identifies, it is determined that the affiliated cluster of each sample in all the other samples described in going out；

Each described computing node, it is additionally operable to calculate respectively in the piecemeal sample of distribution each sample corresponding sample point in sample space and the distance of each cluster centre point, each sample is belonged to described apart from cluster belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to described main controlled node.

9. a cluster realizes system, it is characterised in that including: main controlled node, at least two calculate joint and merge node；

Described main controlled node, for according to the computing node quantity participating in calculating, carrying out piecemeal to unmarked sample current in raw data base, distributed by piecemeal sample and be handed down at least two computing node；And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue；And the candidate samples in described candidate queue is handed down to each described computing node；It is additionally operable to receive the notice of described computing node, each candidate samples of described candidate queue is labeled as and belongs to current cluster, and again issue the candidate samples in candidate queue to described at least two computing node；When after the notice receiving described merging node, then determine next core sample, and repeat said process, until cluster belonging to each sample labelling in described raw data base；

10. cluster as claimed in claim 9 realizes system, it is characterised in that described merging node at least includes two；

It is additionally operable to by described main controlled node, allocates the corresponding candidate samples that each merging node merges in advance.