CN101996198B - Cluster realizing method and system - Google Patents

Cluster realizing method and system Download PDF

Info

Publication number
CN101996198B
CN101996198B CN200910091866.7A CN200910091866A CN101996198B CN 101996198 B CN101996198 B CN 101996198B CN 200910091866 A CN200910091866 A CN 200910091866A CN 101996198 B CN101996198 B CN 101996198B
Authority
CN
China
Prior art keywords
sample
cluster
node
computing node
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200910091866.7A
Other languages
Chinese (zh)
Other versions
CN101996198A (en
Inventor
徐萌
高丹
邓超
罗治国
周文辉
孙少陵
何清
赵卫中
马慧芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN200910091866.7A priority Critical patent/CN101996198B/en
Publication of CN101996198A publication Critical patent/CN101996198A/en
Application granted granted Critical
Publication of CN101996198B publication Critical patent/CN101996198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of cluster realizing method and system.By main controlled node, the candidate samples in candidate queue is carried out burst, at least two computing node whether each sample determined respectively according to the epsilon neighborhood set and minimum density concurrently in the burst sample of distribution is core sample;Due to each computing node parallel processing, accelerate the signature velocity of cluster belonging to each sample in sample database.Unmarked sample current in sample database is carried out piecemeal by main controlled node by the present invention, piecemeal sample is distributed and is handed down at least two computing node, concurrently the candidate samples in candidate queue is processed by each computing node, again through the result merging the node each computing node of merging.Owing to each computing node only processes part sample, solve mass data cannot the problem that processes of unit, and owing to by multiple computing nodes and multiple and close node and carry out parallel processing, treatment effeciency can be substantially increased.

Description

Cluster realizing method and system
Technical field
The present invention relates to Data Mining, particularly relate to cluster realizing method and the corresponding system of a kind of Massive Sample data.
Background technology
At current data excavation applications, existing clustering algorithm can be divided into a few class, including based on the method divided, based on the method for level, the method for density based, the method based on grid and the method etc. based on model.
When carrying out data mining, it is necessary to carrying out total data calculating one by one and analyzing, Algorithms T-cbmplexity is high.Mass data is one to various clustering algorithms challenge.Existing clustering algorithm is mostly also merely resting on laboratory stage, for mass data, and some algorithm or can not be effectively treated, or treatment effeciency is very low.
DBSCAN algorithm is a clustering algorithm based on spatial density.This algorithm will have enough highdensity region and be divided into cluster, it is possible to find the cluster of arbitrary shape in the sample space with " noise " (referring to have some non-core sample points).
The ultimate principle of DBSCAN algorithm is:
The epsilon neighborhood (region in the radius ε of given object being called to the epsilon neighborhood of this object) of sample and minimum density (minimum density is the minimum number of sample size in appointment epsilon neighborhood) when setting data excavates, and when belonging to not labeled in the epsilon neighborhood of not labeled sample, the sample size of cluster meets more than the minimum density set, it is determined that this sample is core sample.Labelling core sample belongs to current cluster, and sample each in the epsilon neighborhood of this core sample is inserted candidate queue and is labeled as and belongs to current cluster.Further determine that in candidate queue, whether each candidate samples is core sample, if, repeating and each sample in the epsilon neighborhood of the core sample determined is inserted candidate team, until traveling through each sample in whole sample database, marking cluster belonging to each sample.
Above-mentioned DBSCAN clustering algorithm, for a small amount of sample, it is possible to realize on unit easily.But for Massive Sample, on the one hand owing to unit memory size is limited, it is impossible to read in the sample data of magnanimity;On the other hand, owing to needing to carry out waiting the dynamic renewal of first queue in cluster process, belonging to each sample in sample database is carried out, cluster labelling, processes chronic, and in actual data service application, efficiency is very low.
Therefore, for the process of mass data in practical application, how effectively promoting treatment effeciency is the subject matter needing in data mining to solve.
Summary of the invention
The embodiment of the present invention provides cluster realizing method and cluster to realize system, and by adopting multiple nodal parallel to process, mass data cannot be realized clustering processing and the low problem for the treatment of effeciency by solution prior art.
A kind of cluster realizing method that the embodiment of the present invention provides includes:
Step 1, main controlled node determine a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;
Candidate samples in described candidate queue is carried out burst by step 2, described main controlled node, is distributed by burst sample and is handed down at least two computing node;
Whether each sample that step 3, each described computing node are determined in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density is core sample;Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue;And after the burst sample of distribution is all disposed, notify described main controlled node;
After step 4, described main controlled node receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, going to above-mentioned steps 2;When being absent from candidate samples, go to above-mentioned steps 1, until cluster belonging to each sample labelling in described sample database.
The another kind of cluster realizing method that the embodiment of the present invention provides, including:
Unmarked sample current in raw data base is carried out piecemeal by step 1, main controlled node, is distributed by piecemeal sample and is handed down at least two computing node;And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;
Candidate samples in described candidate queue is handed down to each described computing node by step 2, described main controlled node;
Step 3, each described computing node epsilon neighborhood according to the piecemeal sample distributed and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node;
Step 4, described merging node are to each candidate samples, the corresponding sample size that accumulative described computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged;When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node;And when determine be absent from core sample time, notify described main controlled node;
Each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies described main controlled node after receiving the core sample notice that described merging node sends by step 5, each computing node;
Step 6, described main controlled node go to above-mentioned steps 1 after receiving the notice that described merging node sends;And after receiving the notice that each described computing node sends, each candidate samples of described candidate queue is labeled as and belongs to current cluster, go to above-mentioned steps 2;Until the affiliated cluster of each sample labelling in raw data base.
A kind of cluster that the embodiment of the present invention provides realizes system, including: main controlled node, at least two computing node;
Described main controlled node, for determining a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;And the candidate samples in described candidate queue is carried out burst, burst sample is distributed and is handed down to described at least two computing node;And receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, and burst is handed down to described at least two computing node again;After judging that described candidate queue is sky, then determine that next core sample repeats said process, until cluster belonging to each sample labelling in described sample database;
Whether described computing node, be core sample for determining each sample in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density;Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue;And after the burst sample of distribution is all disposed, notify described main controlled node.
The another kind of cluster that the embodiment of the present invention provides realizes system, including: main controlled node, at least two calculate joint and merge node;
Described main controlled node, for unmarked sample current in raw data base is carried out piecemeal, distributes piecemeal sample and is handed down at least two computing node;And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;And the candidate samples in described candidate queue is handed down to each described computing node;It is additionally operable to receive the notice of described computing node, each candidate samples of described candidate queue is labeled as and belongs to current cluster, and again issue the candidate samples in candidate queue to described at least two computing node;When after the notice receiving described merging node, then determine next core sample, and repeat said process, until cluster belonging to each sample labelling in described raw data base;
Described computing node, for the epsilon neighborhood of the piecemeal sample according to distribution and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node;And receive the core sample notice merging node transmission, each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies described main controlled node;
Described merging node, for each candidate samples, adding up the corresponding sample size that described computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged;When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node;And when determine be absent from core sample time, notify described main controlled node.
In a kind of cluster realizing method provided by the invention and correspondence system, by the candidate samples in candidate queue is carried out burst, by multiple computing node parallel processings (determining core node parallel), make full use of the calculating resource of each node in system, it is greatly shortened mass data and carries out waiting time when cluster result processes, improve computational efficiency.
In another kind of cluster realizing method provided by the invention and correspondence system, process distributing to different computing nodes after pending sample piecemeal, solve mass data and all cannot be read in, by unit, the problem that internal memory is calculated processing;In cluster realizing method provided by the invention, have employed at least two computing node and participate in cluster calculation process concurrently, accelerate calculating speed;Effectively merge again through merging node, make full use of the calculating resource of each node in system, efficiently solve prior art and mass data cannot be realized clustering processing and the low problem for the treatment of effeciency.
Accompanying drawing explanation
Cluster realizing method one flow chart that Fig. 1 provides for the embodiment of the present invention;
The practical application flow chart of the cluster realizing method one that Fig. 2 provides for the embodiment of the present invention;
The flow chart of steps of the cluster realizing method two that Fig. 3 provides for the embodiment of the present invention;
The practical application flow chart of the cluster realizing method two that Fig. 4 provides for the embodiment of the present invention;
The cluster corresponding with cluster realizing method one that Fig. 5 provides for the embodiment of the present invention realizes system structure schematic diagram;
The cluster corresponding with cluster realizing method two that Fig. 6 provides for the embodiment of the present invention realizes system structure schematic diagram.
Detailed description of the invention
Below in conjunction with accompanying drawing, the cluster realizing method and the systems that provide the embodiment of the present invention are described in detail.
Referring to Fig. 1, for cluster realizing method one flow chart that the embodiment of the present invention provides, comprise the steps:
Step S101, main controlled node determine a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue.
Candidate samples in candidate queue is carried out burst by step S102, main controlled node, is distributed by burst sample and is handed down at least two computing node.
Whether each sample that step S103, each computing node are determined in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density is core sample;Each sample in the epsilon neighborhood of the core sample determined is stored in candidate queue;And after the burst sample of distribution is all disposed, notify main controlled node.
After step S104, main controlled node receive the notice that each computing node sends, it is judged that whether candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, going to above-mentioned steps S102;When being absent from candidate samples, go to above-mentioned steps S101, until cluster belonging to each sample labelling in sample database.
Referring to Fig. 2, for the practical application flow chart of the above-mentioned cluster realizing method one that the embodiment of the present invention provides, including:
Step S201, main controlled node determine a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue.
Candidate samples in candidate queue is carried out burst by step S202, main controlled node, is distributed by burst sample and is handed down at least two computing node.
In reality, main controlled node is according to the computing node quantity (being assumed to be N number of) participating in calculating, whole candidate samples in candidate item queue are divided into the burst (N number of burst) of respective numbers, and each burst sample is distributed to different computing nodes.
In reality, the candidate samples quantity that main controlled node can distribute according to epicycle, determine the computing node quantity participating in parallel processing flexibly.When needing candidate samples quantity to be processed more, start multiple computing node, when candidate samples negligible amounts, the corresponding computing node reducing participation parallel processing.
Step S203, each computing node read a sample in the burst sample distributing to self according to the order of sequence, determine whether this sample is core sample according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density;When determining that this sample is core sample, perform step S204, otherwise, go to step S205.
Step S204, each sample in the epsilon neighborhood of this core sample determined is stored in candidate queue, continues step S205.
Step S205, each computing node judge whether each sample distributing in the burst sample of self is disposed, and if so, perform step S206;Otherwise go to step S203.
Step S206, notice main controlled node are disposed.
After step S207, main controlled node receive the notice that each computing node sends, it is judged that whether candidate queue exists candidate samples, when there is candidate item sample, perform step S208;When being absent from candidate samples, perform step S209.
Step S208, each candidate samples is labeled as belongs to current cluster, go to above-mentioned steps S202.
Whether step S209, main controlled node judgment sample data base exist not labeled sample;If having, go to step S201;Otherwise, step S210 is performed.
Step S210, belonging to each sample of labelling cluster, obtain belonging to the sample of each cluster.
Above-mentioned cluster realizing method one provided by the invention, is particularly suitable for the sample in the less of associated databases of sample size is clustered.Have employed at least two computing node and participate in cluster calculation process concurrently, accelerate calculating speed.
When the sample size in raw data base is huge especially, first raw data base can be sampled, generate sample database, after the sample database generated is determined according to above-mentioned cluster realizing method one cluster belonging to each sample in sample database again, determine cluster belonging to all the other samples in raw data base further again, method particularly includes:
Main controlled node, to each cluster current, calculates the average of the sample value of each sample belonging to this cluster, it is determined that go out the corresponding cluster centre point of each cluster respectively;And
All the other samples except each sample comprised in the sample database generated in raw data base are carried out piecemeal by main controlled node, and piecemeal sample is distributed at least two computing node;The piecemeal sample of distribution and each cluster centre dot information of calculating are sent respectively to each computing node;
Each computing node calculates the distance of each sample corresponding sample point in sample space and each cluster centre point in the piecemeal sample of distribution respectively, each sample is belonged to described apart from cluster belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to described main controlled node;
The cluster of the sample identification that main controlled node returns according to each described computing node and affiliated cluster identifies, it is determined that go out cluster belonging to all the other samples each in raw data base.
In above-mentioned flow process, when there is candidate samples in candidate queue, main controlled node, before each candidate samples being labeled as and belonging to current cluster, also merges the same sample being deposited in candidate queue by different computing nodes.
Above-mentioned cluster realizing method one, by the candidate samples in candidate queue is carried out burst, (core node is determined parallel by multiple computing node parallel processings, and be about to core node epsilon neighborhood in each sample be stored in candidate queue), make full use of the calculating resource of each node in system, it is greatly shortened mass data and carries out waiting time when cluster result processes, improve computational efficiency.
Referring to Fig. 3, for the flow chart of steps of the cluster realizing method two that the embodiment of the present invention provides, comprise the following steps:
Unmarked sample current in raw data base is carried out piecemeal by step S301, main controlled node, is distributed by piecemeal sample and is handed down at least two computing node;And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;
Candidate samples in candidate queue is handed down to each computing node by step S302, main controlled node;
Step S303, each computing node epsilon neighborhood according to the piecemeal sample distributed and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node;
Step S304, merging node are to each candidate samples, the corresponding sample size that cumulative calculation node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged;When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node;And when determine be absent from core sample time, notify main controlled node;
Step S305, each computing node receive after merging the core sample notice that node sends, and are stored in described candidate queue by each sample in the local epsilon neighborhood of corresponding core sample, after being stored in, notify main controlled node;
Step S306, main controlled node receive after merging the notice that node sends, and go to above-mentioned steps S301;And after receiving the notice that each computing node sends, each candidate samples of candidate queue is labeled as and belongs to current cluster, go to above-mentioned steps S302;Until the affiliated cluster of each sample labelling in raw data base.
Referring to Fig. 4, for the practical application flow chart of the above-mentioned cluster realizing method two that the embodiment of the present invention provides, including:
Unmarked sample current in raw data base is carried out piecemeal by step S401, main controlled node, is distributed by piecemeal sample and is handed down at least two computing node;And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue.
In reality, pending whole samples, according to the computing node quantity (being assumed to be N number of) participating in calculating, are divided into the piecemeal (N number of piecemeal) of respective numbers, and each piecemeal sample are distributed to different computing nodes by main controlled node.
Candidate samples in candidate queue is handed down to each computing node by step S402, main controlled node.
Step S403, each computing node epsilon neighborhood according to the piecemeal sample distributed and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node.
Step S404, merging node are to each candidate samples, the corresponding sample size that accumulative each computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged;When determining at least one core sample of existence, perform step S405;Otherwise, step S406 is gone to.
The core sample determined is informed to each computing node by step S405, merging node, goes to step S407.
Step S406, merging node notice this candidate samples issued of main controlled node are absent from core sample, go to step S409.
Above-mentioned steps S405 and step S406 is two branching step, it is impossible to occur simultaneously.
Step S407, each computing node receive after merging the core sample notice that node sends, and are stored in described candidate queue by each sample in the local epsilon neighborhood of corresponding core sample, after being stored in, notify main controlled node, continue step S408.
Each candidate samples of candidate queue is labeled as and belongs to current cluster by step S408, main controlled node, goes to above-mentioned steps S402.
Step S409, main controlled node judge the sample that whether there is labeled affiliated cluster in raw data base, if existing, go to above-mentioned steps S401;Otherwise, step S410 is performed.
Step S410, belonging to each sample of labelling cluster, obtain belonging to the sample of each cluster.
In above-mentioned steps S408, each candidate samples of candidate queue is labeled as before belonging to current cluster by main controlled node, also merges the same sample being deposited in candidate queue by different computing nodes.
In one embodiment, merge node and at least include two;The corresponding candidate samples that each merging node merges is allocated in advance by main controlled node.
In above-mentioned flow process, each computing node adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node, and concrete grammar includes:
Method one: the corresponding candidate samples each merging node merged by main controlled node, prenotice to each computing node, the corresponding candidate samples that each computing node merges according to each merging node, the sample size of the corresponding candidate samples counted this locality reports the merging node of correspondence;
Method two: the corresponding candidate samples that each merging node merges according to self, uploads the statistical information of corresponding candidate samples respectively to the request of each described computing node;Each computing node returns the sample size of the corresponding candidate samples that this locality counts to each merging node.
In above-mentioned cluster realizing method two, have employed plural computing node and participate in cluster calculation concurrently, improve computational efficiency.And each computing node only processes a part of sample data, solve the problem that mass data cannot be realized processing by unit.And the merging node of respective numbers can be set according to candidate samples quantity, make merging process also parallelization, improve merging treatment speed further.
In the cluster realizing method that the above embodiment of the present invention provides, it is possible to adopt Map/Reduce function to realize.Wherein, each computing node adopts the sample size that Map function obtains in the piecemeal sample distributing to self in the local epsilon neighborhood of each candidate samples, and is sent to merging node;Merging node adopts Reduce function to merge the corresponding sample size that each computing node sends.In Map/Reduce function Key and Value pair, the object and the result that process according to actual needs are determined respectively.
Based on same inventive concept, according to the cluster realizing method one that the above embodiment of the present invention provides, the present invention provides a kind of corresponding cluster to realize system, and its structural representation is as it is shown in figure 5, include: main controlled node main controlled node 51 and at least two computing node 52.
Main controlled node 51, for determining a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;And the candidate samples in candidate queue is carried out burst, burst sample is distributed and is handed down to computing node 52;And receive the notice that computing node 52 sends, it is judged that whether candidate queue exists candidate samples, when there is candidate item sample, each candidate samples being labeled as and belongs to current cluster, and burst is handed down to computing node 52 again;After judging that candidate queue is sky, then determine that next core sample repeats said process, until cluster belonging to each sample labelling in sample database;
Whether computing node 52, be core sample for determining each sample in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density;Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue;And after the burst sample of distribution is all disposed, notify described main controlled node.
In one specific embodiment, main controlled node 51 is additionally operable to, and is sampled raw data base processing, obtains sample database;And to each cluster obtained according to sample database, calculate the average of the sample value of each sample currently belonging to this cluster respectively, it is determined that go out the corresponding cluster centre point of each cluster;Main controlled node 51 is additionally operable to, all the other except the sample comprised in sample database sample in raw data base is carried out piecemeal, piecemeal sample is distributed to computing node 52, and the piecemeal sample of distribution and the cluster centre point determined are sent respectively to each computing node 52;The cluster of the sample identification and the affiliated cluster that return according to each computing node 52 identifies, it is determined that go out the affiliated cluster of each sample in raw data base all the other samples except sample database;
Each computing node 52, it is additionally operable to calculate respectively in the piecemeal sample of distribution each sample corresponding sample point in sample space and the distance of each cluster centre point, each sample is belonged to and clusters apart from belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to main controlled node 51.
Based on same inventive concept, according to the cluster realizing method two that the above embodiment of the present invention provides, the present invention provides a kind of corresponding cluster to realize system, its structural representation as shown in Figure 6, including: main controlled node 61, at least two calculate joint 62 and merge node 63;
Main controlled node 61, for unmarked sample current in raw data base is carried out piecemeal, distributes piecemeal sample and is handed down at least two computing node;And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;And the candidate samples in candidate queue is handed down to each computing node 62;It is additionally operable to receive the notice of computing node 62, each candidate samples of candidate queue is labeled as and belongs to current cluster, and again issue the candidate samples in candidate queue to computing node 62;After receiving the notice merging node 63, then determine next core sample, and repeat said process, until cluster belonging to each sample labelling in raw data base;
Computing node 62, for the epsilon neighborhood of the piecemeal sample according to distribution and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node 63;And receive the core sample notice merging node 63 transmission, each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies main controlled node 61;
Merge node 63, be used for each candidate samples, the corresponding sample size that cumulative calculation node 62 sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged;When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node 62;And when determine be absent from core sample time, notify main controlled node 61.
Merge node 63 and at least include two.Main controlled node 61 is additionally operable to, and allocates the corresponding candidate samples that each merging node merges in advance.
Obviously, the present invention can be carried out various change and modification without deviating from the spirit and scope of the present invention by those skilled in the art.So, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims (10)

1. a cluster realizing method, it is characterised in that including:
Step 1, main controlled node determine a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;
Candidate samples in described candidate queue, according to the computing node quantity participating in calculating, is carried out burst, is distributed by burst sample and be handed down at least two computing node by step 2, described main controlled node;
Whether each sample that step 3, each described computing node are determined in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density is core sample;Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue;And after the burst sample of distribution is all disposed, notify described main controlled node;
After step 4, described main controlled node receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate samples, each candidate samples being labeled as and belongs to current cluster, going to above-mentioned steps 2;When being absent from candidate samples, go to above-mentioned steps 1, until cluster belonging to each sample labelling in described sample database.
2. cluster realizing method as claimed in claim 1, it is characterised in that also included before described step 1: raw data base is sampled processing by described main controlled node, obtains described sample database;And
Also include after described step 4:
Described main controlled node, to each cluster current, calculates the average of the sample value of each sample belonging to this cluster, it is determined that go out the corresponding cluster centre point of each cluster respectively;
All the other samples except the sample comprised in described sample database in described raw data base are carried out piecemeal by described main controlled node, and piecemeal sample is distributed at least two computing node;And the piecemeal sample of distribution and the described cluster centre point determined are sent respectively to each described computing node;
Each described computing node calculates the distance of each sample corresponding sample point in sample space and each cluster centre point in the piecemeal sample of distribution respectively, each sample is belonged to described apart from cluster belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to described main controlled node;
The cluster of the sample identification that described main controlled node returns according to each described computing node and affiliated cluster identifies, it is determined that go out the affiliated cluster of each sample in all the other samples described.
3. cluster realizing method as claimed in claim 1 or 2, it is characterized in that, in described step 4, when described candidate queue exists candidate samples, described main controlled node, before each candidate samples being labeled as and belonging to current cluster, also merges and is deposited into the same sample in described candidate queue by different computing nodes.
4. a cluster realizing method, it is characterised in that including:
Unmarked sample current in raw data base, according to the computing node quantity participating in calculating, is carried out piecemeal, is distributed by piecemeal sample and be handed down at least two computing node by step 1, main controlled node;And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;
Candidate samples in described candidate queue is handed down to each described computing node by step 2, described main controlled node;
Step 3, each described computing node epsilon neighborhood according to the piecemeal sample distributed and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node;
Step 4, described merging node are to each candidate samples, the corresponding sample size that accumulative described computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged;When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node;And when determine be absent from core sample time, notify described main controlled node;
Each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies described main controlled node after receiving the core sample notice that described merging node sends by step 5, each computing node;
Step 6, described main controlled node go to above-mentioned steps 1 after receiving the notice that described merging node sends;And after receiving the notice that each described computing node sends, each candidate samples of described candidate queue is labeled as and belongs to current cluster, go to above-mentioned steps 2;Until the affiliated cluster of each sample labelling in raw data base.
5. cluster realizing method as claimed in claim 4, it is characterized in that, in described step 6, each candidate samples of described candidate queue is labeled as before belonging to current cluster by described main controlled node, also merges and is deposited into the same sample in described candidate queue by different computing nodes.
6. cluster realizing method as claimed in claim 5, it is characterised in that described merging node at least includes two;The corresponding candidate samples that each merging node merges is allocated in advance by described main controlled node;Add up the sample size in the local epsilon neighborhood of each candidate samples described in step 3 respectively, and be sent to merging node, specifically include:
The corresponding candidate samples that described computing node merges according to each merging node, the sample size of the corresponding candidate samples counted this locality reports the merging node of correspondence;Or
The corresponding candidate samples that each merging node merges according to self, uploads the statistical information of described corresponding candidate samples respectively to the request of each described computing node;Each described computing node returns the sample size of the corresponding candidate samples that this locality counts to each merging node.
7. a cluster realizes system, it is characterised in that including: main controlled node, at least two computing node;
Described main controlled node, for determining a core sample according to the sample of the current unmarked affiliated cluster in sample database, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;And according to participating in the computing node quantity of calculating, the candidate samples in described candidate queue is carried out burst, burst sample is distributed and is handed down to described at least two computing node;And receive the notice that each described computing node sends, it is judged that whether described candidate queue exists candidate samples, when there is candidate samples, each candidate samples being labeled as and belongs to current cluster, and burst is handed down to described at least two computing node again;After judging that described candidate queue is sky, then determine that next core sample repeats said process, until cluster belonging to each sample labelling in described sample database;
Whether described computing node, be core sample for determining each sample in the burst sample of distribution respectively according to the current unmarked sample in sample database, the epsilon neighborhood of setting and minimum density;Each sample in the epsilon neighborhood of the core sample determined is stored in described candidate queue;And after the burst sample of distribution is all disposed, notify described main controlled node.
8. cluster as claimed in claim 7 realizes system, it is characterised in that described main controlled node is additionally operable to, and is sampled raw data base processing, obtains described sample database;And to each cluster obtained according to sample database, calculate the average of the sample value of each sample currently belonging to this cluster respectively, it is determined that go out the corresponding cluster centre point of each cluster;And all the other samples except the sample comprised in described sample database in described raw data base are carried out piecemeal, piecemeal sample is distributed at least two computing node, and the piecemeal sample of distribution and the described cluster centre point determined are sent respectively to each described computing node;The cluster of the sample identification and the affiliated cluster that return according to each described computing node identifies, it is determined that the affiliated cluster of each sample in all the other samples described in going out;
Each described computing node, it is additionally operable to calculate respectively in the piecemeal sample of distribution each sample corresponding sample point in sample space and the distance of each cluster centre point, each sample is belonged to described apart from cluster belonging to minimum corresponding cluster centre point, and the cluster returning sample identification and affiliated cluster identifies to described main controlled node.
9. a cluster realizes system, it is characterised in that including: main controlled node, at least two calculate joint and merge node;
Described main controlled node, for according to the computing node quantity participating in calculating, carrying out piecemeal to unmarked sample current in raw data base, distributed by piecemeal sample and be handed down at least two computing node;And determine a core sample according to the current unmarked sample in raw data base, and sample each in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, sample each in the epsilon neighborhood of this core sample is stored in candidate queue;And the candidate samples in described candidate queue is handed down to each described computing node;It is additionally operable to receive the notice of described computing node, each candidate samples of described candidate queue is labeled as and belongs to current cluster, and again issue the candidate samples in candidate queue to described at least two computing node;When after the notice receiving described merging node, then determine next core sample, and repeat said process, until cluster belonging to each sample labelling in described raw data base;
Described computing node, for the epsilon neighborhood of the piecemeal sample according to distribution and setting, adds up the sample size in the local epsilon neighborhood of each candidate samples respectively, and is sent to merging node;And receive the core sample notice merging node transmission, each sample in the local epsilon neighborhood of corresponding core sample is stored in described candidate queue, after being stored in, notifies described main controlled node;
Described merging node, for each candidate samples, adding up the corresponding sample size that described computing node sends, and determine whether each candidate samples is core sample according to accumulative and value and the minimum density arranged;When determine there is core sample time, it is to be determined to the core sample gone out informs each computing node;And when determine be absent from core sample time, notify described main controlled node.
10. cluster as claimed in claim 9 realizes system, it is characterised in that described merging node at least includes two;
It is additionally operable to by described main controlled node, allocates the corresponding candidate samples that each merging node merges in advance.
CN200910091866.7A 2009-08-31 2009-08-31 Cluster realizing method and system Active CN101996198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910091866.7A CN101996198B (en) 2009-08-31 2009-08-31 Cluster realizing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910091866.7A CN101996198B (en) 2009-08-31 2009-08-31 Cluster realizing method and system

Publications (2)

Publication Number Publication Date
CN101996198A CN101996198A (en) 2011-03-30
CN101996198B true CN101996198B (en) 2016-06-29

Family

ID=43786365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910091866.7A Active CN101996198B (en) 2009-08-31 2009-08-31 Cluster realizing method and system

Country Status (1)

Country Link
CN (1) CN101996198B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902655B (en) * 2014-02-28 2017-01-04 小米科技有限责任公司 Clustering method, device and terminal unit
CN105447008A (en) * 2014-08-11 2016-03-30 中国移动通信集团四川有限公司 Distributed processing method and system for time series clustering
CN105912598A (en) * 2016-04-05 2016-08-31 中国农业大学 Method and system for determining high-frequency regions for roadside stall business in urban streets
CN108628954B (en) * 2018-04-10 2021-05-25 北京京东尚科信息技术有限公司 Mass data self-service query method and device
CN109165639B (en) * 2018-10-15 2021-12-10 广州广电运通金融电子股份有限公司 Finger vein identification method, device and equipment
CN111444544B (en) * 2020-06-12 2020-09-11 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154942A1 (en) * 2006-12-22 2008-06-26 Cheng-Fa Tsai Method for Grid-Based Data Clustering
CN101339553A (en) * 2008-01-14 2009-01-07 浙江大学 Approximate quick clustering and index method for mass data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154942A1 (en) * 2006-12-22 2008-06-26 Cheng-Fa Tsai Method for Grid-Based Data Clustering
CN101339553A (en) * 2008-01-14 2009-01-07 浙江大学 Approximate quick clustering and index method for mass data

Also Published As

Publication number Publication date
CN101996198A (en) 2011-03-30

Similar Documents

Publication Publication Date Title
CN101996198B (en) Cluster realizing method and system
CN103365726B (en) A kind of method for managing resource towards GPU cluster and system
US9683852B2 (en) Dispatching map matching tasks by a cluster server
CN109739585B (en) Spark cluster parallelization calculation-based traffic congestion point discovery method
CN105190543A (en) Reachability-based coordination for cyclic dataflow
CN101719148B (en) Three-dimensional spatial information saving method, device, system and query system
CN105071994B (en) A kind of mass data monitoring system
CN106202506A (en) Three-dimensional traffic Noise map update method in conjunction with offline storage Yu instant computing
Zeng et al. The simpler the better: An indexing approach for shared-route planning queries
CN103246653A (en) Data processing method and device
CN107391516B (en) Bus stop aggregation method and device
Yang et al. Multiagent reinforcement learning-based taxi predispatching model to balance taxi supply and demand
Hu et al. Data driven optimization for electric vehicle charging station locating and sizing with charging satisfaction consideration in urban areas
EP3008597B1 (en) Method for the continuous processing of two-level data on a system with a plurality of nodes
Ma et al. Disa: A display-driven spatial analysis framework for large-scale vector data
CN101996197A (en) Cluster realizing method and system
Shi et al. Simultaneous simulation experiments and nested partition for discrete resource allocation in supply chain management
Pienta et al. On the parallel simulation of scale-free networks
CN111653317A (en) Gene comparison accelerating device, method and system
CN109254844B (en) Triangle calculation method of large-scale graph
CN114064663B (en) Block chain storage method and system based on distributed index
Im et al. A novel air indexing scheme for window query in non-flat wireless spatial data broadcast
CN115033616A (en) Data screening rule verification method and device based on multi-round sampling
CN111738539B (en) Method, device, equipment and medium for distributing picking tasks
Bani-Mohammad et al. A new compacting non-contiguous processor allocation algorithm for 2D mesh multicomputers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant