CN101996198A - Cluster implementation method and system - Google Patents

Cluster implementation method and system Download PDF

Info

Publication number
CN101996198A
CN101996198A CN2009100918667A CN200910091866A CN101996198A CN 101996198 A CN101996198 A CN 101996198A CN 2009100918667 A CN2009100918667 A CN 2009100918667A CN 200910091866 A CN200910091866 A CN 200910091866A CN 101996198 A CN101996198 A CN 101996198A
Authority
CN
China
Prior art keywords
sample
cluster
node
candidate
main controlled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009100918667A
Other languages
Chinese (zh)
Other versions
CN101996198B (en
Inventor
徐萌
高丹
邓超
罗治国
周文辉
孙少陵
何清
赵卫中
马慧芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN200910091866.7A priority Critical patent/CN101996198B/en
Publication of CN101996198A publication Critical patent/CN101996198A/en
Application granted granted Critical
Publication of CN101996198B publication Critical patent/CN101996198B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a cluster implementation method and system, wherein the method comprises the following steps: carrying out sharding on candidate samples in a candidate queue by a master control node; and respectively determining whether each sample in allocated samples subject to sharding is a core sample parallelly according to a preset epsilon neighborhood and the minimum density by at least two computing nodes, thus due to the parallel processing of the computing nodes, the marking speed of a cluster to which each sample in a sample database belongs is quickened. The invention also discloses another cluster implementation method and system, and the cluster implementation method comprises the following steps: carrying out sharding on samples which are not marked currently in a sample database by a master control node; allocating and issuing the samples subject to sharding to at least two computing nodes; carrying out parallel processing on candidate samples in a candidate queue by the computing nodes; and combining the obtained processing results of the computing nodes by merge nodes. Because each computing node only processes part of samples, the problem that mass data can not be processed by one computer is solved, and because the mass data can be subject to parallel processing by a plurality of the computing nodes and a plurality of the merge nodes, the processing efficiency is greatly improved.

Description

Cluster implementation method and system
Technical field
The present invention relates to the data mining field, relate in particular to a kind of cluster implementation method and corresponding system of magnanimity sample data.
Background technology
At the current data excavation applications, existing clustering algorithm can be divided into several classes, comprises based on the method for dividing, based on the method for level, based on the method for density, based on the method for grid and based on the method for model etc.
When carrying out data mining, need to calculate one by one and to analyze total data, algorithm time complexity height.Mass data is a challenge to various clustering algorithms.Existing clustering algorithm mostly also just rests on laboratory stage, for mass data, and some algorithm or can not effectively handle, perhaps treatment effeciency is very low.
The DBSCAN algorithm is a clustering algorithm based on space density.It is cluster that this algorithm will have enough highdensity area dividing, and can find the cluster of arbitrary shape in having " noise " sample space of (referring to have some non-core sample points).
The DBSCAN algorithm basic principle is:
The epsilon neighborhood (epsilon neighborhood that is called this object for the zone in the radius ε of given object) of sample and minimum density when setting data excavates (minimum density is for specifying the minimum number of sample size in the epsilon neighborhood), and when the sample size of cluster under not being labeled in the epsilon neighborhood of a sample that is not labeled satisfies greater than the minimum density set, determine that this sample is a core sample.The mark core sample belongs to current cluster, and each sample in the epsilon neighborhood of this core sample inserted candidate queue and be labeled as belongs to current cluster.Determine further whether each candidate samples is core sample in the candidate queue, if, repeat in the epsilon neighborhood of the core sample that will determine each sample and insert candidate team, each sample in the whole sample database of traversal marks cluster under each sample.
Above-mentioned DBSCAN clustering algorithm for a small amount of sample, can be realized on unit easily.But for the magnanimity sample, because the unit memory size is limited, can not read in the sample data of magnanimity on the one hand; On the other hand, owing to need to wait dynamically updating of first formation in the cluster process, cluster mark under each sample in the sample database is carried out, the processing time is very long, and in the data service of reality was used, efficient was very low.
Therefore, for the processing of mass data in the practical application, how promoting treatment effeciency effectively and being needs the subject matter of solving in the data mining.
Summary of the invention
The embodiment of the invention provides cluster implementation method and cluster to realize system, by adopting a plurality of node parallel processings, solves prior art and can't realize clustering processing and the low problem of treatment effeciency to mass data.
A kind of cluster implementation method that the embodiment of the invention provides comprises:
Step 1, main controlled node are determined a core sample according to the sample of the current unmarked affiliated cluster in the sample database, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue;
Step 2, described main controlled node carry out burst to the candidate samples in the described candidate queue, with the burst sample dispensing and be handed down at least two computing nodes;
Whether each sample in the burst sample that step 3, each described computing node are determined distribution respectively according to the epsilon neighborhood and the minimum density of the current unmarked sample in the sample database, setting is core sample; Each sample in the epsilon neighborhood of the core sample determined is deposited in the described candidate queue; And after the burst sample that distributes all disposes, notify described main controlled node;
Step 4, described main controlled node judge whether there is candidate samples in the described candidate queue after receiving the notice that each described computing node sends, and when having the candidate item sample, each candidate samples is labeled as belongs to current cluster, go to above-mentioned steps 2; When not having candidate samples, go to above-mentioned steps 1, each sample in described sample database is cluster under the mark all.
The another kind of cluster implementation method that the embodiment of the invention provides comprises:
Step 1, main controlled node carry out piecemeal to current unmarked sample in the raw data base, with the piecemeal sample dispensing and be handed down at least two computing nodes; And determine a core sample according to the current unmarked sample in the raw data base, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue;
Step 2, described main controlled node are handed down to each described computing node with the candidate samples in the described candidate queue;
Step 3, each described computing node are added up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and are sent to merge node according to the piecemeal sample that distributes and the epsilon neighborhood of setting;
Step 4, described merge node add up the corresponding sample size that described computing node sends to each candidate samples, and determine according to the minimum density of accumulative total and value and setting whether each candidate samples is core sample; When determining when having core sample, the core sample of determining is notified to each computing node; And, notify described main controlled node when determining when not having core sample;
Step 5, each computing node deposit each sample in the local epsilon neighborhood of corresponding core sample in the described candidate queue in after receiving the core sample notice that described merge node sends, when deposit in finish after, notify described main controlled node;
Step 6, described main controlled node go to above-mentioned steps 1 after receiving the notice of described merge node transmission; And after receiving the notice that each described computing node sends, each candidate samples of described candidate queue is labeled as belongs to current cluster, go to above-mentioned steps 2; Each sample in raw data base is cluster under the mark all.
A kind of cluster that the embodiment of the invention provides realizes system, comprising: main controlled node, at least two computing nodes;
Described main controlled node, be used for determining a core sample according to the sample of cluster under sample database current unmarked, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue; And the candidate samples in the described candidate queue carried out burst, with the burst sample dispensing and be handed down to described at least two computing nodes; And the notice that receives each described computing node transmission, judge and whether have candidate samples in the described candidate queue, when having the candidate item sample, each candidate samples is labeled as belongs to current cluster, and burst is handed down to described at least two computing nodes again; After judging that described candidate queue is sky, to determine next core sample again and repeat said process, each sample in described sample database is cluster under the mark all;
Described computing node, whether each sample that is used in the epsilon neighborhood of current unmarked sample, setting according to sample database and the burst sample that minimum density is determined distribution respectively is core sample; Each sample in the epsilon neighborhood of the core sample determined is deposited in the described candidate queue; And after the burst sample that distributes all disposes, notify described main controlled node.
The another kind of cluster that the embodiment of the invention provides realizes system, comprising: main controlled node, at least two calculating save and merge node;
Described main controlled node is used for the current unmarked sample of raw data base is carried out piecemeal, with the piecemeal sample dispensing and be handed down at least two computing nodes; And determine a core sample according to the current unmarked sample in the raw data base, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue; And the candidate samples in the described candidate queue is handed down to each described computing node; Also be used to receive the notice of described computing node, each candidate samples of described candidate queue is labeled as belongs to current cluster, and the candidate samples that issues once more in the candidate queue is given described at least two computing nodes; After the notice that receives described merge node, to determine next core sample again, and repeat said process, each sample in described raw data base is cluster under the mark all;
Described computing node is used for according to the piecemeal sample that distributes and the epsilon neighborhood of setting, adds up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and sends to merge node; And receive the core sample notice that merge node sends, each sample in the local epsilon neighborhood of corresponding core sample is deposited in the described candidate queue, when deposit in finish after, notify described main controlled node;
Described merge node is used for each candidate samples, adds up the corresponding sample size that described computing node sends, and determines according to the minimum density of accumulative total and value and setting whether each candidate samples is core sample; When determining when having core sample, the core sample of determining is notified to each computing node; And, notify described main controlled node when determining when not having core sample.
In a kind of cluster implementation method provided by the invention and the corresponding system, by the candidate samples in the candidate queue is carried out burst, by a plurality of computing node parallel processings (parallel definite core node), make full use of the computational resource of each node in the system, shorten mass data greatly and carry out the wait duration of cluster when excavate handling, improved counting yield.
In another kind of cluster implementation method provided by the invention and the corresponding system, handle, solved mass data and can't all read in the problem that internal memory carries out computing by unit with distributing to different computing nodes behind the pending sample piecemeal; In the cluster implementation method provided by the invention, adopt at least two computing nodes to participate in the cluster calculation process concurrently, accelerated computing velocity; Effectively merge by merge node again, make full use of the computational resource of each node in the system, efficiently solve prior art and can't realize clustering processing and the low problem of treatment effeciency mass data.
Description of drawings
Cluster implementation method one process flow diagram that Fig. 1 provides for the embodiment of the invention;
The practical application process flow diagram of the cluster implementation method one that Fig. 2 provides for the embodiment of the invention;
The flow chart of steps of the cluster implementation method two that Fig. 3 provides for the embodiment of the invention;
The practical application process flow diagram of the cluster implementation method two that Fig. 4 provides for the embodiment of the invention;
What Fig. 5 provided for the embodiment of the invention realizes the system architecture synoptic diagram with cluster implementation method one corresponding cluster;
The cluster corresponding with cluster implementation method two that Fig. 6 provides for the embodiment of the invention realizes the system architecture synoptic diagram.
Embodiment
Below in conjunction with accompanying drawing, cluster implementation method and system that the embodiment of the invention is provided are elaborated.
Referring to Fig. 1, cluster implementation method one process flow diagram for the embodiment of the invention provides comprises the steps:
Step S101, main controlled node are determined a core sample according to the sample of the current unmarked affiliated cluster in the sample database, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue.
Step S102, main controlled node carry out burst to the candidate samples in the candidate queue, with the burst sample dispensing and be handed down at least two computing nodes.
Whether each sample in the burst sample that step S103, each computing node determine distribution respectively according to the epsilon neighborhood and the minimum density of the current unmarked sample in the sample database, setting is core sample; Each sample in the epsilon neighborhood of the core sample determined is deposited in the candidate queue; And after the burst sample that distributes all disposes, the notice main controlled node.
Step S104, main controlled node judge whether there is candidate samples in the candidate queue after receiving the notice that each computing node sends, and when having the candidate item sample, each candidate samples is labeled as belongs to current cluster, go to above-mentioned steps S102; When not having candidate samples, go to above-mentioned steps S101, each sample in sample database is cluster under the mark all.
Referring to Fig. 2, the practical application process flow diagram of the above-mentioned cluster implementation method one that provides for the embodiment of the invention comprises:
Step S201, main controlled node are determined a core sample according to the sample of the current unmarked affiliated cluster in the sample database, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue.
Step S202, main controlled node carry out burst to the candidate samples in the candidate queue, with the burst sample dispensing and be handed down at least two computing nodes.
In the reality, main controlled node is according to the computing node quantity of participate in calculating (being assumed to be N), and the whole candidate samples in the candidate item formation are divided into the burst (N burst) of respective numbers, and with each burst sample dispensing to different computing nodes.
In the reality, the candidate samples quantity that main controlled node can will distribute according to epicycle determines to participate in the computing node quantity of parallel processing flexibly.The candidate samples quantity of handling when needs starts a plurality of computing nodes more for a long time, and when the candidate samples negligible amounts, corresponding minimizing participates in the computing node of parallel processing.
Step S203, each computing node read a sample in the burst sample of distributing to self according to the order of sequence, determine according to the epsilon neighborhood and the minimum density of the current unmarked sample in the sample database, setting whether this sample is core sample; When determining this sample and be core sample, execution in step S204, otherwise, go to step S205.
Each sample deposits in the candidate queue in the epsilon neighborhood of step S204, this core sample that will determine, continues step S205.
Step S205, each computing node judge whether each sample in the burst sample of distributing to self disposes, if, execution in step S206; Otherwise go to step S203.
Step S206, notice main controlled node dispose.
Step S207, main controlled node judge whether there is candidate samples in the candidate queue after receiving the notice of each computing node transmission, when having the candidate item sample, and execution in step S208; When not having candidate samples, execution in step S209.
Step S208, each candidate samples is labeled as belongs to current cluster, go to above-mentioned steps S202.
Whether there is the sample that is not labeled in step S209, the main controlled node judgement sample database; If have, go to step S201; Otherwise, execution in step S210.
Step S210, according to cluster under each sample of mark, obtain belonging to the sample of each cluster.
Above-mentioned cluster implementation method one provided by the invention, being particularly suitable for sample size is not that sample in the very big associated databases carries out cluster.Adopt at least two computing nodes to participate in the cluster calculation process concurrently, accelerated computing velocity.
When the sample size in the raw data base is huge especially, can sample to raw data base earlier, generate sample database, again the sample database that generates is determined under each sample in the sample database after the cluster according to above-mentioned cluster implementation method one, further determine the affiliated cluster of all the other samples in the raw data base again, concrete grammar is:
Main controlled node calculates the average of the sample value of each sample that belongs to this cluster respectively to current each cluster, determines the corresponding cluster centre point of each cluster; And
Main controlled node carries out piecemeal to all the other samples each sample that comprises in the raw data base in the sample database that generates, give at least two computing nodes with the piecemeal sample dispensing; The piecemeal sample that distributes and each the cluster centre dot information that calculates are sent to each computing node respectively;
Each computing node respectively in the piecemeal sample of dispensed each sample in sample space corresponding sample point and the distance of each cluster centre point, each sample is belonged to the affiliated cluster of the minimum corresponding cluster centre point of described distance, and the cluster of returning sample identification and affiliated cluster identifies to described main controlled node;
Sample identification that main controlled node returns according to each described computing node and the cluster of affiliated cluster sign are determined cluster under each all the other sample in the raw data base.
In the above-mentioned flow process, in candidate queue, have candidate samples, main controlled node each candidate samples is labeled as belong to current cluster before, also merge by different computing nodes and be deposited into same sample in the candidate queue.
Above-mentioned cluster implementation method one, by the candidate samples in the candidate queue is carried out burst, by a plurality of computing node parallel processings (parallel definite core node, and be about to that each sample deposits in the candidate queue in the epsilon neighborhood of core node), make full use of the computational resource of each node in the system, shorten mass data greatly and carry out the wait duration of cluster when excavate handling, improved counting yield.
Referring to Fig. 3, the flow chart of steps of the cluster implementation method two that provides for the embodiment of the invention comprises the following steps:
Step S301, main controlled node carry out piecemeal to current unmarked sample in the raw data base, with the piecemeal sample dispensing and be handed down at least two computing nodes; And determine a core sample according to the current unmarked sample in the raw data base, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue;
Step S302, main controlled node are handed down to each computing node with the candidate samples in the candidate queue;
Step S303, each computing node are added up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and are sent to merge node according to the piecemeal sample that distributes and the epsilon neighborhood of setting;
Step S304, merge node be to each candidate samples, the corresponding sample size that the cumulative calculation node sends, and determine according to the minimum density of accumulative total and value and setting whether each candidate samples is core sample; When determining when having core sample, the core sample of determining is notified to each computing node; And when determining when not having core sample the notice main controlled node;
Step S305, each computing node deposit each sample in the local epsilon neighborhood of corresponding core sample in the described candidate queue in after receiving the core sample notice that merge node sends, when deposit in finish after, the notice main controlled node;
Step S306, main controlled node go to above-mentioned steps S301 after receiving the notice of merge node transmission; And after receiving the notice that each computing node sends, each candidate samples of candidate queue is labeled as belongs to current cluster, go to above-mentioned steps S302; Each sample in raw data base is cluster under the mark all.
Referring to Fig. 4, the practical application process flow diagram of the above-mentioned cluster implementation method two that provides for the embodiment of the invention comprises:
Step S401, main controlled node carry out piecemeal to current unmarked sample in the raw data base, with the piecemeal sample dispensing and be handed down at least two computing nodes; And determine a core sample according to the current unmarked sample in the raw data base, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue.
In the reality, main controlled node is according to the computing node quantity of participate in calculating (being assumed to be N), and pending whole samples are divided into the piecemeal (N piecemeal) of respective numbers, and with each piecemeal sample dispensing to different computing nodes.
Step S402, main controlled node are handed down to each computing node with the candidate samples in the candidate queue.
Step S403, each computing node are added up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and are sent to merge node according to the piecemeal sample that distributes and the epsilon neighborhood of setting.
Step S404, merge node add up the corresponding sample size that each computing node sends to each candidate samples, and determine according to the minimum density of accumulative total and value and setting whether each candidate samples is core sample; When determining at least one core sample of existence, execution in step S405; Otherwise, go to step S406.
Step S405, merge node are notified to each computing node with the core sample of determining, and go to step S407.
There is not core sample in step S406, this candidate samples that issues of merge node notice main controlled node, goes to step S409.
Above-mentioned steps S405 and step S406 are two branch's steps, can not occur simultaneously.
Step S407, each computing node deposit each sample in the local epsilon neighborhood of corresponding core sample in the described candidate queue in after receiving the core sample notice that merge node sends, when deposit in finish after, the notice main controlled node continues step S408.
Step S408, main controlled node are labeled as each candidate samples of candidate queue and belong to current cluster, go to above-mentioned steps S402.
Step S409, main controlled node judge whether there is the sample that is not labeled affiliated cluster in the raw data base, if exist, go to above-mentioned steps S401; Otherwise, execution in step S410.
Step S410, according to cluster under each sample of mark, obtain belonging to the sample of each cluster.
Among the above-mentioned steps S408, main controlled node is labeled as each candidate samples of candidate queue and belongs to before the current cluster, also merges by different computing nodes to be deposited into same sample in the candidate queue.
Among one embodiment, merge node comprises two at least; Allocate the corresponding candidate samples that each merge node merges in advance by main controlled node.
In the above-mentioned flow process, each computing node is added up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and sends to merge node, and concrete grammar comprises:
Method one: the corresponding candidate samples that each merge node is merged by main controlled node, prenotice to each computing node, the corresponding candidate samples that each computing node merges according to each merge node, the sample size of the corresponding candidate samples that this locality is counted reports corresponding merge node;
Method two: the corresponding candidate samples that each merge node merges according to self, upload the statistical information of corresponding candidate samples respectively to each described computing node request; Each computing node returns the sample size of the corresponding candidate samples that this locality counts to each merge node.
In the above-mentioned cluster implementation method two, adopt plural computing node to participate in cluster calculation concurrently, improved counting yield.And each computing node is only handled a part of sample data, has solved mass data and can't have been realized the problem handled by unit.And the merge node of respective numbers can be set according to candidate samples quantity, make also parallelization of merging process, further improve the merging processing speed.
In the cluster implementation method that the above embodiment of the present invention provides, can adopt the Map/Reduce function to realize.Wherein, each computing node adopts the Map function to obtain the interior sample size of local epsilon neighborhood of each candidate samples in the piecemeal sample of distributing to self, and sends to merge node; Merge node adopts the Reduce function to merge the corresponding sample size that each computing node sends.Key and Value in the Map/Reduce function are right, and object of Chu Liing and result are determined respectively according to actual needs.
Based on same inventive concept, the cluster implementation method one that provides according to the above embodiment of the present invention, the invention provides a kind of corresponding cluster and realize system, its structural representation comprises as shown in Figure 5: main controlled node main controlled node 51 and at least two computing nodes 52.
Main controlled node 51, be used for determining a core sample according to the sample of cluster under sample database current unmarked, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue; And the candidate samples in the candidate queue carried out burst, with the burst sample dispensing and be handed down to computing node 52; And receive the notice that computing node 52 sends, and judge whether there is candidate samples in the candidate queue, when having the candidate item sample, each candidate samples is labeled as belongs to current cluster, and burst is handed down to computing node 52 again; After judging that candidate queue is sky, to determine next core sample again and repeat said process, each sample in sample database is cluster under the mark all;
Computing node 52, whether each sample that is used in the epsilon neighborhood of current unmarked sample, setting according to sample database and the burst sample that minimum density is determined distribution respectively is core sample; Each sample in the epsilon neighborhood of the core sample determined is deposited in the described candidate queue; And after the burst sample that distributes all disposes, notify described main controlled node.
In one specific embodiment, main controlled node 51 also is used for, and raw data base is carried out sample process, obtains sample database; And each cluster to obtaining according to sample database, calculate the average of the sample value of current each sample that belongs to this cluster respectively, determine the corresponding cluster centre point of each cluster; Main controlled node 51 also is used for, all the other samples the sample that comprises in sample database in the raw data base are carried out piecemeal, give computing node 52 with the piecemeal sample dispensing, and piecemeal sample that distributes and the cluster centre point of determining are sent to each computing node 52 respectively; Sample identification of returning according to each computing node 52 and the cluster of affiliated cluster sign are determined the affiliated cluster of each sample in raw data base all the other samples except that sample database;
Each computing node 52, each sample of piecemeal sample that also is used for dispensed respectively in sample space corresponding sample point and the distance of each cluster centre point, each sample is belonged to the affiliated cluster of the minimum corresponding cluster centre point of distance, and the cluster of returning sample identification and affiliated cluster identifies to main controlled node 51.
Based on same inventive concept, the cluster implementation method two that provides according to the above embodiment of the present invention, the invention provides a kind of corresponding cluster and realize system, its structural representation comprises as shown in Figure 6: main controlled node 61, at least two calculate joint 62 and merge node 63;
Main controlled node 61 is used for the current unmarked sample of raw data base is carried out piecemeal, with the piecemeal sample dispensing and be handed down at least two computing nodes; And determine a core sample according to the current unmarked sample in the raw data base, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue; And the candidate samples in the candidate queue is handed down to each computing node 62; Also be used to receive the notice of computing node 62, each candidate samples of candidate queue is labeled as belongs to current cluster, and the candidate samples that issues once more in the candidate queue is given computing node 62; After the notice that receives merge node 63, to determine next core sample again, and repeat said process, each sample in raw data base is cluster under the mark all;
Computing node 62 is used for according to the piecemeal sample that distributes and the epsilon neighborhood of setting, adds up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and sends to merge node 63; And receive the core sample notice that merge node 63 sends, each sample in the local epsilon neighborhood of corresponding core sample is deposited in the described candidate queue, when deposit in finish after, notice main controlled node 61;
Merge node 63 is used for each candidate samples, the corresponding sample size that cumulative calculation node 62 sends, and determine according to the minimum density of accumulative total and value and setting whether each candidate samples is core sample; When determining when having core sample, the core sample of determining is notified to each computing node 62; And when determining when not having core sample notice main controlled node 61.
Merge node 63 comprises two at least.Main controlled node 61 also is used for, and allocates the corresponding candidate samples that each merge node merges in advance.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. a cluster implementation method is characterized in that, comprising:
Step 1, main controlled node are determined a core sample according to the sample of the current unmarked affiliated cluster in the sample database, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue;
Step 2, described main controlled node carry out burst to the candidate samples in the described candidate queue, with the burst sample dispensing and be handed down at least two computing nodes;
Whether each sample in the burst sample that step 3, each described computing node are determined distribution respectively according to the epsilon neighborhood and the minimum density of the current unmarked sample in the sample database, setting is core sample; Each sample in the epsilon neighborhood of the core sample determined is deposited in the described candidate queue; And after the burst sample that distributes all disposes, notify described main controlled node;
Step 4, described main controlled node judge whether there is candidate samples in the described candidate queue after receiving the notice that each described computing node sends, and when having the candidate item sample, each candidate samples is labeled as belongs to current cluster, go to above-mentioned steps 2; When not having candidate samples, go to above-mentioned steps 1, each sample in described sample database is cluster under the mark all.
2. cluster implementation method as claimed in claim 1 is characterized in that, also comprises before described step 1: described main controlled node carries out sample process to raw data base, obtains described sample database; And
After described step 4, also comprise:
Described main controlled node calculates the average of the sample value of each sample that belongs to this cluster respectively to current each cluster, determines the corresponding cluster centre point of each cluster;
Described main controlled node carries out piecemeal to all the other samples the sample that comprises in the described raw data base in described sample database, give at least two computing nodes with the piecemeal sample dispensing; And piecemeal sample that distributes and the described cluster centre point of determining sent to each described computing node respectively;
Each described computing node respectively in the piecemeal sample of dispensed each sample in sample space corresponding sample point and the distance of each cluster centre point, each sample is belonged to the affiliated cluster of the minimum corresponding cluster centre point of described distance, and the cluster of returning sample identification and affiliated cluster identifies to described main controlled node;
Sample identification that described main controlled node returns according to each described computing node and the cluster of affiliated cluster sign are determined the affiliated cluster of each sample in described all the other samples.
3. cluster implementation method as claimed in claim 1 or 2, it is characterized in that, in the described step 4, in described candidate queue, there is the candidate item sample, described main controlled node each candidate samples is labeled as belong to current cluster before, also merge by different computing nodes and be deposited into same sample in the described candidate queue.
4. a cluster implementation method is characterized in that, comprising:
Step 1, main controlled node carry out piecemeal to current unmarked sample in the raw data base, with the piecemeal sample dispensing and be handed down at least two computing nodes; And determine a core sample according to the current unmarked sample in the raw data base, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue;
Step 2, described main controlled node are handed down to each described computing node with the candidate samples in the described candidate queue;
Step 3, each described computing node are added up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and are sent to merge node according to the piecemeal sample that distributes and the epsilon neighborhood of setting;
Step 4, described merge node add up the corresponding sample size that described computing node sends to each candidate samples, and determine according to the minimum density of accumulative total and value and setting whether each candidate samples is core sample; When determining when having core sample, the core sample of determining is notified to each computing node; And, notify described main controlled node when determining when not having core sample;
Step 5, each computing node deposit each sample in the local epsilon neighborhood of corresponding core sample in the described candidate queue in after receiving the core sample notice that described merge node sends, when deposit in finish after, notify described main controlled node;
Step 6, described main controlled node go to above-mentioned steps 1 after receiving the notice of described merge node transmission; And after receiving the notice that each described computing node sends, each candidate samples of described candidate queue is labeled as belongs to current cluster, go to above-mentioned steps 2; Each sample in raw data base is cluster under the mark all.
5. cluster implementation method as claimed in claim 4, it is characterized in that, in the described step 6, described main controlled node is labeled as each candidate samples of described candidate queue and belongs to before the current cluster, also merges by different computing nodes to be deposited into same sample in the described candidate queue.
6. cluster implementation method as claimed in claim 5 is characterized in that, described merge node comprises two at least; Allocate the corresponding candidate samples that each merge node merges in advance by described main controlled node; Add up the interior sample size of local epsilon neighborhood of each candidate samples described in the step 3 respectively, and send to merge node, specifically comprise:
The corresponding candidate samples that described computing node merges according to each merge node, the sample size of the corresponding candidate samples that this locality is counted reports corresponding merge node; Perhaps
The corresponding candidate samples that each merge node merges according to self is uploaded the statistical information of described corresponding candidate samples respectively to each described computing node request; Each described computing node returns the sample size of the corresponding candidate samples that this locality counts to each merge node.
7. a cluster realizes system, it is characterized in that, comprising: main controlled node, at least two computing nodes;
Described main controlled node, be used for determining a core sample according to the sample of cluster under sample database current unmarked, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue; And the candidate samples in the described candidate queue carried out burst, with the burst sample dispensing and be handed down to described at least two computing nodes; And the notice that receives each described computing node transmission, judge and whether have candidate samples in the described candidate queue, when having the candidate item sample, each candidate samples is labeled as belongs to current cluster, and burst is handed down to described at least two computing nodes again; After judging that described candidate queue is sky, to determine next core sample again and repeat said process, each sample in described sample database is cluster under the mark all;
Described computing node, whether each sample that is used in the epsilon neighborhood of current unmarked sample, setting according to sample database and the burst sample that minimum density is determined distribution respectively is core sample; Each sample in the epsilon neighborhood of the core sample determined is deposited in the described candidate queue; And after the burst sample that distributes all disposes, notify described main controlled node.
8. cluster as claimed in claim 7 realizes system, it is characterized in that described main controlled node also is used for, and raw data base is carried out sample process, obtains described sample database; And each cluster to obtaining according to sample database, calculate the average of the sample value of current each sample that belongs to this cluster respectively, determine the corresponding cluster centre point of each cluster; And all the other samples the sample that comprises in the described raw data base are carried out piecemeal in described sample database, give at least two computing nodes with the piecemeal sample dispensing, and piecemeal sample that distributes and the described cluster centre point of determining are sent to each described computing node respectively; Sample identification of returning according to each described computing node and the cluster of affiliated cluster sign are determined the affiliated cluster of each sample in described all the other samples;
Each described computing node, each sample of piecemeal sample that also is used for dispensed respectively in sample space corresponding sample point and the distance of each cluster centre point, each sample is belonged to the affiliated cluster of the minimum corresponding cluster centre point of described distance, and the cluster of returning sample identification and affiliated cluster identifies to described main controlled node.
9. a cluster realizes system, it is characterized in that, comprising: main controlled node, at least two calculating save and merge node;
Described main controlled node is used for the current unmarked sample of raw data base is carried out piecemeal, with the piecemeal sample dispensing and be handed down at least two computing nodes; And determine a core sample according to the current unmarked sample in the raw data base, and each sample in the epsilon neighborhood of this core sample is labeled as belongs to current cluster, each sample in the epsilon neighborhood of this core sample is deposited in the candidate queue; And the candidate samples in the described candidate queue is handed down to each described computing node; Also be used to receive the notice of described computing node, each candidate samples of described candidate queue is labeled as belongs to current cluster, and the candidate samples that issues once more in the candidate queue is given described at least two computing nodes; After the notice that receives described merge node, to determine next core sample again, and repeat said process, each sample in described raw data base is cluster under the mark all;
Described computing node is used for according to the piecemeal sample that distributes and the epsilon neighborhood of setting, adds up the interior sample size of local epsilon neighborhood of each candidate samples respectively, and sends to merge node; And receive the core sample notice that merge node sends, each sample in the local epsilon neighborhood of corresponding core sample is deposited in the described candidate queue, when deposit in finish after, notify described main controlled node;
Described merge node is used for each candidate samples, adds up the corresponding sample size that described computing node sends, and determines according to the minimum density of accumulative total and value and setting whether each candidate samples is core sample; When determining when having core sample, the core sample of determining is notified to each computing node; And, notify described main controlled node when determining when not having core sample.
10. cluster as claimed in claim 9 realizes system, it is characterized in that described merge node comprises two at least;
Also be used for by described main controlled node, allocate the corresponding candidate samples that each merge node merges in advance.
CN200910091866.7A 2009-08-31 2009-08-31 Cluster realizing method and system Active CN101996198B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910091866.7A CN101996198B (en) 2009-08-31 2009-08-31 Cluster realizing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910091866.7A CN101996198B (en) 2009-08-31 2009-08-31 Cluster realizing method and system

Publications (2)

Publication Number Publication Date
CN101996198A true CN101996198A (en) 2011-03-30
CN101996198B CN101996198B (en) 2016-06-29

Family

ID=43786365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910091866.7A Active CN101996198B (en) 2009-08-31 2009-08-31 Cluster realizing method and system

Country Status (1)

Country Link
CN (1) CN101996198B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015127801A1 (en) * 2014-02-28 2015-09-03 小米科技有限责任公司 Clustering method, apparatus, and terminal device
CN105447008A (en) * 2014-08-11 2016-03-30 中国移动通信集团四川有限公司 Distributed processing method and system for time series clustering
CN105912598A (en) * 2016-04-05 2016-08-31 中国农业大学 Method and system for determining high-frequency regions for roadside stall business in urban streets
CN108628954A (en) * 2018-04-10 2018-10-09 北京京东尚科信息技术有限公司 A kind of mass data self-service query method and apparatus
CN109165639A (en) * 2018-10-15 2019-01-08 广州广电运通金融电子股份有限公司 A kind of finger vein identification method, device and equipment
CN111444544A (en) * 2020-06-12 2020-07-24 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI338846B (en) * 2006-12-22 2011-03-11 Univ Nat Pingtung Sci & Tech A method for grid-based data clustering
CN101339553A (en) * 2008-01-14 2009-01-07 浙江大学 Approximate quick clustering and index method for mass data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张望等: "《个性化服务中的并行K-Means聚类算法》", 《微电子学与计算机》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015127801A1 (en) * 2014-02-28 2015-09-03 小米科技有限责任公司 Clustering method, apparatus, and terminal device
CN105447008A (en) * 2014-08-11 2016-03-30 中国移动通信集团四川有限公司 Distributed processing method and system for time series clustering
CN105912598A (en) * 2016-04-05 2016-08-31 中国农业大学 Method and system for determining high-frequency regions for roadside stall business in urban streets
CN108628954A (en) * 2018-04-10 2018-10-09 北京京东尚科信息技术有限公司 A kind of mass data self-service query method and apparatus
CN108628954B (en) * 2018-04-10 2021-05-25 北京京东尚科信息技术有限公司 Mass data self-service query method and device
CN109165639A (en) * 2018-10-15 2019-01-08 广州广电运通金融电子股份有限公司 A kind of finger vein identification method, device and equipment
CN109165639B (en) * 2018-10-15 2021-12-10 广州广电运通金融电子股份有限公司 Finger vein identification method, device and equipment
CN111444544A (en) * 2020-06-12 2020-07-24 支付宝(杭州)信息技术有限公司 Method and device for clustering private data of multiple parties

Also Published As

Publication number Publication date
CN101996198B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN101996198A (en) Cluster implementation method and system
Tong et al. Dynamic pricing in spatial crowdsourcing: A matching-based approach
CN103365726B (en) A kind of method for managing resource towards GPU cluster and system
CN101719148B (en) Three-dimensional spatial information saving method, device, system and query system
CN107590226A (en) A kind of map vector rendering intent based on tile
CN106598743B (en) MPI-based method for parallel attribute reduction of information system
CN103246653A (en) Data processing method and device
CN101488919B (en) Memory address allocation method and apparatus
CN110910054B (en) Track determining method and device and time recommending method and device
Zeng et al. The simpler the better: An indexing approach for shared-route planning queries
CN107391516B (en) Bus stop aggregation method and device
CN112559165A (en) Memory management method and device, electronic equipment and computer readable storage medium
CN111861296A (en) Piece collecting task allocation method and device, piece collecting system, equipment and medium
CN112598373A (en) Method for intelligent processing of land parcel and automatic batch generation after net area calculation
KR20170016168A (en) Tile-based map data updating system and method thereof
CN101996197A (en) Cluster realizing method and system
US20230267830A1 (en) Traffic Data Warehouse Construction Method and Apparatus, Storage Medium, and Terminal
CN101437041A (en) Processor-server hybrid system for processing data and method thereof
CN101515284A (en) Parallel space topology analyzing method based on discrete grid
CN112988933A (en) Method and device for managing address information
CN111653317A (en) Gene comparison accelerating device, method and system
CN106355315A (en) Tourism service integration system
CN115907257A (en) Water resource scheduling method for rock-clamping engineering
CN114996198A (en) Cross-processor data transmission method, device, equipment and medium
CN108628682A (en) A kind of Spark platform Cost Optimization Approachs based on data persistence

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant