CN107038248A

CN107038248A - A kind of massive spatial data Density Clustering method based on elasticity distribution data set

Info

Publication number: CN107038248A
Application number: CN201710298705.XA
Authority: CN
Inventors: 沈晔; 周天和; 李思剑; 任培荣
Original assignee: Hangzhou Yang Fan Technology Co Ltd
Current assignee: Hangzhou Yang Fan Technology Co Ltd
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2017-08-11

Abstract

The present invention relates to a kind of massive spatial data Density Clustering method based on elasticity distribution data set, this method is directed to the aggregation properties base for quickly excavating extensive spatial data, based on " parallel computation merges local result in RDD partition areas " thought design, first according to data space distribution situation, automatic grid division simultaneously distributes data, so that data volume relative equilibrium in grid, reaches the purpose of balancing algorithms node load；Then, propose that a kind of local density suitable for parallel computation defines, and improve the calculation of cluster centre, solving primal algorithm needs to judge the defect of cluster centre object by drawing decision diagram；Finally, by the optimisation strategies such as merging that clustered in grid and between grid, the quick clustering processing of extensive spatial data is realized.The present invention can effectively realize the quick clustering of extensive spatial data, have higher accuracy and more preferable system process performance compared with traditional Density Clustering method.

Description

A kind of massive spatial data Density Clustering method based on elasticity distribution data set

Technical field

The present invention relates to mobile device, more particularly to a kind of massive spatial data density based on elasticity distribution data set are poly- Class method.

Background technology

Clustering plays the part of important role in Spatial Data Mining.Space cluster analysis is assembled spatial data by it Characteristic is divided into some cluster so that have larger similitude positioned at the same data clustered, and positioned at the different data clustered With larger otherness.According to different guiding theory, clustering algorithm can be divided into the cluster based on division, based on level Cluster, density clustering, the cluster based on grid and the cluster based on particular model.Classical division formula algorithm k- Means and its innovatory algorithm k-medoids, k-means++, the center of clustering is determined by successive ignition and data are sorted out into Algorithm is realized simply, but to noise-sensitive, poor to the treatment effect clustered of aspherical.

With the surge of data scale, traditional clustering algorithm compels to be essential often due to data volume is excessive and can not run Will high speed, effective, high flexible mass data clustering algorithm.Computer-oriented cluster GFS, BigTable and MapReduce skill Art provides thinking for the clustering of mass data.As the realization of increasing income of above-mentioned technology, Hadoop parallel computation frames exist Clustering field is widely used.Due to pursuing high-throughput, the parallel clustering based on Hadoop-MapReduce frameworks is calculated Method needs repeatedly read-write disk to access intermediate result, to cause algorithm I/O expenses larger, with higher delay, it is impossible to be used for Cluster in real time.

The content of the invention

The present invention is to overcome above-mentioned weak point, it is therefore intended that provide a kind of magnanimity based on elasticity distribution data set empty Between packing density clustering method, this method, which is directed to, quickly excavates the aggregation properties base of extensive spatial data, based on " RDD points Area's -- parallel computation in area -- merges local result " thought design, first according to data in the distribution situation in space, automatic division Grid simultaneously distributes data so that data volume relative equilibrium in grid, reaches the purpose of balancing algorithms node load；Then, propose A kind of local density suitable for parallel computation defines, and improves the calculation of cluster centre, solves primal algorithm needs The defect of cluster centre object is judged by drawing decision diagram；Finally, the optimization plan such as merging that clustered in grid and between grid is passed through Slightly, the quick clustering processing of extensive spatial data is realized.

The present invention is to reach above-mentioned purpose by the following technical programs：A kind of magnanimity space based on elasticity distribution data set Packing density clustering method, comprises the following steps：

(1) introduce space lattice index generation in the distribution situation in space based on data and be based on grid RDD subregions：

(1.1) using y-bend index generation space lattice, successively partition space and grid is built with reference to strategy from up to down Index, until the sub-grid boundary length of generation is not more than given pre-value；(1.2) MAP-Reduce thoughts are used, statistics is each The number of data object, broadcast index structure and respectively data object to be clustered to each calculate node, merger grid in layer grid Interior data amount information, obtains complete grid index structure；

(1.3) traversal index, searching data amount is less than the maximum mesh of set-point, and grid is based on according to lookup result generation The Key-Value RDD of numbering, generation is based on grid RDD subregions：

(2) cluster calculation in subregion is carried out：Foundation is defined as with improved local density, on each obtained subregion simultaneously Row operation cluster_dp algorithms determine the center that clusters so that data object has identical local density；

(3) local result merging is carried out by the merging optimisation strategy that clusters in grid and between adjacent mesh, completes cluster Processing.

Preferably, the space lattice is defined as follows：

Space S is divided into the subregion of several non-overlapping copies, then each region is a space lattice, is designated as G；Its InFor projection of the net boundary end points on kth dimension axle.

Preferably, the adjacent mesh is defined as follows：

For anyIn the presence of Then claim grid g₁And g₂It is adjacent.

Preferably, (1.1) are specific as follows；

Using y-bend index generation space lattice, wherein, the essential information and grid of each nodes records grid of index Interior data object number, takes tactful successively partition space from up to down and builds grid index, space is halved, storage life Into sub-grid information in grid index, and access newborn grid, newborn grid halved and stored again, until generation Sub-grid boundary length is not more than given pre-value.

Preferably, the step (1.3) is specially：After complete index structure is obtained, traversal index, searching data amount Then stop continuing down to travel through less than the maximum mesh of set-point, after finding, obtain the result result mappings accordingly of space division Data object, is generated the Key-ValueRDD numbered based on grid, utilizes Key-ValueRDD's MapPartitionWithIndex function interfaces, are automatically generated based on grid RDD subregions.

Preferably, the definition of the improved local density is：If ρ '_iFor data object p_iImprovement local density, then Have

Wherein, with data-oriented object p_iCentered on, its radius is that the k dimension spaces in dc are referred to as p_iDc neighborhoods it is adjacent to dc Data object p in domain_j, there is dist (p_i,p_j)<dc。

Preferably, described when running cluster_dp algorithms, Design assistant function gamma judges the center that clusters, tool automatically Body is as follows：

The local density of given data object is ρ '_i, its minimum high density distance is δ_i, then set：

Wherein, max (ρ) * max (δ) are the maximum local density and the product of minimum high intensity values in grid；ρ_iIt is close for part Degree, is defined as p_iDc neighborhoods in the number of data object be referred to as p_iLocal density, be designated as ρ_i, formula is as follows：

Wherein,

Preferably, described minimum high density is apart from δ_iIt is defined as：If p_jIt is that all local densities are higher than ρ_iData pair As middle apart from p_iNearest object, then claim NN (p_i)=p_jFor p_iNearest high density neighbours, claim δ_i=dist (p_i, p_j) it is p_iMost Small high density distance, defined formula is as follows：

Preferably, the step (3) is specially：By calculating the average local density between two clusters, by the member that clusters Labeled as core member and halation, wherein core member is the core clustered, is made up of high density point, is stable data Object is assembled；The corresponding periphery that clusters of halation, is the aggregation of the unstable partial data clustered comprising low-density data point； Using core and the concept of halation, the consolidation strategy clustered between grid is proposed：

If the data object distribution in adjacent grid close to net boundary has the following two kinds situation, need to adjust data pair Clustered as affiliated：

(a) there is the kernel object that clusters in adjacent mesh at proximal border, and kernel object is close to each other, then merges two Cluster；

(b) there is halation object in the boundaries of two adjoining grids, then need to reappraise that halation point belonged to clusters.

Reappraise the method clustered that halation point belonged to preferably, described and be：The grid where halation object Search the data object that density is higher than halation object in adjacent grid, and calculate the data object of the condition that meets to halation point away from From：If the minimum range calculated is less than the minimum high density distance of current halation object, the nearest height of halation point is updated Density neighbours and minimum high density distance, and halation object is assigned to new cluster according to the nearest high density neighbours after renewal In.

The beneficial effects of the present invention are：The inventive method realizes the quick clustering processing of extensive spatial data, gram The problem of postponing in cluster is taken.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the inventive method.

Embodiment

With reference to specific embodiment, the present invention is described further, but protection scope of the present invention is not limited in This：

Embodiment：In the present embodiment, if D={ p₁,p₂,...,p_nIt is data set to be clustered, its residing k dimension spaces area Domain S is D calculating space.For the data object (1≤i≤n) in k dimension spaces,For p_iIn kth Projection on dimension axle.A kind of massive spatial data Density Clustering method based on elasticity distribution data set produces the main of cluster Step is as shown in figure 1, the realization of the present invention is based on following basic conception：

Define 1 (dc neighborhoods)：With data-oriented object p_iCentered on, its radius is that the k dimension spaces in dc are referred to as p_iDc it is adjacent Domain is to the data object p in dc neighborhoods_j, there is dist (p_i,p_j)<dc。

Define 2 (local density p_i)：p_iDc neighborhoods in the number of data object be referred to as p_iLocal density, be designated as ρ_i. Have：

Wherein

Defining 3, (minimum high density is apart from δ_i)：If p_jIt is that all local densities are higher than ρ_iData object in apart from p_iRecently Object, then claim NN (p_i)=p_jFor p_iNearest high density neighbours claim δ_i=dist (p_i, p_j) it is p_iMinimum high density apart from Have：

Define 4 (space lattices)：S is divided into the subregion of some non-overlapping copies, then each region is a space lattice, It is designated as G. whereinFor projection of the net boundary end points on kth dimension axle.

Define 5 (adjacent mesh)：To anyIn the presence of Then claim grid g₁And g₂It is adjacent.

In the present embodiment, PClusterdp overall frameworks PClusterdp utilizes RDD data storages, based on " RDD points Parallel computation-merging local result in area-area " thought is designed, and RDD subregions are realized and parallel by segmentation S.Algorithm totality frame Frame is as follows：

Input:

D:a set of points to be clustered

S:computing space of D

dc:a user input radius distance

maxPointInGrid:a parameter determine the max number of points in a grid

Output:

C:a set of clusters

Method:

/*partitioning phase*/

DatasetRDDD

Execute a Space Partition Algorithm to splits S into grids

Get a grid set G using the split result

For each point in datasetRDD do

Map each point into correspond g belong to G

end for

generate a partitioned PointsRDD based on assigned points

/*paraell computing phase*/

Map partition:

For each partition in partitionedPointsRDD do

Execute a modified cluster_dp algorithm to generate local cluster set C’

End map

/*Mergeing phase*/

Execute merge local clusters Algorithm on C’to build final clustered data C

The space lattice G after data object to segmentation is mapped, the Key-ValueRDD numbered based on grid is generated.Utilize MapPartition interfaces, RDD is divided according to grid numbering, distributes the data object of same district to identical calculations node.Each node Independent operating density clustering algorithm obtains the local cluster based on grid division.Then, local cluster in adjacent mesh is merged, generation is most Whole cluster result.

In the present embodiment, PClusterdp algorithms introduce space lattice index, it is ensured that data volume relative equilibrium in grid, Utilize y-bend index generation space lattice.The essential information (grid) of each nodes records grid of index and data pair in grid As number (count), root vertex storage S and D.Algorithm takes tactful from up to down, successively partition space and builds grid rope Draw.S is halved, the sub-grid information of generation is stored in grid index.Newborn grid is then accessed, it is halved simultaneously again Storage, until the sub-grid boundary length of generation is less than or equal to given pre-value.To changeless S, grid index can be preserved And reuse, the time for building index is saved, efficiency of algorithm is improved.After the essential information for obtaining each layer grid, each layer is counted The number of data object in grid.Statistic algorithm uses the design of Map-Reduce thoughts to improve arithmetic speed.Broadcast index knot Structure and respectively data object to be clustered are to each calculate node.Each node is each counted in data and then merger grid in grid Data amount information.Specific space partitioning algorithm is as follows：

Obtain after complete index structure, traversal index, searching data amount is less than the maximum mesh of set-point, if finding Stop continue down travel through, obtain space division result accordingly result mapping data object, generate based on grid number Key-ValueRDD.Using Key-ValueRDD MapPartitionWithIndex function interfaces, automatically generate based on grid RDD subregions.

In the present embodiment, it is balancing algorithms speed and computational solution precision, realizes parallel computation, is defined as follows improved Local density's calculation.

Define 4 (improved local density ρ '_i)：If ρ '_iFor data object p_iImprovement local density, have

Formula (3) considers the compactness of data object in neighborhood on the basis of formula (1).For being gathered around in dc neighborhoods There is the data object of identical neighbours' number, expand the difference of local density by calculating the backfence distance of data object and its.Can To think, under neighbours' number same case in dc contiguous ranges, and the more close data object of combination of its neighbour possesses more Big local density.Local density's definition proposed by the present invention, specifically, data object is limited in by the calculating of local density Within field, the object of grid division where data object is only considered when calculating local density and its adjoining grid division is kept away Exempt to travel through whole data set, reduce the work expense of calculate node.

Original cluster_dp it is determined that cluster center when, decision diagram need to be drawn, and judged by man-machine interaction.For Dependence and human intervention of the algorithm to decision diagram are broken away from, Design assistant function gamma judges the center that clusters automatically.Given data object Local density be ρ '_i, its minimum high density distance is δ_i, then set：

Wherein, max (ρ) * max (δ) are the maximum local density and the product of minimum high intensity values in grid.Due to local close Degree has different yardsticks from minimum high density distance, therefore by local density in grid and the maximum of minimum high density distance Value carries out simple normalization operation.By γ_iIt is limited to after [0,1], its descending is arranged, it can be seen that the non-center object γ that clusters Convergence 0, the discrete distribution of γ values for the center object that clusters and remote origin.Thus, the center clustered can be determined by pre-set threshold value Candidate target.The selection of preset value relies on actual application environment, selects γ>0.2 data object is used as the core candidate pair that clusters As the local cluster of generation, ideal cluster result can obtain.Obtain clustering after core candidate object, you can it is determined that the number clustered Mesh, and then data in grid are referred in corresponding local cluster.

By stages clusters consolidation strategy each RDD subregions one space lattice of correspondence, for the data pair close to net boundary As, it is necessary to assess its aggregation properties again between adjacent mesh, it is to avoid the classification mistake caused due to dividing RDD passes through meter The average local density between two clusters is calculated, the member that will cluster is labeled as core member (clustercore) and halation (clus- terhalo).Wherein, core member is the core clustered, is made up of high density point, is stable data object aggregation；And The corresponding periphery that clusters of halation, is the aggregation of the unstable partial data clustered comprising low-density data point.Utilize core and halation Concept, if proposing that the data object distribution between grid in the adjacent grid of the merging method that clusters close to net boundary is present Following situation, then need to cluster belonging to adjustment data object.

There is the kernel object that clusters in situation 1, adjacent mesh at proximal border, and kernel object is close to each other.Due to net The presence of lattice, should be classified as the same data object clustered by script and be assigned in different cluster, and merge two in the case of this Cluster.

There is halation object in situation 2, the boundaries of two adjoining grids, now need to reappraise that halation point belonged to is poly- Cluster.Specific adjustment algorithm is as follows.

The data object that density is higher than halation object is searched in the adjoining grid of grid where halation object, and calculates full Distance of the data object of sufficient condition to halation point.If the minimum range calculated is less than the minimum highly dense of current halation object Distance is spent, then updates the nearest high density neighbours of halation point and minimum high density distance, and according to the nearest high density after renewal Halation object is assigned in new cluster by neighbours.

The technical principle for being the specific embodiment of the present invention and being used above, if conception under this invention institute The change of work, during the spirit that function produced by it is still covered without departing from specification and accompanying drawing, should belong to the present invention's Protection domain.

Claims

1. a kind of massive spatial data Density Clustering method based on elasticity distribution data set, it is characterised in that including following step Suddenly：

(1.1) using y-bend index generation space lattice, successively partition space and grid index is built with reference to strategy from up to down, Until the sub-grid boundary length of generation is not more than given pre-value；

(1.2) MAP-Reduce thoughts are used, the number of data object in each layer grid is counted, index structure is broadcasted and divides equally and treat Cluster data object is to each calculate node, and data amount information in merger grid obtains complete grid index structure；

(1.3) traversal index, searching data amount is less than the maximum mesh of set-point, according to lookup result generation based on grid numbering Key-Value RDD, generation be based on grid RDD subregions：

(2) cluster calculation in subregion is carried out：Foundation is defined as with improved local density, transported parallel on each obtained subregion Row cluster_dp algorithms determine the center that clusters so that data object has identical local density；

(3) local result merging is carried out by the merging optimisation strategy that clusters in grid and between adjacent mesh, completes clustering processing.

2. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 1, its It is characterised by：The space lattice is defined as follows：

Space S is divided into the subregion of several non-overlapping copies, then each region is a space lattice, is designated as G；WhereinFor projection of the net boundary end points on kth dimension axle.

3. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 1, its It is characterised by：The adjacent mesh is defined as follows：

For anyIn the presence of Then Claim grid g₁And g₂It is adjacent.

4. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 1, its It is characterised by：(1.1) are specific as follows；

Using y-bend index generation space lattice, wherein, number in the essential information and grid of each nodes records grid of index According to object number, take tactful successively partition space from up to down and build grid index, space is halved, storage generation Sub-grid information accesses newborn grid in grid index, and newborn grid is halved and stored again, until the subnet of generation Lattice boundary length is not more than given pre-value.

5. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 1, its It is characterised by：The step (1.3) is specially：After complete index structure is obtained, traversal index, searching data amount is less than given The maximum mesh of value, after finding then stop continue down travel through, obtain space division result accordingly result mapping data pair As generating the Key-ValueRDD numbered based on grid, utilizing Key-ValueRDD MapPartitionWithIndex functions Interface, is automatically generated based on grid RDD subregions.

6. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 1, its It is characterised by：The definition of the improved local density is：If ρ '_iFor data object p_iImprovement local density, then have

Wherein, with data-oriented object p_iCentered on, its radius is that the k dimension spaces in dc are referred to as p_iDc neighborhoods in dc neighborhoods Data object p_j, there is dist (p_i,p_j)<dc。

7. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 1, its It is characterised by：It is described when running cluster_dp algorithms, Design assistant function gamma judges to cluster center automatically, specific as follows：

Wherein, max (ρ) * max (δ) are the maximum local density and the product of minimum high intensity values in grid；ρ_iFor local density, determine Justice is p_iDc neighborhoods in the number of data object be referred to as p_iLocal density, be designated as ρ_i, formula is as follows：

Wherein,

8. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 7, its It is characterised by：Described minimum high density is apart from δ_iIt is defined as：If p_jIt is that all local densities are higher than ρ_iData object in distance p_iNearest object, then claim NN (p_i)=p_jFor p_iNearest high density neighbours, claim δ_i=dist (p_i, p_j) it is p_iMinimum high density Distance, defined formula is as follows：

9. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 1, its It is characterised by：The step (3) is specially：By calculating the average local density between two clusters, the member that will cluster is labeled as core Heart member and halation, wherein core member are the core clustered, are made up of high density point, are that stable data object gathers Collection；The corresponding periphery that clusters of halation, is the aggregation of the unstable partial data clustered comprising low-density data point；Utilize core The concept of the heart and halation, proposes the consolidation strategy clustered between grid：If close to the data object point of net boundary in adjacent grid There is the following two kinds situation in cloth, then need to cluster belonging to adjustment data object：

(a) there is the kernel object that clusters in adjacent mesh at proximal border, and kernel object is close to each other, then merges two and gather Cluster；

10. a kind of massive spatial data Density Clustering method based on elasticity distribution data set according to claim 9, its It is characterised by：It is described to reappraise the method clustered that halation point belonged to and be：The adjoining grid of grid where halation object It is middle to search the data object that density is higher than halation object, and calculating meets the data object of condition to the distance of halation point：If meter The minimum range drawn is less than the minimum high density distance of current halation object, then updates the nearest high density neighbours of halation point With minimum high density distance, and halation object is assigned in new cluster according to the nearest high density neighbours after renewal.