CN106709503A

CN106709503A - Large spatial data clustering algorithm K-DBSCAN based on density

Info

Publication number: CN106709503A
Application number: CN201611047429.1A
Authority: CN
Inventors: 邓超; 陈智斌; 郭晓惠; 农英雄; 韦屹; 黄聪; 汪倍贝; 钱方远; 李喆
Original assignee: China Tobacco Guangxi Industrial Co Ltd
Current assignee: China Tobacco Guangxi Industrial Co Ltd
Priority date: 2016-11-23
Filing date: 2016-11-23
Publication date: 2017-05-24
Anticipated expiration: 2036-11-23
Also published as: CN106709503B

Abstract

The invention particularly relates to a large spatial data clustering algorithm K-DBSCAN based on density. The algorithm comprises the steps that a density-based clustering parameter is preset: radius R, the minimum neighbor number Min_N, pre-division number K and division iteration number of times T are preset; a data set is divided into K1 subsets according to spatial distribution; the reachable subset of each data subset is calculated to form a reachable subset index; and based on the reachable subset index, spatial clustering based on density is carried out on the data of each subset. According to the technical scheme provided by the invention, density-based unsupervised and semi-supervised clustering can be carried out on the large spatial data set, and efficient and fast parallel clustering calculating is realized.

Description

A kind of large space Data Clustering Algorithm K-DBSCAN based on density

Technical field

The present invention relates to data mining and big data analysis field, and in particular to a kind of large space data based on density Clustering algorithm K-DBSCAN.

Background technology

Spatial Data Clustering is widely used in many areas of information technology, such as data mining, pattern-recognition, machine Study, artificial intelligence, visual analysis, GIS-Geographic Information System etc..Especially in the big data epoch, it can be used to explore it is meaningful but The potential pattern and phenomenon for not yet knowing, can be applied to many ambits, such as social network analysis, economic networks point Analysis, traffic network analysis, meteorologic analysis, smart city development etc..Traditional Spatial Data Clustering method calculated based on distance Mainly there are three kinds：1), based on the cluster for dividing；2), density clustering；3), hierarchical clustering.

Density clustering can effectively process noise spot and identification arbitrary shape, and main algorithm includes： DBSCAN(Density-Based Spatial Clustering of Applications with Noise)、OPTICS (Ordering points to identify the clustering structure)、DENCLUE(DENsity basted CLUstEring) etc..Wherein, DBSCAN is foremost density-based spatial clustering algorithm.Its computation complexity is O (N²), i.e., when data volume increases by 100 times, it calculates the time will about 10000 times of increase.Although existing many parallel based on calculating The method of change calculates required time to be greatly reduced, but is still limited by the quantity of CPU or GPU in calculating platform.For example, will be in phase The DBSCAN clusters of 100 haplotype data amounts are carried out in the same calculating time, then needs about 10000 times of CPU or GPU quantity.Therefore When mass data is faced, DBSCAN cannot be widely used.

The content of the invention

The present invention is to solve being directed to what DBSCAN when mass data needs to be clustered cannot be applicable in the prior art Technical problem.

In order to solve the above-mentioned technical problem, the present invention provides following technical scheme：

A kind of large space Data Clustering Algorithm K-DBSCAN based on density, including：

Data set is divided into K₁Individual data subset, wherein K₁It is the natural number more than 1；

The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index；

Density-based spatial clustering is carried out to each data subset data up to subset index according to described.

Alternatively, it is described to draw data set in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density It is divided into K₁Individual data subset, wherein being K₁Natural number more than 1, specifically includes：

Obtain the real space value range D of the data set_len；

Real space value range D according to the data set_lenPre- division is carried out to the data point in the data set to obtain K is presorted, and wherein K is the natural number more than 1；

Preclassification step, obtains each central point position presorted, and by each data in the data set In point distribution to presorting where central point closest therewith, if it is a certain presort in number of data points less than default Minimum neighbour's quantity Min_N, then delete this and presort；

Repeat the preclassification step T times, obtain K₁Individual data subset, wherein T are default division iterations.

Alternatively, it is described to obtain the number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the real space value range D of collection_len, specifically include：

Obtain the maximum space value range D of the data set_maxWith minimum space value range D_min, wherein D_max=LN_max+ LA_max, D_min=LN_min+LA_min, LN_maxIt is the longitude maximum of all data points in the data set, LA_minIt is the data set In all data points longitude minimum value, LN_minIt is the latitude minimum value of all data points in the data set, LA_maxFor described The latitude maximum of all data points in data set；

Obtain the real space value range D of the data set_len=D_max- D_min。

Alternatively, it is described according to the number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the real space value range D of collection_lenPre- division carried out to the data point in the data set obtain K to presort, wherein K is Natural number more than 1, specifically includes：

For any one data point a in the data set, presorting belonging to it is obtained according to following computational methods：Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.

Alternatively, it is described to obtain each number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the reachable subset of subset, and form corresponding with the data subset up to subset index, specifically include：

According to the longitude and latitude value of all data points in the data subset, the maximum warp of the data subset is determined Angle value LO_max, minimum longitude LQ_min, minimum latitude value LO_min, maximum latitude value LQ_max；

Obtain the reachable tree coverage of each data subset：LO_right=LO_max+ d, LO_left=LO_min- d, LQ_up=LQ_max+ d, LQ_down=LQ_min- d, wherein d are default reach distance, LQ_up、LO_right、LQ_down、LO_leftIt is respectively described The coboundary of the reachable tree coverage of data subset, right margin, lower boundary and left margin；

For data subset calculates it up to subset list each described, computational methods are：For arbitrary data subset P_a With data subset P_bIf, data subset P_aReachable tree coverage and data subset P_bReachable tree coverage exist Occur simultaneously, it is determined that data subset P_aWith data subset P_bIt is reachable mutually, and each data subset is all the reachable subset of itself, note It is RPL_a={ a, b ... } and RPL_b={ a, b ... }；The all of each data subset constitute the reachable of the data subset up to subset Subset list,

After obtaining the reachable subset list of each data subset, by all K₁The reachable subset list of individual data subset Arranged according to data subset order, obtained up to subset index.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, as data subset P_a's Reachable tree coverage and data subset P_bReachable tree coverage when there is following any one relation, it may be determined that number According to subset P_aReachable tree coverage and data subset P_bReachable tree coverage exist occur simultaneously：

Relation one：(LO_{left_a}<LO_{left_b}<LO_{right_a}) and (LQ_{down_a}<LQ_{up_b}<LQ_{up_a})；

Relation two：(LO_{left_a}<LO_{left_b}<LO_{right_a}) and (LA_{down_a}<LQ_{down_b}<LQ_{up_a})；

Relation three：(LO_{left_a}<LO_{right_b}<LO_{right_a}) and (LQ_{down_a}<LQ_{up_b}<LQ_{up_a})；

Relation four：(LO_{left_a}<LO_{right_b}<LO_{right_a}) and (LA_{down_a}<LA_{down_b}<LA_{up_a})；

Relation five：(LO_{left_a}<LO_{left_b}<LO_{right_a}) and (LQ_{down_a}>LQ_{down_b}&LQ_{up_a}<LQ_{up_b})；

Relation six：(LO_{left_a}<LO_{right_b}<LO_{right_a}) and (LQ_{down_a}>LQ_{down_b}&LQ_{up_a}<LQ_{up_b})；

Relation seven：(LO_{left_a}>LO_{left_b}And LO_{right_a}<LO_{right_b}) and (LQ_{down_a}<LQ_{up_b}<LQ_{up_a})；

Relation eight：(LO_{left_a}>LO_{left_b}And LO_{right_a}<LO_{right_b}) and (LQ_{down_a}<LQ_{down_b}<LQ_{up_a})；

Relation nine：(LO_{left_a}>LO_{left_b}) and (LO_{right_a}<LO_{right_b}) and (LQ_{up_a}<LQ_{up_b}) and (LQ_{down_a}> LQ_{down_b})；

Wherein, LQ_{up_a}、LO_{right_a}、LQ_{down_a}、LO_{left_a}Respectively data subset P_aReachable tree coverage it is upper Border, right margin, lower boundary and left margin；LQ_{up_b}、LO_{right_b}、LQ_{down_b}、LO_{left_b}Respectively data subset P_bUp to empty Between the coboundary of coverage, right margin, lower boundary and left margin.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, can described in the basis Density-based spatial clustering is carried out to each data subset data up to subset index, is specifically included：

According to the core point that each data subset is calculated up to subset index；

The core point in each data subset is clustered respectively.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, can described in the basis The core point of each data subset is calculated up to subset index, is specifically included：

Step 11：In order from K₁Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point P_iIf, data point P_iNeighbor Points quantity N_iInitial value be 0, tag along sort CF be 1；

Step 12：Choose data subset b in order from the reachable subset list of data subset a；

Step 13：Choose data point P in order from data subset b_j；

Step 14：Calculate data point P_iWith data point P_jThe distance between D_i,jIf, apart from D_i,jLess than pre-set radius R, then really Fixed number strong point P_jIt is data point P_iNeighbor Points, data point P_iNeighbor Points quantity N_i=N_i+1；

Step 15：According to the method described in step 13 and 14, all data points in ergodic data subset b, according to step 14 method is calculated data point P_iNeighbor Points quantity N_i；

Step 16：It is all up to number in the reachable subset list of ergodic data subset a according to the method described in step 12 According to subset, data point P is calculated according to step 15 methods described_iNeighbor Points quantity N_i；

Step 17：It is calculated data point P every time_iNeighbor Points quantity N_iAfterwards, data point P is judged_iNeighbor Points quantity N_iWhether more than minimum neighbour quantity Min_N, if then data point P_iIt is core point, is data point P_iPlus tag along sort CF_i= CF, and be last Neighbor Points P in step calculating_kPlus tag along sort MF_a, and for data point P_iCore point calculate knot Beam, makes CF=CF+1, and according to the method in step 11, chooses next data point P_i+1Carry out core point calculating；

Step 18：According to the method described in step 11 to step 17, each data in ergodic data subset a in order Point, the core point for completing all data points in data subset a is calculated；

Step 19：According to the method described in step 11 to step 18, K is traveled through in order₁Each in individual data subset Data subset, the core point for completing all data points in all data subsets is calculated.

Alternatively, it is described to calculate each number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the core point of subset, specifically include：

Step 21：In order from K₁Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point P_iIf, data point P_iNeighbor Points quantity N_iInitial value be 0, tag along sort CF be 1；

Step 22：Choose data point P in order from data subset a_j；

Step 23：Calculate data point P_iWith data point P_jThe distance between D_i,jIf, apart from D_i,jLess than pre-set radius R, then really Fixed number strong point P_jIt is data point P_iNeighbor Points, data point P_iNeighbor Points quantity N_i=N_i+1；

Step 24：According to the method described in step 22, all data points in ergodic data subset a, according to step 23 institute The method stated, calculates data point P_iNeighbor Points quantity N_i；

Step 25：It is calculated data point P every time_iNeighbor Points quantity N_iAfterwards, data point P is judged_iNeighbor Points quantity N_iWhether more than minimum neighbour quantity Min_N, if then data point P_iIt is core point, is data point P_iPlus tag along sort CF_i= CF, and be last Neighbor Points P in step calculating_kPlus tag along sort MF_a, and for data point P_iCore point calculate knot Beam, makes CF=CF+1, and according to the method in step 21, chooses next data point P_i+1Carry out core point calculating；

Step 26：According to the method described in step 21 to step 25, each data point in ergodic data subset a is complete The core point of all data points is calculated into data subset a；

Step 27：According to the method described in step 21 to step 26, K is traveled through₁Each data in individual data subset Collection, the core point for completing all data points in all data subsets is calculated.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, according to described reachable Subset index, clusters to the core point in each data subset respectively, specifically includes：

Step 31：The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point P_i, such as Really its MF value is present, then make CF_i=MF_i；

Step 32：In order from K₁Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point P_i；

Step 33：Choose data subset b in order from the reachable subset list of data subset a；

Step 34：Choose data point P in order from data subset b_j；

Step 35：If data point P_iWith data point P_jTag along sort CF_iWith CF_jIt is equal, then perform step 37；Otherwise count Count strong point P_iWith data point P_jApart from D_i,jIf, apart from D_i,jLess than the pre-set radius R, then step 36 is performed, otherwise performed Step 37；

Step 36：Judge tag along sort CF_iWith CF_jBetween magnitude relationship：Work as CF_i<CF_jWhen, then make CF_j=CF_i, when CF_i>CF_jWhen, then make CF_i=CF_j；

Step 37：According to the method described in step 34 to step 36, each data point in ergodic data subset b；

Step 38：It is every in the reachable subset list of ergodic data subset a according to the method described in step 33 to step 37 One, up to each data point in subset, completes core point P in data subset a_iCluster calculation；

Step 39：According to the method described in step 32 to step 38, each data point in ergodic data subset a is complete Into the cluster calculation of all core points in data subset a；

Step 310：According to the method described in step 31 to step 39, K is traveled through in order₁Each in individual data subset Data subset, completes the cluster calculation of each core point in each data subset.

Alternatively, it is described respectively to each in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density Core point in data subset is clustered, and is specifically included：

Step 41：The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point P_i, such as Really its MF value is present, then make CF_i=MF_i；

Step 42：In order from K₁Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point P_iWith data point P_j；

Step 43：Judge data point P_iWith data point P_jTag along sort CF_iWith CF_jIt is whether equal；Step is performed if equal Rapid 45, data point P is calculated if unequal_iWith data point P_jApart from D_i,jIf, apart from D_i,jLess than the pre-set radius R, then hold Row step 44, otherwise performs step 45；

Step 44：Judge data point P_iWith data point P_jTag along sort CF_iWith CF_jMagnitude relationship：Work as CF_i<CF_jWhen, Then make CF_j=CF_i, work as CF_i>CF_jWhen, then make CF_i=CF_j；

Step 45：According to the method described in step 41 to step 44, each data point in ergodic data subset a is complete The cluster calculation of all core points into data subset a；

Step 46：According to the method described in step 46 to step 45, each in K1 data subset is traveled through in order Data subset, completes the cluster calculation of each core point in each data subset.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, also comprise the following steps：

The Neighbor Points that are calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset.

Alternatively, it is described to calculate respectively in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density Clustered to the Neighbor Points in each data subset and to the Neighbor Points in each data subset, specifically included：

Step 51：In order from K1 data subset choose data subset a, using the core point in data subset a as A subset, is designated as CG_a, non-core point is used as a subset NCG_a；

Step 52：According to the method described in step 51, K1 data subset is traveled through, for all data subsets divide core point Subset CG and non-core point NCG；

Step 53：Data subset a is chosen from K1 data subset in order, in order from the non-core of data subset a Point subset NCG_aIt is middle to choose non-core point NCP_i；

Step 54：Data subset b is chosen from the reachable subset list of data subset a in order；

Step 55：In order from the core point subset CG of data subset b_bMiddle selection core point CP_j；

Step 56：Calculate non-core point NCP_iWith core point CP_jThe distance between D_i,jIf, apart from D_i,jIt is pre- less than described If radius R, then non-core point NCP_iIt is core point CP_jNeighbor Points, and clustered：Make CF_i=CF_j；

Step 57：According to the method described in step 55 to step 56, each core point in ergodic data subset b；

Step 58：It is each in the reachable subset list of ergodic data subset a according to the method described in step 54 to 56 Individual reachable subset, completes non-core point NCP_iCluster calculation；

Step 59：According to the method described in step 53 to step 58, each in ergodic data subset a is non-in order Core point, completes the cluster calculation of all non-core points in data subset a；

Step 510：According to the method described in step 53 to step 59, K is traveled through in order₁It is each in individual data subset Individual data subset, completes the cluster calculation of all non-core points in each data subset.

Step 61：By K₁Core point in individual data subset in each data subset is designated as CG as a subset, non- Core point is designated as NCG as a subset；

Step 62：According to the method described in step 61, K is traveled through₁Individual data subset, is that all of data subset divides core Point CG and non-core point NCG；

Step 63：In order from K₁Data subset a is chosen in individual data subset, from the non-core idea in data subset a Collection NCG_aIn choose non-core point NCP in order_i；From the core point subset CG in data subset a_aIn choose core point in order CP_j；

Step 64：Calculate point NCP_iWith point CP_jThe distance between D_i,jIf, D_i,jLess than pre-set radius R, then point NCP_iIt is Point CP_jNeighbor Points, and clustered：Make CF_j=CF_i；

Step 65：According to the method described in step 63 to step 64, each in ergodic data subset a is non-in order Core point, completes the cluster calculation of all non-core point in data subset a；

Step 66：According to the method described in step 63 to step 65, K is traveled through in order₁Each in individual data subset Data subset, completes the cluster calculation of all non-core points in each data subset.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the number presorted Amount K is obtained in the following manner：

Wherein N is the total quantity of data point in the data set, and Min_N is default minimum neighbour points Amount, k is constant.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the default division Iterations T meets：1≤T≤10.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, it is described it is default up to away from Meet from d：If R≤d≤R × 2, wherein R are pre-set radius R.

Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the pre-set radius R is replaced It is changed to R '：R≤R′≤R×2.

The present invention also provides a kind of electronic equipment for performing the large space Data Clustering Algorithm K-DBSCAN based on density, It is characterised in that it includes：

At least one processor；And

The memory being connected with least one processor communication；Wherein,

The memory storage has can be by the instruction of one computing device, and the instruction is by described at least one Reason device is performed, so that at least one processor can：

The above-mentioned technical proposal that the present invention is provided, compared with prior art, at least has the advantages that：The present invention is carried A kind of large space Data Clustering Algorithm K-DBSCAN based on density is gone out, the algorithm is first with a kind of parallelization of simplification K-means algorithms carry out data division, secondly guide cluster using a kind of reachable subset index, finally improved using one kind Distributed parallel clustering algorithm carries out space clustering.The algorithm greatly reduces the Spatial Data Clustering based on density Computation complexity so that the algorithm can be widely applied to mass data cluster.

Brief description of the drawings

In order to illustrate more clearly of the specific embodiment of the invention or technical scheme of the prior art, below will be to specific The accompanying drawing to be used needed for implementation method or description of the prior art is briefly described, it should be apparent that, in describing below Accompanying drawing is some embodiments of the present invention, for those of ordinary skill in the art, before creative work is not paid Put, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the method for the large space Data Clustering Algorithm K-DBSCAN based on density described in one embodiment of the invention Flow chart；

Fig. 2 carries out sky using improved k-means clustering algorithms for step described in one embodiment of the invention to data set Between partition clustering method flow diagram；

Fig. 3 is the implementation method flow chart of step S102 described in one embodiment of the invention；

Fig. 4 is the spatial coverage schematic diagram of data subset described in one embodiment of the invention；

Fig. 5 is the schematic diagram that different pieces of information subset described in one embodiment of the invention mutually reaches；

Fig. 6 is reality of the one embodiment of the invention according to the core point that each data subset is calculated up to subset index Existing method flow diagram；

Fig. 7 is that the core point in each data subset is carried out according to up to subset index described in one embodiment of the invention The implementation method flow chart of the implementation method flow chart of cluster；

Fig. 8 is Neighbor Points in each data subset to be calculated described in one embodiment of the invention and to each data The implementation method flow chart that the Neighbor Points of concentration are clustered；

Fig. 9 is to perform the large space Data Clustering Algorithm K-DBSCAN's based on density described in one embodiment of the invention The hardware configuration connection diagram of the electronic equipment of method.

Specific embodiment

Embodiment 1

The present embodiment provides a kind of large space Data Clustering Algorithm K-DBSCAN based on density, as shown in figure 1, bag Include：

S101：Data set is divided into K₁Individual data subset, wherein K₁It is the natural number more than 1.

S102：The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index.

S103：Density-based spatial clustering is carried out to each data subset data up to subset index according to described.

In such scheme, data set is carried out into data division first, obtain multiple data subsets, secondly using a kind of reachable Subset index guides cluster, finally for division after each data subset space clustering carried out using clustering algorithm.Should Algorithm greatly reduces the computation complexity of the Spatial Data Clustering based on density so that the algorithm can be widely applied to magnanimity Data clusters.

Embodiment 2

In above-mentioned steps S101, data set can be divided using various ways, specifically need to ensure every stroke The data subset obtained after point has specific space and data point.A kind of implementation is provided in the present embodiment, using one kind Improved k-means clustering algorithms carry out space partition clustering to data set, including：

Specifically, as shown in Fig. 2 comprising the following steps：

S201：The longitude maximum LN of all data points in the data set is calculated respectively_max, longitude minimum value LN_min With latitude maximum LA_max, latitude minimum value LA_min；Obtain the maximum space value range D of the data set_max=LN_max+LA_max With minimum space value range D_min=LN_min+LA_min, then calculate the real space value range D of the data set_len=D_max- D_min。

S202：According to the real space value range D_lenInitial division is carried out to the data point in the data set, specifically Algorithm is：For Arbitrary Digit strong point a, the computing formula presorted belonging to it is： Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.And K values can be by formula certainly It is dynamic to be calculated：N is the total quantity of data point in data set, and Min_N is minimum Neighbor Points quantity, and k is to appoint Meaning constant.

S203：For arbitrary classification C_aIf, less than default minimum neighbour quantity Min_N, deleting should for its number of members Presort；

S204：Each central point presorted is calculated respectively；

S205：Then calculate each data point to the distance of central point each described, and by the data point be assigned to The nearest central point where presort；

S206：Step S203, S204, S205 are repeated, iterations is T times, and wherein T can select to be any nature Count, preferably T can be：1≤T≤10；

S207：Data point in the data set is divided into K₁Individual to presort, each presorts and can be used as one Individual data subset, therefore obtained K₁Individual data subset.

In above-mentioned steps S102, as shown in figure 3, can realize in the following way：

S301：For data subset P calculates reachable tree coverage each described, computational methods are：Correction data subset P In all data points longitude and latitude value, find out the maximum longitude LO in all data points_max, minimum longitude LQ_min、 Minimum latitude value LO_min, maximum latitude value LQ_max；Along with default reachable tree is apart from d, the reachable tree of data subset P is obtained Coverage：LO_right=LO_max+ d, LO_left=LO_min- d, LQ_up=LQ_max+ d, LQ_down=LQ_min- d, wherein, LQ_up、 LO_right、LQ_down、LO_leftThe coboundary of the reachable tree coverage of respectively described data subset, right margin, lower boundary and Left margin.

As shown in figure 4, the data point in rectangle frame represents a data subset (can also be called to presort), rectangle Inside casing represents the spatial coverage of the data subset, and rectangular outer frame represents the reachable tree covering model of the data subset Enclose.It can be seen that the difference between inside casing and housing is to preset reachable tree apart from d.As a kind of optional realization Mode, the reachable tree can be any number apart from d values, generally set R≤d≤R × 2, and R is pre-set radius.

S302：For data subset P calculates it up to subset list RPL each described, computational methods are：For arbitrary institute State data subset P_aAnd P_b, if there is data subset P_aReachable tree coverage R_a, and data subset P_bUp to empty Between coverage R_b, and scope R_aWith scope R_bIt is intersecting, then define P_aWith P_bIt is reachable mutually, it is designated as RPLa={ a, b ... }, RPLb= { a, b ... }, it is clear that each data subset is the reachable subset of oneself.As shown in figure 5, for data subset P₁, it is up to empty Between coverage R₁Only with data subset P₂、P₃、P₄Reachable tree coverage R₂、R₃、R₄Intersect, then P₁With P₂It is reachable mutually, P₁With P₃Mutual reachable, P₁With P₄It is reachable mutually.In above-mentioned steps, data subset P is judged_aWith data subset P_bWhether it is intersecting can be with Whether fallen into the border of another data subset by the border for judging one of data subset, specifically can be summarized as 9 kinds of situations, can be calculated by below equation：

(1)(LO_{left_a}<LO_{left_b}<LO_{right_a})&(LQ_{down_a}<LQ_{up_b}<LQ_{up_a})

(2)(LO_{left_a}<LO_{left_b}<LO_{right_a})&(LQ_{down_a}<LQ_{down_b}<LQ_{up_a})

(3)(LO_{left_a}<LO_{right_b}<LO_{right_a})&(LQ_{down_a}<LQ_{up_b}<LQ_{up_a})

(4)(LO_{left_a}<LO_{right_b}<LO_{right_a})&(LQ_{down_a}<LQ_{down_b}<LQ_{up_a})

(5)(LO_{left_a}<LO_{left_b}<LO_{right_a})&(LQ_{down_a}>LQ_{down_b}&LQ_{up_a}<LQ_{up_b})

(6)(LO_{left_a}<LO_{right_b}<LO_{right_a})&(LQ_{down_a}>LQ_{down_b}&LQ_{up_a}<LQ_{up_b})

(7)(LO_{left_a}>LO_{left_b}&LO_{right_a}<LO_{right_b})&(LQ_{down_a}<LQ_{up_b}<LQ_{up_a})

(8)(LO_{left_a}>LO_{left_b}&LO_{right_a}<LO_{right_b})&(LQ_{down_a}<LQ_{down_b}<LQ_{up_a})

(9)(LO_{left_a}>LO_{left_b})&(LO_{right_a}<LO_{right_b})&(LQ_{up_a}<LQ_{up_b})&(LQ_{down_a}>LQ_{down_b})。

Wherein, ＆ represents logical AND, and in subscript, right represents right margin, and left represents left margin, and up represents top Boundary, down represents lower boundary, and a represents data subset P_a, b represents data subset P_b, therefore for containing represented by each symbol Justice, can directly obtain according to subscript.

S303：K is respectively by formula described in step S302₁Relative to other, each data subset enters individual data subset Row is calculated, and obtains the described up to subset list RPL of each data subset, and by all K₁The reachable subset row of individual data subset Table RPL is arranged according to subset order, obtains one up to subset index RPI, is designated as RPI={ RPL₁,...,RPL_K1}。

Preferably, above-mentioned steps S103 may include following steps：According to described each data is calculated up to subset index The core point of collection；The core point in each data subset is clustered respectively；It is calculated respectively in each data subset Neighbor Points are simultaneously clustered to the Neighbor Points in each data subset.

When wherein calculating the core point of each data subset, can be calculated according to up to subset index, it is also possible to only at this Calculated in the range of the data point of data subset itself.Only the core point in each data subset is clustered, Ke Yiyou Effect improves cluster speed, but precision can be subject to a certain degree of influence.

Therefore, can be using scheme as shown in Figure 6 according to the core that each data subset is calculated up to subset index Point, comprises the following steps：

Step 13：Choose data point P in order from data subset b_j；

I.e. each data point of pin data subset a, travels through each up to each data in subset in sequence After point, then other data subsets outside ergodic data subset a, until all K₁Individual data subset completes traversal.

It should be noted that provided in the embodiment of the present invention parameter subscript, can be distinguished using natural number, in fact Any restriction is not done in matter to the parameter, such as when having 100 data points in one data subset, P_iRepresent a data Point, i can get 100 from 1.

As shown in fig. 7, core point can not be calculated for each subset up to subset index RPI by described, to improve sea The calculating speed of data clusters is measured, but the precision of some cluster results can be lost.Circular can be reduced to：

Step 22：Choose data point P in order from data subset a_j；

Step 25：It is calculated data point P every time_iNeighbor Points quantity N_iAfterwards, data point P is judged_iNeighbor Points quantity N_iWhether more than minimum neighbour quantity Min_N, if then data point P_iIt is core point, is data point P_iPlus tag along sort CF_i= CF, and be last Neighbor Points P in step calculating_kPlus tag along sort MFa, and for data point P_iCore point calculate knot Beam, makes CF=CF+1, and according to the method in step 21, chooses next data point P_i+1Carry out core point calculating；

I.e. for each data point in data subset a, without traversal up to the data point in subset, but and data Other data points in subset a itself are traveled through, and other data subsets outside data subset a are carried out again after completion time Go through, until K₁All traversal is completed individual data subset.

Similarly, the core point in each data subset is clustered, it is also possible to be divided into two methods, one kind is basis The core point in each data subset is clustered up to subset index, another way is without up to complete by subset Into.Wherein：

Mode one：According to described reachable subset index, the core point in each data subset is clustered respectively, had Body includes：

Step 34：Choose data point P in order from data subset b_j；

Step 310：According to the method described in step 31 to step 39, each in K1 data subset is traveled through in order Data subset, completes the cluster calculation of each core point in each data subset.

Mode two：Do not clustered for the core point in each subset by the reachable subset index RPI, to improve The calculating speed of mass data cluster, but the precision of cluster result can be lost.Circular can be reduced to：

Step 46：According to the method described in step 46 to step 45, K is traveled through in order₁Each in individual data subset Data subset, completes the cluster calculation of each core point in each data subset.

Similarly, the Neighbor Points that are calculated in each data subset simultaneously gather to the Neighbor Points in each data subset The step of class, it is also possible to realized by two ways.

Mode one：According to described up to subset index RPI, respectively each subset calculates Neighbor Points and clusters.Calculating side Method is：

Step 51：In order from K₁In individual data subset choose data subset a, using the core point in data subset a as A subset, is designated as CG_a, non-core point is used as a subset NCG_a；

Mode two：It is not or not each subset calculates Neighbor Points by the reachable subset index RPI, to improve mass data The calculating speed of cluster, but the precision of cluster result can be lost.Circular can be reduced to：

In above scheme, the pre-set radius R could alternatively be R '：R≤R′≤R×2.The above skill that the present embodiment is provided Art scheme, can carry out the unsupervised and semi-supervised clustering based on density to large space data set, and realize efficiently, quickly simultaneously Row cluster calculation.

Embodiment 3

Fig. 9 is that the electronics of large space Data Clustering Algorithm K-DBSCAN of the execution based on density that the present embodiment is provided sets Standby hardware architecture diagram, as shown in figure 9, the equipment includes：

One or more processors 701 and memory 702, in Fig. 9 by taking a processor 701 as an example.

The equipment for performing the large space Data Clustering Algorithm K-DBSCAN based on density can also include：Input unit 703 and output device 704.

Processor 701, memory 702, input unit 703 and output device 704 can be by bus or other modes Connection, in Fig. 9 as a example by being connected by bus.

Memory 702 can be used to store non-volatile software journey as a kind of non-volatile computer readable storage medium storing program for executing Sequence, non-volatile computer executable program and module, the large space data based on density such as in the embodiment of the present application Corresponding programmed instruction/the modules of clustering algorithm K-DBSCAN.Processor 701 is non-easy in memory 702 by running storage The property lost software program, instruction and module, so that the various function application of execute server and data processing, that is, realize above-mentioned The large space Data Clustering Algorithm K-DBSCAN based on density of embodiment of the method.

Memory 702 can include storing program area and storage data field, wherein, storing program area can store operation system Application program required for system, at least one function；Storage data field can be stored according to large space number of the execution based on density Created data etc. are used according to the device of clustering algorithm K-DBSCAN.Additionally, memory 702 can include depositing at random at a high speed Access to memory, can also include nonvolatile memory, for example, at least one disk memory, flush memory device or other are non- Volatile solid-state part.In certain embodiments, memory 702 is optional including remotely located relative to processor 701 Memory, these remote memories can be by network connection to large space Data Clustering Algorithm K- of the execution based on density DBSCAN devices.The example of above-mentioned network include but is not limited to internet, intranet, LAN, mobile radio communication and its Combination.

Input unit 703 can receive the numeral or character information of input, and produce and perform the large-scale sky based on density Between Data Clustering Algorithm K-DBSCAN devices user set and function control it is relevant key signals input.Output device 704 May include the display devices such as display screen.

One or more of modules are stored in the memory 702, when by one or more of processors During 701 execution, the large space Data Clustering Algorithm K-DBSCAN based on density in above-mentioned any means embodiment is performed.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.

These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described Property concept, then can to these embodiments make change and change.So, appended claims are intended to be construed to include being preferable to carry out Example and fall into having altered and changing for the scope of the invention.

Claims

1. a kind of large space Data Clustering Algorithm K-DBSCAN based on density, it is characterised in that including：

2. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 1, it is characterised in that institute State and data set is divided into K₁Individual data subset, wherein being K₁Natural number more than 1, specifically includes：

Obtain the real space value range D of the data set_len；

Real space value range D according to the data set_lenPre- division is carried out to the data point in the data set and obtains K Presort, wherein K is the natural number more than 1；

Preclassification step, obtains each central point position presorted, and each data point in the data set is divided Be assigned in presorting where central point closest therewith, if it is a certain presort in number of data points less than default minimum Neighbour quantity Min_N, then delete this and presort；

3. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 2, it is characterised in that institute State the real space value range D for obtaining the data set_len, specifically include：

Obtain the maximum space value range D of the data set_maxWith minimum space value range D_min, wherein D_max=LN_max+LA_max, D_min=LN_min+LA_min, LN_maxIt is the longitude maximum of all data points in the data set, LA_minIt is institute in the data set There are the longitude minimum value of data point, LN_minIt is the latitude minimum value of all data points in the data set, LA_maxIt is the data Concentrate the latitude maximum of all data points；

Obtain the real space value range D of the data set_len=D_max- D_min。

4. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 3, it is characterised in that institute State the real space value range D according to the data set_lenPre- division is carried out to the data point in the data set and obtains K in advance Classification, wherein K is the natural number more than 1, is specifically included：

5. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 1, it is characterised in that institute The reachable subset for obtaining each data subset is stated, and forms corresponding with the data subset up to subset index, specifically included：

According to the longitude and latitude value of all data points in the data subset, the maximum longitude of the data subset is determined LO_max, minimum longitude LQ_min, minimum latitude value LO_min, maximum latitude value LQ_max；

Obtain the reachable tree coverage of each data subset：LO_right=LO_max+ d, LO_left=LO_min- d, LQ_up= LQ_max+ d, LQ_down=LQ_min- d, wherein d are default reach distance, LQ_up、LO_right、LQ_down、LO_leftRespectively described data The coboundary of the reachable tree coverage of subset, right margin, lower boundary and left margin；

For data subset calculates it up to subset list each described, computational methods are：For arbitrary data subset P_aAnd data Subset P_bIf, data subset P_aReachable tree coverage and data subset P_bReachable tree coverage exist occur simultaneously, then Determine data subset P_aWith data subset P_bIt is reachable mutually, and each data subset is all the reachable subset of itself, is designated as RPL_a= { a, b ... } and RPL_b={ a, b ... }；All reachable subset row that the data subset is constituted up to subset of each data subset Table,

After obtaining the reachable subset list of each data subset, by all K₁The reachable subset list of individual data subset according to Data subset order is arranged, and is obtained up to subset index.

6. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 5, it is characterised in that when Data subset P_aReachable tree coverage and data subset P_bReachable tree coverage there is following any one relation When, it may be determined that data subset P_aReachable tree coverage and data subset P_bReachable tree coverage exist occur simultaneously：

Relation nine：(LO_{left_a}>LO_{left_b}) and (LO_{right_a}<LO_{right_b}) and (LQ_{up_a}<LQ_{up_b}) and (LQ_{down_a}>LQ_{down_b})；

Wherein, LQ_{up_a}、LO_{right_a}、LQ_{down_a}、LO_{left_a}Respectively data subset P_aReachable tree coverage top Boundary, right margin, lower boundary and left margin；LQ_{up_b}、LO_{right_b}、LQ_{down_b}、LO_{left_b}Respectively data subset P_bReachable tree The coboundary of coverage, right margin, lower boundary and left margin.

7. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 6, it is characterised in that institute State carries out density-based spatial clustering up to subset index according to described to each data subset data, specifically includes：

The core point in each data subset is clustered respectively.

8. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that institute State according to the core point that each data subset is calculated up to subset index, specifically include：

Step 11：In order from K₁Data subset a is chosen in individual data subset, data point is chosen in order from data subset a P_iIf, data point P_iNeighbor Points quantity N_iInitial value be 0, tag along sort CF be 1；

Step 13：Choose data point P in order from data subset b_j；

Step 14：Calculate data point P_iWith data point P_jThe distance between D_i,jIf, apart from D_i,jLess than pre-set radius R, it is determined that number Strong point P_jIt is data point P_iNeighbor Points, data point P_iNeighbor Points quantity N_i=N_i+1；

Step 16：It is all up to data in the reachable subset list of ergodic data subset a according to the method described in step 12 Collection, data point P is calculated according to step 15 methods described_iNeighbor Points quantity N_i；

Step 17：It is calculated data point P every time_iNeighbor Points quantity N_iAfterwards, data point P is judged_iNeighbor Points quantity N_iIt is It is no more than minimum neighbour's quantity Min_N, if then data point P_iIt is core point, is data point P_iPlus tag along sort CF_i=CF, And last Neighbor Points P in being calculated for the step_kPlus tag along sort MF_a, and for data point P_iCore point calculate terminate, CF=CF+1 is made, and according to the method in step 11, chooses next data point P_i+1Carry out core point calculating；

Step 18：According to the method described in step 11 to step 17, each data point in ergodic data subset a in order, The core point for completing all data points in data subset a is calculated；

Step 19：According to the method described in step 11 to step 18, K is traveled through in order₁Each data in individual data subset Collection, the core point for completing all data points in all data subsets is calculated.

9. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 8, it is characterised in that institute The core point for calculating each data subset is stated, is specifically included：

Step 21：In order from K₁Data subset a is chosen in individual data subset, data point is chosen in order from data subset a P_iIf, data point P_iNeighbor Points quantity N_iInitial value be 0, tag along sort CF be 1；

Step 22：Choose data point P in order from data subset a_j；

Step 23：Calculate data point P_iWith data point P_jThe distance between D_i,jIf, apart from D_i,jLess than pre-set radius R, it is determined that number Strong point P_jIt is data point P_iNeighbor Points, data point P_iNeighbor Points quantity N_i=N_i+1；

Step 24：According to the method described in step 22, all data points in ergodic data subset a, according to described in step 23 Method, calculates data point P_iNeighbor Points quantity N_i；

Step 25：It is calculated data point P every time_iNeighbor Points quantity N_iAfterwards, data point P is judged_iNeighbor Points quantity N_iIt is It is no more than minimum neighbour's quantity Min_N, if then data point P_iIt is core point, is data point P_iPlus tag along sort CF_i=CF, And last Neighbor Points P in being calculated for the step_kPlus tag along sort MF_a, and for data point P_iCore point calculate terminate, CF=CF+1 is made, and according to the method in step 21, chooses next data point P_i+1Carry out core point calculating；

Step 26：According to the method described in step 21 to step 25, each data point in ergodic data subset a completes number Calculated according to the core point of all data points in subset a；

Step 27：According to the method described in step 21 to step 26, K is traveled through₁Each data subset in individual data subset, it is complete The core point of all data points is calculated into all data subsets.

10. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that According to described reachable subset index, the core point in each data subset is clustered respectively, specifically included：

Step 31：The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point P_iIf, its MF values are present, then make CF_i=MF_i；

Step 32：In order from K₁Data subset a is chosen in individual data subset, data point is chosen in order from data subset a P_i；

Step 34：Choose data point P in order from data subset b_j；

Step 35：If data point P_iWith data point P_jTag along sort CF_iWith CF_jIt is equal, then perform step 37；Otherwise calculate data Point P_iWith data point P_jApart from D_i,jIf, apart from D_i,jLess than the pre-set radius R, then step 36 is performed, otherwise perform step 37；

Step 36：Judge tag along sort CF_iWith CF_jBetween magnitude relationship：Work as CF_i<CF_jWhen, then make CF_j=CF_i, work as CF_i>CF_j When, then make CF_i=CF_j；

Step 38：Each according to the method described in step 33 to step 37, in the reachable subset list of ergodic data subset a Up to each data point in subset, core point P in data subset a is completed_iCluster calculation；

Step 39：According to the method described in step 32 to step 38, each data point in ergodic data subset a completes number According to the cluster calculation of all core points in subset a；

Step 310：According to the method described in step 31 to step 39, K is traveled through in order₁Each data in individual data subset Subset, completes the cluster calculation of each core point in each data subset.

The 11. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that It is described the core point in each data subset is clustered respectively, specifically include：

Step 41：The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point P_iIf, its MF values are present, then make CF_i=MF_i；

Step 42：In order from K₁Data subset a is chosen in individual data subset, data point P is chosen in order from data subset a_i With data point P_j；

Step 43：Judge data point P_iWith data point P_jTag along sort CF_iWith CF_jIt is whether equal；Step 45 is performed if equal, Data point P is calculated if unequal_iWith data point P_jApart from D_i,jIf, apart from D_i,jLess than the pre-set radius R, then step is performed Rapid 44, otherwise perform step 45；

Step 45：According to the method described in step 41 to step 44, each data point in ergodic data subset a completes number According to the cluster calculation of all core points in subset a；

Step 46：According to the method described in step 46 to step 45, each data in K1 data subset are traveled through in order Subset, completes the cluster calculation of each core point in each data subset.

The 12. large space Data Clustering Algorithm K-DBSCAN based on density according to claim any one of 7-11, its It is characterised by, also comprises the following steps：

The 13. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 12, it is characterised in that The Neighbor Points being calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset, specifically Including：

Step 51：Data subset a is chosen from K1 data subset in order, using the core point in data subset a as Subset, is designated as CG_a, non-core point is used as a subset NCG_a；

Step 53：Data subset a is chosen from K1 data subset in order, in order from the non-core idea of data subset a Collection NCG_aIt is middle to choose non-core point NCP_i；

Step 56：Calculate non-core point NCP_iWith core point CP_jThe distance between D_i,jIf, apart from D_i,jLess than described default half Footpath R, then non-core point NCP_iIt is core point CP_jNeighbor Points, and clustered：Make CF_i=CF_j；

Step 58：According to the method described in step 54 to 56, each in the reachable subset list of ergodic data subset a can Up to subset, non-core point NCP is completed_iCluster calculation；

Step 59：According to the method described in step 53 to step 58, each in ergodic data subset a is non-core in order Point, completes the cluster calculation of all non-core points in data subset a；

Step 510：According to the method described in step 53 to step 59, K is traveled through in order₁Each number in individual data subset According to subset, the cluster calculation of all non-core points in each data subset is completed.

The 14. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 12, it is characterised in that The Neighbor Points being calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset, specifically Including：

Step 61：By K₁Core point in individual data subset in each data subset is designated as CG as a subset, non-core Point is designated as NCG as a subset；

Step 62：According to the method described in step 61, K is traveled through₁Individual data subset, is that all of data subset divides core point CG With non-core point NCG；

Step 63：In order from K₁Data subset a is chosen in individual data subset, from non-core subset NCG in data subset a_a In choose non-core point NCP in order_i；From the core point subset CG in data subset a_aIn choose core point CP in order_j；

Step 64：Calculate point NCP_iWith point CP_jThe distance between D_i,jIf, D_i,jLess than pre-set radius R, then point NCP_iIt is point CP_j Neighbor Points, and clustered：Make CF_j=CF_i；

Step 65：According to the method described in step 63 to step 64, each in ergodic data subset a is non-core in order Point, completes the cluster calculation of all non-core point in data subset a；

Step 66：According to the method described in step 63 to step 65, K is traveled through in order₁Each data in individual data subset Collection, completes the cluster calculation of all non-core points in each data subset.

The 15. large space Data Clustering Algorithm K-DBSCAN based on density according to claim any one of 1-14, its It is characterised by：The quantity K for presorting is obtained in the following manner：

Wherein N is the total quantity of data point in the data set, and Min_N is default minimum Neighbor Points quantity, k It is constant.

The 16. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 15, it is characterised in that： The default division iterations T meets：1≤T≤10.

The 17. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 16, it is characterised in that： The default reach distance d meets：If R≤d≤R × 2, wherein R are pre-set radius R.

The 18. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 17, it is characterised in that： The pre-set radius R replaces with R '：R≤R′≤R×2.

A kind of 19. electronic equipments for performing the large space Data Clustering Algorithm K-DBSCAN based on density, it is characterised in that bag Include：

At least one processor；And

The memory being connected with least one processor communication；Wherein,

The memory storage has can be by the instruction of one computing device, and the instruction is by least one processor Perform, so that at least one processor can：