CN106709503A - Large spatial data clustering algorithm K-DBSCAN based on density - Google Patents
Large spatial data clustering algorithm K-DBSCAN based on density Download PDFInfo
- Publication number
- CN106709503A CN106709503A CN201611047429.1A CN201611047429A CN106709503A CN 106709503 A CN106709503 A CN 106709503A CN 201611047429 A CN201611047429 A CN 201611047429A CN 106709503 A CN106709503 A CN 106709503A
- Authority
- CN
- China
- Prior art keywords
- data
- subset
- point
- data subset
- core
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
Abstract
The invention particularly relates to a large spatial data clustering algorithm K-DBSCAN based on density. The algorithm comprises the steps that a density-based clustering parameter is preset: radius R, the minimum neighbor number Min_N, pre-division number K and division iteration number of times T are preset; a data set is divided into K1 subsets according to spatial distribution; the reachable subset of each data subset is calculated to form a reachable subset index; and based on the reachable subset index, spatial clustering based on density is carried out on the data of each subset. According to the technical scheme provided by the invention, density-based unsupervised and semi-supervised clustering can be carried out on the large spatial data set, and efficient and fast parallel clustering calculating is realized.
Description
Technical field
The present invention relates to data mining and big data analysis field, and in particular to a kind of large space data based on density
Clustering algorithm K-DBSCAN.
Background technology
Spatial Data Clustering is widely used in many areas of information technology, such as data mining, pattern-recognition, machine
Study, artificial intelligence, visual analysis, GIS-Geographic Information System etc..Especially in the big data epoch, it can be used to explore it is meaningful but
The potential pattern and phenomenon for not yet knowing, can be applied to many ambits, such as social network analysis, economic networks point
Analysis, traffic network analysis, meteorologic analysis, smart city development etc..Traditional Spatial Data Clustering method calculated based on distance
Mainly there are three kinds:1), based on the cluster for dividing;2), density clustering;3), hierarchical clustering.
Density clustering can effectively process noise spot and identification arbitrary shape, and main algorithm includes:
DBSCAN(Density-Based Spatial Clustering of Applications with Noise)、OPTICS
(Ordering points to identify the clustering structure)、DENCLUE(DENsity basted
CLUstEring) etc..Wherein, DBSCAN is foremost density-based spatial clustering algorithm.Its computation complexity is O
(N2), i.e., when data volume increases by 100 times, it calculates the time will about 10000 times of increase.Although existing many parallel based on calculating
The method of change calculates required time to be greatly reduced, but is still limited by the quantity of CPU or GPU in calculating platform.For example, will be in phase
The DBSCAN clusters of 100 haplotype data amounts are carried out in the same calculating time, then needs about 10000 times of CPU or GPU quantity.Therefore
When mass data is faced, DBSCAN cannot be widely used.
The content of the invention
The present invention is to solve being directed to what DBSCAN when mass data needs to be clustered cannot be applicable in the prior art
Technical problem.
In order to solve the above-mentioned technical problem, the present invention provides following technical scheme:
A kind of large space Data Clustering Algorithm K-DBSCAN based on density, including:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
Alternatively, it is described to draw data set in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
It is divided into K1Individual data subset, wherein being K1Natural number more than 1, specifically includes:
Obtain the real space value range D of the data setlen;
Real space value range D according to the data setlenPre- division is carried out to the data point in the data set to obtain
K is presorted, and wherein K is the natural number more than 1;
Preclassification step, obtains each central point position presorted, and by each data in the data set
In point distribution to presorting where central point closest therewith, if it is a certain presort in number of data points less than default
Minimum neighbour's quantity Min_N, then delete this and presort;
Repeat the preclassification step T times, obtain K1Individual data subset, wherein T are default division iterations.
Alternatively, it is described to obtain the number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
According to the real space value range D of collectionlen, specifically include:
Obtain the maximum space value range D of the data setmaxWith minimum space value range Dmin, wherein Dmax=LNmax+
LAmax, Dmin=LNmin+LAmin, LNmaxIt is the longitude maximum of all data points in the data set, LAminIt is the data set
In all data points longitude minimum value, LNminIt is the latitude minimum value of all data points in the data set, LAmaxFor described
The latitude maximum of all data points in data set;
Obtain the real space value range D of the data setlen=Dmax- Dmin。
Alternatively, it is described according to the number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
According to the real space value range D of collectionlenPre- division carried out to the data point in the data set obtain K to presort, wherein K is
Natural number more than 1, specifically includes:
For any one data point a in the data set, presorting belonging to it is obtained according to following computational methods:Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.
Alternatively, it is described to obtain each number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
According to the reachable subset of subset, and form corresponding with the data subset up to subset index, specifically include:
According to the longitude and latitude value of all data points in the data subset, the maximum warp of the data subset is determined
Angle value LOmax, minimum longitude LQmin, minimum latitude value LOmin, maximum latitude value LQmax;
Obtain the reachable tree coverage of each data subset:LOright=LOmax+ d, LOleft=LOmin- d,
LQup=LQmax+ d, LQdown=LQmin- d, wherein d are default reach distance, LQup、LOright、LQdown、LOleftIt is respectively described
The coboundary of the reachable tree coverage of data subset, right margin, lower boundary and left margin;
For data subset calculates it up to subset list each described, computational methods are:For arbitrary data subset Pa
With data subset PbIf, data subset PaReachable tree coverage and data subset PbReachable tree coverage exist
Occur simultaneously, it is determined that data subset PaWith data subset PbIt is reachable mutually, and each data subset is all the reachable subset of itself, note
It is RPLa={ a, b ... } and RPLb={ a, b ... };The all of each data subset constitute the reachable of the data subset up to subset
Subset list,
After obtaining the reachable subset list of each data subset, by all K1The reachable subset list of individual data subset
Arranged according to data subset order, obtained up to subset index.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, as data subset Pa's
Reachable tree coverage and data subset PbReachable tree coverage when there is following any one relation, it may be determined that number
According to subset PaReachable tree coverage and data subset PbReachable tree coverage exist occur simultaneously:
Relation one:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation two:(LOleft_a<LOleft_b<LOright_a) and (LAdown_a<LQdown_b<LQup_a);
Relation three:(LOleft_a<LOright_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation four:(LOleft_a<LOright_b<LOright_a) and (LAdown_a<LAdown_b<LAup_a);
Relation five:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation six:(LOleft_a<LOright_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation seven:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQup_b<LQup_a);
Relation eight:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQdown_b<LQup_a);
Relation nine:(LOleft_a>LOleft_b) and (LOright_a<LOright_b) and (LQup_a<LQup_b) and (LQdown_a>
LQdown_b);
Wherein, LQup_a、LOright_a、LQdown_a、LOleft_aRespectively data subset PaReachable tree coverage it is upper
Border, right margin, lower boundary and left margin;LQup_b、LOright_b、LQdown_b、LOleft_bRespectively data subset PbUp to empty
Between the coboundary of coverage, right margin, lower boundary and left margin.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, can described in the basis
Density-based spatial clustering is carried out to each data subset data up to subset index, is specifically included:
According to the core point that each data subset is calculated up to subset index;
The core point in each data subset is clustered respectively.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, can described in the basis
The core point of each data subset is calculated up to subset index, is specifically included:
Step 11:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 12:Choose data subset b in order from the reachable subset list of data subset a;
Step 13:Choose data point P in order from data subset bj;
Step 14:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really
Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 15:According to the method described in step 13 and 14, all data points in ergodic data subset b, according to step
14 method is calculated data point PiNeighbor Points quantity Ni;
Step 16:It is all up to number in the reachable subset list of ergodic data subset a according to the method described in step 12
According to subset, data point P is calculated according to step 15 methods describediNeighbor Points quantity Ni;
Step 17:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity
NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=
CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot
Beam, makes CF=CF+1, and according to the method in step 11, chooses next data point Pi+1Carry out core point calculating;
Step 18:According to the method described in step 11 to step 17, each data in ergodic data subset a in order
Point, the core point for completing all data points in data subset a is calculated;
Step 19:According to the method described in step 11 to step 18, K is traveled through in order1Each in individual data subset
Data subset, the core point for completing all data points in all data subsets is calculated.
Alternatively, it is described to calculate each number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
According to the core point of subset, specifically include:
Step 21:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 22:Choose data point P in order from data subset aj;
Step 23:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really
Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 24:According to the method described in step 22, all data points in ergodic data subset a, according to step 23 institute
The method stated, calculates data point PiNeighbor Points quantity Ni;
Step 25:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity
NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=
CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot
Beam, makes CF=CF+1, and according to the method in step 21, chooses next data point Pi+1Carry out core point calculating;
Step 26:According to the method described in step 21 to step 25, each data point in ergodic data subset a is complete
The core point of all data points is calculated into data subset a;
Step 27:According to the method described in step 21 to step 26, K is traveled through1Each data in individual data subset
Collection, the core point for completing all data points in all data subsets is calculated.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, according to described reachable
Subset index, clusters to the core point in each data subset respectively, specifically includes:
Step 31:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as
Really its MF value is present, then make CFi=MFi;
Step 32:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point Pi;
Step 33:Choose data subset b in order from the reachable subset list of data subset a;
Step 34:Choose data point P in order from data subset bj;
Step 35:If data point PiWith data point PjTag along sort CFiWith CFjIt is equal, then perform step 37;Otherwise count
Count strong point PiWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step 36 is performed, otherwise performed
Step 37;
Step 36:Judge tag along sort CFiWith CFjBetween magnitude relationship:Work as CFi<CFjWhen, then make CFj=CFi, when
CFi>CFjWhen, then make CFi=CFj;
Step 37:According to the method described in step 34 to step 36, each data point in ergodic data subset b;
Step 38:It is every in the reachable subset list of ergodic data subset a according to the method described in step 33 to step 37
One, up to each data point in subset, completes core point P in data subset aiCluster calculation;
Step 39:According to the method described in step 32 to step 38, each data point in ergodic data subset a is complete
Into the cluster calculation of all core points in data subset a;
Step 310:According to the method described in step 31 to step 39, K is traveled through in order1Each in individual data subset
Data subset, completes the cluster calculation of each core point in each data subset.
Alternatively, it is described respectively to each in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
Core point in data subset is clustered, and is specifically included:
Step 41:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as
Really its MF value is present, then make CFi=MFi;
Step 42:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point PiWith data point Pj;
Step 43:Judge data point PiWith data point PjTag along sort CFiWith CFjIt is whether equal;Step is performed if equal
Rapid 45, data point P is calculated if unequaliWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then hold
Row step 44, otherwise performs step 45;
Step 44:Judge data point PiWith data point PjTag along sort CFiWith CFjMagnitude relationship:Work as CFi<CFjWhen,
Then make CFj=CFi, work as CFi>CFjWhen, then make CFi=CFj;
Step 45:According to the method described in step 41 to step 44, each data point in ergodic data subset a is complete
The cluster calculation of all core points into data subset a;
Step 46:According to the method described in step 46 to step 45, each in K1 data subset is traveled through in order
Data subset, completes the cluster calculation of each core point in each data subset.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, also comprise the following steps:
The Neighbor Points that are calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset.
Alternatively, it is described to calculate respectively in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
Clustered to the Neighbor Points in each data subset and to the Neighbor Points in each data subset, specifically included:
Step 51:In order from K1 data subset choose data subset a, using the core point in data subset a as
A subset, is designated as CGa, non-core point is used as a subset NCGa;
Step 52:According to the method described in step 51, K1 data subset is traveled through, for all data subsets divide core point
Subset CG and non-core point NCG;
Step 53:Data subset a is chosen from K1 data subset in order, in order from the non-core of data subset a
Point subset NCGaIt is middle to choose non-core point NCPi;
Step 54:Data subset b is chosen from the reachable subset list of data subset a in order;
Step 55:In order from the core point subset CG of data subset bbMiddle selection core point CPj;
Step 56:Calculate non-core point NCPiWith core point CPjThe distance between Di,jIf, apart from Di,jIt is pre- less than described
If radius R, then non-core point NCPiIt is core point CPjNeighbor Points, and clustered:Make CFi=CFj;
Step 57:According to the method described in step 55 to step 56, each core point in ergodic data subset b;
Step 58:It is each in the reachable subset list of ergodic data subset a according to the method described in step 54 to 56
Individual reachable subset, completes non-core point NCPiCluster calculation;
Step 59:According to the method described in step 53 to step 58, each in ergodic data subset a is non-in order
Core point, completes the cluster calculation of all non-core points in data subset a;
Step 510:According to the method described in step 53 to step 59, K is traveled through in order1It is each in individual data subset
Individual data subset, completes the cluster calculation of all non-core points in each data subset.
Alternatively, it is described to calculate respectively in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density
Clustered to the Neighbor Points in each data subset and to the Neighbor Points in each data subset, specifically included:
Step 61:By K1Core point in individual data subset in each data subset is designated as CG as a subset, non-
Core point is designated as NCG as a subset;
Step 62:According to the method described in step 61, K is traveled through1Individual data subset, is that all of data subset divides core
Point CG and non-core point NCG;
Step 63:In order from K1Data subset a is chosen in individual data subset, from the non-core idea in data subset a
Collection NCGaIn choose non-core point NCP in orderi;From the core point subset CG in data subset aaIn choose core point in order
CPj;
Step 64:Calculate point NCPiWith point CPjThe distance between Di,jIf, Di,jLess than pre-set radius R, then point NCPiIt is
Point CPjNeighbor Points, and clustered:Make CFj=CFi;
Step 65:According to the method described in step 63 to step 64, each in ergodic data subset a is non-in order
Core point, completes the cluster calculation of all non-core point in data subset a;
Step 66:According to the method described in step 63 to step 65, K is traveled through in order1Each in individual data subset
Data subset, completes the cluster calculation of all non-core points in each data subset.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the number presorted
Amount K is obtained in the following manner:
Wherein N is the total quantity of data point in the data set, and Min_N is default minimum neighbour points
Amount, k is constant.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the default division
Iterations T meets:1≤T≤10.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, it is described it is default up to away from
Meet from d:If R≤d≤R × 2, wherein R are pre-set radius R.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the pre-set radius R is replaced
It is changed to R ':R≤R′≤R×2.
The present invention also provides a kind of electronic equipment for performing the large space Data Clustering Algorithm K-DBSCAN based on density,
It is characterised in that it includes:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of one computing device, and the instruction is by described at least one
Reason device is performed, so that at least one processor can:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
The above-mentioned technical proposal that the present invention is provided, compared with prior art, at least has the advantages that:The present invention is carried
A kind of large space Data Clustering Algorithm K-DBSCAN based on density is gone out, the algorithm is first with a kind of parallelization of simplification
K-means algorithms carry out data division, secondly guide cluster using a kind of reachable subset index, finally improved using one kind
Distributed parallel clustering algorithm carries out space clustering.The algorithm greatly reduces the Spatial Data Clustering based on density
Computation complexity so that the algorithm can be widely applied to mass data cluster.
Brief description of the drawings
In order to illustrate more clearly of the specific embodiment of the invention or technical scheme of the prior art, below will be to specific
The accompanying drawing to be used needed for implementation method or description of the prior art is briefly described, it should be apparent that, in describing below
Accompanying drawing is some embodiments of the present invention, for those of ordinary skill in the art, before creative work is not paid
Put, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is the method for the large space Data Clustering Algorithm K-DBSCAN based on density described in one embodiment of the invention
Flow chart;
Fig. 2 carries out sky using improved k-means clustering algorithms for step described in one embodiment of the invention to data set
Between partition clustering method flow diagram;
Fig. 3 is the implementation method flow chart of step S102 described in one embodiment of the invention;
Fig. 4 is the spatial coverage schematic diagram of data subset described in one embodiment of the invention;
Fig. 5 is the schematic diagram that different pieces of information subset described in one embodiment of the invention mutually reaches;
Fig. 6 is reality of the one embodiment of the invention according to the core point that each data subset is calculated up to subset index
Existing method flow diagram;
Fig. 7 is that the core point in each data subset is carried out according to up to subset index described in one embodiment of the invention
The implementation method flow chart of the implementation method flow chart of cluster;
Fig. 8 is Neighbor Points in each data subset to be calculated described in one embodiment of the invention and to each data
The implementation method flow chart that the Neighbor Points of concentration are clustered;
Fig. 9 is to perform the large space Data Clustering Algorithm K-DBSCAN's based on density described in one embodiment of the invention
The hardware configuration connection diagram of the electronic equipment of method.
Specific embodiment
Embodiment 1
The present embodiment provides a kind of large space Data Clustering Algorithm K-DBSCAN based on density, as shown in figure 1, bag
Include:
S101:Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1.
S102:The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index.
S103:Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
In such scheme, data set is carried out into data division first, obtain multiple data subsets, secondly using a kind of reachable
Subset index guides cluster, finally for division after each data subset space clustering carried out using clustering algorithm.Should
Algorithm greatly reduces the computation complexity of the Spatial Data Clustering based on density so that the algorithm can be widely applied to magnanimity
Data clusters.
Embodiment 2
In above-mentioned steps S101, data set can be divided using various ways, specifically need to ensure every stroke
The data subset obtained after point has specific space and data point.A kind of implementation is provided in the present embodiment, using one kind
Improved k-means clustering algorithms carry out space partition clustering to data set, including:
Specifically, as shown in Fig. 2 comprising the following steps:
S201:The longitude maximum LN of all data points in the data set is calculated respectivelymax, longitude minimum value LNmin
With latitude maximum LAmax, latitude minimum value LAmin;Obtain the maximum space value range D of the data setmax=LNmax+LAmax
With minimum space value range Dmin=LNmin+LAmin, then calculate the real space value range D of the data setlen=Dmax-
Dmin。
S202:According to the real space value range DlenInitial division is carried out to the data point in the data set, specifically
Algorithm is:For Arbitrary Digit strong point a, the computing formula presorted belonging to it is: Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.And K values can be by formula certainly
It is dynamic to be calculated:N is the total quantity of data point in data set, and Min_N is minimum Neighbor Points quantity, and k is to appoint
Meaning constant.
S203:For arbitrary classification CaIf, less than default minimum neighbour quantity Min_N, deleting should for its number of members
Presort;
S204:Each central point presorted is calculated respectively;
S205:Then calculate each data point to the distance of central point each described, and by the data point be assigned to
The nearest central point where presort;
S206:Step S203, S204, S205 are repeated, iterations is T times, and wherein T can select to be any nature
Count, preferably T can be:1≤T≤10;
S207:Data point in the data set is divided into K1Individual to presort, each presorts and can be used as one
Individual data subset, therefore obtained K1Individual data subset.
In above-mentioned steps S102, as shown in figure 3, can realize in the following way:
S301:For data subset P calculates reachable tree coverage each described, computational methods are:Correction data subset P
In all data points longitude and latitude value, find out the maximum longitude LO in all data pointsmax, minimum longitude LQmin、
Minimum latitude value LOmin, maximum latitude value LQmax;Along with default reachable tree is apart from d, the reachable tree of data subset P is obtained
Coverage:LOright=LOmax+ d, LOleft=LOmin- d, LQup=LQmax+ d, LQdown=LQmin- d, wherein, LQup、
LOright、LQdown、LOleftThe coboundary of the reachable tree coverage of respectively described data subset, right margin, lower boundary and
Left margin.
As shown in figure 4, the data point in rectangle frame represents a data subset (can also be called to presort), rectangle
Inside casing represents the spatial coverage of the data subset, and rectangular outer frame represents the reachable tree covering model of the data subset
Enclose.It can be seen that the difference between inside casing and housing is to preset reachable tree apart from d.As a kind of optional realization
Mode, the reachable tree can be any number apart from d values, generally set R≤d≤R × 2, and R is pre-set radius.
S302:For data subset P calculates it up to subset list RPL each described, computational methods are:For arbitrary institute
State data subset PaAnd Pb, if there is data subset PaReachable tree coverage Ra, and data subset PbUp to empty
Between coverage Rb, and scope RaWith scope RbIt is intersecting, then define PaWith PbIt is reachable mutually, it is designated as RPLa={ a, b ... }, RPLb=
{ a, b ... }, it is clear that each data subset is the reachable subset of oneself.As shown in figure 5, for data subset P1, it is up to empty
Between coverage R1Only with data subset P2、P3、P4Reachable tree coverage R2、R3、R4Intersect, then P1With P2It is reachable mutually,
P1With P3Mutual reachable, P1With P4It is reachable mutually.In above-mentioned steps, data subset P is judgedaWith data subset PbWhether it is intersecting can be with
Whether fallen into the border of another data subset by the border for judging one of data subset, specifically can be summarized as
9 kinds of situations, can be calculated by below equation:
(1)(LOleft_a<LOleft_b<LOright_a)&(LQdown_a<LQup_b<LQup_a)
(2)(LOleft_a<LOleft_b<LOright_a)&(LQdown_a<LQdown_b<LQup_a)
(3)(LOleft_a<LOright_b<LOright_a)&(LQdown_a<LQup_b<LQup_a)
(4)(LOleft_a<LOright_b<LOright_a)&(LQdown_a<LQdown_b<LQup_a)
(5)(LOleft_a<LOleft_b<LOright_a)&(LQdown_a>LQdown_b&LQup_a<LQup_b)
(6)(LOleft_a<LOright_b<LOright_a)&(LQdown_a>LQdown_b&LQup_a<LQup_b)
(7)(LOleft_a>LOleft_b&LOright_a<LOright_b)&(LQdown_a<LQup_b<LQup_a)
(8)(LOleft_a>LOleft_b&LOright_a<LOright_b)&(LQdown_a<LQdown_b<LQup_a)
(9)(LOleft_a>LOleft_b)&(LOright_a<LOright_b)&(LQup_a<LQup_b)&(LQdown_a>LQdown_b)。
Wherein, & represents logical AND, and in subscript, right represents right margin, and left represents left margin, and up represents top
Boundary, down represents lower boundary, and a represents data subset Pa, b represents data subset Pb, therefore for containing represented by each symbol
Justice, can directly obtain according to subscript.
S303:K is respectively by formula described in step S3021Relative to other, each data subset enters individual data subset
Row is calculated, and obtains the described up to subset list RPL of each data subset, and by all K1The reachable subset row of individual data subset
Table RPL is arranged according to subset order, obtains one up to subset index RPI, is designated as RPI={ RPL1,...,RPLK1}。
Preferably, above-mentioned steps S103 may include following steps:According to described each data is calculated up to subset index
The core point of collection;The core point in each data subset is clustered respectively;It is calculated respectively in each data subset
Neighbor Points are simultaneously clustered to the Neighbor Points in each data subset.
When wherein calculating the core point of each data subset, can be calculated according to up to subset index, it is also possible to only at this
Calculated in the range of the data point of data subset itself.Only the core point in each data subset is clustered, Ke Yiyou
Effect improves cluster speed, but precision can be subject to a certain degree of influence.
Therefore, can be using scheme as shown in Figure 6 according to the core that each data subset is calculated up to subset index
Point, comprises the following steps:
Step 11:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 12:Choose data subset b in order from the reachable subset list of data subset a;
Step 13:Choose data point P in order from data subset bj;
Step 14:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really
Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 15:According to the method described in step 13 and 14, all data points in ergodic data subset b, according to step
14 method is calculated data point PiNeighbor Points quantity Ni;
Step 16:It is all up to number in the reachable subset list of ergodic data subset a according to the method described in step 12
According to subset, data point P is calculated according to step 15 methods describediNeighbor Points quantity Ni;
Step 17:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity
NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=
CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot
Beam, makes CF=CF+1, and according to the method in step 11, chooses next data point Pi+1Carry out core point calculating;
Step 18:According to the method described in step 11 to step 17, each data in ergodic data subset a in order
Point, the core point for completing all data points in data subset a is calculated;
Step 19:According to the method described in step 11 to step 18, K is traveled through in order1Each in individual data subset
Data subset, the core point for completing all data points in all data subsets is calculated.
I.e. each data point of pin data subset a, travels through each up to each data in subset in sequence
After point, then other data subsets outside ergodic data subset a, until all K1Individual data subset completes traversal.
It should be noted that provided in the embodiment of the present invention parameter subscript, can be distinguished using natural number, in fact
Any restriction is not done in matter to the parameter, such as when having 100 data points in one data subset, PiRepresent a data
Point, i can get 100 from 1.
As shown in fig. 7, core point can not be calculated for each subset up to subset index RPI by described, to improve sea
The calculating speed of data clusters is measured, but the precision of some cluster results can be lost.Circular can be reduced to:
Step 21:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 22:Choose data point P in order from data subset aj;
Step 23:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really
Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 24:According to the method described in step 22, all data points in ergodic data subset a, according to step 23 institute
The method stated, calculates data point PiNeighbor Points quantity Ni;
Step 25:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity
NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=
CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot
Beam, makes CF=CF+1, and according to the method in step 21, chooses next data point Pi+1Carry out core point calculating;
Step 26:According to the method described in step 21 to step 25, each data point in ergodic data subset a is complete
The core point of all data points is calculated into data subset a;
Step 27:According to the method described in step 21 to step 26, K is traveled through1Each data in individual data subset
Collection, the core point for completing all data points in all data subsets is calculated.
I.e. for each data point in data subset a, without traversal up to the data point in subset, but and data
Other data points in subset a itself are traveled through, and other data subsets outside data subset a are carried out again after completion time
Go through, until K1All traversal is completed individual data subset.
Similarly, the core point in each data subset is clustered, it is also possible to be divided into two methods, one kind is basis
The core point in each data subset is clustered up to subset index, another way is without up to complete by subset
Into.Wherein:
Mode one:According to described reachable subset index, the core point in each data subset is clustered respectively, had
Body includes:
Step 31:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as
Really its MF value is present, then make CFi=MFi;
Step 32:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point Pi;
Step 33:Choose data subset b in order from the reachable subset list of data subset a;
Step 34:Choose data point P in order from data subset bj;
Step 35:If data point PiWith data point PjTag along sort CFiWith CFjIt is equal, then perform step 37;Otherwise count
Count strong point PiWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step 36 is performed, otherwise performed
Step 37;
Step 36:Judge tag along sort CFiWith CFjBetween magnitude relationship:Work as CFi<CFjWhen, then make CFj=CFi, when
CFi>CFjWhen, then make CFi=CFj;
Step 37:According to the method described in step 34 to step 36, each data point in ergodic data subset b;
Step 38:It is every in the reachable subset list of ergodic data subset a according to the method described in step 33 to step 37
One, up to each data point in subset, completes core point P in data subset aiCluster calculation;
Step 39:According to the method described in step 32 to step 38, each data point in ergodic data subset a is complete
Into the cluster calculation of all core points in data subset a;
Step 310:According to the method described in step 31 to step 39, each in K1 data subset is traveled through in order
Data subset, completes the cluster calculation of each core point in each data subset.
Mode two:Do not clustered for the core point in each subset by the reachable subset index RPI, to improve
The calculating speed of mass data cluster, but the precision of cluster result can be lost.Circular can be reduced to:
Step 41:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as
Really its MF value is present, then make CFi=MFi;
Step 42:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a
Strong point PiWith data point Pj;
Step 43:Judge data point PiWith data point PjTag along sort CFiWith CFjIt is whether equal;Step is performed if equal
Rapid 45, data point P is calculated if unequaliWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then hold
Row step 44, otherwise performs step 45;
Step 44:Judge data point PiWith data point PjTag along sort CFiWith CFjMagnitude relationship:Work as CFi<CFjWhen,
Then make CFj=CFi, work as CFi>CFjWhen, then make CFi=CFj;
Step 45:According to the method described in step 41 to step 44, each data point in ergodic data subset a is complete
The cluster calculation of all core points into data subset a;
Step 46:According to the method described in step 46 to step 45, K is traveled through in order1Each in individual data subset
Data subset, completes the cluster calculation of each core point in each data subset.
Similarly, the Neighbor Points that are calculated in each data subset simultaneously gather to the Neighbor Points in each data subset
The step of class, it is also possible to realized by two ways.
Mode one:According to described up to subset index RPI, respectively each subset calculates Neighbor Points and clusters.Calculating side
Method is:
Step 51:In order from K1In individual data subset choose data subset a, using the core point in data subset a as
A subset, is designated as CGa, non-core point is used as a subset NCGa;
Step 52:According to the method described in step 51, K1 data subset is traveled through, for all data subsets divide core point
Subset CG and non-core point NCG;
Step 53:Data subset a is chosen from K1 data subset in order, in order from the non-core of data subset a
Point subset NCGaIt is middle to choose non-core point NCPi;
Step 54:Data subset b is chosen from the reachable subset list of data subset a in order;
Step 55:In order from the core point subset CG of data subset bbMiddle selection core point CPj;
Step 56:Calculate non-core point NCPiWith core point CPjThe distance between Di,jIf, apart from Di,jIt is pre- less than described
If radius R, then non-core point NCPiIt is core point CPjNeighbor Points, and clustered:Make CFi=CFj;
Step 57:According to the method described in step 55 to step 56, each core point in ergodic data subset b;
Step 58:It is each in the reachable subset list of ergodic data subset a according to the method described in step 54 to 56
Individual reachable subset, completes non-core point NCPiCluster calculation;
Step 59:According to the method described in step 53 to step 58, each in ergodic data subset a is non-in order
Core point, completes the cluster calculation of all non-core points in data subset a;
Step 510:According to the method described in step 53 to step 59, K is traveled through in order1It is each in individual data subset
Individual data subset, completes the cluster calculation of all non-core points in each data subset.
Mode two:It is not or not each subset calculates Neighbor Points by the reachable subset index RPI, to improve mass data
The calculating speed of cluster, but the precision of cluster result can be lost.Circular can be reduced to:
Step 61:By K1Core point in individual data subset in each data subset is designated as CG as a subset, non-
Core point is designated as NCG as a subset;
Step 62:According to the method described in step 61, K is traveled through1Individual data subset, is that all of data subset divides core
Point CG and non-core point NCG;
Step 63:In order from K1Data subset a is chosen in individual data subset, from the non-core idea in data subset a
Collection NCGaIn choose non-core point NCP in orderi;From the core point subset CG in data subset aaIn choose core point in order
CPj;
Step 64:Calculate point NCPiWith point CPjThe distance between Di,jIf, Di,jLess than pre-set radius R, then point NCPiIt is
Point CPjNeighbor Points, and clustered:Make CFj=CFi;
Step 65:According to the method described in step 63 to step 64, each in ergodic data subset a is non-in order
Core point, completes the cluster calculation of all non-core point in data subset a;
Step 66:According to the method described in step 63 to step 65, K is traveled through in order1Each in individual data subset
Data subset, completes the cluster calculation of all non-core points in each data subset.
In above scheme, the pre-set radius R could alternatively be R ':R≤R′≤R×2.The above skill that the present embodiment is provided
Art scheme, can carry out the unsupervised and semi-supervised clustering based on density to large space data set, and realize efficiently, quickly simultaneously
Row cluster calculation.
Embodiment 3
Fig. 9 is that the electronics of large space Data Clustering Algorithm K-DBSCAN of the execution based on density that the present embodiment is provided sets
Standby hardware architecture diagram, as shown in figure 9, the equipment includes:
One or more processors 701 and memory 702, in Fig. 9 by taking a processor 701 as an example.
The equipment for performing the large space Data Clustering Algorithm K-DBSCAN based on density can also include:Input unit
703 and output device 704.
Processor 701, memory 702, input unit 703 and output device 704 can be by bus or other modes
Connection, in Fig. 9 as a example by being connected by bus.
Memory 702 can be used to store non-volatile software journey as a kind of non-volatile computer readable storage medium storing program for executing
Sequence, non-volatile computer executable program and module, the large space data based on density such as in the embodiment of the present application
Corresponding programmed instruction/the modules of clustering algorithm K-DBSCAN.Processor 701 is non-easy in memory 702 by running storage
The property lost software program, instruction and module, so that the various function application of execute server and data processing, that is, realize above-mentioned
The large space Data Clustering Algorithm K-DBSCAN based on density of embodiment of the method.
Memory 702 can include storing program area and storage data field, wherein, storing program area can store operation system
Application program required for system, at least one function;Storage data field can be stored according to large space number of the execution based on density
Created data etc. are used according to the device of clustering algorithm K-DBSCAN.Additionally, memory 702 can include depositing at random at a high speed
Access to memory, can also include nonvolatile memory, for example, at least one disk memory, flush memory device or other are non-
Volatile solid-state part.In certain embodiments, memory 702 is optional including remotely located relative to processor 701
Memory, these remote memories can be by network connection to large space Data Clustering Algorithm K- of the execution based on density
DBSCAN devices.The example of above-mentioned network include but is not limited to internet, intranet, LAN, mobile radio communication and its
Combination.
Input unit 703 can receive the numeral or character information of input, and produce and perform the large-scale sky based on density
Between Data Clustering Algorithm K-DBSCAN devices user set and function control it is relevant key signals input.Output device 704
May include the display devices such as display screen.
One or more of modules are stored in the memory 702, when by one or more of processors
During 701 execution, the large space Data Clustering Algorithm K-DBSCAN based on density in above-mentioned any means embodiment is performed.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions
The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided
The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices
The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy
In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger
Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described
Property concept, then can to these embodiments make change and change.So, appended claims are intended to be construed to include being preferable to carry out
Example and fall into having altered and changing for the scope of the invention.
Claims (19)
1. a kind of large space Data Clustering Algorithm K-DBSCAN based on density, it is characterised in that including:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
2. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 1, it is characterised in that institute
State and data set is divided into K1Individual data subset, wherein being K1Natural number more than 1, specifically includes:
Obtain the real space value range D of the data setlen;
Real space value range D according to the data setlenPre- division is carried out to the data point in the data set and obtains K
Presort, wherein K is the natural number more than 1;
Preclassification step, obtains each central point position presorted, and each data point in the data set is divided
Be assigned in presorting where central point closest therewith, if it is a certain presort in number of data points less than default minimum
Neighbour quantity Min_N, then delete this and presort;
Repeat the preclassification step T times, obtain K1Individual data subset, wherein T are default division iterations.
3. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 2, it is characterised in that institute
State the real space value range D for obtaining the data setlen, specifically include:
Obtain the maximum space value range D of the data setmaxWith minimum space value range Dmin, wherein Dmax=LNmax+LAmax,
Dmin=LNmin+LAmin, LNmaxIt is the longitude maximum of all data points in the data set, LAminIt is institute in the data set
There are the longitude minimum value of data point, LNminIt is the latitude minimum value of all data points in the data set, LAmaxIt is the data
Concentrate the latitude maximum of all data points;
Obtain the real space value range D of the data setlen=Dmax- Dmin。
4. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 3, it is characterised in that institute
State the real space value range D according to the data setlenPre- division is carried out to the data point in the data set and obtains K in advance
Classification, wherein K is the natural number more than 1, is specifically included:
For any one data point a in the data set, presorting belonging to it is obtained according to following computational methods:Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.
5. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 1, it is characterised in that institute
The reachable subset for obtaining each data subset is stated, and forms corresponding with the data subset up to subset index, specifically included:
According to the longitude and latitude value of all data points in the data subset, the maximum longitude of the data subset is determined
LOmax, minimum longitude LQmin, minimum latitude value LOmin, maximum latitude value LQmax;
Obtain the reachable tree coverage of each data subset:LOright=LOmax+ d, LOleft=LOmin- d, LQup=
LQmax+ d, LQdown=LQmin- d, wherein d are default reach distance, LQup、LOright、LQdown、LOleftRespectively described data
The coboundary of the reachable tree coverage of subset, right margin, lower boundary and left margin;
For data subset calculates it up to subset list each described, computational methods are:For arbitrary data subset PaAnd data
Subset PbIf, data subset PaReachable tree coverage and data subset PbReachable tree coverage exist occur simultaneously, then
Determine data subset PaWith data subset PbIt is reachable mutually, and each data subset is all the reachable subset of itself, is designated as RPLa=
{ a, b ... } and RPLb={ a, b ... };All reachable subset row that the data subset is constituted up to subset of each data subset
Table,
After obtaining the reachable subset list of each data subset, by all K1The reachable subset list of individual data subset according to
Data subset order is arranged, and is obtained up to subset index.
6. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 5, it is characterised in that when
Data subset PaReachable tree coverage and data subset PbReachable tree coverage there is following any one relation
When, it may be determined that data subset PaReachable tree coverage and data subset PbReachable tree coverage exist occur simultaneously:
Relation one:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation two:(LOleft_a<LOleft_b<LOright_a) and (LAdown_a<LQdown_b<LQup_a);
Relation three:(LOleft_a<LOright_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation four:(LOleft_a<LOright_b<LOright_a) and (LAdown_a<LAdown_b<LAup_a);
Relation five:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation six:(LOleft_a<LOright_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation seven:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQup_b<LQup_a);
Relation eight:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQdown_b<LQup_a);
Relation nine:(LOleft_a>LOleft_b) and (LOright_a<LOright_b) and (LQup_a<LQup_b) and (LQdown_a>LQdown_b);
Wherein, LQup_a、LOright_a、LQdown_a、LOleft_aRespectively data subset PaReachable tree coverage top
Boundary, right margin, lower boundary and left margin;LQup_b、LOright_b、LQdown_b、LOleft_bRespectively data subset PbReachable tree
The coboundary of coverage, right margin, lower boundary and left margin.
7. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 6, it is characterised in that institute
State carries out density-based spatial clustering up to subset index according to described to each data subset data, specifically includes:
According to the core point that each data subset is calculated up to subset index;
The core point in each data subset is clustered respectively.
8. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that institute
State according to the core point that each data subset is calculated up to subset index, specifically include:
Step 11:In order from K1Data subset a is chosen in individual data subset, data point is chosen in order from data subset a
PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 12:Choose data subset b in order from the reachable subset list of data subset a;
Step 13:Choose data point P in order from data subset bj;
Step 14:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, it is determined that number
Strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 15:According to the method described in step 13 and 14, all data points in ergodic data subset b, according to step 14
Method is calculated data point PiNeighbor Points quantity Ni;
Step 16:It is all up to data in the reachable subset list of ergodic data subset a according to the method described in step 12
Collection, data point P is calculated according to step 15 methods describediNeighbor Points quantity Ni;
Step 17:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiIt is
It is no more than minimum neighbour's quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=CF,
And last Neighbor Points P in being calculated for the stepkPlus tag along sort MFa, and for data point PiCore point calculate terminate,
CF=CF+1 is made, and according to the method in step 11, chooses next data point Pi+1Carry out core point calculating;
Step 18:According to the method described in step 11 to step 17, each data point in ergodic data subset a in order,
The core point for completing all data points in data subset a is calculated;
Step 19:According to the method described in step 11 to step 18, K is traveled through in order1Each data in individual data subset
Collection, the core point for completing all data points in all data subsets is calculated.
9. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 8, it is characterised in that institute
The core point for calculating each data subset is stated, is specifically included:
Step 21:In order from K1Data subset a is chosen in individual data subset, data point is chosen in order from data subset a
PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 22:Choose data point P in order from data subset aj;
Step 23:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, it is determined that number
Strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 24:According to the method described in step 22, all data points in ergodic data subset a, according to described in step 23
Method, calculates data point PiNeighbor Points quantity Ni;
Step 25:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiIt is
It is no more than minimum neighbour's quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=CF,
And last Neighbor Points P in being calculated for the stepkPlus tag along sort MFa, and for data point PiCore point calculate terminate,
CF=CF+1 is made, and according to the method in step 21, chooses next data point Pi+1Carry out core point calculating;
Step 26:According to the method described in step 21 to step 25, each data point in ergodic data subset a completes number
Calculated according to the core point of all data points in subset a;
Step 27:According to the method described in step 21 to step 26, K is traveled through1Each data subset in individual data subset, it is complete
The core point of all data points is calculated into all data subsets.
10. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that
According to described reachable subset index, the core point in each data subset is clustered respectively, specifically included:
Step 31:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point PiIf, its
MF values are present, then make CFi=MFi;
Step 32:In order from K1Data subset a is chosen in individual data subset, data point is chosen in order from data subset a
Pi;
Step 33:Choose data subset b in order from the reachable subset list of data subset a;
Step 34:Choose data point P in order from data subset bj;
Step 35:If data point PiWith data point PjTag along sort CFiWith CFjIt is equal, then perform step 37;Otherwise calculate data
Point PiWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step 36 is performed, otherwise perform step
37;
Step 36:Judge tag along sort CFiWith CFjBetween magnitude relationship:Work as CFi<CFjWhen, then make CFj=CFi, work as CFi>CFj
When, then make CFi=CFj;
Step 37:According to the method described in step 34 to step 36, each data point in ergodic data subset b;
Step 38:Each according to the method described in step 33 to step 37, in the reachable subset list of ergodic data subset a
Up to each data point in subset, core point P in data subset a is completediCluster calculation;
Step 39:According to the method described in step 32 to step 38, each data point in ergodic data subset a completes number
According to the cluster calculation of all core points in subset a;
Step 310:According to the method described in step 31 to step 39, K is traveled through in order1Each data in individual data subset
Subset, completes the cluster calculation of each core point in each data subset.
The 11. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that
It is described the core point in each data subset is clustered respectively, specifically include:
Step 41:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point PiIf, its
MF values are present, then make CFi=MFi;
Step 42:In order from K1Data subset a is chosen in individual data subset, data point P is chosen in order from data subset ai
With data point Pj;
Step 43:Judge data point PiWith data point PjTag along sort CFiWith CFjIt is whether equal;Step 45 is performed if equal,
Data point P is calculated if unequaliWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step is performed
Rapid 44, otherwise perform step 45;
Step 44:Judge data point PiWith data point PjTag along sort CFiWith CFjMagnitude relationship:Work as CFi<CFjWhen, then make
CFj=CFi, work as CFi>CFjWhen, then make CFi=CFj;
Step 45:According to the method described in step 41 to step 44, each data point in ergodic data subset a completes number
According to the cluster calculation of all core points in subset a;
Step 46:According to the method described in step 46 to step 45, each data in K1 data subset are traveled through in order
Subset, completes the cluster calculation of each core point in each data subset.
The 12. large space Data Clustering Algorithm K-DBSCAN based on density according to claim any one of 7-11, its
It is characterised by, also comprises the following steps:
The Neighbor Points that are calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset.
The 13. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 12, it is characterised in that
The Neighbor Points being calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset, specifically
Including:
Step 51:Data subset a is chosen from K1 data subset in order, using the core point in data subset a as
Subset, is designated as CGa, non-core point is used as a subset NCGa;
Step 52:According to the method described in step 51, K1 data subset is traveled through, for all data subsets divide core point subset
CG and non-core point NCG;
Step 53:Data subset a is chosen from K1 data subset in order, in order from the non-core idea of data subset a
Collection NCGaIt is middle to choose non-core point NCPi;
Step 54:Data subset b is chosen from the reachable subset list of data subset a in order;
Step 55:In order from the core point subset CG of data subset bbMiddle selection core point CPj;
Step 56:Calculate non-core point NCPiWith core point CPjThe distance between Di,jIf, apart from Di,jLess than described default half
Footpath R, then non-core point NCPiIt is core point CPjNeighbor Points, and clustered:Make CFi=CFj;
Step 57:According to the method described in step 55 to step 56, each core point in ergodic data subset b;
Step 58:According to the method described in step 54 to 56, each in the reachable subset list of ergodic data subset a can
Up to subset, non-core point NCP is completediCluster calculation;
Step 59:According to the method described in step 53 to step 58, each in ergodic data subset a is non-core in order
Point, completes the cluster calculation of all non-core points in data subset a;
Step 510:According to the method described in step 53 to step 59, K is traveled through in order1Each number in individual data subset
According to subset, the cluster calculation of all non-core points in each data subset is completed.
The 14. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 12, it is characterised in that
The Neighbor Points being calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset, specifically
Including:
Step 61:By K1Core point in individual data subset in each data subset is designated as CG as a subset, non-core
Point is designated as NCG as a subset;
Step 62:According to the method described in step 61, K is traveled through1Individual data subset, is that all of data subset divides core point CG
With non-core point NCG;
Step 63:In order from K1Data subset a is chosen in individual data subset, from non-core subset NCG in data subset aa
In choose non-core point NCP in orderi;From the core point subset CG in data subset aaIn choose core point CP in orderj;
Step 64:Calculate point NCPiWith point CPjThe distance between Di,jIf, Di,jLess than pre-set radius R, then point NCPiIt is point CPj
Neighbor Points, and clustered:Make CFj=CFi;
Step 65:According to the method described in step 63 to step 64, each in ergodic data subset a is non-core in order
Point, completes the cluster calculation of all non-core point in data subset a;
Step 66:According to the method described in step 63 to step 65, K is traveled through in order1Each data in individual data subset
Collection, completes the cluster calculation of all non-core points in each data subset.
The 15. large space Data Clustering Algorithm K-DBSCAN based on density according to claim any one of 1-14, its
It is characterised by:The quantity K for presorting is obtained in the following manner:
Wherein N is the total quantity of data point in the data set, and Min_N is default minimum Neighbor Points quantity, k
It is constant.
The 16. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 15, it is characterised in that:
The default division iterations T meets:1≤T≤10.
The 17. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 16, it is characterised in that:
The default reach distance d meets:If R≤d≤R × 2, wherein R are pre-set radius R.
The 18. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 17, it is characterised in that:
The pre-set radius R replaces with R ':R≤R′≤R×2.
A kind of 19. electronic equipments for performing the large space Data Clustering Algorithm K-DBSCAN based on density, it is characterised in that bag
Include:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of one computing device, and the instruction is by least one processor
Perform, so that at least one processor can:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611047429.1A CN106709503B (en) | 2016-11-23 | 2016-11-23 | Large-scale spatial data clustering algorithm K-DBSCAN based on density |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611047429.1A CN106709503B (en) | 2016-11-23 | 2016-11-23 | Large-scale spatial data clustering algorithm K-DBSCAN based on density |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106709503A true CN106709503A (en) | 2017-05-24 |
CN106709503B CN106709503B (en) | 2020-07-07 |
Family
ID=58934814
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611047429.1A Active CN106709503B (en) | 2016-11-23 | 2016-11-23 | Large-scale spatial data clustering algorithm K-DBSCAN based on density |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106709503B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291873A (en) * | 2017-06-16 | 2017-10-24 | 晶赞广告(上海)有限公司 | Geographical position clustering method |
CN109239792A (en) * | 2018-10-25 | 2019-01-18 | 中国石油大学(华东) | The satellite-derived gravity data data and shipborne gravimetric data data fusion method that fractal interpolation and net―function combine |
CN109359682A (en) * | 2018-10-11 | 2019-02-19 | 北京市交通信息中心 | A kind of Shuttle Bus candidate's website screening technique based on F-DBSCAN iteration cluster |
CN110298371A (en) * | 2018-03-22 | 2019-10-01 | 北京京东尚科信息技术有限公司 | The method and apparatus of data clusters |
CN111160385A (en) * | 2019-11-27 | 2020-05-15 | 北京中交兴路信息科技有限公司 | Method, device, equipment and storage medium for aggregating mass location points |
CN111815361A (en) * | 2020-07-10 | 2020-10-23 | 北京思特奇信息技术股份有限公司 | Region boundary calculation method and device, electronic equipment and storage medium |
CN112183664A (en) * | 2020-10-27 | 2021-01-05 | 中国人民解放军陆军工程大学 | Novel density clustering method |
CN116702304A (en) * | 2023-08-08 | 2023-09-05 | 中建五局第三建设有限公司 | Method and device for grouping foundation pit design schemes based on unsupervised learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559630A (en) * | 2013-10-31 | 2014-02-05 | 华南师范大学 | Customer segmentation method based on customer attribute and behavior characteristic analysis |
CN104715160A (en) * | 2015-04-03 | 2015-06-17 | 天津工业大学 | Soft measurement modeling data outlier detecting method based on KMDB |
-
2016
- 2016-11-23 CN CN201611047429.1A patent/CN106709503B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559630A (en) * | 2013-10-31 | 2014-02-05 | 华南师范大学 | Customer segmentation method based on customer attribute and behavior characteristic analysis |
CN104715160A (en) * | 2015-04-03 | 2015-06-17 | 天津工业大学 | Soft measurement modeling data outlier detecting method based on KMDB |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291873A (en) * | 2017-06-16 | 2017-10-24 | 晶赞广告(上海)有限公司 | Geographical position clustering method |
CN107291873B (en) * | 2017-06-16 | 2020-02-18 | 晶赞广告(上海)有限公司 | Geographical position clustering method |
CN110298371A (en) * | 2018-03-22 | 2019-10-01 | 北京京东尚科信息技术有限公司 | The method and apparatus of data clusters |
CN109359682A (en) * | 2018-10-11 | 2019-02-19 | 北京市交通信息中心 | A kind of Shuttle Bus candidate's website screening technique based on F-DBSCAN iteration cluster |
CN109239792A (en) * | 2018-10-25 | 2019-01-18 | 中国石油大学(华东) | The satellite-derived gravity data data and shipborne gravimetric data data fusion method that fractal interpolation and net―function combine |
CN109239792B (en) * | 2018-10-25 | 2019-07-16 | 中国石油大学(华东) | The satellite-derived gravity data data and shipborne gravimetric data data fusion method that fractal interpolation and net―function combine |
CN111160385A (en) * | 2019-11-27 | 2020-05-15 | 北京中交兴路信息科技有限公司 | Method, device, equipment and storage medium for aggregating mass location points |
CN111160385B (en) * | 2019-11-27 | 2023-04-18 | 北京中交兴路信息科技有限公司 | Method, device, equipment and storage medium for aggregating mass location points |
CN111815361A (en) * | 2020-07-10 | 2020-10-23 | 北京思特奇信息技术股份有限公司 | Region boundary calculation method and device, electronic equipment and storage medium |
CN112183664A (en) * | 2020-10-27 | 2021-01-05 | 中国人民解放军陆军工程大学 | Novel density clustering method |
CN116702304A (en) * | 2023-08-08 | 2023-09-05 | 中建五局第三建设有限公司 | Method and device for grouping foundation pit design schemes based on unsupervised learning |
CN116702304B (en) * | 2023-08-08 | 2023-10-20 | 中建五局第三建设有限公司 | Method and device for grouping foundation pit design schemes based on unsupervised learning |
Also Published As
Publication number | Publication date |
---|---|
CN106709503B (en) | 2020-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109255828B (en) | Method for performing intersection test to draw 3D scene image and ray tracing unit | |
CN106709503A (en) | Large spatial data clustering algorithm K-DBSCAN based on density | |
Pradhan et al. | Finding all-pairs shortest path for a large-scale transportation network using parallel Floyd-Warshall and parallel Dijkstra algorithms | |
CN112132287A (en) | Distributed quantum computing simulation method and device | |
CN114492782B (en) | On-chip core compiling and mapping method and device of neural network based on reinforcement learning | |
CN111985597B (en) | Model compression method and device | |
CN111984400A (en) | Memory allocation method and device of neural network | |
CN110132282A (en) | Unmanned plane paths planning method and device | |
JP2023546040A (en) | Data processing methods, devices, electronic devices, and computer programs | |
CN112906865B (en) | Neural network architecture searching method and device, electronic equipment and storage medium | |
CN114580606A (en) | Data processing method, data processing device, computer equipment and storage medium | |
CN112100450A (en) | Graph calculation data segmentation method, terminal device and storage medium | |
CN115756478A (en) | Method for automatically fusing operators of calculation graph and related product | |
CN115168281A (en) | Neural network on-chip mapping method and device based on tabu search algorithm | |
CN106484532B (en) | GPGPU parallel calculating method towards SPH fluid simulation | |
CN109802859A (en) | Nodes recommendations method and server in a kind of network | |
CN113591629A (en) | Finger three-mode fusion recognition method, system, device and storage medium | |
CN116186571B (en) | Vehicle clustering method, device, computer equipment and storage medium | |
KR20190105147A (en) | Data clustering method using firefly algorithm and the system thereof | |
CN110175172B (en) | Extremely-large binary cluster parallel enumeration method based on sparse bipartite graph | |
CN113986816B (en) | Reconfigurable computing chip | |
CN108171785B (en) | SAH-KD tree design method for ray tracing | |
Dandachi et al. | A robust monte-carlo-based deep learning strategy for virtual network embedding | |
Hu et al. | Data optimization cnn accelerator design on fpga | |
CN106445960A (en) | Data clustering method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |