CN106709503A - Large spatial data clustering algorithm K-DBSCAN based on density - Google Patents

Large spatial data clustering algorithm K-DBSCAN based on density Download PDF

Info

Publication number
CN106709503A
CN106709503A CN201611047429.1A CN201611047429A CN106709503A CN 106709503 A CN106709503 A CN 106709503A CN 201611047429 A CN201611047429 A CN 201611047429A CN 106709503 A CN106709503 A CN 106709503A
Authority
CN
China
Prior art keywords
data
subset
point
data subset
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611047429.1A
Other languages
Chinese (zh)
Other versions
CN106709503B (en
Inventor
邓超
陈智斌
郭晓惠
农英雄
韦屹
黄聪
汪倍贝
钱方远
李喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tobacco Guangxi Industrial Co Ltd
Original Assignee
China Tobacco Guangxi Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tobacco Guangxi Industrial Co Ltd filed Critical China Tobacco Guangxi Industrial Co Ltd
Priority to CN201611047429.1A priority Critical patent/CN106709503B/en
Publication of CN106709503A publication Critical patent/CN106709503A/en
Application granted granted Critical
Publication of CN106709503B publication Critical patent/CN106709503B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Abstract

The invention particularly relates to a large spatial data clustering algorithm K-DBSCAN based on density. The algorithm comprises the steps that a density-based clustering parameter is preset: radius R, the minimum neighbor number Min_N, pre-division number K and division iteration number of times T are preset; a data set is divided into K1 subsets according to spatial distribution; the reachable subset of each data subset is calculated to form a reachable subset index; and based on the reachable subset index, spatial clustering based on density is carried out on the data of each subset. According to the technical scheme provided by the invention, density-based unsupervised and semi-supervised clustering can be carried out on the large spatial data set, and efficient and fast parallel clustering calculating is realized.

Description

A kind of large space Data Clustering Algorithm K-DBSCAN based on density
Technical field
The present invention relates to data mining and big data analysis field, and in particular to a kind of large space data based on density Clustering algorithm K-DBSCAN.
Background technology
Spatial Data Clustering is widely used in many areas of information technology, such as data mining, pattern-recognition, machine Study, artificial intelligence, visual analysis, GIS-Geographic Information System etc..Especially in the big data epoch, it can be used to explore it is meaningful but The potential pattern and phenomenon for not yet knowing, can be applied to many ambits, such as social network analysis, economic networks point Analysis, traffic network analysis, meteorologic analysis, smart city development etc..Traditional Spatial Data Clustering method calculated based on distance Mainly there are three kinds:1), based on the cluster for dividing;2), density clustering;3), hierarchical clustering.
Density clustering can effectively process noise spot and identification arbitrary shape, and main algorithm includes: DBSCAN(Density-Based Spatial Clustering of Applications with Noise)、OPTICS (Ordering points to identify the clustering structure)、DENCLUE(DENsity basted CLUstEring) etc..Wherein, DBSCAN is foremost density-based spatial clustering algorithm.Its computation complexity is O (N2), i.e., when data volume increases by 100 times, it calculates the time will about 10000 times of increase.Although existing many parallel based on calculating The method of change calculates required time to be greatly reduced, but is still limited by the quantity of CPU or GPU in calculating platform.For example, will be in phase The DBSCAN clusters of 100 haplotype data amounts are carried out in the same calculating time, then needs about 10000 times of CPU or GPU quantity.Therefore When mass data is faced, DBSCAN cannot be widely used.
The content of the invention
The present invention is to solve being directed to what DBSCAN when mass data needs to be clustered cannot be applicable in the prior art Technical problem.
In order to solve the above-mentioned technical problem, the present invention provides following technical scheme:
A kind of large space Data Clustering Algorithm K-DBSCAN based on density, including:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
Alternatively, it is described to draw data set in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density It is divided into K1Individual data subset, wherein being K1Natural number more than 1, specifically includes:
Obtain the real space value range D of the data setlen
Real space value range D according to the data setlenPre- division is carried out to the data point in the data set to obtain K is presorted, and wherein K is the natural number more than 1;
Preclassification step, obtains each central point position presorted, and by each data in the data set In point distribution to presorting where central point closest therewith, if it is a certain presort in number of data points less than default Minimum neighbour's quantity Min_N, then delete this and presort;
Repeat the preclassification step T times, obtain K1Individual data subset, wherein T are default division iterations.
Alternatively, it is described to obtain the number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the real space value range D of collectionlen, specifically include:
Obtain the maximum space value range D of the data setmaxWith minimum space value range Dmin, wherein Dmax=LNmax+ LAmax, Dmin=LNmin+LAmin, LNmaxIt is the longitude maximum of all data points in the data set, LAminIt is the data set In all data points longitude minimum value, LNminIt is the latitude minimum value of all data points in the data set, LAmaxFor described The latitude maximum of all data points in data set;
Obtain the real space value range D of the data setlen=Dmax- Dmin
Alternatively, it is described according to the number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the real space value range D of collectionlenPre- division carried out to the data point in the data set obtain K to presort, wherein K is Natural number more than 1, specifically includes:
For any one data point a in the data set, presorting belonging to it is obtained according to following computational methods:Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.
Alternatively, it is described to obtain each number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the reachable subset of subset, and form corresponding with the data subset up to subset index, specifically include:
According to the longitude and latitude value of all data points in the data subset, the maximum warp of the data subset is determined Angle value LOmax, minimum longitude LQmin, minimum latitude value LOmin, maximum latitude value LQmax
Obtain the reachable tree coverage of each data subset:LOright=LOmax+ d, LOleft=LOmin- d, LQup=LQmax+ d, LQdown=LQmin- d, wherein d are default reach distance, LQup、LOright、LQdown、LOleftIt is respectively described The coboundary of the reachable tree coverage of data subset, right margin, lower boundary and left margin;
For data subset calculates it up to subset list each described, computational methods are:For arbitrary data subset Pa With data subset PbIf, data subset PaReachable tree coverage and data subset PbReachable tree coverage exist Occur simultaneously, it is determined that data subset PaWith data subset PbIt is reachable mutually, and each data subset is all the reachable subset of itself, note It is RPLa={ a, b ... } and RPLb={ a, b ... };The all of each data subset constitute the reachable of the data subset up to subset Subset list,
After obtaining the reachable subset list of each data subset, by all K1The reachable subset list of individual data subset Arranged according to data subset order, obtained up to subset index.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, as data subset Pa's Reachable tree coverage and data subset PbReachable tree coverage when there is following any one relation, it may be determined that number According to subset PaReachable tree coverage and data subset PbReachable tree coverage exist occur simultaneously:
Relation one:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation two:(LOleft_a<LOleft_b<LOright_a) and (LAdown_a<LQdown_b<LQup_a);
Relation three:(LOleft_a<LOright_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation four:(LOleft_a<LOright_b<LOright_a) and (LAdown_a<LAdown_b<LAup_a);
Relation five:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation six:(LOleft_a<LOright_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation seven:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQup_b<LQup_a);
Relation eight:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQdown_b<LQup_a);
Relation nine:(LOleft_a>LOleft_b) and (LOright_a<LOright_b) and (LQup_a<LQup_b) and (LQdown_a> LQdown_b);
Wherein, LQup_a、LOright_a、LQdown_a、LOleft_aRespectively data subset PaReachable tree coverage it is upper Border, right margin, lower boundary and left margin;LQup_b、LOright_b、LQdown_b、LOleft_bRespectively data subset PbUp to empty Between the coboundary of coverage, right margin, lower boundary and left margin.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, can described in the basis Density-based spatial clustering is carried out to each data subset data up to subset index, is specifically included:
According to the core point that each data subset is calculated up to subset index;
The core point in each data subset is clustered respectively.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, can described in the basis The core point of each data subset is calculated up to subset index, is specifically included:
Step 11:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 12:Choose data subset b in order from the reachable subset list of data subset a;
Step 13:Choose data point P in order from data subset bj
Step 14:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 15:According to the method described in step 13 and 14, all data points in ergodic data subset b, according to step 14 method is calculated data point PiNeighbor Points quantity Ni
Step 16:It is all up to number in the reachable subset list of ergodic data subset a according to the method described in step 12 According to subset, data point P is calculated according to step 15 methods describediNeighbor Points quantity Ni
Step 17:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi= CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot Beam, makes CF=CF+1, and according to the method in step 11, chooses next data point Pi+1Carry out core point calculating;
Step 18:According to the method described in step 11 to step 17, each data in ergodic data subset a in order Point, the core point for completing all data points in data subset a is calculated;
Step 19:According to the method described in step 11 to step 18, K is traveled through in order1Each in individual data subset Data subset, the core point for completing all data points in all data subsets is calculated.
Alternatively, it is described to calculate each number in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density According to the core point of subset, specifically include:
Step 21:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 22:Choose data point P in order from data subset aj
Step 23:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 24:According to the method described in step 22, all data points in ergodic data subset a, according to step 23 institute The method stated, calculates data point PiNeighbor Points quantity Ni
Step 25:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi= CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot Beam, makes CF=CF+1, and according to the method in step 21, chooses next data point Pi+1Carry out core point calculating;
Step 26:According to the method described in step 21 to step 25, each data point in ergodic data subset a is complete The core point of all data points is calculated into data subset a;
Step 27:According to the method described in step 21 to step 26, K is traveled through1Each data in individual data subset Collection, the core point for completing all data points in all data subsets is calculated.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, according to described reachable Subset index, clusters to the core point in each data subset respectively, specifically includes:
Step 31:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as Really its MF value is present, then make CFi=MFi
Step 32:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point Pi
Step 33:Choose data subset b in order from the reachable subset list of data subset a;
Step 34:Choose data point P in order from data subset bj
Step 35:If data point PiWith data point PjTag along sort CFiWith CFjIt is equal, then perform step 37;Otherwise count Count strong point PiWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step 36 is performed, otherwise performed Step 37;
Step 36:Judge tag along sort CFiWith CFjBetween magnitude relationship:Work as CFi<CFjWhen, then make CFj=CFi, when CFi>CFjWhen, then make CFi=CFj
Step 37:According to the method described in step 34 to step 36, each data point in ergodic data subset b;
Step 38:It is every in the reachable subset list of ergodic data subset a according to the method described in step 33 to step 37 One, up to each data point in subset, completes core point P in data subset aiCluster calculation;
Step 39:According to the method described in step 32 to step 38, each data point in ergodic data subset a is complete Into the cluster calculation of all core points in data subset a;
Step 310:According to the method described in step 31 to step 39, K is traveled through in order1Each in individual data subset Data subset, completes the cluster calculation of each core point in each data subset.
Alternatively, it is described respectively to each in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density Core point in data subset is clustered, and is specifically included:
Step 41:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as Really its MF value is present, then make CFi=MFi
Step 42:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point PiWith data point Pj
Step 43:Judge data point PiWith data point PjTag along sort CFiWith CFjIt is whether equal;Step is performed if equal Rapid 45, data point P is calculated if unequaliWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then hold Row step 44, otherwise performs step 45;
Step 44:Judge data point PiWith data point PjTag along sort CFiWith CFjMagnitude relationship:Work as CFi<CFjWhen, Then make CFj=CFi, work as CFi>CFjWhen, then make CFi=CFj
Step 45:According to the method described in step 41 to step 44, each data point in ergodic data subset a is complete The cluster calculation of all core points into data subset a;
Step 46:According to the method described in step 46 to step 45, each in K1 data subset is traveled through in order Data subset, completes the cluster calculation of each core point in each data subset.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, also comprise the following steps:
The Neighbor Points that are calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset.
Alternatively, it is described to calculate respectively in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density Clustered to the Neighbor Points in each data subset and to the Neighbor Points in each data subset, specifically included:
Step 51:In order from K1 data subset choose data subset a, using the core point in data subset a as A subset, is designated as CGa, non-core point is used as a subset NCGa
Step 52:According to the method described in step 51, K1 data subset is traveled through, for all data subsets divide core point Subset CG and non-core point NCG;
Step 53:Data subset a is chosen from K1 data subset in order, in order from the non-core of data subset a Point subset NCGaIt is middle to choose non-core point NCPi
Step 54:Data subset b is chosen from the reachable subset list of data subset a in order;
Step 55:In order from the core point subset CG of data subset bbMiddle selection core point CPj
Step 56:Calculate non-core point NCPiWith core point CPjThe distance between Di,jIf, apart from Di,jIt is pre- less than described If radius R, then non-core point NCPiIt is core point CPjNeighbor Points, and clustered:Make CFi=CFj
Step 57:According to the method described in step 55 to step 56, each core point in ergodic data subset b;
Step 58:It is each in the reachable subset list of ergodic data subset a according to the method described in step 54 to 56 Individual reachable subset, completes non-core point NCPiCluster calculation;
Step 59:According to the method described in step 53 to step 58, each in ergodic data subset a is non-in order Core point, completes the cluster calculation of all non-core points in data subset a;
Step 510:According to the method described in step 53 to step 59, K is traveled through in order1It is each in individual data subset Individual data subset, completes the cluster calculation of all non-core points in each data subset.
Alternatively, it is described to calculate respectively in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density Clustered to the Neighbor Points in each data subset and to the Neighbor Points in each data subset, specifically included:
Step 61:By K1Core point in individual data subset in each data subset is designated as CG as a subset, non- Core point is designated as NCG as a subset;
Step 62:According to the method described in step 61, K is traveled through1Individual data subset, is that all of data subset divides core Point CG and non-core point NCG;
Step 63:In order from K1Data subset a is chosen in individual data subset, from the non-core idea in data subset a Collection NCGaIn choose non-core point NCP in orderi;From the core point subset CG in data subset aaIn choose core point in order CPj
Step 64:Calculate point NCPiWith point CPjThe distance between Di,jIf, Di,jLess than pre-set radius R, then point NCPiIt is Point CPjNeighbor Points, and clustered:Make CFj=CFi
Step 65:According to the method described in step 63 to step 64, each in ergodic data subset a is non-in order Core point, completes the cluster calculation of all non-core point in data subset a;
Step 66:According to the method described in step 63 to step 65, K is traveled through in order1Each in individual data subset Data subset, completes the cluster calculation of all non-core points in each data subset.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the number presorted Amount K is obtained in the following manner:
Wherein N is the total quantity of data point in the data set, and Min_N is default minimum neighbour points Amount, k is constant.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the default division Iterations T meets:1≤T≤10.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, it is described it is default up to away from Meet from d:If R≤d≤R × 2, wherein R are pre-set radius R.
Alternatively, in the above-mentioned large space Data Clustering Algorithm K-DBSCAN based on density, the pre-set radius R is replaced It is changed to R ':R≤R′≤R×2.
The present invention also provides a kind of electronic equipment for performing the large space Data Clustering Algorithm K-DBSCAN based on density, It is characterised in that it includes:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of one computing device, and the instruction is by described at least one Reason device is performed, so that at least one processor can:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
The above-mentioned technical proposal that the present invention is provided, compared with prior art, at least has the advantages that:The present invention is carried A kind of large space Data Clustering Algorithm K-DBSCAN based on density is gone out, the algorithm is first with a kind of parallelization of simplification K-means algorithms carry out data division, secondly guide cluster using a kind of reachable subset index, finally improved using one kind Distributed parallel clustering algorithm carries out space clustering.The algorithm greatly reduces the Spatial Data Clustering based on density Computation complexity so that the algorithm can be widely applied to mass data cluster.
Brief description of the drawings
In order to illustrate more clearly of the specific embodiment of the invention or technical scheme of the prior art, below will be to specific The accompanying drawing to be used needed for implementation method or description of the prior art is briefly described, it should be apparent that, in describing below Accompanying drawing is some embodiments of the present invention, for those of ordinary skill in the art, before creative work is not paid Put, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is the method for the large space Data Clustering Algorithm K-DBSCAN based on density described in one embodiment of the invention Flow chart;
Fig. 2 carries out sky using improved k-means clustering algorithms for step described in one embodiment of the invention to data set Between partition clustering method flow diagram;
Fig. 3 is the implementation method flow chart of step S102 described in one embodiment of the invention;
Fig. 4 is the spatial coverage schematic diagram of data subset described in one embodiment of the invention;
Fig. 5 is the schematic diagram that different pieces of information subset described in one embodiment of the invention mutually reaches;
Fig. 6 is reality of the one embodiment of the invention according to the core point that each data subset is calculated up to subset index Existing method flow diagram;
Fig. 7 is that the core point in each data subset is carried out according to up to subset index described in one embodiment of the invention The implementation method flow chart of the implementation method flow chart of cluster;
Fig. 8 is Neighbor Points in each data subset to be calculated described in one embodiment of the invention and to each data The implementation method flow chart that the Neighbor Points of concentration are clustered;
Fig. 9 is to perform the large space Data Clustering Algorithm K-DBSCAN's based on density described in one embodiment of the invention The hardware configuration connection diagram of the electronic equipment of method.
Specific embodiment
Embodiment 1
The present embodiment provides a kind of large space Data Clustering Algorithm K-DBSCAN based on density, as shown in figure 1, bag Include:
S101:Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1.
S102:The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index.
S103:Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
In such scheme, data set is carried out into data division first, obtain multiple data subsets, secondly using a kind of reachable Subset index guides cluster, finally for division after each data subset space clustering carried out using clustering algorithm.Should Algorithm greatly reduces the computation complexity of the Spatial Data Clustering based on density so that the algorithm can be widely applied to magnanimity Data clusters.
Embodiment 2
In above-mentioned steps S101, data set can be divided using various ways, specifically need to ensure every stroke The data subset obtained after point has specific space and data point.A kind of implementation is provided in the present embodiment, using one kind Improved k-means clustering algorithms carry out space partition clustering to data set, including:
Specifically, as shown in Fig. 2 comprising the following steps:
S201:The longitude maximum LN of all data points in the data set is calculated respectivelymax, longitude minimum value LNmin With latitude maximum LAmax, latitude minimum value LAmin;Obtain the maximum space value range D of the data setmax=LNmax+LAmax With minimum space value range Dmin=LNmin+LAmin, then calculate the real space value range D of the data setlen=Dmax- Dmin
S202:According to the real space value range DlenInitial division is carried out to the data point in the data set, specifically Algorithm is:For Arbitrary Digit strong point a, the computing formula presorted belonging to it is: Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.And K values can be by formula certainly It is dynamic to be calculated:N is the total quantity of data point in data set, and Min_N is minimum Neighbor Points quantity, and k is to appoint Meaning constant.
S203:For arbitrary classification CaIf, less than default minimum neighbour quantity Min_N, deleting should for its number of members Presort;
S204:Each central point presorted is calculated respectively;
S205:Then calculate each data point to the distance of central point each described, and by the data point be assigned to The nearest central point where presort;
S206:Step S203, S204, S205 are repeated, iterations is T times, and wherein T can select to be any nature Count, preferably T can be:1≤T≤10;
S207:Data point in the data set is divided into K1Individual to presort, each presorts and can be used as one Individual data subset, therefore obtained K1Individual data subset.
In above-mentioned steps S102, as shown in figure 3, can realize in the following way:
S301:For data subset P calculates reachable tree coverage each described, computational methods are:Correction data subset P In all data points longitude and latitude value, find out the maximum longitude LO in all data pointsmax, minimum longitude LQmin、 Minimum latitude value LOmin, maximum latitude value LQmax;Along with default reachable tree is apart from d, the reachable tree of data subset P is obtained Coverage:LOright=LOmax+ d, LOleft=LOmin- d, LQup=LQmax+ d, LQdown=LQmin- d, wherein, LQup、 LOright、LQdown、LOleftThe coboundary of the reachable tree coverage of respectively described data subset, right margin, lower boundary and Left margin.
As shown in figure 4, the data point in rectangle frame represents a data subset (can also be called to presort), rectangle Inside casing represents the spatial coverage of the data subset, and rectangular outer frame represents the reachable tree covering model of the data subset Enclose.It can be seen that the difference between inside casing and housing is to preset reachable tree apart from d.As a kind of optional realization Mode, the reachable tree can be any number apart from d values, generally set R≤d≤R × 2, and R is pre-set radius.
S302:For data subset P calculates it up to subset list RPL each described, computational methods are:For arbitrary institute State data subset PaAnd Pb, if there is data subset PaReachable tree coverage Ra, and data subset PbUp to empty Between coverage Rb, and scope RaWith scope RbIt is intersecting, then define PaWith PbIt is reachable mutually, it is designated as RPLa={ a, b ... }, RPLb= { a, b ... }, it is clear that each data subset is the reachable subset of oneself.As shown in figure 5, for data subset P1, it is up to empty Between coverage R1Only with data subset P2、P3、P4Reachable tree coverage R2、R3、R4Intersect, then P1With P2It is reachable mutually, P1With P3Mutual reachable, P1With P4It is reachable mutually.In above-mentioned steps, data subset P is judgedaWith data subset PbWhether it is intersecting can be with Whether fallen into the border of another data subset by the border for judging one of data subset, specifically can be summarized as 9 kinds of situations, can be calculated by below equation:
(1)(LOleft_a<LOleft_b<LOright_a)&(LQdown_a<LQup_b<LQup_a)
(2)(LOleft_a<LOleft_b<LOright_a)&(LQdown_a<LQdown_b<LQup_a)
(3)(LOleft_a<LOright_b<LOright_a)&(LQdown_a<LQup_b<LQup_a)
(4)(LOleft_a<LOright_b<LOright_a)&(LQdown_a<LQdown_b<LQup_a)
(5)(LOleft_a<LOleft_b<LOright_a)&(LQdown_a>LQdown_b&LQup_a<LQup_b)
(6)(LOleft_a<LOright_b<LOright_a)&(LQdown_a>LQdown_b&LQup_a<LQup_b)
(7)(LOleft_a>LOleft_b&LOright_a<LOright_b)&(LQdown_a<LQup_b<LQup_a)
(8)(LOleft_a>LOleft_b&LOright_a<LOright_b)&(LQdown_a<LQdown_b<LQup_a)
(9)(LOleft_a>LOleft_b)&(LOright_a<LOright_b)&(LQup_a<LQup_b)&(LQdown_a>LQdown_b)。
Wherein, & represents logical AND, and in subscript, right represents right margin, and left represents left margin, and up represents top Boundary, down represents lower boundary, and a represents data subset Pa, b represents data subset Pb, therefore for containing represented by each symbol Justice, can directly obtain according to subscript.
S303:K is respectively by formula described in step S3021Relative to other, each data subset enters individual data subset Row is calculated, and obtains the described up to subset list RPL of each data subset, and by all K1The reachable subset row of individual data subset Table RPL is arranged according to subset order, obtains one up to subset index RPI, is designated as RPI={ RPL1,...,RPLK1}。
Preferably, above-mentioned steps S103 may include following steps:According to described each data is calculated up to subset index The core point of collection;The core point in each data subset is clustered respectively;It is calculated respectively in each data subset Neighbor Points are simultaneously clustered to the Neighbor Points in each data subset.
When wherein calculating the core point of each data subset, can be calculated according to up to subset index, it is also possible to only at this Calculated in the range of the data point of data subset itself.Only the core point in each data subset is clustered, Ke Yiyou Effect improves cluster speed, but precision can be subject to a certain degree of influence.
Therefore, can be using scheme as shown in Figure 6 according to the core that each data subset is calculated up to subset index Point, comprises the following steps:
Step 11:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 12:Choose data subset b in order from the reachable subset list of data subset a;
Step 13:Choose data point P in order from data subset bj
Step 14:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 15:According to the method described in step 13 and 14, all data points in ergodic data subset b, according to step 14 method is calculated data point PiNeighbor Points quantity Ni
Step 16:It is all up to number in the reachable subset list of ergodic data subset a according to the method described in step 12 According to subset, data point P is calculated according to step 15 methods describediNeighbor Points quantity Ni
Step 17:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi= CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot Beam, makes CF=CF+1, and according to the method in step 11, chooses next data point Pi+1Carry out core point calculating;
Step 18:According to the method described in step 11 to step 17, each data in ergodic data subset a in order Point, the core point for completing all data points in data subset a is calculated;
Step 19:According to the method described in step 11 to step 18, K is traveled through in order1Each in individual data subset Data subset, the core point for completing all data points in all data subsets is calculated.
I.e. each data point of pin data subset a, travels through each up to each data in subset in sequence After point, then other data subsets outside ergodic data subset a, until all K1Individual data subset completes traversal.
It should be noted that provided in the embodiment of the present invention parameter subscript, can be distinguished using natural number, in fact Any restriction is not done in matter to the parameter, such as when having 100 data points in one data subset, PiRepresent a data Point, i can get 100 from 1.
As shown in fig. 7, core point can not be calculated for each subset up to subset index RPI by described, to improve sea The calculating speed of data clusters is measured, but the precision of some cluster results can be lost.Circular can be reduced to:
Step 21:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 22:Choose data point P in order from data subset aj
Step 23:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, then really Fixed number strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 24:According to the method described in step 22, all data points in ergodic data subset a, according to step 23 institute The method stated, calculates data point PiNeighbor Points quantity Ni
Step 25:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiWhether more than minimum neighbour quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi= CF, and be last Neighbor Points P in step calculatingkPlus tag along sort MFa, and for data point PiCore point calculate knot Beam, makes CF=CF+1, and according to the method in step 21, chooses next data point Pi+1Carry out core point calculating;
Step 26:According to the method described in step 21 to step 25, each data point in ergodic data subset a is complete The core point of all data points is calculated into data subset a;
Step 27:According to the method described in step 21 to step 26, K is traveled through1Each data in individual data subset Collection, the core point for completing all data points in all data subsets is calculated.
I.e. for each data point in data subset a, without traversal up to the data point in subset, but and data Other data points in subset a itself are traveled through, and other data subsets outside data subset a are carried out again after completion time Go through, until K1All traversal is completed individual data subset.
Similarly, the core point in each data subset is clustered, it is also possible to be divided into two methods, one kind is basis The core point in each data subset is clustered up to subset index, another way is without up to complete by subset Into.Wherein:
Mode one:According to described reachable subset index, the core point in each data subset is clustered respectively, had Body includes:
Step 31:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as Really its MF value is present, then make CFi=MFi
Step 32:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point Pi
Step 33:Choose data subset b in order from the reachable subset list of data subset a;
Step 34:Choose data point P in order from data subset bj
Step 35:If data point PiWith data point PjTag along sort CFiWith CFjIt is equal, then perform step 37;Otherwise count Count strong point PiWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step 36 is performed, otherwise performed Step 37;
Step 36:Judge tag along sort CFiWith CFjBetween magnitude relationship:Work as CFi<CFjWhen, then make CFj=CFi, when CFi>CFjWhen, then make CFi=CFj
Step 37:According to the method described in step 34 to step 36, each data point in ergodic data subset b;
Step 38:It is every in the reachable subset list of ergodic data subset a according to the method described in step 33 to step 37 One, up to each data point in subset, completes core point P in data subset aiCluster calculation;
Step 39:According to the method described in step 32 to step 38, each data point in ergodic data subset a is complete Into the cluster calculation of all core points in data subset a;
Step 310:According to the method described in step 31 to step 39, each in K1 data subset is traveled through in order Data subset, completes the cluster calculation of each core point in each data subset.
Mode two:Do not clustered for the core point in each subset by the reachable subset index RPI, to improve The calculating speed of mass data cluster, but the precision of cluster result can be lost.Circular can be reduced to:
Step 41:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point Pi, such as Really its MF value is present, then make CFi=MFi
Step 42:In order from K1Data subset a is chosen in individual data subset, number is chosen in order from data subset a Strong point PiWith data point Pj
Step 43:Judge data point PiWith data point PjTag along sort CFiWith CFjIt is whether equal;Step is performed if equal Rapid 45, data point P is calculated if unequaliWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then hold Row step 44, otherwise performs step 45;
Step 44:Judge data point PiWith data point PjTag along sort CFiWith CFjMagnitude relationship:Work as CFi<CFjWhen, Then make CFj=CFi, work as CFi>CFjWhen, then make CFi=CFj
Step 45:According to the method described in step 41 to step 44, each data point in ergodic data subset a is complete The cluster calculation of all core points into data subset a;
Step 46:According to the method described in step 46 to step 45, K is traveled through in order1Each in individual data subset Data subset, completes the cluster calculation of each core point in each data subset.
Similarly, the Neighbor Points that are calculated in each data subset simultaneously gather to the Neighbor Points in each data subset The step of class, it is also possible to realized by two ways.
Mode one:According to described up to subset index RPI, respectively each subset calculates Neighbor Points and clusters.Calculating side Method is:
Step 51:In order from K1In individual data subset choose data subset a, using the core point in data subset a as A subset, is designated as CGa, non-core point is used as a subset NCGa
Step 52:According to the method described in step 51, K1 data subset is traveled through, for all data subsets divide core point Subset CG and non-core point NCG;
Step 53:Data subset a is chosen from K1 data subset in order, in order from the non-core of data subset a Point subset NCGaIt is middle to choose non-core point NCPi
Step 54:Data subset b is chosen from the reachable subset list of data subset a in order;
Step 55:In order from the core point subset CG of data subset bbMiddle selection core point CPj
Step 56:Calculate non-core point NCPiWith core point CPjThe distance between Di,jIf, apart from Di,jIt is pre- less than described If radius R, then non-core point NCPiIt is core point CPjNeighbor Points, and clustered:Make CFi=CFj
Step 57:According to the method described in step 55 to step 56, each core point in ergodic data subset b;
Step 58:It is each in the reachable subset list of ergodic data subset a according to the method described in step 54 to 56 Individual reachable subset, completes non-core point NCPiCluster calculation;
Step 59:According to the method described in step 53 to step 58, each in ergodic data subset a is non-in order Core point, completes the cluster calculation of all non-core points in data subset a;
Step 510:According to the method described in step 53 to step 59, K is traveled through in order1It is each in individual data subset Individual data subset, completes the cluster calculation of all non-core points in each data subset.
Mode two:It is not or not each subset calculates Neighbor Points by the reachable subset index RPI, to improve mass data The calculating speed of cluster, but the precision of cluster result can be lost.Circular can be reduced to:
Step 61:By K1Core point in individual data subset in each data subset is designated as CG as a subset, non- Core point is designated as NCG as a subset;
Step 62:According to the method described in step 61, K is traveled through1Individual data subset, is that all of data subset divides core Point CG and non-core point NCG;
Step 63:In order from K1Data subset a is chosen in individual data subset, from the non-core idea in data subset a Collection NCGaIn choose non-core point NCP in orderi;From the core point subset CG in data subset aaIn choose core point in order CPj
Step 64:Calculate point NCPiWith point CPjThe distance between Di,jIf, Di,jLess than pre-set radius R, then point NCPiIt is Point CPjNeighbor Points, and clustered:Make CFj=CFi
Step 65:According to the method described in step 63 to step 64, each in ergodic data subset a is non-in order Core point, completes the cluster calculation of all non-core point in data subset a;
Step 66:According to the method described in step 63 to step 65, K is traveled through in order1Each in individual data subset Data subset, completes the cluster calculation of all non-core points in each data subset.
In above scheme, the pre-set radius R could alternatively be R ':R≤R′≤R×2.The above skill that the present embodiment is provided Art scheme, can carry out the unsupervised and semi-supervised clustering based on density to large space data set, and realize efficiently, quickly simultaneously Row cluster calculation.
Embodiment 3
Fig. 9 is that the electronics of large space Data Clustering Algorithm K-DBSCAN of the execution based on density that the present embodiment is provided sets Standby hardware architecture diagram, as shown in figure 9, the equipment includes:
One or more processors 701 and memory 702, in Fig. 9 by taking a processor 701 as an example.
The equipment for performing the large space Data Clustering Algorithm K-DBSCAN based on density can also include:Input unit 703 and output device 704.
Processor 701, memory 702, input unit 703 and output device 704 can be by bus or other modes Connection, in Fig. 9 as a example by being connected by bus.
Memory 702 can be used to store non-volatile software journey as a kind of non-volatile computer readable storage medium storing program for executing Sequence, non-volatile computer executable program and module, the large space data based on density such as in the embodiment of the present application Corresponding programmed instruction/the modules of clustering algorithm K-DBSCAN.Processor 701 is non-easy in memory 702 by running storage The property lost software program, instruction and module, so that the various function application of execute server and data processing, that is, realize above-mentioned The large space Data Clustering Algorithm K-DBSCAN based on density of embodiment of the method.
Memory 702 can include storing program area and storage data field, wherein, storing program area can store operation system Application program required for system, at least one function;Storage data field can be stored according to large space number of the execution based on density Created data etc. are used according to the device of clustering algorithm K-DBSCAN.Additionally, memory 702 can include depositing at random at a high speed Access to memory, can also include nonvolatile memory, for example, at least one disk memory, flush memory device or other are non- Volatile solid-state part.In certain embodiments, memory 702 is optional including remotely located relative to processor 701 Memory, these remote memories can be by network connection to large space Data Clustering Algorithm K- of the execution based on density DBSCAN devices.The example of above-mentioned network include but is not limited to internet, intranet, LAN, mobile radio communication and its Combination.
Input unit 703 can receive the numeral or character information of input, and produce and perform the large-scale sky based on density Between Data Clustering Algorithm K-DBSCAN devices user set and function control it is relevant key signals input.Output device 704 May include the display devices such as display screen.
One or more of modules are stored in the memory 702, when by one or more of processors During 701 execution, the large space Data Clustering Algorithm K-DBSCAN based on density in above-mentioned any means embodiment is performed.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.And, the present invention can be used and wherein include the computer of computer usable program code at one or more The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) is produced The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that every first-class during flow chart and/or block diagram can be realized by computer program instructions The combination of flow and/or square frame in journey and/or square frame and flow chart and/or block diagram.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced for reality by the instruction of computer or the computing device of other programmable data processing devices The device of the function of being specified in present one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or other programmable data processing devices with spy In determining the computer-readable memory that mode works so that instruction of the storage in the computer-readable memory is produced and include finger Make the manufacture of device, the command device realize in one flow of flow chart or multiple one square frame of flow and/or block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented treatment, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
, but those skilled in the art once know basic creation although preferred embodiments of the present invention have been described Property concept, then can to these embodiments make change and change.So, appended claims are intended to be construed to include being preferable to carry out Example and fall into having altered and changing for the scope of the invention.

Claims (19)

1. a kind of large space Data Clustering Algorithm K-DBSCAN based on density, it is characterised in that including:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
2. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 1, it is characterised in that institute State and data set is divided into K1Individual data subset, wherein being K1Natural number more than 1, specifically includes:
Obtain the real space value range D of the data setlen
Real space value range D according to the data setlenPre- division is carried out to the data point in the data set and obtains K Presort, wherein K is the natural number more than 1;
Preclassification step, obtains each central point position presorted, and each data point in the data set is divided Be assigned in presorting where central point closest therewith, if it is a certain presort in number of data points less than default minimum Neighbour quantity Min_N, then delete this and presort;
Repeat the preclassification step T times, obtain K1Individual data subset, wherein T are default division iterations.
3. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 2, it is characterised in that institute State the real space value range D for obtaining the data setlen, specifically include:
Obtain the maximum space value range D of the data setmaxWith minimum space value range Dmin, wherein Dmax=LNmax+LAmax, Dmin=LNmin+LAmin, LNmaxIt is the longitude maximum of all data points in the data set, LAminIt is institute in the data set There are the longitude minimum value of data point, LNminIt is the latitude minimum value of all data points in the data set, LAmaxIt is the data Concentrate the latitude maximum of all data points;
Obtain the real space value range D of the data setlen=Dmax- Dmin
4. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 3, it is characterised in that institute State the real space value range D according to the data setlenPre- division is carried out to the data point in the data set and obtains K in advance Classification, wherein K is the natural number more than 1, is specifically included:
For any one data point a in the data set, presorting belonging to it is obtained according to following computational methods:Wherein LNa is the longitude of data point a, and LAa is the latitude value of data point a.
5. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 1, it is characterised in that institute The reachable subset for obtaining each data subset is stated, and forms corresponding with the data subset up to subset index, specifically included:
According to the longitude and latitude value of all data points in the data subset, the maximum longitude of the data subset is determined LOmax, minimum longitude LQmin, minimum latitude value LOmin, maximum latitude value LQmax
Obtain the reachable tree coverage of each data subset:LOright=LOmax+ d, LOleft=LOmin- d, LQup= LQmax+ d, LQdown=LQmin- d, wherein d are default reach distance, LQup、LOright、LQdown、LOleftRespectively described data The coboundary of the reachable tree coverage of subset, right margin, lower boundary and left margin;
For data subset calculates it up to subset list each described, computational methods are:For arbitrary data subset PaAnd data Subset PbIf, data subset PaReachable tree coverage and data subset PbReachable tree coverage exist occur simultaneously, then Determine data subset PaWith data subset PbIt is reachable mutually, and each data subset is all the reachable subset of itself, is designated as RPLa= { a, b ... } and RPLb={ a, b ... };All reachable subset row that the data subset is constituted up to subset of each data subset Table,
After obtaining the reachable subset list of each data subset, by all K1The reachable subset list of individual data subset according to Data subset order is arranged, and is obtained up to subset index.
6. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 5, it is characterised in that when Data subset PaReachable tree coverage and data subset PbReachable tree coverage there is following any one relation When, it may be determined that data subset PaReachable tree coverage and data subset PbReachable tree coverage exist occur simultaneously:
Relation one:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation two:(LOleft_a<LOleft_b<LOright_a) and (LAdown_a<LQdown_b<LQup_a);
Relation three:(LOleft_a<LOright_b<LOright_a) and (LQdown_a<LQup_b<LQup_a);
Relation four:(LOleft_a<LOright_b<LOright_a) and (LAdown_a<LAdown_b<LAup_a);
Relation five:(LOleft_a<LOleft_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation six:(LOleft_a<LOright_b<LOright_a) and (LQdown_a>LQdown_b&LQup_a<LQup_b);
Relation seven:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQup_b<LQup_a);
Relation eight:(LOleft_a>LOleft_bAnd LOright_a<LOright_b) and (LQdown_a<LQdown_b<LQup_a);
Relation nine:(LOleft_a>LOleft_b) and (LOright_a<LOright_b) and (LQup_a<LQup_b) and (LQdown_a>LQdown_b);
Wherein, LQup_a、LOright_a、LQdown_a、LOleft_aRespectively data subset PaReachable tree coverage top Boundary, right margin, lower boundary and left margin;LQup_b、LOright_b、LQdown_b、LOleft_bRespectively data subset PbReachable tree The coboundary of coverage, right margin, lower boundary and left margin.
7. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 6, it is characterised in that institute State carries out density-based spatial clustering up to subset index according to described to each data subset data, specifically includes:
According to the core point that each data subset is calculated up to subset index;
The core point in each data subset is clustered respectively.
8. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that institute State according to the core point that each data subset is calculated up to subset index, specifically include:
Step 11:In order from K1Data subset a is chosen in individual data subset, data point is chosen in order from data subset a PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 12:Choose data subset b in order from the reachable subset list of data subset a;
Step 13:Choose data point P in order from data subset bj
Step 14:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, it is determined that number Strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 15:According to the method described in step 13 and 14, all data points in ergodic data subset b, according to step 14 Method is calculated data point PiNeighbor Points quantity Ni
Step 16:It is all up to data in the reachable subset list of ergodic data subset a according to the method described in step 12 Collection, data point P is calculated according to step 15 methods describediNeighbor Points quantity Ni
Step 17:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiIt is It is no more than minimum neighbour's quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=CF, And last Neighbor Points P in being calculated for the stepkPlus tag along sort MFa, and for data point PiCore point calculate terminate, CF=CF+1 is made, and according to the method in step 11, chooses next data point Pi+1Carry out core point calculating;
Step 18:According to the method described in step 11 to step 17, each data point in ergodic data subset a in order, The core point for completing all data points in data subset a is calculated;
Step 19:According to the method described in step 11 to step 18, K is traveled through in order1Each data in individual data subset Collection, the core point for completing all data points in all data subsets is calculated.
9. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 8, it is characterised in that institute The core point for calculating each data subset is stated, is specifically included:
Step 21:In order from K1Data subset a is chosen in individual data subset, data point is chosen in order from data subset a PiIf, data point PiNeighbor Points quantity NiInitial value be 0, tag along sort CF be 1;
Step 22:Choose data point P in order from data subset aj
Step 23:Calculate data point PiWith data point PjThe distance between Di,jIf, apart from Di,jLess than pre-set radius R, it is determined that number Strong point PjIt is data point PiNeighbor Points, data point PiNeighbor Points quantity Ni=Ni+1;
Step 24:According to the method described in step 22, all data points in ergodic data subset a, according to described in step 23 Method, calculates data point PiNeighbor Points quantity Ni
Step 25:It is calculated data point P every timeiNeighbor Points quantity NiAfterwards, data point P is judgediNeighbor Points quantity NiIt is It is no more than minimum neighbour's quantity Min_N, if then data point PiIt is core point, is data point PiPlus tag along sort CFi=CF, And last Neighbor Points P in being calculated for the stepkPlus tag along sort MFa, and for data point PiCore point calculate terminate, CF=CF+1 is made, and according to the method in step 21, chooses next data point Pi+1Carry out core point calculating;
Step 26:According to the method described in step 21 to step 25, each data point in ergodic data subset a completes number Calculated according to the core point of all data points in subset a;
Step 27:According to the method described in step 21 to step 26, K is traveled through1Each data subset in individual data subset, it is complete The core point of all data points is calculated into all data subsets.
10. the large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that According to described reachable subset index, the core point in each data subset is clustered respectively, specifically included:
Step 31:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point PiIf, its MF values are present, then make CFi=MFi
Step 32:In order from K1Data subset a is chosen in individual data subset, data point is chosen in order from data subset a Pi
Step 33:Choose data subset b in order from the reachable subset list of data subset a;
Step 34:Choose data point P in order from data subset bj
Step 35:If data point PiWith data point PjTag along sort CFiWith CFjIt is equal, then perform step 37;Otherwise calculate data Point PiWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step 36 is performed, otherwise perform step 37;
Step 36:Judge tag along sort CFiWith CFjBetween magnitude relationship:Work as CFi<CFjWhen, then make CFj=CFi, work as CFi>CFj When, then make CFi=CFj
Step 37:According to the method described in step 34 to step 36, each data point in ergodic data subset b;
Step 38:Each according to the method described in step 33 to step 37, in the reachable subset list of ergodic data subset a Up to each data point in subset, core point P in data subset a is completediCluster calculation;
Step 39:According to the method described in step 32 to step 38, each data point in ergodic data subset a completes number According to the cluster calculation of all core points in subset a;
Step 310:According to the method described in step 31 to step 39, K is traveled through in order1Each data in individual data subset Subset, completes the cluster calculation of each core point in each data subset.
The 11. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 7, it is characterised in that It is described the core point in each data subset is clustered respectively, specifically include:
Step 41:The MF values of all core points in each data subset are polymerized, for Arbitrary Digit strong point PiIf, its MF values are present, then make CFi=MFi
Step 42:In order from K1Data subset a is chosen in individual data subset, data point P is chosen in order from data subset ai With data point Pj
Step 43:Judge data point PiWith data point PjTag along sort CFiWith CFjIt is whether equal;Step 45 is performed if equal, Data point P is calculated if unequaliWith data point PjApart from Di,jIf, apart from Di,jLess than the pre-set radius R, then step is performed Rapid 44, otherwise perform step 45;
Step 44:Judge data point PiWith data point PjTag along sort CFiWith CFjMagnitude relationship:Work as CFi<CFjWhen, then make CFj=CFi, work as CFi>CFjWhen, then make CFi=CFj
Step 45:According to the method described in step 41 to step 44, each data point in ergodic data subset a completes number According to the cluster calculation of all core points in subset a;
Step 46:According to the method described in step 46 to step 45, each data in K1 data subset are traveled through in order Subset, completes the cluster calculation of each core point in each data subset.
The 12. large space Data Clustering Algorithm K-DBSCAN based on density according to claim any one of 7-11, its It is characterised by, also comprises the following steps:
The Neighbor Points that are calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset.
The 13. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 12, it is characterised in that The Neighbor Points being calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset, specifically Including:
Step 51:Data subset a is chosen from K1 data subset in order, using the core point in data subset a as Subset, is designated as CGa, non-core point is used as a subset NCGa
Step 52:According to the method described in step 51, K1 data subset is traveled through, for all data subsets divide core point subset CG and non-core point NCG;
Step 53:Data subset a is chosen from K1 data subset in order, in order from the non-core idea of data subset a Collection NCGaIt is middle to choose non-core point NCPi
Step 54:Data subset b is chosen from the reachable subset list of data subset a in order;
Step 55:In order from the core point subset CG of data subset bbMiddle selection core point CPj
Step 56:Calculate non-core point NCPiWith core point CPjThe distance between Di,jIf, apart from Di,jLess than described default half Footpath R, then non-core point NCPiIt is core point CPjNeighbor Points, and clustered:Make CFi=CFj
Step 57:According to the method described in step 55 to step 56, each core point in ergodic data subset b;
Step 58:According to the method described in step 54 to 56, each in the reachable subset list of ergodic data subset a can Up to subset, non-core point NCP is completediCluster calculation;
Step 59:According to the method described in step 53 to step 58, each in ergodic data subset a is non-core in order Point, completes the cluster calculation of all non-core points in data subset a;
Step 510:According to the method described in step 53 to step 59, K is traveled through in order1Each number in individual data subset According to subset, the cluster calculation of all non-core points in each data subset is completed.
The 14. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 12, it is characterised in that The Neighbor Points being calculated respectively in each data subset are simultaneously clustered to the Neighbor Points in each data subset, specifically Including:
Step 61:By K1Core point in individual data subset in each data subset is designated as CG as a subset, non-core Point is designated as NCG as a subset;
Step 62:According to the method described in step 61, K is traveled through1Individual data subset, is that all of data subset divides core point CG With non-core point NCG;
Step 63:In order from K1Data subset a is chosen in individual data subset, from non-core subset NCG in data subset aa In choose non-core point NCP in orderi;From the core point subset CG in data subset aaIn choose core point CP in orderj
Step 64:Calculate point NCPiWith point CPjThe distance between Di,jIf, Di,jLess than pre-set radius R, then point NCPiIt is point CPj Neighbor Points, and clustered:Make CFj=CFi
Step 65:According to the method described in step 63 to step 64, each in ergodic data subset a is non-core in order Point, completes the cluster calculation of all non-core point in data subset a;
Step 66:According to the method described in step 63 to step 65, K is traveled through in order1Each data in individual data subset Collection, completes the cluster calculation of all non-core points in each data subset.
The 15. large space Data Clustering Algorithm K-DBSCAN based on density according to claim any one of 1-14, its It is characterised by:The quantity K for presorting is obtained in the following manner:
Wherein N is the total quantity of data point in the data set, and Min_N is default minimum Neighbor Points quantity, k It is constant.
The 16. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 15, it is characterised in that: The default division iterations T meets:1≤T≤10.
The 17. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 16, it is characterised in that: The default reach distance d meets:If R≤d≤R × 2, wherein R are pre-set radius R.
The 18. large space Data Clustering Algorithm K-DBSCAN based on density according to claim 17, it is characterised in that: The pre-set radius R replaces with R ':R≤R′≤R×2.
A kind of 19. electronic equipments for performing the large space Data Clustering Algorithm K-DBSCAN based on density, it is characterised in that bag Include:
At least one processor;And
The memory being connected with least one processor communication;Wherein,
The memory storage has can be by the instruction of one computing device, and the instruction is by least one processor Perform, so that at least one processor can:
Data set is divided into K1Individual data subset, wherein K1It is the natural number more than 1;
The reachable subset of each data subset is obtained, and forms corresponding with the data subset up to subset index;
Density-based spatial clustering is carried out to each data subset data up to subset index according to described.
CN201611047429.1A 2016-11-23 2016-11-23 Large-scale spatial data clustering algorithm K-DBSCAN based on density Active CN106709503B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611047429.1A CN106709503B (en) 2016-11-23 2016-11-23 Large-scale spatial data clustering algorithm K-DBSCAN based on density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611047429.1A CN106709503B (en) 2016-11-23 2016-11-23 Large-scale spatial data clustering algorithm K-DBSCAN based on density

Publications (2)

Publication Number Publication Date
CN106709503A true CN106709503A (en) 2017-05-24
CN106709503B CN106709503B (en) 2020-07-07

Family

ID=58934814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611047429.1A Active CN106709503B (en) 2016-11-23 2016-11-23 Large-scale spatial data clustering algorithm K-DBSCAN based on density

Country Status (1)

Country Link
CN (1) CN106709503B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291873A (en) * 2017-06-16 2017-10-24 晶赞广告(上海)有限公司 Geographical position clustering method
CN109239792A (en) * 2018-10-25 2019-01-18 中国石油大学(华东) The satellite-derived gravity data data and shipborne gravimetric data data fusion method that fractal interpolation and net―function combine
CN109359682A (en) * 2018-10-11 2019-02-19 北京市交通信息中心 A kind of Shuttle Bus candidate's website screening technique based on F-DBSCAN iteration cluster
CN110298371A (en) * 2018-03-22 2019-10-01 北京京东尚科信息技术有限公司 The method and apparatus of data clusters
CN111160385A (en) * 2019-11-27 2020-05-15 北京中交兴路信息科技有限公司 Method, device, equipment and storage medium for aggregating mass location points
CN111815361A (en) * 2020-07-10 2020-10-23 北京思特奇信息技术股份有限公司 Region boundary calculation method and device, electronic equipment and storage medium
CN112183664A (en) * 2020-10-27 2021-01-05 中国人民解放军陆军工程大学 Novel density clustering method
CN116702304A (en) * 2023-08-08 2023-09-05 中建五局第三建设有限公司 Method and device for grouping foundation pit design schemes based on unsupervised learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559630A (en) * 2013-10-31 2014-02-05 华南师范大学 Customer segmentation method based on customer attribute and behavior characteristic analysis
CN104715160A (en) * 2015-04-03 2015-06-17 天津工业大学 Soft measurement modeling data outlier detecting method based on KMDB

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559630A (en) * 2013-10-31 2014-02-05 华南师范大学 Customer segmentation method based on customer attribute and behavior characteristic analysis
CN104715160A (en) * 2015-04-03 2015-06-17 天津工业大学 Soft measurement modeling data outlier detecting method based on KMDB

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291873A (en) * 2017-06-16 2017-10-24 晶赞广告(上海)有限公司 Geographical position clustering method
CN107291873B (en) * 2017-06-16 2020-02-18 晶赞广告(上海)有限公司 Geographical position clustering method
CN110298371A (en) * 2018-03-22 2019-10-01 北京京东尚科信息技术有限公司 The method and apparatus of data clusters
CN109359682A (en) * 2018-10-11 2019-02-19 北京市交通信息中心 A kind of Shuttle Bus candidate's website screening technique based on F-DBSCAN iteration cluster
CN109239792A (en) * 2018-10-25 2019-01-18 中国石油大学(华东) The satellite-derived gravity data data and shipborne gravimetric data data fusion method that fractal interpolation and net―function combine
CN109239792B (en) * 2018-10-25 2019-07-16 中国石油大学(华东) The satellite-derived gravity data data and shipborne gravimetric data data fusion method that fractal interpolation and net―function combine
CN111160385A (en) * 2019-11-27 2020-05-15 北京中交兴路信息科技有限公司 Method, device, equipment and storage medium for aggregating mass location points
CN111160385B (en) * 2019-11-27 2023-04-18 北京中交兴路信息科技有限公司 Method, device, equipment and storage medium for aggregating mass location points
CN111815361A (en) * 2020-07-10 2020-10-23 北京思特奇信息技术股份有限公司 Region boundary calculation method and device, electronic equipment and storage medium
CN112183664A (en) * 2020-10-27 2021-01-05 中国人民解放军陆军工程大学 Novel density clustering method
CN116702304A (en) * 2023-08-08 2023-09-05 中建五局第三建设有限公司 Method and device for grouping foundation pit design schemes based on unsupervised learning
CN116702304B (en) * 2023-08-08 2023-10-20 中建五局第三建设有限公司 Method and device for grouping foundation pit design schemes based on unsupervised learning

Also Published As

Publication number Publication date
CN106709503B (en) 2020-07-07

Similar Documents

Publication Publication Date Title
CN109255828B (en) Method for performing intersection test to draw 3D scene image and ray tracing unit
CN106709503A (en) Large spatial data clustering algorithm K-DBSCAN based on density
Pradhan et al. Finding all-pairs shortest path for a large-scale transportation network using parallel Floyd-Warshall and parallel Dijkstra algorithms
CN112132287A (en) Distributed quantum computing simulation method and device
CN114492782B (en) On-chip core compiling and mapping method and device of neural network based on reinforcement learning
CN111985597B (en) Model compression method and device
CN111984400A (en) Memory allocation method and device of neural network
CN110132282A (en) Unmanned plane paths planning method and device
JP2023546040A (en) Data processing methods, devices, electronic devices, and computer programs
CN112906865B (en) Neural network architecture searching method and device, electronic equipment and storage medium
CN114580606A (en) Data processing method, data processing device, computer equipment and storage medium
CN112100450A (en) Graph calculation data segmentation method, terminal device and storage medium
CN115756478A (en) Method for automatically fusing operators of calculation graph and related product
CN115168281A (en) Neural network on-chip mapping method and device based on tabu search algorithm
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
CN109802859A (en) Nodes recommendations method and server in a kind of network
CN113591629A (en) Finger three-mode fusion recognition method, system, device and storage medium
CN116186571B (en) Vehicle clustering method, device, computer equipment and storage medium
KR20190105147A (en) Data clustering method using firefly algorithm and the system thereof
CN110175172B (en) Extremely-large binary cluster parallel enumeration method based on sparse bipartite graph
CN113986816B (en) Reconfigurable computing chip
CN108171785B (en) SAH-KD tree design method for ray tracing
Dandachi et al. A robust monte-carlo-based deep learning strategy for virtual network embedding
Hu et al. Data optimization cnn accelerator design on fpga
CN106445960A (en) Data clustering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant