CN105740371A - Density-based incremental clustering data mining method and system - Google Patents

Density-based incremental clustering data mining method and system Download PDF

Info

Publication number
CN105740371A
CN105740371A CN201610055222.2A CN201610055222A CN105740371A CN 105740371 A CN105740371 A CN 105740371A CN 201610055222 A CN201610055222 A CN 201610055222A CN 105740371 A CN105740371 A CN 105740371A
Authority
CN
China
Prior art keywords
data
cluster
class label
increment
meet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610055222.2A
Other languages
Chinese (zh)
Inventor
毛睿
张贺
陆敏华
廖好
李荣华
王毅
刘刚
许红龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201610055222.2A priority Critical patent/CN105740371A/en
Publication of CN105740371A publication Critical patent/CN105740371A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the field of data mining, and provides a density-based incremental clustering data mining method. The method comprises the following steps: carrying out clustering on an original data set by adopting a DBSCAN algorithm so as to obtain data with class labels; when new data is added to the data with the class labels, carrying out incremental clustering on the data with the class labels by adopting an Incremental DBSCAN algorithm. The invention furthermore provides a density-based incremental clustering data mining system. Through the technical scheme provided by the invention, the waste of calculation resources caused by repeated clustering can be avoided, the efficiency of the incremental clustering can be improved, the timeliness of the data mining can be strengthened and the efficiency of the data mining can be improved.

Description

A kind of increment cluster data mining method and system of density based
Technical field
The present invention relates to Data Mining, particularly relate to the increment cluster data mining method and system of a kind of density based.
Background technology
Fast development along with computer and network technologies, scale and the range of application of data base also constantly expand therewith, and people obtain the approach of data and also more come also many, and the mode obtaining data also more they tends to automatization, making to obtain data to become to be more prone to, the understanding of data is also progressively goed deep into by people.
Some applications at present, such as telecommunications, the Internet, logistics, economy, military affairs, finance, biological medicine etc., all create substantial amounts of different types of data, these data can be roughly divided into deterministic data and uncertain data, wherein, can being divided into again spatial data (i.e. multidimensional data) and Non-spatial Data (such as DNA sequence etc.) in deterministic data, uncertain data can be divided into again tuple uncertain data and attribute-value pairs data.And, these data are different from traditional static data, and As time goes on, these data can increase gradually.Traditional data analysing method is difficult to it is effectively analyzed, how the therefrom valuable information of excavation rapidly and efficiently, causes the concern of numerous research worker.
In recent years, data mining technology becomes the focus in people's eye, and the target of data mining is exactly extract potential, valuable pattern and knowledge from the data of magnanimity.Cluster analysis is the important means of data mining, the clustering algorithm being widely used at present, it is usually applicable only to the cluster of static data collection, and for dynamic data set, then need after newly-increased data to use clustering algorithm, re-start cluster, so necessarily cause the low of cluster efficiency and calculate the waste of resource.
Therefore, how to improve the data mining efficiency to dynamic data set is exactly the target that industry needs improvement badly all the time.
Summary of the invention
In view of this, the purpose of the embodiment of the present invention is in that to provide the increment cluster data mining method and system of a kind of density based, it is intended to solve in prior art, dynamic data set to be carried out the inefficient problem of data mining.
The embodiment of the present invention is achieved in that a kind of increment cluster data mining method of density based, including:
Adopt DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label;
To the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing;And
The result of twice clustering processing is carried out superposition to form final data Result.
Preferably, described to the described data having a class label when being added with new data, then adopt the step that the described data having class label are carried out increment clustering processing by IncrementalDBSCAN algorithm to include:
Judge data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label;And
According to different types of data, the described data having class label are carried out increment clustering processing.
Preferably, the data manipulation mode that the described data having class label are done by described judgement, and according to different data manipulation modes process respectively described in have the step of data of class label to include:
When carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters;
When carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.
Preferably, described according to different types of data, the step that the described data having class label are carried out increment clustering processing includes:
When data type is spatial data:
Abscissa according to the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table;
According to described Hash table index building;
Utilize given data point to scan in described index building, meet pre-conditioned data point to find.
Preferably, described according to different types of data, the step that the described data having class label are carried out increment clustering processing also includes:
When data type is Non-spatial Data:
According to three road partitioning algorithms, the described data having class label are divided;
MVP-Tree structure is utilized to index;And
Utilize given data point to scan in the index, meet pre-conditioned data point to find;
Wherein, described three road partitioning algorithms specifically include:
Calculate data point in this subtask to the distance specifying the strong point;
In all distances calculated, find two quantiles that data set is divided into three parts;
If the absolute value of two quantile differences is less than 2r, then finding out at data set median a, and the worthwhile of a+r and a-r is cooked two numerical points that data set is divided into three parts, wherein, r is search radius;
The data point different from two quantiles is divided into suitable apoplexy due to endogenous wind, and identical data point is stored in same database;
According to variance, the data point in described data base is divided in suitable part.
On the other hand, the present invention also provides for the increment cluster data mining system of a kind of density based, including:
First cluster module, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label;
Second cluster module, for the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing;And
Cluster superposition module, for carrying out superposition to form final data Result by the result of twice clustering processing.
Preferably, described second cluster module specifically includes:
Mode of operation submodule, for judging data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label;And
Data type submodule, for according to different types of data, carrying out increment clustering processing to the described data having class label.
Preferably, described mode of operation submodule specifically includes:
Update submodule, for when carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters;
Deletion action submodule, for when carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.
Preferably, described data type submodule specifically includes spatial data submodule, wherein,
Described spatial data submodule, for when data type is spatial data:
Abscissa according to the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table;
According to described Hash table index building;
Utilize given data point to scan in described index building, meet pre-conditioned data point to find.
Preferably, described data type submodule specifically includes Non-spatial Data submodule, wherein,
Described Non-spatial Data submodule, for when data type is Non-spatial Data:
According to three road partitioning algorithms, the described data having class label are divided;
MVP-Tree structure is utilized to index;And
Utilize given data point to scan in the index, meet pre-conditioned data point to find;
Wherein, described three road partitioning algorithms specifically include:
Calculate data point in this subtask to the distance specifying the strong point;
In all distances calculated, find two quantiles that data set is divided into three parts;
If the absolute value of two quantile differences is less than 2r, then finding out at data set median a, and the worthwhile of a+r and a-r is cooked two numerical points that data set is divided into three parts, wherein, r is search radius;
The data point different from two quantiles is divided into suitable apoplexy due to endogenous wind, and identical data point is stored in same database;
According to variance, the data point in described data base is divided in suitable part.
Technical scheme, for spatial data, adopts the dynamic hash data structure based on grid, and the time complexity of range-based searching is by original O (nα), 0≤α≤1, it is reduced to O (1), improves the efficiency of increment cluster.Technical scheme, for Non-spatial Data, adopts MVP-Tree structure and uses the data partitioning algorithm on three tunnels segmentation (three-waypartitioning), reducing the time of range-based searching, improve the efficiency of increment cluster.Technical scheme, in PDBSCAN clustering algorithm, proposes IncrementalPDBSCAN clustering algorithm first for dynamic attribute uncertain data, makes cluster efficiency obtain several orders of magnitude.Technical scheme provided by the invention is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.
Accompanying drawing explanation
Fig. 1 is the increment cluster data mining method flow diagram of density based in an embodiment of the present invention;
Fig. 2 is the detailed substeps flow chart of step S12 shown in Fig. 1 in an embodiment of the present invention;
Fig. 3 is the detailed substeps flow chart of step S121 shown in Fig. 2 in an embodiment of the present invention;
Fig. 4 is the structural representation of Hash table in an embodiment of the present invention;
Fig. 5 is the schematic diagram of impacted part under grid in an embodiment of the present invention;
Fig. 6 is the method schematic diagram of three tunnel segmentations in an embodiment of the present invention;
Fig. 7 is the increment cluster data mining system structure schematic diagram of density based in an embodiment of the present invention;
Fig. 8 is the internal structure schematic diagram of the second cluster module 12 shown in Fig. 7 in an embodiment of the present invention;
Fig. 9 is the internal structure schematic diagram of mode of operation submodule 121 shown in Fig. 8 in an embodiment of the present invention.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.
The specific embodiment of the invention provides a kind of increment cluster data mining method of density based, mainly comprises the steps:
Raw data set is carried out clustering processing by S11, employing DBSCAN algorithm, to obtain the data having class label;And
S12, to the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.
A kind of increment cluster data mining method provided by the present invention, is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.
The increment cluster data mining method of a kind of density based provided by the present invention will be described in detail below.
Refer to Fig. 1, for increment cluster data mining method flow diagram in an embodiment of the present invention.
In step s 11, adopt DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label.
In the present embodiment, cluster result mode refers to, according to some artificially defined focus, each transaction object is gathered into class, forms the good class of several similaritys, and makes similarity in class big as much as possible, and between class, similarity is little as much as possible.In the present embodiment, this cluster result mode is the important means of data mining.
In step s 12, to the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.
In the present embodiment, increment cluster has following two features: (1) is based on existing cluster result, single or in bulk new data is clustered;(2) cluster result is dynamically to update.In the present embodiment, Increasable Data Mining is aiming at large data sets, the result of incremental update data mining, rather than after each renewal, whole data set is clustered again.
In the present embodiment, described to the described data having a class label when being added with new data, the step S12 that the described data having class label are carried out increment clustering processing by IncrementalDBSCAN algorithm is then adopted to specifically include S121-S122 two sub-steps, as shown in Figure 2.
Refer to Fig. 2, it is shown that for the detailed substeps flow chart of step S12 shown in Fig. 1 in an embodiment of the present invention.
In step S121, it is determined that the data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label.
In the present embodiment, data manipulation mode includes update and deletion action, the data manipulation mode that the described data having class label are done by described judgement, and according to different data manipulation modes process respectively described in have the step S121 of data of class label to specifically include S1211-S1212 two sub-steps, as shown in Figure 3.
Refer to Fig. 3, it is shown that for the detailed substeps flow chart of step S121 shown in Fig. 2 in an embodiment of the present invention.
In step S1211, when carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters.
In the present embodiment, if D is data set, p is for inserting object.
UpSeedins(p)=q | and q be D ∪ the kernel object in p},Q ' D ∪ p} is core point, is not core point in D, and
Wherein, q and q ' is all searched by similarity ranges to obtain.
When carrying out update, there is following situation four kinds different: (1) noise, UpSeedinsP () is empty, and NEpsP coreless object in (), then be labeled as noise by p;(2) new cluster, UpSeed are createdinsP () there is kernel object and these kernel objects are not belonging to any one cluster, and its density achievable pair as middle without the kernel object in known cluster;(3) it is integrated into a certain cluster, including: a) UpSeedinsP all kernel objects that () comprises broadly fell into same cluster before inserting p;b)UpSeedinsP kernel object that () comprises belong to different cluster and insert p after object between different clusters be still unsatisfactory for density up to, p is assigned to one of them cluster;c)UpSeedinsP () is empty, and NEpsKernel object is had in (p);(4) different clusters, UpSeed are mergedinsP the kernel object comprised in () does not all belong to same cluster inserting before p, after inserting along with p, kernel objects of these different clusters become density up to, then these different clusters and p are merged into a cluster.
In step S1212, when carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.
In the present embodiment, if D is data set, p is the object deleted.
UpSeeddel(p)=q | and q be D the kernel object in p},Q ' is core point in D, D p} is not core point, and}
Wherein, q and q ' is all searched by similarity ranges to obtain.
When carrying out deletion action, having following situation four kinds different: (1) noise, if p is noise, then deleted by p, other object remains unchanged;(2) cluster, UpSeed are eliminateddelP () is empty, and before deletion, p is not noise, N after deletion pEpsNo longer there is kernel object in (p), then belong to other objects in p object place cluster originally and be marked as noise;(3) object in cluster, all UpSeed are reduceddelP the density directly with one another of the object in () is up to, or UpSeeddelP () is empty, but NEpsP still have kernel object in (), then after deleting p, these objects still fall within same cluster, such p is deleted, and some points are likely to and become noise;(4) cluster division, UpSeeddelP the object in () density up to, this situation, can not will see whether these objects can meet density reachability relation by other kernel objects in original same cluster directly with one another, so, cluster process to consider UpSeeddelP original other kernel objects in same cluster beyond (), work as UpSeeddelObject in (p) by other kernel objects of former cluster can each other density up to, then need not divide original cluster, stop the retrieval of other objects simultaneously, if having retrieved all of density achievable pair as UpSeeddelObject in (p) still can not each other density up to, then original cluster will at UpSeeddelProduce division between object in (p), split into one or more cluster.
Please continue to refer to Fig. 2, in step S122, according to different types of data, the described data having class label are carried out increment clustering processing.
In the present embodiment, data type includes spatial data, Non-spatial Data and uncertain data three types.
In the present embodiment, when data type is spatial data, described according to different types of data, the step S122 that the described data having class label are carried out increment clustering processing specifically includes:
Abscissa according to the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table;
According to described Hash table index building;
Utilize given data point to scan in described index building, meet pre-conditioned data point to find.
In the present embodiment, in Hash table, each element is the root node depositing one tree, and then each root node points to the child node of its correspondence, by that analogy, until leaf node.Maximin respectively maxX and the minX of the data abscissa of class label is had described in assuming, the maximin of vertical coordinate is maxY and minY, then initially set up Hash table, this Hash table be sized to (maxX-minX) * (maxY-minY), in Hash table, each element is a structure, includes in structure: the maximin of (1) abscissa;(2) maximin of vertical coordinate;(3) subscript of the data point of this barrel it is mapped to;(4) set of pointers of next laminar structure is pointed to.In the present embodiment, the structure of Hash table is as shown in Figure 4.In the present embodiment, based on the general steps of the hash data structure of grid: the first step, it is determined that each data point position within a grid;Second step, according to this position, adopts suitable hash function, maps that in Hash table;3rd step, for this data point, still successively maps downwards according to hash function, until leaf node.In the present embodiment, the corresponding grid of each element of ground floor in Hash table, every layer afterwards is the division again to this cell, and each division is then be divided into 4 parts.In the present embodiment, the information that each node includes has: 1, the bound of the abscissa of this grid;2, the bound of the vertical coordinate of this grid;3, the data point comprised;4, whether it is leaf node;5, call number;6, the pointer of next layer is pointed to.
Refer to Fig. 4, it is shown that for the structural representation of Hash table in an embodiment of the present invention.
In the present embodiment, in Hash table, each list item has four pointers pointing to next layer, and the Hash table that the structure of next layer is same, and this Hash table includes four list items altogether, the list item structure of each list item and ground floor is consistent, successively map downwards in such a way, until reaching termination condition.
In the present embodiment, index building, the network of a two-dimentional rule is first built according to raw data set D, the length of side of each cell c in grid isFor each non-mentioned null cell, it is divided into the cell of 4 formed objects, if the cell c ' after dividing is non-mentioned null cell, then repeats above-mentioned step, until the length of c ' is not more thanWherein ∈ is for searching radius, and ρ is degree of approximation parameter.When building, first data point position within a grid is determined, then this is mapped in the bucket that Hash table is corresponding by hash function, the columns assuming two-dimensional grid is colNum, the width of each cell is Wcell, data point abscissa within a grid is axisX, and vertical coordinate is axisY, then can arrange hash function and be: Hash (key)=axisX*colNum+axisY.When having new data point to need to join in above-mentioned Hash table, the first step adds it to the corresponding position of grid, and second step maps that in corresponding bucket by hash function, and the 3rd step again maps until being mapped to leaf node in this bucket.
In the present embodiment, range-based searching, namely utilize given data point to scan in described index building, for a given data point A (A.X, A.Y), according toWithCalculate position within a grid, because the width of ground floor grid isTherefore affected grid number has 21, concrete diagram is as shown in Figure 5.
Refer to Fig. 5, it is shown that for the schematic diagram of part impacted under grid in an embodiment of the present invention.
In the present embodiment, range-based searching scans for for these 21 affected grids of Fig. 5, finds the data point satisfied condition.Line range lookup is clicked on for single non-mentioned null cell c, q and is divided into situation three kinds different: (1) if c and B (q, ∈) is non-intersect, then ignores it;(2) if c is completely covered by B (q, ∈ (1+ ρ)), then in c, all of point is all the ∈ neighbour of q;(3) if not above-mentioned situation, then judging whether c is leaf cell lattice, if it is not, then make to search in a like fashion its child nodes, else if, then in labelling c, all of point is all the ∈ neighbour of q.According to above-mentioned steps, may search for out the data point satisfied condition in ∈ (1+ ρ) scope, now, also need to the data point to searching out again be filtered, by all data points searched out and A point computed range and judge that whether this distance is less than or equal to ∈, if it is record this point, otherwise give up.
Please continue to refer to Fig. 2, in the present embodiment, when data type is Non-spatial Data, described according to different types of data, the step S122 that the described data having class label are carried out increment clustering processing specifically includes:
Split (Three-waypartitioning) algorithm according to three tunnels the described data having class label are divided;
MVP-Tree structure is utilized to index;And
Utilize given data point to scan in the index, meet pre-conditioned data point to find;
Wherein, described three road partitioning algorithms specifically include:
Calculate data point in this subtask to the distance specifying the strong point;
In all distances calculated, find two quantiles that data set is divided into three parts;
If the absolute value of two quantile differences is less than 2r, then finding out at data set median a, and the worthwhile of a+r and a-r is cooked two numerical points that data set is divided into three parts, wherein, r is search radius;
The data point different from two quantiles is divided into suitable apoplexy due to endogenous wind, and identical data point is stored in same database;
According to variance, the data point in described data base is divided in suitable part.
In the present embodiment, MVP-Tree structure is the index structure under metric space, it has higher efficiency than M-Tree index structure, but MVP-Tree adopts Balance namely to balance division methods in data division all the time, overall data point is on average divided into the part specified by this method, do not consider the distribution problem of data point, thus causing inefficiencies when carrying out similarity searching, in order to solve this defect, the present invention adopts a brand-new data partition method and three tunnels segmentation (Three-waypartitioning) method, the method just take into account problem of data distribution when indexing, divide thus carrying out more rational data, as shown in Figure 6.
Refer to Fig. 6, it is shown that for the method schematic diagram of three tunnel segmentations in an embodiment of the present invention.
In the present embodiment, three tunnels segmentation (Three-waypartitioning) algorithms in order to avoid cannot the situation of beta pruning, so, it is search radius that the width of mid portion is at least 2r, r, works as m2-m1Value more than 2r, then the width of mid portion is set to m2-m1, it being otherwise provided as 2r, the width of left-hand component is 0 to m1, the width of right-hand component is m2To u.
Please continue to refer to Fig. 2, in the present embodiment, when data type is uncertain data, feature due to uncertain data self, probability density function is used to represent uncertain data object, so, distance between data object, core point, ∈-neighbour and directly density up to etc. concept need to redefine.
In the present embodiment, given two uncertain data point oiAnd oj, they corresponding concept density functions are fiAnd fj, x is oiUncertain dimension, y is ojUncertain dimension, separate between the two, Eps is distance threshold, then d (oi, ojThe probability of)≤Eps is defined as:
P d ( o i , o j ) ≤ E p s = ∫ x ∈ R m ∫ y ∈ R m f i ( x ) f j ( x ) d x d y ∀ d ( o i , o j ) ≤ E p s - - - ( 1 )
opIt is a uncertain data point in data set D,Then opThe probability of ∈-neighbour be defined as PNeighborhood (op), then have:
P N e i g h b o r h o o d ( o p ) = { o i | P d ( o i , o j ) ≤ E p s > 0 } - - - ( 2 )
opIt is the uncertain data point in data set D,Then PNEps(op) it is defined as:
PN E p s ( o p ) = Σ o i ∈ P N e i g h b o r h o o d ( o p ) P d ( o i , o j ) ≤ E p s - - - ( 3 )
If MinPts is the number at the minimum number strong point comprised in the ∈-scope of a core point, if PNEps(op) >=MinPts., then opIt is known as a core point.
opIt is the uncertain data point in data set D,Then opIt is that the probability of core point is designated asThen
P E p s , M i n p t s , D c o r e ( o p ) = PN E p s ( o p ) / | PN E p s ( o p ) | - - - ( 4 )
Given two uncertain data point o in data set DpAnd oq, opIt is a core point, then oqFrom opDirect density up to probability be designated asThen
P E p s , M i n p t s , D d i r - r e a c h ( o q , o p ) = P E p s , M i n p t s - 1 , D \ { o q } d i r - r e a c h ( o p ) * P d ( o i , o j ) ≤ E p s = ( PN E p s ( o p ) - P d ( o i , o j ) ≤ E p s ) / ( | PN E p s ( o p ) | - 1 ) * P d ( o i , o j ) ≤ E p s - - - ( 5 )
In the present embodiment, IncrementalPDBSCAN algorithm is to be applied in Incrementaldatabase by PDBSCAN algorithm, the PDBSCAN algorithm that can only process static uncertain data is improved, and then dynamic uncertainty data can be processed, wherein, concrete process step includes: adopt PDBSCAN algorithm to carry out data clusters;Calculate the seed object and ∈-neighbour that update some p;Four kinds of situations according to inserting with delete carry out increment cluster.
The increment cluster data mining method of density based provided by the present invention, for spatial data, adopts the dynamic hash data structure based on grid, and the time complexity of range-based searching is by original O (nα), 0 < α < 1, it is reduced to O (1), improves the efficiency of increment cluster.A kind of increment cluster data mining method provided by the present invention, for Non-spatial Data, adopt MVP-Tree structure and use the data partitioning algorithm on three tunnels segmentation (three-waypartitioning), reducing the time of range-based searching, improve the efficiency of increment cluster.The increment cluster data mining method of density based provided by the present invention, for uncertain data, IncrementalPDBSCAN algorithm can process dynamic uncertainty data.Technical scheme, in PDBSCAN clustering algorithm, proposes IncrementalPDBSCAN clustering algorithm first for dynamic attribute uncertain data, makes cluster efficiency obtain several orders of magnitude.Technical scheme provided by the invention is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.
The specific embodiment of the invention also provides for the increment cluster data mining system 10 of a kind of density based, specifically includes that
First cluster module 11, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label;
Second cluster module 12, for the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.
A kind of increment cluster data mining system 10 provided by the present invention, is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.
Refer to Fig. 7, it is shown that for the structural representation of increment cluster data mining system 10 in an embodiment of the present invention.In the present embodiment, increment cluster data mining system 10 includes the first cluster module 11 and the second cluster module 12.
First cluster module 11, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label.
In the present embodiment, cluster result mode refers to, according to some artificially defined focus, each transaction object is gathered into class, forms the good class of several similaritys, and makes similarity in class big as much as possible, and between class, similarity is little as much as possible.In the present embodiment, this cluster result mode is the important means of data mining.
Second cluster module 12, for the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.
In the present embodiment, increment cluster has following two features: (1) is based on existing cluster result, single or in bulk new data is clustered;(2) cluster result is dynamically to update.In the present embodiment, Increasable Data Mining is aiming at large data sets, the result of incremental update data mining, rather than after each renewal, whole data set is clustered again.
In the present embodiment, the second cluster module 12 specifically includes mode of operation submodule 121 and data type submodule 122, as shown in Figure 8.
Refer to Fig. 8, it is shown that for the internal structure schematic diagram of the second cluster module 12 shown in Fig. 7 in an embodiment of the present invention.
Mode of operation submodule 121, for judging data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label.
In the present embodiment, data manipulation mode includes update and deletion action.In the present embodiment, mode of operation submodule 121 specifically includes update submodule 1211 and deletion action submodule 1212, as shown in Figure 9.
Refer to Fig. 9, it is shown that for the internal structure schematic diagram of mode of operation submodule shown in Fig. 8 121 in an embodiment of the present invention.
Update submodule 1211, for when carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters.
In the present embodiment, if D is data set, p is for inserting object.
UpSeedins(p)=q | and q be D ∪ the kernel object in p},Q ' D ∪ p} is core point, is not core point in D, and}
Wherein, q and q ' is all searched by similarity ranges to obtain.
When carrying out update, there is following situation four kinds different: (1) noise, UpSeedinsP () is empty, and NEpsP coreless object in (), then be labeled as noise by p;(2) new cluster, UpSeed are createdinsP () there is kernel object and these kernel objects are not belonging to any one cluster, and its density achievable pair as middle without the kernel object in known cluster;(3) it is integrated into a certain cluster, including: a) UpSeedinsP all kernel objects that () comprises broadly fell into same cluster before inserting p;b)UpSeedinsP kernel object that () comprises belong to different cluster and insert p after object between different clusters be still unsatisfactory for density up to, p is assigned to one of them cluster;c)UpSeedinsP () is empty, and NEpsKernel object is had in (p);(4) different clusters, UpSeed are mergedinsP the kernel object comprised in () does not all belong to same cluster inserting before p, after inserting along with p, kernel objects of these different clusters become density up to, then these different clusters and p are merged into a cluster.
Deletion action submodule 1212, for when carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.
In the present embodiment, if D is data set, p is the object deleted.
UpSeeddel(p)=q | and q be D the kernel object in p},Q ' is core point in D, D p} is not core point, and}
Wherein, q and q ' is all searched by similarity ranges to obtain.
When carrying out deletion action, having following situation four kinds different: (1) noise, if p is noise, then deleted by p, other object remains unchanged;(2) cluster, UpSeed are eliminateddelP () is empty, and before deletion, p is not noise, N after deletion pEpsNo longer there is kernel object in (p), then belong to other objects in p object place cluster originally and be marked as noise;(3) object in cluster, all UpSeed are reduceddelP the density directly with one another of the object in () is up to, or UpSeeddelP () is empty, but NEpsP still have kernel object in (), then after deleting p, these objects still fall within same cluster, such p is deleted, and some points are likely to and become noise;(4) cluster division, UpSeeddelP the object in () density up to, this situation, can not will see whether these objects can meet density reachability relation by other kernel objects in original same cluster directly with one another, so, cluster process to consider UpSeeddelP original other kernel objects in same cluster beyond (), work as UpSeeddelObject in (p) by other kernel objects of former cluster can each other density up to, then need not divide original cluster, stop the retrieval of other objects simultaneously, if having retrieved all of density achievable pair as UpSeeddelObject in (p) still can not each other density up to, then original cluster will at UpSeeddelProduce division between object in (p), split into one or more cluster.
Please continue to refer to Fig. 8, data type submodule 122, for according to different types of data, carrying out increment clustering processing to the described data having class label.
In the present embodiment, data type includes spatial data, Non-spatial Data and uncertain data three types.
In the present embodiment, described data type submodule 122 specifically includes spatial data submodule (not shown).
Spatial data submodule, for when data type is spatial data: according to the abscissa of the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table;According to described Hash table index building;Utilize given data point to scan in described index building, meet pre-conditioned data point to find.In the present embodiment, the concrete method of Hash table of setting up, the method for index building and searching method refer to the associated description of preceding sections, do not do repeated description at this.
In the present embodiment, described data type submodule 122 specifically also includes Non-spatial Data submodule (not shown).
Non-spatial Data submodule, for when data type is Non-spatial Data: according to three road partitioning algorithms, the described data having class label are divided;MVP-Tree structure is utilized to index;And utilize given data point to scan in the index, meet pre-conditioned data point to find;
Wherein, described three road partitioning algorithms specifically include:
Calculate data point in this subtask to the distance specifying the strong point;
In all distances calculated, find two quantiles that data set is divided into three parts;
If the absolute value of two quantile differences is less than 2r, then finding out at data set median a, and the worthwhile of a+r and a-r is cooked two numerical points that data set is divided into three parts, wherein, r is search radius;
The data point different from two quantiles is divided into suitable apoplexy due to endogenous wind, and identical data point is stored in same database;
According to variance, the data point in described data base is divided in suitable part.
In the present embodiment, described data type submodule 122 specifically also includes uncertain data submodule (not shown).
In the present embodiment, when data type is uncertain data, feature due to uncertain data self, probability density function is used to represent uncertain data object, so, distance between data object, core point, ∈-neighbour and directly density up to etc. concept need to redefine.In the present embodiment, the method redefined refers to the associated description of preceding sections, does not do repeated description at this.
In the present embodiment, IncrementalPDBSCAN algorithm is to be applied in Incrementaldatabase by PDBSCAN algorithm, the PDBSCAN algorithm that can only process static uncertain data is improved, and then dynamic uncertainty data can be processed, wherein, uncertain data submodule, it is used for adopting PDBSCAN algorithm to carry out data clusters, calculate the seed object and ∈-neighbour that update some p, and carry out increment cluster according to the four kinds of situations inserted with delete.
A kind of increment cluster data mining system 10 provided by the present invention, for spatial data, adopts the dynamic hash data structure based on grid, and the time complexity of range-based searching is by original O (nα), 0 < α < 1, it is reduced to O (1), improves the efficiency of increment cluster.A kind of increment cluster data mining system 10 provided by the present invention, for Non-spatial Data, adopt MVP-Tree structure and use the data partitioning algorithm on three tunnels segmentation (three-waypartitioning), reducing the time of range-based searching, improve the efficiency of increment cluster.A kind of increment cluster data mining system 10 provided by the present invention, for uncertain data, IncrementalPDBSCAN algorithm can process dynamic uncertainty data.Technical scheme, in PDBSCAN clustering algorithm, proposes IncrementalPDBSCAN clustering algorithm first for dynamic attribute uncertain data, makes cluster efficiency obtain several orders of magnitude.Increment cluster data mining system 10 provided by the invention is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.
It should be noted that unit included in above-described embodiment is carry out dividing according to function logic, but be not limited to above-mentioned division, as long as being capable of corresponding function;It addition, the concrete title of each functional unit is also only to facilitate mutually distinguish, it is not limited to protection scope of the present invention.
Additionally, one of ordinary skill in the art will appreciate that all or part of step realizing in the various embodiments described above method can be by the hardware that program carrys out instruction relevant and completes, corresponding program can be stored in a computer read/write memory medium, described storage medium, such as ROM/RAM, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, should be included within protection scope of the present invention.

Claims (10)

1. the increment cluster data mining method of a density based, it is characterised in that described method includes:
Adopt DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label;And
To the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.
2. increment cluster data mining method as claimed in claim 1, it is characterized in that, described to the described data having a class label when being added with new data, then adopt the step that the described data having class label are carried out increment clustering processing by IncrementalDBSCAN algorithm to include:
Judge data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label;And
According to different types of data, the described data having class label are carried out increment clustering processing.
3. increment cluster data mining method as claimed in claim 2, it is characterized in that, the data manipulation mode that the described data having class label are done by described judgement, and according to different data manipulation modes process respectively described in have the step of data of class label to include:
When carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters;
When carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.
4. increment cluster data mining method as claimed in claim 2, it is characterised in that described according to different types of data, the step that the described data having class label are carried out increment clustering processing includes:
When data type is spatial data:
Abscissa according to the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table;
According to described Hash table index building;
Utilize given data point to scan in described index building, meet pre-conditioned data point to find.
5. increment cluster data mining method as claimed in claim 2, it is characterised in that described according to different types of data, the step that the described data having class label are carried out increment clustering processing also includes:
When data type is Non-spatial Data:
According to three road partitioning algorithms, the described data having class label are divided;
MVP-Tree structure is utilized to index;And
Utilize given data point to scan in the index, meet pre-conditioned data point to find;
Wherein, described three road partitioning algorithms specifically include:
Calculate data point in this subtask to the distance specifying the strong point;
In all distances calculated, find two quantiles that data set is divided into three parts;
If the absolute value of two quantile differences is less than 2r, then finding out at data set median a, and the worthwhile of a+r and a-r is cooked two numerical points that data set is divided into three parts, wherein, r is search radius;
The data point different from two quantiles is divided into suitable apoplexy due to endogenous wind, and identical data point is stored in same database;
According to variance, the data point in described data base is divided in suitable part.
6. the increment cluster data mining system of a density based, it is characterised in that described increment cluster data mining system includes:
First cluster module, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label;And
Second cluster module, for the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.
7. increment cluster data mining system as claimed in claim 6, it is characterised in that described second cluster module specifically includes:
Mode of operation submodule, for judging data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label;And
Data type submodule, for according to different types of data, carrying out increment clustering processing to the described data having class label.
8. increment cluster data mining system as claimed in claim 7, it is characterised in that described mode of operation submodule specifically includes:
Update submodule, for when carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters;
Deletion action submodule, for when carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.
9. increment cluster data mining system as claimed in claim 7, it is characterised in that described data type submodule specifically includes spatial data submodule, wherein,
Described spatial data submodule, for when data type is spatial data:
Abscissa according to the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table;
According to described Hash table index building;
Utilize given data point to scan in described index building, meet pre-conditioned data point to find.
10. increment cluster data mining system as claimed in claim 7, it is characterised in that described data type submodule specifically includes Non-spatial Data submodule, wherein,
Described Non-spatial Data submodule, for when data type is Non-spatial Data:
According to three road partitioning algorithms, the described data having class label are divided;
MVP-Tree structure is utilized to index;And
Utilize given data point to scan in the index, meet pre-conditioned data point to find;
Wherein, described three road partitioning algorithms specifically include:
Calculate data point in this subtask to the distance specifying the strong point;
In all distances calculated, find two quantiles that data set is divided into three parts;
If the absolute value of two quantile differences is less than 2r, then finding out at data set median a, and the worthwhile of a+r and a-r is cooked two numerical points that data set is divided into three parts, wherein, r is search radius;
The data point different from two quantiles is divided into suitable apoplexy due to endogenous wind, and identical data point is stored in same database;
According to variance, the data point in described data base is divided in suitable part.
CN201610055222.2A 2016-01-27 2016-01-27 Density-based incremental clustering data mining method and system Pending CN105740371A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610055222.2A CN105740371A (en) 2016-01-27 2016-01-27 Density-based incremental clustering data mining method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610055222.2A CN105740371A (en) 2016-01-27 2016-01-27 Density-based incremental clustering data mining method and system

Publications (1)

Publication Number Publication Date
CN105740371A true CN105740371A (en) 2016-07-06

Family

ID=56246604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610055222.2A Pending CN105740371A (en) 2016-01-27 2016-01-27 Density-based incremental clustering data mining method and system

Country Status (1)

Country Link
CN (1) CN105740371A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875760A (en) * 2017-05-11 2018-11-23 阿里巴巴集团控股有限公司 clustering method and device
CN109614415A (en) * 2018-09-29 2019-04-12 阿里巴巴集团控股有限公司 A kind of data mining, processing method, device, equipment and medium
WO2021232442A1 (en) * 2020-05-21 2021-11-25 深圳大学 Density clustering method and apparatus on basis of dynamic grid hash index

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281652A (en) * 2014-09-16 2015-01-14 深圳大学 One-by-one support point data dividing method in metric space
CN105205111A (en) * 2015-09-01 2015-12-30 西安交通大学 System and method for mining failure modes of time series data
US20160004762A1 (en) * 2014-07-07 2016-01-07 Edward-Robert Tyercha Hilbert Curve Partitioning for Parallelization of DBSCAN

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160004762A1 (en) * 2014-07-07 2016-01-07 Edward-Robert Tyercha Hilbert Curve Partitioning for Parallelization of DBSCAN
CN104281652A (en) * 2014-09-16 2015-01-14 深圳大学 One-by-one support point data dividing method in metric space
CN105205111A (en) * 2015-09-01 2015-12-30 西安交通大学 System and method for mining failure modes of time series data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄永平等: "数据仓库中基于密度的批量增量聚类算法", 《计算机工程与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875760A (en) * 2017-05-11 2018-11-23 阿里巴巴集团控股有限公司 clustering method and device
CN109614415A (en) * 2018-09-29 2019-04-12 阿里巴巴集团控股有限公司 A kind of data mining, processing method, device, equipment and medium
WO2021232442A1 (en) * 2020-05-21 2021-11-25 深圳大学 Density clustering method and apparatus on basis of dynamic grid hash index

Similar Documents

Publication Publication Date Title
Snoeyink Point location
Du et al. The optimal-location query
Goodrich et al. Dynamic trees and dynamic point location
CN106897374B (en) Personalized recommendation method based on track big data nearest neighbor query
CN112765405B (en) Method and system for clustering and inquiring spatial data search results
Ooi et al. Indexing in spatial databases
CN112181991A (en) Earth simulation system grid remapping method based on rapid construction of KD tree
Mohamed et al. Efficient mining frequent itemsets algorithms
CN105740371A (en) Density-based incremental clustering data mining method and system
Lee Fast k-nearest neighbor searching in static objects
Azri et al. Review of spatial indexing techniques for large urban data management
Hamdi et al. A pattern growth-based approach for mining spatiotemporal co-occurrence patterns
Tsai et al. GF-DBSCAN; a new efficient and effective data clustering technique for large databases
CN110807061A (en) Method for searching frequent subgraphs of uncertain graphs based on layering
Alis et al. Parallel processing of big point clouds using Z-Order-based partitioning
CN110489448A (en) The method for digging of big data correlation rule based on Hadoop
Elmasry et al. Dynamic range majority data structures
CN116090395A (en) Data processing method, data structure generating method and query method
Kabir et al. Association rule mining for both frequent and infrequent items using particle swarm optimization algorithm
CN104778259A (en) High-efficiency data analyzing and processing method
Li et al. A Survey of Multi-Dimensional Indexes: Past and Future Trends
Libera et al. Using B-trees to solve geographic range queries
Yi et al. Reverse view field nearest neighbor queries
Jones et al. Triangulated spatial models and neighbourhood search: an experimental comparison with quadtrees
Nugroho et al. Indexing Voronoi cells using quadtree in spatial database

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160706