CN105740371A

CN105740371A - Density-based incremental clustering data mining method and system

Info

Publication number: CN105740371A
Application number: CN201610055222.2A
Authority: CN
Inventors: 毛睿; 张贺; 陆敏华; 廖好; 李荣华; 王毅; 刘刚; 许红龙
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2016-01-27
Filing date: 2016-01-27
Publication date: 2016-07-06

Abstract

The invention is suitable for the field of data mining, and provides a density-based incremental clustering data mining method. The method comprises the following steps: carrying out clustering on an original data set by adopting a DBSCAN algorithm so as to obtain data with class labels; when new data is added to the data with the class labels, carrying out incremental clustering on the data with the class labels by adopting an Incremental DBSCAN algorithm. The invention furthermore provides a density-based incremental clustering data mining system. Through the technical scheme provided by the invention, the waste of calculation resources caused by repeated clustering can be avoided, the efficiency of the incremental clustering can be improved, the timeliness of the data mining can be strengthened and the efficiency of the data mining can be improved.

Description

A kind of increment cluster data mining method and system of density based

Technical field

The present invention relates to Data Mining, particularly relate to the increment cluster data mining method and system of a kind of density based.

Background technology

Fast development along with computer and network technologies, scale and the range of application of data base also constantly expand therewith, and people obtain the approach of data and also more come also many, and the mode obtaining data also more they tends to automatization, making to obtain data to become to be more prone to, the understanding of data is also progressively goed deep into by people.

Some applications at present, such as telecommunications, the Internet, logistics, economy, military affairs, finance, biological medicine etc., all create substantial amounts of different types of data, these data can be roughly divided into deterministic data and uncertain data, wherein, can being divided into again spatial data (i.e. multidimensional data) and Non-spatial Data (such as DNA sequence etc.) in deterministic data, uncertain data can be divided into again tuple uncertain data and attribute-value pairs data.And, these data are different from traditional static data, and As time goes on, these data can increase gradually.Traditional data analysing method is difficult to it is effectively analyzed, how the therefrom valuable information of excavation rapidly and efficiently, causes the concern of numerous research worker.

In recent years, data mining technology becomes the focus in people's eye, and the target of data mining is exactly extract potential, valuable pattern and knowledge from the data of magnanimity.Cluster analysis is the important means of data mining, the clustering algorithm being widely used at present, it is usually applicable only to the cluster of static data collection, and for dynamic data set, then need after newly-increased data to use clustering algorithm, re-start cluster, so necessarily cause the low of cluster efficiency and calculate the waste of resource.

Therefore, how to improve the data mining efficiency to dynamic data set is exactly the target that industry needs improvement badly all the time.

Summary of the invention

In view of this, the purpose of the embodiment of the present invention is in that to provide the increment cluster data mining method and system of a kind of density based, it is intended to solve in prior art, dynamic data set to be carried out the inefficient problem of data mining.

The embodiment of the present invention is achieved in that a kind of increment cluster data mining method of density based, including:

Adopt DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label；

To the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing；And

The result of twice clustering processing is carried out superposition to form final data Result.

Preferably, described to the described data having a class label when being added with new data, then adopt the step that the described data having class label are carried out increment clustering processing by IncrementalDBSCAN algorithm to include:

Judge data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label；And

According to different types of data, the described data having class label are carried out increment clustering processing.

Preferably, the data manipulation mode that the described data having class label are done by described judgement, and according to different data manipulation modes process respectively described in have the step of data of class label to include:

When carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters；

When carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.

Preferably, described according to different types of data, the step that the described data having class label are carried out increment clustering processing includes:

When data type is spatial data:

Abscissa according to the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table；

According to described Hash table index building；

Utilize given data point to scan in described index building, meet pre-conditioned data point to find.

Preferably, described according to different types of data, the step that the described data having class label are carried out increment clustering processing also includes:

When data type is Non-spatial Data:

According to three road partitioning algorithms, the described data having class label are divided；

MVP-Tree structure is utilized to index；And

Utilize given data point to scan in the index, meet pre-conditioned data point to find；

Wherein, described three road partitioning algorithms specifically include:

Calculate data point in this subtask to the distance specifying the strong point；

In all distances calculated, find two quantiles that data set is divided into three parts；

If the absolute value of two quantile differences is less than 2r, then finding out at data set median a, and the worthwhile of a+r and a-r is cooked two numerical points that data set is divided into three parts, wherein, r is search radius；

The data point different from two quantiles is divided into suitable apoplexy due to endogenous wind, and identical data point is stored in same database；

According to variance, the data point in described data base is divided in suitable part.

On the other hand, the present invention also provides for the increment cluster data mining system of a kind of density based, including:

First cluster module, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label；

Second cluster module, for the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing；And

Cluster superposition module, for carrying out superposition to form final data Result by the result of twice clustering processing.

Preferably, described second cluster module specifically includes:

Mode of operation submodule, for judging data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label；And

Data type submodule, for according to different types of data, carrying out increment clustering processing to the described data having class label.

Preferably, described mode of operation submodule specifically includes:

Update submodule, for when carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters；

Deletion action submodule, for when carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.

Preferably, described data type submodule specifically includes spatial data submodule, wherein,

Described spatial data submodule, for when data type is spatial data:

According to described Hash table index building；

Preferably, described data type submodule specifically includes Non-spatial Data submodule, wherein,

Described Non-spatial Data submodule, for when data type is Non-spatial Data:

MVP-Tree structure is utilized to index；And

Wherein, described three road partitioning algorithms specifically include:

Technical scheme, for spatial data, adopts the dynamic hash data structure based on grid, and the time complexity of range-based searching is by original O (n^α), 0≤α≤1, it is reduced to O (1), improves the efficiency of increment cluster.Technical scheme, for Non-spatial Data, adopts MVP-Tree structure and uses the data partitioning algorithm on three tunnels segmentation (three-waypartitioning), reducing the time of range-based searching, improve the efficiency of increment cluster.Technical scheme, in PDBSCAN clustering algorithm, proposes IncrementalPDBSCAN clustering algorithm first for dynamic attribute uncertain data, makes cluster efficiency obtain several orders of magnitude.Technical scheme provided by the invention is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.

Accompanying drawing explanation

Fig. 1 is the increment cluster data mining method flow diagram of density based in an embodiment of the present invention；

Fig. 2 is the detailed substeps flow chart of step S12 shown in Fig. 1 in an embodiment of the present invention；

Fig. 3 is the detailed substeps flow chart of step S121 shown in Fig. 2 in an embodiment of the present invention；

Fig. 4 is the structural representation of Hash table in an embodiment of the present invention；

Fig. 5 is the schematic diagram of impacted part under grid in an embodiment of the present invention；

Fig. 6 is the method schematic diagram of three tunnel segmentations in an embodiment of the present invention；

Fig. 7 is the increment cluster data mining system structure schematic diagram of density based in an embodiment of the present invention；

Fig. 8 is the internal structure schematic diagram of the second cluster module 12 shown in Fig. 7 in an embodiment of the present invention；

Fig. 9 is the internal structure schematic diagram of mode of operation submodule 121 shown in Fig. 8 in an embodiment of the present invention.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.

The specific embodiment of the invention provides a kind of increment cluster data mining method of density based, mainly comprises the steps:

Raw data set is carried out clustering processing by S11, employing DBSCAN algorithm, to obtain the data having class label；And

S12, to the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.

A kind of increment cluster data mining method provided by the present invention, is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.

The increment cluster data mining method of a kind of density based provided by the present invention will be described in detail below.

Refer to Fig. 1, for increment cluster data mining method flow diagram in an embodiment of the present invention.

In step s 11, adopt DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label.

In the present embodiment, cluster result mode refers to, according to some artificially defined focus, each transaction object is gathered into class, forms the good class of several similaritys, and makes similarity in class big as much as possible, and between class, similarity is little as much as possible.In the present embodiment, this cluster result mode is the important means of data mining.

In step s 12, to the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.

In the present embodiment, increment cluster has following two features: (1) is based on existing cluster result, single or in bulk new data is clustered；(2) cluster result is dynamically to update.In the present embodiment, Increasable Data Mining is aiming at large data sets, the result of incremental update data mining, rather than after each renewal, whole data set is clustered again.

In the present embodiment, described to the described data having a class label when being added with new data, the step S12 that the described data having class label are carried out increment clustering processing by IncrementalDBSCAN algorithm is then adopted to specifically include S121-S122 two sub-steps, as shown in Figure 2.

Refer to Fig. 2, it is shown that for the detailed substeps flow chart of step S12 shown in Fig. 1 in an embodiment of the present invention.

In step S121, it is determined that the data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label.

In the present embodiment, data manipulation mode includes update and deletion action, the data manipulation mode that the described data having class label are done by described judgement, and according to different data manipulation modes process respectively described in have the step S121 of data of class label to specifically include S1211-S1212 two sub-steps, as shown in Figure 3.

Refer to Fig. 3, it is shown that for the detailed substeps flow chart of step S121 shown in Fig. 2 in an embodiment of the present invention.

In step S1211, when carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters.

In the present embodiment, if D is data set, p is for inserting object.

UpSeed_ins(p)=q | and q be D ∪ the kernel object in p},Q ' D ∪ p} is core point, is not core point in D, and

Wherein, q and q ' is all searched by similarity ranges to obtain.

When carrying out update, there is following situation four kinds different: (1) noise, UpSeed_insP () is empty, and N_EpsP coreless object in (), then be labeled as noise by p；(2) new cluster, UpSeed are created_insP () there is kernel object and these kernel objects are not belonging to any one cluster, and its density achievable pair as middle without the kernel object in known cluster；(3) it is integrated into a certain cluster, including: a) UpSeed_insP all kernel objects that () comprises broadly fell into same cluster before inserting p；b)UpSeed_insP kernel object that () comprises belong to different cluster and insert p after object between different clusters be still unsatisfactory for density up to, p is assigned to one of them cluster；c)UpSeed_insP () is empty, and N_EpsKernel object is had in (p)；(4) different clusters, UpSeed are merged_insP the kernel object comprised in () does not all belong to same cluster inserting before p, after inserting along with p, kernel objects of these different clusters become density up to, then these different clusters and p are merged into a cluster.

In step S1212, when carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.

In the present embodiment, if D is data set, p is the object deleted.

UpSeed_del(p)=q | and q be D the kernel object in p},Q ' is core point in D, D p} is not core point, and}

Wherein, q and q ' is all searched by similarity ranges to obtain.

When carrying out deletion action, having following situation four kinds different: (1) noise, if p is noise, then deleted by p, other object remains unchanged；(2) cluster, UpSeed are eliminated_delP () is empty, and before deletion, p is not noise, N after deletion p_EpsNo longer there is kernel object in (p), then belong to other objects in p object place cluster originally and be marked as noise；(3) object in cluster, all UpSeed are reduced_delP the density directly with one another of the object in () is up to, or UpSeed_delP () is empty, but N_EpsP still have kernel object in (), then after deleting p, these objects still fall within same cluster, such p is deleted, and some points are likely to and become noise；(4) cluster division, UpSeed_delP the object in () density up to, this situation, can not will see whether these objects can meet density reachability relation by other kernel objects in original same cluster directly with one another, so, cluster process to consider UpSeed_delP original other kernel objects in same cluster beyond (), work as UpSeed_delObject in (p) by other kernel objects of former cluster can each other density up to, then need not divide original cluster, stop the retrieval of other objects simultaneously, if having retrieved all of density achievable pair as UpSeed_delObject in (p) still can not each other density up to, then original cluster will at UpSeed_delProduce division between object in (p), split into one or more cluster.

Please continue to refer to Fig. 2, in step S122, according to different types of data, the described data having class label are carried out increment clustering processing.

In the present embodiment, data type includes spatial data, Non-spatial Data and uncertain data three types.

In the present embodiment, when data type is spatial data, described according to different types of data, the step S122 that the described data having class label are carried out increment clustering processing specifically includes:

According to described Hash table index building；

In the present embodiment, in Hash table, each element is the root node depositing one tree, and then each root node points to the child node of its correspondence, by that analogy, until leaf node.Maximin respectively maxX and the minX of the data abscissa of class label is had described in assuming, the maximin of vertical coordinate is maxY and minY, then initially set up Hash table, this Hash table be sized to (maxX-minX) * (maxY-minY), in Hash table, each element is a structure, includes in structure: the maximin of (1) abscissa；(2) maximin of vertical coordinate；(3) subscript of the data point of this barrel it is mapped to；(4) set of pointers of next laminar structure is pointed to.In the present embodiment, the structure of Hash table is as shown in Figure 4.In the present embodiment, based on the general steps of the hash data structure of grid: the first step, it is determined that each data point position within a grid；Second step, according to this position, adopts suitable hash function, maps that in Hash table；3rd step, for this data point, still successively maps downwards according to hash function, until leaf node.In the present embodiment, the corresponding grid of each element of ground floor in Hash table, every layer afterwards is the division again to this cell, and each division is then be divided into 4 parts.In the present embodiment, the information that each node includes has: 1, the bound of the abscissa of this grid；2, the bound of the vertical coordinate of this grid；3, the data point comprised；4, whether it is leaf node；5, call number；6, the pointer of next layer is pointed to.

Refer to Fig. 4, it is shown that for the structural representation of Hash table in an embodiment of the present invention.

In the present embodiment, in Hash table, each list item has four pointers pointing to next layer, and the Hash table that the structure of next layer is same, and this Hash table includes four list items altogether, the list item structure of each list item and ground floor is consistent, successively map downwards in such a way, until reaching termination condition.

In the present embodiment, index building, the network of a two-dimentional rule is first built according to raw data set D, the length of side of each cell c in grid isFor each non-mentioned null cell, it is divided into the cell of 4 formed objects, if the cell c ' after dividing is non-mentioned null cell, then repeats above-mentioned step, until the length of c ' is not more thanWherein ∈ is for searching radius, and ρ is degree of approximation parameter.When building, first data point position within a grid is determined, then this is mapped in the bucket that Hash table is corresponding by hash function, the columns assuming two-dimensional grid is colNum, the width of each cell is Wcell, data point abscissa within a grid is axisX, and vertical coordinate is axisY, then can arrange hash function and be: Hash (key)=axisX*colNum+axisY.When having new data point to need to join in above-mentioned Hash table, the first step adds it to the corresponding position of grid, and second step maps that in corresponding bucket by hash function, and the 3rd step again maps until being mapped to leaf node in this bucket.

In the present embodiment, range-based searching, namely utilize given data point to scan in described index building, for a given data point A (A.X, A.Y), according toWithCalculate position within a grid, because the width of ground floor grid isTherefore affected grid number has 21, concrete diagram is as shown in Figure 5.

Refer to Fig. 5, it is shown that for the schematic diagram of part impacted under grid in an embodiment of the present invention.

In the present embodiment, range-based searching scans for for these 21 affected grids of Fig. 5, finds the data point satisfied condition.Line range lookup is clicked on for single non-mentioned null cell c, q and is divided into situation three kinds different: (1) if c and B (q, ∈) is non-intersect, then ignores it；(2) if c is completely covered by B (q, ∈ (1+ ρ)), then in c, all of point is all the ∈ neighbour of q；(3) if not above-mentioned situation, then judging whether c is leaf cell lattice, if it is not, then make to search in a like fashion its child nodes, else if, then in labelling c, all of point is all the ∈ neighbour of q.According to above-mentioned steps, may search for out the data point satisfied condition in ∈ (1+ ρ) scope, now, also need to the data point to searching out again be filtered, by all data points searched out and A point computed range and judge that whether this distance is less than or equal to ∈, if it is record this point, otherwise give up.

Please continue to refer to Fig. 2, in the present embodiment, when data type is Non-spatial Data, described according to different types of data, the step S122 that the described data having class label are carried out increment clustering processing specifically includes:

Split (Three-waypartitioning) algorithm according to three tunnels the described data having class label are divided；

MVP-Tree structure is utilized to index；And

Wherein, described three road partitioning algorithms specifically include:

In the present embodiment, MVP-Tree structure is the index structure under metric space, it has higher efficiency than M-Tree index structure, but MVP-Tree adopts Balance namely to balance division methods in data division all the time, overall data point is on average divided into the part specified by this method, do not consider the distribution problem of data point, thus causing inefficiencies when carrying out similarity searching, in order to solve this defect, the present invention adopts a brand-new data partition method and three tunnels segmentation (Three-waypartitioning) method, the method just take into account problem of data distribution when indexing, divide thus carrying out more rational data, as shown in Figure 6.

Refer to Fig. 6, it is shown that for the method schematic diagram of three tunnel segmentations in an embodiment of the present invention.

In the present embodiment, three tunnels segmentation (Three-waypartitioning) algorithms in order to avoid cannot the situation of beta pruning, so, it is search radius that the width of mid portion is at least 2r, r, works as m₂-m₁Value more than 2r, then the width of mid portion is set to m₂-m₁, it being otherwise provided as 2r, the width of left-hand component is 0 to m₁, the width of right-hand component is m₂To u.

Please continue to refer to Fig. 2, in the present embodiment, when data type is uncertain data, feature due to uncertain data self, probability density function is used to represent uncertain data object, so, distance between data object, core point, ∈-neighbour and directly density up to etc. concept need to redefine.

In the present embodiment, given two uncertain data point o_iAnd o_j, they corresponding concept density functions are f_iAnd f_j, x is o_iUncertain dimension, y is o_jUncertain dimension, separate between the two, Eps is distance threshold, then d (o_i, o_jThe probability of)≤Eps is defined as:

\{\begin{matrix} P_{d (o_{i}, o_{j}) \leq E p s} = \underset{x &Element; R^{m}}{&Integral;} \underset{y &Element; R^{m}}{&Integral;} f_{i} (x) f_{j} (x) d x d y \\ &ForAll; d (o_{i}, o_{j}) \leq E p s \end{matrix} - - - (1)

o_pIt is a uncertain data point in data set D,Then o_pThe probability of ∈-neighbour be defined as PNeighborhood (o_p), then have:

P N e i g h b o r h o o d (o_{p}) = {o_{i} | P_{d (o_{i}, o_{j}) \leq E p s} > 0} - - - (2)

o_pIt is the uncertain data point in data set D,Then PN_Eps(o_p) it is defined as:

{PN}_{E p s} (o_{p}) = Σ_{o_{i} &Element; P N e i g h b o r h o o d (o_{p})} P_{d (o_{i}, o_{j}) \leq E p s} - - - (3)

If MinPts is the number at the minimum number strong point comprised in the ∈-scope of a core point, if PN_Eps(o_p) >=MinPts., then o_pIt is known as a core point.

o_pIt is the uncertain data point in data set D,Then o_pIt is that the probability of core point is designated asThen

P_{E p s, M i n p t s, D}^{c o r e} (o_{p}) = {PN}_{E p s} (o_{p}) / | {PN}_{E p s} (o_{p}) | - - - (4)

Given two uncertain data point o in data set D_pAnd o_q, o_pIt is a core point, then o_qFrom o_pDirect density up to probability be designated asThen

\begin{matrix} P_{E p s, M i n p t s, D}^{d i r - r e a c h} (o_{q}, o_{p}) = P_{E p s, M i n p t s - 1, D \ {o_{q}}}^{d i r - r e a c h} (o_{p}) * P_{d (o_{i}, o_{j}) \leq E p s} \\ = ({PN}_{E p s} (o_{p}) - P_{d (o_{i}, o_{j}) \leq E p s}) / (| {PN}_{E p s} (o_{p}) | - 1) * P_{d (o_{i}, o_{j}) \leq E p s} \end{matrix} - - - (5)

In the present embodiment, IncrementalPDBSCAN algorithm is to be applied in Incrementaldatabase by PDBSCAN algorithm, the PDBSCAN algorithm that can only process static uncertain data is improved, and then dynamic uncertainty data can be processed, wherein, concrete process step includes: adopt PDBSCAN algorithm to carry out data clusters；Calculate the seed object and ∈-neighbour that update some p；Four kinds of situations according to inserting with delete carry out increment cluster.

The increment cluster data mining method of density based provided by the present invention, for spatial data, adopts the dynamic hash data structure based on grid, and the time complexity of range-based searching is by original O (n^α), 0 < α < 1, it is reduced to O (1), improves the efficiency of increment cluster.A kind of increment cluster data mining method provided by the present invention, for Non-spatial Data, adopt MVP-Tree structure and use the data partitioning algorithm on three tunnels segmentation (three-waypartitioning), reducing the time of range-based searching, improve the efficiency of increment cluster.The increment cluster data mining method of density based provided by the present invention, for uncertain data, IncrementalPDBSCAN algorithm can process dynamic uncertainty data.Technical scheme, in PDBSCAN clustering algorithm, proposes IncrementalPDBSCAN clustering algorithm first for dynamic attribute uncertain data, makes cluster efficiency obtain several orders of magnitude.Technical scheme provided by the invention is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.

The specific embodiment of the invention also provides for the increment cluster data mining system 10 of a kind of density based, specifically includes that

First cluster module 11, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label；

Second cluster module 12, for the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.

A kind of increment cluster data mining system 10 provided by the present invention, is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.

Refer to Fig. 7, it is shown that for the structural representation of increment cluster data mining system 10 in an embodiment of the present invention.In the present embodiment, increment cluster data mining system 10 includes the first cluster module 11 and the second cluster module 12.

First cluster module 11, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label.

In the present embodiment, the second cluster module 12 specifically includes mode of operation submodule 121 and data type submodule 122, as shown in Figure 8.

Refer to Fig. 8, it is shown that for the internal structure schematic diagram of the second cluster module 12 shown in Fig. 7 in an embodiment of the present invention.

Mode of operation submodule 121, for judging data manipulation mode that the described data having class label are done, and according to different data manipulation modes process respectively described in have the data of class label.

In the present embodiment, data manipulation mode includes update and deletion action.In the present embodiment, mode of operation submodule 121 specifically includes update submodule 1211 and deletion action submodule 1212, as shown in Figure 9.

Refer to Fig. 9, it is shown that for the internal structure schematic diagram of mode of operation submodule shown in Fig. 8 121 in an embodiment of the present invention.

Update submodule 1211, for when carrying out update, if inserting object to meet default noise conditions, it is noise by this insertion object tag, if inserting object to meet the new cluster condition of default establishment, it it is one new cluster of this insertion Object Creation, if inserting object to meet the same cluster condition of default merger, being integrated in same cluster for this insertion object, if inserting object to meet the different cluster condition of default merging, being that this insertion object merging is in different clusters.

In the present embodiment, if D is data set, p is for inserting object.

UpSeed_ins(p)=q | and q be D ∪ the kernel object in p},Q ' D ∪ p} is core point, is not core point in D, and}

Wherein, q and q ' is all searched by similarity ranges to obtain.

Deletion action submodule 1212, for when carrying out deletion action, if inserting object to meet default deletion condition, this insertion object is deleted, if inserting object to meet default elimination cluster condition, it is noise by other object tags similar with this insertion object, if inserting object to meet default minimizing clustering object condition, being noise for this insertion object tag, if insertion object meets to preset clusters splitting condition, being that this insertion object splits into one or more cluster.

In the present embodiment, if D is data set, p is the object deleted.

Wherein, q and q ' is all searched by similarity ranges to obtain.

Please continue to refer to Fig. 8, data type submodule 122, for according to different types of data, carrying out increment clustering processing to the described data having class label.

In the present embodiment, described data type submodule 122 specifically includes spatial data submodule (not shown).

Spatial data submodule, for when data type is spatial data: according to the abscissa of the described data having class label and each self-corresponding maximin of vertical coordinate, set up Hash table；According to described Hash table index building；Utilize given data point to scan in described index building, meet pre-conditioned data point to find.In the present embodiment, the concrete method of Hash table of setting up, the method for index building and searching method refer to the associated description of preceding sections, do not do repeated description at this.

In the present embodiment, described data type submodule 122 specifically also includes Non-spatial Data submodule (not shown).

Non-spatial Data submodule, for when data type is Non-spatial Data: according to three road partitioning algorithms, the described data having class label are divided；MVP-Tree structure is utilized to index；And utilize given data point to scan in the index, meet pre-conditioned data point to find；

Wherein, described three road partitioning algorithms specifically include:

In the present embodiment, described data type submodule 122 specifically also includes uncertain data submodule (not shown).

In the present embodiment, when data type is uncertain data, feature due to uncertain data self, probability density function is used to represent uncertain data object, so, distance between data object, core point, ∈-neighbour and directly density up to etc. concept need to redefine.In the present embodiment, the method redefined refers to the associated description of preceding sections, does not do repeated description at this.

In the present embodiment, IncrementalPDBSCAN algorithm is to be applied in Incrementaldatabase by PDBSCAN algorithm, the PDBSCAN algorithm that can only process static uncertain data is improved, and then dynamic uncertainty data can be processed, wherein, uncertain data submodule, it is used for adopting PDBSCAN algorithm to carry out data clusters, calculate the seed object and ∈-neighbour that update some p, and carry out increment cluster according to the four kinds of situations inserted with delete.

A kind of increment cluster data mining system 10 provided by the present invention, for spatial data, adopts the dynamic hash data structure based on grid, and the time complexity of range-based searching is by original O (n^α), 0 < α < 1, it is reduced to O (1), improves the efficiency of increment cluster.A kind of increment cluster data mining system 10 provided by the present invention, for Non-spatial Data, adopt MVP-Tree structure and use the data partitioning algorithm on three tunnels segmentation (three-waypartitioning), reducing the time of range-based searching, improve the efficiency of increment cluster.A kind of increment cluster data mining system 10 provided by the present invention, for uncertain data, IncrementalPDBSCAN algorithm can process dynamic uncertainty data.Technical scheme, in PDBSCAN clustering algorithm, proposes IncrementalPDBSCAN clustering algorithm first for dynamic attribute uncertain data, makes cluster efficiency obtain several orders of magnitude.Increment cluster data mining system 10 provided by the invention is avoided that to repeat to cluster and causes the waste calculating resource, can improve the efficiency of increment cluster, can strengthen the ageing of data mining, and improve the efficiency of data mining.

It should be noted that unit included in above-described embodiment is carry out dividing according to function logic, but be not limited to above-mentioned division, as long as being capable of corresponding function；It addition, the concrete title of each functional unit is also only to facilitate mutually distinguish, it is not limited to protection scope of the present invention.

Additionally, one of ordinary skill in the art will appreciate that all or part of step realizing in the various embodiments described above method can be by the hardware that program carrys out instruction relevant and completes, corresponding program can be stored in a computer read/write memory medium, described storage medium, such as ROM/RAM, disk or CD etc..

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. made within the spirit and principles in the present invention, should be included within protection scope of the present invention.

Claims

1. the increment cluster data mining method of a density based, it is characterised in that described method includes:

Adopt DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label；And

To the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.

2. increment cluster data mining method as claimed in claim 1, it is characterized in that, described to the described data having a class label when being added with new data, then adopt the step that the described data having class label are carried out increment clustering processing by IncrementalDBSCAN algorithm to include:

3. increment cluster data mining method as claimed in claim 2, it is characterized in that, the data manipulation mode that the described data having class label are done by described judgement, and according to different data manipulation modes process respectively described in have the step of data of class label to include:

4. increment cluster data mining method as claimed in claim 2, it is characterised in that described according to different types of data, the step that the described data having class label are carried out increment clustering processing includes:

When data type is spatial data:

According to described Hash table index building；

5. increment cluster data mining method as claimed in claim 2, it is characterised in that described according to different types of data, the step that the described data having class label are carried out increment clustering processing also includes:

When data type is Non-spatial Data:

MVP-Tree structure is utilized to index；And

Wherein, described three road partitioning algorithms specifically include:

6. the increment cluster data mining system of a density based, it is characterised in that described increment cluster data mining system includes:

First cluster module, is used for adopting DBSCAN algorithm that raw data set is carried out clustering processing, to obtain the data having class label；And

Second cluster module, for the described data having a class label when being added with new data, then adopt IncrementalDBSCAN algorithm that the described data having class label are carried out increment clustering processing.

7. increment cluster data mining system as claimed in claim 6, it is characterised in that described second cluster module specifically includes:

8. increment cluster data mining system as claimed in claim 7, it is characterised in that described mode of operation submodule specifically includes:

9. increment cluster data mining system as claimed in claim 7, it is characterised in that described data type submodule specifically includes spatial data submodule, wherein,

Described spatial data submodule, for when data type is spatial data:

According to described Hash table index building；

10. increment cluster data mining system as claimed in claim 7, it is characterised in that described data type submodule specifically includes Non-spatial Data submodule, wherein,

Described Non-spatial Data submodule, for when data type is Non-spatial Data:

MVP-Tree structure is utilized to index；And

Wherein, described three road partitioning algorithms specifically include: