CN106203474A

CN106203474A - A kind of flow data clustering method dynamically changed based on density value

Info

Publication number: CN106203474A
Application number: CN201610486506.7A
Authority: CN
Inventors: 巩树凤; 张岩峰
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2016-12-07

Abstract

The present invention provides a kind of flow data clustering method dynamically changed based on density value, and the method is the radius r using in history flow data set D the distance between all-pair to determine data structure CluCell；For newly-increased flow data, density center clustering algorithm is used to set up the flow data Clustering Model dynamically changed based on density value；According to newly-increased flow data and the distance relation of data structure CluCell and newly-increased data and the distance relation of outlier, update flow data Clustering Model, thus process newly-increased flow data；The flow data of arbitrary shape can not only be clustered by the method, and it can be found that the cluster occurred in flow data cluster process generates, merges and the change of division, user can be according to the cluster result change detected, this algorithm can also detect outlier during performing cluster, outlier is often the wrong data etc. produced in system, may determine that whether system breaks down by the outlier detected.

Description

A kind of flow data clustering method dynamically changed based on density value

Technical field

The invention belongs to the technical field of flow data cluster analysis, be specifically related to a kind of stream dynamically changed based on density value Data clustering method.

Background technology

Existing a lot of flow data clustering methods have certain one-sidedness, there is certain defect.Although these algorithms Stream data can carry out cluster analysis, but not reach the requirement of flow data cluster analysis.Cluster analysis is required to Accomplish following some: arbitrary shape is clustered, identify outlier (being not belonging to the point of any cluster), detection cluster change (merge and divide).Such as, although CluStream algorithm can cluster with stream data, but the method is only applicable to that The data of a little linear partition, and undesirable to the data clusters effect of the Nonlinear separability of some concave edge shapes etc, secondly, CluStream algorithm includes two stages: one, online data extracts the stage；Two, off-line data clustering phase；Need every time When checking cluster result, being required for triggering an off-line operation, when frequently checking cluster result, cluster efficiency can reduce.D- Stream and DenStream algorithm uses method based on density extract data and cluster, although the two algorithm energy Enough flow data set to arbitrary shape cluster, but, it is poly-that the two algorithm yet suffers from needing off-line operation just to can know that The problem whether class state changes.E-Stream is by the statistics to coordinate axes, according to currently clustering situation, in conjunction with just arrive to Data change cluster state, and change according to change-detection cluster result before and after cluster state, the division such as clustered and conjunction And.Although E-Stream algorithm just can detect without off-line operation cluster state change, but due to this algorithm be based on Coordinate is added up, and therefore can only cluster the data set of linear separability, and not ideal to the cluster result of arbitrary shape.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes a kind of flow data clustering method dynamically changed based on density value.

The technical scheme is that

A kind of flow data clustering method dynamically changed based on density value, comprises the following steps:

Step 1: use in history flow data set D the distance between all-pair to determine the radius of data structure CluCell r；

Step 1.1: by K flow data composition history flow data set D of history buffer；

Step 1.2: calculate each point in history flow data set D between distance；

Step 1.3: by each point in history flow data set D between distance value be ranked up from small to large, choose front A% The value at place is as the radius r, wherein 1 ＜ A ＜ 2 of data structure CluCell；

Step 2: for newly-increased flow data, uses density center clustering algorithm to set up the fluxion dynamically changed based on density value According to Clustering Model；

Step 2.1: number threshold value M of setting data structure C luCell；

Step 2.2: receive the current fluxion strong point p arrived, it is judged that currently whether there is data structure CluCell, if so, Perform step 2.3, otherwise, perform step 2.5；

Step 2.3: current all data structures CluCell are carried out density decay according to current time, finds distance stream The data structure CluCell c that data point p is nearest_k, and determine its distance d_pk；

Step 2.4: judging distance d_pkWith the magnitude relationship of the radius r of data structure CluCell, if d_pk≤ r, then perform Step 2.6, if d_pk＞ r, then perform step 2.5；

Step 2.5: set up data structure CluCell c centered by the p of fluxion strong point_p, delete fluxion strong point p, perform step Rapid 2.7；

Step 2.6: by data structure CluCell c nearest for distance fluxion strong point p_kDensity value add 1, delete flow data Point p, returns step 2.2；

Step 2.7: whether number N of statistics current data structure C luCell reaches the number threshold of data structure CluCell Value M, if so, performs step 2.8, otherwise, returns step 2.2；

Step 2.8: the group that scolds calculating current all data structures CluCell is worth, according to current all data structures The density value of CluCell and scold group to be worth the decision diagram of drawing data structure；

All density center points are scolded group to be worth by step 2.9: determine density center point according to the decision diagram of data structure Minima scold group to be worth δ as minimum_min；

Step 2.10: obtain with density center according to the relations of dependence scolding group to be worth of current all data structures CluCell Point is the tree of root, i.e. clustering tree, all clustering trees one flow data Clustering Model of composition；

Step 3: according to the distance relation of newly-increased flow data and data structure CluCell and newly-increased data and outlier away from From relation, update flow data Clustering Model, thus process newly-increased flow data；

Step 3.1: set time threshold Δ t；

Step 3.2: receive the current fluxion strong point p ' arrived；

Step 3.3: current all data structures CluCell are carried out density decay according to current time, and delete time Between be not inserted into data structure CluCell of flow data in threshold value Δ t；

Step 3.4: find the nearest data structure CluCell c of distance fluxion strong point p '_k′, and determine its distance d_p′k′；

Step 3.5: judging distance d_p′k′Group is scolded to be worth δ with radius r and the minimum of data structure CluCell_minSize close System, if d_p′k′≤ r, then perform step 3.6, if r is ＜ d_p′k′≤δ_min, then perform step 3.7, if d_p′k′＞ δ_min, then step is performed 3.8；

Step 3.6: by data structure CluCell c nearest for distance fluxion strong point p '_k′Density value add 1, delete fluxion Strong point p ', performs step 3.11；

Step 3.7: set up data structure CluCell c centered by the p ' of fluxion strong point_p′, delete fluxion strong point p ', perform Step 3.9:

Step 3.8: be inserted in the outlier pond temporarily depositing outlier by fluxion strong point p ', performs step 3.12；

Step 3.9: calculate in outlier pond each outlier to data structure CluCell c_p′Distance d_p′oIf there being Z d_p′oThe outlier of≤r, then by data structure CluCell c_p′Density value add the freshness of this Z outlier, delete this Z , if there is Z ' individual r ＜ d in outlier_p′o≤δ_minOutlier, then centered by the individual outlier of this Z ', set up data structure CluCell, deletes the individual outlier of this Z '；

Step 3.10: the group that scolds updating each data structure CluCell is worth, and according to each data structure after updating The group that scolds of CluCell is worth renewal clustering tree, returns step 3.2；

Step 3.11: find the nearest outlier o ' of fluxion strong point p ', and determine its distance d_p′o′；

Step 3.12: judging distance d_p′o′Group is scolded to be worth δ with radius r and the minimum of data structure CluCell_minSize close System, if d_p′o′≤ r, then perform step 3.13, if r is ＜ d_p′o′≤δ_min, then perform step 3.14, if d_p′o′＞ δ_min, then step is returned Rapid 3.2；

Step 3.13: set up data structure CluCell c centered by the p ' of fluxion strong point_p′, by data structure CluCell c_p′Density value plus the freshness of outlier o ', delete outlier o ' and fluxion strong point p ', return step 3.9；

Step 3.14: set up data structure CluCell c centered by the p ' of fluxion strong point_p′, build centered by outlier o ' Vertical data structure CluCell c_o′, delete outlier o ' and fluxion strong point p ', return step 3.9.

The density value decay formula of described several Ju structure C luCell is as follows:

ρ^{t} = Σ_{i = 0}^{n} f_{i}^{t} = Σ_{i = 0}^{n} 2^{- λ (t - t_{i})} = 2^{- λ (t - t_{l})} ρ^{t_{l}};

Wherein, ρ^tFor data structure CluCell at the density value of moment t, f_i ^tFor i-th fluxion strong point p_iMoment t's Freshness, 0 ＜ i ＜ n, n are fluxion strong point number, t in data structure CluCell_lClose for last data structure CluCell Angle value die-away time, t_iFor fluxion strong point p_iThe generation time, λ is freshness attenuation quotient.

The formula scolding group to be worth of described data structure CluCell is as follows:

δ_{v}^{t} = \min_{l : ρ_{l}^{t} > ρ_{v}^{t}} (| c_{l}, c_{v} |);

Wherein,For data structure CluCell c_vThe group that scolds at moment t is worth, | c_l, c_v| for data structure CluCell c_v With data structure CluCell c_lDistance,For data structure CluCell c_lAt the density value of moment t,For data structure CluCell c_vDensity value at moment t.

The described group that scolds according to each data structure CluCell is worth renewal clustering tree method particularly includes:

If current data structure C luCell c_μThe group that scolds be worth δ_μ ^t＞ δ_minTime, then will be with data structure CluCell c_μFor The subtree of root splits off from former clustering tree, forms new clustering tree, if the single data structure C luCell in current clustering tree c_mThe group that scolds be worth δ_m ^t＜ δ_minTime, then will be with data structure CluCell c_mClustering tree for root is merged into data structure CluCell c_mThe group that scolds be worth the clustering tree at depended on data structure CluCell place, and with CluCell c_mGroup is scolded to put depended on data Structure C luCell is father node.

The density value of current all data structures CluCell of described basis and the decision diagram scolding group to be worth drawing data structure have Body is: using the density value of current all data structures CluCell as abscissa, by current all data structures CluCell Group is scolded to be worth for vertical coordinate, the decision diagram of drawing data structure.

Beneficial effects of the present invention:

The present invention proposes a kind of flow data clustering method dynamically changed based on density value, and the method can not only be to arbitrary shape The flow data of shape clusters, and it can be found that the cluster occurred in flow data cluster process generates, merges and division Change, user can according to detect cluster result change, this algorithm perform cluster during can also detect from Group's point, outlier is often the wrong data etc. produced in system, may determine that whether system is sent out by the outlier detected Raw fault.

Accompanying drawing explanation

Fig. 1 is the flow chart of the flow data clustering method dynamically changed based on density value in embodiment of the present invention；

Fig. 2 is to use density center clustering algorithm to set up the fluxion dynamically changed based on density value in embodiment of the present invention Flow chart according to Clustering Model；

Fig. 3 is the decision diagram of the data structure drawn in embodiment of the present invention；

Fig. 4 is according to increasing the distance relation of flow data and data structure CluCell newly and increasing newly in embodiment of the present invention The distance relation of data and outlier updates the flow chart of flow data Clustering Model.

Detailed description of the invention

Below in conjunction with the accompanying drawings the specific embodiment of the invention is described in detail.

A kind of flow data clustering method dynamically changed based on density value, in present embodiment, 17k 2-D data is carried out Cluster is as it is shown in figure 1, comprise the following steps:

Step 1: use in history flow data set D the distance between all-pair to determine the radius of data structure CluCell r。

In present embodiment, data structure CluCell is by n flow data { p₁, p₂... p_nForm have four attributes Space higher-dimension spheroid { s, r, ρ^t, δ^t}.Wherein s is seed points or the central point of data structure CluCell, and r is data structure The radius of CluCell, ρ^tIt is the data structure CluCell density value in time t, δ^tFor data structure CluCell in the moment The group that scolds of t is worth

Step 1.1: by K flow data composition history flow data set D of history buffer.

Step 1.2: calculate each point in history flow data set D between distance.

Step 1.3: by each point in history flow data set D between distance value be ranked up from small to large, choose front A% The value at place is as the radius r, wherein 1 ＜ A ＜ 2 of data structure CluCell.

In present embodiment, by each point in history flow data set D between distance value be ranked up from small to large, choose Value at 1.5% is as the radius r of data structure CluCell.The radius r of data structure CluCell obtained is 3.

Step 2: for newly-increased flow data, uses density center clustering algorithm to set up the fluxion dynamically changed based on density value According to Clustering Model, as shown in Figure 2.

Step 2.1: number threshold value M of setting data structure C luCell.

In present embodiment, number threshold value M of data structure CluCell set is as 25.

Step 2.2: receive the current fluxion strong point p arrived, it is judged that currently whether there is data structure CluCell, if so, Perform step 2.3, otherwise, perform step 2.5.

Step 2.3: current all data structures CluCell are carried out density decay according to current time, finds distance stream The data structure CluCell c that data point p is nearest_k, and determine its distance d_pk。

In present embodiment, shown in the density value decay formula such as formula (1) of data structure CluCell:

ρ^{t} = Σ_{i = 0}^{n} f_{i}^{t} = Σ_{i = 0}^{n} 2^{- λ (t - t_{i})} = 2^{- λ (t - t_{l})} ρ^{t_{l}} - - - (1)

Wherein, ρ^tFor data structure CluCell at the density value of moment t,For i-th fluxion strong point p_iMoment t's Freshness, 0 ＜ i ＜ n, n are fluxion strong point number, t in data structure CluCell_lClose for last data structure CluCell Angle value die-away time, t_iFor fluxion strong point p_iThe generation time, λ is freshness attenuation quotient.

Step 2.4: judging distance d_pkWith the magnitude relationship of the radius r of data structure CluCell, if d_pk≤ r, then perform Step 2.6, if d_pk＞ r, then perform step 2.5.

Step 2.5: set up data structure CluCell c centered by the p of fluxion strong point_p, delete fluxion strong point p, perform step Rapid 2.7.

Step 2.6: by data structure CluCell c nearest for distance fluxion strong point p_kDensity value add 1, delete flow data Point p, returns step 2.2.

Step 2.7: whether number N of statistics current number Ju structure C luCell reaches the number threshold of several Ju structure C luCell Value M, if so, performs step 2.8, otherwise, returns step 2.2.

Step 2.8: the group that scolds calculating current all data structures CluCell is worth, according to current all data structures The density value of CluCell and scold group to be worth the decision diagram of drawing data structure.

In present embodiment, the formula such as formula (2) scolding group to be worth of data structure CluCell is shown:

δ_{v}^{t} = \min_{l : ρ_{l}^{t} > ρ_{v}^{t}} (| c_{l}, c_{v} |) - - - (2)

In present embodiment, using the density value of current all data structures CluCell as abscissa, by current all numbers It is worth for vertical coordinate according to the group that scolds of structure C luCell, the decision diagram of drawing data structure, as shown in Figure 3.

All density center points are scolded group to be worth by step 2.9: determine density center point according to the decision diagram of data structure Minima scold group to be worth δ as minimum_min。

In present embodiment, according to Fig. 3, select three data point conducts in the upper right corner in the decision diagram of data structure Density center point.

Step 2.10: obtain with density center according to the relations of dependence scolding group to be worth of current all data structures CluCell Point is the tree of root, i.e. clustering tree, all clustering trees one flow data Clustering Model of composition.

Step 3: according to the distance relation of newly-increased flow data and data structure CluCell and newly-increased data and outlier away from From relation, update flow data Clustering Model, thus process newly-increased flow data, as shown in Figure 4.

Step 3.1: set time threshold Δ t.

In present embodiment, the time threshold Δ t set is as 5s.

Step 3.2: receive the current fluxion strong point p ' arrived.

Step 3.3: current all data structures CluCell are carried out density decay according to current time, and delete time Between be not inserted into data structure CluCell of flow data in threshold value Δ t.

Step 3.4: find the nearest data structure CluCell c of distance fluxion strong point p '_k′, and determine its distance d_p′k′。

Step 3.5: judging distance d_p′k′Group is scolded to be worth δ with radius r and the minimum of data structure CluCell_minSize close System, if d_p′k′≤ r, then perform step 3.6, if r is ＜ dp_′k′≤δ_min, then perform step 3.7, if d_p′k′＞ δ_min, then step is performed 3.8。

Step 3.6: by data structure CluCell c nearest for distance fluxion strong point p '_k′Density value add 1, delete fluxion Strong point p ', performs step 3.10.

Step 3.7: set up data structure CluCell c centered by the p ' of fluxion strong point_p′, delete fluxion strong point p ', perform Step 3.9.

Step 3.8: be inserted in the outlier pond temporarily depositing outlier by fluxion strong point p ', performs step 3.11.

Step 3.9: calculate in outlier pond each outlier to data structure CluCell c_p′Distance d_p′oIf there being Z dp_′oThe outlier of≤r, then by data structure CluCell c_p′Density value add the freshness of this Z outlier, delete this Z , if there is Z ' individual r ＜ d in outlier_p′o≤δ_minOutlier, then centered by the individual outlier of this Z ', set up data structure CluCell, deletes the individual outlier of this Z '.

Step 3.10: the group that scolds updating each data structure CluCell is worth, and according to each data structure after updating The group that scolds of CluCell is worth renewal clustering tree, returns step 3.2.

In present embodiment, if current data structure C luCell c_μThe group that scolds be worth δ_μ ^t＞ δ_minTime, then will be with data structure CluCell c_μSubtree for root splits off from former clustering tree, forms new clustering tree, if the radical in current clustering tree According to structure C luCell c_mThe group that scolds be worth δ_m ^t＜ δ_minTime, then will be with data structure CluCell c_mClustering tree for root is merged into Data structure CluCell c_mThe group that scolds be worth the clustering tree at depended on data structure CluCell place, and with CluCell c_mScold Data structure CluCell that group's point is depended on is father node.

Step 3.11: find the nearest outlier o ' of fluxion strong point p ', and determine its distance d_p′o′。

Step 3.12: judging distance d_p′o′Group is scolded to be worth δ with radius r and the minimum of data structure CluCell_minSize close System, if d_p′o′≤ r, then perform step 3.13, if r is ＜ d_p′o′≤δ_min, then perform step 3.14, if d_p′o′＞ δ_min, then step is returned Rapid 3.2:

Claims

1. the flow data clustering method dynamically changed based on density value, it is characterised in that comprise the following steps:

Step 1: use in history flow data set D the distance between all-pair to determine the radius r of data structure CluCell；

Step 1.2: calculate each point in history flow data set D between distance；

Step 1.3: by each point in history flow data set D between distance value be ranked up from small to large, choose at front A% It is worth as the radius r of data structure CluCell, wherein 1 < A < 2:

Step 2: for newly-increased flow data, uses density center clustering algorithm to set up the flow data dynamically changed based on density value and gathers Class model:

Step 2.1: number threshold value M of setting data structure C luCell；

Step 2.2: receive the current fluxion strong point p arrived, it is judged that currently whether there is data structure CluCell, if so, perform Step 2.3, otherwise, performs step 2.5；

Step 2.3: current all data structures CluCell are carried out density decay according to current time, finds distance flow data The data structure CluCell c that some p is nearest_k, and determine its distance d_pk；

Step 2.4: judging distance d_pkWith the magnitude relationship of the radius r of data structure CluCell, if d_pk≤ r, then perform step 2.6, if d_pk> r, then perform step 2.5；

Step 2.5: set up data structure CluCell c centered by the p of fluxion strong point_p, delete fluxion strong point p, perform step 2.7；

Step 2.6: by data structure CluCell c nearest for distance fluxion strong point p_kDensity value add 1, delete fluxion strong point p, Return step 2.2；

Step 2.7: whether number N of statistics current data structure C luCell reaches number threshold value M of data structure CluCell, If so, perform step 2.8, otherwise, return step 2.2；

Step 2.8: the group that scolds calculating current all data structures CluCell is worth, according to current all data structures CluCell Density value and scold group to be worth the decision diagram of drawing data structure；

Step 2.9: determine density center point according to the decision diagram of data structure, by all density center points scold group to be worth in Little value scolds group to be worth δ as minimum_min；

Step 2.10: obtain according to the relations of dependence scolding group to be worth of current all data structures CluCell with density center point be The tree of root, i.e. clustering tree, all clustering trees one flow data Clustering Model of composition；

Step 3: close with the distance relation of data structure CluCell and the distance of newly-increased data and outlier according to newly-increased flow data System, updates flow data Clustering Model, thus processes newly-increased flow data；

Step 3.1: set time threshold Δ t；

Step 3.2: receive the current fluxion strong point p' arrived；

Step 3.3: according to current time, current all data structures CluCell are carried out density decay, and delete at time threshold Data structure CluCell of flow data it is not inserted in value Δ t；

Step 3.4: find the nearest data structure CluCell c of distance fluxion strong point p'_k', and determine its distance d_p'k'；

Step 3.5: judging distance d_p'k'Group is scolded to be worth δ with radius r and the minimum of data structure CluCell_minMagnitude relationship, if d_p'k'≤ r, then perform step 3.6, if r is < d_p'k'≤δ_min, then perform step 3.7, if d_p'k'>δ_min, then step 3.8 is performed；

Step 3.6: by data structure CluCell c nearest for distance fluxion strong point p'_k'Density value add 1, delete fluxion strong point P', performs step 3.11；

Step 3.7: set up data structure CluCell c centered by the p' of fluxion strong point_p', delete fluxion strong point p', perform step 3.9；

Step 3.8: be inserted in the outlier pond temporarily depositing outlier by fluxion strong point p', performs step 3.12；

Step 3.9: calculate in outlier pond each outlier to data structure CluCell c_p'Distance d_p'oIf there is Z d_p'o The outlier of≤r, then by data structure CluCell c_p'Density value add the freshness of this Z outlier, delete this Z and peel off , if there is Z' r < d in point_p'o≤δ_minOutlier, then centered by this Z' outlier, set up data structure CluCell, delete Except this Z' outlier；

Step 3.10: the group that scolds updating each data structure CluCell is worth, and according to each data structure CluCell after updating Scold group to be worth renewal clustering tree, return step 3.2；

Step 3.11: find nearest outlier o' of fluxion strong point p', and determine its distance d_p'o'；

Step 3.12: judging distance d_p'o'Group is scolded to be worth δ with radius r and the minimum of data structure CluCell_minMagnitude relationship, if d_p'o'≤ r, then perform step 3.13, if r is < d_p'o'≤δ_min, then perform step 3.14, if d_p'o'>δ_min, then step 3.2 is returned；

Step 3.13: set up data structure CluCell c centered by the p' of fluxion strong point_p', by data structure CluCell c_p''s Density value, plus the freshness of outlier o', deletes outlier o' and fluxion strong point p', returns step 3.9；

Step 3.14: set up data structure CluCell c centered by the p' of fluxion strong point_p', centered by outlier o', set up number According to structure C luCell c_o', delete outlier o' and fluxion strong point p', return step 3.9.

The flow data clustering method dynamically changed based on density value the most according to claim 1, it is characterised in that described number Density value decay formula according to structure C luCell is as follows:

ρ^{t} = Σ_{i = 0}^{n} f_{i}^{t} = Σ_{i = 0}^{n} 2^{- λ (t - t_{i})} = 2^{- λ (t - t_{i})} ρ^{t_{l}};

Wherein, ρ^tFor data structure CluCell at the density value of moment t,For i-th fluxion strong point p_iFresh at moment t Degree, 0 < i < n, n is fluxion strong point number, t in data structure CluCell_lDensity value for last data structure CluCell declines Subtract time, t_iFor fluxion strong point p_iThe generation time, λ is freshness attenuation quotient.

The flow data clustering method dynamically changed based on density value the most according to claim 1, it is characterised in that described number The formula scolding group to be worth according to structure C luCell is as follows:

δ_{v}^{t} = \min_{l : ρ_{l}^{t} > ρ_{v}^{t}} (| c_{l}, c_{v} |);

Wherein,For data structure CluCell c_vThe group that scolds at moment t is worth, | c_l,c_v| for data structure CluCell c_vWith number According to structure C luCell c_lDistance,For data structure CluCell c_lAt the density value of moment t,For data structure CluCell c_vDensity value at moment t.

The flow data clustering method dynamically changed based on density value the most according to claim 1, it is characterised in that described It is worth renewal clustering tree according to the group that scolds of each data structure CluCell method particularly includes:

If current data structure C luCell c_μThe group that scolds be worthTime, then will be with data structure CluCell c_μSon for root Tree splits off from former clustering tree, forms new clustering tree, if the single data structure C luCell c in current clustering tree_m's Group is scolded to be worthTime, then will be with data structure CluCell c_mClustering tree for root is merged into data structure CluCell c_m The group that scolds be worth the clustering tree at depended on data structure CluCell place, and with CluCell c_mGroup is scolded to put depended on data Structure C luCell is father node.

The flow data clustering method dynamically changed based on density value the most according to claim 1, it is characterised in that described According to the density value of current all data structures CluCell with scold decision diagram that group is worth drawing data structure particularly as follows: by current institute There is the density value of data structure CluCell as abscissa, the group that scolds of current all data structures CluCell is worth for vertical seat Mark, the decision diagram of drawing data structure.