CN108319699A - Real-time clustering method for evolutionary data streams - Google Patents
Real-time clustering method for evolutionary data streams Download PDFInfo
- Publication number
- CN108319699A CN108319699A CN201810109615.6A CN201810109615A CN108319699A CN 108319699 A CN108319699 A CN 108319699A CN 201810109615 A CN201810109615 A CN 201810109615A CN 108319699 A CN108319699 A CN 108319699A
- Authority
- CN
- China
- Prior art keywords
- class
- disappearance
- point
- class set
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an online clustering method facing to an evolutionary data stream, which adopts the technical scheme that the method comprises the steps of ① establishing an effective class set, a vanishing class set and an outlier set, ② classifying points to be processed obtained at the current moment into a certain set, and ③ updating the outlier set, the effective class set and the vanishing class set.
Description
Technical field
The invention belongs to data stream clustering technical fields, are specifically related to a kind of dynamic clustering side towards evolving data stream
Method.
Background technology
Data flow refers to the data flowed into real time, is different from the data that traditional batch type obtains, usually presses data distribution
Whether change and is divided into static data flow (data distribution does not change) and evolving data stream (data distribution variation), evolution number
It is also referred to as dynamic dataflow according to stream.Currently, data flow has become one of key data form of information-intensive society, such as financial transaction
Data, communication record data, sensing observation data etc..Data stream clustering technology refers to realizing data by certain cluster means
The analysis of stream becomes one of the main means of data Mining stream by the powerful advantages for not depending on prior information at present.
Currently, data stream clustering method is primarily directed to static data flow.In fact, the data flow in reality generally has
Have an evolution characteristic (or dynamic characteristic), i.e., data flow data dynamic can be carried out during flowing into as new class occurs, old class disappears,
The evolutional forms such as old disappearance class reproduction (referred to as occur individually below, disappear, reproduction).In practical applications, it is general that these are detected
Usually there is prior meaning to user all over existing evolutional form, such as can be used to realize in astronomy, medicine, finance, net
The monitoring in the fields such as network and observation purpose etc..Therefore, there is an urgent need to be directed to evolution data development dataset stream clustering technique.Its meaning
It is, on the one hand this will improve user's being fully understood by about the current Clustering of data flow and all kinds of evolution;It is another
Aspect, also helps user just to make accurate judgement before the arrival of all data, such as finds Network Abnormal intrusion time, estimation
The class number of given time period and searching optimal correction time.Although domestic and foreign scholars' just data flow towards evolving data stream
Clustering technique expands many trials, but mainly for new class there is the expansion of this evolutional form, this severely limits data
The application range for flowing clustering algorithm, therefore, it is necessary to which extended data stream clustering technique handles the ability of a variety of evolutional form data.
Invention content
The present invention provides a kind of on-line talking method towards evolving data stream, typical for three kinds of evolving data stream
Evolutional form occurs, disappears, reproduction, having separately designed inspection policies, and devise processing frame, while three kinds being detected
Strategy is integrated, and to realize the present invention to the real-time cluster of evolving data stream, enables to the new class in data flow
The class that the class be added in time, to disappear is removed and reappears in time is restored without re-forming again in time.
The technical scheme is that:A kind of real-time clustering method towards evolving data stream, which is characterized in that including under
State step:
1. the step of establishing effective class set, disappearance class set, the point set that peels off;
2. the step of pending point obtained to current time is included into some set;
3. update peels off the step of point set, effective class set and disappearance class set.
Wherein:
1. the step of establishing effective class set, disappearance class set, the point set that peels off;
The initial value of wherein effective class set is to collect a certain amount of data, and static clustering method is recycled to gather initialization
The set of result obtained from being clustered;The initialization value of disappearance class set is empty set;The initialization value for the point set that peels off is
Empty set.
2. the step of pending point obtained to current time is included into some set, including:
The Euclidean distance of pending point and effective class set and the class in disappearance class set is calculated first, and asks minimum
Value;
Then above-mentioned minimum value is compared with predetermined outlier thresholding:If minimum value is more than above-mentioned thresholding, will wait for
Process points, which are divided into, peels off point set;If minimum value is less than or equal to above-mentioned thresholding, pending point is divided into minimum value pair
In the class that should gather.
3. update peels off the step of point set, effective class set and disappearance class set:
(a) the step of peeling off point set is updated:
The number of element in the point set that peels off is calculated first;Then is there is thresholding and be compared with predetermined in the number:If
The number is more than or equal to the thresholding, then is classified to all elements in the point set that peels off using static clustering processing method, and
Classification results are added in effective class set, while emptying the point set that peels off;If the number is less than the thresholding, not to peeling off
Point set carries out any update.
(b) the step of updating effective class set:
Every one kind in effective class set is proceeded as follows:
The time interval that calculating current time is updated away from such the last time first, if the time interval is more than or equal in advance
Determine cleared threshold, then deletes such from effective class set, and be added in disappearance class set.
(c) the step of updating disappearance class set:
Every one kind in being closed to disappearance class set proceeds as follows:
Such is calculated first from after being added in disappearance class set, the number of process points is divided into, if the number is more than or equal to
Predetermined reproduction thresholding, then delete such, and be added in effective class set from disappearance class set.
The beneficial effects of the invention are as follows:
(1) present invention be directed to respectively typical three kinds of evolutional forms in evolving data stream (i.e. the appearance of class, disappear with again
It is existing) separately designed detection function (i.e. using update peel off point set, effective class set and disappearance class set the step of), improve
Processing capacity of the data stream clustering method towards evolving data stream;
(2) present invention is carried out three kinds of data flow evolutional forms by proposing evolving data stream dynamic clustering Processing Algorithm
It integrates unified, improves the stability of data stream clustering method, extend the application range of data stream clustering method.
Description of the drawings
Fig. 1 is the principle of the present invention flow diagram;
The flow chart of the specific implementation of Fig. 2 present invention;
The data set information of Fig. 3 emulation experiments one;
The experimental result of Fig. 4 emulation experiments one;
The data set information of Fig. 5 emulation experiments two;
The experimental result of Fig. 6 emulation experiments two.
Specific implementation mode
The following further describes the present invention with reference to the drawings.
Fig. 1 is the principle of the present invention flow diagram, carries out establishing three classes set (i.e. effective class set, disappearance class first
Gather and the point set that peels off) the step of, then into the step of pending point is divided into set is about to, finally carry out three classes set
Update.Wherein, effective class set and the basic element in disappearance class set are classifications;Basic element in the point set that peels off is place
Point is managed, that is, forms the base unit of data flow, also referred to as data point.Stored in effective class set be current time to data flow still
It is by after a certain amount of pending data of collection, static clustering method being utilized to handle to have the classification of cluster meaning, initial value
Obtained classification;Before what is stored in disappearance class set is current time, that is deleted from effective class set goes data loss
The classification of meaning is clustered, initial value is empty set;What is stored in the point set that peels off is outlier, and outlier refers to having following spy
The process points of sign:By process points and effective class set and minimum value at a distance from the classification in disappearance class set and predetermined outlier door
Limit is made comparisons, if above-mentioned minimum value is more than the thresholding, judges that the process points for outlier, are divided into peeling off point set.Three classes
The update of set is carried out for the typical evolutional form of three classes, and specifically, the point set that peels off update is corresponding data stream
In the evolutional form that occurs of new class, the reason is that the appearance of new class can cause the point of a certain amount of new class due to can not find suitable class
It is divided into the point set that peels off, therefore, when the points in the point set that peels off are more than to make a reservation for thresholding occur, expression has new class and occurs,
Therefore it needs to cluster the point in the point set that peels off;Effective class set update is the evolution that old class disappears in corresponding data stream
Form, if certain class effectively in class set is divided into long-time without point, the time interval be more than predetermined cleared threshold it
Afterwards, show that these classes " fail ", then will delete such from effective class set, and add it in disappearance class set;
And the update of disappearance class set is the evolutional form of old class reproduction in corresponding data stream, if certain class in disappearance class set still a little by
It is divided into, and the number has been more than predetermined reproduction thresholding, illustrates that such is changed into " effective " state from " failure " state, then will
Such is deleted from disappearance class set, and is added it in effective class set.It, will be pending in data flow real-time processing stage
Point is divided into set step the step of being updated with three classes set and is repeated.
Fig. 2 is the flow chart that the present invention implements.As shown in Figure 2;
Three set are initially set up, Θ, O and Δ correspond to effective class set conjunction, the point set that peels off and disappearance class set respectively.
Wherein, the initial value of the point set that peels off O and disappearance class set Δ is respectively empty set.Effective class set Θ carries out initial
When change, using acquired data flow, initialization cluster set Θ={ c is acquired by existing clustering methodj, it is assumed that it there are
To J classes, then j=1,2 ..., J.The data volume of the data flow of acquisition determines as the case may be.
Then, into the real-time clustering processing stage:
Assuming that any time tnThe multidimensional data point x of arrivaln, following real-time clustering processing step is carried out respectively:
The first step carries out outlier detection;
①:If effective class set Θ is nonempty set, each class c in the set is calculatedjWith xnEuclidean distance, this
Recommend to seek point x in placenWith class cjClass center Euclidean distance;It is assumed that in the set, class cmDistance xnRecently, i.e., the two is European
Distance is minimum, which is defined as d1;
If effective class set Θ is null set, d1=∞;
②:If disappearance class set Δ is nonempty set, it is assumed that Δ={ bl, total L element, then l=1,2 ..., L, are counted
Calculate each of which class blWith xnEuclidean distance;It is assumed that class b in disappearance class set ΔrDistance xnBoth recently, i.e., Euclidean distance is most
It is small, then the minimum range is defined as d2;
If Δ is null set, d2=∞;
③:Enable d3=min (d1,d2), as d1、d2The minimum value of the two.By d3There is thresholding d with predetermined0It compares:When
d3> d0Then by xnIt is divided into peeling off point set O;If d3≤d0, then by xnIt is divided into the class of corresponding set, wherein min is to ask most
Small value function.
Third walks, it is assumed that the element number in the point set that peels off O is N, and N thresholding θ is occurred with predeterminedemergeIt is compared,
If N >=θemerge, then the 4th step is executed, the 5th step is otherwise executed.
4th step clusters all elements to the point set O that peels off using existing static clustering method, and by result
It is added in effective class set Θ, and empties the point set O that peels off.
5th step, for each class c in effective class set Θj, carry out operations described below:
The time interval that calculating current time is updated away from such the last time first, i.e., seek Δ using following equationnj
=tn-tj, wherein tjFor class cjAt the time of the last time is divided into process points.By ΔnjWith predetermined cleared threshold θdispCompared
Compared with if Δnj≥θdisp, then element c is deleted from effective class set Θj, and by cjIt is added to disappearance class set Δ as a class
In, otherwise, execute the 6th step.
6th step, for every one kind b in disappearance class set Δl, it is detected after being added into the set, is divided into place
Manage the number f of pointl, and with predetermined reproduction thresholding θrecurIt makes comparisons, if fl≥θrecur, then class is deleted in disappearance class set Δ
bl, and add it in effective class set Θ, otherwise, the first step is returned to, waits pending next data point.
Dynamic clustering processing has the data flow of evolution characteristic to illustrate the invention, and it is real to have carried out following MATLAB emulation
It tests:Experiment one is to be used for handling emulation data set one by of the invention, it is therefore an objective to which the demonstration present invention is special to having appearance, disappearance to evolve
The processing capacity of the data flow of property;Experiment by the present invention second is that be used for handling emulation data set two, it is therefore an objective to which the demonstration present invention is right
With appearance, disappearance, reappear evolution characteristic data flow processing capacity.
Fig. 3 gives the class number information for the emulation data set one that experiment one is applied, as shown, horizontal axis expression is several
According to each data point of stream, totally 17200 data points, sort successively by inflow sequence;The longitudinal axis represents corresponding to these data points
The number of class, totally 12, number is respectively 1~12.In the incipient stage that data flow into, from the 1st data point to the 1600th
Data point shares 8 classes and occurs, the subsequent 9th, 10,11 classes start to occur;From the 5800th data point to the 8200th data
Point, 12 classes occur, and then, 12,11,10,9,8,7,6 disappear successively, from 16200 data points to the 17200th data
Between point, only 5 classes exist.
Fig. 4 gives the experimental result of emulation experiment one.As shown in figure 4, horizontal axis represents data point, and totally 17200, the longitudinal axis
Indicate the corresponding class number of each data point, the class number for emulating data set one uses dotted line table with the real change of data point in figure
Show, recognition result of the invention is indicated by the solid line, and * labelled notations are at the time of old class occurs to disappear, and △ labels are current moulds
Type rebuild at the time of, × indicate be stream process start time.As shown in Figure 4, the present invention utilizes the 1st number at 1600 moment
Strong point to the 1600th data point completes initialization, and is handled in real time data stream since the 1601st data point.It is right
Than solid line, dotted line in figure, it can be found that the changing tendency of the two is almost the same, this illustrates that the present invention can effectively detect data flow
In new class occur and old class disappearance evolutional form, and this experiment reaches the classification accuracy rate of 17200 data points
99.99%.
Fig. 5 gives the class number information for the emulation data set two that experiment two is applied, as shown in figure 5, horizontal axis expression is
Each data point of data flow, totally 22000 data points, sort successively by inflow sequence;The longitudinal axis represents corresponding to these data points
Class number, totally 12, number be respectively 1~12.From the 1st data point to the 2400th data point stage, 12 are shared
Class occurs, and subsequent 12nd class starts to disappear at 2400 moment, and starts to reappear at the 3600th data point;Subsequent 11st, 12,
10 experienced the evolution reappeared after disappearance.
Fig. 6 gives the experimental result of emulation experiment two.As shown in fig. 6, horizontal axis represents data point, and totally 22000, the longitudinal axis
Indicate the corresponding class number of each data point, the class number for emulating data set two uses dotted line table with the real change of data point in figure
Show, recognition result of the invention is indicated by the solid line, and * labelled notations are at the time of old class occurs to disappear, and o labels are when reappearing
Carve, × indicate be stream process start time.It will be appreciated from fig. 6 that the present invention completes to initialize at 2400 moment, and start to data
Stream is handled in real time.It can be found that the changing tendency of the two is almost the same, this illustrates the present invention for solid line, dotted line in comparison diagram
It can effectively detect the evolutional form that the old class in data flow disappears and old class reappears, and this experiment is to 22000 data
The classification accuracy rate of point reaches 99.99%.
Claims (4)
1. a kind of real-time clustering method towards evolving data stream, which is characterized in that include the following steps:
1. the step of establishing effective class set, disappearance class set, the point set that peels off;
2. the step of pending point obtained to current time is included into some set;
3. update peels off the step of point set, effective class set and disappearance class set.
2. the real-time clustering method according to claim 1 towards evolving data stream, which is characterized in that update the point set that peels off
The step of conjunction is:
The number of element in the point set that peels off is calculated first;Then is there is thresholding and be compared with predetermined in the number:If this
Number is more than or equal to the thresholding, then is classified to all elements in the point set that peels off using static clustering processing method, and will divide
Class result is added in effective class set, while emptying the point set that peels off;If the number is less than the thresholding, not to the point set that peels off
It closes and carries out any update.
3. the real-time clustering method according to claim 2 towards evolving data stream, which is characterized in that update effective class set
The step of conjunction is:
Every one kind in effective class set is proceeded as follows:
The time interval that calculating current time is updated away from such the last time first, if the time interval is more than or equal to predetermined disappear
Thresholding is lost, then is deleted such from effective class set, and be added in disappearance class set.
4. the real-time clustering method according to claim 3 towards evolving data stream, which is characterized in that update disappearance class set
The step of conjunction is:
Every one kind in being closed to disappearance class set proceeds as follows:
Such is calculated first from after being added in disappearance class set, is divided into the number of process points, is made a reservation for if the number is more than or equal to
Reappear thresholding, then deletes such from disappearance class set, and be added in effective class set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810109615.6A CN108319699A (en) | 2018-02-05 | 2018-02-05 | Real-time clustering method for evolutionary data streams |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810109615.6A CN108319699A (en) | 2018-02-05 | 2018-02-05 | Real-time clustering method for evolutionary data streams |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108319699A true CN108319699A (en) | 2018-07-24 |
Family
ID=62902333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810109615.6A Pending CN108319699A (en) | 2018-02-05 | 2018-02-05 | Real-time clustering method for evolutionary data streams |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108319699A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046279A (en) * | 2019-04-18 | 2019-07-23 | 网易传媒科技(北京)有限公司 | Prediction technique, medium, device and the calculating equipment of video file feature |
-
2018
- 2018-02-05 CN CN201810109615.6A patent/CN108319699A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046279A (en) * | 2019-04-18 | 2019-07-23 | 网易传媒科技(北京)有限公司 | Prediction technique, medium, device and the calculating equipment of video file feature |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ren et al. | Anomaly detection based on a dynamic Markov model | |
CN105760888B (en) | A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute | |
Nguyen et al. | Practical and theoretical aspects of mixture‐of‐experts modeling: An overview | |
Takemura et al. | Model extraction attacks on recurrent neural networks | |
CN108319720A (en) | Man-machine interaction method, device based on artificial intelligence and computer equipment | |
Saxena et al. | A comparative analysis of association rule mining algorithms | |
CN107292097A (en) | The feature selection approach of feature based group and traditional Chinese medical science primary symptom system of selection | |
CN112860675B (en) | Big data processing method under online cloud service environment and cloud computing server | |
CN114124460B (en) | Industrial control system intrusion detection method and device, computer equipment and storage medium | |
CN110458096A (en) | A kind of extensive commodity recognition method based on deep learning | |
CN111327949A (en) | Video time sequence action detection method, device, equipment and storage medium | |
EP4209959A1 (en) | Target identification method and apparatus, and electronic device | |
CN112084761B (en) | Hydraulic engineering information management method and device | |
Sui et al. | Dynamic clustering scheme for evolving data streams based on improved STRAP | |
CN106204267A (en) | A kind of based on improving k means and the customer segmentation system of neural network clustering | |
CN114741544B (en) | Image retrieval method, retrieval library construction method, device, electronic equipment and medium | |
CN108319699A (en) | Real-time clustering method for evolutionary data streams | |
KR20150029324A (en) | System for a real-time cashing event summarization in surveillance images and the method thereof | |
CN107943537A (en) | Using method for cleaning, device, storage medium and electronic equipment | |
CN107077617A (en) | fingerprint extraction method and device | |
CN103577532B (en) | Method and system for text-processing | |
Moreno-Garcia et al. | Fuzzy numbers from raw discrete data using linear regression | |
Liu et al. | A learning-based system for predicting sport injuries | |
Prakash et al. | ATM Card Fraud Detection System Using Machine Learning Techniques | |
CN115082071A (en) | Abnormal transaction account identification method and device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180724 |
|
RJ01 | Rejection of invention patent application after publication |