CN108319699A - Real-time clustering method for evolutionary data streams - Google Patents

Real-time clustering method for evolutionary data streams Download PDF

Info

Publication number
CN108319699A
CN108319699A CN201810109615.6A CN201810109615A CN108319699A CN 108319699 A CN108319699 A CN 108319699A CN 201810109615 A CN201810109615 A CN 201810109615A CN 108319699 A CN108319699 A CN 108319699A
Authority
CN
China
Prior art keywords
class
disappearance
point
class set
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810109615.6A
Other languages
Chinese (zh)
Inventor
隋金坪
刘振
黎湘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810109615.6A priority Critical patent/CN108319699A/en
Publication of CN108319699A publication Critical patent/CN108319699A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an online clustering method facing to an evolutionary data stream, which adopts the technical scheme that the method comprises the steps of ① establishing an effective class set, a vanishing class set and an outlier set, ② classifying points to be processed obtained at the current moment into a certain set, and ③ updating the outlier set, the effective class set and the vanishing class set.

Description

A kind of real-time clustering method towards evolving data stream
Technical field
The invention belongs to data stream clustering technical fields, are specifically related to a kind of dynamic clustering side towards evolving data stream Method.
Background technology
Data flow refers to the data flowed into real time, is different from the data that traditional batch type obtains, usually presses data distribution Whether change and is divided into static data flow (data distribution does not change) and evolving data stream (data distribution variation), evolution number It is also referred to as dynamic dataflow according to stream.Currently, data flow has become one of key data form of information-intensive society, such as financial transaction Data, communication record data, sensing observation data etc..Data stream clustering technology refers to realizing data by certain cluster means The analysis of stream becomes one of the main means of data Mining stream by the powerful advantages for not depending on prior information at present.
Currently, data stream clustering method is primarily directed to static data flow.In fact, the data flow in reality generally has Have an evolution characteristic (or dynamic characteristic), i.e., data flow data dynamic can be carried out during flowing into as new class occurs, old class disappears, The evolutional forms such as old disappearance class reproduction (referred to as occur individually below, disappear, reproduction).In practical applications, it is general that these are detected Usually there is prior meaning to user all over existing evolutional form, such as can be used to realize in astronomy, medicine, finance, net The monitoring in the fields such as network and observation purpose etc..Therefore, there is an urgent need to be directed to evolution data development dataset stream clustering technique.Its meaning It is, on the one hand this will improve user's being fully understood by about the current Clustering of data flow and all kinds of evolution;It is another Aspect, also helps user just to make accurate judgement before the arrival of all data, such as finds Network Abnormal intrusion time, estimation The class number of given time period and searching optimal correction time.Although domestic and foreign scholars' just data flow towards evolving data stream Clustering technique expands many trials, but mainly for new class there is the expansion of this evolutional form, this severely limits data The application range for flowing clustering algorithm, therefore, it is necessary to which extended data stream clustering technique handles the ability of a variety of evolutional form data.
Invention content
The present invention provides a kind of on-line talking method towards evolving data stream, typical for three kinds of evolving data stream Evolutional form occurs, disappears, reproduction, having separately designed inspection policies, and devise processing frame, while three kinds being detected Strategy is integrated, and to realize the present invention to the real-time cluster of evolving data stream, enables to the new class in data flow The class that the class be added in time, to disappear is removed and reappears in time is restored without re-forming again in time.
The technical scheme is that:A kind of real-time clustering method towards evolving data stream, which is characterized in that including under State step:
1. the step of establishing effective class set, disappearance class set, the point set that peels off;
2. the step of pending point obtained to current time is included into some set;
3. update peels off the step of point set, effective class set and disappearance class set.
Wherein:
1. the step of establishing effective class set, disappearance class set, the point set that peels off;
The initial value of wherein effective class set is to collect a certain amount of data, and static clustering method is recycled to gather initialization The set of result obtained from being clustered;The initialization value of disappearance class set is empty set;The initialization value for the point set that peels off is Empty set.
2. the step of pending point obtained to current time is included into some set, including:
The Euclidean distance of pending point and effective class set and the class in disappearance class set is calculated first, and asks minimum Value;
Then above-mentioned minimum value is compared with predetermined outlier thresholding:If minimum value is more than above-mentioned thresholding, will wait for Process points, which are divided into, peels off point set;If minimum value is less than or equal to above-mentioned thresholding, pending point is divided into minimum value pair In the class that should gather.
3. update peels off the step of point set, effective class set and disappearance class set:
(a) the step of peeling off point set is updated:
The number of element in the point set that peels off is calculated first;Then is there is thresholding and be compared with predetermined in the number:If The number is more than or equal to the thresholding, then is classified to all elements in the point set that peels off using static clustering processing method, and Classification results are added in effective class set, while emptying the point set that peels off;If the number is less than the thresholding, not to peeling off Point set carries out any update.
(b) the step of updating effective class set:
Every one kind in effective class set is proceeded as follows:
The time interval that calculating current time is updated away from such the last time first, if the time interval is more than or equal in advance Determine cleared threshold, then deletes such from effective class set, and be added in disappearance class set.
(c) the step of updating disappearance class set:
Every one kind in being closed to disappearance class set proceeds as follows:
Such is calculated first from after being added in disappearance class set, the number of process points is divided into, if the number is more than or equal to Predetermined reproduction thresholding, then delete such, and be added in effective class set from disappearance class set.
The beneficial effects of the invention are as follows:
(1) present invention be directed to respectively typical three kinds of evolutional forms in evolving data stream (i.e. the appearance of class, disappear with again It is existing) separately designed detection function (i.e. using update peel off point set, effective class set and disappearance class set the step of), improve Processing capacity of the data stream clustering method towards evolving data stream;
(2) present invention is carried out three kinds of data flow evolutional forms by proposing evolving data stream dynamic clustering Processing Algorithm It integrates unified, improves the stability of data stream clustering method, extend the application range of data stream clustering method.
Description of the drawings
Fig. 1 is the principle of the present invention flow diagram;
The flow chart of the specific implementation of Fig. 2 present invention;
The data set information of Fig. 3 emulation experiments one;
The experimental result of Fig. 4 emulation experiments one;
The data set information of Fig. 5 emulation experiments two;
The experimental result of Fig. 6 emulation experiments two.
Specific implementation mode
The following further describes the present invention with reference to the drawings.
Fig. 1 is the principle of the present invention flow diagram, carries out establishing three classes set (i.e. effective class set, disappearance class first Gather and the point set that peels off) the step of, then into the step of pending point is divided into set is about to, finally carry out three classes set Update.Wherein, effective class set and the basic element in disappearance class set are classifications;Basic element in the point set that peels off is place Point is managed, that is, forms the base unit of data flow, also referred to as data point.Stored in effective class set be current time to data flow still It is by after a certain amount of pending data of collection, static clustering method being utilized to handle to have the classification of cluster meaning, initial value Obtained classification;Before what is stored in disappearance class set is current time, that is deleted from effective class set goes data loss The classification of meaning is clustered, initial value is empty set;What is stored in the point set that peels off is outlier, and outlier refers to having following spy The process points of sign:By process points and effective class set and minimum value at a distance from the classification in disappearance class set and predetermined outlier door Limit is made comparisons, if above-mentioned minimum value is more than the thresholding, judges that the process points for outlier, are divided into peeling off point set.Three classes The update of set is carried out for the typical evolutional form of three classes, and specifically, the point set that peels off update is corresponding data stream In the evolutional form that occurs of new class, the reason is that the appearance of new class can cause the point of a certain amount of new class due to can not find suitable class It is divided into the point set that peels off, therefore, when the points in the point set that peels off are more than to make a reservation for thresholding occur, expression has new class and occurs, Therefore it needs to cluster the point in the point set that peels off;Effective class set update is the evolution that old class disappears in corresponding data stream Form, if certain class effectively in class set is divided into long-time without point, the time interval be more than predetermined cleared threshold it Afterwards, show that these classes " fail ", then will delete such from effective class set, and add it in disappearance class set; And the update of disappearance class set is the evolutional form of old class reproduction in corresponding data stream, if certain class in disappearance class set still a little by It is divided into, and the number has been more than predetermined reproduction thresholding, illustrates that such is changed into " effective " state from " failure " state, then will Such is deleted from disappearance class set, and is added it in effective class set.It, will be pending in data flow real-time processing stage Point is divided into set step the step of being updated with three classes set and is repeated.
Fig. 2 is the flow chart that the present invention implements.As shown in Figure 2;
Three set are initially set up, Θ, O and Δ correspond to effective class set conjunction, the point set that peels off and disappearance class set respectively.
Wherein, the initial value of the point set that peels off O and disappearance class set Δ is respectively empty set.Effective class set Θ carries out initial When change, using acquired data flow, initialization cluster set Θ={ c is acquired by existing clustering methodj, it is assumed that it there are To J classes, then j=1,2 ..., J.The data volume of the data flow of acquisition determines as the case may be.
Then, into the real-time clustering processing stage:
Assuming that any time tnThe multidimensional data point x of arrivaln, following real-time clustering processing step is carried out respectively:
The first step carries out outlier detection;
①:If effective class set Θ is nonempty set, each class c in the set is calculatedjWith xnEuclidean distance, this Recommend to seek point x in placenWith class cjClass center Euclidean distance;It is assumed that in the set, class cmDistance xnRecently, i.e., the two is European Distance is minimum, which is defined as d1
If effective class set Θ is null set, d1=∞;
②:If disappearance class set Δ is nonempty set, it is assumed that Δ={ bl, total L element, then l=1,2 ..., L, are counted Calculate each of which class blWith xnEuclidean distance;It is assumed that class b in disappearance class set ΔrDistance xnBoth recently, i.e., Euclidean distance is most It is small, then the minimum range is defined as d2
If Δ is null set, d2=∞;
③:Enable d3=min (d1,d2), as d1、d2The minimum value of the two.By d3There is thresholding d with predetermined0It compares:When d3> d0Then by xnIt is divided into peeling off point set O;If d3≤d0, then by xnIt is divided into the class of corresponding set, wherein min is to ask most Small value function.
Third walks, it is assumed that the element number in the point set that peels off O is N, and N thresholding θ is occurred with predeterminedemergeIt is compared, If N >=θemerge, then the 4th step is executed, the 5th step is otherwise executed.
4th step clusters all elements to the point set O that peels off using existing static clustering method, and by result It is added in effective class set Θ, and empties the point set O that peels off.
5th step, for each class c in effective class set Θj, carry out operations described below:
The time interval that calculating current time is updated away from such the last time first, i.e., seek Δ using following equationnj =tn-tj, wherein tjFor class cjAt the time of the last time is divided into process points.By ΔnjWith predetermined cleared threshold θdispCompared Compared with if Δnj≥θdisp, then element c is deleted from effective class set Θj, and by cjIt is added to disappearance class set Δ as a class In, otherwise, execute the 6th step.
6th step, for every one kind b in disappearance class set Δl, it is detected after being added into the set, is divided into place Manage the number f of pointl, and with predetermined reproduction thresholding θrecurIt makes comparisons, if fl≥θrecur, then class is deleted in disappearance class set Δ bl, and add it in effective class set Θ, otherwise, the first step is returned to, waits pending next data point.
Dynamic clustering processing has the data flow of evolution characteristic to illustrate the invention, and it is real to have carried out following MATLAB emulation It tests:Experiment one is to be used for handling emulation data set one by of the invention, it is therefore an objective to which the demonstration present invention is special to having appearance, disappearance to evolve The processing capacity of the data flow of property;Experiment by the present invention second is that be used for handling emulation data set two, it is therefore an objective to which the demonstration present invention is right With appearance, disappearance, reappear evolution characteristic data flow processing capacity.
Fig. 3 gives the class number information for the emulation data set one that experiment one is applied, as shown, horizontal axis expression is several According to each data point of stream, totally 17200 data points, sort successively by inflow sequence;The longitudinal axis represents corresponding to these data points The number of class, totally 12, number is respectively 1~12.In the incipient stage that data flow into, from the 1st data point to the 1600th Data point shares 8 classes and occurs, the subsequent 9th, 10,11 classes start to occur;From the 5800th data point to the 8200th data Point, 12 classes occur, and then, 12,11,10,9,8,7,6 disappear successively, from 16200 data points to the 17200th data Between point, only 5 classes exist.
Fig. 4 gives the experimental result of emulation experiment one.As shown in figure 4, horizontal axis represents data point, and totally 17200, the longitudinal axis Indicate the corresponding class number of each data point, the class number for emulating data set one uses dotted line table with the real change of data point in figure Show, recognition result of the invention is indicated by the solid line, and * labelled notations are at the time of old class occurs to disappear, and △ labels are current moulds Type rebuild at the time of, × indicate be stream process start time.As shown in Figure 4, the present invention utilizes the 1st number at 1600 moment Strong point to the 1600th data point completes initialization, and is handled in real time data stream since the 1601st data point.It is right Than solid line, dotted line in figure, it can be found that the changing tendency of the two is almost the same, this illustrates that the present invention can effectively detect data flow In new class occur and old class disappearance evolutional form, and this experiment reaches the classification accuracy rate of 17200 data points 99.99%.
Fig. 5 gives the class number information for the emulation data set two that experiment two is applied, as shown in figure 5, horizontal axis expression is Each data point of data flow, totally 22000 data points, sort successively by inflow sequence;The longitudinal axis represents corresponding to these data points Class number, totally 12, number be respectively 1~12.From the 1st data point to the 2400th data point stage, 12 are shared Class occurs, and subsequent 12nd class starts to disappear at 2400 moment, and starts to reappear at the 3600th data point;Subsequent 11st, 12, 10 experienced the evolution reappeared after disappearance.
Fig. 6 gives the experimental result of emulation experiment two.As shown in fig. 6, horizontal axis represents data point, and totally 22000, the longitudinal axis Indicate the corresponding class number of each data point, the class number for emulating data set two uses dotted line table with the real change of data point in figure Show, recognition result of the invention is indicated by the solid line, and * labelled notations are at the time of old class occurs to disappear, and o labels are when reappearing Carve, × indicate be stream process start time.It will be appreciated from fig. 6 that the present invention completes to initialize at 2400 moment, and start to data Stream is handled in real time.It can be found that the changing tendency of the two is almost the same, this illustrates the present invention for solid line, dotted line in comparison diagram It can effectively detect the evolutional form that the old class in data flow disappears and old class reappears, and this experiment is to 22000 data The classification accuracy rate of point reaches 99.99%.

Claims (4)

1. a kind of real-time clustering method towards evolving data stream, which is characterized in that include the following steps:
1. the step of establishing effective class set, disappearance class set, the point set that peels off;
2. the step of pending point obtained to current time is included into some set;
3. update peels off the step of point set, effective class set and disappearance class set.
2. the real-time clustering method according to claim 1 towards evolving data stream, which is characterized in that update the point set that peels off The step of conjunction is:
The number of element in the point set that peels off is calculated first;Then is there is thresholding and be compared with predetermined in the number:If this Number is more than or equal to the thresholding, then is classified to all elements in the point set that peels off using static clustering processing method, and will divide Class result is added in effective class set, while emptying the point set that peels off;If the number is less than the thresholding, not to the point set that peels off It closes and carries out any update.
3. the real-time clustering method according to claim 2 towards evolving data stream, which is characterized in that update effective class set The step of conjunction is:
Every one kind in effective class set is proceeded as follows:
The time interval that calculating current time is updated away from such the last time first, if the time interval is more than or equal to predetermined disappear Thresholding is lost, then is deleted such from effective class set, and be added in disappearance class set.
4. the real-time clustering method according to claim 3 towards evolving data stream, which is characterized in that update disappearance class set The step of conjunction is:
Every one kind in being closed to disappearance class set proceeds as follows:
Such is calculated first from after being added in disappearance class set, is divided into the number of process points, is made a reservation for if the number is more than or equal to Reappear thresholding, then deletes such from disappearance class set, and be added in effective class set.
CN201810109615.6A 2018-02-05 2018-02-05 Real-time clustering method for evolutionary data streams Pending CN108319699A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810109615.6A CN108319699A (en) 2018-02-05 2018-02-05 Real-time clustering method for evolutionary data streams

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810109615.6A CN108319699A (en) 2018-02-05 2018-02-05 Real-time clustering method for evolutionary data streams

Publications (1)

Publication Number Publication Date
CN108319699A true CN108319699A (en) 2018-07-24

Family

ID=62902333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810109615.6A Pending CN108319699A (en) 2018-02-05 2018-02-05 Real-time clustering method for evolutionary data streams

Country Status (1)

Country Link
CN (1) CN108319699A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature

Similar Documents

Publication Publication Date Title
Ren et al. Anomaly detection based on a dynamic Markov model
CN105760888B (en) A kind of neighborhood rough set integrated learning approach based on hierarchical cluster attribute
Nguyen et al. Practical and theoretical aspects of mixture‐of‐experts modeling: An overview
Takemura et al. Model extraction attacks on recurrent neural networks
CN108319720A (en) Man-machine interaction method, device based on artificial intelligence and computer equipment
Saxena et al. A comparative analysis of association rule mining algorithms
CN107292097A (en) The feature selection approach of feature based group and traditional Chinese medical science primary symptom system of selection
CN112860675B (en) Big data processing method under online cloud service environment and cloud computing server
CN114124460B (en) Industrial control system intrusion detection method and device, computer equipment and storage medium
CN110458096A (en) A kind of extensive commodity recognition method based on deep learning
CN111327949A (en) Video time sequence action detection method, device, equipment and storage medium
EP4209959A1 (en) Target identification method and apparatus, and electronic device
CN112084761B (en) Hydraulic engineering information management method and device
Sui et al. Dynamic clustering scheme for evolving data streams based on improved STRAP
CN106204267A (en) A kind of based on improving k means and the customer segmentation system of neural network clustering
CN114741544B (en) Image retrieval method, retrieval library construction method, device, electronic equipment and medium
CN108319699A (en) Real-time clustering method for evolutionary data streams
KR20150029324A (en) System for a real-time cashing event summarization in surveillance images and the method thereof
CN107943537A (en) Using method for cleaning, device, storage medium and electronic equipment
CN107077617A (en) fingerprint extraction method and device
CN103577532B (en) Method and system for text-processing
Moreno-Garcia et al. Fuzzy numbers from raw discrete data using linear regression
Liu et al. A learning-based system for predicting sport injuries
Prakash et al. ATM Card Fraud Detection System Using Machine Learning Techniques
CN115082071A (en) Abnormal transaction account identification method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180724

RJ01 Rejection of invention patent application after publication