CN108319699A - A real-time clustering method for evolutionary data flow - Google Patents

A real-time clustering method for evolutionary data flow Download PDF

Info

Publication number
CN108319699A
CN108319699A CN201810109615.6A CN201810109615A CN108319699A CN 108319699 A CN108319699 A CN 108319699A CN 201810109615 A CN201810109615 A CN 201810109615A CN 108319699 A CN108319699 A CN 108319699A
Authority
CN
China
Prior art keywords
class
data
class set
disappearance
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810109615.6A
Other languages
Chinese (zh)
Inventor
隋金坪
刘振
黎湘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810109615.6A priority Critical patent/CN108319699A/en
Publication of CN108319699A publication Critical patent/CN108319699A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an online clustering method facing to an evolutionary data stream, which adopts the technical scheme that the method comprises the steps of ① establishing an effective class set, a vanishing class set and an outlier set, ② classifying points to be processed obtained at the current moment into a certain set, and ③ updating the outlier set, the effective class set and the vanishing class set.

Description

一种面向进化数据流的实时聚类方法A real-time clustering method for evolutionary data flow

技术领域technical field

本发明属于数据流聚类技术领域,具体是涉及一种面向进化数据流的动态聚类方法。The invention belongs to the technical field of data stream clustering, and in particular relates to a dynamic clustering method for evolutionary data streams.

背景技术Background technique

数据流,是指实时流入的数据,区别于传统的批量式获取的数据,通常按数据分布是否发生变化分为静态数据流(数据分布不变化)以及进化数据流(数据分布变化),进化数据流也被称为动态数据流。目前,数据流已成为信息社会的主要数据形式之一,如金融交易数据、通信记录数据、传感观测数据等。数据流聚类技术是指通过某种聚类手段来实现数据流的分析,其凭借不依赖先验信息的强大优势目前成为数据流挖掘的主要手段之一。Data flow refers to real-time incoming data, which is different from traditional batch-acquired data. It is usually divided into static data flow (data distribution does not change) and evolutionary data flow (data distribution changes) according to whether the data distribution changes. Evolutionary data flow Streams are also known as dynamic data streams. At present, data flow has become one of the main data forms in the information society, such as financial transaction data, communication record data, sensor observation data, etc. Data stream clustering technology refers to the analysis of data streams through some clustering means. It has become one of the main means of data stream mining due to its strong advantage of not relying on prior information.

目前,数据流聚类方法主要针对的是静态数据流。实际上,现实中的数据流普遍具有进化特性(或称动态特性),即数据流数据动态流入过程中会进行如新类出现、旧类消失、旧消失类复现(以下分别简称为出现、消失、复现)等进化形式。在实际应用中,检测这些普遍存在的进化形式通常对用户具有更重要的意义,如可用来实现在天文学、医药、金融、网络等领域的监测和观察目的等。因此,迫切需要针对进化数据发展数据流聚类技术。其意义在于,一方面这将提高用户关于数据流当前聚类模式以及各类的进化的全面的理解;另一方面,也帮助用户在所有数据到达之前便做出准确的判断,如寻找网络异常侵入时间、估计给定时间段的类数以及寻找最优调整时间。尽管国内外学者已就面向进化数据流的数据流聚类技术展开了许多尝试,但主要针对的新类出现这一进化形式展开,这严重限制了数据流聚类算法的应用范围,因此,有必要扩展数据流聚类技术处理多种进化形式数据的能力。Currently, data stream clustering methods are mainly aimed at static data streams. In fact, data streams in reality generally have evolutionary characteristics (or dynamic characteristics), that is, during the dynamic inflow of data stream data, such as the emergence of new types, the disappearance of old types, and the reappearance of old types of disappearance (hereinafter referred to as emergence, Disappearance, reappearance) and other evolutionary forms. In practical applications, detection of these ubiquitous evolutionary forms is usually of more significance to users, such as monitoring and observation purposes in astronomy, medicine, finance, networking, etc. Therefore, there is an urgent need to develop data stream clustering techniques for evolutionary data. The significance is that, on the one hand, this will improve users' comprehensive understanding of the current clustering patterns of data streams and the evolution of various types; on the other hand, it will also help users make accurate judgments before all data arrives, such as looking for network anomalies Invasion time, estimating the number of classes for a given time period, and finding the optimal adjustment time. Although scholars at home and abroad have made many attempts on data stream clustering technology for evolutionary data streams, they mainly focus on the evolution of new classes, which severely limits the application range of data stream clustering algorithms. Therefore, there are There is a need to extend the ability of data stream clustering techniques to handle multiple evolving forms of data.

发明内容Contents of the invention

本发明提供一种面向进化数据流的在线聚类方法,针对进化数据流的三种典型的进化形式,即出现、消失、复现,分别设计了检测策略,并设计了处理框架,同时将三种检测策略进行了整合,从而实现了本发明对进化数据流的实时聚类,能够使得数据流中的新类及时地加入、消失的类及时地被移除、以及复现的类被及时恢复而不需要重新再形成。The present invention provides an online clustering method oriented to evolutionary data streams. Aiming at three typical evolutionary forms of evolutionary data streams, that is, appearance, disappearance, and reappearance, detection strategies are designed respectively, and a processing framework is designed. These detection strategies are integrated, so as to realize the real-time clustering of evolutionary data streams in the present invention, which can make new classes in the data streams be added in time, disappearing classes be removed in time, and recurring classes be restored in time without the need for re-formation.

本发明的技术方案是:一种面向进化数据流的实时聚类方法,其特征在于,包括下述步骤:The technical scheme of the present invention is: a kind of real-time clustering method for evolution data flow, it is characterized in that, comprises the following steps:

①建立有效类集合、消失类集合、离群点集合的步骤;① Steps to establish effective class sets, vanishing class sets, and outlier point sets;

②对当前时刻获得的待处理点归入某个集合的步骤;② The step of classifying the points to be processed obtained at the current moment into a certain set;

③更新离群点集合、有效类集合和消失类集合的步骤。③ The step of updating the outlier set, effective class set and disappearance class set.

其中:in:

①建立有效类集合、消失类集合、离群点集合的步骤;① Steps to establish effective class sets, vanishing class sets, and outlier point sets;

其中有效类集合的初始值是收集一定量数据,再利用静态聚类方法对初始化集合进行聚类而得到的结果的集合;消失类集合的初始化值为空集;离群点集合的初始化值为空集。The initial value of the effective class set is the set of results obtained by collecting a certain amount of data and clustering the initial set by using the static clustering method; the initial value of the vanishing class set is an empty set; the initial value of the outlier set is empty set.

②对当前时刻获得的待处理点归入某个集合的步骤,包括:② The step of classifying the points to be processed obtained at the current moment into a certain set, including:

首先计算待处理点与有效类集合以及消失类集合中的类的欧式距离,并求最小值;First calculate the Euclidean distance between the point to be processed and the effective class set and the class in the disappearing class set, and find the minimum value;

然后对上述最小值与预定离群点门限进行比较:若最小值大于上述门限,则将待处理点分入到离群点集合中;若最小值小于或等于上述门限,则将待处理点分入最小值对应集合的类中。Then compare the above minimum value with the preset outlier threshold: if the minimum value is greater than the above threshold, the points to be processed will be divided into the outlier set; if the minimum value is less than or equal to the above threshold, the points to be processed will be divided into into the class corresponding to the minimum value.

③更新离群点集合、有效类集合和消失类集合的步骤:③ Steps to update the outlier set, effective class set and disappearance class set:

(a)更新离群点集合的步骤:(a) Steps to update the outlier set:

首先计算离群点集合中元素的个数;然后将该个数与预定出现门限进行比较:若该个数大于等于该门限,则利用静态聚类处理方法对离群点集合中所有元素进行分类,并将分类结果添加至有效类集合中,同时清空离群点集合;若该个数小于该门限,则不对离群点集合进行任何更新。First calculate the number of elements in the outlier set; then compare the number with the predetermined threshold: if the number is greater than or equal to the threshold, use the static clustering method to classify all the elements in the outlier set , and add the classification result to the valid class set, and clear the outlier set; if the number is less than the threshold, no update is made to the outlier set.

(b)更新有效类集合的步骤:(b) Steps to update the set of valid classes:

对有效类集合中的每一类进行如下操作:For each class in the valid class collection, do the following:

首先计算当前时刻距该类最近一次被更新的时间间隔,若该时间间隔大于等于预定消失门限,则将该类从有效类集合中删除,并添加至消失类集合中。First, calculate the time interval between the current moment and the latest update of the class, if the time interval is greater than or equal to the predetermined disappearance threshold, delete the class from the effective class set and add it to the disappearing class set.

(c)更新消失类集合的步骤:(c) Steps to update the collection of disappearing classes:

对消失类集合中的每一类进行如下操作:For each class in the disappearance class set, perform the following operations:

首先计算该类自加入消失类集合中后,被分入处理点的个数,若该个数大于等于预定复现门限,则将该类从消失类集合中删除,并添加至有效类集合中。First calculate the number of processing points that have been divided into processing points since the class was added to the disappearing class set. If the number is greater than or equal to the predetermined recurrence threshold, delete the class from the disappearing class set and add it to the effective class set. .

本发明的有益效果是:The beneficial effects of the present invention are:

(1)本发明分别针对进化数据流中典型的三种进化形式(即类的出现、消失与复现)分别设计了检测函数(即利用更新离群点集合、有效类集合和消失类集合的步骤),提高了数据流聚类方法面向进化数据流的处理能力;(1) The present invention respectively designs detection functions for three typical evolutionary forms (i.e. class appearance, disappearance and reappearance) in the evolutionary data stream (i.e. using the method of updating the outlier set, effective class set and disappearance class set). step), improving the processing capability of the data stream clustering method for evolutionary data streams;

(2)本发明通过提出进化数据流动态聚类处理算法,将三种数据流进化形式进行了整合统一,提高数据流聚类方法的稳定性,扩展了数据流聚类方法的应用范围。(2) The present invention integrates and unifies three evolutionary forms of data streams by proposing an evolutionary data stream dynamic clustering processing algorithm, improves the stability of the data stream clustering method, and expands the application range of the data stream clustering method.

附图说明Description of drawings

图1是本发明的原理流程示意图;Fig. 1 is a schematic flow chart of the principle of the present invention;

图2本发明的具体实现的流程图;The flowchart of the concrete realization of Fig. 2 the present invention;

图3仿真实验一的数据集信息;Figure 3 Data set information of simulation experiment 1;

图4仿真实验一的实验结果;The experimental result of Fig. 4 simulation experiment one;

图5仿真实验二的数据集信息;Figure 5 Data set information of simulation experiment 2;

图6仿真实验二的实验结果。Fig. 6 Experimental results of the second simulation experiment.

具体实施方式Detailed ways

下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.

图1是本发明的原理流程示意图,首先进行建立三类集合(即有效类集合、消失类集合以及离群点集合)的步骤,然后进行将待处理点分入集合的步骤,最后进行三类集合的更新。其中,有效类集合与消失类集合中的基本元素是类别;离群点集合中的基本元素是处理点,即形成数据流的基本单位,也称数据点。有效类集合中存放的是当前时刻对数据流仍有聚类意义的类别,其初始值是通过收集一定量的待处理数据后,利用静态聚类方法处理得到的类别;消失类集合中存放的是当前时刻之前,从有效类集合中删除的对数据流失去聚类意义的类别,其初始值为空集;离群点集合里存放的是离群点,离群点是指具有如下特征的处理点:将处理点与有效类集合与消失类集合中的类别的距离最小值与预定离群点门限作比较,若上述最小值大于该门限,则判定该处理点为离群点,分入离群点集合中。三类集合的更新是针对三类典型的进化形式而进行的,具体地,离群点集合更新是对应数据流中新类出现的进化形式,原因是新类的出现会导致一定量的新类的点因找不到合适的类而被分入离群点集合,因此,当离群点集合中的点数超过预定出现门限,表示已有新类出现,因此需要对离群点集合中的点进行聚类;有效类集合更新是对应数据流中旧类消失的进化形式,若有效类集合中的某类在长时间没有点被分入,在该时间间隔大于预定消失门限之后,表明这些类已经“失效”,则将从有效类集合中删除该类,并将其加入到消失类集合中;而消失类集合更新是对应数据流中旧类复现的进化形式,若消失类集合中的某类仍有点被分入,且该数目超过了预定复现门限,说明该类已经从“失效”状态转变为“有效”状态,则将从消失类集合中删除该类,并将其加入到有效类集合中。在数据流实时处理阶段,将待处理点分入集合步骤与三类集合更新的步骤反复进行。Fig. 1 is a schematic flow chart of the principle of the present invention, at first carry out the step of setting up three types of sets (i.e. effective set, disappearance set and outlier point set), then carry out the step of dividing the points to be processed into sets, and finally carry out three types of sets Collection updates. Among them, the basic elements in the effective class set and the vanishing class set are categories; the basic elements in the outlier set are processing points, that is, the basic units that form data streams, also called data points. The effective class set stores the categories that still have clustering significance for the data flow at the current moment, and its initial value is the category obtained by using the static clustering method after collecting a certain amount of data to be processed; It is the category that loses the clustering meaning to the data flow and is deleted from the effective category set before the current moment, and its initial value is an empty set; the outlier point set stores outlier points, and the outlier point refers to the category with the following characteristics Processing point: Compare the minimum value of the distance between the processing point and the category in the effective class set and the disappearing class set with the predetermined outlier threshold. in the outlier set. The update of the three types of sets is carried out for three types of typical evolution forms. Specifically, the update of the outlier set is the evolution form corresponding to the emergence of new classes in the data stream, because the appearance of new classes will lead to a certain amount of new classes The points of the outliers are divided into the outlier set because no suitable class can be found. Therefore, when the number of points in the outlier set exceeds the predetermined threshold, it means that a new class has appeared, so it is necessary to classify the points in the outlier set Clustering; the update of the effective class set is the evolutionary form of the disappearance of the old class in the corresponding data stream. If a certain class in the effective class set has not been classified for a long time, after the time interval is greater than the predetermined disappearance threshold, it indicates that these classes has been "invalid", the class will be deleted from the effective class set and added to the disappearing class set; and the update of the disappearing class set is an evolutionary form of the reappearance of the old class in the corresponding data stream. If a certain class is still classified, and the number exceeds the predetermined recurrence threshold, it means that the class has changed from the "failed" state to the "valid" state, then this class will be deleted from the disappearing class collection and added to in the valid class collection. In the real-time processing stage of the data stream, the steps of dividing the points to be processed into collections and updating the three types of collections are repeated.

图2是本发明具体实现的流程图。如图2所示;Fig. 2 is a flow chart of the specific implementation of the present invention. as shown in picture 2;

首先建立三个集合,Θ、O及Δ分别对应有效类集合、离群点集合及消失类集合。Firstly, three sets are established, Θ, O and Δ correspond to the valid set, the outlier set and the vanishing set respectively.

其中,离群点集合O及消失类集合Δ的初始值分别为空集。有效类集合Θ进行初始化时,利用已获得的数据流,通过现有的聚类方法求得初始化聚类集合Θ={cj},假设共得到J类,则j=1,2,...,J。获得的数据流的数据量根据具体情况确定。Among them, the initial values of the outlier set O and the vanishing class set Δ are empty sets respectively. When initializing the effective class set Θ, use the obtained data stream to obtain the initial clustering set Θ={c j } through the existing clustering method, assuming that a total of J classes are obtained, then j=1,2,... .,J. The data volume of the obtained data stream is determined according to specific conditions.

然后,进入实时聚类处理阶段:Then, enter the real-time clustering processing stage:

假设任意时刻tn到达的多维数据点xn,分别进行下列实时聚类处理步骤:Assuming that the multi-dimensional data point x n arrives at any time t n , the following real-time clustering processing steps are performed respectively:

第一步,进行离群点检测;The first step is to perform outlier detection;

①:若有效类集合Θ为非空集合,则计算该集合中每一个类cj与xn的欧式距离,此处推荐求取点xn与类cj的类中心的欧式距离;假定该集合中,类cm距离xn最近,即二者欧式距离最小,将该最小距离定义为d1①: If the effective class set Θ is a non-empty set, then calculate the Euclidean distance between each class c j and x n in the set, here it is recommended to obtain the Euclidean distance between the point x n and the class center of the class c j ; assuming that In the set, the class c m is the closest to x n , that is, the Euclidean distance between the two is the smallest, and the minimum distance is defined as d 1 ;

若有效类集合Θ为空集合,则d1=∞;If the effective class set Θ is an empty set, then d 1 = ∞;

②:若消失类集合Δ为非空集合,假设Δ={bl},共L个元素,则l=1,2,...,L,计算其每一个类bl与xn的欧式距离;假定消失类集合Δ中类br距离xn最近,即二者欧式距离最小,则将该最小距离定义为d2②: If the disappearance class set Δ is a non-empty set, assuming Δ={b l }, a total of L elements, then l=1,2,...,L, calculate the Euclidean formula of each class b l and x n Distance; assuming that the class b r in the disappearance class set Δ is the closest to x n , that is, the Euclidean distance between the two is the smallest, then define the minimum distance as d 2 ;

若Δ为空集合,d2=∞;If Δ is an empty set, d 2 =∞;

③:令d3=min(d1,d2),即为d1、d2二者的最小值。将d3与预定出现门限d0做比较:当d3>d0则将xn分入到离群点集合O中;若d3≤d0,则将xn分入对应集合的类中,其中min为求最小值函数。③: let d 3 =min(d 1 ,d 2 ), which is the minimum value of both d 1 and d 2 . Compare d 3 with the predetermined occurrence threshold d 0 : when d 3 > d 0 , divide x n into the outlier set O; if d 3 ≤ d 0 , divide x n into the class of the corresponding set , where min is the minimum value function.

第三步,假设离群点集合O中的元素个数为N,将N与预定出现门限θemerge进行比较,若N≥θemerge,则执行第四步,否则执行第五步。In the third step, assuming that the number of elements in the outlier set O is N, compare N with the predetermined emergence threshold θ emerge , if N≥θ emerge , execute the fourth step, otherwise execute the fifth step.

第四步,利用现有静态聚类方法对对离群点集合O的所有元素进行聚类,并将结果添加至到有效类集合Θ中,并清空离群点集合O。The fourth step is to use the existing static clustering method to cluster all the elements of the outlier set O, and add the result to the effective class set Θ, and clear the outlier set O.

第五步,针对有效类集合Θ中的每一个类cj,进行下述操作:The fifth step, for each class c j in the effective class set Θ, perform the following operations:

首先计算当前时刻距该类最近一次被更新的时间间隔,即利用下列公式求取Δnj=tn-tj,其中tj为类cj最近一次被分入处理点的时刻。将Δnj与预定消失门限θdisp进行比较,若Δnj≥θdisp,则从有效类集合Θ中删除元素cj,并将cj作为一个类添加至消失类集合Δ中,否则,执行第六步。First, calculate the time interval between the current moment and the latest update of this class, that is, use the following formula to obtain Δ nj =t n -t j , where t j is the last time when class c j is classified into a processing point. Compare Δ nj with the predetermined disappearance threshold θ disp , if Δ nj ≥ θ disp , then delete the element c j from the effective class set Θ, and add c j as a class to the disappearing class set Δ, otherwise, execute the first six steps.

第六步,针对消失类集合Δ中每一类bl,检测其从被加入到该集合中后,被分入处理点的个数fl,并与预定复现门限θrecur作比较,若fl≥θrecur,则在消失类集合Δ中删除类bl,并将其添加至有效类集合Θ中,否则,回到第一步,等待处理下一个数据点。Step 6: For each class b l in the disappearance class set Δ, detect the number f l of its processing points since it was added to the set, and compare it with the predetermined recurrence threshold θ recur , if f l ≥ θ recur , delete class b l from the vanishing class set Δ, and add it to the valid class set Θ, otherwise, go back to the first step and wait for the next data point to be processed.

为说明本发明能动态聚类处理具有进化特性的数据流,进行了如下MATLAB仿真实验:实验一是将本发明用来处理仿真数据集一,目的是演示本发明对具有出现、消失进化特性的数据流的处理能力;实验二是将本发明用来处理仿真数据集二,目的是演示本发明对具有出现、消失、复现进化特性的数据流的处理能力。In order to illustrate that the present invention can dynamically cluster and process data streams with evolutionary characteristics, the following MATLAB simulation experiments have been carried out: experiment one is to use the present invention to process simulation data sets one, and the purpose is to demonstrate that the present invention has emergence and disappearance evolutionary characteristics The processing capability of the data flow; the second experiment is to use the present invention to process the simulation data set 2, and the purpose is to demonstrate the processing capability of the present invention to the data flow with the characteristics of emergence, disappearance and reappearance evolution.

图3给出了实验一所应用的仿真数据集一的类数信息,如图所示,横轴表示的是数据流的各数据点,共17200个数据点,按流入顺序依次排序;纵轴代表这些数据点所对应的类的编号,共12个,编号分别为1~12。在数据流入的开始阶段,从第1个数据点至第1600个数据点,共有8个类出现,随后第9、10、11类开始出现;从第5800个数据点至第8200个数据点,12个类均出现,随后,12、11、10、9、8、7、6陆续消失,从16200个数据点至第17200个数据点间,只有5个类存在。Figure 3 shows the class number information of the simulation data set 1 used in Experiment 1. As shown in the figure, the horizontal axis represents the data points of the data stream, a total of 17,200 data points, sorted in order of inflow; the vertical axis Represents the number of the class corresponding to these data points, a total of 12, numbered 1 to 12. At the beginning of the data inflow, from the 1st data point to the 1600th data point, a total of 8 classes appeared, and then the 9th, 10th, and 11th classes began to appear; from the 5800th data point to the 8200th data point, All 12 classes appeared, and then, 12, 11, 10, 9, 8, 7, and 6 disappeared one after another. From the 16200th data point to the 17200th data point, only 5 classes existed.

图4给出了仿真实验一的实验结果。如图4所示,横轴代表数据点,共17200个,纵轴表示各个数据点对应的类数,仿真数据集一的类数随数据点的真实变化在图中用虚线表示,本发明的识别结果用实线表示,*号标记的是发生旧类消失的时刻,△标记的是当前模型重建的时刻、×表示的是流处理开始时刻。由图4可知,本发明在1600时刻即利用第1个数据点至第1600个数据点完成初始化,并从第1601个数据点开始对数据流进行实时处理。对比图中实线、虚线可以发现,二者的变化走势基本一致,这说明本发明能有效检测到数据流中的新类出现及旧类消失进化形式,并且本次试验对17200个数据点的分类正确率达到99.99%。Figure 4 shows the experimental results of the first simulation experiment. As shown in Figure 4, the horizontal axis represents data points, totally 17200, and the vertical axis represents the class number corresponding to each data point, and the class number of the simulation data set 1 is represented with a dotted line in the figure along with the real change of data points, the present invention The recognition result is represented by a solid line. The * mark is the time when the old class disappears, the △ mark is the time when the current model is rebuilt, and the × mark is the start time of stream processing. It can be seen from FIG. 4 that the present invention uses the first data point to the 1600th data point to complete the initialization at time 1600, and starts to process the data stream in real time from the 1601st data point. Comparing the solid line and dotted line in the figure, it can be found that the change trend of the two is basically the same, which shows that the present invention can effectively detect the emergence of new classes and the evolutionary form of old classes disappearing in the data stream, and this test has 17200 data points. The classification accuracy reaches 99.99%.

图5给出了实验二所应用的仿真数据集二的类数信息,如图5所示,横轴表示的是数据流的各数据点,共22000个数据点,按流入顺序依次排序;纵轴代表这些数据点所对应的类的编号,共12个,编号分别为1~12。从第1个数据点至第2400个数据点阶段,共有12个类出现,随后第12类在2400时刻开始消失,并于第3600个数据点处开始复现;随后第11、12、10均经历了消失后复现的进化。Figure 5 shows the number of categories of the simulation data set 2 used in Experiment 2. As shown in Figure 5, the horizontal axis represents the data points of the data stream, a total of 22,000 data points, sorted in order of inflow; The axes represent the numbers of the classes corresponding to these data points, there are 12 in total, and the numbers are 1~12 respectively. From the first data point to the 2400th data point stage, a total of 12 classes appeared, and then the 12th class began to disappear at the 2400th time, and began to reappear at the 3600th data point; Experienced the evolution of disappearing and reappearing.

图6给出了仿真实验二的实验结果。如图6所示,横轴代表数据点,共22000个,纵轴表示各个数据点对应的类数,仿真数据集二的类数随数据点的真实变化在图中用虚线表示,本发明的识别结果用实线表示,*号标记的是发生旧类消失的时刻,o标记的是复现的时刻、×表示的是流处理开始时刻。由图6可知,本发明在2400时刻完成初始化,并开始对数据流进行实时处理。对比图中实线、虚线可以发现,二者的变化走势基本一致,这说明本发明能有效检测到数据流中的旧类消失及旧类复现的进化形式,并且本次试验对22000个数据点的分类正确率达到99.99%。Figure 6 shows the experimental results of the second simulation experiment. As shown in Figure 6, the horizontal axis represents data points, 22000 in total, and the vertical axis represents the class number corresponding to each data point, and the class number of the simulation data set two is represented with a dotted line in the figure along with the real change of data points, the present invention The recognition result is represented by a solid line. The * mark is the time when the old class disappears, the o mark is the reappearance time, and the × mark is the stream processing start time. It can be known from FIG. 6 that the present invention completes the initialization at time 2400 and starts to process the data stream in real time. Comparing the solid line and the dotted line in the figure, it can be found that the change trend of the two is basically the same, which shows that the present invention can effectively detect the evolution form of the disappearance of the old class and the reappearance of the old class in the data stream, and this test is performed on 22000 data The classification accuracy of points reaches 99.99%.

Claims (4)

1. a kind of real-time clustering method towards evolving data stream, which is characterized in that include the following steps:
1. the step of establishing effective class set, disappearance class set, the point set that peels off;
2. the step of pending point obtained to current time is included into some set;
3. update peels off the step of point set, effective class set and disappearance class set.
2. the real-time clustering method according to claim 1 towards evolving data stream, which is characterized in that update the point set that peels off The step of conjunction is:
The number of element in the point set that peels off is calculated first;Then is there is thresholding and be compared with predetermined in the number:If this Number is more than or equal to the thresholding, then is classified to all elements in the point set that peels off using static clustering processing method, and will divide Class result is added in effective class set, while emptying the point set that peels off;If the number is less than the thresholding, not to the point set that peels off It closes and carries out any update.
3. the real-time clustering method according to claim 2 towards evolving data stream, which is characterized in that update effective class set The step of conjunction is:
Every one kind in effective class set is proceeded as follows:
The time interval that calculating current time is updated away from such the last time first, if the time interval is more than or equal to predetermined disappear Thresholding is lost, then is deleted such from effective class set, and be added in disappearance class set.
4. the real-time clustering method according to claim 3 towards evolving data stream, which is characterized in that update disappearance class set The step of conjunction is:
Every one kind in being closed to disappearance class set proceeds as follows:
Such is calculated first from after being added in disappearance class set, is divided into the number of process points, is made a reservation for if the number is more than or equal to Reappear thresholding, then deletes such from disappearance class set, and be added in effective class set.
CN201810109615.6A 2018-02-05 2018-02-05 A real-time clustering method for evolutionary data flow Pending CN108319699A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810109615.6A CN108319699A (en) 2018-02-05 2018-02-05 A real-time clustering method for evolutionary data flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810109615.6A CN108319699A (en) 2018-02-05 2018-02-05 A real-time clustering method for evolutionary data flow

Publications (1)

Publication Number Publication Date
CN108319699A true CN108319699A (en) 2018-07-24

Family

ID=62902333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810109615.6A Pending CN108319699A (en) 2018-02-05 2018-02-05 A real-time clustering method for evolutionary data flow

Country Status (1)

Country Link
CN (1) CN108319699A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature

Similar Documents

Publication Publication Date Title
CN104484673B (en) The Supplementing Data method of real-time stream application of pattern recognition
CN109325471B (en) Double-current network pedestrian re-identification method combining apparent characteristics and space-time distribution
CN102096931B (en) Moving target real-time detection method based on layering background modeling
CN109753949B (en) A multi-window traffic sign detection method based on deep learning
CN104050556B (en) The feature selection approach and its detection method of a kind of spam
CN103177258A (en) Method for automatically extracting terrain characteristic line according to vector contour line data
CN105913051A (en) Device and method for updating template library for face image recognition
CN105931267B (en) A kind of moving object segmentation tracking based on improvement ViBe algorithm
CN106643734A (en) Grading processing method for space-time track data
CN103413323B (en) Based on the object tracking methods of component-level apparent model
CN108804731A (en) Based on the dual evaluation points time series trend feature extracting method of vital point
CN109325510B (en) An Image Feature Point Matching Method Based on Grid Statistics
CN110688940A (en) Rapid face tracking method based on face detection
CN104102706A (en) Hierarchical clustering-based suspicious taxpayer detection method
CN104915009B (en) The method and system of gesture anticipation
CN107909042A (en) A kind of continuous gesture cutting recognition methods
CN104731887B (en) A kind of user method for measuring similarity in collaborative filtering
CN108319699A (en) A real-time clustering method for evolutionary data flow
CN106127188B (en) A kind of Handwritten Digit Recognition method based on gyroscope
CN114970707A (en) A Trajectory Similarity Analysis Method Based on Trajectory Compression and Clustering
CN111832475B (en) A Semantic Feature-Based Face False Detection and Screening Method
CN110288572A (en) Method and device for automatic extraction of blood vessel centerline
CN106533784A (en) Method for improving application layer traffic classification accuracy
CN103136256A (en) Method and system for achieving information retrieval in network
CN106203474A (en) A kind of flow data clustering method dynamically changed based on density value

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180724