CN103577602A

CN103577602A - Secondary clustering method and system

Info

Publication number: CN103577602A
Application number: CN201310581217.1A
Authority: CN
Inventors: 侯德龙
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2013-11-18
Filing date: 2013-11-18
Publication date: 2014-02-12

Abstract

The invention provides a secondary clustering method which is applied to the technical field of data mining. The secondary clustering method comprises the following steps of carrying out partitioning on data flow and reading in data blocks, obtaining the reference point of density clusters by using DBSCAN algorithm clustering, carrying out k-means algorithm clustering on the obtained reference point of the density clusters and storing the k mean value reference point obtained through the k-means algorithm clustering by adopting a layered structure. The secondary clustering method effectively improves the clustering quality, the clustering precision and efficiency of the data flow which is distributed irregularly and contains noise.

Description

A kind of secondary clustering method and system

Technical field

The invention belongs to Mining Data Stream Technology field, relate in particular to a kind of secondary clustering method and system.

Background technology

In recent years, along with the development of hardware technology, there is increasing application to produce data stream, data stream is different from traditional static data on disk that is stored in, but the new data object of a class, it is data unlimited, continuous, orderly, fast-changing, magnanimity; Typical data stream packets includes network and the monitoring information data of system of monitoring road traffic are, the message registration data of telecommunication department, the various Monitoring Data of being passed back by sensor, the stock price information data in stock exchange and the Monitoring Data of environment temperature etc.When these features of data stream itself have determined data stream to process, the scanning of one to twice can only be done to data, and a small amount of data can only be stored temporarily.Therefore original a lot of ripe data mining, data analysis and data query technique become inapplicable in data stream, need to propose new solution.

Therefore, once the attention that occurs having caused researcher, there are a lot of achievements in research in the problem of data stream, data stream is studied from many aspects such as management, inquiry, analysis and mining algorithms; Mining Data Stream Technology is as the new problem of Data Mining, and a lot of mining algorithms need to be transformed for data stream; Data stream clustering is analyzed an important research direction of excavating as data stream, is faced with equally huge challenge, has also caused researchers' extensive concern, has occurred at present many relevant achievements in research, and has been applied in practice.

Traditional cluster is based upon under database manipulation pattern; Complicated query manipulation be stored and be supported to traditional database can to all data.Therefore, under database schema, classic method can adopt repeatedly reading out data, and data is carried out to the operations such as random access and realize the cluster to stored data.Yet under data stream environment, these methods of operating are all infeasible, the feature that data stream itself has makes traditional clustering algorithm (even can not) not directly apply to data stream clustering.

Thereby, to compare with traditional clustering method, Data Stream Clustering Algorithm should have following characteristics:

First, use limited internal memory and storage space.Data stream has continuous unlimitedness, and data total amount is wherein considerably beyond offering space (main memory) capacity that clustering algorithm is used, so the data in full storage data stream are infeasible, is also impossible.Data Stream Clustering Algorithm can not be stored all need data object to be processed, can only be by generalization or give up selectively data and guarantee that used space size is limited, reasonably.

Secondly, linear sweep increment type is processed or a scanning.For mass data ultra-large in data stream, linear sweep is unique effective reading out data method, and random reading out data needs quite expensive calculation cost.And, even the data in data stream are carried out to repeatedly linear sweep, be also to need a lot of calculation costs, because these data are stored in the very slow external equipment of reading speed conventionally.Moreover in a lot of data stream environment, data, with very fast velocity variations, do not need its storage.These data must be just processed when it produces, and is then dropped.Therefore, Data Stream Clustering Algorithm should only carry out a scanning to data, at least will realize the increment type of linear sweep and process.

Again, the processing of data recording is had to real-time.In data stream, the pace of change of data is very fast, very high to the requirement of response speed.Therefore, in Data Stream Clustering Algorithm, the processing procedure of usage data record must have very fast processing speed, and avoiding omitting need data recording to be processed.

But known Data Stream Clustering Algorithm is applicable to have the data of specific distribution mostly, and more responsive to noise.Yet, the data stream in practical application area mostly have data distribute irregular, contain the features such as noise, what make existing Data Stream Clustering Algorithm is difficult to obtain gratifying cluster quality.

Summary of the invention

The invention provides a kind of secondary clustering method and system, to address the above problem.

The invention provides a kind of secondary clustering method, comprise the following steps:

Data stream is carried out to piecemeal reading data piece;

Use DBSCAN algorithm cluster, obtain Density Cluster reference point;

The described Density Cluster reference point of obtaining is carried out k-means algorithm cluster and adopted the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.

The invention provides a kind of secondary clustering system, comprising: piecemeal reads in module, Density Cluster reference point acquisition module, k mean reference point acquisition module; Piecemeal reads in module and is connected with k mean reference point acquisition module by Density Cluster reference point acquisition module;

Described piecemeal reads in module, for data stream being carried out to piecemeal reading data piece;

Described Density Cluster reference point acquisition module, for using DBSCAN algorithm cluster, obtains Density Cluster reference point;

Described k mean reference point acquisition module, for carrying out k-means algorithm cluster to the described Density Cluster reference point of obtaining and adopting the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.

The present invention proposes a kind of secondary clustering method, first data stream is carried out to piecemeal and do the middle cluster result Density Cluster of DBSCAN cluster generation with reference to point set, subsequently these Density Cluster reference point are carried out to k-means cluster, by the structure of layering, preserve bunch reference point that each cluster obtains, effectively improve cluster quality, clustering precision and the efficiency of the irregular data stream with containing noise of data distribution.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the processing flow chart of the embodiment of the present invention 1;

Fig. 2 is the system diagram of realizing of the embodiment of the present invention 2.

Embodiment

Hereinafter with reference to accompanying drawing, also describe the present invention in detail in conjunction with the embodiments.It should be noted that, in the situation that not conflicting, embodiment and the feature in embodiment in the application can combine mutually.

Fig. 1 is the processing flow chart of the embodiment of the present invention 1, comprises the following steps:

Step 101: data stream is carried out to piecemeal reading data piece;

Step 102: use DBSCAN algorithm cluster, obtain Density Cluster reference point;

Step 103: the described Density Cluster reference point of obtaining is carried out k-means algorithm cluster and adopted the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.

Wherein, the process of data stream being carried out to piecemeal reading data piece is: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.

Wherein, Density Cluster reference point is defined as follows:

In tentation data stream, data are with piece X1, X2,, Xn, form arrive, can in internal memory, process the data point that each data block comprises similar number for every;

Definition 1:{| ≠: Density Cluster reference point: the data block Xt that moment t is arrived, with the clustering algorithm based on density, carry out cluster, generate k.(kt=1,2) individual clustering, and average point is respectively cl,, Ci,, ek, data block will be by k.Individual shape is as the two tuples formations of (ci, ni), and ni is the data point number that is under the jurisdiction of ci in Xi, claims that rd (ei, ni) is the Density Cluster reference point in data stream.

Wherein, k mean reference point is defined as follows:

Definition 2:k mean reference point: individual the clustering of 2k of m Density Cluster reference point being carried out to the generation of k mean cluster, average point is respectively cl,, Ci, c2k, data block will consist of two tuples of 2k shape j/N (el, hi), and ni is the data point number that is under the jurisdiction of ci, claim that rk (ci, ni) is the k mean reference point in data stream.

The present invention proposes data stream secondary clustering method (Twice Clustering Streaming Algorithm-TCLUSA), to improve clustering precision.

TCLUSA is based on subregion thought, use DBSCAN method to delete outlier to every blocks of data cluster and using the average point of each bunch as its representative point, obtain Local Clustering result (Density Cluster reference point), then by k-means method, obtained representative point is carried out to cluster acquisition net result (k mean reference point).

TCLUSA describes:

In tentation data stream, data are with piece X1, X2,, Xn, form arrive, can in internal memory, process the data point that each data block comprises similar number for every.

Definition 1:{| ≠: Density Cluster reference point.The data block Xt that moment t is arrived, carries out cluster with the clustering algorithm based on density, generates k.(kt=1,2) individual clustering, and average point is respectively cl,, Ci,, ek, data block will be by k.Individual shape is as the two tuples formations of (ci, ni), and ni is the data point number that is under the jurisdiction of ci in Xi, claims that rd (ei, ni) is the Density Cluster reference point in data stream.

Definition 2:k mean reference point.M Density Cluster reference point carried out to individual the clustering of 2k of k mean cluster generation, average point is respectively cl,, Ci, c2k, data block will consist of two tuples of 2k shape j/N (el, hi), and ni is the data point number that is under the jurisdiction of ci, claim that rk (ci, ni) is the k mean reference point in data stream.

In definition l and definition 2,11i can be interpreted as to the weight of reference point.

The thought of TCLUSA algorithm based on subregion is carried out piecemeal processing to data stream, and preserve according to the structure of layering bunch reference point that each cluster obtains, preserve m reference point for every layer, this makes this algorithm can in limited memory headroom, realize the cluster to data stream, and data stream in chronological order every m data point forms a data block.

This algorithm is used DBSCAN algorithm to each data block cluster, calculates Density Cluster reference point, obtains Local Clustering result, uses k-means algorithm to Density Cluster reference point cluster, until obtain final result (k mean reference point).

Below the specific implementation of above-mentioned principle is elaborated:

False code is as follows:

Procedure?TCLUSA

/ wooden Function: the data stream of accumulation a period of time, in moving window, then utilizes TCLUSA algorithm to data stream clustering, and according to a plurality of cluster pieces of result division place.

Input: data block records number oncepattern, bunch number k, radius of neighbourhood eps, the minpits that counts that core point at least comprises.

Output:k bunch

*/

(1) do while (less than data stream end)

(2) f=1; //f represents the number of plies of processing

(3) read a data block;

(4) use DBSCAN algorithm to carry out cluster to this data block;

(5) calculate Density Cluster reference point, be stored in f layer;

(6) if (number of Density Cluster reference point==m) //m represents to store intermediate density bunch reference point maximum number

(7) t=0; //f is that two-valued variable: f is that 1 expression is all

Layer be not all filled with data,

//f is 0 and indicates that layer has been filled with data.

(8)dowhile(f==0)

(9) use k-means algorithm to i layer bunch reference point cluster;

(10) juice+;

(11) store the 2Ji} obtaining a k mean reference point into f layer;

(12) if (a bunch reference for f layer count unequal to institute)

(13)t?t=1；

(14)end?if

(15)end／／end?while

(16)end?if

(17)end／／end?while

(18) all Density Cluster reference point clusters to storage with k-means algorithm, generate final k bunch;

Step (1) realizes in moving window to be processed the Circulant Block of data stream, and obtains final cluster result.If data are untreated complete, read in next data block, until Data Stream Processing is complete; DBSCAN algorithm cluster is used to the data block of reading in step (3)-(5), and bulk density bunch reference point, is saved in i layer;

Step (6)-(8) judge that whether the Density Cluster reference point of i layer is full, if i layer reference point is full, use k.means algorithm to process the layering of Density Cluster reference point, and result is saved in to its last layer, until all layers are not all filled with; Step (9)-(11) are used k-means algorithm to carry out cluster to i layer reference point, obtain 2k mean reference point, are saved in i+l layer, the mean reference point number that (12) step judges its upper strata whether less than, as less than, end process; When data stream finishes, step (18) represents to use k-means algorithm to carry out cluster to all Density Cluster reference point of storage, obtains k bunch ((k mean reference point)) output.

Carry out Algorithm Analysis below, be described in detail as follows:

Note m is data point number in piece, k is cluster number of clusters, C is cycle index, t is for obtaining the data block number of m Density Cluster mean reference point, the time overhead of algorithm consists of two parts: data stream is carried out to piecemeal, 111 data points of order form a data block, the m of each a data block data point is carried out to DBSCAN cluster and generate Density Cluster mean reference point, and the expense of this part is O (m2); M Density Cluster mean reference point carried out to k.means cluster, and this part of expense is O (mkc).So total time overhead reaches O (m2t+mkc).

Because data point number m in deblocking is generally less, therefore, total time overhead is little.After generating m Density Cluster mean reference point, this m Density Cluster reference point carried out to k.means cluster, because it has simply, efficient advantage, its computation complexity is O (rake), conventionally have k<<m and c<m, and this part of calculation times is less than the number of times of bulk density bunch mean reference point, so its time overhead is less.Aspect the space expense of algorithm, because this algorithm adopts the structure of layering and by the information of average point, data preserved, make algorithm can in limited memory headroom, realize the cluster to data stream.

Aspect cluster quality, TCLUSA algorithm adopts DBSCAN algorithm to carry out ground floor cluster preprocessing to data stream, for distribution is irregular, processes with data that contain a large amount of noises, makes the Density Cluster reference point generating have higher precision.Adopt complete partitioning algorithm k.means to carry out cluster to Density Cluster reference point, therefore, in cluster process, retained all bunch reference point information, make the result of cluster more can reflect the overall condition that data distribute.

Carry out a concrete experiment below, technical solution of the present invention be elaborated:

Realize the Data Stream Clustering Algorithm TCLUSA software and hardware configuration proposing as follows: CPU frequency is P4.2.80G, inside saves as DDR.II256MB, and hard disk size is 80GB; Operating system is Windows XP, take C++ as development language.With Scat Comp coefficient [38], weigh cluster quality (Scat Comp coefficient is less, and cluster quality is better).Test the data source with network invasion monitoring data set KDD.CUP99 herein, this data set derives from 2 weeks interior 494020 the LAN (Local Area Network) linkage records in Lincoln laboratory of MIT, is a kind of noise and data irregular data that distribute that contain.Every of data centralization records 42 attributes, corresponding normal mode or certain intrusion model, and utilization 120000 data recording are wherein imitated flow data and are carried out cluster, have deleted the nonumeric attribute in data, only use 34 numerical attributes wherein.

Data Stream Clustering Algorithm (being called KSCDC algorithm herein) based on k-means in Data Stream Clustering Algorithm TCLUSA and document [39] is compared.This algorithm is divided into one by every 100 data recording, and k gets 5, minpits and gets 7, eps and get 0.5.With KSCDC algorithm and TCLUSA algorithm, the different pieces of information of 20,000 to 100,000 is carried out to cluster respectively, experimental result is as shown in table l.As known from Table 1: algorithm TCLUSA all can obtain the better cluster quality than algorithm KSCDC under different parameters condition.Owing to adopting DBSCAN algorithm in ground floor cluster, consuming time more than k-means algorithm, cause algorithm TCLUSA slightly poorer than algorithm KSCDC, however in the larger environment of cluster quality-critical degree this part time performance be lost in acceptable scope within.

The time performance of table 1 algorithm KSCDC and TCLUSA and cluster mass ratio are

Fig. 2 is the system diagram of realizing of the embodiment of the present invention 2, comprising: piecemeal reads in module 201, Density Cluster reference point acquisition module 202, k mean reference point acquisition module 203; Piecemeal reads in module 201 and is connected with k mean reference point acquisition module 203 by Density Cluster reference point acquisition module 202;

Described piecemeal reads in module 201, for data stream being carried out to piecemeal reading data piece;

Described Density Cluster reference point acquisition module 202, for using DBSCAN algorithm cluster, obtains Density Cluster reference point;

Described k mean reference point acquisition module 203, for carrying out k-means algorithm cluster to the described Density Cluster reference point of obtaining and adopting the structure of layering to preserve the k mean reference point that k-means algorithm cluster obtains.

Wherein, described piecemeal reads in module 201, for data stream being carried out to the process of piecemeal reading data piece, is also: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.

The present invention proposes a kind of secondary clustering method, first data stream is carried out to piecemeal and do the middle cluster result Density Cluster of DBSCAN cluster generation with reference to point set, subsequently these Density Cluster reference point are carried out to k-means cluster, by the structure of layering, preserve bunch reference point (k mean reference point) that each cluster obtains, effectively improve cluster quality and clustering precision and the efficiency of the irregular data stream with containing noise of data distribution.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a secondary clustering method, is characterized in that, comprises the following steps:

Data stream is carried out to piecemeal reading data piece;

Use DBSCAN algorithm cluster, obtain Density Cluster reference point;

2. method according to claim 1, is characterized in that:

The process of data stream being carried out to piecemeal reading data piece is: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.

3. method according to claim 1, is characterized in that: Density Cluster reference point is defined as follows:

4. method according to claim 1, is characterized in that: k mean reference point is defined as follows:

5. a secondary clustering system, is characterized in that, comprising:

Piecemeal reads in module, Density Cluster reference point acquisition module, k mean reference point acquisition module; Piecemeal reads in module and is connected with k mean reference point acquisition module by Density Cluster reference point acquisition module;

6. system according to claim 5, is characterized in that,

Described piecemeal reads in module, for data stream being carried out to the process of piecemeal reading data piece, is also: in moving window, realize the Circulant Block of data stream is processed, and obtain final cluster result; If data are untreated complete, read in next data block, until Data Stream Processing is complete.