CN109214465A

CN109214465A - Flow data clustering method based on selective sampling

Info

Publication number: CN109214465A
Application number: CN201811172699.4A
Authority: CN
Inventors: 邱云飞; 张哲�
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2018-10-09
Filing date: 2018-10-09
Publication date: 2019-01-15

Abstract

The present invention provides a kind of flow data clustering method based on selective sampling, including S1: the determination parameter to be clustered；Wherein, parameter includes: data set D: data set D=x1, x2 ... xn }；Kernel function

Description

Flow data clustering method based on selective sampling

Technical field

The present invention relates to data mining technology fields, more specifically, it is poly- to be related to a kind of flow data based on selective sampling Class method.

Background technique

Thousands upon thousands data are generated under big data era daily, therefore flow data also becomes hot topic instantly.Through Some data clustering methods of allusion quotation have such as: (1) a kind of self-adaptation nonlinear stream clustering method, using dyskaryosis detection method If flow data is divided into stem portion by the locality according to the time, and is clustered to each section, adaptive choose has representative The part of property is clustered as other points in initial class stream data, this method although reduce time complexity and Utilization to memory space, but influence degree of the data information for not accounting for data point in flow data itself, therefore cluster The effect is unsatisfactory；(2) the stream clustering method based on sampling (Approximate Kernel Fuzzy C-means, AKFCM), stream data carries out stochastical sampling and clusters, and this method greatly reduces time complexity, but accuracy rate is lower.

Summary of the invention

It is an object of the invention to aiming at the disadvantages of the prior art, provide a kind of flow data cluster based on selective sampling Method includes the following steps:

Step S1: the determination parameter to be clustered；Wherein, parameter includes:

Data set D: data set D=x1, x2 ... xn }；Wherein, x indicates the data defined in data set；

Kernel function κ: κ (y, y`):Wherein, y and y` indicates two parameters of kernel function,Table Similitude between registration strong point；

The initial number c of cluster；

Initial points m in buffer area, and the initial number c of initial points m > cluster in buffer area；

The maximum number of points M allowed in buffer area, and the maximum point allowed in the initial buffer area points m < in buffer area Number M；

Cluster rate of disintegration γ；

Cluster lifetime threshold η；

Step S2: initialization cluster centre S, S={ x1 }, VC=1 and Σ c=κ (x1, x1)；Wherein, V indicate feature to Amount；

Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled；

Step S4: clustering nuclear matrix using core k-means method, obtains with markd matrix；

Step S5: according to markd matrix update nuclear matrix；

Step S6: output cluster result.

Compared with prior art, the flow data clustering method provided by the invention based on selective sampling, utilizes selective sampling Carry out sampled data set, construct nuclear matrix with sample point, while sampling decline behavior aggregate and updating nuclear matrix, is projected into top In the lower dimensional space that portion's vector is crossed over, sample point is clustered with k-means at this moment, time complexity can reduced Accuracy rate is improved simultaneously.

Detailed description of the invention

By reference to the following description in conjunction with the accompanying drawings and the contents of the claims, and with to of the invention more complete Foliation solution, other objects and results of the present invention will be more clearly understood and understood.In the accompanying drawings:

Fig. 1 is the flow diagram according to the flow data clustering method based on selective sampling of the embodiment of the present invention；

Fig. 2 is the schematic diagram according to the Imagenet data set comparative experiments NMI value of the embodiment of the present invention；

Fig. 3 is the schematic diagram according to the Network Intrusion data set comparative experiments NMI value of the embodiment of the present invention；

Fig. 4 is the schematic diagram according to the CIFAR-10 data set comparative experiments NMI value of the embodiment of the present invention；

Fig. 5 is the schematic diagram according to the Forest Cover Type data set comparative experiments NMI value of the embodiment of the present invention.

Specific embodiment

In the following description, for purposes of illustration, in order to provide the comprehensive understanding to one or more embodiments, Numerous specific details are set forth.It may be evident, however, that these implementations can also be realized without these specific details Example.In other examples, one or more embodiments for ease of description, well known structure and equipment are in block form an It shows.

Integral Thought of the invention is the sampling that data are carried out using selective sampling, then constructs nuclear matrix with sample point, Finally sampling decline behavior aggregate update nuclear matrix, with this come complete sampling, cluster and update.

Hereinafter, specific embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 shows the process of the flow data clustering method according to an embodiment of the present invention based on selective sampling.

As shown in Figure 1, the flow data clustering method provided in an embodiment of the present invention based on selective sampling, including walk as follows It is rapid:

Step S1: the determination parameter to be clustered.

Wherein, parameter includes:

Data set D: data set D=x1, x2 ... xn }；Wherein, x indicates the data defined in data set.

Kernel function κ: κ ():Wherein, y and y` indicates two parameters of kernel function,It indicates Similitude between data point；

The initial number c of cluster；

Cluster rate of disintegration γ；

Cluster lifetime threshold η.

Step S2: initialization cluster centre S, S={ x1 }, Vc=1 and Σ c=κ (x1, x1).

Wherein, V indicates feature vector.

Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled.

The simplest method of sampled data points XT is to execute independent Bernoulli trials, i.e. XT is stored in S.Probability P t =1/2.However, Bernoulli Jacob's sampling leads to big kernel approximate error, and need a large amount of.In order to alleviate this problem, the present invention It carries out importance sampling to sample instead of Bernoulli Jacob, the sampling probability Pt of each point XT is based on its " importance ", and definition is that basis exists It is defined in statistics lever score, kernel matrix Kt is allowed to be decomposed on time t.By using importance sampling, the present invention passes through The sub-fraction (about S=(C LN C) sample (17)) of sampled data set obtains the good approximation of real kernel.

Statistics lever score is for measuring the consistency of row vector and matrix or the standard of correlation, so that judgement should The similitude of vector and matrix.Statistics lever value is higher, then the row vector and the otherness at matrix midpoint are bigger, and correlation is got over It is small.Count lever score using relatively broad, in rejecting outliers field, for judging whether external data is abnormal number According to；In random matrix analysis field, for analyzing correlation of the data with random matrix；In matrix agreement field, example Such as matrix fill-in, for estimating matrix lack part.

It is as follows to count the calculating of lever score:

If matrix A ∈ n × d, A⁽ⁱ⁾1 × d of ∈ is the i-th row of matrix A, the statistics lever score I of the i-th row of matrix A are as follows:

Step S4: clustering nuclear matrix using core k-means method, obtains with markd matrix.

Core k-means is the nonlinear extensions of popular k-means method.The key principle of the behind core k-means be by Data projection is to higher-dimension reproducing kernel Hilbert space (RKHS) H κ, using nonlinear function ω (), and on data for projection Execute k- mean value.Given input data set D={ x1, x2 ..., n },Gathered C groups, it is user-defined non- The similar κ () of linear function:WhereinIt is for defining number Similitude between strong point.C cluster is obtained by minimizing the sum of square error in H κ.

Wherein, | | | | H_κIt is H_κRepresentative form, CK () represents k-th of cluster centre in RKHS, and U represents C × n The subordinated-degree matrix of cluster, if xi belongs to k-th of cluster, otherwise Uki=1 is 0.

Cluster are as follows: in time t, s is the number of the data point of buffer area S, and c is the number of class, by solving core k- The problem of means can will be that data point in s is divided into the middle of class.

The runing time complexity of this step is O (S2).By the way that cluster center is tied to smaller subspace, further Across C feature vector, this complexity further reduced.

Nuclear matrix KT, along the straight line of Spectral Clustering.Clustering problem is set as optimization problem by the present invention:

Wherein, Ha crosses over (v₁,...,v_C), cluster centre is expressed as the linear combination of nuclear matrix feature vector.

Wherein, Nk is the number of the point in KTH cluster, and UK=(UK1, UK2 ..., UKS)；By being substituted into (4) (5), it obtains:

By executing k- mean value on matrix, the above problem can be efficiently solved.

It should be noted that being, characteristic value and feature vector do not need the cluster recalculated, because they have calculated Come, while calculating lever score, this eliminates calculating and stores the needs of nuclear matrix KT, because of its top characteristic value and phase The feature vector answered all is required for sampling and clustering.Since Vc=1 and c=1, when data point reaches, the present invention It can be with progressive updating system

Step S5: according to markd matrix update nuclear matrix.

Step S6: output cluster result.

In order to prove high efficiency of the invention, four benchmark dataset (CIFAR-10, Forest Cover are used first Type, Imagenet and Network Intrusion) simulate flow data.

Use following data collection:

CIFAR-10:CIFAR-10 image data set includes 10 class 60,000 unique 32 × 32 color images.Figure As by 384 GIST character representations.The present invention is compared with the clustering result quality of this medium-sized data set based on the side for examining and approving k- mean value Method.

Forest Cover Type: the data set includes 581,012 data point, and each data point represents American Forest The attribute of 30 × 30 square metres of units in ground.Belong to 7 classes using the data of 54 character representations, each class represents different Forest cover type.

Imagenet:Imagenet data set includes about 14,000,000 images, they are organized into based on concept " Synset " hierarchical structure.The present invention has downloaded 1,262,102 picture classes from 34, and uses 900 in SIFT descriptor Word feature indicates them.

Network Intrusion: network intrusions data set includes 4,897,988 50 dimension strong points in 10 classes.

NMI: i.e. normalized mutual information, because NMI uses the measure of non-linear similarity, the ratio that this method is realized K-means is more preferable.

Accuracy rate (A): being the evaluation most common criterion of cluster result performance, and calculation formula is as follows:

Error sum of squares (SSE): the function for evaluating class inherited, formula are as follows:

τ: by η=exp (- γ τ), for τ ∈ { 1,2,3,4,5 } it is found that τ is bigger, η is smaller, otherwise η bigger.When immediate cause value is big When η, by data point be divided into corresponding class, in order to strictly screen the data point in nuclear matrix, should make as far as possible the value of η compared with Greatly, but simultaneously it will increase the time of method operation again, it is therefore desirable to weigh the relationship between time complexity and Clustering Effect. The sample size of Imagenet data set is that the sample size of 100, Network Intrusion data set is The sample size of 500, CIFAR-10 data sets is the sample size of 2000, Forest Cover Type data set It is 4000.By table 1: for each data set, with the increase of τ, runing time is presented the trend that constantly increases, therefore from fortune The value of row time upper τ is set as 1 preferably；By table 2: as τ=1, the NMI value of four data sets is all maximum, and with τ value Increase, each data set NMI value reduces very fast；As can be seen from Table 3, as τ=1, the A value of four data sets is all maximum 's.Therefore τ value is set as 1 in experiment.

Runing time (unit: millisecond) under the different values of 1 τ of table

NMI value under the different values of 2 τ of table

A value under the different values of 3 τ of table

Parameter h ∈ (0,1], the value of h is smaller, illustrate the arrival of new data point so that statistics lever score it is bigger, show The otherness of data point and nuclear matrix Central Plains data point is bigger, so that the distribution of the point in nuclear matrix becomes larger, data The information for including is abundanter, but the value of h is smaller, the data point for meeting condition can be made to tail off to need constantly screening to cause Biggish time complexity, it is therefore desirable to which time complexity and cluster are weighed by experimental verification according to different data sets Relationship between effect determines the value of h.

The analysis of 4 Imagenet data set h value of table

The analysis of 5 Network Intrusion data set h value of table

The analysis of 6 CIFAR-10 data set h value of table

Fig. 2-Fig. 5 is respectively the NMI value of four kinds of methods on different data sets, the method and AKFCM provided due to invention Method is sampled, and sample size is different, and Clustering Effect is also different, therefore NMI value is variation, and KFCM and the side FCM Method is non-sampled method, therefore is fixed value.By analysis chart 2- Fig. 5, can obtain:

(1) value that method provided by the invention obtains is consistently greater than the value of AKFCM method acquisition, and with data scale Continuous expansion, the two difference is gradually increased, it was demonstrated that the method for sampling in method provided by the invention is adopted better than AKFCM method Stochastical sampling method.

(2) on four data sets, the NMI value of method provided by the invention is all higher than the value of KFCM and FCM method, and The significantly larger than value of FCM method, while with the increase of number of samples, method NMI value provided by the invention is gradually increased, card It is bright on stream data is clustered, method provided by the invention be better than traditional clustering method.

It is mainly compared in terms of runing time (Time/ms), error sum of squares (SSE) and accuracy rate (A) three point Analysis.Four group data set sample sizes are respectively 100,500,2000,4000.Between can be seen that at runtime from table 9-10 (Time/ms), three aspect this paper algorithms of error sum of squares (SSE) and accuracy rate (A) will be better than AKFCM and KFCM, and With the expansion of data set scale, the accuracy rate of AKFCM and KFCM algorithm is gradually decreased, and algorithm proposed in this paper still has Higher accuracy rate.Prove that the Clustering Effect of this paper algorithm is better than the calculation clustered using stochastical sampling method stream data Method.Although this paper algorithm is higher than FCM algorithm in terms of run time, this paper algorithm in error sum of squares and accuracy rate It is much better than FCM algorithm.

7 imagenent data set runing time of table, SSE and A

8 network intrusion data set runing time of table, SSE and A

9 CIFAR-10 data set runing time of table, SSE and A

10 Forest Cover Type data set runing time of table, SSE and A

The invention proposes a kind of flow data clustering method based on selective sampling, in sample phase, using important pumping Sample, is different from Bernoulli trials, not will lead to big kernel approximate error；In the data more new stage, herein using decline cluster Mechanism deletes the class that can not reflect new data point feature with the arrival of new data point in real time, and replaces this with new data point Class, to guarantee that analysis obtains the data model more representative of all data distributions in real time.The experimental results showed that this method is being protected Under the premise of demonstrate,proving Clustering Effect, the time complexity of stream data cluster is greatly reduced, while with the expansion of data set scale Greatly, the Clustering Effect of this method is not affected, it was demonstrated that this method flow data big for data volume is more advantageous.

Describe a kind of fluxion based on selective sampling proposed according to the present invention in an illustrative manner above with reference to attached drawing According to clustering method.It will be understood by those skilled in the art, however, that the one kind proposed for aforementioned present invention is based on important pumping The flow data clustering method of sample can also make realization details therein on the basis of not departing from the content of present invention various It improves.Therefore, protection scope of the present invention should be determined by the content of appended claims.

Claims

1. a kind of flow data clustering method based on selective sampling, which comprises the steps of:

Step S1: the determination parameter to be clustered；Wherein, the parameter includes:

Kernel function κ: κ (y, y`):Wherein, y and y` indicates two parameters of kernel function,Indicate data Similitude between point；

The initial number c of cluster；

The maximum number of points M allowed in buffer area, and the maximum number of points M allowed in the initial buffer area points m < in buffer area；

Cluster rate of disintegration γ；

Cluster lifetime threshold η；

Step S2: initialization cluster centre S, S={ x1 }, Vc=1 and Σ c=κ (x1, x1)；Wherein, V indicates feature vector；

Step S4: clustering the nuclear matrix using core k-means method, obtains with markd matrix；

Step S5: according to nuclear matrix described in markd matrix update；

Step S6: output cluster result.