CN109214465A - Flow data clustering method based on selective sampling - Google Patents
Flow data clustering method based on selective sampling Download PDFInfo
- Publication number
- CN109214465A CN109214465A CN201811172699.4A CN201811172699A CN109214465A CN 109214465 A CN109214465 A CN 109214465A CN 201811172699 A CN201811172699 A CN 201811172699A CN 109214465 A CN109214465 A CN 109214465A
- Authority
- CN
- China
- Prior art keywords
- cluster
- data
- buffer area
- points
- initial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of flow data clustering method based on selective sampling, including S1: the determination parameter to be clustered;Wherein, parameter includes: data set D: data set D=x1, x2 ... xn };Kernel function
Description
Technical field
The present invention relates to data mining technology fields, more specifically, it is poly- to be related to a kind of flow data based on selective sampling
Class method.
Background technique
Thousands upon thousands data are generated under big data era daily, therefore flow data also becomes hot topic instantly.Through
Some data clustering methods of allusion quotation have such as: (1) a kind of self-adaptation nonlinear stream clustering method, using dyskaryosis detection method
If flow data is divided into stem portion by the locality according to the time, and is clustered to each section, adaptive choose has representative
The part of property is clustered as other points in initial class stream data, this method although reduce time complexity and
Utilization to memory space, but influence degree of the data information for not accounting for data point in flow data itself, therefore cluster
The effect is unsatisfactory;(2) the stream clustering method based on sampling (Approximate Kernel Fuzzy C-means,
AKFCM), stream data carries out stochastical sampling and clusters, and this method greatly reduces time complexity, but accuracy rate is lower.
Summary of the invention
It is an object of the invention to aiming at the disadvantages of the prior art, provide a kind of flow data cluster based on selective sampling
Method includes the following steps:
Step S1: the determination parameter to be clustered;Wherein, parameter includes:
Data set D: data set D=x1, x2 ... xn };Wherein, x indicates the data defined in data set;
Kernel function κ: κ (y, y`):Wherein, y and y` indicates two parameters of kernel function,Table
Similitude between registration strong point;
The initial number c of cluster;
Initial points m in buffer area, and the initial number c of initial points m > cluster in buffer area;
The maximum number of points M allowed in buffer area, and the maximum point allowed in the initial buffer area points m < in buffer area
Number M;
Cluster rate of disintegration γ;
Cluster lifetime threshold η;
Step S2: initialization cluster centre S, S={ x1 }, VC=1 and Σ c=κ (x1, x1);Wherein, V indicate feature to
Amount;
Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled;
Step S4: clustering nuclear matrix using core k-means method, obtains with markd matrix;
Step S5: according to markd matrix update nuclear matrix;
Step S6: output cluster result.
Compared with prior art, the flow data clustering method provided by the invention based on selective sampling, utilizes selective sampling
Carry out sampled data set, construct nuclear matrix with sample point, while sampling decline behavior aggregate and updating nuclear matrix, is projected into top
In the lower dimensional space that portion's vector is crossed over, sample point is clustered with k-means at this moment, time complexity can reduced
Accuracy rate is improved simultaneously.
Detailed description of the invention
By reference to the following description in conjunction with the accompanying drawings and the contents of the claims, and with to of the invention more complete
Foliation solution, other objects and results of the present invention will be more clearly understood and understood.In the accompanying drawings:
Fig. 1 is the flow diagram according to the flow data clustering method based on selective sampling of the embodiment of the present invention;
Fig. 2 is the schematic diagram according to the Imagenet data set comparative experiments NMI value of the embodiment of the present invention;
Fig. 3 is the schematic diagram according to the Network Intrusion data set comparative experiments NMI value of the embodiment of the present invention;
Fig. 4 is the schematic diagram according to the CIFAR-10 data set comparative experiments NMI value of the embodiment of the present invention;
Fig. 5 is the schematic diagram according to the Forest Cover Type data set comparative experiments NMI value of the embodiment of the present invention.
Specific embodiment
In the following description, for purposes of illustration, in order to provide the comprehensive understanding to one or more embodiments,
Numerous specific details are set forth.It may be evident, however, that these implementations can also be realized without these specific details
Example.In other examples, one or more embodiments for ease of description, well known structure and equipment are in block form an
It shows.
Integral Thought of the invention is the sampling that data are carried out using selective sampling, then constructs nuclear matrix with sample point,
Finally sampling decline behavior aggregate update nuclear matrix, with this come complete sampling, cluster and update.
Hereinafter, specific embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 shows the process of the flow data clustering method according to an embodiment of the present invention based on selective sampling.
As shown in Figure 1, the flow data clustering method provided in an embodiment of the present invention based on selective sampling, including walk as follows
It is rapid:
Step S1: the determination parameter to be clustered.
Wherein, parameter includes:
Data set D: data set D=x1, x2 ... xn };Wherein, x indicates the data defined in data set.
Kernel function κ: κ ():Wherein, y and y` indicates two parameters of kernel function,It indicates
Similitude between data point;
The initial number c of cluster;
Initial points m in buffer area, and the initial number c of initial points m > cluster in buffer area;
The maximum number of points M allowed in buffer area, and the maximum point allowed in the initial buffer area points m < in buffer area
Number M;
Cluster rate of disintegration γ;
Cluster lifetime threshold η.
Step S2: initialization cluster centre S, S={ x1 }, Vc=1 and Σ c=κ (x1, x1).
Wherein, V indicates feature vector.
Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled.
The simplest method of sampled data points XT is to execute independent Bernoulli trials, i.e. XT is stored in S.Probability P t
=1/2.However, Bernoulli Jacob's sampling leads to big kernel approximate error, and need a large amount of.In order to alleviate this problem, the present invention
It carries out importance sampling to sample instead of Bernoulli Jacob, the sampling probability Pt of each point XT is based on its " importance ", and definition is that basis exists
It is defined in statistics lever score, kernel matrix Kt is allowed to be decomposed on time t.By using importance sampling, the present invention passes through
The sub-fraction (about S=(C LN C) sample (17)) of sampled data set obtains the good approximation of real kernel.
Statistics lever score is for measuring the consistency of row vector and matrix or the standard of correlation, so that judgement should
The similitude of vector and matrix.Statistics lever value is higher, then the row vector and the otherness at matrix midpoint are bigger, and correlation is got over
It is small.Count lever score using relatively broad, in rejecting outliers field, for judging whether external data is abnormal number
According to;In random matrix analysis field, for analyzing correlation of the data with random matrix;In matrix agreement field, example
Such as matrix fill-in, for estimating matrix lack part.
It is as follows to count the calculating of lever score:
If matrix A ∈ n × d, A(i)1 × d of ∈ is the i-th row of matrix A, the statistics lever score I of the i-th row of matrix A are as follows:
Step S4: clustering nuclear matrix using core k-means method, obtains with markd matrix.
Core k-means is the nonlinear extensions of popular k-means method.The key principle of the behind core k-means be by
Data projection is to higher-dimension reproducing kernel Hilbert space (RKHS) H κ, using nonlinear function ω (), and on data for projection
Execute k- mean value.Given input data set D={ x1, x2 ..., n },Gathered C groups, it is user-defined non-
The similar κ () of linear function:WhereinIt is for defining number
Similitude between strong point.C cluster is obtained by minimizing the sum of square error in H κ.
Wherein, | | | | HκIt is HκRepresentative form, CK () represents k-th of cluster centre in RKHS, and U represents C × n
The subordinated-degree matrix of cluster, if xi belongs to k-th of cluster, otherwise Uki=1 is 0.
Cluster are as follows: in time t, s is the number of the data point of buffer area S, and c is the number of class, by solving core k-
The problem of means can will be that data point in s is divided into the middle of class.
The runing time complexity of this step is O (S2).By the way that cluster center is tied to smaller subspace, further
Across C feature vector, this complexity further reduced.
Nuclear matrix KT, along the straight line of Spectral Clustering.Clustering problem is set as optimization problem by the present invention:
Wherein, Ha crosses over (v1,...,vC), cluster centre is expressed as the linear combination of nuclear matrix feature vector.
Wherein, Nk is the number of the point in KTH cluster, and UK=(UK1, UK2 ..., UKS);By being substituted into (4)
(5), it obtains:
By executing k- mean value on matrix, the above problem can be efficiently solved.
It should be noted that being, characteristic value and feature vector do not need the cluster recalculated, because they have calculated
Come, while calculating lever score, this eliminates calculating and stores the needs of nuclear matrix KT, because of its top characteristic value and phase
The feature vector answered all is required for sampling and clustering.Since Vc=1 and c=1, when data point reaches, the present invention
It can be with progressive updating system
Step S5: according to markd matrix update nuclear matrix.
Step S6: output cluster result.
In order to prove high efficiency of the invention, four benchmark dataset (CIFAR-10, Forest Cover are used first
Type, Imagenet and Network Intrusion) simulate flow data.
Use following data collection:
CIFAR-10:CIFAR-10 image data set includes 10 class 60,000 unique 32 × 32 color images.Figure
As by 384 GIST character representations.The present invention is compared with the clustering result quality of this medium-sized data set based on the side for examining and approving k- mean value
Method.
Forest Cover Type: the data set includes 581,012 data point, and each data point represents American Forest
The attribute of 30 × 30 square metres of units in ground.Belong to 7 classes using the data of 54 character representations, each class represents different
Forest cover type.
Imagenet:Imagenet data set includes about 14,000,000 images, they are organized into based on concept
" Synset " hierarchical structure.The present invention has downloaded 1,262,102 picture classes from 34, and uses 900 in SIFT descriptor
Word feature indicates them.
Network Intrusion: network intrusions data set includes 4,897,988 50 dimension strong points in 10 classes.
NMI: i.e. normalized mutual information, because NMI uses the measure of non-linear similarity, the ratio that this method is realized
K-means is more preferable.
Accuracy rate (A): being the evaluation most common criterion of cluster result performance, and calculation formula is as follows:
Error sum of squares (SSE): the function for evaluating class inherited, formula are as follows:
τ: by η=exp (- γ τ), for τ ∈ { 1,2,3,4,5 } it is found that τ is bigger, η is smaller, otherwise η bigger.When immediate cause value is big
When η, by data point be divided into corresponding class, in order to strictly screen the data point in nuclear matrix, should make as far as possible the value of η compared with
Greatly, but simultaneously it will increase the time of method operation again, it is therefore desirable to weigh the relationship between time complexity and Clustering Effect.
The sample size of Imagenet data set is that the sample size of 100, Network Intrusion data set is
The sample size of 500, CIFAR-10 data sets is the sample size of 2000, Forest Cover Type data set
It is 4000.By table 1: for each data set, with the increase of τ, runing time is presented the trend that constantly increases, therefore from fortune
The value of row time upper τ is set as 1 preferably;By table 2: as τ=1, the NMI value of four data sets is all maximum, and with τ value
Increase, each data set NMI value reduces very fast;As can be seen from Table 3, as τ=1, the A value of four data sets is all maximum
's.Therefore τ value is set as 1 in experiment.
Runing time (unit: millisecond) under the different values of 1 τ of table
NMI value under the different values of 2 τ of table
A value under the different values of 3 τ of table
Parameter h ∈ (0,1], the value of h is smaller, illustrate the arrival of new data point so that statistics lever score it is bigger, show
The otherness of data point and nuclear matrix Central Plains data point is bigger, so that the distribution of the point in nuclear matrix becomes larger, data
The information for including is abundanter, but the value of h is smaller, the data point for meeting condition can be made to tail off to need constantly screening to cause
Biggish time complexity, it is therefore desirable to which time complexity and cluster are weighed by experimental verification according to different data sets
Relationship between effect determines the value of h.
The analysis of 4 Imagenet data set h value of table
The analysis of 5 Network Intrusion data set h value of table
The analysis of 6 CIFAR-10 data set h value of table
Fig. 2-Fig. 5 is respectively the NMI value of four kinds of methods on different data sets, the method and AKFCM provided due to invention
Method is sampled, and sample size is different, and Clustering Effect is also different, therefore NMI value is variation, and KFCM and the side FCM
Method is non-sampled method, therefore is fixed value.By analysis chart 2- Fig. 5, can obtain:
(1) value that method provided by the invention obtains is consistently greater than the value of AKFCM method acquisition, and with data scale
Continuous expansion, the two difference is gradually increased, it was demonstrated that the method for sampling in method provided by the invention is adopted better than AKFCM method
Stochastical sampling method.
(2) on four data sets, the NMI value of method provided by the invention is all higher than the value of KFCM and FCM method, and
The significantly larger than value of FCM method, while with the increase of number of samples, method NMI value provided by the invention is gradually increased, card
It is bright on stream data is clustered, method provided by the invention be better than traditional clustering method.
It is mainly compared in terms of runing time (Time/ms), error sum of squares (SSE) and accuracy rate (A) three point
Analysis.Four group data set sample sizes are respectively 100,500,2000,4000.Between can be seen that at runtime from table 9-10
(Time/ms), three aspect this paper algorithms of error sum of squares (SSE) and accuracy rate (A) will be better than AKFCM and KFCM, and
With the expansion of data set scale, the accuracy rate of AKFCM and KFCM algorithm is gradually decreased, and algorithm proposed in this paper still has
Higher accuracy rate.Prove that the Clustering Effect of this paper algorithm is better than the calculation clustered using stochastical sampling method stream data
Method.Although this paper algorithm is higher than FCM algorithm in terms of run time, this paper algorithm in error sum of squares and accuracy rate
It is much better than FCM algorithm.
7 imagenent data set runing time of table, SSE and A
8 network intrusion data set runing time of table, SSE and A
9 CIFAR-10 data set runing time of table, SSE and A
10 Forest Cover Type data set runing time of table, SSE and A
The invention proposes a kind of flow data clustering method based on selective sampling, in sample phase, using important pumping
Sample, is different from Bernoulli trials, not will lead to big kernel approximate error;In the data more new stage, herein using decline cluster
Mechanism deletes the class that can not reflect new data point feature with the arrival of new data point in real time, and replaces this with new data point
Class, to guarantee that analysis obtains the data model more representative of all data distributions in real time.The experimental results showed that this method is being protected
Under the premise of demonstrate,proving Clustering Effect, the time complexity of stream data cluster is greatly reduced, while with the expansion of data set scale
Greatly, the Clustering Effect of this method is not affected, it was demonstrated that this method flow data big for data volume is more advantageous.
Describe a kind of fluxion based on selective sampling proposed according to the present invention in an illustrative manner above with reference to attached drawing
According to clustering method.It will be understood by those skilled in the art, however, that the one kind proposed for aforementioned present invention is based on important pumping
The flow data clustering method of sample can also make realization details therein on the basis of not departing from the content of present invention various
It improves.Therefore, protection scope of the present invention should be determined by the content of appended claims.
Claims (1)
1. a kind of flow data clustering method based on selective sampling, which comprises the steps of:
Step S1: the determination parameter to be clustered;Wherein, the parameter includes:
Data set D: data set D=x1, x2 ... xn };Wherein, x indicates the data defined in data set;
Kernel function κ: κ (y, y`):Wherein, y and y` indicates two parameters of kernel function,Indicate data
Similitude between point;
The initial number c of cluster;
Initial points m in buffer area, and the initial number c of initial points m > cluster in buffer area;
The maximum number of points M allowed in buffer area, and the maximum number of points M allowed in the initial buffer area points m < in buffer area;
Cluster rate of disintegration γ;
Cluster lifetime threshold η;
Step S2: initialization cluster centre S, S={ x1 }, Vc=1 and Σ c=κ (x1, x1);Wherein, V indicates feature vector;
Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled;
Step S4: clustering the nuclear matrix using core k-means method, obtains with markd matrix;
Step S5: according to nuclear matrix described in markd matrix update;
Step S6: output cluster result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811172699.4A CN109214465A (en) | 2018-10-09 | 2018-10-09 | Flow data clustering method based on selective sampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811172699.4A CN109214465A (en) | 2018-10-09 | 2018-10-09 | Flow data clustering method based on selective sampling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109214465A true CN109214465A (en) | 2019-01-15 |
Family
ID=64983269
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811172699.4A Pending CN109214465A (en) | 2018-10-09 | 2018-10-09 | Flow data clustering method based on selective sampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109214465A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113239964A (en) * | 2021-04-13 | 2021-08-10 | 联合汽车电子有限公司 | Vehicle data processing method, device, equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103456019A (en) * | 2013-09-08 | 2013-12-18 | 西安电子科技大学 | Image segmentation method of semi-supervised kernel k-mean clustering based on constraint pairs |
CN106991442A (en) * | 2017-03-30 | 2017-07-28 | 中国矿业大学 | The self-adaptive kernel k means method and systems of shuffled frog leaping algorithm |
-
2018
- 2018-10-09 CN CN201811172699.4A patent/CN109214465A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103456019A (en) * | 2013-09-08 | 2013-12-18 | 西安电子科技大学 | Image segmentation method of semi-supervised kernel k-mean clustering based on constraint pairs |
CN106991442A (en) * | 2017-03-30 | 2017-07-28 | 中国矿业大学 | The self-adaptive kernel k means method and systems of shuffled frog leaping algorithm |
Non-Patent Citations (5)
Title |
---|
CHARU C. AGGARWAL等: ""A Framework for Clustering Evolving Data Streams"", 《PROCEEDINGS OF THE 29TH VLDB CONFERENCE》 * |
RADHA CHITTA等: ""Approximate Kernel k-means: Solution to Large Scale Kernel Clustering"", 《PROCEEDINGS OF THE 17TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 * |
RADHA CHITTA等: ""Stream Clustering: Efficient Kernel-based Approximation using Importance Sampling"", 《DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING》 * |
TIMOTHY C. HAVENS等: ""Speedup of Fuzzy and Possibilistic Kernel c-Means for Large-Scale Clustering"", 《2011 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS》 * |
费博雯等: ""距离决策下的模糊聚类集成模型"", 《电子与信息学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113239964A (en) * | 2021-04-13 | 2021-08-10 | 联合汽车电子有限公司 | Vehicle data processing method, device, equipment and storage medium |
CN113239964B (en) * | 2021-04-13 | 2024-03-01 | 联合汽车电子有限公司 | Method, device, equipment and storage medium for processing vehicle data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Luo et al. | Video anomaly detection with sparse coding inspired deep neural networks | |
Yang et al. | Action recognition using super sparse coding vector with spatio-temporal awareness | |
CN104679818B (en) | A kind of video key frame extracting method and system | |
CN102902978A (en) | Object-oriented high-resolution remote-sensing image classification method | |
He et al. | Fast kernel learning for spatial pyramid matching | |
CN108229674A (en) | The training method and device of cluster neural network, clustering method and device | |
CN105354595A (en) | Robust visual image classification method and system | |
CN106778714B (en) | LDA face identification method based on nonlinear characteristic and model combination | |
CN106845358A (en) | A kind of method and system of handwritten character characteristics of image identification | |
CN109948735A (en) | A kind of multi-tag classification method, system, device and storage medium | |
CN106651915A (en) | Target tracking method of multi-scale expression based on convolutional neural network | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN110322418A (en) | A kind of super-resolution image generates the training method and device of confrontation network | |
CN108564111A (en) | A kind of image classification method based on neighborhood rough set feature selecting | |
CN108664653A (en) | A kind of Medical Consumption client's automatic classification method based on K-means | |
CN111125434A (en) | Relation extraction method and system based on ensemble learning | |
WO2022152009A1 (en) | Target detection method and apparatus, and device and storage medium | |
CN115545103A (en) | Abnormal data identification method, label identification method and abnormal data identification device | |
CN109711442A (en) | Unsupervised layer-by-layer generation fights character representation learning method | |
CN109523514A (en) | To the batch imaging quality assessment method of Inverse Synthetic Aperture Radar ISAR | |
CN109214465A (en) | Flow data clustering method based on selective sampling | |
CN105512675B (en) | A kind of feature selection approach based on the search of Memorability multiple point crossover gravitation | |
CN112560105B (en) | Joint modeling method and device for protecting multi-party data privacy | |
CN104200220B (en) | Dynamic texture identification method based on static texture model aggregation | |
CN108549915A (en) | Image hash code training pattern algorithm based on two-value weight and classification learning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190115 |
|
RJ01 | Rejection of invention patent application after publication |