CN109214465A - Flow data clustering method based on selective sampling - Google Patents

Flow data clustering method based on selective sampling Download PDF

Info

Publication number
CN109214465A
CN109214465A CN201811172699.4A CN201811172699A CN109214465A CN 109214465 A CN109214465 A CN 109214465A CN 201811172699 A CN201811172699 A CN 201811172699A CN 109214465 A CN109214465 A CN 109214465A
Authority
CN
China
Prior art keywords
cluster
data
buffer area
points
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811172699.4A
Other languages
Chinese (zh)
Inventor
邱云飞
张哲�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN201811172699.4A priority Critical patent/CN109214465A/en
Publication of CN109214465A publication Critical patent/CN109214465A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of flow data clustering method based on selective sampling, including S1: the determination parameter to be clustered;Wherein, parameter includes: data set D: data set D=x1, x2 ... xn };Kernel function

Description

Flow data clustering method based on selective sampling
Technical field
The present invention relates to data mining technology fields, more specifically, it is poly- to be related to a kind of flow data based on selective sampling Class method.
Background technique
Thousands upon thousands data are generated under big data era daily, therefore flow data also becomes hot topic instantly.Through Some data clustering methods of allusion quotation have such as: (1) a kind of self-adaptation nonlinear stream clustering method, using dyskaryosis detection method If flow data is divided into stem portion by the locality according to the time, and is clustered to each section, adaptive choose has representative The part of property is clustered as other points in initial class stream data, this method although reduce time complexity and Utilization to memory space, but influence degree of the data information for not accounting for data point in flow data itself, therefore cluster The effect is unsatisfactory;(2) the stream clustering method based on sampling (Approximate Kernel Fuzzy C-means, AKFCM), stream data carries out stochastical sampling and clusters, and this method greatly reduces time complexity, but accuracy rate is lower.
Summary of the invention
It is an object of the invention to aiming at the disadvantages of the prior art, provide a kind of flow data cluster based on selective sampling Method includes the following steps:
Step S1: the determination parameter to be clustered;Wherein, parameter includes:
Data set D: data set D=x1, x2 ... xn };Wherein, x indicates the data defined in data set;
Kernel function κ: κ (y, y`):Wherein, y and y` indicates two parameters of kernel function,Table Similitude between registration strong point;
The initial number c of cluster;
Initial points m in buffer area, and the initial number c of initial points m > cluster in buffer area;
The maximum number of points M allowed in buffer area, and the maximum point allowed in the initial buffer area points m < in buffer area Number M;
Cluster rate of disintegration γ;
Cluster lifetime threshold η;
Step S2: initialization cluster centre S, S={ x1 }, VC=1 and Σ c=κ (x1, x1);Wherein, V indicate feature to Amount;
Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled;
Step S4: clustering nuclear matrix using core k-means method, obtains with markd matrix;
Step S5: according to markd matrix update nuclear matrix;
Step S6: output cluster result.
Compared with prior art, the flow data clustering method provided by the invention based on selective sampling, utilizes selective sampling Carry out sampled data set, construct nuclear matrix with sample point, while sampling decline behavior aggregate and updating nuclear matrix, is projected into top In the lower dimensional space that portion's vector is crossed over, sample point is clustered with k-means at this moment, time complexity can reduced Accuracy rate is improved simultaneously.
Detailed description of the invention
By reference to the following description in conjunction with the accompanying drawings and the contents of the claims, and with to of the invention more complete Foliation solution, other objects and results of the present invention will be more clearly understood and understood.In the accompanying drawings:
Fig. 1 is the flow diagram according to the flow data clustering method based on selective sampling of the embodiment of the present invention;
Fig. 2 is the schematic diagram according to the Imagenet data set comparative experiments NMI value of the embodiment of the present invention;
Fig. 3 is the schematic diagram according to the Network Intrusion data set comparative experiments NMI value of the embodiment of the present invention;
Fig. 4 is the schematic diagram according to the CIFAR-10 data set comparative experiments NMI value of the embodiment of the present invention;
Fig. 5 is the schematic diagram according to the Forest Cover Type data set comparative experiments NMI value of the embodiment of the present invention.
Specific embodiment
In the following description, for purposes of illustration, in order to provide the comprehensive understanding to one or more embodiments, Numerous specific details are set forth.It may be evident, however, that these implementations can also be realized without these specific details Example.In other examples, one or more embodiments for ease of description, well known structure and equipment are in block form an It shows.
Integral Thought of the invention is the sampling that data are carried out using selective sampling, then constructs nuclear matrix with sample point, Finally sampling decline behavior aggregate update nuclear matrix, with this come complete sampling, cluster and update.
Hereinafter, specific embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 shows the process of the flow data clustering method according to an embodiment of the present invention based on selective sampling.
As shown in Figure 1, the flow data clustering method provided in an embodiment of the present invention based on selective sampling, including walk as follows It is rapid:
Step S1: the determination parameter to be clustered.
Wherein, parameter includes:
Data set D: data set D=x1, x2 ... xn };Wherein, x indicates the data defined in data set.
Kernel function κ: κ ():Wherein, y and y` indicates two parameters of kernel function,It indicates Similitude between data point;
The initial number c of cluster;
Initial points m in buffer area, and the initial number c of initial points m > cluster in buffer area;
The maximum number of points M allowed in buffer area, and the maximum point allowed in the initial buffer area points m < in buffer area Number M;
Cluster rate of disintegration γ;
Cluster lifetime threshold η.
Step S2: initialization cluster centre S, S={ x1 }, Vc=1 and Σ c=κ (x1, x1).
Wherein, V indicates feature vector.
Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled.
The simplest method of sampled data points XT is to execute independent Bernoulli trials, i.e. XT is stored in S.Probability P t =1/2.However, Bernoulli Jacob's sampling leads to big kernel approximate error, and need a large amount of.In order to alleviate this problem, the present invention It carries out importance sampling to sample instead of Bernoulli Jacob, the sampling probability Pt of each point XT is based on its " importance ", and definition is that basis exists It is defined in statistics lever score, kernel matrix Kt is allowed to be decomposed on time t.By using importance sampling, the present invention passes through The sub-fraction (about S=(C LN C) sample (17)) of sampled data set obtains the good approximation of real kernel.
Statistics lever score is for measuring the consistency of row vector and matrix or the standard of correlation, so that judgement should The similitude of vector and matrix.Statistics lever value is higher, then the row vector and the otherness at matrix midpoint are bigger, and correlation is got over It is small.Count lever score using relatively broad, in rejecting outliers field, for judging whether external data is abnormal number According to;In random matrix analysis field, for analyzing correlation of the data with random matrix;In matrix agreement field, example Such as matrix fill-in, for estimating matrix lack part.
It is as follows to count the calculating of lever score:
If matrix A ∈ n × d, A(i)1 × d of ∈ is the i-th row of matrix A, the statistics lever score I of the i-th row of matrix A are as follows:
Step S4: clustering nuclear matrix using core k-means method, obtains with markd matrix.
Core k-means is the nonlinear extensions of popular k-means method.The key principle of the behind core k-means be by Data projection is to higher-dimension reproducing kernel Hilbert space (RKHS) H κ, using nonlinear function ω (), and on data for projection Execute k- mean value.Given input data set D={ x1, x2 ..., n },Gathered C groups, it is user-defined non- The similar κ () of linear function:WhereinIt is for defining number Similitude between strong point.C cluster is obtained by minimizing the sum of square error in H κ.
Wherein, | | | | HκIt is HκRepresentative form, CK () represents k-th of cluster centre in RKHS, and U represents C × n The subordinated-degree matrix of cluster, if xi belongs to k-th of cluster, otherwise Uki=1 is 0.
Cluster are as follows: in time t, s is the number of the data point of buffer area S, and c is the number of class, by solving core k- The problem of means can will be that data point in s is divided into the middle of class.
The runing time complexity of this step is O (S2).By the way that cluster center is tied to smaller subspace, further Across C feature vector, this complexity further reduced.
Nuclear matrix KT, along the straight line of Spectral Clustering.Clustering problem is set as optimization problem by the present invention:
Wherein, Ha crosses over (v1,...,vC), cluster centre is expressed as the linear combination of nuclear matrix feature vector.
Wherein, Nk is the number of the point in KTH cluster, and UK=(UK1, UK2 ..., UKS);By being substituted into (4) (5), it obtains:
By executing k- mean value on matrix, the above problem can be efficiently solved.
It should be noted that being, characteristic value and feature vector do not need the cluster recalculated, because they have calculated Come, while calculating lever score, this eliminates calculating and stores the needs of nuclear matrix KT, because of its top characteristic value and phase The feature vector answered all is required for sampling and clustering.Since Vc=1 and c=1, when data point reaches, the present invention It can be with progressive updating system
Step S5: according to markd matrix update nuclear matrix.
Step S6: output cluster result.
In order to prove high efficiency of the invention, four benchmark dataset (CIFAR-10, Forest Cover are used first Type, Imagenet and Network Intrusion) simulate flow data.
Use following data collection:
CIFAR-10:CIFAR-10 image data set includes 10 class 60,000 unique 32 × 32 color images.Figure As by 384 GIST character representations.The present invention is compared with the clustering result quality of this medium-sized data set based on the side for examining and approving k- mean value Method.
Forest Cover Type: the data set includes 581,012 data point, and each data point represents American Forest The attribute of 30 × 30 square metres of units in ground.Belong to 7 classes using the data of 54 character representations, each class represents different Forest cover type.
Imagenet:Imagenet data set includes about 14,000,000 images, they are organized into based on concept " Synset " hierarchical structure.The present invention has downloaded 1,262,102 picture classes from 34, and uses 900 in SIFT descriptor Word feature indicates them.
Network Intrusion: network intrusions data set includes 4,897,988 50 dimension strong points in 10 classes.
NMI: i.e. normalized mutual information, because NMI uses the measure of non-linear similarity, the ratio that this method is realized K-means is more preferable.
Accuracy rate (A): being the evaluation most common criterion of cluster result performance, and calculation formula is as follows:
Error sum of squares (SSE): the function for evaluating class inherited, formula are as follows:
τ: by η=exp (- γ τ), for τ ∈ { 1,2,3,4,5 } it is found that τ is bigger, η is smaller, otherwise η bigger.When immediate cause value is big When η, by data point be divided into corresponding class, in order to strictly screen the data point in nuclear matrix, should make as far as possible the value of η compared with Greatly, but simultaneously it will increase the time of method operation again, it is therefore desirable to weigh the relationship between time complexity and Clustering Effect. The sample size of Imagenet data set is that the sample size of 100, Network Intrusion data set is The sample size of 500, CIFAR-10 data sets is the sample size of 2000, Forest Cover Type data set It is 4000.By table 1: for each data set, with the increase of τ, runing time is presented the trend that constantly increases, therefore from fortune The value of row time upper τ is set as 1 preferably;By table 2: as τ=1, the NMI value of four data sets is all maximum, and with τ value Increase, each data set NMI value reduces very fast;As can be seen from Table 3, as τ=1, the A value of four data sets is all maximum 's.Therefore τ value is set as 1 in experiment.
Runing time (unit: millisecond) under the different values of 1 τ of table
NMI value under the different values of 2 τ of table
A value under the different values of 3 τ of table
Parameter h ∈ (0,1], the value of h is smaller, illustrate the arrival of new data point so that statistics lever score it is bigger, show The otherness of data point and nuclear matrix Central Plains data point is bigger, so that the distribution of the point in nuclear matrix becomes larger, data The information for including is abundanter, but the value of h is smaller, the data point for meeting condition can be made to tail off to need constantly screening to cause Biggish time complexity, it is therefore desirable to which time complexity and cluster are weighed by experimental verification according to different data sets Relationship between effect determines the value of h.
The analysis of 4 Imagenet data set h value of table
The analysis of 5 Network Intrusion data set h value of table
The analysis of 6 CIFAR-10 data set h value of table
Fig. 2-Fig. 5 is respectively the NMI value of four kinds of methods on different data sets, the method and AKFCM provided due to invention Method is sampled, and sample size is different, and Clustering Effect is also different, therefore NMI value is variation, and KFCM and the side FCM Method is non-sampled method, therefore is fixed value.By analysis chart 2- Fig. 5, can obtain:
(1) value that method provided by the invention obtains is consistently greater than the value of AKFCM method acquisition, and with data scale Continuous expansion, the two difference is gradually increased, it was demonstrated that the method for sampling in method provided by the invention is adopted better than AKFCM method Stochastical sampling method.
(2) on four data sets, the NMI value of method provided by the invention is all higher than the value of KFCM and FCM method, and The significantly larger than value of FCM method, while with the increase of number of samples, method NMI value provided by the invention is gradually increased, card It is bright on stream data is clustered, method provided by the invention be better than traditional clustering method.
It is mainly compared in terms of runing time (Time/ms), error sum of squares (SSE) and accuracy rate (A) three point Analysis.Four group data set sample sizes are respectively 100,500,2000,4000.Between can be seen that at runtime from table 9-10 (Time/ms), three aspect this paper algorithms of error sum of squares (SSE) and accuracy rate (A) will be better than AKFCM and KFCM, and With the expansion of data set scale, the accuracy rate of AKFCM and KFCM algorithm is gradually decreased, and algorithm proposed in this paper still has Higher accuracy rate.Prove that the Clustering Effect of this paper algorithm is better than the calculation clustered using stochastical sampling method stream data Method.Although this paper algorithm is higher than FCM algorithm in terms of run time, this paper algorithm in error sum of squares and accuracy rate It is much better than FCM algorithm.
7 imagenent data set runing time of table, SSE and A
8 network intrusion data set runing time of table, SSE and A
9 CIFAR-10 data set runing time of table, SSE and A
10 Forest Cover Type data set runing time of table, SSE and A
The invention proposes a kind of flow data clustering method based on selective sampling, in sample phase, using important pumping Sample, is different from Bernoulli trials, not will lead to big kernel approximate error;In the data more new stage, herein using decline cluster Mechanism deletes the class that can not reflect new data point feature with the arrival of new data point in real time, and replaces this with new data point Class, to guarantee that analysis obtains the data model more representative of all data distributions in real time.The experimental results showed that this method is being protected Under the premise of demonstrate,proving Clustering Effect, the time complexity of stream data cluster is greatly reduced, while with the expansion of data set scale Greatly, the Clustering Effect of this method is not affected, it was demonstrated that this method flow data big for data volume is more advantageous.
Describe a kind of fluxion based on selective sampling proposed according to the present invention in an illustrative manner above with reference to attached drawing According to clustering method.It will be understood by those skilled in the art, however, that the one kind proposed for aforementioned present invention is based on important pumping The flow data clustering method of sample can also make realization details therein on the basis of not departing from the content of present invention various It improves.Therefore, protection scope of the present invention should be determined by the content of appended claims.

Claims (1)

1. a kind of flow data clustering method based on selective sampling, which comprises the steps of:
Step S1: the determination parameter to be clustered;Wherein, the parameter includes:
Data set D: data set D=x1, x2 ... xn };Wherein, x indicates the data defined in data set;
Kernel function κ: κ (y, y`):Wherein, y and y` indicates two parameters of kernel function,Indicate data Similitude between point;
The initial number c of cluster;
Initial points m in buffer area, and the initial number c of initial points m > cluster in buffer area;
The maximum number of points M allowed in buffer area, and the maximum number of points M allowed in the initial buffer area points m < in buffer area;
Cluster rate of disintegration γ;
Cluster lifetime threshold η;
Step S2: initialization cluster centre S, S={ x1 }, Vc=1 and Σ c=κ (x1, x1);Wherein, V indicates feature vector;
Step S3: being sampled using Importance Sampling Method, and constructs nuclear matrix according to the sample set sampled;
Step S4: clustering the nuclear matrix using core k-means method, obtains with markd matrix;
Step S5: according to nuclear matrix described in markd matrix update;
Step S6: output cluster result.
CN201811172699.4A 2018-10-09 2018-10-09 Flow data clustering method based on selective sampling Pending CN109214465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811172699.4A CN109214465A (en) 2018-10-09 2018-10-09 Flow data clustering method based on selective sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811172699.4A CN109214465A (en) 2018-10-09 2018-10-09 Flow data clustering method based on selective sampling

Publications (1)

Publication Number Publication Date
CN109214465A true CN109214465A (en) 2019-01-15

Family

ID=64983269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811172699.4A Pending CN109214465A (en) 2018-10-09 2018-10-09 Flow data clustering method based on selective sampling

Country Status (1)

Country Link
CN (1) CN109214465A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239964A (en) * 2021-04-13 2021-08-10 联合汽车电子有限公司 Vehicle data processing method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103456019A (en) * 2013-09-08 2013-12-18 西安电子科技大学 Image segmentation method of semi-supervised kernel k-mean clustering based on constraint pairs
CN106991442A (en) * 2017-03-30 2017-07-28 中国矿业大学 The self-adaptive kernel k means method and systems of shuffled frog leaping algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103456019A (en) * 2013-09-08 2013-12-18 西安电子科技大学 Image segmentation method of semi-supervised kernel k-mean clustering based on constraint pairs
CN106991442A (en) * 2017-03-30 2017-07-28 中国矿业大学 The self-adaptive kernel k means method and systems of shuffled frog leaping algorithm

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHARU C. AGGARWAL等: ""A Framework for Clustering Evolving Data Streams"", 《PROCEEDINGS OF THE 29TH VLDB CONFERENCE》 *
RADHA CHITTA等: ""Approximate Kernel k-means: Solution to Large Scale Kernel Clustering"", 《PROCEEDINGS OF THE 17TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 *
RADHA CHITTA等: ""Stream Clustering: Efficient Kernel-based Approximation using Importance Sampling"", 《DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING》 *
TIMOTHY C. HAVENS等: ""Speedup of Fuzzy and Possibilistic Kernel c-Means for Large-Scale Clustering"", 《2011 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS》 *
费博雯等: ""距离决策下的模糊聚类集成模型"", 《电子与信息学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239964A (en) * 2021-04-13 2021-08-10 联合汽车电子有限公司 Vehicle data processing method, device, equipment and storage medium
CN113239964B (en) * 2021-04-13 2024-03-01 联合汽车电子有限公司 Method, device, equipment and storage medium for processing vehicle data

Similar Documents

Publication Publication Date Title
Luo et al. Video anomaly detection with sparse coding inspired deep neural networks
Yang et al. Action recognition using super sparse coding vector with spatio-temporal awareness
CN104679818B (en) A kind of video key frame extracting method and system
CN102902978A (en) Object-oriented high-resolution remote-sensing image classification method
He et al. Fast kernel learning for spatial pyramid matching
CN108229674A (en) The training method and device of cluster neural network, clustering method and device
CN105354595A (en) Robust visual image classification method and system
CN106778714B (en) LDA face identification method based on nonlinear characteristic and model combination
CN106845358A (en) A kind of method and system of handwritten character characteristics of image identification
CN109948735A (en) A kind of multi-tag classification method, system, device and storage medium
CN106651915A (en) Target tracking method of multi-scale expression based on convolutional neural network
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN110322418A (en) A kind of super-resolution image generates the training method and device of confrontation network
CN108564111A (en) A kind of image classification method based on neighborhood rough set feature selecting
CN108664653A (en) A kind of Medical Consumption client&#39;s automatic classification method based on K-means
CN111125434A (en) Relation extraction method and system based on ensemble learning
WO2022152009A1 (en) Target detection method and apparatus, and device and storage medium
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
CN109711442A (en) Unsupervised layer-by-layer generation fights character representation learning method
CN109523514A (en) To the batch imaging quality assessment method of Inverse Synthetic Aperture Radar ISAR
CN109214465A (en) Flow data clustering method based on selective sampling
CN105512675B (en) A kind of feature selection approach based on the search of Memorability multiple point crossover gravitation
CN112560105B (en) Joint modeling method and device for protecting multi-party data privacy
CN104200220B (en) Dynamic texture identification method based on static texture model aggregation
CN108549915A (en) Image hash code training pattern algorithm based on two-value weight and classification learning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190115

RJ01 Rejection of invention patent application after publication