CN109978042A - A kind of adaptive quick K-means clustering method of fusion feature study - Google Patents

A kind of adaptive quick K-means clustering method of fusion feature study Download PDF

Info

Publication number
CN109978042A
CN109978042A CN201910209441.5A CN201910209441A CN109978042A CN 109978042 A CN109978042 A CN 109978042A CN 201910209441 A CN201910209441 A CN 201910209441A CN 109978042 A CN109978042 A CN 109978042A
Authority
CN
China
Prior art keywords
matrix
data
feature
cluster
adaptive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910209441.5A
Other languages
Chinese (zh)
Inventor
王晓栋
严菲
曾志强
陈玉明
洪朝群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University of Technology
Original Assignee
Xiamen University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University of Technology filed Critical Xiamen University of Technology
Priority to CN201910209441.5A priority Critical patent/CN109978042A/en
Publication of CN109978042A publication Critical patent/CN109978042A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2134Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
    • G06F18/21345Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis enforcing sparsity or involving a domain transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The present invention discloses a kind of adaptive quick K-means clustering method of fusion feature study, pre-processes first to data, excludes wherein attribute missing, Data duplication problem, each data attribute is normalized;The total Scatter Matrix of data is calculated, and introduces sparse characteristic construction feature selection matrix;K-means clustering method is executed on proper subspace, and in cluster centre renewal process, introduce each data sample weight of adaptive factor dynamic regulation;According to the information of distinguishing between cluster, Feature Choice Matrix is updated, and then filters out optimal feature subset.Such method enables traditional K-means clustering method efficiently to improve the accuracy clustered using the correlation information distinguished between information and feature between cluster and in cluster, adaptive factor is also incorporated in cluster process, the characteristics of being distributed according to different types of data updates cluster centre, have higher practicability and scalability, effective support can be provided for related applications such as machine learning, computer visions.

Description

A kind of adaptive quick K-means clustering method of fusion feature study
Technical field
The invention belongs to machine learning techniques field, in particular to a kind of adaptive quick K- of fusion feature study Means clustering method.
Background technique
Clustering method is a technology very widely used in machine learning field, wherein with K-means method application The most extensively, good effect is achieved in multiple fields such as data mining, medical treatment, education.However, with multimedia technology With Internet technology high speed development, high dimensional data is presented explosive growth, brings huge challenge to traditional K-means method. Due to often there is redundancy feature and noise characteristic in high dimensional data, directly this kind of data application K-means cluster is not only needed A large amount of computing resources are consumed in addition, and will affect its cluster accuracy.It recent studies have shown that, if in advance to data characteristics Dimension-reduction treatment is carried out, the cluster efficiency of K-means will be promoted effectively.
In recent years, there are some researchs by dimension reduction method (such as linear discriminant analysis (Linear Discriminant Analysis, LDA), the methods of orthogonal centroid method (Orthogonal Centroid Method, OCM)) carried out with K-means In conjunction with, provide optimal subspace using the former for the latter, and using the latter on subspace " label " of the cluster result as the former Information.Although such method can effectively improve the cluster accuracy of K-means, it is both needed to operate by feature decomposition and solve Optimal characteristics subspace, computation complexity will increase with pending data characteristic dimension square grade, and obtained feature is empty Between differ greatly with original feature space, be difficult suitable for true application scenarios high dimensional data processing.
Summary of the invention
The purpose of the present invention is to provide a kind of adaptive quick K-means clustering method of fusion feature study, so that Traditional K-means clustering method can be by efficiently utilizing distinguishing between information and feature between cluster and in cluster Correlation information improve cluster accuracy;Meanwhile this method has also incorporated adaptive factor in cluster process, it can basis The characteristics of different types of data is distributed updates cluster centre, has higher practicability and scalability, can be machine learning, meter The related applications such as calculation machine vision, which provide, effectively to be supported.
In order to achieve the above objectives, solution of the invention is:
A kind of adaptive quick K-means clustering method of fusion feature study, includes the following steps:
(1) pending data is pre-processed, solves the problems, such as attribute missing, Data duplication in pending data, and right Each data attribute is normalized, and then obtaining n group includes D feature without label data X=[x1, x2..., xn]∈ RD×n, wherein xi∈RD×1Indicate i-th of data sample;Calculate data lump Scatter Matrix
(2) subcharacter number d to be selected, and construction feature selection matrix W are set:
W=[wI(1), wI(2)..., wI(d)]
Wherein,I be gather { 1,2 ..., D } one group of fully intermeshing and I (i) be its i-th A element;
(3) matrix A=[a is given1, a2..., an]∈Rm×n, it is as follows to define its adaptive loss function:
Wherein, σ > 0 is adaptive factor;
(4) classification number c is set, it is poly- on subspace that data set is constructed on the basis of K-means algorithm and step (3) Class model:
Wherein, F=[f1, f2..., fn]T∈ { 0,1 }n×cWith G=[g1, g2..., gc]∈Rd×cIt respectively indicates to be solved Class label matrix and cluster centre matrix;
(5) information is distinguished sufficiently to extract in data, it is desirable to divergence of the data set on its subspace is as big as possible, It is as follows that Clustering Model is established thus:
Wherein, λ is balance factor;
The optimal solution of model above is equal to:
Wherein,
Set Δ=diag (τ1, τ2..., τn) it is a diagonal matrix, and U=[u1, u2..., un]T=XTW-FGT, then This method final goal function can convert are as follows:
(6) objective function is solved
Step 1: setting subcharacter number d, classification number c, parameter lambda and σ, initialization Δ are unit matrix, and random initializtion W And G;
Step 2: given W, Δ, G optimize F.Objective function can be converted into In view of the discreteness of matrix F to be solved, value can be solved by executing K-means on lower-dimensional subspace, That is:
Step 3: given F, Δ optimize G, W.Derivative of the objective function relative to G is taken, and enabling derivation result is 0, can obtain G =WTXΔF(FTΔF)-1, after G is substituted into objective function, it can obtain:
Wherein, Sw=X (Δ-Δ F (FTΔF)-1FTΔ)XTFor Scatter Matrix in weighting class, M=St-λSw
In view of the sparsity structure of W, the optimal solution of W can be obtained by the first d maximum diagonal element of solution matrix M in above formula ?.Enable N=Δ-Δ F (FTΔF)-1FTΔ, then i-th of diagonal element M of MiiFollowing equation quick obtaining can be passed through:
Wherein, XI:WithRespectively represent matrix X andThe i-th row element composed by vector.
Step 4: using calculated F in step 2 and step 3, W, G update Δ, and the value of i-th of diagonal element of Δ can be more Newly are as follows:
Step 5: repeating step 2- step 4 until meeting termination condition, export Feature Choice Matrix W, cluster centre Matrix G and class label matrix F.Optimal feature subset can be by vectorIn preceding d nonzero term corresponding to subscript It determines, wherein W: iRepresent column vector composed by the i-th column element of W.
After adopting the above scheme, specific steps of the present invention include: and pre-process first to data, exclude to belong in data set Property missing, Data duplication problem, each data attribute is normalized;Data lump Scatter Matrix is calculated, and is introduced dilute Dredge characteristic construction feature selection matrix;K-means clustering method is executed on proper subspace, and in cluster centre renewal process In, introduce each data sample weight of adaptive factor dynamic regulation;According to the information of distinguishing between cluster, feature selecting is updated Matrix, and then filter out optimal feature subset.Feature selecting and adaptive learning have been incorporated biography by method proposed by the invention Unite K-means Clustering Model, can effectively deal with different types of data distributed data collection, and its Feature Choice Matrix solution procedure without Complicated feature decomposition operation need to be introduced, Clustering Effect is also superior to traditional clustering method.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is the embodiment of the present invention and tradition K-means Clustering Effect schematic diagram;
Fig. 3 is providing method of the embodiment of the present invention and tradition K-means runing time contrast effect figure
Specific embodiment
Below with reference to attached drawing, technical solution of the present invention and beneficial effect are described in detail.
As shown in Figure 1, the present invention provides a kind of adaptive quick K-means clustering method of fusion feature study, this reality It applies example to cluster no label data by the way of fusion feature selection and K-means clustering method, include the following steps:
Step 1, the problems such as pending data being pre-processed, removing attribute missing, Data duplication in data, and to each Data attribute is normalized, and then obtaining n group includes D feature without label data X=[x1, x2..., xn]∈RD ×n, wherein xi∈RD×1Indicate i-th of data sample, i=1,2 ..., n;
Step 2, the total Scatter Matrix of data is calculated
Step 3, subcharacter number d, classification number c, parameter lambda and σ are set, initialization weight matrix Δ is unit matrix, and with Machine initialization feature selection matrix W and cluster centre matrix G;Wherein, subcharacter number d and classification number c can be according to practical application need The middle feature quantity for needing to retain is asked to determine, parameter lambda can (ε represents any positive number close to 0, ∞ for selection from { ε, 1,2, ∞ } Represent enough big any positive numbers), Grid Method selection can be used in parameter σ;
Step 4, using K-means in subspace WTClass label matrix is solved on X.Since K-means is only in data It is clustered on the subcharacter of part, computing resource is much smaller than tradition K-means;
Step 5, sparsity the rapid solving W and G of Feature Choice Matrix W are utilized;
Step 6, according to step 4 and step 5 calculated F, W, G and σ, the weight of each sample is updated, i.e. update weight Matrix Δ.Fig. 2 is given the present embodiment basic process that cluster centre updates on emulation data set.Wherein, Fig. 2 (a) is represented The true value of cluster result, the data include two classifications, class 1 (the hollow round representative of such data sample) and 2 (such of class Data sample is represented with hollow triangle), and include two outliers in class 1 (apart from the farther away point of other data points).Fig. 2 (b) traditional K-means cluster centre renewal process is shown, wherein the weight size of sample indicates with straight line thickness with the arrow, Straight line is thicker, and weight is bigger, otherwise weight is smaller.As can be seen that being each sample distribution equal weight in tradition K-means, make The cluster centre for obtaining class 1 is interfered by outlier, causes the update of cluster centre (being indicated with solid shape) to shift, in turn There is classification error.Fig. 2 (c) is the present embodiment cluster result, due to this embodiment introduces weight regulatory factor σ, when setting Set σ it is smaller when, biggish weight will be distributed to point closer from cluster centre during each iteration optimization, therefore cluster The update at center will be more stable, and then the Clustering Effect got well.
Step 7, step 4- step 6 is repeated, until iterative target functional value is close enough twice for algorithm;
Step 8, class label matrix F is exported, cluster centre matrix G, Feature Choice Matrix W, optimal feature subset can be by VectorIn preceding d nonzero term corresponding to subscript determine, wherein W: iRepresent column composed by the i-th column element of W Vector.
For examine institute of embodiment of the present invention providing method validity, for PostgreSQL database Yale, WebKB, TDT2 into Row verifying analysis.Wherein Yale is face recognition database, including 165 data samples being made of 15 different classes of faces This, each sample is made of 32 × 32 gray-scale pixels, that is, includes 1024 pixel characteristics, the data set is special in the present embodiment Levying search range is { 100,200 ..., 900,1000 };WebKB and TDT2 database is text and document class data library respectively, The former includes 814 samples and 4029 features, and the latter then includes 653 samples and 36771 features, the number in the present embodiment It is respectively { 50,100 ..., 250,300 } and { 10,50 ..., 410,450 } according to collected works signature search range.All data set categories Property is normalized to the range of [- 1,1].The comparison that 3 main stream approach carry out effect is introduced in the present embodiment, is respectively: K- Means, Trace Ratio Formulation and K-means Clustering (TRACK), Discriminative Embedded Clustering(DEC).Cluster result evaluation criterion is using cluster accuracy (Accuracy, ACC).It ties below The attached drawing in the embodiment of the present invention is closed, technical solution in the embodiment of the present invention is clearly described:
1 many algorithms of table cluster accuracy on different data sets and compare (± standard variance)
Runing time compares (± standard variance) (in seconds) to 2 many algorithms of table on different data sets
Table 1 is that many algorithms cluster accuracy comparison result on different data sets, wherein when " -- " represents algorithm operation Memory spilling can not obtain cluster result.It can be seen that the relatively other methods of method provided by the invention poly- from the result of table There is apparent advantage in terms of class accuracy, to demonstrate the validity of method provided by the present invention;Table 2 is that many algorithms exist Runing time comparison result on different data collection, wherein subcharacter number is set as c-1.It can be seen that method provided by the invention K-means is only slightly slower than on Yale data set, and the consumption minimum calculating time on other data sets;Fig. 3 illustrates this hair Bright institute's providing method and K-means runing time it is increased with characteristic as a result, in figure the equal stochastical sampling of data set features from TDT2.With the increase of characteristic, the present invention embodies more advantages compared to K-means in terms of the speed of service.
The above examples only illustrate the technical idea of the present invention, and this does not limit the scope of protection of the present invention, all According to the technical idea provided by the invention, any changes made on the basis of the technical scheme each falls within the scope of the present invention Within.

Claims (6)

1. a kind of adaptive quick K-means clustering method of fusion feature study, it is characterised in that include the following steps:
Step 1, pending data is pre-processed, and each data attribute is normalized, and then obtained n group and include D feature without label data X=[x1, x2..., xn]∈RD×n, wherein xi∈RD×1Indicate i-th of data sample, i=1, 2 ..., n;Calculate the total Scatter Matrix S of datat
Step 2, subcharacter number d to be selected, classification number c, balance factor λ and adaptive factor σ are set, weight matrix is initialized Δ is unit matrix, and random initializtion Feature Choice Matrix W and cluster centre matrix G, wherein G=[g1, g2..., gc]∈Rd ×c, gkIndicate k-th of cluster centre vector, k=1,2 ..., c;
Step 3, it is as follows to establish Clustering Model:
Wherein, F=[f1, f2..., fn]T∈ { 0,1 }n×cFor class label matrix, fiIndicate the classification mark of i-th of data sample Label, i=1,2 ..., n;
The optimal solution of model above is equal to:
Wherein,
Set Δ=diag (τ1, τ2..., τn) it is a diagonal matrix, and U=[u1, u2..., un]T=XTW-FGT, then finally Objective function conversion are as follows:
Step 4, solve the above objective function, until meet termination condition, export Feature Choice Matrix W, cluster centre matrix G and Class label matrix F.
2. the method as described in claim 1, it is characterised in that: in the step 1, pre-process, wrap to pending data It includes and solves the problems, such as attribute missing and Data duplication in pending data.
3. the method as described in claim 1, it is characterised in that: in the step 1, the total Scatter Matrix S of datatCalculation formula It is
4. the method as described in claim 1, it is characterised in that: in the step 2, balance factor λ choosing from { ε, 1,2, ∞ } It takes, ε represents any positive number close to 0, and ∞ represents enough big any positive numbers, and adaptive factor σ is chosen using Grid Method.
5. the method as described in claim 1, it is characterised in that: in the step 2, the expression formula of Feature Choice Matrix W are as follows:
W=[wI(1), wI(2)..., wI(d)]
Wherein,I is one group of fully intermeshing for gathering { 1,2 ..., D } and I (i) is its i-th yuan Element.
6. the method as described in claim 1, it is characterised in that: the particular content of the step 4 is:
Step 41, W, Δ are given, G optimizes F;It converts objective function to It is logical It crosses and executes K-means solution on lower-dimensional subspace, it may be assumed that
Step 42, F is given, Δ optimizes G, W;Derivative of the objective function relative to G is taken, and enabling derivation result is 0, obtains G=WTXΔ F(FTΔF)-1, after G is substituted into objective function, obtain:
Wherein, Sw=X (Δ-Δ F (FTΔF)-1FTΔ)XTFor Scatter Matrix in weighting class, M=St-λSw
The optimal solution of W is obtained by the first d maximum diagonal element of solution matrix M in above formula;Enable N=Δ-Δ F (FTΔF)-1FT Δ, then i-th of diagonal element M of MiiPass through following equation quick obtaining:
Wherein, XI:WithRespectively represent matrix X andThe i-th row element composed by vector;
Step 43, Δ is updated using F calculated in step 41 and step 42, W, G, the value of i-th of diagonal element of Δ updates Are as follows:
Step 45: repeating step 41- step 43 until meeting termination condition, export Feature Choice Matrix W, cluster centre matrix G And class label matrix F.
CN201910209441.5A 2019-03-19 2019-03-19 A kind of adaptive quick K-means clustering method of fusion feature study Pending CN109978042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910209441.5A CN109978042A (en) 2019-03-19 2019-03-19 A kind of adaptive quick K-means clustering method of fusion feature study

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910209441.5A CN109978042A (en) 2019-03-19 2019-03-19 A kind of adaptive quick K-means clustering method of fusion feature study

Publications (1)

Publication Number Publication Date
CN109978042A true CN109978042A (en) 2019-07-05

Family

ID=67079551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910209441.5A Pending CN109978042A (en) 2019-03-19 2019-03-19 A kind of adaptive quick K-means clustering method of fusion feature study

Country Status (1)

Country Link
CN (1) CN109978042A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670418A (en) * 2018-12-04 2019-04-23 厦门理工学院 In conjunction with the unsupervised object identification method of multi-source feature learning and group sparse constraint
CN111160298A (en) * 2019-12-31 2020-05-15 深圳市优必选科技股份有限公司 Robot and pose estimation method and device thereof
CN111578154A (en) * 2020-05-25 2020-08-25 吉林大学 LSDR-JMI-based water supply network multi-leakage pressure sensor optimal arrangement method
CN111611293A (en) * 2020-04-24 2020-09-01 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN111898630A (en) * 2020-06-06 2020-11-06 东南大学 Characteristic method for noisy marked sample

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670418A (en) * 2018-12-04 2019-04-23 厦门理工学院 In conjunction with the unsupervised object identification method of multi-source feature learning and group sparse constraint
CN109670418B (en) * 2018-12-04 2021-10-15 厦门理工学院 Unsupervised object identification method combining multi-source feature learning and group sparsity constraint
CN111160298A (en) * 2019-12-31 2020-05-15 深圳市优必选科技股份有限公司 Robot and pose estimation method and device thereof
CN111160298B (en) * 2019-12-31 2023-12-01 深圳市优必选科技股份有限公司 Robot and pose estimation method and device thereof
CN111611293A (en) * 2020-04-24 2020-09-01 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN111611293B (en) * 2020-04-24 2023-09-29 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN111578154A (en) * 2020-05-25 2020-08-25 吉林大学 LSDR-JMI-based water supply network multi-leakage pressure sensor optimal arrangement method
CN111578154B (en) * 2020-05-25 2021-03-26 吉林大学 LSDR-JMI-based water supply network multi-leakage pressure sensor optimal arrangement method
CN111898630A (en) * 2020-06-06 2020-11-06 东南大学 Characteristic method for noisy marked sample

Similar Documents

Publication Publication Date Title
CN109978042A (en) A kind of adaptive quick K-means clustering method of fusion feature study
Li et al. Recent developments of content-based image retrieval (CBIR)
Farajzadeh-Zanjani et al. Generative adversarial dimensionality reduction for diagnosing faults and attacks in cyber-physical systems
Weng et al. Enhancing multi-view clustering through common subspace integration by considering both global similarities and local structures
CN102663447B (en) Cross-media searching method based on discrimination correlation analysis
Zhao et al. Adaptive feature fusion for visual object tracking
CN103888541A (en) Method and system for discovering cells fused with topology potential and spectral clustering
CN111914728A (en) Hyperspectral remote sensing image semi-supervised classification method and device and storage medium
CN103745205A (en) Gait recognition method based on multi-linear mean component analysis
CN113255895A (en) Graph neural network representation learning-based structure graph alignment method and multi-graph joint data mining method
CN111311702A (en) Image generation and identification module and method based on BlockGAN
CN108564116A (en) A kind of ingredient intelligent analysis method of camera scene image
Zhang et al. Multiview graph restricted Boltzmann machines
CN113095158A (en) Handwriting generation method and device based on countermeasure generation network
CN106355210A (en) Method for expressing infrared image features of insulators on basis of depth neuron response modes
Mishra et al. Kohonen self organizing map with modified k-means clustering for high dimensional data set
Wang et al. Multi-manifold clustering
Pratikakis et al. Partial 3d object retrieval combining local shape descriptors with global fisher vectors
Peng et al. Multiview clustering via hypergraph induced semi-supervised symmetric nonnegative matrix factorization
CN105678349B (en) A kind of sub- generation method of the context-descriptive of visual vocabulary
Shao et al. Two-stage deep learning for supervised cross-modal retrieval
CN104616027B (en) A kind of sparse face identification method of non-adjacent graph structure
CN109902746A (en) Asymmetrical fine granularity IR image enhancement system and method
CN109871907A (en) Radar target high resolution range profile recognition methods based on SAE-HMM model
Wang et al. Two-level-oriented selective clustering ensemble based on hybrid multi-modal metrics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190705