CN109978042A

CN109978042A - A kind of adaptive quick K-means clustering method of fusion feature study

Info

Publication number: CN109978042A
Application number: CN201910209441.5A
Authority: CN
Inventors: 王晓栋; 严菲; 曾志强; 陈玉明; 洪朝群
Original assignee: Xiamen University of Technology
Current assignee: Xiamen University of Technology
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-07-05

Abstract

The present invention discloses a kind of adaptive quick K-means clustering method of fusion feature study, pre-processes first to data, excludes wherein attribute missing, Data duplication problem, each data attribute is normalized；The total Scatter Matrix of data is calculated, and introduces sparse characteristic construction feature selection matrix；K-means clustering method is executed on proper subspace, and in cluster centre renewal process, introduce each data sample weight of adaptive factor dynamic regulation；According to the information of distinguishing between cluster, Feature Choice Matrix is updated, and then filters out optimal feature subset.Such method enables traditional K-means clustering method efficiently to improve the accuracy clustered using the correlation information distinguished between information and feature between cluster and in cluster, adaptive factor is also incorporated in cluster process, the characteristics of being distributed according to different types of data updates cluster centre, have higher practicability and scalability, effective support can be provided for related applications such as machine learning, computer visions.

Description

A kind of adaptive quick K-means clustering method of fusion feature study

Technical field

The invention belongs to machine learning techniques field, in particular to a kind of adaptive quick K- of fusion feature study Means clustering method.

Background technique

Clustering method is a technology very widely used in machine learning field, wherein with K-means method application The most extensively, good effect is achieved in multiple fields such as data mining, medical treatment, education.However, with multimedia technology With Internet technology high speed development, high dimensional data is presented explosive growth, brings huge challenge to traditional K-means method. Due to often there is redundancy feature and noise characteristic in high dimensional data, directly this kind of data application K-means cluster is not only needed A large amount of computing resources are consumed in addition, and will affect its cluster accuracy.It recent studies have shown that, if in advance to data characteristics Dimension-reduction treatment is carried out, the cluster efficiency of K-means will be promoted effectively.

In recent years, there are some researchs by dimension reduction method (such as linear discriminant analysis (Linear Discriminant Analysis, LDA), the methods of orthogonal centroid method (Orthogonal Centroid Method, OCM)) carried out with K-means In conjunction with, provide optimal subspace using the former for the latter, and using the latter on subspace " label " of the cluster result as the former Information.Although such method can effectively improve the cluster accuracy of K-means, it is both needed to operate by feature decomposition and solve Optimal characteristics subspace, computation complexity will increase with pending data characteristic dimension square grade, and obtained feature is empty Between differ greatly with original feature space, be difficult suitable for true application scenarios high dimensional data processing.

Summary of the invention

The purpose of the present invention is to provide a kind of adaptive quick K-means clustering method of fusion feature study, so that Traditional K-means clustering method can be by efficiently utilizing distinguishing between information and feature between cluster and in cluster Correlation information improve cluster accuracy；Meanwhile this method has also incorporated adaptive factor in cluster process, it can basis The characteristics of different types of data is distributed updates cluster centre, has higher practicability and scalability, can be machine learning, meter The related applications such as calculation machine vision, which provide, effectively to be supported.

In order to achieve the above objectives, solution of the invention is:

A kind of adaptive quick K-means clustering method of fusion feature study, includes the following steps:

(1) pending data is pre-processed, solves the problems, such as attribute missing, Data duplication in pending data, and right Each data attribute is normalized, and then obtaining n group includes D feature without label data X=[x₁, x₂..., x_n]∈ R^D×n, wherein x_i∈R^D×1Indicate i-th of data sample；Calculate data lump Scatter Matrix

(2) subcharacter number d to be selected, and construction feature selection matrix W are set:

W=[w_I(1), w_I(2)..., w_I(d)]

Wherein,I be gather { 1,2 ..., D } one group of fully intermeshing and I (i) be its i-th A element；

(3) matrix A=[a is given₁, a₂..., a_n]∈R^m×n, it is as follows to define its adaptive loss function:

Wherein, σ > 0 is adaptive factor；

(4) classification number c is set, it is poly- on subspace that data set is constructed on the basis of K-means algorithm and step (3) Class model:

Wherein, F=[f₁, f₂..., f_n]^T∈ { 0,1 }^n×cWith G=[g₁, g₂..., g_c]∈R^d×cIt respectively indicates to be solved Class label matrix and cluster centre matrix；

(5) information is distinguished sufficiently to extract in data, it is desirable to divergence of the data set on its subspace is as big as possible, It is as follows that Clustering Model is established thus:

Wherein, λ is balance factor；

The optimal solution of model above is equal to:

Wherein,

Set Δ=diag (τ₁, τ₂..., τ_n) it is a diagonal matrix, and U=[u₁, u₂..., u_n]^T=X^TW-FG^T, then This method final goal function can convert are as follows:

(6) objective function is solved

Step 1: setting subcharacter number d, classification number c, parameter lambda and σ, initialization Δ are unit matrix, and random initializtion W And G；

Step 2: given W, Δ, G optimize F.Objective function can be converted into In view of the discreteness of matrix F to be solved, value can be solved by executing K-means on lower-dimensional subspace, That is:

Step 3: given F, Δ optimize G, W.Derivative of the objective function relative to G is taken, and enabling derivation result is 0, can obtain G =W^TXΔF(F^TΔF)^-1, after G is substituted into objective function, it can obtain:

Wherein, S_w=X (Δ-Δ F (F^TΔF)^-1F^TΔ)X^TFor Scatter Matrix in weighting class, M=S_t-λS_w。

In view of the sparsity structure of W, the optimal solution of W can be obtained by the first d maximum diagonal element of solution matrix M in above formula ?.Enable N=Δ-Δ F (F^TΔF)^-1F^TΔ, then i-th of diagonal element M of M_iiFollowing equation quick obtaining can be passed through:

Wherein, X_I:WithRespectively represent matrix X andThe i-th row element composed by vector.

Step 4: using calculated F in step 2 and step 3, W, G update Δ, and the value of i-th of diagonal element of Δ can be more Newly are as follows:

Step 5: repeating step 2- step 4 until meeting termination condition, export Feature Choice Matrix W, cluster centre Matrix G and class label matrix F.Optimal feature subset can be by vectorIn preceding d nonzero term corresponding to subscript It determines, wherein W_{: i}Represent column vector composed by the i-th column element of W.

After adopting the above scheme, specific steps of the present invention include: and pre-process first to data, exclude to belong in data set Property missing, Data duplication problem, each data attribute is normalized；Data lump Scatter Matrix is calculated, and is introduced dilute Dredge characteristic construction feature selection matrix；K-means clustering method is executed on proper subspace, and in cluster centre renewal process In, introduce each data sample weight of adaptive factor dynamic regulation；According to the information of distinguishing between cluster, feature selecting is updated Matrix, and then filter out optimal feature subset.Feature selecting and adaptive learning have been incorporated biography by method proposed by the invention Unite K-means Clustering Model, can effectively deal with different types of data distributed data collection, and its Feature Choice Matrix solution procedure without Complicated feature decomposition operation need to be introduced, Clustering Effect is also superior to traditional clustering method.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is the embodiment of the present invention and tradition K-means Clustering Effect schematic diagram；

Fig. 3 is providing method of the embodiment of the present invention and tradition K-means runing time contrast effect figure

Specific embodiment

Below with reference to attached drawing, technical solution of the present invention and beneficial effect are described in detail.

As shown in Figure 1, the present invention provides a kind of adaptive quick K-means clustering method of fusion feature study, this reality It applies example to cluster no label data by the way of fusion feature selection and K-means clustering method, include the following steps:

Step 1, the problems such as pending data being pre-processed, removing attribute missing, Data duplication in data, and to each Data attribute is normalized, and then obtaining n group includes D feature without label data X=[x₁, x₂..., x_n]∈R^D ^×n, wherein x_i∈R^D×1Indicate i-th of data sample, i=1,2 ..., n；

Step 2, the total Scatter Matrix of data is calculated

Step 3, subcharacter number d, classification number c, parameter lambda and σ are set, initialization weight matrix Δ is unit matrix, and with Machine initialization feature selection matrix W and cluster centre matrix G；Wherein, subcharacter number d and classification number c can be according to practical application need The middle feature quantity for needing to retain is asked to determine, parameter lambda can (ε represents any positive number close to 0, ∞ for selection from { ε, 1,2, ∞ } Represent enough big any positive numbers), Grid Method selection can be used in parameter σ；

Step 4, using K-means in subspace W^TClass label matrix is solved on X.Since K-means is only in data It is clustered on the subcharacter of part, computing resource is much smaller than tradition K-means；

Step 5, sparsity the rapid solving W and G of Feature Choice Matrix W are utilized；

Step 6, according to step 4 and step 5 calculated F, W, G and σ, the weight of each sample is updated, i.e. update weight Matrix Δ.Fig. 2 is given the present embodiment basic process that cluster centre updates on emulation data set.Wherein, Fig. 2 (a) is represented The true value of cluster result, the data include two classifications, class 1 (the hollow round representative of such data sample) and 2 (such of class Data sample is represented with hollow triangle), and include two outliers in class 1 (apart from the farther away point of other data points).Fig. 2 (b) traditional K-means cluster centre renewal process is shown, wherein the weight size of sample indicates with straight line thickness with the arrow, Straight line is thicker, and weight is bigger, otherwise weight is smaller.As can be seen that being each sample distribution equal weight in tradition K-means, make The cluster centre for obtaining class 1 is interfered by outlier, causes the update of cluster centre (being indicated with solid shape) to shift, in turn There is classification error.Fig. 2 (c) is the present embodiment cluster result, due to this embodiment introduces weight regulatory factor σ, when setting Set σ it is smaller when, biggish weight will be distributed to point closer from cluster centre during each iteration optimization, therefore cluster The update at center will be more stable, and then the Clustering Effect got well.

Step 7, step 4- step 6 is repeated, until iterative target functional value is close enough twice for algorithm；

Step 8, class label matrix F is exported, cluster centre matrix G, Feature Choice Matrix W, optimal feature subset can be by VectorIn preceding d nonzero term corresponding to subscript determine, wherein W_{: i}Represent column composed by the i-th column element of W Vector.

For examine institute of embodiment of the present invention providing method validity, for PostgreSQL database Yale, WebKB, TDT2 into Row verifying analysis.Wherein Yale is face recognition database, including 165 data samples being made of 15 different classes of faces This, each sample is made of 32 × 32 gray-scale pixels, that is, includes 1024 pixel characteristics, the data set is special in the present embodiment Levying search range is { 100,200 ..., 900,1000 }；WebKB and TDT2 database is text and document class data library respectively, The former includes 814 samples and 4029 features, and the latter then includes 653 samples and 36771 features, the number in the present embodiment It is respectively { 50,100 ..., 250,300 } and { 10,50 ..., 410,450 } according to collected works signature search range.All data set categories Property is normalized to the range of [- 1,1].The comparison that 3 main stream approach carry out effect is introduced in the present embodiment, is respectively: K- Means, Trace Ratio Formulation and K-means Clustering (TRACK), Discriminative Embedded Clustering(DEC).Cluster result evaluation criterion is using cluster accuracy (Accuracy, ACC).It ties below The attached drawing in the embodiment of the present invention is closed, technical solution in the embodiment of the present invention is clearly described:

1 many algorithms of table cluster accuracy on different data sets and compare (± standard variance)

Runing time compares (± standard variance) (in seconds) to 2 many algorithms of table on different data sets

Table 1 is that many algorithms cluster accuracy comparison result on different data sets, wherein when " -- " represents algorithm operation Memory spilling can not obtain cluster result.It can be seen that the relatively other methods of method provided by the invention poly- from the result of table There is apparent advantage in terms of class accuracy, to demonstrate the validity of method provided by the present invention；Table 2 is that many algorithms exist Runing time comparison result on different data collection, wherein subcharacter number is set as c-1.It can be seen that method provided by the invention K-means is only slightly slower than on Yale data set, and the consumption minimum calculating time on other data sets；Fig. 3 illustrates this hair Bright institute's providing method and K-means runing time it is increased with characteristic as a result, in figure the equal stochastical sampling of data set features from TDT2.With the increase of characteristic, the present invention embodies more advantages compared to K-means in terms of the speed of service.

The above examples only illustrate the technical idea of the present invention, and this does not limit the scope of protection of the present invention, all According to the technical idea provided by the invention, any changes made on the basis of the technical scheme each falls within the scope of the present invention Within.

Claims

1. a kind of adaptive quick K-means clustering method of fusion feature study, it is characterised in that include the following steps:

Step 1, pending data is pre-processed, and each data attribute is normalized, and then obtained n group and include D feature without label data X=[x₁, x₂..., x_n]∈R^D×n, wherein x_i∈R^D×1Indicate i-th of data sample, i=1, 2 ..., n；Calculate the total Scatter Matrix S of data_t；

Step 2, subcharacter number d to be selected, classification number c, balance factor λ and adaptive factor σ are set, weight matrix is initialized Δ is unit matrix, and random initializtion Feature Choice Matrix W and cluster centre matrix G, wherein G=[g₁, g₂..., g_c]∈R^d ^×c, g_kIndicate k-th of cluster centre vector, k=1,2 ..., c；

Step 3, it is as follows to establish Clustering Model:

Wherein, F=[f₁, f₂..., f_n]^T∈ { 0,1 }^n×cFor class label matrix, f_iIndicate the classification mark of i-th of data sample Label, i=1,2 ..., n；

The optimal solution of model above is equal to:

Wherein,

Set Δ=diag (τ₁, τ₂..., τ_n) it is a diagonal matrix, and U=[u₁, u₂..., u_n]^T=X^TW-FG^T, then finally Objective function conversion are as follows:

Step 4, solve the above objective function, until meet termination condition, export Feature Choice Matrix W, cluster centre matrix G and Class label matrix F.

2. the method as described in claim 1, it is characterised in that: in the step 1, pre-process, wrap to pending data It includes and solves the problems, such as attribute missing and Data duplication in pending data.

3. the method as described in claim 1, it is characterised in that: in the step 1, the total Scatter Matrix S of data_tCalculation formula It is

4. the method as described in claim 1, it is characterised in that: in the step 2, balance factor λ choosing from { ε, 1,2, ∞ } It takes, ε represents any positive number close to 0, and ∞ represents enough big any positive numbers, and adaptive factor σ is chosen using Grid Method.

5. the method as described in claim 1, it is characterised in that: in the step 2, the expression formula of Feature Choice Matrix W are as follows:

W=[w_I(1), w_I(2)..., w_I(d)]

Wherein,I is one group of fully intermeshing for gathering { 1,2 ..., D } and I (i) is its i-th yuan Element.

6. the method as described in claim 1, it is characterised in that: the particular content of the step 4 is:

Step 41, W, Δ are given, G optimizes F；It converts objective function to It is logical It crosses and executes K-means solution on lower-dimensional subspace, it may be assumed that

Step 42, F is given, Δ optimizes G, W；Derivative of the objective function relative to G is taken, and enabling derivation result is 0, obtains G=W^TXΔ F(F^TΔF)^-1, after G is substituted into objective function, obtain:

Wherein, S_w=X (Δ-Δ F (F^TΔF)^-1F^TΔ)X^TFor Scatter Matrix in weighting class, M=S_t-λS_w；

The optimal solution of W is obtained by the first d maximum diagonal element of solution matrix M in above formula；Enable N=Δ-Δ F (F^TΔF)^-1F^T Δ, then i-th of diagonal element M of M_iiPass through following equation quick obtaining:

Wherein, X_I:WithRespectively represent matrix X andThe i-th row element composed by vector；

Step 43, Δ is updated using F calculated in step 41 and step 42, W, G, the value of i-th of diagonal element of Δ updates Are as follows:

Step 45: repeating step 41- step 43 until meeting termination condition, export Feature Choice Matrix W, cluster centre matrix G And class label matrix F.