CN102495876A

CN102495876A - Nonnegative local coordinate factorization-based clustering method

Info

Publication number: CN102495876A
Application number: CN2011103946863A
Authority: CN
Inventors: 何晓飞; 陈琰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2011-12-02
Filing date: 2011-12-02
Publication date: 2012-06-13

Abstract

The invention discloses a nonnegative local coordinate factorization-based clustering method, which comprises the following steps that: (1) a sample characteristic matrix is built; (2) a low-dimensional sparse matrix is iteratively outputted; (3) and the low-dimensional sparse matrix is clustered and analyzed. A sparse coding concept is introduced into the nonnegative matrix factorization (NMF) process, nonnegative local coordinate factorization is undertaken on a high-dimensional sample characteristic matrix, so the factorized coefficient matrix is used as a low-dimensional expression of the high-dimensional sample characteristic matrix, and the low-dimensional matrix is clustered to analyze, so the clustering analysis is simple and valid; and at the same time, data after the dimensional reduction has good explanatory property. Compared with the dimensional reduction method of the prior art, the judgment capacity of the clustering analysis can be further improved.

Description

A kind of clustering method that decomposes based on non-negative local coordinate

Technical field

The invention belongs to technical field of data processing, be specifically related to a kind of clustering method that decomposes based on non-negative local coordinate.

Background technology

Cluster is a kind of common multivariate statistical analysis method in machine learning and the data mining; Its discuss to as if a large amount of samples; Requirement can reasonably be classified by characteristic separately, have no the pattern can be for reference or follow, and is not promptly having to carry out under the situation of priori.At present, as a kind of data analysis means effectively, clustering method is widely used in each big field: commercial, cluster analysis is used to find different customers, and portrays the characteristic of different customers through purchasing model; On biology, cluster analysis is used to the animals and plants classification and gene is classified, and obtains the understanding to the population inherent structure; On geography, the similarity that is tending towards on the database that cluster can help in the earth, to be observed; On insurance industry, cluster analysis identifies through a high average consumption and the single holder's of car insurance grouping, is worth simultaneously according to housing type that the geographic position identifies that the house property in a city divides into groups; In internet, applications, cluster analysis is used to the document in the network is sorted out, and the user in the virtual community is divided into groups.

Common clustering method mainly comprises following several kinds:

(1) disintegrating method is claimed division methods again, at first creates K division, and K is the number of the division that will create; The technology of utilizing a circulation location is then divided and is improved the division quality through object is moved on to another from a division.Typical division methods has: Kmeans, Kmedoids and CLARA (Clustering LARge Application) etc.

(2) stratification is through creating a level to decompose given data set.This method can be divided into from top to bottom (decomposition) and (merging) two kinds of modes of operation from bottom to top.Decompose and the deficiency that merges for remedying, the level merging often will combine with other clustering method, like the circulation location.Typical hierarchical method has: BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), CURE (Clustering Using REprisentatives) and CHEMALOEN etc.

(3), accomplish the cluster of object according to density based on the method for density.It constantly increases cluster based on the density around the object.Typically the method based on density has: DBSCAN (Densit-based Spatial Clustering of Application with Noise) and OPTICS (Ordering Points To Identity the Clustering Structure).

(4) based on the method for grid, at first object space is divided into limited unit to constitute network, utilizes network to accomplish cluster then.

(5) based on the method for model, the model of its each cluster of hypothesis also finds to be fit to the data of corresponding model.

The clustering problem of low dimension data that these traditional clustering methods have compared successful solution still owing to the complicacy of data in the practical application, often lost efficacy when handling many high dimensional datas.Because traditional clustering method is concentrated when carrying out cluster high dimensional data, mainly run into two problems: (1) high dimensional data concentrates the possibility that exists a large amount of irrelevant attributes to make in all dimensions, to exist bunch almost nil; (2) the dimension disaster brought of higher-dimension makes that the practicality of some clustering algorithm is almost nil.

To above two problems, just in order to solve dimension disaster and to eliminate in the data redundant information unnecessary for cluster, before carrying out cluster, advanced line data dimensionality reduction is necessary.Main dimension reduction method has at present:

(1) (Principal Component Analysis, PCA): classical nothing is supervised linear dimension reduction method in principal component analysis (PCA).It is a kind of method of grasping the things principal character, and it can parse major influence factors from polynary things, discloses the essence of things, simplifies complicated problems.

(2) (Linear DiscriminantAnalysis, LDA): classical have a supervision dimension reduction method in linear discriminant analysis.The dependency structure that this method can keep in low n-dimensional subspace n type is applicable to classification and is identified as the dimensionality reduction of purpose, but the reconstruct effect is not as the PCA method.

(3) nonnegative matrix is decomposed (Nonnegative Matrix Factorization; NMF): the nonnegative matrix decomposition method is through being decomposed into data matrix the purpose that basis matrix U and matrix of coefficients V reach dimensionality reduction, and nonnegative matrix is decomposed the nonnegativity that has kept basis matrix and matrix of coefficients in the matrix decomposition process.

PCA is traditional and classical nothing supervision dimension reduction method, has been widely used in various application at present, and this method can be found out the principal character of data effectively, but can not extract the category feature of data effectively; LDA is as a kind of dimension reduction method that supervision is arranged, although effect is pretty good, this method needs a large amount of data that contain label information as training data, so it is suitable for the dimensionality reduction means as classification, and is not suitable for the dimensionality reduction means as cluster analysis; NMF is as a kind of basic dimensionality reduction framework, and the data that its dimensionality reduction obtains have good interpretation and become present focus, but carry out cluster analysis behind its dimensionality reduction, and effect is unsatisfactory, and the discriminating power during cluster analysis still has the space of raising.

Summary of the invention

To the above-mentioned technological deficiency of existing in prior technology, the invention provides a kind of clustering method that decomposes based on non-negative local coordinate, can improve the effect of cluster analysis, improve the discriminating power of cluster analysis.

A kind of clustering method that decomposes based on non-negative local coordinate comprises the steps:

(1) obtains sample set, and then make up the sample characteristics matrix of sample set;

(2), decompose the low dimension sparse matrix that iterative algorithm solves sample set through non-negative local coordinate according to described sample characteristics matrix;

(3) described low dimension sparse matrix is carried out cluster.

In the described step (2),, solve the low dimension sparse matrix of sample set through following iterative equation group;

u_{(j, p)}^{t} = u_{(j, p)}^{t - 1} \frac{{(X {(V^{t - 1})}^{T} + μ Σ_{i = 1}^{n} x_{i} l^{T} Λ_{i}^{t - 1})}_{(j, p)}}{{(U^{t - 1} V^{t - 1} {(V^{t - 1})}^{T} + μ Σ_{i = 1}^{n} U^{t - 1} Λ_{i}^{t - 1})}_{(j, p)}}

v_{(p, i)}^{t} = v_{(p, i)}^{t - 1} \frac{2 (μ + 1) {({(U^{t})}^{T} X)}_{(p, i)}}{{(2 {(U^{t})}^{T} U^{t} V^{t - 1} + μC + μ D^{t})}_{(p, i)}}

Σ_{i = 1}^{n} ({| | x_{i} - U^{t} v_{i}^{t} | |}^{2} + μ Σ_{p = 1}^{k} | v_{(p, i)}^{t} | {| | u_{p}^{t} - x_{i} | |}^{2}) < ρ

Wherein: X is the sample characteristics matrix of m * n dimension, and n is a number of samples, and m is the characteristic number of sample, and the element value among the X is the eigenwert of each characteristic of sample, and U is the basis matrix of m * k dimension, and V is the matrix of coefficients of k * n dimension, and k is the cluster number; U ^tAnd V ^tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U ⁰And V ⁰Be respectively non-at random negative initialized basis matrix and matrix of coefficients, Be U ^tIn the element value of the capable p of j row,

Be V ^tIn the element value of the capable i of p row;

Be V ^T-1In the i column vector,

Be U ^tIn the p column vector, x _iBe the i column vector among the X; μ is the sparse factor and is the practical experience value that l is that the element value of k dimension is 1 column vector, and ρ is convergence threshold and is the practical experience value; C and D ^tBe the matrix of k * n dimension, wherein, the capable vector among the C is c ^T, c=diag (X ^TX), D ^tIn column vector be d ^t, d ^t=diag ((U ^t) ^TU ^t).

When iteration convergence or reach maximum iteration time, then corresponding V ^tBe the low dimension sparse matrix of sample set.

In the described step (3), the process that low dimension sparse matrix is carried out cluster is: analyze the greatest member value in low each column vector of dimension sparse matrix, if the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.

The present invention is through introducing the theory of sparse coding in the NMF process; Higher-dimension sample characteristics matrix is carried out non-negative local coordinate to be decomposed; The matrix of coefficients that decomposition is obtained is represented as the low dimension of higher-dimension sample characteristics matrix; This low dimension matrix is carried out cluster analysis, can make cluster analysis become simple and effective; Simultaneously the data behind the dimensionality reduction of the present invention have good interpretation, and with respect to the dimension reduction method of prior art, can make the discriminating power of cluster analysis be further improved.

Description of drawings

Fig. 1 is the steps flow chart synoptic diagram of clustering method of the present invention.

Fig. 2 (a) is the degree of accuracy curve map of Kmeans, NMF, NMF-SC and four kinds of clustering methods of the present invention.

Fig. 2 (b) is the standardization mutual information curve map of Kmeans, NMF, NMF-SC and four kinds of clustering methods of the present invention.

Embodiment

In order to describe the present invention more particularly, clustering method of the present invention is elaborated below in conjunction with accompanying drawing and embodiment.

As shown in Figure 1, a kind of clustering method that decomposes based on non-negative local coordinate comprises the steps:

(1) makes up the sample characteristics matrix.

This embodiment is an example with ORL people's face data set, and the statistical information of this data acquisition is as shown in table 1.

Table 1:ORL people face data set statistical information

Data set	The facial image frame number	People's face classification number	The characteristics of image number
				ORL	400	40	1024

Wherein, ORL people's face data centralization has 400 frame facial images, and 400 frame facial images are formed (everyone each 10 frame facial image) by the people's of 40 different appearances facial image.

Choose two types of instances of ORL people's face data centralization as original high dimensional data set; And make up corresponding sample eigenmatrix X, and X is that m * n ties up matrix, n is number of samples (being number of image frames); M is the characteristic number of sample, and the element value in the sample characteristics matrix is the eigenwert of each characteristic of sample; N=2 * 10=20, m=1024.

(2) the low dimension of iteration output sparse matrix.

Based on sample characteristics matrix X, decompose the low dimension sparse matrix that iterative algorithm solves sample set through following non-negative local coordinate;

u_{(j, p)}^{t} = u_{(j, p)}^{t - 1} \frac{{(X {(V^{t - 1})}^{T} + μ Σ_{i = 1}^{n} x_{i} l^{T} Λ_{i}^{t - 1})}_{(j, p)}}{{(U^{t - 1} V^{t - 1} {(V^{t - 1})}^{T} + μ Σ_{i = 1}^{n} U^{t - 1} Λ_{i}^{t - 1})}_{(j, p)}}

v_{(p, i)}^{t} = v_{(p, i)}^{t - 1} \frac{2 (μ + 1) {({(U^{t})}^{T} X)}_{(p, i)}}{{(2 {(U^{t})}^{T} U^{t} V^{t - 1} + μC + μ D^{t})}_{(p, i)}}

Σ_{i = 1}^{n} ({| | x_{i} - U^{t} v_{i}^{t} | |}^{2} + μ Σ_{p = 1}^{k} | v_{(p, i)}^{t} | {| | u_{p}^{t} - x_{i} | |}^{2}) < ρ

Wherein: U is the basis matrix of m * k dimension, and V is the matrix of coefficients of k * n dimension, and k is the cluster number, k=2 in the present embodiment; U ^tAnd V ^tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U ⁰And V ⁰Be respectively non-at random negative initialized basis matrix and matrix of coefficients,

Be U ^tIn the element value of the capable p of j row,

Be V ^tIn the element value of the capable i of p row;

Be V ^T-1In the i column vector, Be U ^tIn the p column vector, x _iBe the i column vector among the X; μ is the sparse factor, μ in the present embodiment=1, and l is that the element value of k dimension is 1 column vector, ρ is a convergence threshold, ρ in the present embodiment=10 ^-7C and D ^tBe the matrix of k * n dimension, wherein, the capable vector among the C is c ^T, c=diag (X ^TX), D ^tIn column vector be d ^t, d ^t=diag ((U ^t) ^TU ^t).

When iteration convergence or reach maximum iteration time, then corresponding V ^tBe the low dimension sparse matrix of sample set, maximum iteration time is 200 in the present embodiment.

(3) to the cluster analysis of low dimension sparse matrix.

Analyze the greatest member value in low each column vector of dimension sparse matrix, if the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.

Next coming in order make cluster number k=2; 4,8,12; 16; 20,25,30; 40; Come the cluster effect under comparison Kmeans (not dimensionality reduction) cluster, NMF (nonnegative matrix decomposition) cluster, NMF-SC (nonnegative matrix with sparse restriction is decomposed) cluster and four kinds of clustering methods of this embodiment through analytical precision (accuracy is abbreviated as AC) and two indexs of standardization mutual information (normalized mutual information is abbreviated as

); Final data result such as table 2 are with shown in Figure 2.

Degree of accuracy is the number percent that is used for measuring the data of correct labeling:

The standardization mutual information is the measure information that is used for measuring two correlativitys between the set, given two set C and C ':

MI (C, C^{'}) = \underset{c_{i} &Element; C, c_{j}^{'} &Element; C^{'}}{Σ} p (c_{i}, c_{j}^{'}) \cdot \log \frac{p (c_{i}, c_{j}^{'})}{p (c_{i}) \cdot p (c_{j}^{'})}

\hat{MI} (C, C^{'}) = \frac{MI (C, C^{'})}{\max (H (C), H (C^{'}))}

Wherein: p (c _i), p (c ' _j) represent to choose a certain data arbitrarily from data centralization, belong to c respectively _i, c ' _jProbability, p (c _i, c ' _j) then expression belong to two types probability simultaneously; H (C) and H (C ') represent the entropy of C and C ' respectively.

The achievement data of table 2:Kmeans, NMF, NMF-SC and four kinds of clustering methods of this embodiment

Visible by table 2 and Fig. 2, this embodiment is compared three kinds of clustering methods of prior art, and the effect of cluster and discriminating power can be significantly improved and improve.

Claims

1. a clustering method that decomposes based on non-negative local coordinate comprises the steps:

(3) described low dimension sparse matrix is carried out cluster.

2. the clustering method that decomposes based on non-negative local coordinate according to claim 1 is characterized in that: in the described step (2), through following iterative equation group, solve the low dimension sparse matrix of sample set;

u_{(j, p)}^{t} = u_{(j, p)}^{t - 1} \frac{{(X {(V^{t - 1})}^{T} + μ Σ_{i = 1}^{n} x_{i} l^{T} Λ_{i}^{t - 1})}_{(j, p)}}{{(U^{t - 1} V^{t - 1} {(V^{t - 1})}^{T} + μ Σ_{i = 1}^{n} U^{t - 1} Λ_{i}^{t - 1})}_{(j, p)}}

v_{(p, i)}^{t} = v_{(p, i)}^{t - 1} \frac{2 (μ + 1) {({(U^{t})}^{T} X)}_{(p, i)}}{{(2 {(U^{t})}^{T} U^{t} V^{t - 1} + μC + μ D^{t})}_{(p, i)}}

Σ_{i = 1}^{n} ({| | x_{i} - U^{t} v_{i}^{t} | |}^{2} + μ Σ_{p = 1}^{k} | v_{(p, i)}^{t} | {| | u_{p}^{t} - x_{i} | |}^{2}) < ρ

Wherein: X is the sample characteristics matrix, and U is a basis matrix, and V is a matrix of coefficients; U ^tAnd V ^tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U ⁰And V ⁰Be respectively non-at random negative initialized basis matrix and matrix of coefficients,

Be U ^tIn the element value of the capable p of j row,

Be V ^tIn the element value of the capable i of p row;

Be V ^T-1In the i column vector,

Be U ^tIn the p column vector, x _iBe the i column vector among the X; μ is the sparse factor, and l is that element value is 1 column vector, and ρ is a convergence threshold; C and D ^tBe matrix, wherein, the capable vector among the C is c ^T, c=diag (X ^TX), D ^tIn column vector be d ^t, d ^t=diag ((U ^t) ^TU ^t); When iteration convergence or reach maximum iteration time, then corresponding V ^tBe the low dimension sparse matrix of sample set.

3. the clustering method that decomposes based on non-negative local coordinate according to claim 1; It is characterized in that: in the described step (3); The process that low dimension sparse matrix is carried out cluster is: analyze the greatest member value in low each column vector of dimension sparse matrix; If the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.