CN102495876A - Nonnegative local coordinate factorization-based clustering method - Google Patents

Nonnegative local coordinate factorization-based clustering method Download PDF

Info

Publication number
CN102495876A
CN102495876A CN2011103946863A CN201110394686A CN102495876A CN 102495876 A CN102495876 A CN 102495876A CN 2011103946863 A CN2011103946863 A CN 2011103946863A CN 201110394686 A CN201110394686 A CN 201110394686A CN 102495876 A CN102495876 A CN 102495876A
Authority
CN
China
Prior art keywords
matrix
column vector
local coordinate
low
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103946863A
Other languages
Chinese (zh)
Inventor
何晓飞
陈琰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN2011103946863A priority Critical patent/CN102495876A/en
Publication of CN102495876A publication Critical patent/CN102495876A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a nonnegative local coordinate factorization-based clustering method, which comprises the following steps that: (1) a sample characteristic matrix is built; (2) a low-dimensional sparse matrix is iteratively outputted; (3) and the low-dimensional sparse matrix is clustered and analyzed. A sparse coding concept is introduced into the nonnegative matrix factorization (NMF) process, nonnegative local coordinate factorization is undertaken on a high-dimensional sample characteristic matrix, so the factorized coefficient matrix is used as a low-dimensional expression of the high-dimensional sample characteristic matrix, and the low-dimensional matrix is clustered to analyze, so the clustering analysis is simple and valid; and at the same time, data after the dimensional reduction has good explanatory property. Compared with the dimensional reduction method of the prior art, the judgment capacity of the clustering analysis can be further improved.

Description

A kind of clustering method that decomposes based on non-negative local coordinate
Technical field
The invention belongs to technical field of data processing, be specifically related to a kind of clustering method that decomposes based on non-negative local coordinate.
Background technology
Cluster is a kind of common multivariate statistical analysis method in machine learning and the data mining; Its discuss to as if a large amount of samples; Requirement can reasonably be classified by characteristic separately, have no the pattern can be for reference or follow, and is not promptly having to carry out under the situation of priori.At present, as a kind of data analysis means effectively, clustering method is widely used in each big field: commercial, cluster analysis is used to find different customers, and portrays the characteristic of different customers through purchasing model; On biology, cluster analysis is used to the animals and plants classification and gene is classified, and obtains the understanding to the population inherent structure; On geography, the similarity that is tending towards on the database that cluster can help in the earth, to be observed; On insurance industry, cluster analysis identifies through a high average consumption and the single holder's of car insurance grouping, is worth simultaneously according to housing type that the geographic position identifies that the house property in a city divides into groups; In internet, applications, cluster analysis is used to the document in the network is sorted out, and the user in the virtual community is divided into groups.
Common clustering method mainly comprises following several kinds:
(1) disintegrating method is claimed division methods again, at first creates K division, and K is the number of the division that will create; The technology of utilizing a circulation location is then divided and is improved the division quality through object is moved on to another from a division.Typical division methods has: Kmeans, Kmedoids and CLARA (Clustering LARge Application) etc.
(2) stratification is through creating a level to decompose given data set.This method can be divided into from top to bottom (decomposition) and (merging) two kinds of modes of operation from bottom to top.Decompose and the deficiency that merges for remedying, the level merging often will combine with other clustering method, like the circulation location.Typical hierarchical method has: BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), CURE (Clustering Using REprisentatives) and CHEMALOEN etc.
(3), accomplish the cluster of object according to density based on the method for density.It constantly increases cluster based on the density around the object.Typically the method based on density has: DBSCAN (Densit-based Spatial Clustering of Application with Noise) and OPTICS (Ordering Points To Identity the Clustering Structure).
(4) based on the method for grid, at first object space is divided into limited unit to constitute network, utilizes network to accomplish cluster then.
(5) based on the method for model, the model of its each cluster of hypothesis also finds to be fit to the data of corresponding model.
The clustering problem of low dimension data that these traditional clustering methods have compared successful solution still owing to the complicacy of data in the practical application, often lost efficacy when handling many high dimensional datas.Because traditional clustering method is concentrated when carrying out cluster high dimensional data, mainly run into two problems: (1) high dimensional data concentrates the possibility that exists a large amount of irrelevant attributes to make in all dimensions, to exist bunch almost nil; (2) the dimension disaster brought of higher-dimension makes that the practicality of some clustering algorithm is almost nil.
To above two problems, just in order to solve dimension disaster and to eliminate in the data redundant information unnecessary for cluster, before carrying out cluster, advanced line data dimensionality reduction is necessary.Main dimension reduction method has at present:
(1) (Principal Component Analysis, PCA): classical nothing is supervised linear dimension reduction method in principal component analysis (PCA).It is a kind of method of grasping the things principal character, and it can parse major influence factors from polynary things, discloses the essence of things, simplifies complicated problems.
(2) (Linear DiscriminantAnalysis, LDA): classical have a supervision dimension reduction method in linear discriminant analysis.The dependency structure that this method can keep in low n-dimensional subspace n type is applicable to classification and is identified as the dimensionality reduction of purpose, but the reconstruct effect is not as the PCA method.
(3) nonnegative matrix is decomposed (Nonnegative Matrix Factorization; NMF): the nonnegative matrix decomposition method is through being decomposed into data matrix the purpose that basis matrix U and matrix of coefficients V reach dimensionality reduction, and nonnegative matrix is decomposed the nonnegativity that has kept basis matrix and matrix of coefficients in the matrix decomposition process.
PCA is traditional and classical nothing supervision dimension reduction method, has been widely used in various application at present, and this method can be found out the principal character of data effectively, but can not extract the category feature of data effectively; LDA is as a kind of dimension reduction method that supervision is arranged, although effect is pretty good, this method needs a large amount of data that contain label information as training data, so it is suitable for the dimensionality reduction means as classification, and is not suitable for the dimensionality reduction means as cluster analysis; NMF is as a kind of basic dimensionality reduction framework, and the data that its dimensionality reduction obtains have good interpretation and become present focus, but carry out cluster analysis behind its dimensionality reduction, and effect is unsatisfactory, and the discriminating power during cluster analysis still has the space of raising.
Summary of the invention
To the above-mentioned technological deficiency of existing in prior technology, the invention provides a kind of clustering method that decomposes based on non-negative local coordinate, can improve the effect of cluster analysis, improve the discriminating power of cluster analysis.
A kind of clustering method that decomposes based on non-negative local coordinate comprises the steps:
(1) obtains sample set, and then make up the sample characteristics matrix of sample set;
(2), decompose the low dimension sparse matrix that iterative algorithm solves sample set through non-negative local coordinate according to described sample characteristics matrix;
(3) described low dimension sparse matrix is carried out cluster.
In the described step (2),, solve the low dimension sparse matrix of sample set through following iterative equation group;
u ( j , p ) t = u ( j , p ) t - 1 ( X ( V t - 1 ) T + μ Σ i = 1 n x i l T Λ i t - 1 ) ( j , p ) ( U t - 1 V t - 1 ( V t - 1 ) T + μ Σ i = 1 n U t - 1 Λ i t - 1 ) ( j , p )
v ( p , i ) t = v ( p , i ) t - 1 2 ( μ + 1 ) ( ( U t ) T X ) ( p , i ) ( 2 ( U t ) T U t V t - 1 + μC + μ D t ) ( p , i )
&Sigma; i = 1 n ( | | x i - U t v i t | | 2 + &mu; &Sigma; p = 1 k | v ( p , i ) t | | | u p t - x i | | 2 ) < &rho;
Wherein: X is the sample characteristics matrix of m * n dimension, and n is a number of samples, and m is the characteristic number of sample, and the element value among the X is the eigenwert of each characteristic of sample, and U is the basis matrix of m * k dimension, and V is the matrix of coefficients of k * n dimension, and k is the cluster number; U tAnd V tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U 0And V 0Be respectively non-at random negative initialized basis matrix and matrix of coefficients, Be U tIn the element value of the capable p of j row,
Figure BDA0000115460340000035
Be V tIn the element value of the capable i of p row;
Figure BDA0000115460340000036
Figure BDA0000115460340000037
Be V T-1In the i column vector,
Figure BDA0000115460340000038
Be U tIn the p column vector, x iBe the i column vector among the X; μ is the sparse factor and is the practical experience value that l is that the element value of k dimension is 1 column vector, and ρ is convergence threshold and is the practical experience value; C and D tBe the matrix of k * n dimension, wherein, the capable vector among the C is c T, c=diag (X TX), D tIn column vector be d t, d t=diag ((U t) TU t).
When iteration convergence or reach maximum iteration time, then corresponding V tBe the low dimension sparse matrix of sample set.
In the described step (3), the process that low dimension sparse matrix is carried out cluster is: analyze the greatest member value in low each column vector of dimension sparse matrix, if the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.
The present invention is through introducing the theory of sparse coding in the NMF process; Higher-dimension sample characteristics matrix is carried out non-negative local coordinate to be decomposed; The matrix of coefficients that decomposition is obtained is represented as the low dimension of higher-dimension sample characteristics matrix; This low dimension matrix is carried out cluster analysis, can make cluster analysis become simple and effective; Simultaneously the data behind the dimensionality reduction of the present invention have good interpretation, and with respect to the dimension reduction method of prior art, can make the discriminating power of cluster analysis be further improved.
Description of drawings
Fig. 1 is the steps flow chart synoptic diagram of clustering method of the present invention.
Fig. 2 (a) is the degree of accuracy curve map of Kmeans, NMF, NMF-SC and four kinds of clustering methods of the present invention.
Fig. 2 (b) is the standardization mutual information curve map of Kmeans, NMF, NMF-SC and four kinds of clustering methods of the present invention.
Embodiment
In order to describe the present invention more particularly, clustering method of the present invention is elaborated below in conjunction with accompanying drawing and embodiment.
As shown in Figure 1, a kind of clustering method that decomposes based on non-negative local coordinate comprises the steps:
(1) makes up the sample characteristics matrix.
This embodiment is an example with ORL people's face data set, and the statistical information of this data acquisition is as shown in table 1.
Table 1:ORL people face data set statistical information
Data set The facial image frame number People's face classification number The characteristics of image number
ORL 400 40 1024
Wherein, ORL people's face data centralization has 400 frame facial images, and 400 frame facial images are formed (everyone each 10 frame facial image) by the people's of 40 different appearances facial image.
Choose two types of instances of ORL people's face data centralization as original high dimensional data set; And make up corresponding sample eigenmatrix X, and X is that m * n ties up matrix, n is number of samples (being number of image frames); M is the characteristic number of sample, and the element value in the sample characteristics matrix is the eigenwert of each characteristic of sample; N=2 * 10=20, m=1024.
(2) the low dimension of iteration output sparse matrix.
Based on sample characteristics matrix X, decompose the low dimension sparse matrix that iterative algorithm solves sample set through following non-negative local coordinate;
u ( j , p ) t = u ( j , p ) t - 1 ( X ( V t - 1 ) T + &mu; &Sigma; i = 1 n x i l T &Lambda; i t - 1 ) ( j , p ) ( U t - 1 V t - 1 ( V t - 1 ) T + &mu; &Sigma; i = 1 n U t - 1 &Lambda; i t - 1 ) ( j , p )
v ( p , i ) t = v ( p , i ) t - 1 2 ( &mu; + 1 ) ( ( U t ) T X ) ( p , i ) ( 2 ( U t ) T U t V t - 1 + &mu;C + &mu; D t ) ( p , i )
&Sigma; i = 1 n ( | | x i - U t v i t | | 2 + &mu; &Sigma; p = 1 k | v ( p , i ) t | | | u p t - x i | | 2 ) < &rho;
Wherein: U is the basis matrix of m * k dimension, and V is the matrix of coefficients of k * n dimension, and k is the cluster number, k=2 in the present embodiment; U tAnd V tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U 0And V 0Be respectively non-at random negative initialized basis matrix and matrix of coefficients,
Figure BDA0000115460340000054
Be U tIn the element value of the capable p of j row,
Figure BDA0000115460340000055
Be V tIn the element value of the capable i of p row;
Figure BDA0000115460340000056
Figure BDA0000115460340000057
Be V T-1In the i column vector, Be U tIn the p column vector, x iBe the i column vector among the X; μ is the sparse factor, μ in the present embodiment=1, and l is that the element value of k dimension is 1 column vector, ρ is a convergence threshold, ρ in the present embodiment=10 -7C and D tBe the matrix of k * n dimension, wherein, the capable vector among the C is c T, c=diag (X TX), D tIn column vector be d t, d t=diag ((U t) TU t).
When iteration convergence or reach maximum iteration time, then corresponding V tBe the low dimension sparse matrix of sample set, maximum iteration time is 200 in the present embodiment.
(3) to the cluster analysis of low dimension sparse matrix.
Analyze the greatest member value in low each column vector of dimension sparse matrix, if the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.
Next coming in order make cluster number k=2; 4,8,12; 16; 20,25,30; 40; Come the cluster effect under comparison Kmeans (not dimensionality reduction) cluster, NMF (nonnegative matrix decomposition) cluster, NMF-SC (nonnegative matrix with sparse restriction is decomposed) cluster and four kinds of clustering methods of this embodiment through analytical precision (accuracy is abbreviated as AC) and two indexs of standardization mutual information (normalized mutual information is abbreviated as
Figure BDA0000115460340000059
); Final data result such as table 2 are with shown in Figure 2.
Degree of accuracy is the number percent that is used for measuring the data of correct labeling:
Figure BDA00001154603400000510
The standardization mutual information is the measure information that is used for measuring two correlativitys between the set, given two set C and C ':
MI ( C , C &prime; ) = &Sigma; c i &Element; C , c j &prime; &Element; C &prime; p ( c i , c j &prime; ) &CenterDot; log p ( c i , c j &prime; ) p ( c i ) &CenterDot; p ( c j &prime; )
MI ^ ( C , C &prime; ) = MI ( C , C &prime; ) max ( H ( C ) , H ( C &prime; ) )
Wherein: p (c i), p (c ' j) represent to choose a certain data arbitrarily from data centralization, belong to c respectively i, c ' jProbability, p (c i, c ' j) then expression belong to two types probability simultaneously; H (C) and H (C ') represent the entropy of C and C ' respectively.
The achievement data of table 2:Kmeans, NMF, NMF-SC and four kinds of clustering methods of this embodiment
Figure BDA0000115460340000063
Visible by table 2 and Fig. 2, this embodiment is compared three kinds of clustering methods of prior art, and the effect of cluster and discriminating power can be significantly improved and improve.

Claims (3)

1. a clustering method that decomposes based on non-negative local coordinate comprises the steps:
(1) obtains sample set, and then make up the sample characteristics matrix of sample set;
(2), decompose the low dimension sparse matrix that iterative algorithm solves sample set through non-negative local coordinate according to described sample characteristics matrix;
(3) described low dimension sparse matrix is carried out cluster.
2. the clustering method that decomposes based on non-negative local coordinate according to claim 1 is characterized in that: in the described step (2), through following iterative equation group, solve the low dimension sparse matrix of sample set;
u ( j , p ) t = u ( j , p ) t - 1 ( X ( V t - 1 ) T + &mu; &Sigma; i = 1 n x i l T &Lambda; i t - 1 ) ( j , p ) ( U t - 1 V t - 1 ( V t - 1 ) T + &mu; &Sigma; i = 1 n U t - 1 &Lambda; i t - 1 ) ( j , p )
v ( p , i ) t = v ( p , i ) t - 1 2 ( &mu; + 1 ) ( ( U t ) T X ) ( p , i ) ( 2 ( U t ) T U t V t - 1 + &mu;C + &mu; D t ) ( p , i )
&Sigma; i = 1 n ( | | x i - U t v i t | | 2 + &mu; &Sigma; p = 1 k | v ( p , i ) t | | | u p t - x i | | 2 ) < &rho;
Wherein: X is the sample characteristics matrix, and U is a basis matrix, and V is a matrix of coefficients; U tAnd V tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U 0And V 0Be respectively non-at random negative initialized basis matrix and matrix of coefficients,
Figure FDA0000115460330000014
Be U tIn the element value of the capable p of j row,
Figure FDA0000115460330000015
Be V tIn the element value of the capable i of p row;
Figure FDA0000115460330000016
Figure FDA0000115460330000017
Be V T-1In the i column vector,
Figure FDA0000115460330000018
Be U tIn the p column vector, x iBe the i column vector among the X; μ is the sparse factor, and l is that element value is 1 column vector, and ρ is a convergence threshold; C and D tBe matrix, wherein, the capable vector among the C is c T, c=diag (X TX), D tIn column vector be d t, d t=diag ((U t) TU t); When iteration convergence or reach maximum iteration time, then corresponding V tBe the low dimension sparse matrix of sample set.
3. the clustering method that decomposes based on non-negative local coordinate according to claim 1; It is characterized in that: in the described step (3); The process that low dimension sparse matrix is carried out cluster is: analyze the greatest member value in low each column vector of dimension sparse matrix; If the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.
CN2011103946863A 2011-12-02 2011-12-02 Nonnegative local coordinate factorization-based clustering method Pending CN102495876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103946863A CN102495876A (en) 2011-12-02 2011-12-02 Nonnegative local coordinate factorization-based clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103946863A CN102495876A (en) 2011-12-02 2011-12-02 Nonnegative local coordinate factorization-based clustering method

Publications (1)

Publication Number Publication Date
CN102495876A true CN102495876A (en) 2012-06-13

Family

ID=46187701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103946863A Pending CN102495876A (en) 2011-12-02 2011-12-02 Nonnegative local coordinate factorization-based clustering method

Country Status (1)

Country Link
CN (1) CN102495876A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834746A (en) * 2015-05-23 2015-08-12 华东交通大学 Heterogeneous feature time sequence data evolution and clustering method based on graphic processing unit
CN105095275A (en) * 2014-05-13 2015-11-25 中国科学院自动化研究所 Document clustering method and apparatus
CN107368913A (en) * 2017-06-15 2017-11-21 中国汽车技术研究中心 A kind of oil consumption Forecasting Methodology based on least square method supporting vector machine
CN107480685A (en) * 2016-06-08 2017-12-15 国家计算机网络与信息安全管理中心 A kind of distributed power iteration clustering method and device based on GraphX
CN108664368A (en) * 2017-03-30 2018-10-16 华为技术有限公司 Processor performance index evaluating method and equipment
CN109118469A (en) * 2018-06-20 2019-01-01 国网浙江省电力有限公司 Prediction technique for saliency
CN109754008A (en) * 2018-12-28 2019-05-14 上海理工大学 The estimation method of the symmetrical sparse network missing information of higher-dimension based on matrix decomposition
CN113408548A (en) * 2021-07-14 2021-09-17 贵州电网有限责任公司电力科学研究院 Transformer abnormal data detection method and device, computer equipment and storage medium

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095275B (en) * 2014-05-13 2019-04-05 中国科学院自动化研究所 The method and device of clustering documents
CN105095275A (en) * 2014-05-13 2015-11-25 中国科学院自动化研究所 Document clustering method and apparatus
CN104834746B (en) * 2015-05-23 2017-12-12 华东交通大学 Heterogeneous characteristic time series data evolution clustering method based on graphics processing unit
CN104834746A (en) * 2015-05-23 2015-08-12 华东交通大学 Heterogeneous feature time sequence data evolution and clustering method based on graphic processing unit
CN107480685B (en) * 2016-06-08 2021-02-23 国家计算机网络与信息安全管理中心 GraphX-based distributed power iterative clustering method and device
CN107480685A (en) * 2016-06-08 2017-12-15 国家计算机网络与信息安全管理中心 A kind of distributed power iteration clustering method and device based on GraphX
CN108664368A (en) * 2017-03-30 2018-10-16 华为技术有限公司 Processor performance index evaluating method and equipment
CN107368913B (en) * 2017-06-15 2020-06-12 中国汽车技术研究中心 Oil consumption prediction method based on least square support vector machine
CN107368913A (en) * 2017-06-15 2017-11-21 中国汽车技术研究中心 A kind of oil consumption Forecasting Methodology based on least square method supporting vector machine
CN109118469A (en) * 2018-06-20 2019-01-01 国网浙江省电力有限公司 Prediction technique for saliency
CN109118469B (en) * 2018-06-20 2020-11-17 国网浙江省电力有限公司 Prediction method for video saliency
CN109754008A (en) * 2018-12-28 2019-05-14 上海理工大学 The estimation method of the symmetrical sparse network missing information of higher-dimension based on matrix decomposition
CN109754008B (en) * 2018-12-28 2022-07-19 上海理工大学 High-dimensional symmetric sparse network missing information estimation method based on matrix decomposition
CN113408548A (en) * 2021-07-14 2021-09-17 贵州电网有限责任公司电力科学研究院 Transformer abnormal data detection method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102495876A (en) Nonnegative local coordinate factorization-based clustering method
Clogg Some models for the analysis of association in multiway cross-classifications having ordered categories
CN102411610A (en) Semi-supervised dimensionality reduction method for high dimensional data clustering
Shao et al. Multiple incomplete views clustering via weighted nonnegative matrix factorization with regularization
Erichson et al. Randomized nonnegative matrix factorization
Sussman et al. A consistent adjacency spectral embedding for stochastic blockmodel graphs
CN113065974B (en) Link prediction method based on dynamic network representation learning
CN105930308A (en) Nonnegative matrix factorization method based on low-rank recovery
CN102722578B (en) Unsupervised cluster characteristic selection method based on Laplace regularization
CN102156878A (en) Sparse embedding with manifold information-based human face identification method
CN109657611A (en) A kind of adaptive figure regularization non-negative matrix factorization method for recognition of face
Guo et al. Principal component analysis with sparse fused loadings
CN102779162B (en) Matrix concept decomposition method with local area limit
Møller et al. An introduction to simulation-based inference for spatial point processes
Mitsuhiro et al. Reduced k-means clustering with MCA in a low-dimensional space
Farhadi et al. Improving random forest algorithm by selecting appropriate penalized method
CN105389560B (en) Figure optimization Dimensionality Reduction method based on local restriction
CN102799891A (en) Spectral clustering method based on landmark point representation
Wang Mixtures of common factor analyzers for high-dimensional data with missing information
Sun et al. A comparison of graph embedding methods for vertex nomination
CN104951651B (en) It is a kind of that the non-negative view data dimension reduction method optimized with A is constrained based on Hessen canonical
Aljumily Agglomerative hierarchical clustering: an introduction to essentials.(1) proximity coefficients and creation of a vector-distance matrix and (2) construction of the hierarchical tree and a selection of methods
Scott Partial mixture estimation and outlier detection in data and regression
Shi et al. Blind source separation of more sources than mixtures using sparse mixture models
CN107169410A (en) The structural type rarefaction representation sorting technique based on LBP features for recognition of face

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120613