CN102495876A - Nonnegative local coordinate factorization-based clustering method - Google Patents
Nonnegative local coordinate factorization-based clustering method Download PDFInfo
- Publication number
- CN102495876A CN102495876A CN2011103946863A CN201110394686A CN102495876A CN 102495876 A CN102495876 A CN 102495876A CN 2011103946863 A CN2011103946863 A CN 2011103946863A CN 201110394686 A CN201110394686 A CN 201110394686A CN 102495876 A CN102495876 A CN 102495876A
- Authority
- CN
- China
- Prior art keywords
- matrix
- column vector
- local coordinate
- low
- dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a nonnegative local coordinate factorization-based clustering method, which comprises the following steps that: (1) a sample characteristic matrix is built; (2) a low-dimensional sparse matrix is iteratively outputted; (3) and the low-dimensional sparse matrix is clustered and analyzed. A sparse coding concept is introduced into the nonnegative matrix factorization (NMF) process, nonnegative local coordinate factorization is undertaken on a high-dimensional sample characteristic matrix, so the factorized coefficient matrix is used as a low-dimensional expression of the high-dimensional sample characteristic matrix, and the low-dimensional matrix is clustered to analyze, so the clustering analysis is simple and valid; and at the same time, data after the dimensional reduction has good explanatory property. Compared with the dimensional reduction method of the prior art, the judgment capacity of the clustering analysis can be further improved.
Description
Technical field
The invention belongs to technical field of data processing, be specifically related to a kind of clustering method that decomposes based on non-negative local coordinate.
Background technology
Cluster is a kind of common multivariate statistical analysis method in machine learning and the data mining; Its discuss to as if a large amount of samples; Requirement can reasonably be classified by characteristic separately, have no the pattern can be for reference or follow, and is not promptly having to carry out under the situation of priori.At present, as a kind of data analysis means effectively, clustering method is widely used in each big field: commercial, cluster analysis is used to find different customers, and portrays the characteristic of different customers through purchasing model; On biology, cluster analysis is used to the animals and plants classification and gene is classified, and obtains the understanding to the population inherent structure; On geography, the similarity that is tending towards on the database that cluster can help in the earth, to be observed; On insurance industry, cluster analysis identifies through a high average consumption and the single holder's of car insurance grouping, is worth simultaneously according to housing type that the geographic position identifies that the house property in a city divides into groups; In internet, applications, cluster analysis is used to the document in the network is sorted out, and the user in the virtual community is divided into groups.
Common clustering method mainly comprises following several kinds:
(1) disintegrating method is claimed division methods again, at first creates K division, and K is the number of the division that will create; The technology of utilizing a circulation location is then divided and is improved the division quality through object is moved on to another from a division.Typical division methods has: Kmeans, Kmedoids and CLARA (Clustering LARge Application) etc.
(2) stratification is through creating a level to decompose given data set.This method can be divided into from top to bottom (decomposition) and (merging) two kinds of modes of operation from bottom to top.Decompose and the deficiency that merges for remedying, the level merging often will combine with other clustering method, like the circulation location.Typical hierarchical method has: BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), CURE (Clustering Using REprisentatives) and CHEMALOEN etc.
(3), accomplish the cluster of object according to density based on the method for density.It constantly increases cluster based on the density around the object.Typically the method based on density has: DBSCAN (Densit-based Spatial Clustering of Application with Noise) and OPTICS (Ordering Points To Identity the Clustering Structure).
(4) based on the method for grid, at first object space is divided into limited unit to constitute network, utilizes network to accomplish cluster then.
(5) based on the method for model, the model of its each cluster of hypothesis also finds to be fit to the data of corresponding model.
The clustering problem of low dimension data that these traditional clustering methods have compared successful solution still owing to the complicacy of data in the practical application, often lost efficacy when handling many high dimensional datas.Because traditional clustering method is concentrated when carrying out cluster high dimensional data, mainly run into two problems: (1) high dimensional data concentrates the possibility that exists a large amount of irrelevant attributes to make in all dimensions, to exist bunch almost nil; (2) the dimension disaster brought of higher-dimension makes that the practicality of some clustering algorithm is almost nil.
To above two problems, just in order to solve dimension disaster and to eliminate in the data redundant information unnecessary for cluster, before carrying out cluster, advanced line data dimensionality reduction is necessary.Main dimension reduction method has at present:
(1) (Principal Component Analysis, PCA): classical nothing is supervised linear dimension reduction method in principal component analysis (PCA).It is a kind of method of grasping the things principal character, and it can parse major influence factors from polynary things, discloses the essence of things, simplifies complicated problems.
(2) (Linear DiscriminantAnalysis, LDA): classical have a supervision dimension reduction method in linear discriminant analysis.The dependency structure that this method can keep in low n-dimensional subspace n type is applicable to classification and is identified as the dimensionality reduction of purpose, but the reconstruct effect is not as the PCA method.
(3) nonnegative matrix is decomposed (Nonnegative Matrix Factorization; NMF): the nonnegative matrix decomposition method is through being decomposed into data matrix the purpose that basis matrix U and matrix of coefficients V reach dimensionality reduction, and nonnegative matrix is decomposed the nonnegativity that has kept basis matrix and matrix of coefficients in the matrix decomposition process.
PCA is traditional and classical nothing supervision dimension reduction method, has been widely used in various application at present, and this method can be found out the principal character of data effectively, but can not extract the category feature of data effectively; LDA is as a kind of dimension reduction method that supervision is arranged, although effect is pretty good, this method needs a large amount of data that contain label information as training data, so it is suitable for the dimensionality reduction means as classification, and is not suitable for the dimensionality reduction means as cluster analysis; NMF is as a kind of basic dimensionality reduction framework, and the data that its dimensionality reduction obtains have good interpretation and become present focus, but carry out cluster analysis behind its dimensionality reduction, and effect is unsatisfactory, and the discriminating power during cluster analysis still has the space of raising.
Summary of the invention
To the above-mentioned technological deficiency of existing in prior technology, the invention provides a kind of clustering method that decomposes based on non-negative local coordinate, can improve the effect of cluster analysis, improve the discriminating power of cluster analysis.
A kind of clustering method that decomposes based on non-negative local coordinate comprises the steps:
(1) obtains sample set, and then make up the sample characteristics matrix of sample set;
(2), decompose the low dimension sparse matrix that iterative algorithm solves sample set through non-negative local coordinate according to described sample characteristics matrix;
(3) described low dimension sparse matrix is carried out cluster.
In the described step (2),, solve the low dimension sparse matrix of sample set through following iterative equation group;
Wherein: X is the sample characteristics matrix of m * n dimension, and n is a number of samples, and m is the characteristic number of sample, and the element value among the X is the eigenwert of each characteristic of sample, and U is the basis matrix of m * k dimension, and V is the matrix of coefficients of k * n dimension, and k is the cluster number; U
tAnd V
tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U
0And V
0Be respectively non-at random negative initialized basis matrix and matrix of coefficients,
Be U
tIn the element value of the capable p of j row,
Be V
tIn the element value of the capable i of p row;
Be V
T-1In the i column vector,
Be U
tIn the p column vector, x
iBe the i column vector among the X; μ is the sparse factor and is the practical experience value that l is that the element value of k dimension is 1 column vector, and ρ is convergence threshold and is the practical experience value; C and D
tBe the matrix of k * n dimension, wherein, the capable vector among the C is c
T, c=diag (X
TX), D
tIn column vector be d
t, d
t=diag ((U
t)
TU
t).
When iteration convergence or reach maximum iteration time, then corresponding V
tBe the low dimension sparse matrix of sample set.
In the described step (3), the process that low dimension sparse matrix is carried out cluster is: analyze the greatest member value in low each column vector of dimension sparse matrix, if the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.
The present invention is through introducing the theory of sparse coding in the NMF process; Higher-dimension sample characteristics matrix is carried out non-negative local coordinate to be decomposed; The matrix of coefficients that decomposition is obtained is represented as the low dimension of higher-dimension sample characteristics matrix; This low dimension matrix is carried out cluster analysis, can make cluster analysis become simple and effective; Simultaneously the data behind the dimensionality reduction of the present invention have good interpretation, and with respect to the dimension reduction method of prior art, can make the discriminating power of cluster analysis be further improved.
Description of drawings
Fig. 1 is the steps flow chart synoptic diagram of clustering method of the present invention.
Fig. 2 (a) is the degree of accuracy curve map of Kmeans, NMF, NMF-SC and four kinds of clustering methods of the present invention.
Fig. 2 (b) is the standardization mutual information curve map of Kmeans, NMF, NMF-SC and four kinds of clustering methods of the present invention.
Embodiment
In order to describe the present invention more particularly, clustering method of the present invention is elaborated below in conjunction with accompanying drawing and embodiment.
As shown in Figure 1, a kind of clustering method that decomposes based on non-negative local coordinate comprises the steps:
(1) makes up the sample characteristics matrix.
This embodiment is an example with ORL people's face data set, and the statistical information of this data acquisition is as shown in table 1.
Table 1:ORL people face data set statistical information
Data set | The facial image frame number | People's face classification number | The characteristics of image number |
ORL | 400 | 40 | 1024 |
Wherein, ORL people's face data centralization has 400 frame facial images, and 400 frame facial images are formed (everyone each 10 frame facial image) by the people's of 40 different appearances facial image.
Choose two types of instances of ORL people's face data centralization as original high dimensional data set; And make up corresponding sample eigenmatrix X, and X is that m * n ties up matrix, n is number of samples (being number of image frames); M is the characteristic number of sample, and the element value in the sample characteristics matrix is the eigenwert of each characteristic of sample; N=2 * 10=20, m=1024.
(2) the low dimension of iteration output sparse matrix.
Based on sample characteristics matrix X, decompose the low dimension sparse matrix that iterative algorithm solves sample set through following non-negative local coordinate;
Wherein: U is the basis matrix of m * k dimension, and V is the matrix of coefficients of k * n dimension, and k is the cluster number, k=2 in the present embodiment; U
tAnd V
tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U
0And V
0Be respectively non-at random negative initialized basis matrix and matrix of coefficients,
Be U
tIn the element value of the capable p of j row,
Be V
tIn the element value of the capable i of p row;
Be V
T-1In the i column vector,
Be U
tIn the p column vector, x
iBe the i column vector among the X; μ is the sparse factor, μ in the present embodiment=1, and l is that the element value of k dimension is 1 column vector, ρ is a convergence threshold, ρ in the present embodiment=10
-7C and D
tBe the matrix of k * n dimension, wherein, the capable vector among the C is c
T, c=diag (X
TX), D
tIn column vector be d
t, d
t=diag ((U
t)
TU
t).
When iteration convergence or reach maximum iteration time, then corresponding V
tBe the low dimension sparse matrix of sample set, maximum iteration time is 200 in the present embodiment.
(3) to the cluster analysis of low dimension sparse matrix.
Analyze the greatest member value in low each column vector of dimension sparse matrix, if the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.
Next coming in order make cluster number k=2; 4,8,12; 16; 20,25,30; 40; Come the cluster effect under comparison Kmeans (not dimensionality reduction) cluster, NMF (nonnegative matrix decomposition) cluster, NMF-SC (nonnegative matrix with sparse restriction is decomposed) cluster and four kinds of clustering methods of this embodiment through analytical precision (accuracy is abbreviated as AC) and two indexs of standardization mutual information (normalized mutual information is abbreviated as
); Final data result such as table 2 are with shown in Figure 2.
Degree of accuracy is the number percent that is used for measuring the data of correct labeling:
The standardization mutual information is the measure information that is used for measuring two correlativitys between the set, given two set C and C ':
Wherein: p (c
i), p (c '
j) represent to choose a certain data arbitrarily from data centralization, belong to c respectively
i, c '
jProbability, p (c
i, c '
j) then expression belong to two types probability simultaneously; H (C) and H (C ') represent the entropy of C and C ' respectively.
The achievement data of table 2:Kmeans, NMF, NMF-SC and four kinds of clustering methods of this embodiment
Visible by table 2 and Fig. 2, this embodiment is compared three kinds of clustering methods of prior art, and the effect of cluster and discriminating power can be significantly improved and improve.
Claims (3)
1. a clustering method that decomposes based on non-negative local coordinate comprises the steps:
(1) obtains sample set, and then make up the sample characteristics matrix of sample set;
(2), decompose the low dimension sparse matrix that iterative algorithm solves sample set through non-negative local coordinate according to described sample characteristics matrix;
(3) described low dimension sparse matrix is carried out cluster.
2. the clustering method that decomposes based on non-negative local coordinate according to claim 1 is characterized in that: in the described step (2), through following iterative equation group, solve the low dimension sparse matrix of sample set;
Wherein: X is the sample characteristics matrix, and U is a basis matrix, and V is a matrix of coefficients; U
tAnd V
tBe respectively basis matrix and the matrix of coefficients after the iteration t time, U
0And V
0Be respectively non-at random negative initialized basis matrix and matrix of coefficients,
Be U
tIn the element value of the capable p of j row,
Be V
tIn the element value of the capable i of p row;
Be V
T-1In the i column vector,
Be U
tIn the p column vector, x
iBe the i column vector among the X; μ is the sparse factor, and l is that element value is 1 column vector, and ρ is a convergence threshold; C and D
tBe matrix, wherein, the capable vector among the C is c
T, c=diag (X
TX), D
tIn column vector be d
t, d
t=diag ((U
t)
TU
t); When iteration convergence or reach maximum iteration time, then corresponding V
tBe the low dimension sparse matrix of sample set.
3. the clustering method that decomposes based on non-negative local coordinate according to claim 1; It is characterized in that: in the described step (3); The process that low dimension sparse matrix is carried out cluster is: analyze the greatest member value in low each column vector of dimension sparse matrix; If the greatest member value in the i column vector is that p is capable, then the pairing sample of i column vector belongs to the p class.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011103946863A CN102495876A (en) | 2011-12-02 | 2011-12-02 | Nonnegative local coordinate factorization-based clustering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011103946863A CN102495876A (en) | 2011-12-02 | 2011-12-02 | Nonnegative local coordinate factorization-based clustering method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102495876A true CN102495876A (en) | 2012-06-13 |
Family
ID=46187701
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011103946863A Pending CN102495876A (en) | 2011-12-02 | 2011-12-02 | Nonnegative local coordinate factorization-based clustering method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102495876A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104834746A (en) * | 2015-05-23 | 2015-08-12 | 华东交通大学 | Heterogeneous feature time sequence data evolution and clustering method based on graphic processing unit |
CN105095275A (en) * | 2014-05-13 | 2015-11-25 | 中国科学院自动化研究所 | Document clustering method and apparatus |
CN107368913A (en) * | 2017-06-15 | 2017-11-21 | 中国汽车技术研究中心 | A kind of oil consumption Forecasting Methodology based on least square method supporting vector machine |
CN107480685A (en) * | 2016-06-08 | 2017-12-15 | 国家计算机网络与信息安全管理中心 | A kind of distributed power iteration clustering method and device based on GraphX |
CN108664368A (en) * | 2017-03-30 | 2018-10-16 | 华为技术有限公司 | Processor performance index evaluating method and equipment |
CN109118469A (en) * | 2018-06-20 | 2019-01-01 | 国网浙江省电力有限公司 | Prediction technique for saliency |
CN109754008A (en) * | 2018-12-28 | 2019-05-14 | 上海理工大学 | The estimation method of the symmetrical sparse network missing information of higher-dimension based on matrix decomposition |
CN113408548A (en) * | 2021-07-14 | 2021-09-17 | 贵州电网有限责任公司电力科学研究院 | Transformer abnormal data detection method and device, computer equipment and storage medium |
-
2011
- 2011-12-02 CN CN2011103946863A patent/CN102495876A/en active Pending
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095275B (en) * | 2014-05-13 | 2019-04-05 | 中国科学院自动化研究所 | The method and device of clustering documents |
CN105095275A (en) * | 2014-05-13 | 2015-11-25 | 中国科学院自动化研究所 | Document clustering method and apparatus |
CN104834746B (en) * | 2015-05-23 | 2017-12-12 | 华东交通大学 | Heterogeneous characteristic time series data evolution clustering method based on graphics processing unit |
CN104834746A (en) * | 2015-05-23 | 2015-08-12 | 华东交通大学 | Heterogeneous feature time sequence data evolution and clustering method based on graphic processing unit |
CN107480685B (en) * | 2016-06-08 | 2021-02-23 | 国家计算机网络与信息安全管理中心 | GraphX-based distributed power iterative clustering method and device |
CN107480685A (en) * | 2016-06-08 | 2017-12-15 | 国家计算机网络与信息安全管理中心 | A kind of distributed power iteration clustering method and device based on GraphX |
CN108664368A (en) * | 2017-03-30 | 2018-10-16 | 华为技术有限公司 | Processor performance index evaluating method and equipment |
CN107368913B (en) * | 2017-06-15 | 2020-06-12 | 中国汽车技术研究中心 | Oil consumption prediction method based on least square support vector machine |
CN107368913A (en) * | 2017-06-15 | 2017-11-21 | 中国汽车技术研究中心 | A kind of oil consumption Forecasting Methodology based on least square method supporting vector machine |
CN109118469A (en) * | 2018-06-20 | 2019-01-01 | 国网浙江省电力有限公司 | Prediction technique for saliency |
CN109118469B (en) * | 2018-06-20 | 2020-11-17 | 国网浙江省电力有限公司 | Prediction method for video saliency |
CN109754008A (en) * | 2018-12-28 | 2019-05-14 | 上海理工大学 | The estimation method of the symmetrical sparse network missing information of higher-dimension based on matrix decomposition |
CN109754008B (en) * | 2018-12-28 | 2022-07-19 | 上海理工大学 | High-dimensional symmetric sparse network missing information estimation method based on matrix decomposition |
CN113408548A (en) * | 2021-07-14 | 2021-09-17 | 贵州电网有限责任公司电力科学研究院 | Transformer abnormal data detection method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102495876A (en) | Nonnegative local coordinate factorization-based clustering method | |
Clogg | Some models for the analysis of association in multiway cross-classifications having ordered categories | |
CN102411610A (en) | Semi-supervised dimensionality reduction method for high dimensional data clustering | |
Shao et al. | Multiple incomplete views clustering via weighted nonnegative matrix factorization with regularization | |
Erichson et al. | Randomized nonnegative matrix factorization | |
Sussman et al. | A consistent adjacency spectral embedding for stochastic blockmodel graphs | |
CN113065974B (en) | Link prediction method based on dynamic network representation learning | |
CN105930308A (en) | Nonnegative matrix factorization method based on low-rank recovery | |
CN102722578B (en) | Unsupervised cluster characteristic selection method based on Laplace regularization | |
CN102156878A (en) | Sparse embedding with manifold information-based human face identification method | |
CN109657611A (en) | A kind of adaptive figure regularization non-negative matrix factorization method for recognition of face | |
Guo et al. | Principal component analysis with sparse fused loadings | |
CN102779162B (en) | Matrix concept decomposition method with local area limit | |
Møller et al. | An introduction to simulation-based inference for spatial point processes | |
Mitsuhiro et al. | Reduced k-means clustering with MCA in a low-dimensional space | |
Farhadi et al. | Improving random forest algorithm by selecting appropriate penalized method | |
CN105389560B (en) | Figure optimization Dimensionality Reduction method based on local restriction | |
CN102799891A (en) | Spectral clustering method based on landmark point representation | |
Wang | Mixtures of common factor analyzers for high-dimensional data with missing information | |
Sun et al. | A comparison of graph embedding methods for vertex nomination | |
CN104951651B (en) | It is a kind of that the non-negative view data dimension reduction method optimized with A is constrained based on Hessen canonical | |
Aljumily | Agglomerative hierarchical clustering: an introduction to essentials.(1) proximity coefficients and creation of a vector-distance matrix and (2) construction of the hierarchical tree and a selection of methods | |
Scott | Partial mixture estimation and outlier detection in data and regression | |
Shi et al. | Blind source separation of more sources than mixtures using sparse mixture models | |
CN107169410A (en) | The structural type rarefaction representation sorting technique based on LBP features for recognition of face |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20120613 |