CN111930934B

CN111930934B - Clustering method based on constraint sparse concept decomposition of dual local agreement

Info

Publication number: CN111930934B
Application number: CN202010507876.0A
Authority: CN
Inventors: 舒振球; 张云猛; 翁宗慧; 叶飞跃
Original assignee: Jiangsu University of Technology
Current assignee: Jiangsu University of Technology
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2023-12-26
Anticipated expiration: 2040-06-05
Also published as: CN111930934A

Abstract

The invention discloses a clustering method based on constraint sparse concept decomposition consistent with dual parts, which is characterized by comprising the following steps: s10, acquiring samples to be clustered to form a sample data set to be clustered; s20, constructing an adjacency matrix aiming at the sample data set to be clustered; s30, establishing an objective function J based on conceptual decomposition _DESCFS The method comprises the steps of carrying out a first treatment on the surface of the S40, iterating for preset times by using an iteration weighting method according to the objective function, and updating a base matrix W, a tag matrix A and an auxiliary matrix Z; and S50, carrying out cluster analysis on the coefficient matrix V by adopting a K-Means clustering algorithm, wherein V=AZ. Compared with the traditional clustering method, the method has the advantages that the internal geometric structure and the distinguishing structure of the data are more effectively disclosed, and the clustering effect is improved.

Description

Clustering method based on constraint sparse concept decomposition of dual local agreement

Technical Field

The invention relates to the technical field of text clustering, in particular to a clustering method based on constraint sparse concept decomposition of dual local consistency.

Background

Matrix decomposition-based methods have gained widespread attention in document clustering over the past decade. When using matrix factorization based methods, a text document is typically a point in a high-dimensional linear space, one term for each dimension. At the heart of all targets of the cluster analysis is the concept of similarity between the individual objects being clustered. Research shows that the similarity can be measured more accurately in a low-dimensional space, so that the clustering performance is improved. The application of NMF (negative matrix factorization) in document clustering has achieved impressive results. In NMF: given a non-negative data matrix X, low-rank non-negative matrices U and V are found so that UV provides a good approximation to X. How to perform NMF efficiently in the transformed data space is a big problem.

In order to solve NMF limitations while inheriting all its advantages, xu and Gong propose CF (concept decomposition) for data clustering. The CF models each cluster as a linear combination of data points, and models each data point as a linear combination of cluster centers; and then, the data clustering is completed by calculating two groups of linear coefficients, and the data clustering is realized by finding a non-negative solution of the minimum data point reconstruction error. The main advantage of CF compared to NMF is that it can be performed on any data item, whether in raw space or RKHS.

Many CF-based classification methods have been developed by expanding classical CFs in various ways (e.g., imposing additional constraints and incorporating regulatory information), essentially to learn low-dimensional discriminant, and input into a typical classifier in turn. However, these approaches ignore the dependency between the two processes, and the resulting low-dimensional features may not adapt well to the classifier used, resulting in suboptimal classification performance.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a clustering method based on dual local consistent constraint sparse concept decomposition, which effectively solves the technical problem of poor clustering effect of the existing clustering method.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

a clustering method based on dual local coincidence constraint sparse concept decomposition comprises the following steps:

s10, acquiring samples to be clustered to form a sample data set to be clustered;

s20, constructing an adjacency matrix aiming at the sample data set to be clustered;

s30, establishing an objective function J based on conceptual decomposition _DESCFS ：

Wherein x= [ X ] ₁ ,x ₂ ,...,x _n ]For a sample data set to be clusteredThe constructed adjacency matrix, W is the base matrix, A= [ A ] _l ；A _u ]＝[a ₁ ,a ₂ ,...,a _n ] ^T ∈R ^n×c Is a label matrix, A _l ∈R ^w×c Representing the class of marked samples, A _u ∈R ⁽ⁿ ^-w)×c Representing the class of unlabeled exemplars, Z ε R ^c×d Is an auxiliary matrix, a _i ∈R ^c×1 The position of the largest element in (2) represents sample x _i The category to which R represents a real number set; alpha is a characteristic space local consistency regularization parameter, beta is a class space local consistency regularization parameter, lambda is a sparse parameter, and Tr (·) is a trace of a matrix; l is the laplacian matrix of the weight graph and l=d-S, D is a diagonal matrix, which is the side weight matrix S _ij Sum of rows or columns of (D) _ii ＝∑ _j S _ij ；

S40, iterating for preset times by using an iteration weighting method according to the objective function, and updating a base matrix W, a tag matrix A and an auxiliary matrix Z;

and S50, carrying out cluster analysis on the coefficient matrix V by adopting a K-Means clustering algorithm, wherein V=AZ.

Compared with the traditional clustering method, the clustering method based on the constraint sparse concept decomposition of dual local consistency combines the local consistency of the feature space and the prior class information to reveal the inherent geometry of the data, so that the algorithm has certain discrimination, and on the basis, the local consistency of the class space and the sparse constraint are combined to more effectively reveal the inherent geometry and the discrimination structure of the data, and the discrimination capability of the algorithm is enhanced by keeping the class information of the samples and the inherent geometry manifold structure information among the samples, so that the clustering performance is greatly improved.

Drawings

The invention will be more fully understood and its attendant advantages and features will be more readily understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic flow diagram of a clustering method based on constrained sparse concept decomposition with dual local agreement in the invention.

Detailed Description

In order to make the contents of the present invention more clear and understandable, the contents of the present invention will be further described with reference to the accompanying drawings. Of course, the invention is not limited to this particular embodiment, and common alternatives known to those skilled in the art are also encompassed within the scope of the invention.

In the CF model, each basis vector v _j Is the data point x _i Non-negative linear combinations of (v) _j ＝∑ _i w _ij x _i Wherein the coefficient w _ij And is more than or equal to 0. Let the base matrix w= [ W ] _ij ]∈R ^n×k The goal of the CF algorithm is to find two matrices W and V such that X≡ XWV ^T . Thus, the objective function J of the CF algorithm _CF Can be represented by formula (1):

wherein the data set x= [ X ] ₁ ,x ₂ ,...,x _n ]Coefficient matrix v= [ V ] ₁ ,v ₂ ,...,v _n ]。

According to the principle of local consistency, if x _i And x _j Is two similar samples, then the corresponding low-dimensional space represents z _i And z _j And are also similar. Assuming that the data set includes n samples in total, if the ith sample is a neighbor point of the jth sample, an edge exists between the ith sample and the jth sample, and the weight is s _ij . Defining an edge weight matrix S _ij Is of formula (2):

wherein N is _p (x _j ) Representing sample x _j Is a set of p nearest neighbor data samples. The low-dimensional representation smoothness on the p-nearest neighbor map is measured by the formula. Measuring light of data points in a low-dimensional space with O is typically usedSlip degree, O may represent formula (3):

wherein Tr (·) represents the trace of the matrix, D represents a diagonal matrix whose terms are the side weight matrix S _ij Sum of rows or columns (S is a symmetric matrix), i.e. D _ii ＝∑ _j S _ij L is the laplacian matrix of the weight map and l=d-S.

The low-dimensional representation learned by the concept decomposition method is further projected into a label space, which defines the formula (4):

V＝AZ (4)

wherein A= [ A ] _l ；A _u ]＝[a ₁ ,a ₂ ,...,a _n ] ^T ∈R ^n×c Is a label matrix, A _l ∈R ^w×c Representing the class of marked samples, A _u ∈R ^(n-w)×c Representing the class of unlabeled exemplars, Z ε R ^c×d Is an auxiliary matrix, a _i ∈R ^c×1 The position of the largest element in (2) represents sample x _i Belonging to the category. Let it be assumed that two data samples x _i And x _j Belonging to the k-th class, they have the same low rank representation, i.e. v, as known from equation (4) _i ＝v _j And a _i And a _j 1 in the k-th item, and the remaining items are 0.

The label matrix a and the coefficient matrix V are representations of the samples in a label space and a low-dimensional feature space, respectively. To preserve the inherent geometry of the data samples, the two representations may be further regularized with LE constraints, optimally reconstructing each sample in different spaces from the same linear combination of its local neighbors. Objective function J at this time _DESCFS As formula (5):

where α represents a feature space local uniformity regularization parameter, β represents a class space local uniformity regularization parameter, and Tr (·) represents the trace of the matrix.

Based on the above, as shown in fig. 1, the invention provides a clustering method based on constrained sparse concept decomposition of dual local coincidence, which is characterized by comprising the following steps:

s20, constructing an adjacency matrix aiming at a sample data set to be clustered;

s30, establishing an objective function J based on conceptual decomposition _DESCFS As formula (6):

wherein x= [ X ] ₁ ,x ₂ ,...,x _n ]For an adjacency matrix constructed from sample data sets to be clustered, W is the basis matrix, a= [ a ] ₁ ,a ₂ ,...,a _n ] ^T ∈R ^n×c For the label matrix, Z ε R ^c×d As an auxiliary matrix, a _i ∈R ^c×1 Represents x _i The representation in label space is x _i The category to which R represents a real number set; alpha is a characteristic space local consistency regularization parameter, beta is a class space local consistency regularization parameter, lambda is a sparse parameter, and Tr (·) is a trace of a matrix; l is the laplace matrix of weights and l=d-S, D is a diagonal matrix, which is the side weight matrix S _ij Sum of rows or columns of (D) _ii ＝∑ _j S _ij ；

S40, iterating for preset times by using an iteration weighting method according to an objective function, and updating a base matrix W, a tag matrix A and an auxiliary matrix Z;

Due to the objective function J _DESCFS It is known that the desfs algorithm is non-convex for the entire base matrix W, tag matrix a, and auxiliary matrix Z, and therefore cannot solve for the global optimum. However, if the base matrix W, the tag matrix A and the auxiliary moment are madeTwo variables in the array Z are fixed, and the other variable is changed, so that the objective function is convex, and an iterative method is adopted to solve the local optimal solution of the objective function. With this objective function J _DESCFS Can be reduced to a function Ω, as in equation (7):

wherein k=x ^T X。

Due to w _ij ≥0，z _ij More than or equal to 0, let ψ= [ ψ ] _ij ]，Φ＝[φ _ij ]，γ＝[γ _ij ]The lagrangian function La may represent equation (8):

La＝Ω+Tr(ΨW ^T )+Tr(ΦZ ^T )+Tr(γA ^T ) (8)

wherein ψ is the Lagrangian multiplier of W, Φ is the Lagrangian multiplier of Z, and γ is the Lagrangian multiplier of A.

The base matrix W, the tag matrix a and the auxiliary matrix Z are derived respectively to obtain formulas (9) to (11):

let k=x be an intermediate variable ^T X, intermediate variableAnd obtaining iterative updating rules of the base matrix W, the label matrix A and the auxiliary matrix Z according to KKT conditions, wherein,

the update rule of the base matrix W is as shown in formula (12):

the updating rule of the tag matrix A is as shown in the formula (13):

the update rule of the auxiliary matrix Z is as shown in formula (14):

and finally, carrying out cluster analysis on the coefficient matrix V by adopting a K-Means clustering algorithm to realize the clustering of the samples to be clustered.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that, for the above description of the preferred embodiment of the present invention, it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principle of the present invention, and these modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A clustering method based on dual local coincidence constraint sparse concept decomposition is used for text clustering, and is characterized by comprising the following steps:

s30, establishing an objective function J based on conceptual decomposition _DESCFS ；

Wherein x= [ X ] ₁ ,x ₂ ,...,x _n ]For an adjacency matrix constructed from sample data sets to be clustered, W isBase matrix, wherein a= [ a ] _l ；A _u ]＝[a ₁ ,a ₂ ,...,a _n ] ^T ∈R ^n×c Is a label matrix, A _l ∈R ^w×c Representing the class of marked samples, A _u ∈R ^(n-w)×c Representing the class of unlabeled exemplars, Z ε R ^c×d Is an auxiliary matrix, a _i ∈R ^c×1 The position of the largest element in (2) represents sample x _i The category to which R represents a real number set; alpha is a characteristic space local consistency regularization parameter, beta is a class space local consistency regularization parameter, lambda is a sparse parameter, and Tr (·) is a trace of a matrix; l is the laplacian matrix of the weight graph and l=d-S, D is a diagonal matrix, which is the side weight matrix S _ij Sum of rows or columns of (D) _ii ＝∑ _j S _ij ；

2. The clustering method as claimed in claim 1, wherein in step S40, update rules of the base matrix W, the tag matrix a and the auxiliary matrix Z are respectively:

the update rule of the base matrix W is:

the update rule of the tag matrix a is:

the update rule of the auxiliary matrix Z is as follows:

wherein k=x ^T X，