CN112417234A

CN112417234A - A data clustering method and apparatus, and computer-readable storage medium

Info

Publication number: CN112417234A
Application number: CN201910784526.6A
Authority: CN
Inventors: 赵剑; 邱思远
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2021-02-26
Anticipated expiration: 2039-08-23
Also published as: CN112417234B

Abstract

Embodiments of the present invention disclose a data clustering method and device, and a computer-readable storage medium. The data clustering method includes: receiving and converting an original data set; Weight matrix; according to the low-rank dictionary and weight matrix, determine the representation coefficient corresponding to the original data set; establish a similarity matrix corresponding to the original data set according to the representation coefficient; based on the similarity matrix, use spectral clustering to obtain the original data set corresponding Therefore, the ideal clustering effect can be obtained and the clustering performance can be effectively improved.

Description

Data clustering method and device and computer readable storage medium

Technical Field

The present invention relates to data detection technologies, and in particular, to a data clustering method and apparatus, and a computer-readable storage medium.

Background

When the data set of the high-dimensional data is clustered, the high-dimensional data from different subspaces can be divided into respective low-dimensional subspaces according to the potential subspace structure of the data set, and the different subspaces correspond to different categories. In many fields, subspace clustering algorithms are widely used, wherein linear representation-based subspace clustering algorithms represented by Sparse Subspace Clustering (SSC), Low rank representation (Low rank representation) subspace clustering (LRR) and Least Squares Regression (LSR) subspace clustering algorithms have attracted extensive interest of researchers due to the simplicity of the algorithms and the effectiveness of high-dimensional data clustering.

At present, the commonly used subspace clustering algorithm based on linear representation is obtained by l₁The norm, the kernel norm or the F-norm constraint representation coefficient is used to obtain a representation coefficient Z with a block diagonal structure, however, the obtained representation coefficient Z is usually insufficient due to the single norm constraint representation coefficient Z, so that the final clustering result is not ideal and the clustering performance is low.

Disclosure of Invention

To solve the above technical problems, embodiments of the present invention desirably provide a data clustering method and apparatus, and a computer-readable storage medium,

in order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:

the embodiment of the invention provides a data clustering method, which comprises the following steps:

receiving and converting an original data set;

determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set;

determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix;

establishing a similarity matrix corresponding to the original data set according to the representation coefficients;

and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set.

The data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.

Drawings

FIG. 1 is a basic framework of a subspace clustering algorithm based on linear representations;

fig. 2 is a first schematic flow chart illustrating an implementation process of a data clustering method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a partial relationship;

fig. 4 is a schematic diagram illustrating a second implementation flow of a data clustering method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a first structural component of a data clustering device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a data clustering device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant application and are not limiting of the application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.

With the rapid development of information technology, data is ubiquitous in our daily life, the huge scale and the complex structure of the data bring many challenges to data processing, and how to effectively mine valuable information from the data becomes a big problem. With the introduction of a classical clustering algorithm, the clustering algorithm can effectively solve the problem of low-dimensional data clustering, but the application environment is changed day by day, high-dimensional data can be seen everywhere in work and life, the dimensionality of various image data, video data and text data is often as high as ten thousands of dimensions, for example, a picture shot by a smart phone can reach tens of thousands of pixels, and the traditional clustering algorithm cannot obtain an ideal result when the problem of high-dimensional data clustering is processed. The main problems faced by high-dimensional data clustering are: data distribution in a high-dimensional space is sparser than data distribution in a low-dimensional space, distances among the data are almost equal, and some irrelevant attributes exist in the data, so that clustering cannot be realized according to the distance relation among the data in the high-dimensional space generally. The subspace clustering algorithm is an extension of the conventional clustering algorithm, and high-dimensional data from different subspaces are divided into respective low-dimensional subspaces according to the potential subspace structure of a data set, wherein the different subspaces correspond to different categories. Subspace clustering algorithms are widely used in many fields, for example: image clustering, motion segmentation, etc. Currently, among subspace clustering algorithms, a subspace clustering algorithm based on linear representation is a research hotspot in the field due to the superior clustering performance of the subspace clustering algorithm.

Subspace clustering algorithms based on linear representations expect to better construct the similarity matrix by exploiting global information between data points. Linear representation-based subspace clustering algorithms, represented by Sparse Subspace Clustering (SSC), Low rank representation subspace clustering (LRR) and least squares regression subspace clustering (LSR), have attracted extensive interest to researchers due to the simplicity of their algorithms and the effectiveness of high-dimensional data clustering. The algorithm does not need to know the dimension of the subspace, the self-expression of the data is utilized to obtain the expression coefficient of each data point, the obtained expression coefficient is used for establishing a similarity matrix, and the similarity matrix is applied to spectral clustering to obtain a clustering result.

SSC algorithm under the assumption of linear representation by l₁Norm minimization forces the sparsity of the representation coefficient matrix to be zero between classes and sparse within classes. The LRR algorithm is able to well group together highly correlated data by minimizing the kernel norm to reveal the lowest rank representation of the global structure of the data. And when processing data containing noise and serious pollution, good robustness can be obtained. The LSR algorithm uses the F-norm to constrain the representation coefficients so that there is a grouping effect between the coefficients, maintaining the aggregate performance of the correlated data. Under the assumption of subspace independence, the representation matrix obtained by the LSR algorithm has a block diagonal structure. When the data points are insufficient, the obtained matrix of the representative coefficients also has a block diagonal structure under the assumption that the subspaces are orthogonal. Meanwhile, the objective function of the LSR algorithm can solve the analytic solution, so that the iterative solution process is avoided, and the time complexity of the algorithm is greatly reduced. Fig. 1 is a basic framework of a subspace clustering algorithm based on linear representation, and as shown in fig. 1, the subspace clustering algorithm based on linear representation mainly performs linear representation on an input data set to obtain a representation coefficient, then constructs a similarity matrix according to the representation coefficient, and performs spectral clustering by using the similarity matrix obtained by the construction, so as to obtain a clustering result.

Classical subspace clustering algorithm based on linear representation, by l₁Norm, normThe number or F-norm constraint representation coefficients to find the representation coefficients Z with a block diagonal structure, whereas a single norm constraint representation coefficient Z, which typically has a deficiency, such as the SSC algorithm by minimizing l₁Norm to obtain the sparsest representation of the samples as a coefficient matrix, minimizing l if the data from the same subspace has high correlation₁Norm, which usually randomly selects a small number of data points for linear representation, while ignoring other relevant data points, the coefficient matrix obtained does not guarantee the connection between the data points within the class, and thus, although the SSC algorithm can construct a sparse similarity matrix, it may not achieve satisfactory results. The LRR algorithm finds the lowest rank representation between the high dimensional data, and can obtain the global structure of the data. The LRR algorithm solves the optimization problem using a minimized kernel norm instead of rank minimization. Although the low-rank representation clustering algorithm can obtain a representation coefficient matrix with good block diagonal properties, the algorithm only focuses on the constraint of global rank, so that the final representation coefficient matrix lacks sparsity, a large number of nonzero elements still exist in inter-class representation coefficients, and the intra-class representation coefficients have large difference, so that the final clustering result is not ideal enough.

In order to overcome the defects of the classical subspace clustering algorithm based on linear representation, a non-negative low-rank sparse graph is used for semi-supervised learning to use l₁The norm and the kernel norm are simultaneously introduced into the objective function so as to achieve the effect of eliminating the representation coefficient which is too dense among classes. The low-rank representation algorithm with the structured constraint adds the structured sparse constraint in the low-rank representation subspace clustering algorithm, so that the algorithm can better represent coefficients among sparse classes, and can process more general subspace distribution structures. The smooth representation clustering restrains the representation coefficients through the local relation between data, so that the in-class representation coefficients tend to be smooth, and ideal clustering quality is obtained.

The data clustering method can utilize a smooth low-rank representation subspace clustering algorithm (SSLRR) to introduce local similarity constraint into an LRR target function, improve intra-class consistency of representation coefficients through a local relation between data points, and introduce Structured sparse constraint into the target function to increase inter-class sparsity of the representation coefficients. In order to enable the algorithm to better process data containing noise, the algorithm firstly obtains a low-rank structure dictionary through a low-rank recovery technology for linearly representing an original data set, so that the robustness of the algorithm for processing the noise data is improved, and meanwhile, higher clustering performance can be obtained.

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Example one

An embodiment of the present invention provides a data clustering method, and fig. 2 is a schematic view illustrating an implementation flow of the data clustering method provided in the embodiment of the present application, as shown in fig. 2, in an embodiment of the present invention, a method for performing data clustering by a data clustering device may include the following steps:

step 101, receiving and converting an original data set.

In an embodiment of the present application, the data clustering device may receive the original data set first, and perform dimension conversion on the original data set after receiving the original data set.

Further, in an embodiment of the present application, the original data set may be high-dimensional data, for example, the original data set may be Extended Yale B face data set, Augmented Reality (AR) face data set, or high-dimensional data such as hand-written digital data set.

It should be noted that, in the embodiment of the present application, the data clustering device may be a device integrated with a data clustering algorithm, and the data clustering device may be used to perform clustering, analysis, and experiments on a data set. For example, the data clustering means may be installed with a subspace clustering application, for example, the data clustering means may be installed with a face clustering application or a handwritten digit clustering application.

Further, in embodiments of the present application, the raw data set may be a high dimensional data set, e.g., a raw data setX＝[x₁,x₂,...,x_n]∈R^m×nWhere each column represents a sample of data, n represents the number of data, m represents the dimension of the data, x_iRepresenting the ith sample in the dataset.

It should be noted that, in the embodiment of the present application, after receiving the original data set, the data clustering device may perform dimensionality reduction processing on the original data set, so as to perform dimensionality conversion on the original data set. Specifically, when the data clustering device performs the dimensionality reduction process on the original data set, the dimensionality of the data can be reduced to 6 × k dimensions by Principal Component Analysis (PCA), where k represents a category parameter.

And 102, determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set.

In the implementation of the present application, after receiving and converting the original data set, the data clustering device may determine the low-rank dictionary and the weight matrix corresponding to the original data set according to the original data set.

It should be noted that, in the embodiment of the present application, the original data set received by the data clustering device may carry random noise, that is, data contaminated by noise may exist in the original data set. In order to better handle the problem of noisy data clustering, the data clustering device may use Robust Principal Component Analysis (RPCA) to recover a discriminative low rank dictionary from the original data set.

Further, in the embodiment of the present application, the data clustering device may extract the low rank dictionary from the original data set according to a first objective function, wherein the first objective function may be used for denoising the original data set, specifically, the expression of the first objective function is shown in formula (1),

min_A,E‖A‖_*+γ‖E‖₁ s.t.X＝A+E (1)

wherein | A |_*Represents the kernel norm, | E | of the matrix₁L representing a matrix₁Norm, in particular, first orderThe calibration function can be solved by using an inaccurate Lagrange multiplier algorithm, and finally the low-rank dictionary A is obtained.

Further, in the embodiment of the present application, the data clustering device may further obtain a weight matrix corresponding to the original data set according to the original data set. The weight matrix may include a first weight matrix and a second weight matrix. Specifically, the first weight matrix is used for reducing the representation coefficient; the second weight matrix is used for representing the local relation of the data in the original data set in the original space.

It should be noted that, in the embodiment of the present application, the data clustering device may involve the first weight matrix and the second weight matrix in a third objective function used when clustering the original data set, and therefore, the data clustering device may determine the first weight matrix and the second weight matrix according to the original data set.

Further, in the embodiment of the present application, the weight values in the first weight matrix can be obtained by formula (2),

wherein, W_ijAre the weight values in the first weight matrix,

and

are respectively data points x_iAnd x_jThe matrix B is defined according to equation (3),

the parameter σ is the average of all elements in the matrix B. The first weight matrix can be defined by formula (2), so that the weights between data points in different subspaces in the original data set can be definedWith a larger value, the weight value between data points in the original data set at the same subspace tends to zero, which in turn can be reduced by minimizing the data item | W |₁To better reduce the significand coefficient between data points in different subspaces, wherein, as a Hadamard product, in the embodiment of the present application, it is defined that H | W | Z |₁。

Further, in the embodiment of the present application, in order to better characterize the local relationship between the data points in the original data set, the data clustering device may determine the local relationship between the data points through a Local Linear Embedding (LLE) graph. First determine each data point x_iK of (d) and then using data point x_iK near neighbor point pair x_iPerforming linear reconstruction, solving weight value by using minimized reconstruction error, and obtaining weight value M in second weight matrix_ijRepresenting the contribution of the jth data point to the reconstruction of the ith data point, the closer the two data points are, the greater the weight between the two data points. For example, fig. 3 is a schematic diagram of a local relationship, and in a high-dimensional space, when a neighbor point K is 3, a data point x_iWith 3 neighboring points x_j、x_k、x_lThe linear reconstruction relationship therebetween is shown in FIG. 3, wherein W_ij、W_ik、W_ilAre data points x, respectively_iAnd x_j、x_k、x_lThe weight value in between. Based on two constraints: (1) each data point is linearly reconstructed by K nearest neighbor data points when a certain data point x_jK neighbors not belonging to a data point, M_ij0; (2) the sum of the reconstruction weight coefficients of each data point is 1, the second objective function of the data clustering device for solving the second weight matrix can be expressed as formula (4),

where n represents the number of data points, Q_iRepresents each data point x_iThe subscript set of K neighbors of (1) defines equation (5),

V_jk＝(x_i-x_j)^T(x_i-x_k) (5)

then, M_ijCan be expressed as the formula (6),

further, in the embodiment of the present application, the data clustering device may determine the second weight matrix according to equation (6), specifically, the second weight matrix may be a symmetric non-negative weight matrix, for example, the second weight matrix M may be represented by equation (7),

it should be noted that, in the embodiment of the present application, after receiving the original data set, the data clustering device may determine, based on the original data set, the low-rank dictionary, the first weight matrix, and the second weight matrix according to the above formula (1) and the value formula (7), so as to continue to determine the representation coefficients according to the low-rank dictionary and the weight matrix.

And 103, determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.

In the embodiment of the application, after determining the low-rank dictionary and the weight matrix corresponding to the original data set according to the original data set, the data clustering device may further determine the representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.

It should be noted that, in the embodiment of the present application, the LRR algorithm can well obtain the global structure of the data through the low rank criterion, but the inter-class representation coefficients generate a large number of non-zero elements, thereby affecting the accuracy of the clustering. In the embodiments of the present application, l may be₁The norm is introduced into the LRR objective function, i.e. into a third objective function for clustering, i.e. the third objective function may be an objective function corresponding to the LRR algorithm, so that it is possible to usel₁Norm-improvement represents sparsity of coefficients. Specifically, the third objective function can be expressed according to equation (8),

min_Z,E‖Z‖_*+β‖Z‖₁+γ‖E‖₁ s.t.X＝AZ+E (8)

where β, γ are used to balance the effects of low rank, sparse and noise terms. In particular, in embodiments of the present application, structured sparse constraint term minimization is superior to criterion/₁Norm minimization, so equation (8) can be converted to equation (9) to represent the third objective function,

min_Z,E‖Z‖_*+βH+γ‖E‖_2,1 s.t.X＝AZ+E (9)

wherein, W is the first weight matrix in the weight matrix. In order to be able to better obtain the local relationship of the data in the raw data set, it can be assumed that if data point x is obtained_iAnd x_jIf the data distribution is similar in the potential geometry, then the two data points are also similar when embedded or projected into a new space, and specifically, the data clustering device may first define L ═ D-M as a laplacian matrix and D as a degree matrix

Then, the formula (9) is converted by the formula laplace matrix, and the converted third target function is obtained as shown in the formula (10), that is, in mathematics, the assumed relationship can be expressed as the formula (10),

wherein M is a second weight matrix in the weight matrix, reflecting the local relationship of the data in the original data set in the original space, z_iAnd z_jAre respectively data points x_iAnd x_jThe corresponding representing coefficients. The formula (9) and the formula (10) are fused, and the expression coefficient is restricted through the local relation between data points, so that the in-class expression coefficient tends to be smooth, the improvement of the final clustering accuracy is promoted, and the conversion is realizedThe latter third objective function can be expressed by equation (11),

where α is used to balance the effects of the regularization term of the graph with the other three terms.

Further, in the embodiment of the present application, in order to effectively solve the above equation (11), the data clustering apparatus may use an alternating direction multiplier algorithm to iteratively solve the equation (11). Specifically, the data clustering device can introduce the preset auxiliary variable J, T E R^n×nThe above equation (11) can be converted into equation (12),

using lagrange multiplier reconstruction equation (12), equation (13) can be obtained,

wherein, Y^A、Y^BAnd Y^CRepresents a lagrange multiplier and mu represents a penalty parameter to control the convergence of the third objective function.

It should be noted that, in the embodiment of the present application, the data clustering device may utilize singular value soft threshold operation based on Y^CZ, updating and iterating J; also, the data clustering means may operate using a shrink threshold based on Y^bZ, updating and iterating T; further, the data clustering device can also use a Bartels-Stewart algorithm to solve, iteration is carried out based on a low-rank dictionary, and in the iteration process, the representation coefficient Z has a unique solution, so that the optimal value of the representation coefficient can be obtained.

And 104, establishing a similarity matrix corresponding to the original data set according to the representation coefficients.

In the embodiment of the application, after determining the representation coefficients corresponding to the original data set according to the low-rank dictionary and the weight matrix, the data clustering device may establish the similarity matrix corresponding to the original data set according to the representation coefficients.

It should be noted that, in the embodiment of the present application, after obtaining the representation coefficients, the data clustering device may construct the similarity matrix according to the representation coefficients, specifically, the data clustering device may establish the similarity matrix according to equation (14),

it should be noted that, in the embodiment of the present application, the similarity matrix determined by the data clustering device according to the formula (14) may be used for performing spectral clustering on the original data set.

And 105, based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set.

In the embodiment of the application, after the data clustering device establishes the similarity matrix corresponding to the original data set according to the representation coefficient, the clustering result corresponding to the original data set can be obtained by utilizing spectral clustering based on the similarity matrix.

Further, in the embodiment of the present application, after performing dimensionality reduction processing on the original data set, the data clustering device may further determine a category parameter corresponding to the original data set.

It should be noted that, in the embodiment of the present application, after determining the similarity matrix, the data clustering device may further determine the normalized symmetric laplacian matrix according to the similarity matrix, then may obtain K eigenvectors in the normalized symmetric laplacian matrix according to the category parameter K of the original data set, and perform normalization processing on the target matrix formed by the K eigenvectors, and then may use a K-means clustering algorithm on the normalized target matrix, and may finally output the class allocation of the original data set, that is, obtain the clustering result corresponding to the original data set.

In the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.

Example two

Based on the first embodiment, in another embodiment of the present application, when the data clustering device solves the converted third objective function, that is, when the equation (11) is solved, the data clustering device may iteratively solve the converted third objective function according to preset auxiliary variables to obtain the expression coefficient.

Further, in the embodiment of the present application, the data clustering device may introduce the preset auxiliary variable J, T ∈ R^n×nAnd reconstructing the data by using an augmented Lagrange multiplier method after introducing the preset auxiliary variable to obtain the formula (13), and then sequentially updating the preset auxiliary variable J, the preset auxiliary variable T, Z, E, the Lagrange multiplier and the mu to obtain the optimal expression coefficient Z^*。

In the embodiments of the present application, it is exemplified that X is [ X ] for the original data set₁,x₂,...,x_n]∈R^m×nWhen the determination of the representation coefficients is performed, the smooth low-rank representation subspace clustering algorithm proposed by the data clustering device may include the following steps:

step 201, initializing variables.

Setting the maximum iteration number maxter as 1000, the current iteration number k as 0, initializing Z as J as T as 0, E as 0, Y^A＝0，Y^B＝Y^C＝0，μ＝10^-6，max_μ＝10¹⁰，ρ＝1.1，ε＝10^-8. Wherein | Z-J | Y_∞>Epsilon or Z-T Y phosphor_∞>Epsilon or | | X-AZ-E | | non-woven phosphor_∞>ε。

And step 202, updating a preset auxiliary variable J.

The fixed other variables update the preset auxiliary variable J,

specifically, when updating variable J, the singular value soft threshold operation is utilized to make

Performing singular value decomposition on P, and SVD (P) ([ U, Sigma, V)]And thresholding the singular value matrix sigma: g_τ(∑)＝diag((σ_i-τ)₊) Where σ is_iIs the main diagonal element of sigma and is also the singular value of the matrix P, tau is the threshold value, take

G_τ(∑) denotes: if the diagonal element σ is_iIf it is larger than τ, take σ_i＝σ_iτ, else σ_i0. The optimal solution of the final J per iteration is that J is UG_τ(∑)V^T。

And step 203, updating the preset auxiliary variable T.

The fixed other variables update the preset auxiliary variable T,

specifically, when updating the variable T, a shrink threshold operation is utilizedLet us order

In this case, the variable T may be expressed as T ═ S_ε(Q), for each element T in T_ijThe following relationship of formula (15) is satisfied:

and step 204, updating the variable Z.

Updating the variable Z by fixing other variables, and specifically, solving the equation mu A by using a Bartels-Stewart algorithm when updating the variable Z^TAZ+αZ(2I+L)+(-A^TY^A+Y^B+Y^C+μ(A^TE-A^TX-J-T))＝0。A^TA is a semi-positive definite matrix, so for A^TArbitrary characteristic value p of A_iSatisfies p_i≧ 0, 2I + L is the positive definite matrix, so for any eigenvalue μ of 2I + L_iSatisfies mu_i>0. Because for any characteristic value p_iAnd mu_iSatisfies p_i+μ_i>0, in the iteration process, the variable Z has a unique solution.

And step 205, updating the variable E.

The other variables are fixed to update the variable E, where E satisfies the following equation (16):

specifically, when the variable E is updated, it is set

u_iEach column of E, representing each column of matrix U, satisfies the condition of the following equation (17):

and step 206, updating the Lagrange multiplier.

For Lagrange multiplier Y^A、Y^BAnd Y^CAnd (6) updating. In particular, may be according to Y^A＝Y^A+μ(X-AZ-E)、Y^B＝Y^B+ mu (Z-T) and Y^C＝Y^C+ mu (Z-J) for Y^A、Y^BAnd Y^CAnd (6) updating.

And step 207, updating the penalty parameter mu.

In terms of μ ═ min (ρ μ, max)_μ) The penalty parameter is updated.

Step 208, let k equal to k +1, repeat the above steps 202 to 207 until the optimal expression coefficient Z is output^*。

In the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result. EXAMPLE III

Based on the first embodiment and the second embodiment, in a further embodiment of the present application, fig. 4 is a schematic diagram illustrating an implementation flow of a data clustering method provided in the embodiment of the present application, as shown in fig. 4, a method for obtaining a clustering result corresponding to an original data set by using spectral clustering based on a similarity matrix by a data clustering device may include the following steps:

and 301, calculating to obtain a normalized symmetric Laplacian matrix corresponding to the original data set according to the similarity matrix.

In the embodiment of the application, after the data clustering device determines the similarity matrix, the original data set can be clustered according to a normalized symmetric spectral clustering algorithm.

Further, in the embodiment of the present application, the data clustering device may first obtain the normalized symmetric laplacian matrix corresponding to the original data set according to the similarity matrix. For example, based on the similarity matrix C obtained by the above equation (14), a normalized symmetric laplacian matrix L corresponding to the original data set is obtained by calculation_sym。

And 302, forming a target matrix according to the class parameters and the normalized symmetrical Laplace matrix.

In the embodiment of the application, after the data clustering device obtains the normalized symmetric laplacian matrix according to the similarity matrix, the target matrix can be further constructed by combining the class parameters corresponding to the original data set.

It should be noted that, in the embodiment of the present application, when the type parameter is k, the data clustering device may first calculate the laplacian matrix L_symThe first k eigenvectors u₁,u₂,…,u_kThen according to k eigenvectors u₁,u₂,…,u_kForm an object matrix U ═ U₁,u₂,…,u_k]∈R^n×k。

Step 303, normalizing the target matrix to obtain a normalized target matrix.

In the embodiment of the application, after the data clustering device constructs the target matrix according to the class parameters and the normalized symmetric laplacian matrix, the target matrix may be normalized, so that the normalized target matrix may be obtained. Specifically, the data clustering device may normalize the target matrix U by rowsTo the normalized target matrix T ∈ R^n×k。

And 304, clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.

In the embodiment of the application, the data clustering device may perform the clustering process on the normalized target matrix after performing the normalization process on the target matrix to obtain the normalized target matrix, so as to obtain the clustering result corresponding to the original data set.

Further, in the embodiment of the present application, the data clustering device may classify each row q in the normalized target matrix T_i∈R^kIs regarded as R^kAnd (3) one point in the space is subjected to a K-means clustering algorithm, so that a clustering result corresponding to the original data set can be obtained.

Example four

Based on the first to third embodiments, the data clustering device performs clustering processing on the original data set according to the SSLRR to obtain a corresponding clustering result, and in order to verify the clustering effect of the SSLRR, the embodiments of the present application propose the following two proof ways from a theoretical perspective.

The first method is as follows: the optimal solution for SSLRR has a block diagonal structure.

For the problem of equation (18) without considering noise:

given a set of m-dimensional datasets, X ═ X₁,x₂,...,x_n]＝[X₁,X₂,…,X_k]∈R^m×nAnd data set X is taken from k independent linear subspaces

Wherein X_iIs m × n_iEach column of which is from the same subspace S_iAnd n is₁+n₂+…+n_i＝n，Z^*Is the optimal solution to the minimization problem (18), then the coefficient Z is represented^*Has a block diagonal structure.

Suppose Z^*Is the optimal solution of the objective function (18), defining a formula (19),

and Z^C＝Z^*-Z^D，Z^CIs greater than or equal to 0, and Z is the orthogonality assumption of subspace^DIs also a feasible solution to the objective function (17), and is derived from the kernel-norm nature of the matrix, | | Z^*||_*≥||Z^D||_*. From Z^CMore than or equal to 0, tr (Z) can be deduced^*LZ^*T)＝tr((Z^D+Z^C)L(Z^D+Z^C)^T)≥tr(Z^DL(Z^D)^T) Since the weight matrix W is a non-negative momentArray, therefore, for H, one can obtain:

wherein L ═ W-^D‖₁From | | | Z^*||_*≥||Z^D||_*、tr(Z^*LZ^*T)≥tr(Z^DL(Z^D)^T) And | W | Z^*‖₁≥‖W⊙Z^D‖₁It can be deduced that:

||Z^*||_*+tr(Z^*LZ^*T)+‖W⊙Z^*‖₁≥||Z^D||_*+tr(Z^DL(Z^D)^T)+L (21)

and because of Z^*Is the optimal solution of equation (18), and therefore, | | Z can be obtained^*||_*+tr(Z^*LZ^*T)+‖W⊙Z^*‖₁＝||Z^D||_*+tr(Z^DL(Z^D)^T)+L，Z^C0, to obtain Z^*＝Z^DTherefore, the optimal solution Z of equation (18) has a block diagonal structure.

The second method comprises the following steps: and (5) analyzing time complexity.

For a data set X ═ X₁,x₂,...,x_n]∈R^m×nIn the above step 101, the time complexity of recovering a low rank dictionary A using RPCA is O (t)₁n³)，t₁Representing the number of iterations of the algorithm. The updating J, T, E and the Lagrangian multiplier Y in the above steps 202 to 207^A、Y^B、Y^cRespectively, is O (n)³)、O(n²)、O(mn²)、O(mn²)、O(n²)、O(n²) When updating Z, the Bartels-Stewart algorithm is used to solve the Sylvester equation, so the time complexity is O (n)³) Therefore, the overall time complexity in the above steps 202 to 207 is O (3 t)₂n²+2t₂mn²+2t₂n³) If m is<n, timeComplexity O (2 t)₂n³)，t₂Representing the number of iterations of the alternating direction multiplier algorithm. Step 105 spectral clustering has an overall temporal complexity of O (n)³). Therefore, the temporal complexity of the SSLRR algorithm proposed in this chapter is O ((t)₁+2t₂+1)n³)。

EXAMPLE five

Based on the first to fourth embodiments, fig. 5 is a schematic structural diagram of a data clustering device according to an embodiment of the present application, as shown in fig. 5, in an embodiment of the present invention, a data clustering device 1 includes a receiving unit 11, a converting unit 12, a determining unit 13, an establishing unit 14, and an obtaining unit 15,

the receiving unit 11 is configured to receive an original data set.

The conversion unit 12 is configured to convert the original data set.

The determining unit 13 is configured to determine, according to the original data set, a low-rank dictionary and a weight matrix corresponding to the original data set; and determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.

The establishing unit 14 is configured to establish a similarity matrix corresponding to the original data set according to the representation coefficient.

The obtaining unit 15 is configured to obtain a clustering result corresponding to the original data set by using spectral clustering based on the similarity matrix.

Further, in the embodiment of the present application, the converting unit 12 is specifically configured to perform dimensionality reduction processing on the original data set after receiving the original data set.

Further, in an embodiment of the present application, the determining unit 13 is specifically configured to determine the low-rank dictionary from the original data set according to a first objective function; the first objective function is used for denoising the original data set; or, the determining unit 13 is further specifically configured to obtain a third objective function according to the first weight matrix; obtaining a Laplace matrix according to the second weight matrix; converting the third objective function according to the Laplace matrix to obtain a converted third objective function; and solving the converted third objective function to obtain the representation coefficient.

Further, in an embodiment of the present application, the weight matrix includes a first weight matrix and the second weight matrix, and the determining unit 13 is further specifically configured to calculate the first weight matrix according to the original data set; wherein the first weight matrix is used for reducing the representation coefficient; determining the second weight matrix according to a second objective function and the original data set; the second weight matrix is used for representing the local relation of the data in the original data set in the original space.

Further, in an embodiment of the present application, the determining unit 13 is further specifically configured to perform iterative solution on the converted third objective function according to a preset auxiliary variable, so as to obtain the representation coefficient.

Further, in an embodiment of the present application, the determining unit 13 is further configured to determine a category parameter corresponding to the original data set after performing dimensionality reduction processing on the original data set.

Further, in an embodiment of the present application, the obtaining unit 15 is specifically configured to obtain a normalized symmetric laplacian matrix corresponding to the original data set according to the similarity matrix calculation; forming a target matrix according to the category parameters and the normalized symmetrical Laplace matrix; carrying out normalization processing on the target matrix to obtain a normalized target matrix; and clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.

Fig. 6 is a schematic diagram of a composition structure of the data clustering device according to the embodiment of the present application, and as shown in fig. 6, the data clustering device 1 according to the embodiment of the present application may further include a processor 16 and a memory 17 storing executable instructions of the processor 16, and further, the data clustering device 1 may further include a communication interface 18, and a bus 19 for connecting the processor 16, the memory 17, and the communication interface 18.

In the embodiment of the present Application, the Processor 16 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a ProgRAMmable Logic Device (PLD), a Field ProgRAMmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the processor functions may be other devices, and the embodiments of the present application are not limited in particular. The data clustering device 1 may further comprise a memory 17, which memory 17 may be connected to the processor 16, wherein the memory 17 is adapted to store executable program code comprising computer operating instructions, and the memory 17 may comprise a high speed RAM memory and may further comprise a non-volatile memory, e.g. at least two disk memories.

In the embodiment of the present application, the bus 19 is used to connect the communication interface 18, the processor 16, and the memory 17 and the intercommunication among these devices.

In the embodiment of the present application, the memory 17 is used for storing instructions and data.

Further, in an embodiment of the present application, a processor 16 for receiving and converting the original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set.

In practical applications, the Memory 17 may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 16.

In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The data clustering device provided by the embodiment of the application receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.

An embodiment of the present application provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the program implements the data clustering method as described above.

Specifically, the program instructions corresponding to a data clustering method in this embodiment may be stored in a storage medium such as an optical disc, a hard disc, or a usb disk, and when the program instructions corresponding to a data clustering method in the storage medium are read or executed by an electronic device, the method includes the following steps:

receiving and converting an original data set;

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, display, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or flowchart block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks in the flowchart and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. a data clustering method, is characterized in that, described method comprises:

receive and transform raw datasets;

Determine a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set;

According to the low-rank dictionary and the weight matrix, determine the representation coefficient corresponding to the original data set;

establishing a similarity matrix corresponding to the original data set according to the representation coefficient;

Based on the similarity matrix, spectral clustering is used to obtain the clustering result corresponding to the original data set.

2. The method according to claim 1, wherein the converting the original data set comprises:

After receiving the original data set, dimensionality reduction processing is performed on the original data set.

3. The method according to claim 1, wherein the determining, according to the original data set, a low-rank dictionary corresponding to the original data set comprises:

The low-rank dictionary is determined from the original data set according to a first objective function; wherein the first objective function is used to perform denoising processing on the original data set.

4. The method according to claim 1, wherein the weight matrix comprises a first weight matrix and the second weight matrix, and the corresponding determination of the original data set according to the original data set The weight matrix of , including:

Calculate the first weight matrix according to the original data set; wherein, the first weight matrix is used to reduce the representation coefficient;

The second weight matrix is determined according to the second objective function and the original data set; wherein, the second weight matrix is used to represent the local relationship of the data in the original data set in the original space.

5 . The method according to claim 1 , wherein, determining the representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix, comprising: 6 .

Obtain a third objective function according to the first weight matrix; obtain a Laplacian matrix according to the second weight matrix;

Convert the third objective function according to the Laplace matrix to obtain the converted third objective function;

The transformed third objective function is solved to obtain the representation coefficient.

6. The method according to claim 5, characterized in that, said solving the converted third objective function to obtain said representation coefficient, comprising:

Iteratively solves the converted third objective function according to preset auxiliary variables to obtain the representation coefficient.

7. The method according to claim 2, wherein after the dimension reduction processing is performed on the original data set, the method further comprises:

Determine the category parameter corresponding to the original data set.

8. The method according to claim 7, characterized in that, based on the similarity matrix, using spectral clustering to obtain a clustering result corresponding to the original data set, comprising:

Calculate and obtain a normalized symmetric Laplacian matrix corresponding to the original data set according to the similarity matrix;

According to the category parameter and the normalized symmetric Laplacian matrix, a target matrix is formed;

Normalize the target matrix to obtain a normalized target matrix;

Perform clustering processing on the normalized target matrix to obtain a clustering result corresponding to the original data set.

9. A data clustering device, characterized in that the data clustering device comprises: a receiving unit, a converting unit, a determining unit, a establishing unit and an obtaining unit,

the receiving unit, for receiving the original data set;

the conversion unit, configured to convert the original data set;

The determining unit is configured to determine, according to the original data set, a low-rank dictionary and a weight matrix corresponding to the original data set; and determine, according to the low-rank dictionary and the weight matrix, corresponding to the original data set The expression coefficient of ;

The establishment unit is configured to establish a similarity matrix corresponding to the original data set according to the representation coefficient;

The obtaining unit is configured to obtain a clustering result corresponding to the original data set by using spectral clustering based on the similarity matrix.

10. The data clustering device according to claim 9, wherein,

The conversion unit is specifically configured to perform dimension reduction processing on the original data set after receiving the original data set.

11. The data clustering device according to claim 9, wherein,

The determining unit is specifically configured to determine the low-rank dictionary from the original data set according to a first objective function; wherein, the first objective function is used to perform denoising processing on the original data set;

Alternatively, the determining unit is further configured to obtain a third objective function according to the first weight matrix; obtain a Laplacian matrix according to the second weight matrix; and obtain a Laplacian matrix according to the Laplacian matrix The third objective function is converted to obtain the converted third objective function; and the converted third objective function is solved to obtain the representation coefficient.

12. The data clustering apparatus according to claim 9, wherein the weight matrix comprises a first weight matrix and the second weight matrix,

The determining unit is also specifically configured to calculate the first weight matrix according to the original data set; wherein, the first weight matrix is used to reduce the representation coefficient; and according to the second objective function and The original data set determines the second weight matrix; wherein, the second weight matrix is used to represent the local relationship of the data in the original data set in the original space.

13. The data clustering device according to claim 11, wherein,

The determining unit is further configured to iteratively solve the converted third objective function according to preset auxiliary variables to obtain the representation coefficient.

14. The data clustering device according to claim 10, wherein,

The determining unit is further configured to determine a category parameter corresponding to the original data set after performing dimension reduction processing on the original data set.

15. The data clustering device according to claim 14, wherein,

The obtaining unit is specifically configured to calculate and obtain a normalized symmetric Laplacian matrix corresponding to the original data set according to the similarity matrix; and form a target according to the category parameter and the normalized symmetric Laplacian matrix and performing normalization processing on the target matrix to obtain a normalized target matrix; and performing clustering processing on the normalized target matrix to obtain a clustering result corresponding to the original data set .

16. A data clustering device, characterized in that the data clustering device comprises a processor, a memory storing executable instructions of the processor, a communication interface, and a communication interface for connecting the processor, the memory and the bus of the communication interface, when the instructions are executed by the processor, implement the method of any one of claims 1-8.

17. A computer-readable storage medium on which a program is stored and applied in a data clustering device, wherein when the program is executed by a processor, the program according to any one of claims 1-8 is implemented. method.