CN112417234A - A data clustering method and apparatus, and computer-readable storage medium - Google Patents

A data clustering method and apparatus, and computer-readable storage medium Download PDF

Info

Publication number
CN112417234A
CN112417234A CN201910784526.6A CN201910784526A CN112417234A CN 112417234 A CN112417234 A CN 112417234A CN 201910784526 A CN201910784526 A CN 201910784526A CN 112417234 A CN112417234 A CN 112417234A
Authority
CN
China
Prior art keywords
data set
original data
matrix
clustering
weight matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910784526.6A
Other languages
Chinese (zh)
Other versions
CN112417234B (en
Inventor
赵剑
邱思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201910784526.6A priority Critical patent/CN112417234B/en
Publication of CN112417234A publication Critical patent/CN112417234A/en
Application granted granted Critical
Publication of CN112417234B publication Critical patent/CN112417234B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例公开了一种数据聚类方法和装置,及计算机可读存储介质,上述数据聚类方法包括:接收并转换原始数据集;根据原始数据集确定原始数据集对应的低秩字典和权值矩阵;根据低秩字典和权值矩阵,确定原始数据集对应的表示系数;按照表示系数建立与原始数据集对应的相似度矩阵;基于相似度矩阵,利用谱聚类获得原始数据集对应的聚类结果,从而可以获得理想的聚类效果,有效地提高聚类性能。

Figure 201910784526

Embodiments of the present invention disclose a data clustering method and device, and a computer-readable storage medium. The data clustering method includes: receiving and converting an original data set; Weight matrix; according to the low-rank dictionary and weight matrix, determine the representation coefficient corresponding to the original data set; establish a similarity matrix corresponding to the original data set according to the representation coefficient; based on the similarity matrix, use spectral clustering to obtain the original data set corresponding Therefore, the ideal clustering effect can be obtained and the clustering performance can be effectively improved.

Figure 201910784526

Description

Data clustering method and device and computer readable storage medium
Technical Field
The present invention relates to data detection technologies, and in particular, to a data clustering method and apparatus, and a computer-readable storage medium.
Background
When the data set of the high-dimensional data is clustered, the high-dimensional data from different subspaces can be divided into respective low-dimensional subspaces according to the potential subspace structure of the data set, and the different subspaces correspond to different categories. In many fields, subspace clustering algorithms are widely used, wherein linear representation-based subspace clustering algorithms represented by Sparse Subspace Clustering (SSC), Low rank representation (Low rank representation) subspace clustering (LRR) and Least Squares Regression (LSR) subspace clustering algorithms have attracted extensive interest of researchers due to the simplicity of the algorithms and the effectiveness of high-dimensional data clustering.
At present, the commonly used subspace clustering algorithm based on linear representation is obtained by l1The norm, the kernel norm or the F-norm constraint representation coefficient is used to obtain a representation coefficient Z with a block diagonal structure, however, the obtained representation coefficient Z is usually insufficient due to the single norm constraint representation coefficient Z, so that the final clustering result is not ideal and the clustering performance is low.
Disclosure of Invention
To solve the above technical problems, embodiments of the present invention desirably provide a data clustering method and apparatus, and a computer-readable storage medium,
in order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:
the embodiment of the invention provides a data clustering method, which comprises the following steps:
receiving and converting an original data set;
determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set;
determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix;
establishing a similarity matrix corresponding to the original data set according to the representation coefficients;
and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set.
The data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.
Drawings
FIG. 1 is a basic framework of a subspace clustering algorithm based on linear representations;
fig. 2 is a first schematic flow chart illustrating an implementation process of a data clustering method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a partial relationship;
fig. 4 is a schematic diagram illustrating a second implementation flow of a data clustering method according to an embodiment of the present application;
fig. 5 is a schematic diagram of a first structural component of a data clustering device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a data clustering device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant application and are not limiting of the application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.
With the rapid development of information technology, data is ubiquitous in our daily life, the huge scale and the complex structure of the data bring many challenges to data processing, and how to effectively mine valuable information from the data becomes a big problem. With the introduction of a classical clustering algorithm, the clustering algorithm can effectively solve the problem of low-dimensional data clustering, but the application environment is changed day by day, high-dimensional data can be seen everywhere in work and life, the dimensionality of various image data, video data and text data is often as high as ten thousands of dimensions, for example, a picture shot by a smart phone can reach tens of thousands of pixels, and the traditional clustering algorithm cannot obtain an ideal result when the problem of high-dimensional data clustering is processed. The main problems faced by high-dimensional data clustering are: data distribution in a high-dimensional space is sparser than data distribution in a low-dimensional space, distances among the data are almost equal, and some irrelevant attributes exist in the data, so that clustering cannot be realized according to the distance relation among the data in the high-dimensional space generally. The subspace clustering algorithm is an extension of the conventional clustering algorithm, and high-dimensional data from different subspaces are divided into respective low-dimensional subspaces according to the potential subspace structure of a data set, wherein the different subspaces correspond to different categories. Subspace clustering algorithms are widely used in many fields, for example: image clustering, motion segmentation, etc. Currently, among subspace clustering algorithms, a subspace clustering algorithm based on linear representation is a research hotspot in the field due to the superior clustering performance of the subspace clustering algorithm.
Subspace clustering algorithms based on linear representations expect to better construct the similarity matrix by exploiting global information between data points. Linear representation-based subspace clustering algorithms, represented by Sparse Subspace Clustering (SSC), Low rank representation subspace clustering (LRR) and least squares regression subspace clustering (LSR), have attracted extensive interest to researchers due to the simplicity of their algorithms and the effectiveness of high-dimensional data clustering. The algorithm does not need to know the dimension of the subspace, the self-expression of the data is utilized to obtain the expression coefficient of each data point, the obtained expression coefficient is used for establishing a similarity matrix, and the similarity matrix is applied to spectral clustering to obtain a clustering result.
SSC algorithm under the assumption of linear representation by l1Norm minimization forces the sparsity of the representation coefficient matrix to be zero between classes and sparse within classes. The LRR algorithm is able to well group together highly correlated data by minimizing the kernel norm to reveal the lowest rank representation of the global structure of the data. And when processing data containing noise and serious pollution, good robustness can be obtained. The LSR algorithm uses the F-norm to constrain the representation coefficients so that there is a grouping effect between the coefficients, maintaining the aggregate performance of the correlated data. Under the assumption of subspace independence, the representation matrix obtained by the LSR algorithm has a block diagonal structure. When the data points are insufficient, the obtained matrix of the representative coefficients also has a block diagonal structure under the assumption that the subspaces are orthogonal. Meanwhile, the objective function of the LSR algorithm can solve the analytic solution, so that the iterative solution process is avoided, and the time complexity of the algorithm is greatly reduced. Fig. 1 is a basic framework of a subspace clustering algorithm based on linear representation, and as shown in fig. 1, the subspace clustering algorithm based on linear representation mainly performs linear representation on an input data set to obtain a representation coefficient, then constructs a similarity matrix according to the representation coefficient, and performs spectral clustering by using the similarity matrix obtained by the construction, so as to obtain a clustering result.
Classical subspace clustering algorithm based on linear representation, by l1Norm, normThe number or F-norm constraint representation coefficients to find the representation coefficients Z with a block diagonal structure, whereas a single norm constraint representation coefficient Z, which typically has a deficiency, such as the SSC algorithm by minimizing l1Norm to obtain the sparsest representation of the samples as a coefficient matrix, minimizing l if the data from the same subspace has high correlation1Norm, which usually randomly selects a small number of data points for linear representation, while ignoring other relevant data points, the coefficient matrix obtained does not guarantee the connection between the data points within the class, and thus, although the SSC algorithm can construct a sparse similarity matrix, it may not achieve satisfactory results. The LRR algorithm finds the lowest rank representation between the high dimensional data, and can obtain the global structure of the data. The LRR algorithm solves the optimization problem using a minimized kernel norm instead of rank minimization. Although the low-rank representation clustering algorithm can obtain a representation coefficient matrix with good block diagonal properties, the algorithm only focuses on the constraint of global rank, so that the final representation coefficient matrix lacks sparsity, a large number of nonzero elements still exist in inter-class representation coefficients, and the intra-class representation coefficients have large difference, so that the final clustering result is not ideal enough.
In order to overcome the defects of the classical subspace clustering algorithm based on linear representation, a non-negative low-rank sparse graph is used for semi-supervised learning to use l1The norm and the kernel norm are simultaneously introduced into the objective function so as to achieve the effect of eliminating the representation coefficient which is too dense among classes. The low-rank representation algorithm with the structured constraint adds the structured sparse constraint in the low-rank representation subspace clustering algorithm, so that the algorithm can better represent coefficients among sparse classes, and can process more general subspace distribution structures. The smooth representation clustering restrains the representation coefficients through the local relation between data, so that the in-class representation coefficients tend to be smooth, and ideal clustering quality is obtained.
The data clustering method can utilize a smooth low-rank representation subspace clustering algorithm (SSLRR) to introduce local similarity constraint into an LRR target function, improve intra-class consistency of representation coefficients through a local relation between data points, and introduce Structured sparse constraint into the target function to increase inter-class sparsity of the representation coefficients. In order to enable the algorithm to better process data containing noise, the algorithm firstly obtains a low-rank structure dictionary through a low-rank recovery technology for linearly representing an original data set, so that the robustness of the algorithm for processing the noise data is improved, and meanwhile, higher clustering performance can be obtained.
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Example one
An embodiment of the present invention provides a data clustering method, and fig. 2 is a schematic view illustrating an implementation flow of the data clustering method provided in the embodiment of the present application, as shown in fig. 2, in an embodiment of the present invention, a method for performing data clustering by a data clustering device may include the following steps:
step 101, receiving and converting an original data set.
In an embodiment of the present application, the data clustering device may receive the original data set first, and perform dimension conversion on the original data set after receiving the original data set.
Further, in an embodiment of the present application, the original data set may be high-dimensional data, for example, the original data set may be Extended Yale B face data set, Augmented Reality (AR) face data set, or high-dimensional data such as hand-written digital data set.
It should be noted that, in the embodiment of the present application, the data clustering device may be a device integrated with a data clustering algorithm, and the data clustering device may be used to perform clustering, analysis, and experiments on a data set. For example, the data clustering means may be installed with a subspace clustering application, for example, the data clustering means may be installed with a face clustering application or a handwritten digit clustering application.
Further, in embodiments of the present application, the raw data set may be a high dimensional data set, e.g., a raw data setX=[x1,x2,...,xn]∈Rm×nWhere each column represents a sample of data, n represents the number of data, m represents the dimension of the data, xiRepresenting the ith sample in the dataset.
It should be noted that, in the embodiment of the present application, after receiving the original data set, the data clustering device may perform dimensionality reduction processing on the original data set, so as to perform dimensionality conversion on the original data set. Specifically, when the data clustering device performs the dimensionality reduction process on the original data set, the dimensionality of the data can be reduced to 6 × k dimensions by Principal Component Analysis (PCA), where k represents a category parameter.
And 102, determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set.
In the implementation of the present application, after receiving and converting the original data set, the data clustering device may determine the low-rank dictionary and the weight matrix corresponding to the original data set according to the original data set.
It should be noted that, in the embodiment of the present application, the original data set received by the data clustering device may carry random noise, that is, data contaminated by noise may exist in the original data set. In order to better handle the problem of noisy data clustering, the data clustering device may use Robust Principal Component Analysis (RPCA) to recover a discriminative low rank dictionary from the original data set.
Further, in the embodiment of the present application, the data clustering device may extract the low rank dictionary from the original data set according to a first objective function, wherein the first objective function may be used for denoising the original data set, specifically, the expression of the first objective function is shown in formula (1),
minA,E‖A‖*+γ‖E‖1 s.t.X=A+E (1)
wherein | A |*Represents the kernel norm, | E | of the matrix1L representing a matrix1Norm, in particular, first orderThe calibration function can be solved by using an inaccurate Lagrange multiplier algorithm, and finally the low-rank dictionary A is obtained.
Further, in the embodiment of the present application, the data clustering device may further obtain a weight matrix corresponding to the original data set according to the original data set. The weight matrix may include a first weight matrix and a second weight matrix. Specifically, the first weight matrix is used for reducing the representation coefficient; the second weight matrix is used for representing the local relation of the data in the original data set in the original space.
It should be noted that, in the embodiment of the present application, the data clustering device may involve the first weight matrix and the second weight matrix in a third objective function used when clustering the original data set, and therefore, the data clustering device may determine the first weight matrix and the second weight matrix according to the original data set.
Further, in the embodiment of the present application, the weight values in the first weight matrix can be obtained by formula (2),
Figure BDA0002177614800000071
wherein, WijAre the weight values in the first weight matrix,
Figure BDA0002177614800000072
and
Figure BDA0002177614800000073
are respectively data points xiAnd xjThe matrix B is defined according to equation (3),
Figure BDA0002177614800000074
the parameter σ is the average of all elements in the matrix B. The first weight matrix can be defined by formula (2), so that the weights between data points in different subspaces in the original data set can be definedWith a larger value, the weight value between data points in the original data set at the same subspace tends to zero, which in turn can be reduced by minimizing the data item | W |1To better reduce the significand coefficient between data points in different subspaces, wherein, as a Hadamard product, in the embodiment of the present application, it is defined that H | W | Z |1
Further, in the embodiment of the present application, in order to better characterize the local relationship between the data points in the original data set, the data clustering device may determine the local relationship between the data points through a Local Linear Embedding (LLE) graph. First determine each data point xiK of (d) and then using data point xiK near neighbor point pair xiPerforming linear reconstruction, solving weight value by using minimized reconstruction error, and obtaining weight value M in second weight matrixijRepresenting the contribution of the jth data point to the reconstruction of the ith data point, the closer the two data points are, the greater the weight between the two data points. For example, fig. 3 is a schematic diagram of a local relationship, and in a high-dimensional space, when a neighbor point K is 3, a data point xiWith 3 neighboring points xj、xk、xlThe linear reconstruction relationship therebetween is shown in FIG. 3, wherein Wij、Wik、WilAre data points x, respectivelyiAnd xj、xk、xlThe weight value in between. Based on two constraints: (1) each data point is linearly reconstructed by K nearest neighbor data points when a certain data point xjK neighbors not belonging to a data point, Mij0; (2) the sum of the reconstruction weight coefficients of each data point is 1, the second objective function of the data clustering device for solving the second weight matrix can be expressed as formula (4),
Figure BDA0002177614800000081
where n represents the number of data points, QiRepresents each data point xiThe subscript set of K neighbors of (1) defines equation (5),
Vjk=(xi-xj)T(xi-xk) (5)
then, MijCan be expressed as the formula (6),
Figure BDA0002177614800000082
further, in the embodiment of the present application, the data clustering device may determine the second weight matrix according to equation (6), specifically, the second weight matrix may be a symmetric non-negative weight matrix, for example, the second weight matrix M may be represented by equation (7),
Figure BDA0002177614800000083
it should be noted that, in the embodiment of the present application, after receiving the original data set, the data clustering device may determine, based on the original data set, the low-rank dictionary, the first weight matrix, and the second weight matrix according to the above formula (1) and the value formula (7), so as to continue to determine the representation coefficients according to the low-rank dictionary and the weight matrix.
And 103, determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.
In the embodiment of the application, after determining the low-rank dictionary and the weight matrix corresponding to the original data set according to the original data set, the data clustering device may further determine the representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.
It should be noted that, in the embodiment of the present application, the LRR algorithm can well obtain the global structure of the data through the low rank criterion, but the inter-class representation coefficients generate a large number of non-zero elements, thereby affecting the accuracy of the clustering. In the embodiments of the present application, l may be1The norm is introduced into the LRR objective function, i.e. into a third objective function for clustering, i.e. the third objective function may be an objective function corresponding to the LRR algorithm, so that it is possible to usel1Norm-improvement represents sparsity of coefficients. Specifically, the third objective function can be expressed according to equation (8),
minZ,E‖Z‖*+β‖Z‖1+γ‖E‖1 s.t.X=AZ+E (8)
where β, γ are used to balance the effects of low rank, sparse and noise terms. In particular, in embodiments of the present application, structured sparse constraint term minimization is superior to criterion/1Norm minimization, so equation (8) can be converted to equation (9) to represent the third objective function,
minZ,E‖Z‖*+βH+γ‖E‖2,1 s.t.X=AZ+E (9)
wherein, W is the first weight matrix in the weight matrix. In order to be able to better obtain the local relationship of the data in the raw data set, it can be assumed that if data point x is obtainediAnd xjIf the data distribution is similar in the potential geometry, then the two data points are also similar when embedded or projected into a new space, and specifically, the data clustering device may first define L ═ D-M as a laplacian matrix and D as a degree matrix
Figure BDA0002177614800000091
Then, the formula (9) is converted by the formula laplace matrix, and the converted third target function is obtained as shown in the formula (10), that is, in mathematics, the assumed relationship can be expressed as the formula (10),
Figure BDA0002177614800000092
wherein M is a second weight matrix in the weight matrix, reflecting the local relationship of the data in the original data set in the original space, ziAnd zjAre respectively data points xiAnd xjThe corresponding representing coefficients. The formula (9) and the formula (10) are fused, and the expression coefficient is restricted through the local relation between data points, so that the in-class expression coefficient tends to be smooth, the improvement of the final clustering accuracy is promoted, and the conversion is realizedThe latter third objective function can be expressed by equation (11),
Figure BDA0002177614800000093
where α is used to balance the effects of the regularization term of the graph with the other three terms.
Further, in the embodiment of the present application, in order to effectively solve the above equation (11), the data clustering apparatus may use an alternating direction multiplier algorithm to iteratively solve the equation (11). Specifically, the data clustering device can introduce the preset auxiliary variable J, T E Rn×nThe above equation (11) can be converted into equation (12),
Figure BDA0002177614800000101
using lagrange multiplier reconstruction equation (12), equation (13) can be obtained,
Figure BDA0002177614800000102
wherein, YA、YBAnd YCRepresents a lagrange multiplier and mu represents a penalty parameter to control the convergence of the third objective function.
It should be noted that, in the embodiment of the present application, the data clustering device may utilize singular value soft threshold operation based on YCZ, updating and iterating J; also, the data clustering means may operate using a shrink threshold based on YbZ, updating and iterating T; further, the data clustering device can also use a Bartels-Stewart algorithm to solve, iteration is carried out based on a low-rank dictionary, and in the iteration process, the representation coefficient Z has a unique solution, so that the optimal value of the representation coefficient can be obtained.
And 104, establishing a similarity matrix corresponding to the original data set according to the representation coefficients.
In the embodiment of the application, after determining the representation coefficients corresponding to the original data set according to the low-rank dictionary and the weight matrix, the data clustering device may establish the similarity matrix corresponding to the original data set according to the representation coefficients.
It should be noted that, in the embodiment of the present application, after obtaining the representation coefficients, the data clustering device may construct the similarity matrix according to the representation coefficients, specifically, the data clustering device may establish the similarity matrix according to equation (14),
Figure BDA0002177614800000103
it should be noted that, in the embodiment of the present application, the similarity matrix determined by the data clustering device according to the formula (14) may be used for performing spectral clustering on the original data set.
And 105, based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set.
In the embodiment of the application, after the data clustering device establishes the similarity matrix corresponding to the original data set according to the representation coefficient, the clustering result corresponding to the original data set can be obtained by utilizing spectral clustering based on the similarity matrix.
Further, in the embodiment of the present application, after performing dimensionality reduction processing on the original data set, the data clustering device may further determine a category parameter corresponding to the original data set.
It should be noted that, in the embodiment of the present application, after determining the similarity matrix, the data clustering device may further determine the normalized symmetric laplacian matrix according to the similarity matrix, then may obtain K eigenvectors in the normalized symmetric laplacian matrix according to the category parameter K of the original data set, and perform normalization processing on the target matrix formed by the K eigenvectors, and then may use a K-means clustering algorithm on the normalized target matrix, and may finally output the class allocation of the original data set, that is, obtain the clustering result corresponding to the original data set.
In the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.
Example two
Based on the first embodiment, in another embodiment of the present application, when the data clustering device solves the converted third objective function, that is, when the equation (11) is solved, the data clustering device may iteratively solve the converted third objective function according to preset auxiliary variables to obtain the expression coefficient.
Further, in the embodiment of the present application, the data clustering device may introduce the preset auxiliary variable J, T ∈ Rn×nAnd reconstructing the data by using an augmented Lagrange multiplier method after introducing the preset auxiliary variable to obtain the formula (13), and then sequentially updating the preset auxiliary variable J, the preset auxiliary variable T, Z, E, the Lagrange multiplier and the mu to obtain the optimal expression coefficient Z*
In the embodiments of the present application, it is exemplified that X is [ X ] for the original data set1,x2,...,xn]∈Rm×nWhen the determination of the representation coefficients is performed, the smooth low-rank representation subspace clustering algorithm proposed by the data clustering device may include the following steps:
step 201, initializing variables.
Setting the maximum iteration number maxter as 1000, the current iteration number k as 0, initializing Z as J as T as 0, E as 0, YA=0,YB=YC=0,μ=10-6,maxμ=1010,ρ=1.1,ε=10-8. Wherein | Z-J | Y>Epsilon or Z-T Y phosphor>Epsilon or | | X-AZ-E | | non-woven phosphor>ε。
And step 202, updating a preset auxiliary variable J.
The fixed other variables update the preset auxiliary variable J,
Figure BDA0002177614800000121
specifically, when updating variable J, the singular value soft threshold operation is utilized to make
Figure BDA0002177614800000122
Performing singular value decomposition on P, and SVD (P) ([ U, Sigma, V)]And thresholding the singular value matrix sigma: gτ(∑)=diag((σi-τ)+) Where σ isiIs the main diagonal element of sigma and is also the singular value of the matrix P, tau is the threshold value, take
Figure BDA0002177614800000123
Gτ(∑) denotes: if the diagonal element σ isiIf it is larger than τ, take σi=σiτ, else σi0. The optimal solution of the final J per iteration is that J is UGτ(∑)VT
And step 203, updating the preset auxiliary variable T.
The fixed other variables update the preset auxiliary variable T,
Figure BDA0002177614800000124
specifically, when updating the variable T, a shrink threshold operation is utilizedLet us order
Figure BDA0002177614800000125
In this case, the variable T may be expressed as T ═ Sε(Q), for each element T in TijThe following relationship of formula (15) is satisfied:
Figure BDA0002177614800000131
and step 204, updating the variable Z.
Updating the variable Z by fixing other variables, and specifically, solving the equation mu A by using a Bartels-Stewart algorithm when updating the variable ZTAZ+αZ(2I+L)+(-ATYA+YB+YC+μ(ATE-ATX-J-T))=0。ATA is a semi-positive definite matrix, so for ATArbitrary characteristic value p of AiSatisfies pi≧ 0, 2I + L is the positive definite matrix, so for any eigenvalue μ of 2I + LiSatisfies mui>0. Because for any characteristic value piAnd muiSatisfies pii>0, in the iteration process, the variable Z has a unique solution.
And step 205, updating the variable E.
The other variables are fixed to update the variable E, where E satisfies the following equation (16):
Figure BDA0002177614800000132
specifically, when the variable E is updated, it is set
Figure BDA0002177614800000133
uiEach column of E, representing each column of matrix U, satisfies the condition of the following equation (17):
Figure BDA0002177614800000134
and step 206, updating the Lagrange multiplier.
For Lagrange multiplier YA、YBAnd YCAnd (6) updating. In particular, may be according to YA=YA+μ(X-AZ-E)、YB=YB+ mu (Z-T) and YC=YC+ mu (Z-J) for YA、YBAnd YCAnd (6) updating.
And step 207, updating the penalty parameter mu.
In terms of μ ═ min (ρ μ, max)μ) The penalty parameter is updated.
Step 208, let k equal to k +1, repeat the above steps 202 to 207 until the optimal expression coefficient Z is output*
In the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result. EXAMPLE III
Based on the first embodiment and the second embodiment, in a further embodiment of the present application, fig. 4 is a schematic diagram illustrating an implementation flow of a data clustering method provided in the embodiment of the present application, as shown in fig. 4, a method for obtaining a clustering result corresponding to an original data set by using spectral clustering based on a similarity matrix by a data clustering device may include the following steps:
and 301, calculating to obtain a normalized symmetric Laplacian matrix corresponding to the original data set according to the similarity matrix.
In the embodiment of the application, after the data clustering device determines the similarity matrix, the original data set can be clustered according to a normalized symmetric spectral clustering algorithm.
Further, in the embodiment of the present application, the data clustering device may first obtain the normalized symmetric laplacian matrix corresponding to the original data set according to the similarity matrix. For example, based on the similarity matrix C obtained by the above equation (14), a normalized symmetric laplacian matrix L corresponding to the original data set is obtained by calculationsym
And 302, forming a target matrix according to the class parameters and the normalized symmetrical Laplace matrix.
In the embodiment of the application, after the data clustering device obtains the normalized symmetric laplacian matrix according to the similarity matrix, the target matrix can be further constructed by combining the class parameters corresponding to the original data set.
It should be noted that, in the embodiment of the present application, when the type parameter is k, the data clustering device may first calculate the laplacian matrix LsymThe first k eigenvectors u1,u2,…,ukThen according to k eigenvectors u1,u2,…,ukForm an object matrix U ═ U1,u2,…,uk]∈Rn×k
Step 303, normalizing the target matrix to obtain a normalized target matrix.
In the embodiment of the application, after the data clustering device constructs the target matrix according to the class parameters and the normalized symmetric laplacian matrix, the target matrix may be normalized, so that the normalized target matrix may be obtained. Specifically, the data clustering device may normalize the target matrix U by rowsTo the normalized target matrix T ∈ Rn×k
And 304, clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.
In the embodiment of the application, the data clustering device may perform the clustering process on the normalized target matrix after performing the normalization process on the target matrix to obtain the normalized target matrix, so as to obtain the clustering result corresponding to the original data set.
Further, in the embodiment of the present application, the data clustering device may classify each row q in the normalized target matrix Ti∈RkIs regarded as RkAnd (3) one point in the space is subjected to a K-means clustering algorithm, so that a clustering result corresponding to the original data set can be obtained.
In the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.
Example four
Based on the first to third embodiments, the data clustering device performs clustering processing on the original data set according to the SSLRR to obtain a corresponding clustering result, and in order to verify the clustering effect of the SSLRR, the embodiments of the present application propose the following two proof ways from a theoretical perspective.
The first method is as follows: the optimal solution for SSLRR has a block diagonal structure.
For the problem of equation (18) without considering noise:
Figure BDA0002177614800000161
given a set of m-dimensional datasets, X ═ X1,x2,...,xn]=[X1,X2,…,Xk]∈Rm×nAnd data set X is taken from k independent linear subspaces
Figure BDA0002177614800000162
Wherein XiIs m × niEach column of which is from the same subspace SiAnd n is1+n2+…+ni=n,Z*Is the optimal solution to the minimization problem (18), then the coefficient Z is represented*Has a block diagonal structure.
Suppose Z*Is the optimal solution of the objective function (18), defining a formula (19),
Figure BDA0002177614800000163
and ZC=Z*-ZD,ZCIs greater than or equal to 0, and Z is the orthogonality assumption of subspaceDIs also a feasible solution to the objective function (17), and is derived from the kernel-norm nature of the matrix, | | Z*||*≥||ZD||*. From ZCMore than or equal to 0, tr (Z) can be deduced*LZ*T)=tr((ZD+ZC)L(ZD+ZC)T)≥tr(ZDL(ZD)T) Since the weight matrix W is a non-negative momentArray, therefore, for H, one can obtain:
Figure BDA0002177614800000164
wherein L ═ W-D1From | | | Z*||*≥||ZD||*、tr(Z*LZ*T)≥tr(ZDL(ZD)T) And | W | Z*1≥‖W⊙ZD1It can be deduced that:
||Z*||*+tr(Z*LZ*T)+‖W⊙Z*1≥||ZD||*+tr(ZDL(ZD)T)+L (21)
and because of Z*Is the optimal solution of equation (18), and therefore, | | Z can be obtained*||*+tr(Z*LZ*T)+‖W⊙Z*1=||ZD||*+tr(ZDL(ZD)T)+L,ZC0, to obtain Z*=ZDTherefore, the optimal solution Z of equation (18) has a block diagonal structure.
The second method comprises the following steps: and (5) analyzing time complexity.
For a data set X ═ X1,x2,...,xn]∈Rm×nIn the above step 101, the time complexity of recovering a low rank dictionary A using RPCA is O (t)1n3),t1Representing the number of iterations of the algorithm. The updating J, T, E and the Lagrangian multiplier Y in the above steps 202 to 207A、YB、YcRespectively, is O (n)3)、O(n2)、O(mn2)、O(mn2)、O(n2)、O(n2) When updating Z, the Bartels-Stewart algorithm is used to solve the Sylvester equation, so the time complexity is O (n)3) Therefore, the overall time complexity in the above steps 202 to 207 is O (3 t)2n2+2t2mn2+2t2n3) If m is<n, timeComplexity O (2 t)2n3),t2Representing the number of iterations of the alternating direction multiplier algorithm. Step 105 spectral clustering has an overall temporal complexity of O (n)3). Therefore, the temporal complexity of the SSLRR algorithm proposed in this chapter is O ((t)1+2t2+1)n3)。
In the data clustering method provided by the embodiment of the application, a data clustering device receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.
EXAMPLE five
Based on the first to fourth embodiments, fig. 5 is a schematic structural diagram of a data clustering device according to an embodiment of the present application, as shown in fig. 5, in an embodiment of the present invention, a data clustering device 1 includes a receiving unit 11, a converting unit 12, a determining unit 13, an establishing unit 14, and an obtaining unit 15,
the receiving unit 11 is configured to receive an original data set.
The conversion unit 12 is configured to convert the original data set.
The determining unit 13 is configured to determine, according to the original data set, a low-rank dictionary and a weight matrix corresponding to the original data set; and determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix.
The establishing unit 14 is configured to establish a similarity matrix corresponding to the original data set according to the representation coefficient.
The obtaining unit 15 is configured to obtain a clustering result corresponding to the original data set by using spectral clustering based on the similarity matrix.
Further, in the embodiment of the present application, the converting unit 12 is specifically configured to perform dimensionality reduction processing on the original data set after receiving the original data set.
Further, in an embodiment of the present application, the determining unit 13 is specifically configured to determine the low-rank dictionary from the original data set according to a first objective function; the first objective function is used for denoising the original data set; or, the determining unit 13 is further specifically configured to obtain a third objective function according to the first weight matrix; obtaining a Laplace matrix according to the second weight matrix; converting the third objective function according to the Laplace matrix to obtain a converted third objective function; and solving the converted third objective function to obtain the representation coefficient.
Further, in an embodiment of the present application, the weight matrix includes a first weight matrix and the second weight matrix, and the determining unit 13 is further specifically configured to calculate the first weight matrix according to the original data set; wherein the first weight matrix is used for reducing the representation coefficient; determining the second weight matrix according to a second objective function and the original data set; the second weight matrix is used for representing the local relation of the data in the original data set in the original space.
Further, in an embodiment of the present application, the determining unit 13 is further specifically configured to perform iterative solution on the converted third objective function according to a preset auxiliary variable, so as to obtain the representation coefficient.
Further, in an embodiment of the present application, the determining unit 13 is further configured to determine a category parameter corresponding to the original data set after performing dimensionality reduction processing on the original data set.
Further, in an embodiment of the present application, the obtaining unit 15 is specifically configured to obtain a normalized symmetric laplacian matrix corresponding to the original data set according to the similarity matrix calculation; forming a target matrix according to the category parameters and the normalized symmetrical Laplace matrix; carrying out normalization processing on the target matrix to obtain a normalized target matrix; and clustering the normalized target matrix to obtain a clustering result corresponding to the original data set.
Fig. 6 is a schematic diagram of a composition structure of the data clustering device according to the embodiment of the present application, and as shown in fig. 6, the data clustering device 1 according to the embodiment of the present application may further include a processor 16 and a memory 17 storing executable instructions of the processor 16, and further, the data clustering device 1 may further include a communication interface 18, and a bus 19 for connecting the processor 16, the memory 17, and the communication interface 18.
In the embodiment of the present Application, the Processor 16 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a ProgRAMmable Logic Device (PLD), a Field ProgRAMmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the processor functions may be other devices, and the embodiments of the present application are not limited in particular. The data clustering device 1 may further comprise a memory 17, which memory 17 may be connected to the processor 16, wherein the memory 17 is adapted to store executable program code comprising computer operating instructions, and the memory 17 may comprise a high speed RAM memory and may further comprise a non-volatile memory, e.g. at least two disk memories.
In the embodiment of the present application, the bus 19 is used to connect the communication interface 18, the processor 16, and the memory 17 and the intercommunication among these devices.
In the embodiment of the present application, the memory 17 is used for storing instructions and data.
Further, in an embodiment of the present application, a processor 16 for receiving and converting the original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set.
In practical applications, the Memory 17 may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 16.
In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.
Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The data clustering device provided by the embodiment of the application receives and converts an original data set; determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix; establishing a similarity matrix corresponding to the original data set according to the representation coefficients; and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set. Therefore, in the embodiment of the application, the data clustering device can obtain a denoised low-rank dictionary from the original data set, and then combine the weight matrix obtained according to the original data set to construct the target coefficient, so as to obtain the similarity matrix corresponding to the original data set, so as to perform clustering processing on the original data set by using the similarity matrix, and obtain the corresponding clustering result.
An embodiment of the present application provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the program implements the data clustering method as described above.
Specifically, the program instructions corresponding to a data clustering method in this embodiment may be stored in a storage medium such as an optical disc, a hard disc, or a usb disk, and when the program instructions corresponding to a data clustering method in the storage medium are read or executed by an electronic device, the method includes the following steps:
receiving and converting an original data set;
determining a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set;
determining a representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix;
establishing a similarity matrix corresponding to the original data set according to the representation coefficients;
and based on the similarity matrix, utilizing spectral clustering to obtain a clustering result corresponding to the original data set.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, display, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or flowchart block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks in the flowchart and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

Claims (17)

1.一种数据聚类方法,其特征在于,所述方法包括:1. a data clustering method, is characterized in that, described method comprises: 接收并转换原始数据集;receive and transform raw datasets; 根据所述原始数据集确定所述原始数据集对应的低秩字典和权值矩阵;Determine a low-rank dictionary and a weight matrix corresponding to the original data set according to the original data set; 根据所述低秩字典和所述权值矩阵,确定所述原始数据集对应的表示系数;According to the low-rank dictionary and the weight matrix, determine the representation coefficient corresponding to the original data set; 按照所述表示系数建立与所述原始数据集对应的相似度矩阵;establishing a similarity matrix corresponding to the original data set according to the representation coefficient; 基于所述相似度矩阵,利用谱聚类获得所述原始数据集对应的聚类结果。Based on the similarity matrix, spectral clustering is used to obtain the clustering result corresponding to the original data set. 2.根据权利要求1所述的方法,其特征在于,所述转换原始数据集,包括:2. The method according to claim 1, wherein the converting the original data set comprises: 在接收所述原始数据集之后,对所述原始数据集进行降低维度处理。After receiving the original data set, dimensionality reduction processing is performed on the original data set. 3.根据权利要求1所述的方法,其特征在于,所述根据所述原始数据集确定所述原始数据集对应的低秩字典,包括:3. The method according to claim 1, wherein the determining, according to the original data set, a low-rank dictionary corresponding to the original data set comprises: 按照第一目标函数从所述原始数据集中确定所述低秩字典;其中,所述第一目标函数用于对所述原始数据集进行去噪处理。The low-rank dictionary is determined from the original data set according to a first objective function; wherein the first objective function is used to perform denoising processing on the original data set. 4.根据权利要求1所述的方法,其特征在于,所述权值矩阵包括第一权值矩阵和所述第二权值矩阵,所述根据所述原始数据集确定所述原始数据集对应的权值矩阵,包括:4. The method according to claim 1, wherein the weight matrix comprises a first weight matrix and the second weight matrix, and the corresponding determination of the original data set according to the original data set The weight matrix of , including: 按照所述原始数据集计算所述第一权值矩阵;其中,所述第一权值矩阵用于对所述表示系数进行降低;Calculate the first weight matrix according to the original data set; wherein, the first weight matrix is used to reduce the representation coefficient; 按照第二目标函数和所述原始数据集确定所述第二权值矩阵;其中,所述第二权值矩阵用于表征所述原始数据集中的数据在原始空间中的局部关系。The second weight matrix is determined according to the second objective function and the original data set; wherein, the second weight matrix is used to represent the local relationship of the data in the original data set in the original space. 5.根据权利要求1所述的方法,其特征在于,所述根据所述低秩字典和所述权值矩阵,确定所述原始数据集对应的表示系数,包括:5 . The method according to claim 1 , wherein, determining the representation coefficient corresponding to the original data set according to the low-rank dictionary and the weight matrix, comprising: 6 . 根据所述第一权值矩阵获得第三目标函数;按照所述第二权值矩阵获得拉普拉斯矩阵;Obtain a third objective function according to the first weight matrix; obtain a Laplacian matrix according to the second weight matrix; 根据所述拉普拉斯矩阵对所述第三目标函数进行转换,获得转换后的第三目标函数;Convert the third objective function according to the Laplace matrix to obtain the converted third objective function; 求解所述转换后的第三目标函数,获得所述表示系数。The transformed third objective function is solved to obtain the representation coefficient. 6.根据权利要求5所述的方法,其特征在于,所述求解所述转换后的第三目标函数,获得所述表示系数,包括:6. The method according to claim 5, characterized in that, said solving the converted third objective function to obtain said representation coefficient, comprising: 按照预设辅助变量对所述转换后的第三目标函数进行迭代求解,获得所述表示系数。Iteratively solves the converted third objective function according to preset auxiliary variables to obtain the representation coefficient. 7.根据权利要求2所述的方法,其特征在于,所述对所述原始数据集进行降低维度处理之后,所述方法还包括:7. The method according to claim 2, wherein after the dimension reduction processing is performed on the original data set, the method further comprises: 确定所述原始数据集对应的类别参数。Determine the category parameter corresponding to the original data set. 8.根据权利要求7所述的方法,其特征在于,所述基于所述相似度矩阵,利用谱聚类获得所述原始数据集对应的聚类结果,包括:8. The method according to claim 7, characterized in that, based on the similarity matrix, using spectral clustering to obtain a clustering result corresponding to the original data set, comprising: 根据所述相似度矩阵计算获得所述原始数据集对应的规范化对称拉普拉斯矩阵;Calculate and obtain a normalized symmetric Laplacian matrix corresponding to the original data set according to the similarity matrix; 按照所述类别参数和所述规范化对称拉普拉斯矩阵,构成目标矩阵;According to the category parameter and the normalized symmetric Laplacian matrix, a target matrix is formed; 对所述目标矩阵进行归一化处理,获得归一化后的目标矩阵;Normalize the target matrix to obtain a normalized target matrix; 对所述归一化后的目标矩阵进行聚类处理,获得所述原始数据集对应的聚类结果。Perform clustering processing on the normalized target matrix to obtain a clustering result corresponding to the original data set. 9.一种数据聚类装置,其特征在于,所述数据聚类装置包括:接收单元,转换单元,确定单元,建立单元以及获取单元,9. A data clustering device, characterized in that the data clustering device comprises: a receiving unit, a converting unit, a determining unit, a establishing unit and an obtaining unit, 所述接收单元,用于接收原始数据集;the receiving unit, for receiving the original data set; 所述转换单元,用于转换所述原始数据集;the conversion unit, configured to convert the original data set; 所述确定单元,用于根据所述原始数据集确定所述原始数据集对应的低秩字典和权值矩阵;以及根据所述低秩字典和所述权值矩阵,确定所述原始数据集对应的表示系数;The determining unit is configured to determine, according to the original data set, a low-rank dictionary and a weight matrix corresponding to the original data set; and determine, according to the low-rank dictionary and the weight matrix, corresponding to the original data set The expression coefficient of ; 所述建立单元,用于按照所述表示系数建立与所述原始数据集对应的相似度矩阵;The establishment unit is configured to establish a similarity matrix corresponding to the original data set according to the representation coefficient; 所述获取单元,用于基于所述相似度矩阵,利用谱聚类获得所述原始数据集对应的聚类结果。The obtaining unit is configured to obtain a clustering result corresponding to the original data set by using spectral clustering based on the similarity matrix. 10.根据权利要求9所述的数据聚类装置,其特征在于,10. The data clustering device according to claim 9, wherein, 所述转换单元,具体用于在接收所述原始数据集之后,对所述原始数据集进行降低维度处理。The conversion unit is specifically configured to perform dimension reduction processing on the original data set after receiving the original data set. 11.根据权利要求9所述的数据聚类装置,其特征在于,11. The data clustering device according to claim 9, wherein, 所述确定单元,具体用于按照第一目标函数从所述原始数据集中确定所述低秩字典;其中,所述第一目标函数用于对所述原始数据集进行去噪处理;The determining unit is specifically configured to determine the low-rank dictionary from the original data set according to a first objective function; wherein, the first objective function is used to perform denoising processing on the original data set; 或者,所述确定单元,还具体用于根据所述第一权值矩阵获得第三目标函数;按照所述第二权值矩阵获得拉普拉斯矩阵;以及根据所述拉普拉斯矩阵对所述第三目标函数进行转换,获得转换后的第三目标函数;以及求解所述转换后的第三目标函数,获得所述表示系数。Alternatively, the determining unit is further configured to obtain a third objective function according to the first weight matrix; obtain a Laplacian matrix according to the second weight matrix; and obtain a Laplacian matrix according to the Laplacian matrix The third objective function is converted to obtain the converted third objective function; and the converted third objective function is solved to obtain the representation coefficient. 12.根据权利要求9所述的数据聚类装置,其特征在于,所述权值矩阵包括第一权值矩阵和所述第二权值矩阵,12. The data clustering apparatus according to claim 9, wherein the weight matrix comprises a first weight matrix and the second weight matrix, 所述确定单元,还具体用于按照所述原始数据集计算所述第一权值矩阵;其中,所述第一权值矩阵用于对所述表示系数进行降低;以及按照第二目标函数和所述原始数据集确定所述第二权值矩阵;其中,所述第二权值矩阵用于表征所述原始数据集中的数据在原始空间中的局部关系。The determining unit is also specifically configured to calculate the first weight matrix according to the original data set; wherein, the first weight matrix is used to reduce the representation coefficient; and according to the second objective function and The original data set determines the second weight matrix; wherein, the second weight matrix is used to represent the local relationship of the data in the original data set in the original space. 13.根据权利要求11所述的数据聚类装置,其特征在于,13. The data clustering device according to claim 11, wherein, 所述确定单元,还具体用于按照预设辅助变量对所述转换后的第三目标函数进行迭代求解,获得所述表示系数。The determining unit is further configured to iteratively solve the converted third objective function according to preset auxiliary variables to obtain the representation coefficient. 14.根据权利要求10所述的数据聚类装置,其特征在于,14. The data clustering device according to claim 10, wherein, 所述确定单元,还用于对所述原始数据集进行降低维度处理之后,确定所述原始数据集对应的类别参数。The determining unit is further configured to determine a category parameter corresponding to the original data set after performing dimension reduction processing on the original data set. 15.根据权利要求14所述的数据聚类装置,其特征在于,15. The data clustering device according to claim 14, wherein, 所述获取单元,具体用于根据所述相似度矩阵计算获得所述原始数据集对应的规范化对称拉普拉斯矩阵;以及按照所述类别参数和所述规范化对称拉普拉斯矩阵,构成目标矩阵;以及对所述目标矩阵进行归一化处理,获得归一化后的目标矩阵;以及对所述归一化后的目标矩阵进行聚类处理,获得所述原始数据集对应的聚类结果。The obtaining unit is specifically configured to calculate and obtain a normalized symmetric Laplacian matrix corresponding to the original data set according to the similarity matrix; and form a target according to the category parameter and the normalized symmetric Laplacian matrix and performing normalization processing on the target matrix to obtain a normalized target matrix; and performing clustering processing on the normalized target matrix to obtain a clustering result corresponding to the original data set . 16.一种数据聚类装置,其特征在于,所述数据聚类装置包括处理器、存储有所述处理器可执行指令的存储器、通信接口,和用于连接所述处理器、所述存储器以及所述通信接口的总线,当所述指令被所述处理器执行时,实现如权利要求1-8任一项所述的方法。16. A data clustering device, characterized in that the data clustering device comprises a processor, a memory storing executable instructions of the processor, a communication interface, and a communication interface for connecting the processor, the memory and the bus of the communication interface, when the instructions are executed by the processor, implement the method of any one of claims 1-8. 17.一种计算机可读存储介质,其上存储有程序,应用于数据聚类装置中,其特征在于,所述程序被处理器执行时,实现如权利要求1-8任一项所述的方法。17. A computer-readable storage medium on which a program is stored and applied in a data clustering device, wherein when the program is executed by a processor, the program according to any one of claims 1-8 is implemented. method.
CN201910784526.6A 2019-08-23 2019-08-23 Data clustering method and device and computer readable storage medium Active CN112417234B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910784526.6A CN112417234B (en) 2019-08-23 2019-08-23 Data clustering method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910784526.6A CN112417234B (en) 2019-08-23 2019-08-23 Data clustering method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112417234A true CN112417234A (en) 2021-02-26
CN112417234B CN112417234B (en) 2024-01-26

Family

ID=74779690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910784526.6A Active CN112417234B (en) 2019-08-23 2019-08-23 Data clustering method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112417234B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601232A (en) * 2022-12-14 2023-01-13 华东交通大学(Cn) A color image decolorization method and system based on singular value decomposition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130191425A1 (en) * 2012-01-20 2013-07-25 Fatih Porikli Method for Recovering Low-Rank Matrices and Subspaces from Data in High-Dimensional Matrices
CN106446924A (en) * 2016-06-23 2017-02-22 首都师范大学 Construction of spectral clustering adjacency matrix based on L3CRSC and application thereof
CN107292258A (en) * 2017-06-14 2017-10-24 南京理工大学 High spectrum image low-rank representation clustering method with filtering is modulated based on bilateral weighted

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130191425A1 (en) * 2012-01-20 2013-07-25 Fatih Porikli Method for Recovering Low-Rank Matrices and Subspaces from Data in High-Dimensional Matrices
CN106446924A (en) * 2016-06-23 2017-02-22 首都师范大学 Construction of spectral clustering adjacency matrix based on L3CRSC and application thereof
CN107292258A (en) * 2017-06-14 2017-10-24 南京理工大学 High spectrum image low-rank representation clustering method with filtering is modulated based on bilateral weighted

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601232A (en) * 2022-12-14 2023-01-13 华东交通大学(Cn) A color image decolorization method and system based on singular value decomposition

Also Published As

Publication number Publication date
CN112417234B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
Ma et al. A statistical perspective on algorithmic leveraging
Qiu et al. Learning transformations for clustering and classification.
Liu et al. On the performance of manhattan nonnegative matrix factorization
Zhou et al. Double shrinking sparse dimension reduction
Yang et al. ℓ 0-sparse subspace clustering
Patel et al. Kernel sparse subspace clustering
CN107229757B (en) Video retrieval method based on deep learning and hash coding
Ding et al. Robust multi-view subspace learning through dual low-rank decompositions
Jin et al. Low-rank matrix factorization with multiple hypergraph regularizer
Shao et al. Deep Linear Coding for Fast Graph Clustering.
Chen et al. A generalized model for robust tensor factorization with noise modeling by mixture of Gaussians
US9141885B2 (en) Visual pattern recognition in an image
Qi et al. Multi-dimensional sparse models
Wang et al. Region-aware hierarchical latent feature representation learning-guided clustering for hyperspectral band selection
CN105469063B (en) The facial image principal component feature extracting method and identification device of robust
Peng et al. Integrate and conquer: Double-sided two-dimensional k-means via integrating of projection and manifold construction
Hidru et al. EquiNMF: Graph regularized multiview nonnegative matrix factorization
CN110032704B (en) Data processing method, device, terminal and storage medium
Xie et al. Inducing wavelets into random fields via generative boosting
Wang et al. Modal regression based greedy algorithm for robust sparse signal recovery, clustering and classification
Abrol et al. A geometric approach to archetypal analysis via sparse projections
CN110633732B (en) A low-rank and joint sparsity-based multimodal image recognition method
Shi et al. Hyperspectral Image denoising via Double Subspace Deep Prior
CN112417234B (en) Data clustering method and device and computer readable storage medium
Meng et al. A general framework for understanding compressed subspace clustering algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant