CN109858529B - Scalable image clustering method - Google Patents
Scalable image clustering method Download PDFInfo
- Publication number
- CN109858529B CN109858529B CN201910028637.4A CN201910028637A CN109858529B CN 109858529 B CN109858529 B CN 109858529B CN 201910028637 A CN201910028637 A CN 201910028637A CN 109858529 B CN109858529 B CN 109858529B
- Authority
- CN
- China
- Prior art keywords
- clustering
- matrix
- image data
- label
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a scalable image clustering method, which is applied to the field of image clustering and is used for solving the problems of low accuracy and high time memory consumption when large-scale data are clustered in the prior art. The invention constructs the similar matrix of the data by selecting the representative points in the large-scale data, substitutes the similar matrix into the robust integrated spectrum clustering model to ensure that the clustering model is simpler and more convenient to operate, and the calculation efficiency is obviously improved compared with the original method of constructing the similar matrix by using all data to carry out iterative operation.
Description
Technical Field
The invention relates to the field of image clustering, in particular to a scalable image clustering method.
Background
With the rapid development of computer technology and internet technology and the wide application of smart phones and cameras, a massive image and video information environment is formed. In the face of massive image data, how to find out useful image information becomes a difficult problem to be solved urgently, and the image clustering technology provides powerful support for finding out image repeated information. The image clustering method research relates to an image search engine, personalized management of digital photos, identification and filtration of sensitive images, artistic image identification and the like, has very important practical significance, and the latest knowledge and research results in related research fields can be innovatively applied to solving various aspects of life problems so as to promote the rapid development of an image clustering technology.
An image feature extraction method and a clustering algorithm are two key technologies involved in image clustering. Conventional clustering algorithms are for example: sparse Subspace Clustering (SSC), low Rank Representation (LRR), robust lineage clustering (RSEC), etc. have successfully solved the clustering problem of Low dimensional data, but due to the complexity of data in practical applications, existing algorithms often fail when dealing with many problems, especially for the case of high dimensional data and large data. Many clustering algorithms work well on small data sets of less than 200 data objects, but a large-scale database may contain millions of objects, and clustering on such large data set samples not only results in a technical problem of low accuracy, but also results in a problem of large memory consumption.
Disclosure of Invention
The invention provides a scalable image clustering method for overcoming the defects of low accuracy and large memory consumption in the process of clustering large-scale data in the prior art.
In order to solve the technical problems, the technical scheme of the invention is as follows: a scalable image clustering method, comprising the steps of:
s1: selecting and normalizing an image data set X, and performing fast clustering operation on the normalized data set X for p times to obtain tag vectors of p basic clusters y1,y2,y3,...,ypForm a label matrix Y = [ Y ]1,y2,y3...yp]Wherein p is a positive integer greater than 0;
s2: converting label matrix into binary sparse matrix Y*;
S3: using fast clustering operation to Y*Clustering into n clustering partitions, selecting a clustering center point of each clustering partition as a representative point, placing the representative point into a matrix R, and reserving a label vector T corresponding to each original image data in an image data set X as a corresponding relation between the representative point and the original image data in the image data set X;
s4: constructing a similar matrix S by using the matrix R, substituting the similar matrix S into the robust integrated spectral clustering model to obtain a target function, and performing iterative optimization operation on the target function;
s5: performing iterative optimization until the final target function is converged, performing fast clustering operation on the marking matrix H in the finally obtained target function, and reserving a clustering label of each clustering partition after clustering to obtain a label matrix P;
s6: and (3) associating the original image data in the image data set X with the label matrix P, and distributing the original image data in the image data set X represented by the representative point of the clustering partition after the clustering of the mark matrix H is carried out after iteration into the clustering partition in S5 to obtain the final clustering result.
Preferably, the fast clustering algorithm is a K-means clustering algorithm.
Preferably, the formula for constructing the similarity matrix S by using the matrix R in S4 is as follows:
preferably, in S4, substituting the similarity matrix S into the robust ensemble spectrum clustering model to obtain a calculation formula of the scalable robust ensemble spectrum clustering model objective function is as follows:
wherein tr is a trace, H belongs to n multiplied by k as a marking matrix, and represents the final global clustering effect; l iszIs a normalized laplacian matrix; z belongs to n multiplied by n and is a learned coefficient matrix; e belongs to n multiplied by n and is an error matrix, and the error matrix contains data noise points in the similar matrix S; lambda [ alpha ]1And λ2Is a penalty factor in augmented lagrange; dzThe degree matrix is used to calculate the number of connected edges between each data point:
Dz=diag([d1,...,dn])
diis a matrix (Z + Z)T)/2+HHTThe sum of all numbers in the ith row is more than or equal to 1 and less than or equal to n.
Preferably, the above objective function is solved by iterative optimization using ADMM (alternating direction multiplier).
Preferably, the iterative optimization solution of the objective function by using the ADMM specifically includes the following steps:
initializing Z and J, making Z = J, respectively solving Z and J, wherein J is an auxiliary variable matrix, and the listed augmented Lagrange solving form corresponding to the robust integrated spectrum clustering model is as follows:
wherein, Y1And Y2Are two of the lagrange multipliers and,is a penalty factor used to balance the influence of the correlation term;
solving the update coefficient J(t+1)The formula of (1) is:
where t represents the number of iterations, and Z is solved(t+1)The formula of (1) is:
solving for E(t+1)The formula of (1) is as follows:
solving for H(t+1)The formula of (1) is:
obtaining the updated coefficient J(t+1),Z(t+1),E(t+1),H(t+1)Then, in μ(t)The lagrange multiplier is updated as the step size for which the gradient descent method is performed:
wherein mu(t+1)The updating method comprises the following steps: mu.s(t+1)=ρμ(t)And ρ is an update coefficient.
Preferably, in S6, original image data in the image data set X is associated with the tag matrix P, and the original image data in the image data set X represented by the representative points of the clustering partitions after the clustering by the iterative indicator matrix H is assigned to the original clustering group to obtain a final clustering result, according to the following formula:
Li=P[T[i]]
where i is the ith original image data in image dataset X before iteration, T [ i [ i ] ]]Is the label corresponding to the ith original image data in the image dataset X, P [ T [ i ] i]]Represents the T [ i ]]Marking representative points of clustering partitions after matrix clustering after iteration corresponding to each label, P is a label of the last clustering partition after optimization iteration, and LiAnd partitioning the cluster of the last representative point after iterative optimization.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the method constructs the similar matrix of the data by selecting the representative points in the data and substitutes the similar matrix into the robust integrated spectrum clustering model to ensure that the clustering model is simpler and more convenient to operate, and the calculation efficiency is obviously improved compared with the original method of constructing the similar matrix by using all the data to carry out iterative operation. The method has the advantages of time, less memory consumption, good scalability and suitability for data sets of various sizes.
Drawings
FIG. 1 is a general flow diagram of the present invention.
FIG. 2 is a schematic flow chart of the present invention.
Fig. 3 is a schematic diagram of the conversion of the tag matrix Y into a binary sparse matrix Y according to the present invention.
FIG. 4 is a graph comparing accuracy of the present invention with that of the prior art for different size data sets.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in the general flow diagram of fig. 1, the present invention comprises the following steps:
s1: selecting and normalizing an image data set X, and performing fast clustering operation on the normalized data set X for p times to obtain tag vectors of p clustering partitions y1,y2,y3,...,ypForm a label matrix Y = [ Y ]1,y2,y3...yp]Wherein p is a positive integer greater than 0;
in a specific embodiment, the image data set X selects different types and sizes of cluster image data sets for experiments according to experimental needs, wherein the cluster image data sets comprise a handwritten digital image data set: MNIST and USPS, face image dataset: yale and ORL, still image dataset: COIL20. The data set information is shown in table 1 below:
data set name | Object | Dimension (d) of | Class I |
Yale | 1024 | 165 | 15 |
ORL | 1024 | 400 | 40 |
COIL20 | 1440 | 1024 | 20 |
MNIST | 4000 | 784 | 10 |
USPS | 9298 | 256 | 10 |
Table 1S2: converting label matrix Y into binary sparse matrix Y*;Y*Is considered to contain each dataContains only 0 and 1, when the element is 1, the feature of the data object is contained, when the element is 0, the feature is not contained, wherein the label matrix Y is converted into a binary sparse matrix Y*Is shown in figure 3.
S3: y is calculated by adopting K-means fast clustering operation*Clustering into n clustering partitions, selecting a clustering center point of each clustering partition as a representative point, putting the representative point into a matrix R, and keeping a label vector T corresponding to each original image data in an image data set X as a corresponding relation between the representative point and the original image data in the image data set X;
s4: and (3) constructing a similar matrix S by using the matrix R, wherein the formula is as follows:
substituting the similarity matrix S into the robust integrated spectrum clustering model to obtain an objective function:
wherein tr is a trace, H belongs to n multiplied by k as a marking matrix, and represents the final global clustering effect; l iszIs a normalized laplacian matrix; z belongs to n multiplied by n and is a learned coefficient matrix; e belongs to n multiplied by n and is an error matrix, and the error matrix contains data noise points in the similar matrix S; lambda [ alpha ]1And λ2Is a penalty factor in augmented lagrange; dzThe degree matrix is used to calculate the number of connected edges between each data point:
Dz=diag([d1,...,dn])
diis a matrix (Z + Z)T)/2+HHTThe sum of all numbers in the ith row is more than or equal to 1 and less than or equal to n.
Performing iterative optimization operation on the objective function by adopting ADMM:
initializing Z and J, making Z = J, respectively solving Z and J, wherein J is an auxiliary variable matrix, and the listed augmented Lagrange solving form corresponding to the robust integrated spectrum clustering model is as follows:
wherein, Y1And Y2Are two lagrange multipliers and are,is a penalty factor used to balance the influence of the correlation term;
solving the update coefficient J(t+1)The formula of (1) is as follows:
where t represents the number of iterations, and Z is solved(t+1)The formula of (1) is:
solving for E(t+1)The formula of (1) is as follows:
solving for H(t+1)The formula of (1) is as follows:
obtaining the updated coefficient J(t+1),Z(t+1),E(t+1),H(t+1)Then, in μ(t)The lagrange multiplier is updated as the step size for which the gradient descent method is performed:
wherein mu(t+1)The updating method comprises the following steps: mu.s(t+1)=ρμ(t)And ρ is an update coefficient.
S5: performing iterative optimization until the final objective function is converged, performing fast clustering operation on the marking matrix H in the finally obtained objective function, and reserving the clustering label of each clustering partition after clustering to obtain a label matrix P;
s6: original image data in the image data set X is associated with the label matrix P, original image data in the image data set X represented by the representative points of the clustering partitions after the clustering of the mark matrix H is carried out after iteration is distributed to the clustering partitions in S5, and a final clustering result is obtained, wherein the formula is as follows:
Li=P[T[i]]
where i is the ith original image data in image dataset X before iteration, T [ i [ i ] ]]Is a label corresponding to the original image data in the ith image dataset X in the image datasets, P [ T [ i ] i]]Represents the T [ i ]]Marking representative points of clustering partitions after matrix clustering after iteration corresponding to each label, P is a label of the last clustering partition after optimization iteration, and L isiAnd partitioning the cluster of the last representative point after iterative optimization.
A schematic diagram of the above steps is shown in FIG. 2.
Two methods for calculating the Accuracy, including Accuracy calculation method ACC and normalized mutual information method NMI, are respectively adopted, and compared with a comparison file Robust spectral ensemble (Robust lineage clustering RSEC), the Accuracy comparison results are shown in table 2 and table 3 below:
TABLE 2
TABLE 3
It takes two scheme time pairs with the same data as in table 4:
Time cost/s | SRSEC | RSEC |
Yale | 0.5329144 | 0.6049372 |
ORL | 1.0633575 | 2.7546931 |
COIL20 | 3.4406807 | 45.379373 |
MNIST | 12.250594 | 1251.1258 |
USPS | 14.639404 | 10938 |
TABLE 4
When data sets of different orders of magnitude are adopted, the expressions of the accuracy of the handwritten image data set MNIST under the conditions of containing 4000,1 ten thousand and 7 ten thousand data are compared, and the experimental result is shown in FIG. 4.
As can be known from the comparison of the experimental data of the three aspects, on one hand, the invention reduces the calculated amount of the data, greatly improves the image clustering efficiency, reduces the time and space cost and simultaneously obtains better image clustering effect.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (7)
1. A scalable image clustering method is characterized by comprising the following steps:
s1: selecting and normalizing an image data set X, and performing fast clustering operation on the normalized data set X for p times to obtain tag vectors of p basic clusters y1,y2,y3,...,ypForm a label matrix Y =[y1,y2,y3...yp]Wherein p is a positive integer greater than 0;
s2: converting label matrix Y into binary sparse matrix Y*;
S3: using fast clustering operation to Y*Clustering into n clustering partitions, selecting a clustering center point of each clustering partition as a representative point, putting the representative point into a matrix R, and keeping a label vector T corresponding to each original image data in an image data set X as a corresponding relation between the representative point and the original image data in the image data set X;
s4: constructing a similar matrix S by using the matrix R, substituting the similar matrix S into the robust integrated spectral clustering model to obtain a target function, and performing iterative optimization operation on the target function;
s5: performing iterative optimization until the final objective function is converged, performing fast clustering operation on the marking matrix H in the finally obtained objective function, and reserving the clustering label of each clustering partition after clustering to obtain a label matrix P;
s6: and (3) associating the original image data in the image data set X with the label matrix P, and distributing the original image data in the image data set X represented by the representative point of the clustering partition after the clustering of the mark matrix H is carried out after iteration into the clustering partition in S5 to obtain the final clustering result.
2. A scalable image clustering method according to claim 1, wherein the fast clustering algorithm is a K-means clustering algorithm.
4. the scalable image clustering method according to claim 1, wherein the similarity matrix S is substituted into the robust ensemble spectral clustering model in S4 to obtain the scalable robust ensemble spectral clustering model objective function according to the following formula:
wherein tr is a retrace, H belongs to n multiplied by k is a marking matrix, and represents the final global clustering effect; l is a radical of an alcoholzIs a normalized laplacian matrix; z belongs to n multiplied by n and is a learned coefficient matrix; e belongs to n multiplied by n and is an error matrix, and the error matrix contains data noise points in the similar matrix S; lambda [ alpha ]1And λ2Is a penalty factor in augmented lagrange; dzThe degree matrix is used to calculate the number of connected edges between each data point:
Dz=diag([d1,...,dn])
diis a matrix (Z + Z)T)/2+HHTThe sum of all numbers in the ith row is more than or equal to 1 and less than or equal to n.
5. The scalable image clustering method according to claim 4, wherein the objective function is iteratively optimized and solved by using ADMM.
6. A scalable image clustering method according to claim 5, wherein the iterative optimization solution of the objective function using ADMM comprises the specific steps of:
initializing Z and J, enabling Z = J, respectively solving the Z and the J, wherein the J is an auxiliary variable matrix, and the listed augmented Lagrange solving form corresponding to the robust integrated spectral clustering model is as follows:
wherein, Y1And Y2Are two of the lagrange multipliers and,the penalty factor is used for balancing the influence of the related terms;
solving the update coefficient J(t+1)The formula of (1) is:
where t represents the number of iterations, and Z is solved(t+1)The formula of (1) is as follows:
solution E(t+1)The formula of (1) is as follows:
solving for H(t+1)The formula of (1) is as follows:
obtaining the updated coefficient J(t+1),Z(t+1),E(t+1),H(t+1)Then, in μ(t)The lagrange multiplier is updated as the step size for which the gradient descent method is performed:
Y1 (t+1)=Y1 (t)+μ(t)(S-SZ(t+1)-E(t+1));
wherein mu(t+1)The updating method comprises the following steps: mu.s(t+1)=ρμ(t)And ρ is an update coefficient.
7. The scalable image clustering method according to claim 1, wherein S6 associates original image data in the image data set X with the label matrix P, and assigns the original image data in the image data set X represented by the representative point of the clustering partition after clustering by the iterative labeled matrix H to an original clustering group to obtain a final clustering result, according to the following formula:
Li=P[T[i]]
where i is the ith original image data in image dataset X before iteration, T [ i [ i ] ]]Is the label corresponding to the ith original image data of the image dataset X, P [ T [ i [ ]]]Represents the T [ i ]]Marking representative points of clustering partitions after matrix clustering after iteration corresponding to each label, P is a label of the last clustering partition after optimization iteration, and L isiAnd partitioning the cluster of the last representative point after iterative optimization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910028637.4A CN109858529B (en) | 2019-01-11 | 2019-01-11 | Scalable image clustering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910028637.4A CN109858529B (en) | 2019-01-11 | 2019-01-11 | Scalable image clustering method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109858529A CN109858529A (en) | 2019-06-07 |
CN109858529B true CN109858529B (en) | 2022-11-01 |
Family
ID=66894497
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910028637.4A Active CN109858529B (en) | 2019-01-11 | 2019-01-11 | Scalable image clustering method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109858529B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110660132A (en) * | 2019-10-11 | 2020-01-07 | 杨再毅 | Three-dimensional model construction method and device |
CN111767941B (en) * | 2020-05-15 | 2022-11-18 | 上海大学 | Improved spectral clustering and parallelization method based on symmetric nonnegative matrix factorization |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104732545A (en) * | 2015-04-02 | 2015-06-24 | 西安电子科技大学 | Texture image segmentation method combined with sparse neighbor propagation and rapid spectral clustering |
CN106355202A (en) * | 2016-08-31 | 2017-01-25 | 广州精点计算机科技有限公司 | Image feature extraction method based on K-means clustering |
CN107316050A (en) * | 2017-05-19 | 2017-11-03 | 中国科学院西安光学精密机械研究所 | Subspace based on Cauchy's loss function is from expression model clustering method |
CN108764276A (en) * | 2018-04-12 | 2018-11-06 | 西北大学 | A kind of robust weights multi-characters clusterl method automatically |
CN108921853A (en) * | 2018-06-22 | 2018-11-30 | 西安电子科技大学 | Image partition method based on super-pixel and clustering of immunity sparse spectrums |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7831538B2 (en) * | 2007-05-23 | 2010-11-09 | Nec Laboratories America, Inc. | Evolutionary spectral clustering by incorporating temporal smoothness |
-
2019
- 2019-01-11 CN CN201910028637.4A patent/CN109858529B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104732545A (en) * | 2015-04-02 | 2015-06-24 | 西安电子科技大学 | Texture image segmentation method combined with sparse neighbor propagation and rapid spectral clustering |
CN106355202A (en) * | 2016-08-31 | 2017-01-25 | 广州精点计算机科技有限公司 | Image feature extraction method based on K-means clustering |
CN107316050A (en) * | 2017-05-19 | 2017-11-03 | 中国科学院西安光学精密机械研究所 | Subspace based on Cauchy's loss function is from expression model clustering method |
CN108764276A (en) * | 2018-04-12 | 2018-11-06 | 西北大学 | A kind of robust weights multi-characters clusterl method automatically |
CN108921853A (en) * | 2018-06-22 | 2018-11-30 | 西安电子科技大学 | Image partition method based on super-pixel and clustering of immunity sparse spectrums |
Non-Patent Citations (2)
Title |
---|
"Robust Spectral Ensemble Clustering";Zhiqiang Tao et al.;《 Proceedings of the 25th ACM International on Conference on Information and Knowledge Management》;20161031;第367-376页 * |
"基于路径相似度测量的鲁棒性谱聚类算法";范敏等;《计算机应用研究》;20150228;第32卷(第2期);第372-375页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109858529A (en) | 2019-06-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Guo et al. | Deep embedded clustering with data augmentation | |
CN108334574B (en) | Cross-modal retrieval method based on collaborative matrix decomposition | |
CN106777318B (en) | Matrix decomposition cross-modal Hash retrieval method based on collaborative training | |
Tao et al. | Latent complete row space recovery for multi-view subspace clustering | |
CN108415883B (en) | Convex non-negative matrix factorization method based on subspace clustering | |
CN109858529B (en) | Scalable image clustering method | |
CN113191385A (en) | Unknown image classification automatic labeling method based on pre-training labeling data | |
Zhang et al. | A Robust k‐Means Clustering Algorithm Based on Observation Point Mechanism | |
Peng et al. | Adaptive attribute and structure subspace clustering network | |
CN110751027A (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN109857892B (en) | Semi-supervised cross-modal Hash retrieval method based on class label transfer | |
Yamada et al. | Guiding labelling effort for efficient learning with georeferenced images | |
CN111027582A (en) | Semi-supervised feature subspace learning method and device based on low-rank graph learning | |
Pengcheng et al. | Fast Chinese calligraphic character recognition with large-scale data | |
CN109886315A (en) | A kind of Measurement of Similarity between Two Images method kept based on core | |
CN113705674A (en) | Non-negative matrix factorization clustering method and device and readable storage medium | |
Luo et al. | Attention regularized Laplace graph for domain adaptation | |
Ng et al. | Incremental hashing with sample selection using dominant sets | |
CN115661504A (en) | Remote sensing sample classification method based on transfer learning and visual word package | |
Chang et al. | Robust subspace clustering by learning an optimal structured bipartite graph via low-rank representation | |
CN111259176B (en) | Cross-modal Hash retrieval method based on matrix decomposition and integrated with supervision information | |
Jammula | Content based image retrieval system using integrated ML and DL-CNN | |
CN114399653A (en) | Fast multi-view discrete clustering method and system based on anchor point diagram | |
Rad et al. | A multi-view-group non-negative matrix factorization approach for automatic image annotation | |
CN112800138A (en) | Big data classification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |