CN109858529B

CN109858529B - Scalable image clustering method

Info

Publication number: CN109858529B
Application number: CN201910028637.4A
Authority: CN
Inventors: 梁奕念; 吴宗泽; 任志刚; 谢胜利; 李建中; 曾德宇
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2022-11-01
Anticipated expiration: 2039-01-11
Also published as: CN109858529A

Abstract

The invention discloses a scalable image clustering method, which is applied to the field of image clustering and is used for solving the problems of low accuracy and high time memory consumption when large-scale data are clustered in the prior art. The invention constructs the similar matrix of the data by selecting the representative points in the large-scale data, substitutes the similar matrix into the robust integrated spectrum clustering model to ensure that the clustering model is simpler and more convenient to operate, and the calculation efficiency is obviously improved compared with the original method of constructing the similar matrix by using all data to carry out iterative operation.

Description

Scalable image clustering method

Technical Field

The invention relates to the field of image clustering, in particular to a scalable image clustering method.

Background

With the rapid development of computer technology and internet technology and the wide application of smart phones and cameras, a massive image and video information environment is formed. In the face of massive image data, how to find out useful image information becomes a difficult problem to be solved urgently, and the image clustering technology provides powerful support for finding out image repeated information. The image clustering method research relates to an image search engine, personalized management of digital photos, identification and filtration of sensitive images, artistic image identification and the like, has very important practical significance, and the latest knowledge and research results in related research fields can be innovatively applied to solving various aspects of life problems so as to promote the rapid development of an image clustering technology.

An image feature extraction method and a clustering algorithm are two key technologies involved in image clustering. Conventional clustering algorithms are for example: sparse Subspace Clustering (SSC), low Rank Representation (LRR), robust lineage clustering (RSEC), etc. have successfully solved the clustering problem of Low dimensional data, but due to the complexity of data in practical applications, existing algorithms often fail when dealing with many problems, especially for the case of high dimensional data and large data. Many clustering algorithms work well on small data sets of less than 200 data objects, but a large-scale database may contain millions of objects, and clustering on such large data set samples not only results in a technical problem of low accuracy, but also results in a problem of large memory consumption.

Disclosure of Invention

The invention provides a scalable image clustering method for overcoming the defects of low accuracy and large memory consumption in the process of clustering large-scale data in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows: a scalable image clustering method, comprising the steps of:

s1: selecting and normalizing an image data set X, and performing fast clustering operation on the normalized data set X for p times to obtain tag vectors of p basic clusters y¹,y²,y³,...,y^pForm a label matrix Y = [ Y ]¹,y²,y³...y^p]Wherein p is a positive integer greater than 0;

s2: converting label matrix into binary sparse matrix Y^*；

S3: using fast clustering operation to Y^*Clustering into n clustering partitions, selecting a clustering center point of each clustering partition as a representative point, placing the representative point into a matrix R, and reserving a label vector T corresponding to each original image data in an image data set X as a corresponding relation between the representative point and the original image data in the image data set X;

s4: constructing a similar matrix S by using the matrix R, substituting the similar matrix S into the robust integrated spectral clustering model to obtain a target function, and performing iterative optimization operation on the target function;

s5: performing iterative optimization until the final target function is converged, performing fast clustering operation on the marking matrix H in the finally obtained target function, and reserving a clustering label of each clustering partition after clustering to obtain a label matrix P;

s6: and (3) associating the original image data in the image data set X with the label matrix P, and distributing the original image data in the image data set X represented by the representative point of the clustering partition after the clustering of the mark matrix H is carried out after iteration into the clustering partition in S5 to obtain the final clustering result.

Preferably, the fast clustering algorithm is a K-means clustering algorithm.

Preferably, the formula for constructing the similarity matrix S by using the matrix R in S4 is as follows:

preferably, in S4, substituting the similarity matrix S into the robust ensemble spectrum clustering model to obtain a calculation formula of the scalable robust ensemble spectrum clustering model objective function is as follows:

wherein tr is a trace, H belongs to n multiplied by k as a marking matrix, and represents the final global clustering effect; l is_zIs a normalized laplacian matrix; z belongs to n multiplied by n and is a learned coefficient matrix; e belongs to n multiplied by n and is an error matrix, and the error matrix contains data noise points in the similar matrix S; lambda [ alpha ]₁And λ₂Is a penalty factor in augmented lagrange; d_zThe degree matrix is used to calculate the number of connected edges between each data point:

D_z＝diag([d₁,...,d_n])

d_iis a matrix (Z + Z)^T)/2+HH^TThe sum of all numbers in the ith row is more than or equal to 1 and less than or equal to n.

Preferably, the above objective function is solved by iterative optimization using ADMM (alternating direction multiplier).

Preferably, the iterative optimization solution of the objective function by using the ADMM specifically includes the following steps:

initializing Z and J, making Z = J, respectively solving Z and J, wherein J is an auxiliary variable matrix, and the listed augmented Lagrange solving form corresponding to the robust integrated spectrum clustering model is as follows:

wherein, Y₁And Y₂Are two of the lagrange multipliers and,

is a penalty factor used to balance the influence of the correlation term;

solving the update coefficient J^(t+1)The formula of (1) is:

where t represents the number of iterations, and Z is solved^(t+1)The formula of (1) is:

solving for E^(t+1)The formula of (1) is as follows:

solving for H^(t+1)The formula of (1) is:

obtaining the updated coefficient J^(t+1)，Z^(t+1)，E^(t+1),H^(t+1)Then, in μ^(t)The lagrange multiplier is updated as the step size for which the gradient descent method is performed:

wherein mu^(t+¹⁾The updating method comprises the following steps: mu.s^(t+1)＝ρμ^(t)And ρ is an update coefficient.

Preferably, in S6, original image data in the image data set X is associated with the tag matrix P, and the original image data in the image data set X represented by the representative points of the clustering partitions after the clustering by the iterative indicator matrix H is assigned to the original clustering group to obtain a final clustering result, according to the following formula:

L_i＝P[T[i]]

where i is the ith original image data in image dataset X before iteration, T [ i [ i ] ]]Is the label corresponding to the ith original image data in the image dataset X, P [ T [ i ] i]]Represents the T [ i ]]Marking representative points of clustering partitions after matrix clustering after iteration corresponding to each label, P is a label of the last clustering partition after optimization iteration, and L_iAnd partitioning the cluster of the last representative point after iterative optimization.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the method constructs the similar matrix of the data by selecting the representative points in the data and substitutes the similar matrix into the robust integrated spectrum clustering model to ensure that the clustering model is simpler and more convenient to operate, and the calculation efficiency is obviously improved compared with the original method of constructing the similar matrix by using all the data to carry out iterative operation. The method has the advantages of time, less memory consumption, good scalability and suitability for data sets of various sizes.

Drawings

FIG. 1 is a general flow diagram of the present invention.

FIG. 2 is a schematic flow chart of the present invention.

Fig. 3 is a schematic diagram of the conversion of the tag matrix Y into a binary sparse matrix Y according to the present invention.

FIG. 4 is a graph comparing accuracy of the present invention with that of the prior art for different size data sets.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in the general flow diagram of fig. 1, the present invention comprises the following steps:

s1: selecting and normalizing an image data set X, and performing fast clustering operation on the normalized data set X for p times to obtain tag vectors of p clustering partitions y¹,y²,y³,...,y^pForm a label matrix Y = [ Y ]¹,y²,y³...y^p]Wherein p is a positive integer greater than 0;

in a specific embodiment, the image data set X selects different types and sizes of cluster image data sets for experiments according to experimental needs, wherein the cluster image data sets comprise a handwritten digital image data set: MNIST and USPS, face image dataset: yale and ORL, still image dataset: COIL20. The data set information is shown in table 1 below:

data set name	Object	Dimension (d) of	Class I
				Yale	1024	165	15
ORL	1024	400	40
				COIL20	1440	1024	20
MNIST	4000	784	10
				USPS	9298	256	10

Table 1S2: converting label matrix Y into binary sparse matrix Y^*；Y^*Is considered to contain each dataContains only 0 and 1, when the element is 1, the feature of the data object is contained, when the element is 0, the feature is not contained, wherein the label matrix Y is converted into a binary sparse matrix Y^*Is shown in figure 3.

S3: y is calculated by adopting K-means fast clustering operation^*Clustering into n clustering partitions, selecting a clustering center point of each clustering partition as a representative point, putting the representative point into a matrix R, and keeping a label vector T corresponding to each original image data in an image data set X as a corresponding relation between the representative point and the original image data in the image data set X;

s4: and (3) constructing a similar matrix S by using the matrix R, wherein the formula is as follows:

substituting the similarity matrix S into the robust integrated spectrum clustering model to obtain an objective function:

D_z＝diag([d₁,...,d_n])

Performing iterative optimization operation on the objective function by adopting ADMM:

wherein, Y₁And Y₂Are two lagrange multipliers and are,

is a penalty factor used to balance the influence of the correlation term;

solving the update coefficient J^(t+1)The formula of (1) is as follows:

solving for E^(t+1)The formula of (1) is as follows:

solving for H^(t+1)The formula of (1) is as follows:

wherein mu^(t+1)The updating method comprises the following steps: mu.s^(t+1)＝ρμ^(t)And ρ is an update coefficient.

S5: performing iterative optimization until the final objective function is converged, performing fast clustering operation on the marking matrix H in the finally obtained objective function, and reserving the clustering label of each clustering partition after clustering to obtain a label matrix P;

s6: original image data in the image data set X is associated with the label matrix P, original image data in the image data set X represented by the representative points of the clustering partitions after the clustering of the mark matrix H is carried out after iteration is distributed to the clustering partitions in S5, and a final clustering result is obtained, wherein the formula is as follows:

L_i＝P[T[i]]

where i is the ith original image data in image dataset X before iteration, T [ i [ i ] ]]Is a label corresponding to the original image data in the ith image dataset X in the image datasets, P [ T [ i ] i]]Represents the T [ i ]]Marking representative points of clustering partitions after matrix clustering after iteration corresponding to each label, P is a label of the last clustering partition after optimization iteration, and L is_iAnd partitioning the cluster of the last representative point after iterative optimization.

A schematic diagram of the above steps is shown in FIG. 2.

Two methods for calculating the Accuracy, including Accuracy calculation method ACC and normalized mutual information method NMI, are respectively adopted, and compared with a comparison file Robust spectral ensemble (Robust lineage clustering RSEC), the Accuracy comparison results are shown in table 2 and table 3 below:

TABLE 2

TABLE 3

It takes two scheme time pairs with the same data as in table 4:

Time cost/s	SRSEC	RSEC
			Yale	0.5329144	0.6049372
ORL	1.0633575	2.7546931
			COIL20	3.4406807	45.379373
MNIST	12.250594	1251.1258
			USPS	14.639404	10938

TABLE 4

When data sets of different orders of magnitude are adopted, the expressions of the accuracy of the handwritten image data set MNIST under the conditions of containing 4000,1 ten thousand and 7 ten thousand data are compared, and the experimental result is shown in FIG. 4.

As can be known from the comparison of the experimental data of the three aspects, on one hand, the invention reduces the calculated amount of the data, greatly improves the image clustering efficiency, reduces the time and space cost and simultaneously obtains better image clustering effect.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A scalable image clustering method is characterized by comprising the following steps:

s1: selecting and normalizing an image data set X, and performing fast clustering operation on the normalized data set X for p times to obtain tag vectors of p basic clusters y¹,y²,y³,...,y^pForm a label matrix Y =[y¹,y²,y³...y^p]Wherein p is a positive integer greater than 0;

s2: converting label matrix Y into binary sparse matrix Y^*；

S3: using fast clustering operation to Y^*Clustering into n clustering partitions, selecting a clustering center point of each clustering partition as a representative point, putting the representative point into a matrix R, and keeping a label vector T corresponding to each original image data in an image data set X as a corresponding relation between the representative point and the original image data in the image data set X;

2. A scalable image clustering method according to claim 1, wherein the fast clustering algorithm is a K-means clustering algorithm.

3. A scalable image clustering method according to claim 1, wherein the formula for constructing the similarity matrix S by using the matrix R in S4 is:

4. the scalable image clustering method according to claim 1, wherein the similarity matrix S is substituted into the robust ensemble spectral clustering model in S4 to obtain the scalable robust ensemble spectral clustering model objective function according to the following formula:

wherein tr is a retrace, H belongs to n multiplied by k is a marking matrix, and represents the final global clustering effect; l is a radical of an alcohol_zIs a normalized laplacian matrix; z belongs to n multiplied by n and is a learned coefficient matrix; e belongs to n multiplied by n and is an error matrix, and the error matrix contains data noise points in the similar matrix S; lambda [ alpha ]₁And λ₂Is a penalty factor in augmented lagrange; d_zThe degree matrix is used to calculate the number of connected edges between each data point:

D_z＝diag([d₁,...,d_n])

5. The scalable image clustering method according to claim 4, wherein the objective function is iteratively optimized and solved by using ADMM.

6. A scalable image clustering method according to claim 5, wherein the iterative optimization solution of the objective function using ADMM comprises the specific steps of:

initializing Z and J, enabling Z = J, respectively solving the Z and the J, wherein the J is an auxiliary variable matrix, and the listed augmented Lagrange solving form corresponding to the robust integrated spectral clustering model is as follows:

wherein, Y₁And Y₂Are two of the lagrange multipliers and,

the penalty factor is used for balancing the influence of the related terms;

solving the update coefficient J^(t+1)The formula of (1) is:

where t represents the number of iterations, and Z is solved^(t+1)The formula of (1) is as follows:

solution E^(t+1)The formula of (1) is as follows:

solving for H^(t+1)The formula of (1) is as follows:

Y₁ ^(t+1)＝Y₁ ^(t)+μ^(t)(S-SZ^(t+1)-E^(t+1))；

7. The scalable image clustering method according to claim 1, wherein S6 associates original image data in the image data set X with the label matrix P, and assigns the original image data in the image data set X represented by the representative point of the clustering partition after clustering by the iterative labeled matrix H to an original clustering group to obtain a final clustering result, according to the following formula:

L_i＝P[T[i]]

where i is the ith original image data in image dataset X before iteration, T [ i [ i ] ]]Is the label corresponding to the ith original image data of the image dataset X, P [ T [ i [ ]]]Represents the T [ i ]]Marking representative points of clustering partitions after matrix clustering after iteration corresponding to each label, P is a label of the last clustering partition after optimization iteration, and L is_iAnd partitioning the cluster of the last representative point after iterative optimization.