CN109447098B

CN109447098B - Image clustering algorithm based on deep semantic embedding

Info

Publication number: CN109447098B
Application number: CN201810982183.XA
Authority: CN
Inventors: 郭军; 袁璇; 许鹏飞; 柏浩; 刘宝英; 陈锋
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2018-08-27
Filing date: 2018-08-27
Publication date: 2022-03-18
Anticipated expiration: 2038-08-27
Also published as: CN109447098A

Abstract

The invention relates to an image clustering algorithm based on deep semantic embedding, which comprises the following steps: step 1: dividing an image data set into a training set and a testing set, and obtaining respective data characteristic spaces; step 2: solving a mapping function of mapping the image data from the data feature space of the training set obtained in the step 1 to the semantic space of the training set, and obtaining the semantic space of the test set through the mapping function; and step 3: taking the result obtained in the step (2) as an input layer, and performing fusion dimensionality reduction through self-coding to obtain a low-dimensional embedding space with semantic information and original characteristics; and 4, step 4: clustering in the low-dimensional embedding space with the semantic information and the original characteristics obtained in the step 4 by using the KL divergence function, and finishing if the KL divergence function is converged; otherwise, returning to the step 3 and updating the input layer in the step 3. The invention effectively improves the discriminability of the data characteristics and improves the clustering effect.

Description

Image clustering algorithm based on deep semantic embedding

Technical Field

The invention belongs to the technical field of image clustering and deep learning, and particularly relates to an image clustering algorithm based on deep semantic embedding.

Background

In the field of machine learning and computer vision, high-dimensional image clustering is a huge challenge. The traditional clustering algorithm mainly focuses on the optimization of a distance function and the research of a grouping algorithm, and the algorithms comprise a k-means algorithm, a Gaussian mixture model, spectral clustering and the like. However, these algorithms are limited by the clustering task of linear embedding, which is difficult to handle with more complex, high-dimensional data. Therefore, in order to solve the problem of clustering more complex high-dimensional data, a clustering algorithm fusing dimension reduction is developed.

Unlike the conventional clustering algorithm, the deep clustering algorithm can obtain a more considerable clustering result by learning more typical data features. The existing deep clustering algorithm is divided into two methods: the first method focuses more on feature extraction, namely, the feature extraction is to obtain more typical feature expression with lower dimension through a dimension reduction algorithm, and then the traditional clustering algorithm is used for clustering on the basis, so that the method achieves a certain clustering effect. The second method focuses on the combination of feature extraction and clustering, which extracts typical feature expression and improves the clustering algorithm to a certain extent, so that the feature extraction and clustering are more compatible, and a better clustering effect is obtained. When the second method is used, an important dimension reduction algorithm, namely a self-coding algorithm, is proposed. Compared with the traditional dimension reduction algorithm, the self-coding has various forms, such as traditional self-coding, sparse self-coding, noise reduction self-coding and the like. These different self-encodings are in fact the reconstruction process of the original data, extracting the most typical low-dimensional feature expression of the data by minimizing the reconstruction function. Self-coding generally comprises three layers of network structures, i.e. an input layer, an output layer, and a hidden layer. The internal structure of the self-coding network can be changed by increasing the number of hidden layers, so that more complex characteristic relation of high-dimensional data is obtained.

However, even if the self-encoding is to perform dimension reduction by analyzing the complex feature relationship of the high-dimensional data, some identifying information cannot be lost for the high-dimensional image data. Therefore, the missing useful information is made up, and the problem to be solved urgently by the clustering algorithm is solved.

Disclosure of Invention

Aiming at the problems in the prior art, the image clustering algorithm based on deep semantic embedding comprises the following steps:

step 1: dividing an image data set into a training set and a testing set, and respectively extracting the characteristics of the training set and the testing set as respective data characteristic spaces;

step 2: obtaining a mapping function of mapping the image data from the data feature space of the training set obtained in the step 1 to the semantic space of the training set, and obtaining the semantic space of the test set through the mapping function;

and step 3: performing deep semantic embedding combination on the semantic space of the test set obtained in the step 2 and the data characteristic space of the test set extracted in the step 1, taking the semantic space of the test set and the test data characteristic space after combination as an input layer of ascending dimension, and then performing fusion dimension reduction on the input layer of ascending dimension through self-coding to obtain a low-dimensional embedding space with semantic information and original characteristics;

and 4, step 4: clustering in the embedded space with the semantic information and the original characteristics obtained in the step 3 after the dimension reduction by using the KL divergence function, and finishing if the KL divergence function is converged; otherwise, returning to the step 3 and updating the input layer in the step 3.

Further, step 1 comprises the following substeps:

step 1.1: dividing an image data set with visual attributes into a training set and a testing set, wherein the categories of the training set and the testing set have no intersection, and the visual attributes are associated;

step 1.2: and respectively extracting the features of the training set and the test set by adopting a convolutional neural network GooleNet, and respectively taking the obtained features of the training set and the test set as respective data feature spaces.

Further, step 2 comprises the following substeps:

step 2.1: constructing a model of mapping data from a data feature space to a semantic space through formula 1, and solving a mapping function W by using a Sylvester function for formula 1:

wherein X represents the data feature space of the training set, and S represents the semantic space of the training setW represents a mapping function for mapping data from a feature space to a semantic space, W^TRepresenting a mapping function of mapping data from a semantic space to an original characteristic space, wherein delta represents that the value of a weight coefficient is 50000, and F represents a normal form of a matrix;

step 2.2: and (3) obtaining the semantic space of the test set by using the mapping function W obtained in the step 2.1 through a formula 2:

S_uWU (formula 2)

Where U is the data test set, S_uIs the semantic space of the test set.

Further, step 3 comprises the following substeps:

and step 3: and (3) performing deep semantic embedding combination on the semantic space of the test set obtained in the step (2) and the data characteristic space of the test set extracted in the step (1), taking the semantic space of the test set and the test data characteristic space T after combination as an upscaled input layer, and then performing dimensionality reduction on the upscaled input layer through a stacked encoder consisting of a plurality of layers of denoising autoencoders to finally obtain a low-dimensional semantic embedding characteristic layer Z.

Further, step 4 comprises the following substeps:

step 4.1: in the embedded space with semantic information and original features obtained in step 3 after dimensionality reduction, clustering is performed by using a KL divergence function shown in formula 3:

wherein P denotes soft allocation, Q denotes auxiliary target allocation, z_iRepresenting semantic Embedded Point, μ_jRepresenting a clustering center point, and L representing a convergence function symbol; i represents the ith sample, and the value range is as follows: 1 to the total number of samples in the test set, wherein j represents the jth category and has a value range of: 1 to total number of classes in the test set; p is a radical of_ijDenotes the probability of the ith sample being assigned to the jth class, q_ijIndicating the probability of the ith sample being assigned to the auxiliary target of the jth class.

Step 4.2: iterating by a stochastic gradient descent algorithm, e.g.If the KL divergence function is converged, ending the iteration; otherwise, updating the semantic embedded point z through a formula 4 after each iteration_iAnd cluster center point mu_jThen the updated z_iCarrying into the input layer of the step 3 and returning to execute the step 3;

the updating formula in the random gradient descent algorithm is as follows:

in the formula: λ represents the learning rate, and its value is 0.1.

The technical scheme provided by the invention has the beneficial effects that:

a self-coding method based on a deep neural network is provided for extracting and embedding original features and semantic information, so that a low-dimensional embedding space with the semantic information and the original features is obtained. And finally, completing the clustering task of the image in a lower-dimensional feature space in which semantic information is embedded, effectively improving the discriminability of data features and improving the clustering effect.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of semantic feature extraction;

FIG. 3 is a stacked self-encoding for deep semantic embedding;

FIG. 4 is a clustered accuracy result of the AWA dataset;

FIG. 5 is a clustered accuracy result of the CUB dataset;

FIG. 6 is a clustered accuracy result of the SUN dataset;

FIG. 7 is a clustered NMI result for an AWA dataset;

FIG. 8 is a clustered NMI result for a CUB dataset;

fig. 9 is a clustered NMI result for the SUN dataset.

Detailed Description

First, three sets of image data with attributes, AwA, SUN, and CUB, are introduced:

animal visual attribute data set (AWA): it collects 50 kinds of animal pictures from each large picture web site, 30475 pictures in total. All samples of 40 classes were used as training data, a total of 24295 samples, and the other 10 classes were used as test data, for 6180 samples. Meanwhile, the data set defines 85 attributes, and provides attribute values defined by a priori knowledge for each training sample.

CUB data set: a california institute of technology bird database, which contains 11788 samples in total, divided into 200 categories. According to the experiment needs, 175 types of samples are selected as training data, and the other 25 types of samples are used as test data. While this data set contains 312 attributes, again each training example provides manually defined attribute values.

SUN data set: the SUN data set is mainly created on the basis of an SUN category database, total 14340 samples share category 717, and general researchers select two slicing modes of 645/72 and 707/10 to divide categories of training data and testing data. 707 classes were selected as training data and the remaining 10 classes were selected as test data in this experiment. This data also contains 102 attributes, again with each training example providing manually defined attribute values.

The following embodiments of the present invention are provided, and it should be noted that the present invention is not limited to the following embodiments, and all equivalent changes based on the technical solutions of the present invention are within the protection scope of the present invention.

An image clustering method based on deep semantic embedding comprises the following steps:

step 1: dividing an image data set into a training set and a testing set, and respectively extracting the features of the training set and the testing set as respective data feature spaces, wherein the method comprises the following substeps:

step 1.2: respectively extracting the characteristics of a training set and a test set by adopting a convolutional neural network GooleNet, and respectively taking the characteristics of the training set and the test set as respective data characteristic spaces;

specifically, in step 1.2, 1024-dimensional features are extracted from each training set or test set.

The method has the advantages that: the reason that the visual attributes are used as semantic spaces to connect different image objects is that compared with other attributes, the visual attributes of the images can provide more identification information which can be intuitively understood, the information is not influenced by compression, rotation and scaling of the images, can be shared by various target objects, and is easy to acquire. In real life, humans are adept at identifying unseen target objects by having some objects with the same attributes as unseen objects, for example, we can identify the species of unseen birds by the common attribute features of the unseen birds. By applying the characteristic of human beings in the method, the visual attribute is used as a pivot for connecting the training set and the test set, so that the conclusion obtained in the training set can be directly used in the test set.

Step 2: obtaining a mapping function of mapping the image data from the data feature space of the training set obtained in the step 1 to the semantic space of the training set, and obtaining the semantic space of the test set through the mapping function, wherein the mapping function comprises the following substeps:

step 2.1: a model of mapping data from a feature space to a semantic space is constructed through formula 1, and a mapping matrix W can be obtained by using a Sylvester function for formula 1:

wherein X represents the data feature space of the training set, S represents the semantic space of the training set, W represents the mapping function for mapping data from the feature space to the semantic space, W^TA mapping function for mapping the data from semantic space to original characteristic space, and a value of weight coefficient of delta 50000And F represents a paradigm of matrix evaluation.

Specifically, for step 2.1, when solving equation 1,

first, the objective function of the definition method is:

replacing WX ═ S with the above formula, the transformable function form is:

defining a weight coefficient delta to solve the constraint condition of WX ═ S, changing the objective function into a standard quadratic equation, and facilitating the solution to obtain:

the derivation of this function finally yields:

-S(X^T-S^TW)+δ(WX-S)X^T＝0

SS^TW+δWXX^T＝SX^T+δSX^T

the above equation satisfies the basic form of Sylvester function, by which the mapping function W is solved.

S_uWU (formula 2)

Where U is the data test set, S_uIs the semantic space of the test set.

The method has the advantages that: the training set is input into a basic three-layer self-coding network structure, and the semantic space of the test set is extracted according to the original features and semantic features of the data of the training set, namely X is an input layer, W is a hidden layer, and attributes are used for semantic expression of the data. In the process of learning the mapping function, X' is a reconstructed training set feature space, and represents that the required semantic space can accurately express the data per se as much as possible by the limitation of reconstructed data, namely, the reconstruction error rate is minimized, so that the semantic feature extraction is more accurate.

And step 3: performing deep semantic embedding combination on the semantic space of the test set obtained in the step 2 and the data feature space of the test set extracted in the step 1, taking the semantic space and the test data feature space of the test set after combination as an input layer of ascending dimension, and then performing fusion dimension reduction on the input layer of ascending dimension through self-coding to obtain a low-dimensional embedding space with semantic information and original features, wherein:

the stacked encoder comprises encoding and decoding processes, each layer is a denoising self-encoder, data is mapped randomly, certain dimensionalities of the data are set to be 0, a layer-by-layer training mode is adopted, the output of each layer serves as the input of the next layer to continue training, and the encoding process is completed after all layers of encoding training are completed. The decoding process is to fine-tune the parameters through inverse training to achieve the purpose of minimizing the reconstruction loss function. The method adopts a three-layer denoising autoencoder to finally form a layer h containing three middle layers₁,h₂,h₃Is self-encoder.

We first use the fusion T epsilon R of original features of the test set and complete semantic expression^(d+s)×nAs input layer, R^(d+s)×nA matrix representing n rows (d + s) and columns defining a non-linear mapping function f_θWherein theta is a weight parameter of the iterative process, and is initialized by using zero mean Gaussian distribution, so as to obtain a final low-dimensional semantic embedded feature layer Z belonging to P^k ^×n，P^k×nA matrix representing n rows and k columns, where k is the final extracted low-dimensional semantically embedded feature dimension, z_i＝f_θ(t_i)∈Z,t_iE.g. T. Then the decoding stage t_i'＝f'_θ'(z_i)∈T,z_iE.g. Z, and utilizing a back propagation algorithm to fine tune the parameters so as to achieve the purpose of minimizing the reconstruction loss function. Finally form a bagContaining three intermediate layers h₁,h₂,h₃Is self-encoder.

The method has the advantages that: the three-layer complex intermediate network can express the complex characteristic relation of high-dimensional data, so that the finally extracted low-dimensional features with semantic embedding can represent the data per se, and image clustering is performed on the basis of the low-dimensional features, thereby improving the clustering effect.

And 4, step 4: clustering in the embedded space with the semantic information and the original characteristics obtained in the step 4 after the dimension reduction by using the KL divergence function, and finishing if the KL divergence function is converged; otherwise, returning to the step 3, and updating the input layer in the step 3, specifically including the following substeps:

where P represents a soft allocation representing a low-dimensional semantic-embedded feature point z_iAnd each cluster center point mu_jThe similarity probability of (a) is Q represents auxiliary target distribution, emphasizes that data point distribution has high confidence, normalizes loss contribution of each center, thereby ensuring confidence of cluster distribution, and z represents the probability of auxiliary target distribution_iRepresenting semantic Embedded Point, μ_jRepresenting a clustering center point, and L representing a convergence function symbol; i represents the ith sample, and the value range is as follows: 1 to the total number of samples in the test set, wherein j represents the jth category and has a value range of: 1 to total number of classes in the test set; p is a radical of_ijDenotes the probability of the ith sample being assigned to the jth class, q_ijIndicating the probability of the ith sample being assigned to the auxiliary target of the jth class.

Initializing a clustering center by using a k-means clustering algorithm during the first clustering operation

Where c is the number of clusters.

The method has the advantages that: in KL clustering, auxiliary target distribution is derived through soft distribution, the method emphasizes high confidence coefficient of data point distribution, strengthens prediction, and can normalize loss contribution of each centroid to prevent large class distortion hidden feature space, and finally enables KL optimization process to improve clustering and feature expression simultaneously.

Step 4.2: iteration is carried out through a random gradient descent algorithm, and if the KL divergence function is converged, the iteration is ended; otherwise, updating the semantic embedded point z through a formula 4 after each iteration_iAnd cluster center point mu_jThen the updated z_iAnd (4) substituting the input layer in the step (4) and returning to execute the step (4), wherein an updating formula in the random gradient descent algorithm is as follows:

wherein the minimum reconstruction rate is

Stopping updating the parameter z when its value is less than 0.1_i。

The method has the advantages that: the Stochastic Gradient Descent (SGD) algorithm can divide the data set into N buckets, using one bucket's data for each update, allowing equation 3 to converge more quickly.

The method is applied to a specific data set to prove the more prominent clustering effect.

(1) Shown in the table are the specific information for the three data sets.

Data set	Number of samples	Semantic space	Semantic dimension	Label/non-label
					AwA	30475	Properties	85	40/10
CUB	11788	Properties	312	175/25
					SUN	14340	Properties	102	707/10

TABLE 1

(2) Evaluation criteria: and (4) according to the specific implementation steps of the clustering algorithm, completing the image clustering task. We use Accuracy (ACC) and Normalized Mutual Information (NMI) as evaluation criteria for clustering performance, and we briefly describe the following two evaluation criteria.

ACC for the ith sample in the dataset, if g is defined_iAs the finally obtained cluster label, h_iAs a genuine label. Then the calculation formula for ACC is as follows:

where N is the number of samples in the training set, map (g)_i) Is a mapping function for mapping the obtained cluster label onto the real label. δ is a function that achieves matching between x and y, where δ (x, y) is 1 if x is y and 0 otherwise.

NMI: normalized information is another important cluster evaluation criterion. For any two of the variables C and D,

where I (C, D) is used to calculate the common information for C and D, and H (C) and H (D) are functions that calculate the entropy of C and D, respectively. We define t_lAs clusters

The number of samples in (1) is,

as the number of samples in the h-th real class, where t_l.hIs a cluster

And the number of samples crossed between the h-th real class. Then, NMI can be calculated by the following formula:

(3) and (4) analyzing results:

algorithm	AWA	CUB	SUN
				K-means	0.8427	0.4703	0.7417
K-means++	0.8436	0.4649	0.8
				SAE+k-means	0.9159	0.4969	0.845
DEC	0.9125	0.4902	0.8449
				DSEC	0.9307	0.5138	0.875

TABLE 2

Table 2 shows the clustering accuracy (accuracycacy) of all the methods in the three datasets, from which we can see that our method is superior to other algorithms, especially in AwA and SUN datasets. This means that deep self-coding based on deep neural networks has a significant role in high-dimensional data clustering tasks. By comparing the SAE + k-means algorithm with the DEC algorithm, the semantic expression is shown to be as important as the pixel characteristics in image clustering, and the finally obtained low-dimensional characteristics can better express the data per se by showing the realization of semantic embedding.

Fig. 3, 4, 5 depict the results of clustering accuracy of three data sets in the DEC and DSEC algorithms.

It is seen from the figure that the acc values of both algorithms are gradually increased along with the increase of the iteration number, and finally convergence is achieved. Our method works better than the DEC algorithm, while it can be seen from the AwA and CUB data sets that our algorithm iterates a little more than DEC and takes a relatively little more time. This is because our input dimensions contain semantic features at the input layer of the self-encoding, relatively speaking dimensions are higher than the original feature dimensions. The SUN data set has few data samples, so that the iteration times of the two algorithms are not obviously different, and the time spent on the two algorithms is not greatly different.

Algorithm	AWA	CUB	SUN
				K-means	0.8373	0.6059	0.775
K-means++	0.8364	0.6009	0.8096
				DEC	0.9029	0.6304	0.8118
DSEC	0.9212	0.6447	0.8374

TABLE 3

Table 3 shows the clustering results of all clustering methods under the NMI standard in the three data sets. It can be seen from the figure that the deep clustering algorithm DEC and our DSEC algorithm are significantly superior to other conventional clustering algorithms. Meanwhile, the clustering performance of the algorithm is about two percent higher than that of the DEC algorithm.

Fig. 7, fig. 8, and fig. 9 show iterative convergence processes of two algorithms, DEC and DSEC, under the NMI standard. From the figure we also see that our algorithm is superior to the DEC algorithm.

Claims

1. An image clustering algorithm based on deep semantic embedding, which is characterized by comprising the following steps:

step 2: obtaining a mapping function of the image data from the data feature space of the training set obtained in the step 1 to the semantic space of the training set, and obtaining the semantic space of the test set through the mapping function;

and step 3: performing deep semantic embedding combination on the semantic space of the test set obtained in the step 2 and the data characteristic space of the test set extracted in the step 1, taking the semantic space of the test set and the test data characteristic space T after combination as an upscaled input layer, and then performing dimensionality reduction on the upscaled input layer through a stacked encoder consisting of a plurality of layers of denoising autoencoders to finally obtain a low-dimensional semantic embedding characteristic layer Z;

and 4, step 4: clustering in the embedded space with the semantic information and the original characteristics obtained in the step 3 after the dimension reduction by using the KL divergence function, and finishing if the KL divergence function is converged; otherwise, returning to the step 3 and updating the input layer in the step 3;

step 4 comprises the following substeps:

step 4.1: embedding the low-dimensional semantics obtained in the step 3 into a feature layer Z, and clustering by using a KL divergence function shown in a formula 3:

wherein P denotes soft allocation, Q denotes auxiliary target allocation, z_iRepresenting semantic Embedded Point, μ_jRepresenting a clustering center point, and L representing a convergence function symbol; i represents the ith sample, and the value range is as follows: 1 to the total number of samples in the test set, wherein j represents the jth category and has a value range of: 1 to total number of classes in the test set; p is a radical of_ijDenotes the probability of the ith sample being assigned to the jth class, q_ijRepresenting the probability of the ith sample being assigned to the auxiliary target of the jth class;

step 4.2: iteration is carried out through a random gradient descent algorithm, and if the KL divergence function is converged, the iteration is ended; otherwise, updating the semantic embedded point z through a formula 4 after each iteration_iAnd cluster center point mu_jThen the updated z_iCarrying into the input layer of the step 3 and returning to execute the step 3;

the updating formula in the random gradient descent algorithm is as follows:

in the formula: λ represents a learning rate and takes a value of 0.1.

2. The depth semantic embedding-based image clustering algorithm according to claim 1, wherein step 1 comprises the following sub-steps:

3. The depth semantic embedding-based image clustering algorithm according to claim 1, wherein the step 2 comprises the following sub-steps:

wherein X represents the data feature space of the training set, S represents the semantic space of the training set, W represents the mapping function for mapping data from the feature space to the semantic space, W^TRepresenting a mapping function of mapping data from a semantic space to an original characteristic space, wherein delta represents a weight coefficient and takes the value of 50000, and F represents a normal form of a matrix;

S_uWU (formula 2)

Where U is the data test set, S_uIs the semantic space of the test set.