CN116863177A

CN116863177A - Object view distillation method for general multi-view object clustering

Info

Publication number: CN116863177A
Application number: CN202310700264.7A
Authority: CN
Inventors: 刘文静; 李海龙; 许志伟; 王钢
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-10-10

Abstract

An object view distillation method for general multi-view object clustering collects initial multi-view data of a plurality of samples; constructing a self-encoder, a teacher network, a student network and knowledge distillation; the encoder projects a view of the sample into a potential representation and constructs a low-dimensional latent space; training a teacher network using the multi-view data; training a student network by utilizing the multi-view data, and guiding the student network to carry out fine adjustment by taking the dark knowledge generated by knowledge distillation as a new self-supervision signal; feeding the whole original multi-view data set to an overall network, obtaining probability distribution of all view clusters through a student network, and carrying out weighted sum summation on the probability of each view to obtain a final clustering result; according to the invention, the knowledge distillation method is used for solving the problem of false label error guiding and inaccurate characteristic correction of model training in the multi-view clustering algorithm, and the clustering performance is remarkably improved.

Description

Object view distillation method for general multi-view object clustering

Technical Field

The invention belongs to the technical field of artificial intelligence and image clustering, and particularly relates to an object view distillation method for general multi-view object clustering.

Background

Data in the real world is mostly collected from different sensors or obtained from different feature extractors. If different modes of the data or different visual angles of the image are fully utilized, the data can be better established into a visual model, so that the aim of analysis or clustering is fulfilled. Multi-view clustering is a multi-stage clustering method aimed at classifying visual objects into different clusters to improve the effectiveness of the model and facilitate subsequent tasks such as object detection and motion recognition. To achieve this goal, it is critical to explore the common semantics between the different views and make full use of the pseudo tags obtained through self-supervised learning. However, multi-view clustering has some drawbacks and limitations when applied to multi-modalities or multi-views. In fact, the samples of the different views contain more features, while their distribution is disturbed by noise points and missing data. If a linear separation of the traditional representation is used as a pseudo tag, this can lead to excessive confidence in creating the pseudo tag (i.e., low entropy prediction), which in turn can mislead model training and ultimately lead to inaccurate clustering. Thus, avoiding the destructive effects of false labels and correcting inaccurate features during feature learning is a critical task in multi-view clustering.

In order to solve the problem of pseudo labels in clustering, multi-stage deep multi-view clustering algorithms are attracting more and more attention, but their performance is limited due to the following drawbacks: 1) While pseudo tags may provide an explicit indicator of self-supervised learning, when multi-view instances are represented by pseudo tags, intra-cluster and inter-cluster associations may be ignored, which reduces the performance of feature representation and compromises multi-view clustering results. 2) The multi-view data samples contain different features, and their distribution is affected by the multi-view data integrity, so that inaccurate features can affect the accuracy of their learning during the feature learning process.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an object view distillation method for general multi-view object clustering, which aims to solve the problems of false label misguidance and inaccurate correction of model training in a multi-view clustering algorithm and improve the multi-view clustering effect.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

an object view distillation method for general multi-view object clustering comprises the following steps:

step 1, collecting initial multi-view object data of N samples, wherein the number of view data of each sample is V;

step 2, constructing a self-encoder, a teacher network, a student network and knowledge distillation; the self-encoder consists of an encoder and a decoder, shared by all views, the encoder will view X of each sample ¹ ,X ² ,...,X ^V Respectively projected as potential representation Z ¹ ,Z ² ,...,Z ^V And constructing a low-dimensional latent space; the decoder remaps the potential representation into a view;

step 3, training a teacher network by utilizing the multi-view data output by the encoder; the teacher network linearly separates the learned high-dimensional features into pseudo tags based on contrast learning;

step 4, training a student network by utilizing the multi-view data output by the encoder; the student network is used for extracting multi-view features and projecting original features to feature spaces with different levels, and learning public semantics by maximizing mutual information of the feature spaces with different levels;

step 5, using knowledge distillation to convert the pseudo tag generated by the teacher network into dark knowledge (K dimension), wherein the dark knowledge is used as a new self-supervision signal to provide an optimization direction for the student network and guide fine adjustment until learning is completed;

step 6, feeding the original multi-view data set to the whole network, wherein predictors in the branches of the student network obtain probability distribution of all view clusters, and weighting and summing the probabilities of each view to obtain a final clustering result; the overall network is composed of the self-encoder, knowledge distillation, teacher network and student network.

Compared with the prior art, the invention has the beneficial effects that:

in the multi-view clustering at the present stage, multi-view data introduces more features, so that the features are difficult to be represented by excessively trusted pseudo tags, and the existing multi-stage clustering method is difficult to adapt to the multi-view clustering scene due to more noise.

The invention explores the application of knowledge distillation in multi-view clustering, provides a multi-view knowledge distillation technology, converts the excessively self-trusted pseudo tag into dark knowledge, and reduces the influence of the pseudo tag on multi-view feature learning. Furthermore, contrast methods are used to learn multi-view semantics in different levels of feature space. In the low-dimensional potential space, mutual information is directly maximized by using invariant information clustering, and in the Gao Weizi space, the lower bound of the mutual information is improved according to the stationary points related to the negative sample size. This may correspondingly improve the self-supervised learning multiview representation performance of multiview clustering.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a schematic diagram of the structure of the model of the present invention.

Fig. 3 is a diagram of the teacher network configuration of the present invention.

Fig. 4 is a diagram of the network structure of the student of the present invention.

FIG. 5 is a schematic representation of the clustering flow path of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

The concepts and parameters that may be involved in the present invention are presented as follows:

one sample typically has multiple views, which may or may not be complete. Given a data set, it contains multiple view data for multiple samples, where multiple view data refers to different angle views of the same sample, or different modality views of the same sample at the same angle (e.g., RGB images or depth maps). In various view data, noise points and missing data are likely to interfere, and in order to solve the phenomenon, the object view distillation method for the general multi-view object clustering can still keep good clustering performance when pseudo tag guidance training is too confident.

Referring to fig. 1, the complete flow of the present invention is as follows:

step 1, collecting an initial multi-view dataset of a number of samples,where each view takes N samples, i.e. the initial multi-view object data comprises N samples. The view data number for each sample is V, V e { 1..v }. D (D) _v Representing sample X of view v ^v K is the number of categories to be clustered.

The object of the present invention is to cluster all N samples into k clusters. Define a set of N samples { X ] ¹ ,X ² ,...,X ^V (wherein X is ¹ Representing the number of views in the sample as 1, X ² Representing the number of views in the sample as 2, X ^v The number of views in the sample is denoted v. In this step, the views of the sample may be different angle views of the same sample, or may be views of the same sample at the same angle and in different modes.

And 2, constructing a self-encoder, a teacher network, a student network and a knowledge distillation module.

The model structure schematic diagram is shown in fig. 2, and consists of a self-encoder, a teacher network, a student network and knowledge distillation. The self-encoder consists of an encoder and a decoder, shared by all views, the encoder will view X for each sample ¹ ,X ² ,...,X ^V Respectively projected as potential representation Z ¹ ,Z ² ,...,Z ^V And constructing a low-dimensional latent space; the decoder remaps the potential representation to a view. Wherein the encoder and decoder of view v uses f _v And g _v And (3) representing. Encoder f ₁ X is to be ¹ Projected as potential representation Z ¹ A second set of codesDevice f ₂ X is to be ² Projected as potential representation Z ² The structure of the self-encoder of the v-th view is defined asWherein Fc is ₅₁₂ A fully connected neural network with 512 neurons is represented, and each layer is followed by a ReLU layer.

As shown in fig. 3, the teacher network constructs an independent Gao Weizi space from the contrast module, indirectly improves the lower boundary of mutual information through contrast learning in the Gao Weizi space, and linearly separates the learned high-dimensional features into pseudo-labels. The teacher network structure has two linear layers and adds a ReLU activation function in the middle.

As shown in FIG. 4, the student network consists of a student network wp that converts features of the student network ws into probability distributions and uses them as soft labels for distillation, a student network ws that builds an independent subspace in which a high-dimensional hierarchical public representation is captured by contrast learning, and a latent representation X in the original low-dimensional feature space ¹ ,X ² ,…,X ^V Mutual information is maximized in pairs. Since maximizing mutual information is implemented between two users, the same information of the two users is discovered as much as possible, so that mutual information is maximized in pairs. The probability distribution of the output characteristics of the student network is compared with the dark knowledge output of the teacher network to calculate the KL divergence.

The knowledge distillation module aims to overcome excessive confidence in the teacher's network output pseudo tags. Knowledge distillation utilizes a teacher network to output k-dimensional features, and converts one-dimensional pseudo tags into k-dimensional dark knowledge by adjusting temperature and adding Softmax activation functions. Since the dark knowledge contains basic level information not contained in the pseudo tag, the dark knowledge obtained by the final distillation is used as a basic true phase as a new self-supervision signal to guide feature learning.

In this embodiment, knowledge distillation is introduced to the teacher network to extract the dark knowledge of the pseudo tag distribution, and the dark knowledge acts as a self-supervising signal to provide more accurate guidance to the student network. The invention evaluates the superiority of the invention on eight public data sets: 1) Scene contains 4485 images from 15 different indoor and outdoor Scene categories, and the PHOG and GIST functions. 2) MNIST-USPS contains 5000 digital image samples of two different styles, with the USPS picture 256 dark and the MNIST image 784 dark. 3) BDGP consists of 5 categories and 2500 samples, with 500 samples per category, each sample represented by visual and textual features. 4) Fashion has 10 categories in total, providing 6 tens of thousands of 28 x 28 pixel pictures and labels for training, and providing 1 tens of thousands of 28 x 28 pixel pictures and labels for testing. 5) Caltech consists of 9144 images distributed over 102 categories. To evaluate the robustness of the present invention in view quantity, caltech, which is a multi-view RGB image dataset, was decomposed into Caltech-2V, caltech-3V, caltech-4V and Caltech-5V. Detailed statistics of the dataset are summarized in table 3-1.

Table 3-1 summary of data sets table

And step 3, training a teacher network by using the multi-view data output from the encoder.

The multi-view data is represented asX ¹ By encoder f ₁ Obtaining a potential representation Z ¹ ，X ² By encoder f ₂ Obtaining a potential representation Z ² ，X ^v Through the encoder f in the network 1 _v Obtaining a potential representation Z ^v 。Z ¹ ，Z ² And Z ^v The first view, the second view and the potential representation of the v-th view, respectively. By way of example, training 300 epochs, step 4 training, may be usedThe self-encoder in the training network 2 provides better quality initialization parameters.

Based on Z ¹ ，Z ² And Z ^v Three objective functions are constructed that require further optimization:

i) Constructing a depth auto-encoder to capture salient features of data by minimizingEnabling an automatic encoder to convert heterogeneous multi-view data into cluster-friendly potential representations, wherein:

for converting heterogeneous multi-view data into potentially represented evaluation indicators for an automatic encoder, therefore +.>The formula of (a) is the difference, the smaller the value is, the better. For view v, f ^v () Is an encoder g ^v () Is a decoder, < >>Representing the nth feature vector, the learned potential representation is defined as Z ^v ，/>Represents the nth potential representation,/->Is Z ^v Is included. The design can enable the self-encoder to maintain the diversity of respective views, avoid trivial solution and prevent model collapse, and is the basis for improving the multi-view clustering performance.

In order for the model to effectively perform feature learning, teachingLow-dimensional representation { Z ] of teacher and student networks projection ¹ ,Z ² ,...Z ^v Entry into the high-dimensional space { t } ¹ ,t ² ,...t ^v Sum { y } ¹ ,y ² ,...y ^v Respectively in different hierarchies. In order to achieve efficient feature learning at different levels, an objective function is proposed to learn common semantics:

ii) minimizing InfoNCE at a high-dimensional level can be considered as indirectly maximizing the lower bound of mutual information, while the present invention maximizes mutual information directly between different views of the low-dimensional hierarchy, referred to as invariant information clustering, which can be expressed as:

wherein I represents the mutual information and wherein,representing maximized mutual information +.>Representing the nth potential representation of the learned v' th view. As shown in FIG. 3, the +.>And->Approximately into two independent discrete distributions, further obtaining +.>And->Is described. Thus, I directly calculates:

iii) As shown in fig. 3, the teacher network uses a feature learning method with the goal of providing supervisory signals for the optimization of the student network while providing high-dimensional features { t } ¹ ,t ² ,...t ^v Used for linear separation. The use of contrast learning helps to better fit the probability distribution and learn the mutual information at different levels. Give out a sample pairAnd->The cross entropy loss to optimize symmetry is as follows:

wherein τ _t The temperature parameters of the teacher network for controlling the distribution softness are given out by considering all views on the data set, and the optimization targets of the teacher network are as follows:

wherein { t } ¹ ,t ² ,…t ^v The } is a high-dimensional feature that the teacher network projects a potential representation in the low-dimensional feature space, { y ¹ ,y ² ,...y ^v The student network projects the underlying representation in the low-dimensional feature space to the resulting high-dimensional feature,is a pair of samples, is a->Is a representation of an optimized teacher network.

And 4, training the student network by using the multi-view data output from the encoder.

Through this step, the student network is trained to predict n×n possible in a batchWhich of the pairs actually occurs. For this purpose, the student network wp is constructed by maximizing +_n for n positive sample pairs on the diagonal>And->Cosine similarity of (n) while minimizing (n) ² -n) cosine similarity of the embedding of negative sample pairs to learn a multiview embedding spatial feature matrix +.>Feature matrix->The pair wise similarity in (a) is measured by cosine similarity:

wherein n, m is E [1, N]，v,v∈[1,V]，And->Representing pairs of samples->Indicating transpose,/->And->Is that student network wp will be embedded +.>And->Probability distribution obtained by conversion, < >>And->Is a potential representation +.>And->Feed the resulting embeddings to the student network ws.

As shown in fig. 4, the student network and the teacher network use the same feature learning method, and in particular the student network requires regularization terms to prevent model collapse. Similarly, to optimize pairwise similarity, without loss of generality, a given sample pairAnd->Optimizing symmetric cross entropy loss:

wherein, the liquid crystal display device comprises a liquid crystal display device,is a representation of an optimized student network, τ _s In order to control the student network temperature parameters of the distribution softness,is a sample pair. The present invention wishes to identify all the positive pairs of the entire dataset, so the contrast loss of the sample pair sums needs to be calculated over all views and extended to Vv2, as follows:

in the above equation, an additional entropy balance term can be added

The regularization term avoids trivial solutions and prevents all sample points from clustering into the same class.

And step 5, converting the pseudo tag generated by the teacher network into the dark knowledge by utilizing knowledge distillation, thereby providing the optimizing direction for the student network by taking the dark knowledge as a new self-supervision signal and guiding to conduct fine tuning until learning is completed.

To better exploit the learned common semantics for clustering, a knowledge distillation module needs to be added to two independent student and teacher subspaces to fine tune. The present invention does not directly use the soft labels output by the teacher network as the distribution required for extraction because such probability distribution does not contain significant clustering information. Firstly, clustering information contained in high-dimensional features is utilized to improve the clustering effect of semantic tags, and a new clustering center C is obtained by optimizing the following targets:

where θ is a parameter of the teacher's network,d _v is t _n Is a dimension of (c). />Is a symbolic representation of the new cluster center C, is the cluster center of each view, is the dataset, is the nth potential representation. This step is more efficient using the K-means algorithm, so t can be linearly separated according to the cluster center C _n To obtain V group of pseudo tag +.>The Softmax activation function will be superimposed on the last layer of the predictor,is defined as the probability that the nth sample is clustered into the mth cluster, and thus there is also a V-set probability distributionAt->In the case of the obtained dark knowledge, a KL divergence distillation model is used:

wherein y is ^v Is a sample of student network, τ _d Is the distillation factor, u is the distribution introduced, here gaussian,to obtain a dark knowledge. The softmax function causes y ^v Becomes a relatively sharp distribution, while the dark knowledge is a relatively smooth distribution, and KL divergence can make the two to be opposed, thereby effectively preventing the model from collapsing.

Step 6, as shown in fig. 5, the original multi-view dataset is fed to the whole network, the predictors in the branches of the student network will obtain probability distribution of all view clusters, and the probabilities of each view are weighted and summed to obtain the final clustering result.

The objective function of the overall network is:

the whole original multi-view data set is fed to the whole network, a student network wp is used as a predictor in a student network branch, probability distribution of all view clusters is obtained, and probability of each view is weighted and summed to obtain a final clustering result. According to the invention, the knowledge distillation method is used for solving the problem of false label error guiding and inaccurate characteristic correction of model training in the multi-view clustering algorithm, and the clustering performance is remarkably improved.

Regarding the evaluation index selection of the present invention, three indexes of Accuracy (ACC), normalized Mutual Information (NMI), and Purity (PUR) are used to evaluate clustering performance. The higher the values of these evaluation indexes, the better the clustering performance.

TABLE 3-2 ablation experiments on Caltech-2V datasets

Table 3-2 shows the loss components and experimental results for the five variables. The present invention devised six sets of schemes on four datasets with different view numbers and observed the following results: a) All losses play an indispensable role in an object view distillation method for general multi-view object clustering; b) After the knowledge distillation method is introduced into the steps (1), (3), (5) and (6), the method is obviously improved, which further proves that the method can effectively relieve the excessive confidence problem in the pseudo mark, thereby improving the clustering performance; c) (2) adding self-distillation to (4) results in model degradation; d) As can be seen by comparing (1) and (6), the loss is optimizedCan bring great improvement, prove the effectiveness of the method provided by us for maximizing mutual information on different levels; e) The four observations above apply to all data sets, which also demonstrates the robustness of the method of use of the present invention.

Table 3-3 Cluster behavior on different datasets

Data set	ACC	NMI	PUR
				Scene	0.428	0.432	0.448
MNIST-USPS	0.996	0.987	0.996
				BDGP	0.971	0.971	0.991
Fashion	0.993	0.982	0.993
				Caltech-2V	0.619	0.533	0.619
Caltech-3V	0.650	0.575	0.663
				Caltech-4V	0.809	0.695	0.809
Caltech-5V	0.824	0.709	0.824

Tables 3-3 describe the clustering performance of the present invention at different scale data sets, listing the clustering performance of all methods on eight data sets, from which we obtained the following observations: our object view distillation method for generic multi-view object-oriented clustering achieves optimal performance on all datasets. On the other hand, the invention uses the dark knowledge instead of the pseudo tags to provide more accurate guidance for self-supervised clustering, thereby obtaining excellent clustering results.

Claims

1. An object view distillation method for general multi-view object clustering is characterized by comprising the following steps:

step 3, training a teacher network by utilizing the multi-view data output by the encoder;

step 4, training a student network by utilizing the multi-view data output by the encoder;

step 5, using knowledge distillation to convert the pseudo tag generated by the teacher network into dark knowledge, providing an optimization direction for the student network and guiding fine adjustment until learning is completed;

and 6, feeding the original multi-view data set to the whole network, wherein predictors in the branches of the student network obtain probability distribution of all view clusters, and weighting and summing the probabilities of each view to obtain a final clustering result.

2. The method according to claim 1, wherein in step 1, the views of the sample are different angle views of the same sample or views of the same sample with different angles and different modalities.

3. The method of claim 1, wherein the teacher network constructs an independent Gao Weizi space by the contrast module, indirectly improves the lower boundary of mutual information by contrast learning in the Gao Weizi space, and linearly separates the learned high-dimensional features into pseudo-labels;

the student network consists of a student network wp, a student network ws and a comparison module, wherein the student network wp converts the characteristics of the student network ws into probability distribution and uses the probability distribution as a distillation soft label, and the comparison module constructs an independent oneCapturing a high-dimensional hierarchical public representation by contrast learning in the subspace, and a potential representation X in the original low-dimensional feature space ¹ ,X ² ,...,X ^V Maximizing mutual information in pairs;

the knowledge distillation utilizes a teacher network to output k-dimensional characteristics, a one-dimensional pseudo tag is converted into k-dimensional dark knowledge by adjusting the temperature and adding a Softmax activation function, the dark knowledge contains basic level information which is not contained in the pseudo tag, and the dark knowledge obtained by final distillation is used as a basic true phase and is used as a self-supervision signal to guide characteristic learning.

4. The method for object view distillation for generic multi-view object clustering according to claim 3, wherein said step 3, constructing a depth auto encoder by minimizingEnabling an automatic encoder to convert heterogeneous multi-view data into cluster-friendly potential representations, wherein:

converting heterogeneous multi-view data into potentially represented evaluation metrics for an automatic encoder, f for the v-th view ^v () Is an encoder g ^v () Is a decoder, < >>Representing the nth feature vector, the learned potential representation is defined as Z ^v ，/>Represents the nth potential representation,/->Is Z ^v Is a reconstructed view of (a);

maximizing mutual information between different views of the low-dimensional hierarchy is referred to as invariant information clustering, which is expressed as:

representing maximized mutual information, I representing mutual information,/-for each>Representing the nth potential representation of the learned v' th view.

5. The method of claim 3, wherein the teacher network uses feature learning methods to provide supervisory signals for optimization of student networks while providing high-dimensional features { t } ¹ ,t ² ,...t ^v Used for linear separation; sample pairAnd->The cross entropy loss to optimize symmetry is as follows:

wherein { t } ¹ ,t ² ,...t ^v The } is a high-dimensional feature that the teacher network projects a potential representation in the low-dimensional feature space, { y ¹ ,y ² ,...y ^v The student network projects the underlying representation in the low-dimensional feature space to the resulting high-dimensional feature,is a pair of samples, is a->Is a representation of an optimized teacher network.

6. A method of object view distillation for generic multi-view object clustering according to claim 3 wherein said student network is trained to predict n x n possible in a batchWhich of the pairs actually occurs; student network wp is created by maximizing +_n for n positive sample pairs on the diagonal>And->Cosine similarity of (n) while minimizing (n) ² -n) cosine similarity of the embedding of negative sample pairs to learn the multiview embedding spatial feature matrix +.>Feature matrix->The pair wise similarity in (a) is measured by cosine similarity:

wherein the method comprises the steps ofAnd->Representing pairs of samples->Indicating transpose,/->And->Is that student network wp will be embedded +.>And->Probability distribution obtained by conversion, < >>And->Is a potential listShow->And->Feed to student network ws derived embeddings;

the student network and teacher network use the same feature learning method, given a sample pairAnd->Optimizing symmetric cross entropy loss:

the contrast loss of the sample pair sums is calculated over all views and extended to V.gtoreq.2 as follows:

is a representation of an optimized student network, τ _s For controlling the student network temperature parameter of the distribution softness, < ->Is a sample pair.

7. The method for distilling object views for clustering general multi-view objects according to claim 3, wherein in step 5, the clustering effect of the semantic tags is improved by using the clustering information contained in the high-dimensional features, and the new clustering center C is obtained by optimizing the following targets:

where θ is a parameter of the teacher's network,is a symbolic representation of the new cluster center C, is the cluster center of each view, is the dataset, is the nth potential representation, +.> d _v Is t _n Dimension of (2);

linear separation t according to cluster center C _n Obtaining V groups of pseudo tagsAt->In the case of the obtained dark knowledge, a KL divergent distillation model is used:

wherein y is ^v Is a sample of student network, τ _d Is the distillation factor, u is the distribution of the introduction, P ^*v To obtain a dark knowledge.

8. The method for distilling object views for generic multi-view object clustering according to claim 3, wherein in step 6, the objective function of the overall network is:

the whole original multi-view data set is fed to the whole network, a student network wp is used as a predictor in a student network branch, probability distribution of all view clusters is obtained, and probability of each view is weighted and summed to obtain a final clustering result.