CN112784902B - Image classification method with missing data in mode - Google Patents
Image classification method with missing data in mode Download PDFInfo
- Publication number
- CN112784902B CN112784902B CN202110095029.2A CN202110095029A CN112784902B CN 112784902 B CN112784902 B CN 112784902B CN 202110095029 A CN202110095029 A CN 202110095029A CN 112784902 B CN112784902 B CN 112784902B
- Authority
- CN
- China
- Prior art keywords
- data
- modal
- mode
- modality
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a two-mode clustering method with missing data, which is based on a self-encoder, learns the mode special representation of each mode data through intra-mode reconstruction loss, learns the consistency representation of modes through cross-mode contrast learning loss, recovers the inconsistent information between the lost modes and the abandoned modes through cross-mode dual prediction loss, further improves consistency, performs unified processing on data recovery and consistency learning, and has better clustering effect.
Description
Technical Field
The invention relates to the field of big data analysis, in particular to an image classification method of modal missing data.
Background
At present, the multi-mode data clustering technology is widely applied to various fields. In commodity recommendation, combining massive commodity images with text attributes, learning image semantic feature expression, and improving commodity recommendation degree meeting user requirements; in the multi-mode dialogue with intelligent customer service, the multi-mode clustering technology of vision and language is integrated, so that automatic text, picture or video response to the user can be automatically realized. The success of these multimodal techniques has been largely due to the consistent learning of multimodal data, i.e., the exploration and exploitation of the inherent dependencies and invariance of data between different modalities. However, consistency learning is based on the completeness of multi-modal data, where all data samples cover all modalities and no modal data is missing. However, due to the complexity of the data acquisition environment, there is often a situation where the modality is missing in the actual data, for example, in an online conference, some video frames may lose visual or acoustic signals due to damage of the sensor. In medical diagnosis, the patient often does not perform all physical examination, but only performs partial examination, and how to perform etiology diagnosis by using the partial examination information is essentially a multi-modal clustering problem with missing data. On the basis of the current technology, to cluster real multi-modal data, the data needs to be complemented in advance to ensure the completeness of the objects to be clustered. The current complement method is mainly aimed at the similarity among samples, not the missing data samples per se, such as a double-ended alignment incomplete multi-modal clustering (DAIMC), partial multi-modal clustering (PVC) and incomplete multi-modal visualization data grouping (IMG) method based on matrix decomposition.
Incomplete multi-modal data clustering methods can be broadly divided into two categories: firstly, a method based on a shallow model, for example, a DAIMC method proposed by Menglei Hu et al models high-order correlation among multiple modes through low-rank matrix decomposition, and combines related prior information to effectively utilize consistent information among the multiple modes so as to realize multi-mode subspace learning. Secondly, a method based on deep learning, for example, a DM2C method proposed by Yangbangyan Jiang et al firstly obtains a mode specific representation of each mode through a self-encoder, then adopts a cyclic generation countermeasure network (Cycle Generative Adversarial Networks), utilizes complete mode data to generate missing mode data, and splices the mode specific representations of each mode to obtain a public representation.
Second, almost all existing methods treat data recovery and consistency learning as two independent problems or steps, lacking a unified theoretical understanding. Such as deep mixed mode clustering (DM 2C) based on generating a countermeasure network and countermeasure incomplete multi-mode clustering (AIMC). Therefore, under the condition of modal data deletion, the data clustering technology for researching unified data complement and consistency learning has very high application prospect and practical value. For example, classifying images with missing data on the model is currently mainly dependent on manual work, and requires a large amount of human resources.
Disclosure of Invention
Aiming at the defects in the prior art, the image classification method with the missing data in the mode solves the problem that a large amount of human resources are needed for classifying the images with the missing data in the mode in the prior art.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the image classification method with missing data in a mode comprises the following steps:
s1, respectively sending two modal data of a sample with two modalities to a corresponding self-encoder to obtain corresponding hidden representations;
s2, respectively acquiring corresponding cross-modal contrast learning loss and intra-modal reconstruction loss according to hidden representations corresponding to the two modal data;
s3, carrying out counter propagation on the current self-encoder according to the cross-modal contrast learning loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s4, judging whether the counter propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1;
s5, obtaining corresponding cross-modal contrast learning loss, intra-modal reconstruction loss and cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data;
s6, carrying out back propagation on the current self-encoder according to the current latest cross-modal contrast learning loss, cross-modal dual prediction loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s7, judging whether the current self-encoder converges or not, if so, entering a step S8, otherwise, returning to the step S5;
s8, sending a set of samples with two modes at the same time, a sample with only a first mode and a sample with only a second mode as a two-mode data set with missing data to a current latest self-encoder to obtain hidden representations corresponding to the two-mode data set with the missing data;
s9, obtaining the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the first mode and the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the second mode in the two-mode data set based on dual mapping;
and S10, splicing different modal representations corresponding to each sample, using the spliced different modal representations as public representations, and clustering the public representations to complete two-modal clustering with missing data.
Further, the self-encoder in step S1 includes an encoder and a decoder, where the encoder includes a first full-connection layer, a first normalization layer, a first activation function, a second full-connection layer, a second normalization layer, a second activation function, a third full-connection layer, a third normalization layer, a third activation function, a fourth full-connection layer, and a fourth activation function that are sequentially connected; the input dimension of the first full connection layer is the dimension of the input modal data; the output dimensions of the first full-connection layer, the second full-connection layer and the third full-connection layer are 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth full connection layer is 128, and the fourth activation function is Softmax;
the decoder comprises a fifth full-connection layer, a fourth normalization layer, a fifth activation function, a sixth full-connection layer, a fifth normalization layer, a sixth activation function, a seventh full-connection layer, a sixth normalization layer, a seventh activation function, an eighth full-connection layer, a seventh normalization layer and an eighth activation function which are sequentially connected; the input dimension of the fifth full-connection layer is 128, the output dimensions of the fifth full-connection layer, the sixth full-connection layer and the seventh full-connection layer are 1024, and the fifth activation function, the sixth activation function, the seventh activation function and the eighth activation function are ReLU; the output dimension of the eighth fully connected layer is the dimension of the input modal data.
Further, in step S2, the specific method for obtaining the corresponding cross-modal contrast learning loss according to the hidden representations corresponding to the two modal data is as follows:
according to the formula:
acquiring cross-modal contrast learning loss l cl The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;representing mutual information +.>For the t-th hidden representation corresponding to the first modality data in the sample with two modalities simultaneously, +.>The hidden representation corresponding to the second modal data in the sample with the t th two simultaneous modalities; h (·) represents information entropy; alpha is the balance parameter of the entropy.
Further, in step S2, the specific method for obtaining the corresponding intra-mode reconstruction loss according to the hidden representations corresponding to the two mode data is as follows:
according to the formula:
acquiring intra-modality reconstruction loss l rec The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;representing the v-th modality data in the t-th sample; f (f) (v) (. Cndot.) and g (v) (. Cndot.) represents the encoder and decoder to which the v-th modality data currently corresponds, respectively; />Is a norm.
Further, the specific method of step S3 is as follows:
will l cl +0.1l rec Is calculated by (a) is calculated by (b)The current self-encoder is counter-propagated as the current loss, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) rec Is intra-modal reconstruction loss.
Further, in step S5, the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data is as follows:
according to the formula:
acquiring cross-modal dual prediction loss l pre The method comprises the steps of carrying out a first treatment on the surface of the Wherein Z is 1 The method comprises the steps of providing a hidden representation set corresponding to all first modal data in a sample with two modalities simultaneously; z is Z 2 The hidden representation set corresponding to all second modal data in the sample with two modalities simultaneously exists; g (2) (Z 2 ) To Z 2 Mapping G (1) (Z 1 ) To Z 1 Mapping G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map;is a norm.
Further, the specific method of step S6 is as follows:
let formula l cl +0.1l pre +0.1l rec The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) pre Predicting loss for cross-modal pair; l (L) rec Is intra-modal reconstruction loss.
Further, the specific method of step S8 is as follows:
according to the formula:
obtaining hidden representation corresponding to two-mode data set with missing data, including sample set X for simultaneous existence of two-mode data 1 Corresponding hidden representationSample set X for simultaneous presence of two modality data 2 Corresponding hidden representation +.>For sample set X where only the first modality exists (1) Corresponding hidden representation +.>And for a sample set X where only the second modality exists (2) Corresponding hidden representation +.>Wherein->Representing the encoder in the latest self-encoder corresponding to the 1 st modality data; />The encoder of the latest self-encoder corresponding to the 2 nd modality data is represented.
Further, the specific method of step S9 is as follows:
according to the formula:
respectively acquiring hidden representations corresponding to sample sets with only the first modalityRepresentation of the corresponding deletion modality +.>Hidden representation corresponding to sample set with only the second modality present +.>Representation of the corresponding deletion modality +.>G (1) (. Cndot.) represents the mapping corresponding to the first modality, G (2) (. Cndot.) represents the mapping corresponding to the second modality, G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map.
Further, in step S10, the specific method for splicing the different modal representations corresponding to each sample and using the spliced different modal representations as the common representation is as follows:
will beAs a common representation of samples where both modalities exist; will->A public representation as a sample where only the first modality exists; will->As a common representation of samples where only the second modality exists.
The beneficial effects of the invention are as follows: according to the invention, based on the self-encoder, the mode special representation of each mode data is learned through intra-mode reconstruction loss, the consistency representation of modes is learned through inter-mode contrast learning loss, the lost modes are recovered through inter-mode dual prediction loss, and inconsistent information between modes is abandoned, so that consistency is further improved, data recovery and consistency learning are uniformly processed, and the clustering effect is better.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a block diagram of a model of the present invention;
FIG. 3 is a graph showing the comparison of the accuracy of the deletion rate from 0 to 0.8 in example 1;
FIG. 4 is a graph showing the normalized mutual information of the deletion rate varying from 0 to 0.8 in example 1;
FIG. 5 is a graph showing the contrast of the adjusted Lande coefficients for the deletion rate from 0 to 0.8 in example 1;
FIG. 6 is a graph showing comparison of the accuracy of the deletion rate from 0 to 0.8 in example 2;
FIG. 7 is a graph showing the normalized mutual information of the deletion rate varying from 0 to 0.8 in example 2;
FIG. 8 is a graph showing the contrast of the adjusted Lande coefficients for the deletion rate from 0 to 0.8 in example 2.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
As shown in fig. 1 and 2, the image classification method with missing data in the mode includes the following steps:
s1, respectively sending two modal data of a sample with two modalities to a corresponding self-encoder to obtain corresponding hidden representations;
s2, respectively acquiring corresponding cross-modal contrast learning loss and intra-modal reconstruction loss according to hidden representations corresponding to the two modal data;
s3, carrying out counter propagation on the current self-encoder according to the cross-modal contrast learning loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s4, judging whether the counter propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1; the threshold is 100;
s5, obtaining corresponding cross-modal contrast learning loss, intra-modal reconstruction loss and cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data;
s6, carrying out back propagation on the current self-encoder according to the current latest cross-modal contrast learning loss, cross-modal dual prediction loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s7, judging whether the current self-encoder converges or not, if so, entering a step S8, otherwise, returning to the step S5;
s8, sending a set of samples with two modes at the same time, a sample with only a first mode and a sample with only a second mode as a two-mode data set with missing data to a current latest self-encoder to obtain hidden representations corresponding to the two-mode data set with the missing data;
s9, obtaining the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the first mode and the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the second mode in the two-mode data set based on dual mapping;
and S10, splicing different modal representations corresponding to each sample, using the spliced different modal representations as public representations, and clustering the public representations to complete two-modal clustering with missing data.
The self-encoder in the step S1 comprises an encoder and a decoder, wherein the encoder comprises a first full-connection layer, a first normalization layer, a first activation function, a second full-connection layer, a second normalization layer, a second activation function, a third full-connection layer, a third normalization layer, a third activation function, a fourth full-connection layer and a fourth activation function which are sequentially connected; the input dimension of the first full connection layer is the dimension of the input modal data; the output dimensions of the first full-connection layer, the second full-connection layer and the third full-connection layer are 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth full connection layer is 128, and the fourth activation function is Softmax;
the decoder comprises a fifth full-connection layer, a fourth normalization layer, a fifth activation function, a sixth full-connection layer, a fifth normalization layer, a sixth activation function, a seventh full-connection layer, a sixth normalization layer, a seventh activation function, an eighth full-connection layer, a seventh normalization layer and an eighth activation function which are sequentially connected; the input dimension of the fifth full-connection layer is 128, the output dimensions of the fifth full-connection layer, the sixth full-connection layer and the seventh full-connection layer are 1024, and the fifth activation function, the sixth activation function, the seventh activation function and the eighth activation function are ReLU; the output dimension of the eighth fully connected layer is the dimension of the input modal data.
In step S2, the specific method for obtaining the corresponding cross-modal contrast learning loss according to the hidden representations corresponding to the two modal data is as follows: according to the formula:
acquiring cross-modal contrast learning loss l cl The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;representing mutual information +.>Hidden representation corresponding to the v-th modal data in the t-th sampleV.epsilon. {1,2}, i.e. +.>For the t-th hidden representation corresponding to the first modality data in the sample with two modalities simultaneously, +.>The hidden representation corresponding to the second modal data in the sample with the t th two simultaneous modalities; h (·) represents information entropy; alpha is the balance parameter of the entropy.
In step S2, the specific method for obtaining the corresponding intra-mode reconstruction loss according to the hidden representations corresponding to the two mode data is as follows: according to the formula:
acquiring intra-modality reconstruction loss l rec The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;representing the v-th modality data in the t-th sample; f (f) (v) (. Cndot.) and g (v) (. Cndot.) represents the encoder and decoder to which the v-th modality data currently corresponds, respectively; />Is a norm.
The specific method of the step S3 is as follows: will l cl +0.1l rec The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) rec Is intra-modal reconstruction loss.
In step S5, the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data is as follows: according to the formula:
acquiring cross-modal dual prediction loss l pre The method comprises the steps of carrying out a first treatment on the surface of the Wherein Z is 1 The method comprises the steps of providing a hidden representation set corresponding to all first modal data in a sample with two modalities simultaneously; z is Z 2 The hidden representation set corresponding to all second modal data in the sample with two modalities simultaneously exists; g (2) (Z 2 ) To Z 2 Mapping G (1) (Z 1 ) To Z 1 Mapping G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map;is a norm.
The specific method of step S6 is as follows: let formula l cl +0.1l pre +0.1l rec The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) pre Predicting loss for cross-modal pair; l (L) rec Is intra-modal reconstruction loss.
The specific method of step S8 is as follows: according to the formula:
obtaining two-mode data with missing dataSet-corresponding hidden representations comprising a sample set X for the simultaneous presence of two modality data 1 Corresponding hidden representationSample set X for simultaneous presence of two modality data 2 Corresponding hidden representation +.>For sample set X where only the first modality exists (1) Corresponding hidden representation +.>And for a sample set X where only the second modality exists (2) Corresponding hidden representation +.>Wherein->Representing the encoder in the latest self-encoder corresponding to the 1 st modality data; />The encoder of the latest self-encoder corresponding to the 2 nd modality data is represented.
The specific method of step S9 is as follows: according to the formula:
respectively acquiring hidden representations corresponding to sample sets with only the first modalityRepresentation of the corresponding deletion modality +.>Hidden representation corresponding to sample set with only the second modality present +.>Representation of the corresponding deletion modality +.>G (1) (. Cndot.) represents the mapping corresponding to the first modality, G (2) (. Cndot.) represents the mapping corresponding to the second modality, G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map.
In step S10, the specific method for splicing the different modal representations corresponding to each sample and using the spliced different modal representations as the common representation is as follows: will beAs a common representation of samples where both modalities exist; will->A public representation as a sample where only the first modality exists; will->As a common representation of samples where only the second modality exists.
In the specific implementation process, the entropy is regularized, the parameter alpha is fixed to be 10, and the design of the cross-modal contrast learning loss has two advantages: on the one hand, from the information theory, the information entropy is the average information quantity transmitted by an event, and a larger entropy represents a more information quantity representation; on the other hand, maximizeAnd->It is possible to avoid all samples being assigned to the same sampleA trivial solution to the individual clusters. For the purpose of construction->The joint probability distribution p (z, z ') of the variables z and z' can be defined first, since the Softmax activation function is stacked at the last layer of the encoder, therefore +.>And->Can be regarded as an overclustering probability, i.e. +.>And->It can be understood that the distribution of two discrete cluster assignment variables z and z' over D classes, D being +.>And->Is a dimension of (c). The joint probability p (z, z') is thus defined as matrix +.>
Let P d And P d' The edge probability distributions P (z=d) and P (z ' =d '), respectively, are represented and can be obtained by summing the d-th row and d ' -th column of the joint probability respective matrix P, respectively. For discrete variables, the cross-modal contrast learning loss function can be redefined as:
wherein P is dd' Is the element of row d and column d' in P.
To infer the missing modality, the present invention proposes a dual prediction mechanism. Specifically, in a potential space parameterized by a neural network, the spatial distribution is determined by minimizing the conditional entropy H (Z i |Z j ) Representation Z of a particular modality i Can be Z j Prediction, where i=1, j=2, or j=1, i=2; i.e. Z i Completely by Z i Determining if and only if conditional entropyOne common approach to optimize this objective is to introduce a variation distribution +.>And maximize +.>Lower boundary of->Wherein the method comprises the steps of
The variation distribution Q described above may be of any type, such as gaussian distribution, class distribution, and laplace distribution. In particular, the method can take the distribution Q as Gaussian distribution N (Z i |G (j) (Z j ) σi), σi is the variance matrix. Neglecting constant derivation in gaussian distribution, maximizingEquivalent to->For a given bimodal data, a cross-modal dual prediction loss can be obtained>
It should be noted that if there is no intra-modal reconstruction loss, the above-mentioned dual prediction loss may lead to a trivial solution, i.e. Z 1 And Z 2 Equal to a same constant. After model convergence, we can easily predict the representation of the missing modality corresponding to the hidden representation corresponding to the sample set with the first modality only by the dual mapping.
After the whole model is trained on the data with complete two modes to be converged, the whole data set is directly fed into a network model, and the network model can execute missing mode complementation and infer corresponding representation. And then the representations of different modes are directly spliced together to obtain a common representation, and then the common representation is clustered by a traditional clustering method such as k-means, so that the two-mode clustering with missing data can be completed. Similarly, the method is applied to any two-mode clustering, so that the method can be directly popularized to multi-mode clustering.
Mapping model G (2) And G (1) The same network structure is adopted, and the layers are 6:
a first layer: a full connection layer with an input of 128 and an output of 128 followed by a batch normalization layer BatchNorm1d; the activation function is ReLU;
a second layer: full connection layer, input 128, output 256, following batch normalization layer BatchNorm1d; the activation function is ReLU;
third layer: full connection layer, input 256, output 128, following batch normalization layer BatchNorm1d; the activation function is ReLU;
fourth layer: full connection layer, input 128, output 256, following batch normalization layer BatchNorm1d; the activation function is ReLU;
fifth layer: full connection layer, input 256, output 128, following batch normalization layer BatchNorm1d; the activation function is ReLU;
sixth layer: the fully connected layer has an input of 128, an output of 128, and an activation function of Softmax.
In one embodiment of the invention, a dataset Caltech-101-20 is used, containing 2386 pictures from 20 object categories, using 2 extracted image features as 2 modalities, containing (HOG, GIST). The experimental data category information and sample number distribution are shown in table 1.
Table 1: experimental data category information and sample quantity distribution
Experiments were performed with different deletion rates, defined as η= (n-m)/n, where n is the size of the dataset and m is the number of samples with complete modality. To verify the superiority of this solution, we compared this solution (complete) with other 10 multi-modal clustering techniques, namely partial multi-modal clustering (PVC), incomplete multi-modal visualization data grouping (IMG), unified Embedding Alignment Framework (UEAF), double-ended alignment incomplete multi-modal clustering (DAIMC), spectral perturbation incomplete multi-modal clustering (PIC), efficient regularized incomplete multi-modal clustering (eermvc), deep Canonical Correlation Analysis (DCCA), deep canonical correlation self-encoder (DCCAE), binary multi-modal clustering (BMVC), and dual self-encoder network (AE) 2 Nets)。
The test results at a loss of 0.5 are shown in Table 2.
Table 2: test results at loss rate η=0.5
The test results at a loss of 0 are shown in Table 3.
Table 3: test results at loss rate η=0
As can be seen from tables 2 and 3, compared with other clustering methods, the method has larger improvement on two indexes of standardized mutual information and adjustable Rankine coefficient, which means that the object picture data can be clustered correctly in practical application, and the consumption of a large amount of human resources for picture classification is avoided.
As shown in fig. 3, 4 and 5, to further explore the effectiveness of our method, we varied the deletion rate η from 0 to 0.8 on Caltech101-20, with 0.1 as the interval. From the results in fig. 3-5, it can be observed that: i) The complete (present method) is significantly better than all comparative methods ii) in all deletion rate settings, with the deletion rate increasing, the performance degradation is much greater than our method than the comparative methods. For example, in the case of η=0, the competer and PIC achieve NMI of 0.6806 and 0.6793, respectively, whereas with increasing deletion rate competer is significantly better than PIC.
In another embodiment of the invention, a Scene-15 dataset is used, containing 4485 pictures from 15 Scene categories, using 2 extracted image features as 2 modalities, including (PHOG, GIST). The experimental data category information and sample number distribution are shown in table 4.
Table 4: experimental data category information and sample quantity distribution
The experimental results at the deletion rate η=0.5 are shown in table 5.
Table 5: experimental results at deletion rate η=0.5
The experimental results at the deletion rate η=0 are shown in table 6.
Table 6: experimental results at deletion rate η=0
As can be seen from tables 5 and 6, compared with other clustering methods, the method has larger improvement on two indexes of accuracy and standardized mutual information, which means that the object picture data can be clustered correctly in practical application, and the consumption of a large amount of human resources for picture classification is avoided. Meanwhile, the method has the best effect under the conditions of both deficiency and no deficiency.
As shown in fig. 6, 7 and 8, in order to further investigate the effectiveness of the present method, experiments were performed with 0.1 as an interval by changing the deletion rate η from 0 to 0.8. From the results in fig. 6-8, it can be observed that complete (this method) is significantly better than all the comparative methods in almost all the deletion rate settings.
In summary, the method and the device are based on the self-encoder, the mode special representation of each mode data is learned through intra-mode reconstruction loss, the consistency representation of modes is learned through cross-mode contrast learning loss, the information of inconsistency between the lost modes and the discarded modes is recovered through cross-mode dual prediction loss, consistency is further improved, data recovery and consistency learning are uniformly processed, and clustering effect is better.
Claims (10)
1. The image classification method with the missing data in the mode is characterized by comprising the following steps of:
s1, respectively sending two-mode data of an image sample with two modes simultaneously into corresponding self-encoders to obtain corresponding hidden representations; wherein the two modes are one of HOG and PHOG, and GIST;
s2, respectively acquiring corresponding cross-modal contrast learning loss and intra-modal reconstruction loss according to hidden representations corresponding to the two modal data;
s3, carrying out counter propagation on the current self-encoder according to the cross-modal contrast learning loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s4, judging whether the counter propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1;
s5, obtaining corresponding cross-modal contrast learning loss, intra-modal reconstruction loss and cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data;
s6, carrying out back propagation on the current self-encoder according to the current latest cross-modal contrast learning loss, cross-modal dual prediction loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s7, judging whether the current self-encoder converges or not, if so, entering a step S8, otherwise, returning to the step S5;
s8, sending a set of image samples with two modes at the same time, an image sample with only a first mode and an image sample with only a second mode as two-mode data sets with missing data to a current latest self-encoder to obtain hidden representations corresponding to the two-mode data sets with the missing data;
s9, acquiring a representation of a missing mode corresponding to the hidden representation corresponding to the image sample set only with the first mode and a representation of a missing mode corresponding to the hidden representation corresponding to the image sample set only with the second mode in the two-mode data set based on dual mapping;
and S10, splicing different mode representations corresponding to each image sample, using the spliced different mode representations as common representations, clustering the common representations, and completing two-mode clustering of missing data, namely realizing image classification of the mode with the missing data.
2. The method of classifying images in which there is missing data in a modality according to claim 1, wherein in step S1 the self-encoder includes an encoder and a decoder, the encoder includes a first full-link layer, a first batch of normalization layers, a first activation function, a second full-link layer, a second batch of normalization layers, a second activation function, a third full-link layer, a third batch of normalization layers, a third activation function, a fourth full-link layer, and a fourth activation function, which are sequentially connected; the input dimension of the first full connection layer is the dimension of the input modal data; the output dimensions of the first full-connection layer, the second full-connection layer and the third full-connection layer are 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth full connection layer is 128, and the fourth activation function is Softmax;
the decoder comprises a fifth full-connection layer, a fourth normalization layer, a fifth activation function, a sixth full-connection layer, a fifth normalization layer, a sixth activation function, a seventh full-connection layer, a sixth normalization layer, a seventh activation function, an eighth full-connection layer, a seventh normalization layer and an eighth activation function which are sequentially connected; the input dimension of the fifth full-connection layer is 128, the output dimensions of the fifth full-connection layer, the sixth full-connection layer and the seventh full-connection layer are 1024, and the fifth activation function, the sixth activation function, the seventh activation function and the eighth activation function are ReLU; the output dimension of the eighth fully connected layer is the dimension of the input modal data.
3. The method for classifying images with missing data in a mode according to claim 1, wherein the specific method for acquiring corresponding cross-mode contrast learning loss according to hidden representations corresponding to two mode data in step S2 is as follows:
according to the formula:
acquiring cross-modal contrast learning lossThe method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofmTo have a total number of image samples of both modalities,trepresent the firsttA plurality of image samples; />Representing mutual information +.>Is the firsttImplicit representation corresponding to the first modality data in an image sample of two simultaneous modalities,/->Is the firsttHidden representations corresponding to second modality data in the image samples with two modalities simultaneously; />Representing information entropy; />Is the balance parameter of entropy.
4. The method for classifying images with missing data in a mode according to claim 1, wherein the specific method for acquiring the reconstruction loss in the corresponding mode according to the hidden representations corresponding to the two mode data in step S2 is as follows:
according to the formula:
acquiring intra-modality reconstruction lossesThe method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofmTo have a total number of image samples of both modalities,trepresent the firsttA plurality of image samples; />Represent the firsttThe first of the image samplesvThe personal modality data; />And->Respectively represent the firstvAn encoder and a decoder corresponding to the modal data; />Is a norm.
5. The method for classifying images having missing data in a modality according to claim 1, wherein the specific method of step S3 is as follows:
6. The method for classifying images with missing data according to claim 1, wherein the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data in step S5 is as follows:
according to the formula:
acquiring cross-modal dual prediction lossThe method comprises the steps of carrying out a first treatment on the surface of the Wherein->The hidden representation set corresponding to all the first modal data in the image sample with two modalities simultaneously exists; />The hidden representation set corresponding to all second mode data in the image sample with two modes simultaneously exists; />To->Mapping->To->Mapping->And->Constructing dual mapping; />Is a norm.
7. The method for classifying images having missing data in a modality according to claim 1, wherein the specific method of step S6 is as follows:
the formula is given byThe calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein->Learning loss for cross-modal contrast; />Predicting loss for cross-modal pair; />Is intra-modal reconstruction loss.
8. The method for classifying images having missing data in a modality according to claim 1, wherein the specific method of step S8 is as follows:
according to the formula:
obtaining corresponding hidden representation of two-mode data set with missing data, including image sample set with two-mode dataCorresponding hidden representation +.>Image sample set for simultaneous presence of two modality data +.>Corresponding hidden representation +.>Image sample set for which only the first modality is present +.>Corresponding hidden representation +.>And +/for image sample set with only the second modality>Corresponding hidden representation +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing the encoder in the latest self-encoder corresponding to the 1 st modality data; />The encoder of the latest self-encoder corresponding to the 2 nd modality data is represented.
9. The method for classifying images having missing data in a modality according to claim 8, wherein the specific method of step S9 is as follows:
according to the formula:
respectively acquiring hidden representations corresponding to image sample sets with only a first modalityRepresentation of the corresponding deletion modality +.>Hidden representation corresponding to a set of image samples where only the second modality is present +.>Representation of the corresponding missing modality;/>Mapping representing the correspondence of the first modality, +.>Mapping representing the correspondence of the second modality, +.>Anda dual map is constructed.
10. The method for classifying images with missing data according to claim 9, wherein the specific method for stitching and using different modal representations corresponding to each image sample as a common representation in step S10 is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110095029.2A CN112784902B (en) | 2021-01-25 | 2021-01-25 | Image classification method with missing data in mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110095029.2A CN112784902B (en) | 2021-01-25 | 2021-01-25 | Image classification method with missing data in mode |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112784902A CN112784902A (en) | 2021-05-11 |
CN112784902B true CN112784902B (en) | 2023-06-30 |
Family
ID=75758853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110095029.2A Active CN112784902B (en) | 2021-01-25 | 2021-01-25 | Image classification method with missing data in mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112784902B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657272B (en) * | 2021-08-17 | 2022-06-28 | 山东建筑大学 | Micro video classification method and system based on missing data completion |
CN114742132A (en) * | 2022-03-17 | 2022-07-12 | 湖南工商大学 | Deep multi-view clustering method, system and equipment based on common difference learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8255739B1 (en) * | 2008-06-30 | 2012-08-28 | American Megatrends, Inc. | Achieving data consistency in a node failover with a degraded RAID array |
CN106202281A (en) * | 2016-06-28 | 2016-12-07 | 广东工业大学 | A kind of multi-modal data represents learning method and system |
WO2017122785A1 (en) * | 2016-01-15 | 2017-07-20 | Preferred Networks, Inc. | Systems and methods for multimodal generative machine learning |
WO2018232378A1 (en) * | 2017-06-16 | 2018-12-20 | Markable, Inc. | Image processing system |
CN112001437A (en) * | 2020-08-19 | 2020-11-27 | 四川大学 | Modal non-complete alignment-oriented data clustering method |
-
2021
- 2021-01-25 CN CN202110095029.2A patent/CN112784902B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8255739B1 (en) * | 2008-06-30 | 2012-08-28 | American Megatrends, Inc. | Achieving data consistency in a node failover with a degraded RAID array |
WO2017122785A1 (en) * | 2016-01-15 | 2017-07-20 | Preferred Networks, Inc. | Systems and methods for multimodal generative machine learning |
CN106202281A (en) * | 2016-06-28 | 2016-12-07 | 广东工业大学 | A kind of multi-modal data represents learning method and system |
WO2018232378A1 (en) * | 2017-06-16 | 2018-12-20 | Markable, Inc. | Image processing system |
CN112001437A (en) * | 2020-08-19 | 2020-11-27 | 四川大学 | Modal non-complete alignment-oriented data clustering method |
Non-Patent Citations (2)
Title |
---|
COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction;Yijie Lin 等;《2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;11169-11178 * |
基于深度神经网络的多模态特征自适应聚类方法;敬明旻;计算机应用与软件(第10期);262-269 * |
Also Published As
Publication number | Publication date |
---|---|
CN112784902A (en) | 2021-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111310707B (en) | Bone-based graph annotation meaning network action recognition method and system | |
CN108399428B (en) | Triple loss function design method based on trace ratio criterion | |
JP5282658B2 (en) | Image learning, automatic annotation, search method and apparatus | |
WO2019015246A1 (en) | Image feature acquisition | |
CN108460356A (en) | A kind of facial image automated processing system based on monitoring system | |
CN112784902B (en) | Image classification method with missing data in mode | |
CN114398961A (en) | Visual question-answering method based on multi-mode depth feature fusion and model thereof | |
CN110263801B (en) | Image processing model generation method and device and electronic equipment | |
WO2022042043A1 (en) | Machine learning model training method and apparatus, and electronic device | |
WO2021018245A1 (en) | Image classification method and apparatus | |
CN111339942A (en) | Method and system for recognizing skeleton action of graph convolution circulation network based on viewpoint adjustment | |
CN110738102A (en) | face recognition method and system | |
CN112084891B (en) | Cross-domain human body action recognition method based on multi-modal characteristics and countermeasure learning | |
WO2021051987A1 (en) | Method and apparatus for training neural network model | |
CN109325513B (en) | Image classification network training method based on massive single-class images | |
CN113343974A (en) | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement | |
CN110889335B (en) | Human skeleton double interaction behavior identification method based on multichannel space-time fusion network | |
Asmai et al. | Mosquito larvae detection using deep learning | |
CN110414541A (en) | The method, equipment and computer readable storage medium of object for identification | |
CN115761905A (en) | Diver action identification method based on skeleton joint points | |
CN114333062A (en) | Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency | |
CN107729821B (en) | Video summarization method based on one-dimensional sequence learning | |
CN112084913B (en) | End-to-end human body detection and attribute identification method | |
CN113378938A (en) | Edge transform graph neural network-based small sample image classification method and system | |
CN113378934B (en) | Small sample image classification method and system based on semantic perception map neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |