CN112784902B - Image classification method with missing data in mode - Google Patents

Image classification method with missing data in mode Download PDF

Info

Publication number
CN112784902B
CN112784902B CN202110095029.2A CN202110095029A CN112784902B CN 112784902 B CN112784902 B CN 112784902B CN 202110095029 A CN202110095029 A CN 202110095029A CN 112784902 B CN112784902 B CN 112784902B
Authority
CN
China
Prior art keywords
data
modal
mode
modality
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110095029.2A
Other languages
Chinese (zh)
Other versions
CN112784902A (en
Inventor
彭玺
林义杰
杨谋星
李云帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202110095029.2A priority Critical patent/CN112784902B/en
Publication of CN112784902A publication Critical patent/CN112784902A/en
Application granted granted Critical
Publication of CN112784902B publication Critical patent/CN112784902B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a two-mode clustering method with missing data, which is based on a self-encoder, learns the mode special representation of each mode data through intra-mode reconstruction loss, learns the consistency representation of modes through cross-mode contrast learning loss, recovers the inconsistent information between the lost modes and the abandoned modes through cross-mode dual prediction loss, further improves consistency, performs unified processing on data recovery and consistency learning, and has better clustering effect.

Description

Image classification method with missing data in mode
Technical Field
The invention relates to the field of big data analysis, in particular to an image classification method of modal missing data.
Background
At present, the multi-mode data clustering technology is widely applied to various fields. In commodity recommendation, combining massive commodity images with text attributes, learning image semantic feature expression, and improving commodity recommendation degree meeting user requirements; in the multi-mode dialogue with intelligent customer service, the multi-mode clustering technology of vision and language is integrated, so that automatic text, picture or video response to the user can be automatically realized. The success of these multimodal techniques has been largely due to the consistent learning of multimodal data, i.e., the exploration and exploitation of the inherent dependencies and invariance of data between different modalities. However, consistency learning is based on the completeness of multi-modal data, where all data samples cover all modalities and no modal data is missing. However, due to the complexity of the data acquisition environment, there is often a situation where the modality is missing in the actual data, for example, in an online conference, some video frames may lose visual or acoustic signals due to damage of the sensor. In medical diagnosis, the patient often does not perform all physical examination, but only performs partial examination, and how to perform etiology diagnosis by using the partial examination information is essentially a multi-modal clustering problem with missing data. On the basis of the current technology, to cluster real multi-modal data, the data needs to be complemented in advance to ensure the completeness of the objects to be clustered. The current complement method is mainly aimed at the similarity among samples, not the missing data samples per se, such as a double-ended alignment incomplete multi-modal clustering (DAIMC), partial multi-modal clustering (PVC) and incomplete multi-modal visualization data grouping (IMG) method based on matrix decomposition.
Incomplete multi-modal data clustering methods can be broadly divided into two categories: firstly, a method based on a shallow model, for example, a DAIMC method proposed by Menglei Hu et al models high-order correlation among multiple modes through low-rank matrix decomposition, and combines related prior information to effectively utilize consistent information among the multiple modes so as to realize multi-mode subspace learning. Secondly, a method based on deep learning, for example, a DM2C method proposed by Yangbangyan Jiang et al firstly obtains a mode specific representation of each mode through a self-encoder, then adopts a cyclic generation countermeasure network (Cycle Generative Adversarial Networks), utilizes complete mode data to generate missing mode data, and splices the mode specific representations of each mode to obtain a public representation.
Second, almost all existing methods treat data recovery and consistency learning as two independent problems or steps, lacking a unified theoretical understanding. Such as deep mixed mode clustering (DM 2C) based on generating a countermeasure network and countermeasure incomplete multi-mode clustering (AIMC). Therefore, under the condition of modal data deletion, the data clustering technology for researching unified data complement and consistency learning has very high application prospect and practical value. For example, classifying images with missing data on the model is currently mainly dependent on manual work, and requires a large amount of human resources.
Disclosure of Invention
Aiming at the defects in the prior art, the image classification method with the missing data in the mode solves the problem that a large amount of human resources are needed for classifying the images with the missing data in the mode in the prior art.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the image classification method with missing data in a mode comprises the following steps:
s1, respectively sending two modal data of a sample with two modalities to a corresponding self-encoder to obtain corresponding hidden representations;
s2, respectively acquiring corresponding cross-modal contrast learning loss and intra-modal reconstruction loss according to hidden representations corresponding to the two modal data;
s3, carrying out counter propagation on the current self-encoder according to the cross-modal contrast learning loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s4, judging whether the counter propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1;
s5, obtaining corresponding cross-modal contrast learning loss, intra-modal reconstruction loss and cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data;
s6, carrying out back propagation on the current self-encoder according to the current latest cross-modal contrast learning loss, cross-modal dual prediction loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s7, judging whether the current self-encoder converges or not, if so, entering a step S8, otherwise, returning to the step S5;
s8, sending a set of samples with two modes at the same time, a sample with only a first mode and a sample with only a second mode as a two-mode data set with missing data to a current latest self-encoder to obtain hidden representations corresponding to the two-mode data set with the missing data;
s9, obtaining the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the first mode and the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the second mode in the two-mode data set based on dual mapping;
and S10, splicing different modal representations corresponding to each sample, using the spliced different modal representations as public representations, and clustering the public representations to complete two-modal clustering with missing data.
Further, the self-encoder in step S1 includes an encoder and a decoder, where the encoder includes a first full-connection layer, a first normalization layer, a first activation function, a second full-connection layer, a second normalization layer, a second activation function, a third full-connection layer, a third normalization layer, a third activation function, a fourth full-connection layer, and a fourth activation function that are sequentially connected; the input dimension of the first full connection layer is the dimension of the input modal data; the output dimensions of the first full-connection layer, the second full-connection layer and the third full-connection layer are 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth full connection layer is 128, and the fourth activation function is Softmax;
the decoder comprises a fifth full-connection layer, a fourth normalization layer, a fifth activation function, a sixth full-connection layer, a fifth normalization layer, a sixth activation function, a seventh full-connection layer, a sixth normalization layer, a seventh activation function, an eighth full-connection layer, a seventh normalization layer and an eighth activation function which are sequentially connected; the input dimension of the fifth full-connection layer is 128, the output dimensions of the fifth full-connection layer, the sixth full-connection layer and the seventh full-connection layer are 1024, and the fifth activation function, the sixth activation function, the seventh activation function and the eighth activation function are ReLU; the output dimension of the eighth fully connected layer is the dimension of the input modal data.
Further, in step S2, the specific method for obtaining the corresponding cross-modal contrast learning loss according to the hidden representations corresponding to the two modal data is as follows:
according to the formula:
Figure GDA0004148077270000041
acquiring cross-modal contrast learning loss l cl The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;
Figure GDA0004148077270000042
representing mutual information +.>
Figure GDA0004148077270000043
For the t-th hidden representation corresponding to the first modality data in the sample with two modalities simultaneously, +.>
Figure GDA0004148077270000044
The hidden representation corresponding to the second modal data in the sample with the t th two simultaneous modalities; h (·) represents information entropy; alpha is the balance parameter of the entropy.
Further, in step S2, the specific method for obtaining the corresponding intra-mode reconstruction loss according to the hidden representations corresponding to the two mode data is as follows:
according to the formula:
Figure GDA0004148077270000045
acquiring intra-modality reconstruction loss l rec The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;
Figure GDA0004148077270000046
representing the v-th modality data in the t-th sample; f (f) (v) (. Cndot.) and g (v) (. Cndot.) represents the encoder and decoder to which the v-th modality data currently corresponds, respectively; />
Figure GDA0004148077270000047
Is a norm.
Further, the specific method of step S3 is as follows:
will l cl +0.1l rec Is calculated by (a) is calculated by (b)The current self-encoder is counter-propagated as the current loss, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) rec Is intra-modal reconstruction loss.
Further, in step S5, the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data is as follows:
according to the formula:
Figure GDA0004148077270000051
acquiring cross-modal dual prediction loss l pre The method comprises the steps of carrying out a first treatment on the surface of the Wherein Z is 1 The method comprises the steps of providing a hidden representation set corresponding to all first modal data in a sample with two modalities simultaneously; z is Z 2 The hidden representation set corresponding to all second modal data in the sample with two modalities simultaneously exists; g (2) (Z 2 ) To Z 2 Mapping G (1) (Z 1 ) To Z 1 Mapping G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map;
Figure GDA0004148077270000052
is a norm.
Further, the specific method of step S6 is as follows:
let formula l cl +0.1l pre +0.1l rec The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) pre Predicting loss for cross-modal pair; l (L) rec Is intra-modal reconstruction loss.
Further, the specific method of step S8 is as follows:
according to the formula:
Figure GDA0004148077270000053
Figure GDA0004148077270000054
Figure GDA0004148077270000055
Figure GDA0004148077270000056
obtaining hidden representation corresponding to two-mode data set with missing data, including sample set X for simultaneous existence of two-mode data 1 Corresponding hidden representation
Figure GDA0004148077270000057
Sample set X for simultaneous presence of two modality data 2 Corresponding hidden representation +.>
Figure GDA0004148077270000058
For sample set X where only the first modality exists (1) Corresponding hidden representation +.>
Figure GDA0004148077270000059
And for a sample set X where only the second modality exists (2) Corresponding hidden representation +.>
Figure GDA0004148077270000061
Wherein->
Figure GDA0004148077270000062
Representing the encoder in the latest self-encoder corresponding to the 1 st modality data; />
Figure GDA0004148077270000063
The encoder of the latest self-encoder corresponding to the 2 nd modality data is represented.
Further, the specific method of step S9 is as follows:
according to the formula:
Figure GDA0004148077270000064
Figure GDA0004148077270000065
respectively acquiring hidden representations corresponding to sample sets with only the first modality
Figure GDA0004148077270000066
Representation of the corresponding deletion modality +.>
Figure GDA0004148077270000067
Hidden representation corresponding to sample set with only the second modality present +.>
Figure GDA0004148077270000068
Representation of the corresponding deletion modality +.>
Figure GDA0004148077270000069
G (1) (. Cndot.) represents the mapping corresponding to the first modality, G (2) (. Cndot.) represents the mapping corresponding to the second modality, G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map.
Further, in step S10, the specific method for splicing the different modal representations corresponding to each sample and using the spliced different modal representations as the common representation is as follows:
will be
Figure GDA00041480772700000610
As a common representation of samples where both modalities exist; will->
Figure GDA00041480772700000611
A public representation as a sample where only the first modality exists; will->
Figure GDA00041480772700000612
As a common representation of samples where only the second modality exists.
The beneficial effects of the invention are as follows: according to the invention, based on the self-encoder, the mode special representation of each mode data is learned through intra-mode reconstruction loss, the consistency representation of modes is learned through inter-mode contrast learning loss, the lost modes are recovered through inter-mode dual prediction loss, and inconsistent information between modes is abandoned, so that consistency is further improved, data recovery and consistency learning are uniformly processed, and the clustering effect is better.
Drawings
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a block diagram of a model of the present invention;
FIG. 3 is a graph showing the comparison of the accuracy of the deletion rate from 0 to 0.8 in example 1;
FIG. 4 is a graph showing the normalized mutual information of the deletion rate varying from 0 to 0.8 in example 1;
FIG. 5 is a graph showing the contrast of the adjusted Lande coefficients for the deletion rate from 0 to 0.8 in example 1;
FIG. 6 is a graph showing comparison of the accuracy of the deletion rate from 0 to 0.8 in example 2;
FIG. 7 is a graph showing the normalized mutual information of the deletion rate varying from 0 to 0.8 in example 2;
FIG. 8 is a graph showing the contrast of the adjusted Lande coefficients for the deletion rate from 0 to 0.8 in example 2.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
As shown in fig. 1 and 2, the image classification method with missing data in the mode includes the following steps:
s1, respectively sending two modal data of a sample with two modalities to a corresponding self-encoder to obtain corresponding hidden representations;
s2, respectively acquiring corresponding cross-modal contrast learning loss and intra-modal reconstruction loss according to hidden representations corresponding to the two modal data;
s3, carrying out counter propagation on the current self-encoder according to the cross-modal contrast learning loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s4, judging whether the counter propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1; the threshold is 100;
s5, obtaining corresponding cross-modal contrast learning loss, intra-modal reconstruction loss and cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data;
s6, carrying out back propagation on the current self-encoder according to the current latest cross-modal contrast learning loss, cross-modal dual prediction loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s7, judging whether the current self-encoder converges or not, if so, entering a step S8, otherwise, returning to the step S5;
s8, sending a set of samples with two modes at the same time, a sample with only a first mode and a sample with only a second mode as a two-mode data set with missing data to a current latest self-encoder to obtain hidden representations corresponding to the two-mode data set with the missing data;
s9, obtaining the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the first mode and the representation of the missing mode corresponding to the hidden representation corresponding to the sample set only with the second mode in the two-mode data set based on dual mapping;
and S10, splicing different modal representations corresponding to each sample, using the spliced different modal representations as public representations, and clustering the public representations to complete two-modal clustering with missing data.
The self-encoder in the step S1 comprises an encoder and a decoder, wherein the encoder comprises a first full-connection layer, a first normalization layer, a first activation function, a second full-connection layer, a second normalization layer, a second activation function, a third full-connection layer, a third normalization layer, a third activation function, a fourth full-connection layer and a fourth activation function which are sequentially connected; the input dimension of the first full connection layer is the dimension of the input modal data; the output dimensions of the first full-connection layer, the second full-connection layer and the third full-connection layer are 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth full connection layer is 128, and the fourth activation function is Softmax;
the decoder comprises a fifth full-connection layer, a fourth normalization layer, a fifth activation function, a sixth full-connection layer, a fifth normalization layer, a sixth activation function, a seventh full-connection layer, a sixth normalization layer, a seventh activation function, an eighth full-connection layer, a seventh normalization layer and an eighth activation function which are sequentially connected; the input dimension of the fifth full-connection layer is 128, the output dimensions of the fifth full-connection layer, the sixth full-connection layer and the seventh full-connection layer are 1024, and the fifth activation function, the sixth activation function, the seventh activation function and the eighth activation function are ReLU; the output dimension of the eighth fully connected layer is the dimension of the input modal data.
In step S2, the specific method for obtaining the corresponding cross-modal contrast learning loss according to the hidden representations corresponding to the two modal data is as follows: according to the formula:
Figure GDA0004148077270000091
acquiring cross-modal contrast learning loss l cl The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;
Figure GDA0004148077270000092
representing mutual information +.>
Figure GDA0004148077270000093
Hidden representation corresponding to the v-th modal data in the t-th sampleV.epsilon. {1,2}, i.e. +.>
Figure GDA0004148077270000094
For the t-th hidden representation corresponding to the first modality data in the sample with two modalities simultaneously, +.>
Figure GDA0004148077270000095
The hidden representation corresponding to the second modal data in the sample with the t th two simultaneous modalities; h (·) represents information entropy; alpha is the balance parameter of the entropy.
In step S2, the specific method for obtaining the corresponding intra-mode reconstruction loss according to the hidden representations corresponding to the two mode data is as follows: according to the formula:
Figure GDA0004148077270000096
acquiring intra-modality reconstruction loss l rec The method comprises the steps of carrying out a first treatment on the surface of the Where m is the total number of samples for which two modes exist simultaneously, and t represents the t-th sample;
Figure GDA0004148077270000097
representing the v-th modality data in the t-th sample; f (f) (v) (. Cndot.) and g (v) (. Cndot.) represents the encoder and decoder to which the v-th modality data currently corresponds, respectively; />
Figure GDA0004148077270000098
Is a norm.
The specific method of the step S3 is as follows: will l cl +0.1l rec The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) rec Is intra-modal reconstruction loss.
In step S5, the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data is as follows: according to the formula:
Figure GDA0004148077270000099
acquiring cross-modal dual prediction loss l pre The method comprises the steps of carrying out a first treatment on the surface of the Wherein Z is 1 The method comprises the steps of providing a hidden representation set corresponding to all first modal data in a sample with two modalities simultaneously; z is Z 2 The hidden representation set corresponding to all second modal data in the sample with two modalities simultaneously exists; g (2) (Z 2 ) To Z 2 Mapping G (1) (Z 1 ) To Z 1 Mapping G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map;
Figure GDA0004148077270000101
is a norm.
The specific method of step S6 is as follows: let formula l cl +0.1l pre +0.1l rec The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein l cl Learning loss for cross-modal contrast; l (L) pre Predicting loss for cross-modal pair; l (L) rec Is intra-modal reconstruction loss.
The specific method of step S8 is as follows: according to the formula:
Figure GDA0004148077270000102
Figure GDA0004148077270000103
Figure GDA0004148077270000104
Figure GDA0004148077270000105
obtaining two-mode data with missing dataSet-corresponding hidden representations comprising a sample set X for the simultaneous presence of two modality data 1 Corresponding hidden representation
Figure GDA0004148077270000106
Sample set X for simultaneous presence of two modality data 2 Corresponding hidden representation +.>
Figure GDA0004148077270000107
For sample set X where only the first modality exists (1) Corresponding hidden representation +.>
Figure GDA0004148077270000108
And for a sample set X where only the second modality exists (2) Corresponding hidden representation +.>
Figure GDA0004148077270000109
Wherein->
Figure GDA00041480772700001010
Representing the encoder in the latest self-encoder corresponding to the 1 st modality data; />
Figure GDA00041480772700001011
The encoder of the latest self-encoder corresponding to the 2 nd modality data is represented.
The specific method of step S9 is as follows: according to the formula:
Figure GDA00041480772700001012
Figure GDA00041480772700001013
respectively acquiring hidden representations corresponding to sample sets with only the first modality
Figure GDA00041480772700001014
Representation of the corresponding deletion modality +.>
Figure GDA00041480772700001015
Hidden representation corresponding to sample set with only the second modality present +.>
Figure GDA00041480772700001016
Representation of the corresponding deletion modality +.>
Figure GDA00041480772700001017
G (1) (. Cndot.) represents the mapping corresponding to the first modality, G (2) (. Cndot.) represents the mapping corresponding to the second modality, G (2) (. Cndot.) and G (1) (. Cndot.) constitutes a dual map.
In step S10, the specific method for splicing the different modal representations corresponding to each sample and using the spliced different modal representations as the common representation is as follows: will be
Figure GDA00041480772700001018
As a common representation of samples where both modalities exist; will->
Figure GDA00041480772700001019
A public representation as a sample where only the first modality exists; will->
Figure GDA0004148077270000111
As a common representation of samples where only the second modality exists.
In the specific implementation process, the entropy is regularized, the parameter alpha is fixed to be 10, and the design of the cross-modal contrast learning loss has two advantages: on the one hand, from the information theory, the information entropy is the average information quantity transmitted by an event, and a larger entropy represents a more information quantity representation; on the other hand, maximize
Figure GDA0004148077270000112
And->
Figure GDA0004148077270000113
It is possible to avoid all samples being assigned to the same sampleA trivial solution to the individual clusters. For the purpose of construction->
Figure GDA0004148077270000114
The joint probability distribution p (z, z ') of the variables z and z' can be defined first, since the Softmax activation function is stacked at the last layer of the encoder, therefore +.>
Figure GDA0004148077270000115
And->
Figure GDA0004148077270000116
Can be regarded as an overclustering probability, i.e. +.>
Figure GDA0004148077270000117
And->
Figure GDA0004148077270000118
It can be understood that the distribution of two discrete cluster assignment variables z and z' over D classes, D being +.>
Figure GDA0004148077270000119
And->
Figure GDA00041480772700001110
Is a dimension of (c). The joint probability p (z, z') is thus defined as matrix +.>
Figure GDA00041480772700001111
Figure GDA00041480772700001112
Let P d And P d' The edge probability distributions P (z=d) and P (z ' =d '), respectively, are represented and can be obtained by summing the d-th row and d ' -th column of the joint probability respective matrix P, respectively. For discrete variables, the cross-modal contrast learning loss function can be redefined as:
Figure GDA00041480772700001113
wherein P is dd' Is the element of row d and column d' in P.
To infer the missing modality, the present invention proposes a dual prediction mechanism. Specifically, in a potential space parameterized by a neural network, the spatial distribution is determined by minimizing the conditional entropy H (Z i |Z j ) Representation Z of a particular modality i Can be Z j Prediction, where i=1, j=2, or j=1, i=2; i.e. Z i Completely by Z i Determining if and only if conditional entropy
Figure GDA00041480772700001114
One common approach to optimize this objective is to introduce a variation distribution +.>
Figure GDA00041480772700001115
And maximize +.>
Figure GDA00041480772700001116
Lower boundary of->
Figure GDA00041480772700001117
Wherein the method comprises the steps of
Figure GDA00041480772700001118
The variation distribution Q described above may be of any type, such as gaussian distribution, class distribution, and laplace distribution. In particular, the method can take the distribution Q as Gaussian distribution N (Z i |G (j) (Z j ) σi), σi is the variance matrix. Neglecting constant derivation in gaussian distribution, maximizing
Figure GDA0004148077270000121
Equivalent to->
Figure GDA0004148077270000122
For a given bimodal data, a cross-modal dual prediction loss can be obtained>
Figure GDA0004148077270000123
It should be noted that if there is no intra-modal reconstruction loss, the above-mentioned dual prediction loss may lead to a trivial solution, i.e. Z 1 And Z 2 Equal to a same constant. After model convergence, we can easily predict the representation of the missing modality corresponding to the hidden representation corresponding to the sample set with the first modality only by the dual mapping.
After the whole model is trained on the data with complete two modes to be converged, the whole data set is directly fed into a network model, and the network model can execute missing mode complementation and infer corresponding representation. And then the representations of different modes are directly spliced together to obtain a common representation, and then the common representation is clustered by a traditional clustering method such as k-means, so that the two-mode clustering with missing data can be completed. Similarly, the method is applied to any two-mode clustering, so that the method can be directly popularized to multi-mode clustering.
Mapping model G (2) And G (1) The same network structure is adopted, and the layers are 6:
a first layer: a full connection layer with an input of 128 and an output of 128 followed by a batch normalization layer BatchNorm1d; the activation function is ReLU;
a second layer: full connection layer, input 128, output 256, following batch normalization layer BatchNorm1d; the activation function is ReLU;
third layer: full connection layer, input 256, output 128, following batch normalization layer BatchNorm1d; the activation function is ReLU;
fourth layer: full connection layer, input 128, output 256, following batch normalization layer BatchNorm1d; the activation function is ReLU;
fifth layer: full connection layer, input 256, output 128, following batch normalization layer BatchNorm1d; the activation function is ReLU;
sixth layer: the fully connected layer has an input of 128, an output of 128, and an activation function of Softmax.
In one embodiment of the invention, a dataset Caltech-101-20 is used, containing 2386 pictures from 20 object categories, using 2 extracted image features as 2 modalities, containing (HOG, GIST). The experimental data category information and sample number distribution are shown in table 1.
Table 1: experimental data category information and sample quantity distribution
Figure GDA0004148077270000131
Experiments were performed with different deletion rates, defined as η= (n-m)/n, where n is the size of the dataset and m is the number of samples with complete modality. To verify the superiority of this solution, we compared this solution (complete) with other 10 multi-modal clustering techniques, namely partial multi-modal clustering (PVC), incomplete multi-modal visualization data grouping (IMG), unified Embedding Alignment Framework (UEAF), double-ended alignment incomplete multi-modal clustering (DAIMC), spectral perturbation incomplete multi-modal clustering (PIC), efficient regularized incomplete multi-modal clustering (eermvc), deep Canonical Correlation Analysis (DCCA), deep canonical correlation self-encoder (DCCAE), binary multi-modal clustering (BMVC), and dual self-encoder network (AE) 2 Nets)。
The test results at a loss of 0.5 are shown in Table 2.
Table 2: test results at loss rate η=0.5
Figure GDA0004148077270000132
Figure GDA0004148077270000141
The test results at a loss of 0 are shown in Table 3.
Table 3: test results at loss rate η=0
Figure GDA0004148077270000142
Figure GDA0004148077270000151
As can be seen from tables 2 and 3, compared with other clustering methods, the method has larger improvement on two indexes of standardized mutual information and adjustable Rankine coefficient, which means that the object picture data can be clustered correctly in practical application, and the consumption of a large amount of human resources for picture classification is avoided.
As shown in fig. 3, 4 and 5, to further explore the effectiveness of our method, we varied the deletion rate η from 0 to 0.8 on Caltech101-20, with 0.1 as the interval. From the results in fig. 3-5, it can be observed that: i) The complete (present method) is significantly better than all comparative methods ii) in all deletion rate settings, with the deletion rate increasing, the performance degradation is much greater than our method than the comparative methods. For example, in the case of η=0, the competer and PIC achieve NMI of 0.6806 and 0.6793, respectively, whereas with increasing deletion rate competer is significantly better than PIC.
In another embodiment of the invention, a Scene-15 dataset is used, containing 4485 pictures from 15 Scene categories, using 2 extracted image features as 2 modalities, including (PHOG, GIST). The experimental data category information and sample number distribution are shown in table 4.
Table 4: experimental data category information and sample quantity distribution
Figure GDA0004148077270000152
Figure GDA0004148077270000161
The experimental results at the deletion rate η=0.5 are shown in table 5.
Table 5: experimental results at deletion rate η=0.5
Figure GDA0004148077270000162
The experimental results at the deletion rate η=0 are shown in table 6.
Table 6: experimental results at deletion rate η=0
Figure GDA0004148077270000163
Figure GDA0004148077270000171
As can be seen from tables 5 and 6, compared with other clustering methods, the method has larger improvement on two indexes of accuracy and standardized mutual information, which means that the object picture data can be clustered correctly in practical application, and the consumption of a large amount of human resources for picture classification is avoided. Meanwhile, the method has the best effect under the conditions of both deficiency and no deficiency.
As shown in fig. 6, 7 and 8, in order to further investigate the effectiveness of the present method, experiments were performed with 0.1 as an interval by changing the deletion rate η from 0 to 0.8. From the results in fig. 6-8, it can be observed that complete (this method) is significantly better than all the comparative methods in almost all the deletion rate settings.
In summary, the method and the device are based on the self-encoder, the mode special representation of each mode data is learned through intra-mode reconstruction loss, the consistency representation of modes is learned through cross-mode contrast learning loss, the information of inconsistency between the lost modes and the discarded modes is recovered through cross-mode dual prediction loss, consistency is further improved, data recovery and consistency learning are uniformly processed, and clustering effect is better.

Claims (10)

1. The image classification method with the missing data in the mode is characterized by comprising the following steps of:
s1, respectively sending two-mode data of an image sample with two modes simultaneously into corresponding self-encoders to obtain corresponding hidden representations; wherein the two modes are one of HOG and PHOG, and GIST;
s2, respectively acquiring corresponding cross-modal contrast learning loss and intra-modal reconstruction loss according to hidden representations corresponding to the two modal data;
s3, carrying out counter propagation on the current self-encoder according to the cross-modal contrast learning loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s4, judging whether the counter propagation times reach a threshold value, if so, entering a step S5, otherwise, returning to the step S1;
s5, obtaining corresponding cross-modal contrast learning loss, intra-modal reconstruction loss and cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data;
s6, carrying out back propagation on the current self-encoder according to the current latest cross-modal contrast learning loss, cross-modal dual prediction loss and intra-modal reconstruction loss to update parameters and weights of the current self-encoder;
s7, judging whether the current self-encoder converges or not, if so, entering a step S8, otherwise, returning to the step S5;
s8, sending a set of image samples with two modes at the same time, an image sample with only a first mode and an image sample with only a second mode as two-mode data sets with missing data to a current latest self-encoder to obtain hidden representations corresponding to the two-mode data sets with the missing data;
s9, acquiring a representation of a missing mode corresponding to the hidden representation corresponding to the image sample set only with the first mode and a representation of a missing mode corresponding to the hidden representation corresponding to the image sample set only with the second mode in the two-mode data set based on dual mapping;
and S10, splicing different mode representations corresponding to each image sample, using the spliced different mode representations as common representations, clustering the common representations, and completing two-mode clustering of missing data, namely realizing image classification of the mode with the missing data.
2. The method of classifying images in which there is missing data in a modality according to claim 1, wherein in step S1 the self-encoder includes an encoder and a decoder, the encoder includes a first full-link layer, a first batch of normalization layers, a first activation function, a second full-link layer, a second batch of normalization layers, a second activation function, a third full-link layer, a third batch of normalization layers, a third activation function, a fourth full-link layer, and a fourth activation function, which are sequentially connected; the input dimension of the first full connection layer is the dimension of the input modal data; the output dimensions of the first full-connection layer, the second full-connection layer and the third full-connection layer are 1024; the first activation function, the second activation function and the third activation function are all ReLU; the output dimension of the fourth full connection layer is 128, and the fourth activation function is Softmax;
the decoder comprises a fifth full-connection layer, a fourth normalization layer, a fifth activation function, a sixth full-connection layer, a fifth normalization layer, a sixth activation function, a seventh full-connection layer, a sixth normalization layer, a seventh activation function, an eighth full-connection layer, a seventh normalization layer and an eighth activation function which are sequentially connected; the input dimension of the fifth full-connection layer is 128, the output dimensions of the fifth full-connection layer, the sixth full-connection layer and the seventh full-connection layer are 1024, and the fifth activation function, the sixth activation function, the seventh activation function and the eighth activation function are ReLU; the output dimension of the eighth fully connected layer is the dimension of the input modal data.
3. The method for classifying images with missing data in a mode according to claim 1, wherein the specific method for acquiring corresponding cross-mode contrast learning loss according to hidden representations corresponding to two mode data in step S2 is as follows:
according to the formula:
Figure QLYQS_1
acquiring cross-modal contrast learning loss
Figure QLYQS_2
The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofmTo have a total number of image samples of both modalities,trepresent the firsttA plurality of image samples; />
Figure QLYQS_3
Representing mutual information +.>
Figure QLYQS_4
Is the firsttImplicit representation corresponding to the first modality data in an image sample of two simultaneous modalities,/->
Figure QLYQS_5
Is the firsttHidden representations corresponding to second modality data in the image samples with two modalities simultaneously; />
Figure QLYQS_6
Representing information entropy; />
Figure QLYQS_7
Is the balance parameter of entropy.
4. The method for classifying images with missing data in a mode according to claim 1, wherein the specific method for acquiring the reconstruction loss in the corresponding mode according to the hidden representations corresponding to the two mode data in step S2 is as follows:
according to the formula:
Figure QLYQS_8
acquiring intra-modality reconstruction losses
Figure QLYQS_9
The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofmTo have a total number of image samples of both modalities,trepresent the firsttA plurality of image samples; />
Figure QLYQS_10
Represent the firsttThe first of the image samplesvThe personal modality data; />
Figure QLYQS_11
And->
Figure QLYQS_12
Respectively represent the firstvAn encoder and a decoder corresponding to the modal data; />
Figure QLYQS_13
Is a norm.
5. The method for classifying images having missing data in a modality according to claim 1, wherein the specific method of step S3 is as follows:
will be
Figure QLYQS_14
The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein->
Figure QLYQS_15
Learning loss for cross-modal contrast; />
Figure QLYQS_16
Is intra-modal reconstruction loss.
6. The method for classifying images with missing data according to claim 1, wherein the specific method for obtaining the corresponding cross-modal dual prediction loss according to the current latest hidden representation corresponding to the two modal data in step S5 is as follows:
according to the formula:
Figure QLYQS_17
acquiring cross-modal dual prediction loss
Figure QLYQS_19
The method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure QLYQS_23
The hidden representation set corresponding to all the first modal data in the image sample with two modalities simultaneously exists; />
Figure QLYQS_26
The hidden representation set corresponding to all second mode data in the image sample with two modes simultaneously exists; />
Figure QLYQS_20
To->
Figure QLYQS_21
Mapping->
Figure QLYQS_24
To->
Figure QLYQS_27
Mapping->
Figure QLYQS_18
And->
Figure QLYQS_22
Constructing dual mapping; />
Figure QLYQS_25
Is a norm.
7. The method for classifying images having missing data in a modality according to claim 1, wherein the specific method of step S6 is as follows:
the formula is given by
Figure QLYQS_28
The calculated result of the self-encoder is used as the current loss to carry out back propagation on the current self-encoder, and the parameters and the weights of the current self-encoder are updated; wherein->
Figure QLYQS_29
Learning loss for cross-modal contrast; />
Figure QLYQS_30
Predicting loss for cross-modal pair; />
Figure QLYQS_31
Is intra-modal reconstruction loss.
8. The method for classifying images having missing data in a modality according to claim 1, wherein the specific method of step S8 is as follows:
according to the formula:
Figure QLYQS_32
obtaining corresponding hidden representation of two-mode data set with missing data, including image sample set with two-mode data
Figure QLYQS_34
Corresponding hidden representation +.>
Figure QLYQS_36
Image sample set for simultaneous presence of two modality data +.>
Figure QLYQS_40
Corresponding hidden representation +.>
Figure QLYQS_35
Image sample set for which only the first modality is present +.>
Figure QLYQS_38
Corresponding hidden representation +.>
Figure QLYQS_41
And +/for image sample set with only the second modality>
Figure QLYQS_42
Corresponding hidden representation +.>
Figure QLYQS_33
The method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure QLYQS_37
Representing the encoder in the latest self-encoder corresponding to the 1 st modality data; />
Figure QLYQS_39
The encoder of the latest self-encoder corresponding to the 2 nd modality data is represented.
9. The method for classifying images having missing data in a modality according to claim 8, wherein the specific method of step S9 is as follows:
according to the formula:
Figure QLYQS_43
respectively acquiring hidden representations corresponding to image sample sets with only a first modality
Figure QLYQS_45
Representation of the corresponding deletion modality +.>
Figure QLYQS_47
Hidden representation corresponding to a set of image samples where only the second modality is present +.>
Figure QLYQS_49
Representation of the corresponding missing modality
Figure QLYQS_44
;/>
Figure QLYQS_48
Mapping representing the correspondence of the first modality, +.>
Figure QLYQS_50
Mapping representing the correspondence of the second modality, +.>
Figure QLYQS_51
And
Figure QLYQS_46
a dual map is constructed.
10. The method for classifying images with missing data according to claim 9, wherein the specific method for stitching and using different modal representations corresponding to each image sample as a common representation in step S10 is as follows:
will be
Figure QLYQS_52
As a common representation of image samples where both modalities are present; will->
Figure QLYQS_53
As a common representation of image samples where only the first modality exists; will->
Figure QLYQS_54
As a common representation of image samples where only the second modality is present.
CN202110095029.2A 2021-01-25 2021-01-25 Image classification method with missing data in mode Active CN112784902B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110095029.2A CN112784902B (en) 2021-01-25 2021-01-25 Image classification method with missing data in mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110095029.2A CN112784902B (en) 2021-01-25 2021-01-25 Image classification method with missing data in mode

Publications (2)

Publication Number Publication Date
CN112784902A CN112784902A (en) 2021-05-11
CN112784902B true CN112784902B (en) 2023-06-30

Family

ID=75758853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110095029.2A Active CN112784902B (en) 2021-01-25 2021-01-25 Image classification method with missing data in mode

Country Status (1)

Country Link
CN (1) CN112784902B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657272B (en) * 2021-08-17 2022-06-28 山东建筑大学 Micro video classification method and system based on missing data completion
CN114742132A (en) * 2022-03-17 2022-07-12 湖南工商大学 Deep multi-view clustering method, system and equipment based on common difference learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8255739B1 (en) * 2008-06-30 2012-08-28 American Megatrends, Inc. Achieving data consistency in a node failover with a degraded RAID array
CN106202281A (en) * 2016-06-28 2016-12-07 广东工业大学 A kind of multi-modal data represents learning method and system
WO2017122785A1 (en) * 2016-01-15 2017-07-20 Preferred Networks, Inc. Systems and methods for multimodal generative machine learning
WO2018232378A1 (en) * 2017-06-16 2018-12-20 Markable, Inc. Image processing system
CN112001437A (en) * 2020-08-19 2020-11-27 四川大学 Modal non-complete alignment-oriented data clustering method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8255739B1 (en) * 2008-06-30 2012-08-28 American Megatrends, Inc. Achieving data consistency in a node failover with a degraded RAID array
WO2017122785A1 (en) * 2016-01-15 2017-07-20 Preferred Networks, Inc. Systems and methods for multimodal generative machine learning
CN106202281A (en) * 2016-06-28 2016-12-07 广东工业大学 A kind of multi-modal data represents learning method and system
WO2018232378A1 (en) * 2017-06-16 2018-12-20 Markable, Inc. Image processing system
CN112001437A (en) * 2020-08-19 2020-11-27 四川大学 Modal non-complete alignment-oriented data clustering method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction;Yijie Lin 等;《2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;11169-11178 *
基于深度神经网络的多模态特征自适应聚类方法;敬明旻;计算机应用与软件(第10期);262-269 *

Also Published As

Publication number Publication date
CN112784902A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN111310707B (en) Bone-based graph annotation meaning network action recognition method and system
CN108399428B (en) Triple loss function design method based on trace ratio criterion
JP5282658B2 (en) Image learning, automatic annotation, search method and apparatus
WO2019015246A1 (en) Image feature acquisition
CN108460356A (en) A kind of facial image automated processing system based on monitoring system
CN112784902B (en) Image classification method with missing data in mode
CN114398961A (en) Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN110263801B (en) Image processing model generation method and device and electronic equipment
WO2022042043A1 (en) Machine learning model training method and apparatus, and electronic device
WO2021018245A1 (en) Image classification method and apparatus
CN111339942A (en) Method and system for recognizing skeleton action of graph convolution circulation network based on viewpoint adjustment
CN110738102A (en) face recognition method and system
CN112084891B (en) Cross-domain human body action recognition method based on multi-modal characteristics and countermeasure learning
WO2021051987A1 (en) Method and apparatus for training neural network model
CN109325513B (en) Image classification network training method based on massive single-class images
CN113343974A (en) Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN110889335B (en) Human skeleton double interaction behavior identification method based on multichannel space-time fusion network
Asmai et al. Mosquito larvae detection using deep learning
CN110414541A (en) The method, equipment and computer readable storage medium of object for identification
CN115761905A (en) Diver action identification method based on skeleton joint points
CN114333062A (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
CN107729821B (en) Video summarization method based on one-dimensional sequence learning
CN112084913B (en) End-to-end human body detection and attribute identification method
CN113378938A (en) Edge transform graph neural network-based small sample image classification method and system
CN113378934B (en) Small sample image classification method and system based on semantic perception map neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant