CN116186317B - Cross-modal cross-guidance-based image-text retrieval method and system - Google Patents

Cross-modal cross-guidance-based image-text retrieval method and system Download PDF

Info

Publication number
CN116186317B
CN116186317B CN202310436332.3A CN202310436332A CN116186317B CN 116186317 B CN116186317 B CN 116186317B CN 202310436332 A CN202310436332 A CN 202310436332A CN 116186317 B CN116186317 B CN 116186317B
Authority
CN
China
Prior art keywords
image
text
cross
semantic
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310436332.3A
Other languages
Chinese (zh)
Other versions
CN116186317A (en
Inventor
丁运来
董军宇
李岳尊
于佳傲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202310436332.3A priority Critical patent/CN116186317B/en
Publication of CN116186317A publication Critical patent/CN116186317A/en
Application granted granted Critical
Publication of CN116186317B publication Critical patent/CN116186317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention belongs to the technical field of artificial intelligence, and discloses a cross-modal cross-guidance-based image-text retrieval method and a cross-modal cross-guidance-based image-text retrieval system, wherein the cross-modal cross-guidance-based image-text retrieval method comprises the following steps of: inputting image data and text data; performing feature extraction and shared semantic learning of two different mode data of images and texts by using a cross-mode cross-guidance network model constructed based on a self-distillation algorithm to complete training of the model, wherein the cross-mode cross-guidance network model comprises a teacher network and a student network, the student network comprises two branches of images and texts, the teacher network has the same structure as the student network, and cross-mode cross-guidance is performed between the teacher network and the student network; finally, inputting the images or texts to be searched into a trained cross-modal cross-guidance network model to extract corresponding features, calculating the similarity of the images or texts to be searched, and searching according to the similarity score to obtain an optimal searching result; by the method and the device, cross-modal semantic alignment is realized, and retrieval accuracy is improved.

Description

Cross-modal cross-guidance-based image-text retrieval method and system
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a cross-modal cross-guidance-based image-text retrieval method and system.
Background
Cross-modal retrieval is a task of retrieving between different types of data (e.g., images, text, audio, etc.); the main challenge is how to span the "heterogeneous gap" between different modality data, i.e. how to understand and establish the relationships between different types of data.
Cross-modal teletext retrieval is a subtask of cross-modal retrieval, which is mainly focused on retrieving between images and text. Early methods used hash retrieval by learning the hash of images and text modalities and mapping them into a hamming binarized space. This search is fast, but loses accuracy in the binarization process, and does not fully mine the relationships between modalities.
With the development of computer vision and natural language processing tasks, especially the advent of deep learning models such as fast-RCNN (a two-stage object detection model) and transducer (a deep learning model of self-attention mechanism), image and text features can be extracted more finely. This provides new possibilities for cross-modal teletext retrieval, hopefully solving the problems existing in earlier approaches.
The existing cross-modal image-text retrieval method can be mainly divided into one-to-one matching and many-to-many matching. One-to-one matching is also called visual semantic embedding (Visual Semantic Embedding), and the main steps are to extract features of image text first using a feature extractor, then to contextualize and aggregate the extracted features using a feature aggregator, and finally to map them to the same joint embedding space and measure their matching score using cosine similarity. The method has the advantages that the image and text features can be extracted in parallel, and then the features are saved offline for retrieval, however, the method lacks interaction between modes, and the retrieval accuracy is poor due to the fact that the similarity is calculated only by means of embedding obtained by the last layer. The method mainly comprises the steps of firstly extracting features of images and texts to obtain segment-level feature representations, such as regions of the images and words described by the texts, and then carrying out feature processing of image-text segments by combining an attention mechanism to obtain hidden layer representations. In the process, the image-text characteristics are interacted and fused, so that the hidden layer can learn a function for measuring the cross-modal similarity. This method first calculates the similarity of the local representations and then integrates to obtain the overall similarity. However, cross-modal alignment cannot be achieved at a higher semantic level, and relying solely on cross-attention between image regions and text words can also result in a significant amount of computation and mismatching. The method of the present invention effectively solves these problems.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a graph-text retrieval method and a system based on cross-modal cross-guidance, which utilize two modes of images and texts to share a semantic prototype and a modal specific semantic decoder, can effectively capture cross-modal local semantics, realize cross-modal cross-guidance through a teacher and student network, and realize fine granularity alignment between images and texts; in addition, the invention provides a plug-and-play self-distillation method based on optimal transportation, which relieves the defect of label by multi-mode data integration and realizes accurate matching of images and texts; extensive experiments on different mainstream image text retrieval benchmarks obtain remarkable performance improvement, and prove the effectiveness of the invention.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a graph-text retrieval method based on cross-modal cross guidance, which comprises the following steps:
s1, inputting image data and text data of a batch;
s2, performing feature extraction and shared semantic learning of two different mode data of images and texts by using a cross-mode cross-guidance network model constructed based on a self-distillation algorithm, and completing training of the model:
The cross-modal cross-guidance network model comprises a teacher network and a student network, wherein the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module, and the teacher network has the same structure as the student network, so that the teacher network has the same images and text branches and corresponding modules;
during training, the student network and the teacher network firstly conduct feature extraction on the image and text data to obtain local features of the image and the text; secondly, respectively inputting the local features of the image and the text to corresponding semantic decoders, extracting the semantic features of the corresponding image and the text from the local features of the image and the text through a learnable shared semantic prototype, and calculating the similarity between the semantic features of the image and the text; then, semantic features of the image and the text are processed through a self-attention module respectively to obtain global features of the image and the text, and similarity between the global features of the image and the text is calculated; finally, according to the calculated similarity, the distribution of the teacher network is used as the real distribution to guide the distribution of the student network;
S3, inputting the images or texts to be searched into a trained cross-modal cross-guidance network model to extract corresponding features, calculating the similarity of the images or texts to be searched by using the extracted features, and searching according to the similarity score to obtain an optimal search result.
Further, when the cross-modal cross-guidance network model is trained, the input image and text data of one batch are subjected to data enhancement and then are respectively input into a student network and a teacher network for training; the teacher network and the student network have the same structure, and the teacher model and the student model guide learning mutually in the training process so as to achieve a better parameter fitting effect; in the verification stage, the model can accurately extract the image and text characteristics, and match and search the corresponding images and texts; the training comprises the following specific steps:
s21, extracting local features:
the image branch and the text branch extract the characteristics of the regional level of the image and the characteristics of the word level of the text through an image encoder and a text encoder respectively to obtain the local characteristics of the image
Figure SMS_1
And text local feature->
Figure SMS_2
S22, cross-modal sharing semantic learning:
a set of learnable shared semantic prototypes is designed
Figure SMS_3
Capturing the semantics of the alignment of the image and the text by means of a semantic decoder structure, resulting in the image semantic features +.>
Figure SMS_4
And text semantic feature->
Figure SMS_5
And calculates the similarity, expressed as +.>
Figure SMS_6
S23, self-attention processing:
image semantic features output in step S22
Figure SMS_7
And text semantic feature->
Figure SMS_8
Respectively performing self-attention processing, and using sparsity loss to restrict attention weights obtained by image semantic features and text semantic features through a self-attention module to obtain image global features ∈ ->
Figure SMS_9
And text Global feature->
Figure SMS_10
At the same time, the global feature +.>
Figure SMS_11
And text Global feature->
Figure SMS_12
Again calculate the similarity, tableShown as +.>
Figure SMS_13
S24, teacher and student network cross guidance:
image semantic features
Figure SMS_14
And text semantic feature->
Figure SMS_15
Obtaining global image features through self-attention modules respectively
Figure SMS_16
And text Global feature->
Figure SMS_17
For matched image text pairs, the aligned global features should also pay attention to aligned local semantics, cross-modal cross guidance can be performed by using relative entropy loss, similarity between images and texts of teacher and student networks is obtained through calculation according to S22 and S23, and then distribution of the teacher network is used as real distribution to guide distribution of the student networks;
And iterating the process until all the images and text data which participate in training are input into the network model, and adjusting network model parameters through back propagation, wherein the parameters of the teacher network do not participate in back propagation for gradient update, the teacher network and the student network conduct mutual guidance learning, and meanwhile loss is reduced to the minimum, and the similarity relation between the real images and texts is learned.
Further, the image-text retrieval method based on cross-modal cross guidance further comprises a step of applying an optimal transportation algorithm in step S25, and the method specifically comprises the following steps: the teacher network and the student network measure the similarity of two data of the image and the text respectively, and output two paired similarity matrixes
Figure SMS_18
、/>
Figure SMS_19
The method comprises the steps of carrying out a first treatment on the surface of the Similarity matrix calculated by student network>
Figure SMS_20
For calculating the triple loss to optimize the student network, and calculating the similarity matrix of teacher network>
Figure SMS_21
Modeling into an optimal transportation problem, and solving to obtain an optimal solution of the optimal transportation problem>
Figure SMS_22
Obtaining the most accurate matching relation of the image text by +.>
Figure SMS_23
And (3) with
Figure SMS_24
And calculating the optimal transportation self-distillation loss, and realizing the guidance of the student network.
Further, the specific steps of cross-modal sharing semantic learning in step S22 are as follows:
Given image local features
Figure SMS_25
And text local feature->
Figure SMS_26
Pairs Ji Yuyi between them are learning captured using a shared semantic prototype, which is a set of vectors that can be learned, and a semantic decoder, which is randomly initialized and shared among data of different modalities, the shared semantic prototype being defined as:
Figure SMS_27
wherein the method comprises the steps of
Figure SMS_31
Representing all shared semantic prototypes ∈ ->
Figure SMS_32
Indicate->
Figure SMS_36
Personal shared semantic prototype->
Figure SMS_28
Representing the number of shared semantic prototypes; sharing semantic prototypes and local features of one modality as inputs to a semantic decoder, image branching for example, image local features +.>
Figure SMS_34
And shared semantic prototype->
Figure SMS_39
Obtaining the image semantic feature after passing through the image semantic decoder>
Figure SMS_42
A semantic decoder consists of +.>
Figure SMS_29
The same attention layer is stacked, in +.>
Figure SMS_35
In a layer, by outputting the previous layer
Figure SMS_38
And shared semantic prototype->
Figure SMS_41
Together, the local features of the image are focused by a multi-head focusing mechanism>
Figure SMS_30
Capturing a specific image semantic feature and outputting the image semantic feature updated in the current time period +.>
Figure SMS_33
And attention weight matrix->
Figure SMS_37
Finally, the output of the whole image semantic decoder is obtained>
Figure SMS_40
Similarly, text semantic features of the text branches obtained by the text semantic decoder can be obtained
Figure SMS_43
The method comprises the steps of carrying out a first treatment on the surface of the Then, the triple loss based on hard negative-sample mining is used to optimize the similarity matrix +.>
Figure SMS_44
Attention weighting at last layer
Figure SMS_45
The diversity regularization loss is introduced, so that the diversity regularization loss of the image branches and the diversity regularization loss of the text branches can be obtained.
Further, the specific steps of the self-attention processing in step S23 are as follows: image semantic features output in step S22
Figure SMS_46
And text semantic feature->
Figure SMS_47
Input to the self-attention module, respectively, to further learn and align cross-modal semantics; for image semantic features->
Figure SMS_48
The process through the self-attention module is expressed by the following formula:
Figure SMS_49
Figure SMS_50
Figure SMS_51
Figure SMS_52
wherein the method comprises the steps of
Figure SMS_64
Temperature parameter representing student network, ++>
Figure SMS_54
Indicate->
Figure SMS_60
Individual image semantic features,/->
Figure SMS_66
For the number of semantic features of an image, +.>
Figure SMS_69
And->
Figure SMS_70
Respectively representing the image global feature after average merging and the attention weighted image global feature, L2Norm represents L2 normalization, +.>
Figure SMS_61
Representation->
Figure SMS_67
The result of the normalization of the individual image semantic features L2,
Figure SMS_55
indicate->
Figure SMS_57
Results of normalization of individual image semantic features L2, < >>
Figure SMS_56
Self-attention module for representing semantic features of imageProcessing the obtained weight, ++>
Figure SMS_59
Indicate->
Figure SMS_63
Weights obtained by processing semantic features of images through self-attention modules and sharing
Figure SMS_65
Individual, softmax refers to the softmax function; likewise, for text semantic features +.>
Figure SMS_53
The text global feature +.>
Figure SMS_58
And text global feature after attention weighting +.>
Figure SMS_62
Further, the specific steps of the teacher and student network cross guidance in step S24 are as follows: according to the steps S22 and S23, calculating to obtain the similarity between images and texts of teacher and student networks, and then using the distribution of the teacher network as the real distribution to guide the distribution of the student networks; for matched pairs of image text, the aligned global features should also be noted for aligned local semantics as well, and cross-modal guidance can be performed using distillation loss:
Figure SMS_71
Figure SMS_72
Figure SMS_73
Figure SMS_74
Figure SMS_75
wherein the method comprises the steps of
Figure SMS_85
Indicating loss of distillation->
Figure SMS_77
Representing the true distribution of text in a teacher network +.>
Figure SMS_81
And image estimation distribution in student network +.>
Figure SMS_79
KL divergence between->
Figure SMS_82
Representing true distribution of images in teacher networks
Figure SMS_86
And text estimation distribution in student network +.>
Figure SMS_90
KL divergence between, using teacher's distribution as the true distribution to guide student's distribution,/->
Figure SMS_84
Representing +.>
Figure SMS_91
The result of the L2 normalization of the individual image semantic features,
Figure SMS_76
representing +.>
Figure SMS_80
Results of L2 normalization of individual text semantic features,/- >
Figure SMS_87
Representing in a student network
Figure SMS_88
Results of L2 normalization of individual image semantic features,/->
Figure SMS_89
Representing +.>
Figure SMS_92
Results of L2 normalization of individual text semantic features,/->
Figure SMS_78
Representing a temperature parameter of the teacher network with a value greater than the temperature parameter of the student network>
Figure SMS_83
Further, the self-distillation step based on the optimal transportation of step S25 is as follows: first of all, pairs of labels are assigned
Figure SMS_93
B x B is the size of the input image and text of a batch, by Z and +.>
Figure SMS_94
Constructing an optimal transportation problem using the formula:
Figure SMS_95
wherein the method comprises the steps of
Figure SMS_113
Representing optimal transport problems, sup represents the upper bound, < +.>
Figure SMS_116
、/>
Figure SMS_121
Representing two probabilitiesDistribution of->
Figure SMS_99
Representing from->
Figure SMS_102
To->
Figure SMS_106
Is a set of all joint probability distributions, +.>
Figure SMS_111
Is->
Figure SMS_114
Is a joint probability distribution of +.>
Figure SMS_117
Two aggregate elements representing image and text +.>
Figure SMS_122
Figure SMS_124
Figure SMS_115
A similarity matrix between the two; the optimal transport problem aims at finding a joint probability distribution +.>
Figure SMS_119
So that the edge distribution thereof is +.>
Figure SMS_123
And->
Figure SMS_125
And expected benefit +.>
Figure SMS_98
Maximum, so->
Figure SMS_100
Representing maximize +.>
Figure SMS_104
Desired benefit->
Figure SMS_110
To find the optimal solution; max represents maximization,/->
Figure SMS_96
Representing the slave->
Figure SMS_101
To->
Figure SMS_105
Is a set of joint probability distributions ∈ >
Figure SMS_109
And->
Figure SMS_97
Representing weight vector, ++>
Figure SMS_103
Represents +.>
Figure SMS_107
Image and->
Figure SMS_108
Labels corresponding to the individual texts->
Figure SMS_112
Representing +.>
Figure SMS_118
Image and->
Figure SMS_120
The corresponding similarity value of the texts, and the sizes of the images and the texts of one batch are B;
optimal solution for solving the optimal transportation problem equation
Figure SMS_126
The optimal transport self-distillation loss is expressed as follows:
Figure SMS_127
wherein the method comprises the steps of
Figure SMS_128
Indicating optimal transport self-distillation loss, < >>
Figure SMS_129
Representing the relative entropy loss, < >>
Figure SMS_130
,/>
Figure SMS_131
Similarity matrix and corresponding optimal transport solution matrix between semantic features of image and text respectively are represented, and +.>
Figure SMS_132
,/>
Figure SMS_133
And respectively representing a similarity matrix and a corresponding optimal transportation solution matrix between the image and the text global features.
Further, during model training, the total loss function of the whole network model is expressed as:
Figure SMS_134
wherein the method comprises the steps of
Figure SMS_135
、/>
Figure SMS_138
、/>
Figure SMS_141
、/>
Figure SMS_136
Belongs to super-parameters (Foliumet)>
Figure SMS_139
Representing the triple loss under the similarity of semantic features of images and texts; />
Figure SMS_142
Representing the triple loss under the similarity of the global features of the image and the text; />
Figure SMS_144
Representing diversity regularization loss; />
Figure SMS_137
Representing sparsity loss; />
Figure SMS_140
Indicating distillation loss; />
Figure SMS_143
Indicating optimal transport self-distillation losses.
The invention also provides a graph and text retrieval system based on cross-modal cross-guidance, which is used for realizing the graph and text retrieval method based on cross-modal cross-guidance, and comprises a data preprocessing module, a cross-modal cross-guidance network and a loss function module, wherein the data preprocessing module is used for processing data of images or text data to be retrieved as input of a teacher network and a student network; the cross-modal cross-direction network comprises a teacher network and a student network, the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module and are used for processing input image data, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module and are used for processing input text data, the structure of the teacher network is the same as that of the student network, and the teacher network and the student network perform cross-modal cross-direction; the loss function module is used for calculating triplet loss, diversity regularization loss, sparsity loss, distillation loss and optimal transportation self-distillation loss.
Compared with the prior art, the invention has the advantages that:
the existing image and text retrieval method mainly relies on local image and text features to measure cross-modal similarity, so cross-modal semantic alignment for matching the features becomes critical, however, due to the heterogeneous difference between different modes, the cross-modal semantic alignment is extremely difficult, in order to solve the problem, the invention proposes to learn cross-modal cross-guidance consensus by using a modal shared semantic prototype and a modal specific semantic decoder, unlike the existing method, the invention proposes a new paradigm which does not directly align multi-modal local features, but uses the shared semantic prototype as a bridge, focuses on specific contents of different modalities through a semantic decoder, and in the process, the cross-modal semantic alignment can be realized very naturally, and the accuracy of the cross-modal retrieval is improved; in addition, the invention designs a new self-distillation method based on optimal transmission, which alleviates the defects of paired labels in the multi-mode data set; numerous experimental results demonstrate the effectiveness and versatility of both designs of the present invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a cross-modal cross-guided graph-text retrieval overall structure diagram of the invention;
FIG. 2 is a diagram of a cross-modal cross-guided graph-text retrieval student network structure of the invention;
fig. 3 is a structural diagram of a semantic decoder according to the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific examples.
Example 1
With reference to fig. 1, the embodiment provides a cross-modal cross-guidance-based image-text retrieval method, which comprises the following steps:
s1, inputting image data and text data of a batch.
S2, performing feature extraction and shared semantic learning of two different mode data of images and texts by using a cross-mode cross-guidance network model constructed based on a self-distillation algorithm, and completing training of the model.
The cross-modal cross-guidance network model comprises a teacher network and a student network, wherein the student network comprises two branches of images and texts. The image branch comprises an image encoder, an image semantic decoder and an image self-attention module, and the text branch comprises a text encoder, a text semantic decoder and a text self-attention module, wherein the structures and principles of the image semantic decoder and the text semantic decoder are the same, and the structures and principles of the image self-attention module and the text self-attention module are the same. The teacher's network has the same structure as the student's network, so the teacher's network has the same image and text branches and corresponding modules.
During training, the student network and the teacher network firstly perform feature extraction on the image and text data to obtain local features of the image
Figure SMS_145
And text local feature->
Figure SMS_146
. Secondly, local features of the image and text are input to the corresponding semantic decoder via a learnable shared semantic prototype +.>
Figure SMS_147
Extracting corresponding image semantic features from image and text local features>
Figure SMS_148
And text semantic feature->
Figure SMS_149
. Image processing apparatusThe semantic features and the text semantic features realize preliminary alignment, and similarity calculation can be performed to obtain +.>
Figure SMS_150
. In order to facilitate further realizing the alignment of the image semantic features and the text semantic features, the invention designs a self-attention module, the image and the text are processed by the self-attention module respectively, the image semantic features and the text semantic features are processed into image global features and text global features in a refined way, and the similarity between the image global features and the text global features is expressed as>
Figure SMS_151
. And finally, according to the calculated similarity, using the distribution of the teacher network as the real distribution to guide the distribution of the student network.
As a preferred implementation mode, the teacher network and the student network measure the similarity of two data of the image and the text respectively and output two paired similarity matrixes
Figure SMS_152
、/>
Figure SMS_155
. Similarity matrix of student network>
Figure SMS_158
For computing triplet loss to optimize student network, and similarity matrix of teacher network +.>
Figure SMS_154
Is modeled as an optimal transportation problem, and the optimal solution of the optimal transportation problem is obtained by solving, which is also called an optimal transportation scheme->
Figure SMS_156
,/>
Figure SMS_157
B x B is the size of the input image and text of a lot, by +.>
Figure SMS_159
And->
Figure SMS_153
The optimal transport self-distillation loss is calculated to guide the student's network.
In addition, the teacher network provides additional cross-modal guidance for the student network, the parameters of the teacher network model do not calculate gradients, and the parameters come from the exponential sliding updating of the student network parameters in the following updating mode:
Figure SMS_160
wherein the method comprises the steps of
Figure SMS_161
Weights representing momentum updates +.>
Figure SMS_162
Parameters representing teacher model->
Figure SMS_163
Parameters representing the student model.
S3, inputting the images or texts to be searched into a trained cross-modal cross-guidance network model to extract corresponding features, calculating the similarity of the images or texts to be searched by using the extracted features, and searching according to the similarity score to obtain an optimal search result.
As a preferred implementation mode, when the cross-modal cross-guidance network model is trained, the input image and text data of a batch are subjected to data enhancement and then are respectively input into a student network and a teacher network for training. The teacher network and the student network have the same structure, and the teacher model and the student model guide learning mutually in the training process so as to achieve a better parameter fitting effect. In the verification stage, the model can accurately extract image and text features and match and retrieve corresponding images and text.
The specific steps of model training are described below.
S21, extracting local features:
image branches and text branches are respectively extracted through a visual encoder and a text encoder to obtain image local features
Figure SMS_164
And text local feature->
Figure SMS_165
And map them to the same dimension.
As a preferred embodiment, as shown in fig. 2, at the time of feature extraction, given B image and text pairs in a batch, the data enhancement is realized by randomly deleting or replacing the words of the text and the area of the image therein. Extracting their local features using a Faster-RCNN model based on bottom-up and top-down attention as an image encoder and a Transformer-based pre-trained BERT model as a text encoder, respectively, to obtain image local features
Figure SMS_166
And text local feature->
Figure SMS_167
Figure SMS_168
Figure SMS_169
Wherein the image local features
Figure SMS_170
,/>
Figure SMS_171
Represents the M-th image area, M represents M area characteristics obtained by the image through an image encoder, and the text is localCharacteristics->
Figure SMS_172
,/>
Figure SMS_173
Representing the nth text word, N representing the N word features of the text through the text encoder. Before the image region features and text word features are processed in the follow-up process, the full-connection layer or the multi-layer perceptron is used for carrying out the scaling unification of feature dimensions, so that the final feature dimensions of the image and the final feature dimensions of the text are the same. Due to the dimension +. >
Figure SMS_174
The local features of the image, text, need to be mapped to the same dimension d, i.e. +.>
Figure SMS_175
、/>
Figure SMS_176
S22, cross-modal sharing semantic learning:
a set of learnable shared semantic prototypes is designed
Figure SMS_179
Capturing the semantics of the alignment of the image and the text by means of a semantic decoder structure, resulting in the image semantic features +.>
Figure SMS_181
And text semantic feature->
Figure SMS_183
And calculates the similarity, specifically: image local feature->
Figure SMS_178
And text local feature->
Figure SMS_182
Are respectively input to the image semantic solutionEncoder and text semantic decoder through a learnable shared semantic prototype +.>
Figure SMS_185
Extracting corresponding image semantic features from image and text local features>
Figure SMS_186
And text semantic features
Figure SMS_177
Computing image semantic features->
Figure SMS_180
And text semantic feature->
Figure SMS_184
Similarity between them, expressed as +.>
Figure SMS_187
And optimizing by using the triplet loss on the basis of obtaining the similarity matrix, and using diversity regularization loss constraint on the attention weight obtained by the image and text local features through the semantic decoder.
As a preferred embodiment, the specific steps of cross-modal sharing semantic learning are as follows:
given image local features
Figure SMS_188
And text local feature->
Figure SMS_189
The shared semantic prototypes and semantic decoders are used to learn capture pairs Ji Yuyi between them; a shared semantic prototype is a set of learnable vectors that are randomly initialized and shared among data of different modalities, the shared semantic prototype being defined as:
Figure SMS_190
Wherein the method comprises the steps of
Figure SMS_191
Representing all shared semantic prototypes, and +.>
Figure SMS_192
,/>
Figure SMS_193
Indicate->
Figure SMS_194
Personal shared semantic prototype->
Figure SMS_195
Representing the number of shared semantic prototypes; sharing semantic prototypes and local features of one modality as inputs to a semantic decoder, here exemplified by image branches, image local features +.>
Figure SMS_196
And shared semantic prototype->
Figure SMS_197
The image semantic features obtained after passing through the image semantic decoder can be expressed as:
Figure SMS_198
here, the
Figure SMS_199
Representing shared semantic prototype->
Figure SMS_204
Local features in the image->
Figure SMS_207
Semantic features of the image mined. SemanticDec represents a semantic decoder structure, as shown in FIG. 3, a semantic decoder consisting of +.>
Figure SMS_201
The same attention layer is stacked, in +.>
Figure SMS_202
In layers, by the output of the previous layer +.>
Figure SMS_205
And shared semantic prototype->
Figure SMS_208
Together, the local features of the image are focused by a multi-head focusing Mechanism (MHA)>
Figure SMS_200
Capturing a specific image semantic feature and outputting an updated image semantic feature +.>
Figure SMS_203
And attention weight matrix->
Figure SMS_206
The following formula may be used to describe:
Figure SMS_209
Figure SMS_210
wherein the method comprises the steps of
Figure SMS_211
Is a parameter that can be learned, and has:
Figure SMS_212
Figure SMS_213
Figure SMS_214
wherein the method comprises the steps of
Figure SMS_217
、/>
Figure SMS_219
、/>
Figure SMS_223
For the input of the attention mechanism, the principle is the prior art, and will not be described in detail here, in this embodiment +.>
Figure SMS_216
、/>
Figure SMS_220
、/>
Figure SMS_224
Representation- >
Figure SMS_225
、/>
Figure SMS_215
、/>
Figure SMS_221
I-th uniform slice of>
Figure SMS_222
,/>
Figure SMS_226
Is the number of attention heads, so +.>
Figure SMS_218
The output of the individual head is:
Figure SMS_227
the outputs of all the heads are combined and can be expressed as:
Figure SMS_228
wherein the method comprises the steps of
Figure SMS_229
Is->
Figure SMS_230
Outputting individual heads;
finally, the first
Figure SMS_231
The output of the layer semantic decoder is:
Figure SMS_232
Figure SMS_233
wherein LayerNorm represents the layer standardization,
Figure SMS_234
representation->
Figure SMS_235
The output of the semantic decoder layer before the layer Dropout means random inactivation of neurons,/for neurons>
Figure SMS_236
Reflecting the attention of different shared semantic prototypes to the local features of different images, will +.>
Figure SMS_237
Set to all 0 matrix, ">
Figure SMS_238
Expressed as the output of the whole semantic decoder, then there are:
Figure SMS_239
similarly, text semantic features obtained by the text branches through the semantic decoder can be obtained:
Figure SMS_240
Figure SMS_241
SemanticDec represents the semantic decoder structure that, during training,
Figure SMS_245
by focusing on semantics of a specific modality, own parameters are updated layer by layer to adaptively capture diversified semantics from a large amount of input data. At the same time (I)>
Figure SMS_248
Also always focus on the shared semantic prototype +.>
Figure SMS_243
By->
Figure SMS_246
Residual connection on each layer avoids semantic drift between different modalities. Thus->
Figure SMS_247
And->
Figure SMS_250
Is through corresponding shared semantic prototype +.>
Figure SMS_242
The cross-modal consensus is learned to achieve alignment of the two modalities of the image and the text, and the amount of the cross-modal consensus represents the overall similarity of the image and the text. It is possible to calculate the +.sup.th of a batch >
Figure SMS_244
Image and->
Figure SMS_249
Similarity between the individual texts:
Figure SMS_251
wherein the method comprises the steps of
Figure SMS_252
Indicate->
Figure SMS_253
Individual image semantic features, together->
Figure SMS_254
Personal (S)>
Figure SMS_255
Indicate->
Figure SMS_256
Individual text semantic features, together->
Figure SMS_257
L2Norm is L2 standardized, the similarity is calculated by adopting a cosine similarity method, and a similarity matrix obtained by a semantic decoder module of images and texts of the whole batch is +.>
Figure SMS_258
The similarity matrix may be optimized using a triplet penalty based on hard negative-sample mining, expressed as:
Figure SMS_259
the triplet is lost, the learning target is the relative distance, and the distance between the reference sample and the negative sample is far greater than the distance between the reference sample and the positive sample through learning. In the present invention, the reference sample and the positive sample refer to text matched with an image or an image matched with text. Wherein the method comprises the steps of
Figure SMS_260
Representing the loss of triples under the similarity of semantic features of images and texts in one batch, and +.>
Figure SMS_261
Image text pairs representing positive matches in image, text semantic feature similarity matrix, ++>
Figure SMS_262
And
Figure SMS_263
representing unmatched image text pairs in the image and text semantic feature similarity matrix, wherein margin represents boundary parameters in triplet loss; />
Figure SMS_264
Representing that only hard negative samples (hard negative samples refer to samples that do not match the largest value in similarity of image text) are used in a small batch process, rather than summing all negative samples, allows the model to learn more challenging negative samples, thereby improving the robustness and generalization performance of the model.
In addition, in order to avoid feature redundancy, the diversity of different semantics is ensured, and the attention weight is at the last layer
Figure SMS_265
The diversity regularization penalty introduced above, for image branches, can be expressed as:
Figure SMS_266
wherein the method comprises the steps of
Figure SMS_267
A diversity regularization penalty representing image branches, < +.>
Figure SMS_268
Is an identity matrix>
Figure SMS_269
Is attention weight +.>
Figure SMS_270
Results after L2 normalization; />
Figure SMS_271
Is->
Figure SMS_272
Transposed matrix of>
Figure SMS_273
The Frobenius norm of the matrix is represented, and 2 is represented as square.
Using the same procedure, the diversity regularization penalty of the text branches can be obtained
Figure SMS_274
So the diversity regularization loss of the model as a whole +.>
Figure SMS_275
The method comprises the following steps:
Figure SMS_276
s23, self-attention processing:
image semantic features output in step S22
Figure SMS_277
And text semantic feature->
Figure SMS_278
The images are respectively input to a self-attention module, the relation between the global features and the local semantic features of the images and the texts is explored, and the cross-modal semantics are further learned and aligned. Meanwhile, the similarity is calculated again for the image and text global features after self-attention processing, and is expressed as +.>
Figure SMS_279
. Using triplet loss optimization, andsparsity constraint is performed on the attention weight obtained through the self-attention module.
As a preferred embodiment, the specific steps of step S23 are as follows: characterizing image semantics
Figure SMS_280
And text semantic feature->
Figure SMS_281
Respectively input to the self-attention module, for the semantic features of the image +.>
Figure SMS_282
The process of going through the self-attention module process can be expressed by the following formula:
Figure SMS_283
Figure SMS_284
Figure SMS_285
;/>
Figure SMS_286
wherein the method comprises the steps of
Figure SMS_296
Temperature parameter representing student network, ++>
Figure SMS_288
Indicate->
Figure SMS_293
Individual image semantic features,/->
Figure SMS_298
For the number of semantic features of an image, +.>
Figure SMS_304
And->
Figure SMS_302
Respectively representing the image global feature after average merging and the attention weighted image global feature, L2Norm represents L2 normalization, +.>
Figure SMS_299
Representation->
Figure SMS_303
The result of the normalization of the individual image semantic features L2,
Figure SMS_290
indicate->
Figure SMS_294
Results of normalization of individual image semantic features L2, < >>
Figure SMS_289
Weights representing semantic features of the image processed by the self-attention module, < ->
Figure SMS_292
Indicate->
Figure SMS_295
Weights obtained by processing semantic features of images through self-attention modules and sharing
Figure SMS_301
Individual, softmax refers to the softmax function; likewise, for text semantic features +.>
Figure SMS_287
The text global feature +.>
Figure SMS_291
And text global feature after attention weighting +.>
Figure SMS_297
The method comprises the steps of carrying out a first treatment on the surface of the Constraint using sparsity loss can be expressed as:
Figure SMS_305
wherein the method comprises the steps of
Figure SMS_306
Representing sparsity loss, < >>
Figure SMS_307
Weights representing text semantic features processed by self-attention module, +.>
Figure SMS_308
、/>
Figure SMS_309
Entropy regularization of image weights and text weights is shown separately, from which the +.sup.th of a batch can be calculated >
Figure SMS_310
Image and->
Figure SMS_311
The similarity between the global features of the individual texts is given by:
Figure SMS_312
wherein the method comprises the steps of
Figure SMS_313
Indicate->
Figure SMS_314
Image global feature after attention weighting of the individual images,/-, for example>
Figure SMS_315
Indicate->
Figure SMS_316
Global features of the text after the individual text is weighted by attention, L2Norm representing the L2 normalization; the similarity matrix between the global semantic features of the images and texts of the whole batch obtained by the self-attention module is +.>
Figure SMS_317
The similarity matrix may be optimized using a triplet penalty based on hard negative-sample mining, expressed as:
Figure SMS_318
wherein the method comprises the steps of
Figure SMS_319
Representing the triplet loss under the global feature similarity of images and texts in one batch,
Figure SMS_320
image text pairs representing positive matches in the image, text global feature similarity matrix, ++>
Figure SMS_321
And
Figure SMS_322
image text pairs which represent mismatching in the image, text global feature similarity matrix, margin represents boundary parameters in the triplet loss, ++>
Figure SMS_323
Indicating that only hard negative samples are used in a small batch process. />
S24, teacher and student network cross guidance:
image semantic features
Figure SMS_324
And text semantic feature->
Figure SMS_325
Through the process ofThe self-attention module obtains the global features of the image respectively
Figure SMS_326
And text Global feature->
Figure SMS_327
. For matched pairs of image text, the aligned global features should also take care of the aligned local semantics. The relative entropy loss can be used for cross-modal cross-guidance. And (4) calculating according to S22 and S23 to obtain the similarity between images and texts of the teacher and student networks, and then guiding the distribution of the student networks by using the distribution of the teacher network as the actual distribution.
As a preferred embodiment, the specific steps of step S24 are as follows:
for matched pairs of image text, the aligned global features should also take care of the aligned local semantics. The loss of distillation can be used for cross-modal guidance:
Figure SMS_328
Figure SMS_329
Figure SMS_330
Figure SMS_331
Figure SMS_332
wherein the method comprises the steps of
Figure SMS_343
Indicating loss of distillation->
Figure SMS_335
Representing the true distribution of text in a teacher network +.>
Figure SMS_339
And image estimation distribution in student network +.>
Figure SMS_344
KL divergence between->
Figure SMS_347
Representing the true distribution of images in a teacher network>
Figure SMS_348
And text estimation distribution in student network +.>
Figure SMS_349
KL divergence between, using teacher's distribution as the true distribution to guide student's distribution,/->
Figure SMS_341
Representing +.>
Figure SMS_345
The result of the L2 normalization of the individual image semantic features,
Figure SMS_333
representing +.>
Figure SMS_338
Results of L2 normalization of individual text semantic features,/->
Figure SMS_336
Representing in a student network
Figure SMS_337
Results of L2 normalization of individual image semantic features,/->
Figure SMS_342
Representing +.>
Figure SMS_346
Results of L2 normalization of individual text semantic features,/->
Figure SMS_334
Representing a temperature parameter of the teacher network with a value greater than the temperature parameter of the student network>
Figure SMS_340
S25, self-distillation based on optimal transportation:
the teacher network and the student network respectively measure the similarity of two data of images and texts, output two paired similarity matrixes, model the similarity matrixes calculated by the teacher network into an optimal transportation problem, obtain the most accurate image text matching relation, and realize the guidance of the student network through the optimal transportation self-distillation loss; this is a self-distilling process, since the teacher network and the student network are identical in structure.
As a preferred embodiment, the specific steps of step S25 are as follows:
the teacher network and the student network measure the similarity of two data of the image and the text respectively, and output two paired similarity matrixes
Figure SMS_350
、/>
Figure SMS_351
Similarity matrix calculated by student network>
Figure SMS_352
For optimizing student network by calculating similarity matrix of teacher network>
Figure SMS_353
Modeling optimal transport problem to further guide +.>
Figure SMS_354
To maximize the matching score, pairs of tags are first assigned/>
Figure SMS_355
B x B is the size of the input image and text of a batch, by Z and +.>
Figure SMS_356
Can construct an optimal transportation problem using the formula:
Figure SMS_357
wherein the method comprises the steps of
Figure SMS_358
Representing optimal transport problems, sup represents the upper bound, < +.>
Figure SMS_364
、/>
Figure SMS_367
Representing two probability distributions +.>
Figure SMS_359
Representing from->
Figure SMS_362
To->
Figure SMS_369
Is a set of all joint probability distributions, +.>
Figure SMS_371
Is->
Figure SMS_360
Is a joint probability distribution of +.>
Figure SMS_365
Two aggregate elements representing image and text +.>
Figure SMS_366
Figure SMS_372
Figure SMS_375
A similarity matrix between the two; the optimal transport problem aims at finding a joint probability distribution +.>
Figure SMS_378
So that the edge distribution thereof is +.>
Figure SMS_382
And->
Figure SMS_386
And expected benefit +.>
Figure SMS_376
Maximum, so->
Figure SMS_380
Representing maximize +. >
Figure SMS_384
Desired benefit->
Figure SMS_387
To find the optimal solution; max represents maximization,/->
Figure SMS_361
Representing the slave->
Figure SMS_363
To->
Figure SMS_368
Is a set of joint probability distributions ∈>
Figure SMS_373
And->
Figure SMS_370
Representing weight vector, ++>
Figure SMS_374
Represents +.>
Figure SMS_379
Image and->
Figure SMS_383
Labels corresponding to the individual texts->
Figure SMS_377
Representing +.>
Figure SMS_381
Image and->
Figure SMS_385
And the corresponding similarity value of the texts, and the sizes of the images and the texts of one batch are B.
Optimal solution to the optimal transportation problem equation
Figure SMS_388
Image text similarity matrix obtained by teacher based network
Figure SMS_393
、/>
Figure SMS_397
And->
Figure SMS_390
In the presence of->
Figure SMS_395
On the premise of (1) can be applied to->
Figure SMS_396
And->
Figure SMS_400
Constraint is carried out by
Figure SMS_389
,/>
Figure SMS_394
,/>
Figure SMS_399
,/>
Figure SMS_401
Indicate->
Figure SMS_391
How much text the individual images should be assigned to, < >>
Figure SMS_392
Indicate->
Figure SMS_398
How many images each text should be assigned, in the absence of more a priori information, each image and text is considered to have the same assigned weight, namely:
Figure SMS_402
thus, the optimal solution can be obtained
Figure SMS_403
,/>
Figure SMS_404
For optimal solution->
Figure SMS_405
Multiplying it by B so that each row and each column is a probability distribution and realizing +.>
Figure SMS_406
For->
Figure SMS_407
Is a guide of (3). The formula is used as follows:
Figure SMS_408
;/>
wherein the method comprises the steps of
Figure SMS_409
Representing the relative entropy loss, < >>
Figure SMS_410
Indicating the KL divergence between the two distributions, < - >
Figure SMS_411
Similarity matrix obtained by calculating semantic features of student network images and texts, wherein softmax refers to softmax function and +.>
Figure SMS_412
Representing the temperature parameters in the optimal transport, the optimal transport self-distillation losses are represented as follows:
Figure SMS_413
wherein the method comprises the steps of
Figure SMS_414
Indicating optimal transport self-distillation loss, < >>
Figure SMS_415
,/>
Figure SMS_416
Similarity matrix and corresponding optimal transport solution matrix between semantic features of image and text respectively are represented, and +.>
Figure SMS_417
,/>
Figure SMS_418
Representing similarity matrix and corresponding optimal transportation solution matrix among the images and the text global features.
In model training, the total loss function of the entire network model is expressed as:
Figure SMS_419
wherein the method comprises the steps of
Figure SMS_431
、/>
Figure SMS_422
、/>
Figure SMS_427
、/>
Figure SMS_430
Belongs to super-parameters (Foliumet)>
Figure SMS_434
Representing the triple loss under the similarity of semantic features of images and texts; />
Figure SMS_436
Representing the triple loss under the similarity of the global features of the image and the text; />
Figure SMS_437
Representing diversity regularization loss; />
Figure SMS_429
Representing sparsity loss; />
Figure SMS_433
Indicating distillation loss; />
Figure SMS_423
Indicating optimal transport self-distillation losses. Dimension d in this embodiment is set to 1024, the number of shared semantic prototypes +.>
Figure SMS_424
20 number of layers of semantic decoder +.>
Figure SMS_428
Set to 2, the boundary parameter for triplet loss set to 0.2, temperature parameter +.>
Figure SMS_432
、/>
Figure SMS_435
、/>
Figure SMS_438
0.2, 0.1, respectively. />
Figure SMS_420
、/>
Figure SMS_425
、/>
Figure SMS_421
、/>
Figure SMS_426
Set to 0.2, 0.1, 2.0 and 1.0, respectively.
And iterating the process until all the images and text data which participate in training are input into the network model, and adjusting network model parameters through back propagation, wherein the parameters of the teacher network do not participate in back propagation for gradient update, the teacher network and the student network conduct mutual guidance learning, and meanwhile loss is reduced to the minimum, and the similarity relation between the real images and texts is learned.
Example 2
The embodiment provides a graph-text retrieval system based on cross-modal cross guidance, which comprises a data preprocessing module, a cross-modal cross guidance network and a loss function module.
The data preprocessing module is used for data of images or text data to be retrieved, and the data preprocessing module is used for inputting teacher networks and student networks.
The cross-modal cross-guidance network comprises a teacher network and a student network, wherein the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module and are used for processing input image data, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module and are used for processing input text data, the structure of the teacher network is the same as that of the student network, and the teacher network and the student network perform cross-modal cross-guidance.
The loss function module is used for calculating triplet loss, diversity regularization loss, sparsity loss, distillation loss and optimal transportation self-distillation loss.
The functional implementation and data processing procedure of each module can be referred to in the description of embodiment 1, and are not described here.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that various changes, modifications, additions and substitutions can be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims (9)

1. The image-text retrieval method based on cross-modal cross guidance is characterized by comprising the following steps of:
s1, inputting image data and text data of a batch;
s2, performing feature extraction and shared semantic learning of two different mode data of images and texts by using a cross-mode cross-guidance network model constructed based on a self-distillation algorithm, and completing training of the model:
the cross-modal cross-guidance network model comprises a teacher network and a student network, wherein the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module, and the teacher network has the same structure as the student network, so that the teacher network has the same images and text branches and corresponding modules;
during training, the student network and the teacher network firstly conduct feature extraction on the image and text data to obtain local features of the image and the text; secondly, respectively inputting the local features of the image and the text to corresponding semantic decoders, extracting the semantic features of the corresponding image and the text from the local features of the image and the text through a learnable shared semantic prototype, and calculating the similarity between the semantic features of the image and the text; then, semantic features of the image and the text are processed through a self-attention module respectively to obtain global features of the image and the text, and similarity between the global features of the image and the text is calculated; finally, according to the calculated similarity, the distribution of the teacher network is used as the real distribution to guide the distribution of the student network;
S3, inputting the images or texts to be searched into a trained cross-modal cross-guidance network model to extract corresponding features, calculating the similarity of the images or texts to be searched by using the extracted features, and searching according to the similarity score to obtain an optimal search result.
2. The image-text retrieval method based on cross-modal cross-guidance according to claim 1, wherein the cross-modal cross-guidance network model is characterized in that when training, the input image and text data of a batch are subjected to data enhancement and then are respectively input into a student network and a teacher network for training; the teacher network and the student network have the same structure, and the teacher model and the student model guide learning mutually in the training process so as to achieve a better parameter fitting effect; in the verification stage, the model can accurately extract the image and text characteristics, and match and search the corresponding images and texts; the training comprises the following specific steps:
s21, extracting local features:
the image branch and the text branch extract the characteristics of the regional level of the image and the characteristics of the word level of the text through an image encoder and a text encoder respectively to obtain the local characteristics of the image
Figure QLYQS_1
And text local feature->
Figure QLYQS_2
S22, cross-modal sharing semantic learning:
designs a group of learnable shared semantic sourceA kind of electronic device with a display unit
Figure QLYQS_3
Capturing the semantics of the alignment of the image and the text by means of a semantic decoder structure, resulting in the image semantic features +.>
Figure QLYQS_4
And text semantic feature->
Figure QLYQS_5
And calculates the similarity, expressed as +.>
Figure QLYQS_6
S23, self-attention processing:
image semantic features output in step S22
Figure QLYQS_7
And text semantic feature->
Figure QLYQS_8
Respectively performing self-attention processing, and using sparsity loss to restrict attention weights obtained by image semantic features and text semantic features through a self-attention module to obtain image global features ∈ ->
Figure QLYQS_9
And text Global feature->
Figure QLYQS_10
At the same time, the global feature +.>
Figure QLYQS_11
And text Global feature->
Figure QLYQS_12
Again calculate the similarity, expressed as +.>
Figure QLYQS_13
S24, teacher and student network cross guidance:
image semantic features
Figure QLYQS_14
And text semantic feature->
Figure QLYQS_15
Obtaining global image features through self-attention modules respectively
Figure QLYQS_16
And text Global feature->
Figure QLYQS_17
For matched image text pairs, the aligned global features should also pay attention to aligned local semantics, cross-modal cross guidance can be performed by using relative entropy loss, similarity between images and texts of teacher and student networks is obtained through calculation according to S22 and S23, and then distribution of the teacher network is used as real distribution to guide distribution of the student networks;
And iterating the process until all the images and text data which participate in training are input into the network model, and adjusting network model parameters through back propagation, wherein the parameters of the teacher network do not participate in back propagation for gradient update, the teacher network and the student network conduct mutual guidance learning, and meanwhile loss is reduced to the minimum, and the similarity relation between the real images and texts is learned.
3. The method for text retrieval based on cross-modal cross-guidance according to claim 2, further comprising the step of applying an optimal transportation algorithm in step S25, specifically as follows: the teacher network and the student network measure the similarity of two data of the image and the text respectively, and output two paired similarity matrixes
Figure QLYQS_18
、/>
Figure QLYQS_19
The method comprises the steps of carrying out a first treatment on the surface of the Similarity matrix calculated by student network>
Figure QLYQS_20
For calculating the triple loss to optimize the student network, and calculating the similarity matrix of teacher network>
Figure QLYQS_21
Modeling into an optimal transportation problem, and solving to obtain an optimal solution of the optimal transportation problem>
Figure QLYQS_22
The method comprises the steps of carrying out a first treatment on the surface of the Obtaining the most accurate matching relation of the image text by +.>
Figure QLYQS_23
And->
Figure QLYQS_24
And calculating the optimal transportation self-distillation loss, and realizing the guidance of the student network.
4. The method for searching pictures and texts based on cross-modal cross-guidance according to claim 2, wherein the specific steps of cross-modal sharing semantic learning in step S22 are as follows:
Given image local features
Figure QLYQS_25
And text local feature->
Figure QLYQS_26
Pairs Ji Yuyi between them are learning captured using a shared semantic prototype, which is a set of vectors that can be learned, and a semantic decoder, which is randomly initialized and shared among data of different modalities, the shared semantic prototype being defined as:
Figure QLYQS_27
wherein the method comprises the steps of
Figure QLYQS_31
Representing all shared semantic prototypes ∈ ->
Figure QLYQS_33
Indicate->
Figure QLYQS_37
Personal shared semantic prototype->
Figure QLYQS_28
Representing the number of shared semantic prototypes; sharing semantic prototypes and local features of one modality as inputs to a semantic decoder, image branching for example, image local features +.>
Figure QLYQS_32
And shared semantic prototype->
Figure QLYQS_36
Obtaining the image semantic feature after passing through the image semantic decoder>
Figure QLYQS_40
A semantic decoder consists of +.>
Figure QLYQS_29
The same attention layer is stacked, in +.>
Figure QLYQS_34
In layers, by the output of the previous layer +.>
Figure QLYQS_39
And shared semantic prototype->
Figure QLYQS_41
Together, the local features of the image are focused by a multi-head focusing mechanism>
Figure QLYQS_30
Capturing a specific image semantic feature and outputting the image semantic feature updated in the current time period +.>
Figure QLYQS_35
And attention weight matrix->
Figure QLYQS_38
Finally, the output of the whole image semantic decoder is obtained>
Figure QLYQS_42
Similarly, text semantic features of the text branches obtained by the text semantic decoder can be obtained
Figure QLYQS_43
Then, the triple loss based on hard negative-sample mining is used to optimize the similarity matrix
Figure QLYQS_44
Attention weighting at last layer
Figure QLYQS_45
The diversity regularization loss is introduced, so that the diversity regularization loss of the image branches and the diversity regularization loss of the text branches can be obtained.
5. The method for retrieving images and texts based on cross-modal cross guidance according to claim 2, wherein the specific steps of the self-attention processing in step S23 are as follows: image semantic features output in step S22
Figure QLYQS_46
And text semantic features
Figure QLYQS_47
Input to the self-attention module, respectively, to further learn and align cross-modal semantics; for image semantic features->
Figure QLYQS_48
The process through the self-attention module is expressed by the following formula:
Figure QLYQS_49
Figure QLYQS_50
Figure QLYQS_51
Figure QLYQS_52
wherein the method comprises the steps of
Figure QLYQS_61
Temperature parameter representing student network, ++>
Figure QLYQS_54
Indicate->
Figure QLYQS_60
Individual image semantic features,/->
Figure QLYQS_59
For the number of semantic features of an image, +.>
Figure QLYQS_64
And->
Figure QLYQS_70
Respectively representing the image global feature after average merging and the attention weighted image global feature, L2Norm represents L2 normalization, +.>
Figure QLYQS_62
Representation->
Figure QLYQS_67
Results of normalization of individual image semantic features L2, < >>
Figure QLYQS_56
Indicate->
Figure QLYQS_58
Results of normalization of individual image semantic features L2, < >>
Figure QLYQS_63
Weights representing semantic features of the image processed by the self-attention module, < - >
Figure QLYQS_65
Indicate->
Figure QLYQS_66
Weights of semantic features of the images obtained by processing the self-attention module are shared by +.>
Figure QLYQS_69
Individual, softmax refers to the softmax function; likewise, for text semantic features +.>
Figure QLYQS_55
The text global feature +.>
Figure QLYQS_57
And text global feature after attention weighting +.>
Figure QLYQS_53
6. The method for searching images and texts based on cross-modal cross guidance according to claim 2, wherein the specific steps of the teacher and student network cross guidance in the step S24 are as follows: according to the steps S22 and S23, calculating to obtain the similarity between images and texts of teacher and student networks, and then using the distribution of the teacher network as the real distribution to guide the distribution of the student networks; for matched pairs of image text, cross-modal guidance was performed using distillation loss:
Figure QLYQS_71
Figure QLYQS_72
Figure QLYQS_73
Figure QLYQS_74
Figure QLYQS_75
wherein the method comprises the steps of
Figure QLYQS_84
Indicating loss of distillation->
Figure QLYQS_78
Representing the true distribution of text in a teacher network +.>
Figure QLYQS_81
And in a student networkImage estimation distribution +.>
Figure QLYQS_87
KL divergence between->
Figure QLYQS_90
Representing the true distribution of images in a teacher network>
Figure QLYQS_88
And text estimation distribution in student network +.>
Figure QLYQS_92
KL divergence between, using teacher's distribution as the true distribution to guide student's distribution,/->
Figure QLYQS_86
Representing +.>
Figure QLYQS_89
The result of the L2 normalization of the individual image semantic features,
Figure QLYQS_77
Representing +.>
Figure QLYQS_82
Results of L2 normalization of individual text semantic features,/->
Figure QLYQS_76
Representing in a student network
Figure QLYQS_80
Results of L2 normalization of individual image semantic features,/->
Figure QLYQS_85
Representing +.>
Figure QLYQS_91
Results of L2 normalization of individual text semantic features,/->
Figure QLYQS_79
Representing a temperature parameter of the teacher network with a value greater than the temperature parameter of the student network>
Figure QLYQS_83
7. A cross-modal cross-guided teletext retrieval method according to claim 3, wherein step S25 is based on a self-distillation step of optimal transport as follows: first of all, pairs of labels are assigned
Figure QLYQS_93
Wherein B x B is the size of the input image and text of a batch, by Z and +.>
Figure QLYQS_94
Constructing an optimal transportation problem using the formula:
Figure QLYQS_95
wherein the method comprises the steps of
Figure QLYQS_114
Representing optimal transport problems, sup represents the upper bound, < +.>
Figure QLYQS_118
、/>
Figure QLYQS_121
Representing two probability distributions +.>
Figure QLYQS_99
Representing from->
Figure QLYQS_102
To->
Figure QLYQS_106
Is a set of all joint probability distributions, +.>
Figure QLYQS_110
Is->
Figure QLYQS_96
Is a joint probability distribution of +.>
Figure QLYQS_100
Two aggregate elements representing image and text +.>
Figure QLYQS_104
Figure QLYQS_109
Figure QLYQS_98
A similarity matrix between the two; the optimal transport problem aims at finding a joint probability distribution +.>
Figure QLYQS_101
So that the edge distribution thereof is +.>
Figure QLYQS_105
And->
Figure QLYQS_108
And expected benefit +. >
Figure QLYQS_115
Maximum, so->
Figure QLYQS_122
Representing maximize +.>
Figure QLYQS_123
Expected benefits from
Figure QLYQS_125
To find the optimal solution; max represents maximization,/->
Figure QLYQS_97
Representing the slave->
Figure QLYQS_103
To->
Figure QLYQS_107
Is a set of joint probability distributions ∈>
Figure QLYQS_113
And->
Figure QLYQS_111
Representing weight vector, ++>
Figure QLYQS_116
Represents +.>
Figure QLYQS_120
Image and->
Figure QLYQS_124
Labels corresponding to the individual texts->
Figure QLYQS_112
Representing +.>
Figure QLYQS_117
Image and->
Figure QLYQS_119
Similarity value corresponding to each text, and large image and text of one batchThe small values are B;
optimal solution for solving the optimal transportation problem equation
Figure QLYQS_126
The optimal transport self-distillation loss is expressed as follows:
Figure QLYQS_127
wherein the method comprises the steps of
Figure QLYQS_128
Indicating optimal transport self-distillation loss, < >>
Figure QLYQS_129
Representing the relative entropy loss, < >>
Figure QLYQS_130
,/>
Figure QLYQS_131
Similarity matrix and corresponding optimal transport solution matrix between semantic features of image and text respectively are represented, and +.>
Figure QLYQS_132
,/>
Figure QLYQS_133
And respectively representing a similarity matrix and a corresponding optimal transportation solution matrix between the image and the text global features.
8. The method for cross-modal cross-guided graph retrieval of claim 7, wherein the overall loss function of the whole network model is expressed as:
Figure QLYQS_134
wherein the method comprises the steps of
Figure QLYQS_137
、/>
Figure QLYQS_138
、/>
Figure QLYQS_141
、/>
Figure QLYQS_135
Belongs to super-parameters (Foliumet)>
Figure QLYQS_139
Representing the triple loss under the similarity of semantic features of images and texts; / >
Figure QLYQS_143
Representing the triple loss under the similarity of the global features of the image and the text; />
Figure QLYQS_144
Representing diversity regularization loss;
Figure QLYQS_136
representing sparsity loss; />
Figure QLYQS_140
Indicating distillation loss; />
Figure QLYQS_142
Indicating optimal transport self-distillation losses.
9. The image-text retrieval system based on cross-modal cross-guidance is characterized by comprising a data preprocessing module, a cross-modal cross-guidance network and a loss function module, wherein the data preprocessing module is used for inputting images or text data to be retrieved as an input of a teacher network and a student network; the cross-modal cross-direction network comprises a teacher network and a student network, the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module and are used for processing input image data, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module and are used for processing input text data, the structure of the teacher network is the same as that of the student network, and the teacher network and the student network perform cross-modal cross-direction; the loss function module is used for calculating triplet loss, diversity regularization loss, sparsity loss, distillation loss and optimal transportation self-distillation loss.
CN202310436332.3A 2023-04-23 2023-04-23 Cross-modal cross-guidance-based image-text retrieval method and system Active CN116186317B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310436332.3A CN116186317B (en) 2023-04-23 2023-04-23 Cross-modal cross-guidance-based image-text retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310436332.3A CN116186317B (en) 2023-04-23 2023-04-23 Cross-modal cross-guidance-based image-text retrieval method and system

Publications (2)

Publication Number Publication Date
CN116186317A CN116186317A (en) 2023-05-30
CN116186317B true CN116186317B (en) 2023-06-30

Family

ID=86434795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310436332.3A Active CN116186317B (en) 2023-04-23 2023-04-23 Cross-modal cross-guidance-based image-text retrieval method and system

Country Status (1)

Country Link
CN (1) CN116186317B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116682144B (en) * 2023-06-20 2023-12-22 北京大学 Multi-modal pedestrian re-recognition method based on multi-level cross-modal difference reconciliation
CN116975318B (en) * 2023-08-03 2024-01-23 四川大学 Half-pairing image-text retrieval method based on cross-correlation mining
CN117573908B (en) * 2024-01-16 2024-03-19 卓世智星(天津)科技有限公司 Large language model distillation method based on contrast learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990296A (en) * 2021-03-10 2021-06-18 中科人工智能创新技术研究院(青岛)有限公司 Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN114298287A (en) * 2022-01-11 2022-04-08 平安科技(深圳)有限公司 Knowledge distillation-based prediction method and device, electronic equipment and storage medium
CN115311463A (en) * 2022-10-09 2022-11-08 中国海洋大学 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system
CN115860880A (en) * 2023-01-06 2023-03-28 中国海洋大学 Personalized commodity recommendation method and system based on multilayer heterogeneous graph convolution model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516085B (en) * 2019-07-11 2022-05-17 西安电子科技大学 Image text mutual retrieval method based on bidirectional attention
US20220261593A1 (en) * 2021-02-16 2022-08-18 Nvidia Corporation Using neural networks to perform object detection, instance segmentation, and semantic correspondence from bounding box supervision

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990296A (en) * 2021-03-10 2021-06-18 中科人工智能创新技术研究院(青岛)有限公司 Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN114298287A (en) * 2022-01-11 2022-04-08 平安科技(深圳)有限公司 Knowledge distillation-based prediction method and device, electronic equipment and storage medium
CN115311463A (en) * 2022-10-09 2022-11-08 中国海洋大学 Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system
CN115860880A (en) * 2023-01-06 2023-03-28 中国海洋大学 Personalized commodity recommendation method and system based on multilayer heterogeneous graph convolution model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Predicting visual features form text for image and video caption retrieval;Dong jianfeng等;《IEEE transactions on multimedia》;全文 *
基于知识蒸馏的轻量型浮游植物检测网络;张彤彤等;《万方》;全文 *
面向海洋的多模态智能计算:挑战、进展和展望;聂婕等;《万方》;全文 *

Also Published As

Publication number Publication date
CN116186317A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Bin et al. Describing video with attention-based bidirectional LSTM
CN116186317B (en) Cross-modal cross-guidance-based image-text retrieval method and system
Qiu et al. DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain
He et al. Vd-san: visual-densely semantic attention network for image caption generation
CN111782838B (en) Image question-answering method, device, computer equipment and medium
CN113657124B (en) Multi-mode Mongolian translation method based on cyclic common attention transducer
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
WO2021103761A1 (en) Compound property analysis method and apparatus, compound property analysis model training method, and storage medium
CN113344206A (en) Knowledge distillation method, device and equipment integrating channel and relation feature learning
Dang Smart attendance system based on improved facial recognition
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
Yue et al. Multi-task adversarial autoencoder network for face alignment in the wild
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
Lin et al. PS-mixer: A polar-vector and strength-vector mixer model for multimodal sentiment analysis
CN114117048A (en) Text classification method and device, computer equipment and storage medium
Belharbi et al. Deep neural networks regularization for structured output prediction
Liao et al. FERGCN: facial expression recognition based on graph convolution network
CN111582449B (en) Training method, device, equipment and storage medium of target domain detection network
CN117114063A (en) Method for training a generative large language model and for processing image tasks
Tao et al. An efficient and robust cloud-based deep learning with knowledge distillation
CN111161238A (en) Image quality evaluation method and device, electronic device, and storage medium
Pappas et al. Deep residual output layers for neural language generation
CN114612961A (en) Multi-source cross-domain expression recognition method and device and storage medium
Hu et al. Class-oriented self-learning graph embedding for image compact representation
Peng et al. AMFLW-YOLO: A Lightweight Network for Remote Sensing Image Detection Based on Attention Mechanism and Multi-scale Feature Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant