CN116186317B

CN116186317B - Cross-modal cross-guidance-based image-text retrieval method and system

Info

Publication number: CN116186317B
Application number: CN202310436332.3A
Authority: CN
Inventors: 丁运来; 董军宇; 李岳尊; 于佳傲
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-06-30
Anticipated expiration: 2043-04-23
Also published as: CN116186317A

Abstract

The invention belongs to the technical field of artificial intelligence, and discloses a cross-modal cross-guidance-based image-text retrieval method and a cross-modal cross-guidance-based image-text retrieval system, wherein the cross-modal cross-guidance-based image-text retrieval method comprises the following steps of: inputting image data and text data; performing feature extraction and shared semantic learning of two different mode data of images and texts by using a cross-mode cross-guidance network model constructed based on a self-distillation algorithm to complete training of the model, wherein the cross-mode cross-guidance network model comprises a teacher network and a student network, the student network comprises two branches of images and texts, the teacher network has the same structure as the student network, and cross-mode cross-guidance is performed between the teacher network and the student network; finally, inputting the images or texts to be searched into a trained cross-modal cross-guidance network model to extract corresponding features, calculating the similarity of the images or texts to be searched, and searching according to the similarity score to obtain an optimal searching result; by the method and the device, cross-modal semantic alignment is realized, and retrieval accuracy is improved.

Description

Cross-modal cross-guidance-based image-text retrieval method and system

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a cross-modal cross-guidance-based image-text retrieval method and system.

Background

Cross-modal retrieval is a task of retrieving between different types of data (e.g., images, text, audio, etc.); the main challenge is how to span the "heterogeneous gap" between different modality data, i.e. how to understand and establish the relationships between different types of data.

Cross-modal teletext retrieval is a subtask of cross-modal retrieval, which is mainly focused on retrieving between images and text. Early methods used hash retrieval by learning the hash of images and text modalities and mapping them into a hamming binarized space. This search is fast, but loses accuracy in the binarization process, and does not fully mine the relationships between modalities.

With the development of computer vision and natural language processing tasks, especially the advent of deep learning models such as fast-RCNN (a two-stage object detection model) and transducer (a deep learning model of self-attention mechanism), image and text features can be extracted more finely. This provides new possibilities for cross-modal teletext retrieval, hopefully solving the problems existing in earlier approaches.

The existing cross-modal image-text retrieval method can be mainly divided into one-to-one matching and many-to-many matching. One-to-one matching is also called visual semantic embedding (Visual Semantic Embedding), and the main steps are to extract features of image text first using a feature extractor, then to contextualize and aggregate the extracted features using a feature aggregator, and finally to map them to the same joint embedding space and measure their matching score using cosine similarity. The method has the advantages that the image and text features can be extracted in parallel, and then the features are saved offline for retrieval, however, the method lacks interaction between modes, and the retrieval accuracy is poor due to the fact that the similarity is calculated only by means of embedding obtained by the last layer. The method mainly comprises the steps of firstly extracting features of images and texts to obtain segment-level feature representations, such as regions of the images and words described by the texts, and then carrying out feature processing of image-text segments by combining an attention mechanism to obtain hidden layer representations. In the process, the image-text characteristics are interacted and fused, so that the hidden layer can learn a function for measuring the cross-modal similarity. This method first calculates the similarity of the local representations and then integrates to obtain the overall similarity. However, cross-modal alignment cannot be achieved at a higher semantic level, and relying solely on cross-attention between image regions and text words can also result in a significant amount of computation and mismatching. The method of the present invention effectively solves these problems.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a graph-text retrieval method and a system based on cross-modal cross-guidance, which utilize two modes of images and texts to share a semantic prototype and a modal specific semantic decoder, can effectively capture cross-modal local semantics, realize cross-modal cross-guidance through a teacher and student network, and realize fine granularity alignment between images and texts; in addition, the invention provides a plug-and-play self-distillation method based on optimal transportation, which relieves the defect of label by multi-mode data integration and realizes accurate matching of images and texts; extensive experiments on different mainstream image text retrieval benchmarks obtain remarkable performance improvement, and prove the effectiveness of the invention.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention provides a graph-text retrieval method based on cross-modal cross guidance, which comprises the following steps:

s1, inputting image data and text data of a batch;

s2, performing feature extraction and shared semantic learning of two different mode data of images and texts by using a cross-mode cross-guidance network model constructed based on a self-distillation algorithm, and completing training of the model:

The cross-modal cross-guidance network model comprises a teacher network and a student network, wherein the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module, and the teacher network has the same structure as the student network, so that the teacher network has the same images and text branches and corresponding modules;

during training, the student network and the teacher network firstly conduct feature extraction on the image and text data to obtain local features of the image and the text; secondly, respectively inputting the local features of the image and the text to corresponding semantic decoders, extracting the semantic features of the corresponding image and the text from the local features of the image and the text through a learnable shared semantic prototype, and calculating the similarity between the semantic features of the image and the text; then, semantic features of the image and the text are processed through a self-attention module respectively to obtain global features of the image and the text, and similarity between the global features of the image and the text is calculated; finally, according to the calculated similarity, the distribution of the teacher network is used as the real distribution to guide the distribution of the student network;

S3, inputting the images or texts to be searched into a trained cross-modal cross-guidance network model to extract corresponding features, calculating the similarity of the images or texts to be searched by using the extracted features, and searching according to the similarity score to obtain an optimal search result.

Further, when the cross-modal cross-guidance network model is trained, the input image and text data of one batch are subjected to data enhancement and then are respectively input into a student network and a teacher network for training; the teacher network and the student network have the same structure, and the teacher model and the student model guide learning mutually in the training process so as to achieve a better parameter fitting effect; in the verification stage, the model can accurately extract the image and text characteristics, and match and search the corresponding images and texts; the training comprises the following specific steps:

s21, extracting local features:

the image branch and the text branch extract the characteristics of the regional level of the image and the characteristics of the word level of the text through an image encoder and a text encoder respectively to obtain the local characteristics of the image

And text local feature->

；

S22, cross-modal sharing semantic learning:

a set of learnable shared semantic prototypes is designed

Capturing the semantics of the alignment of the image and the text by means of a semantic decoder structure, resulting in the image semantic features +.>

And text semantic feature->

And calculates the similarity, expressed as +.>

；

S23, self-attention processing:

image semantic features output in step S22

And text semantic feature->

Respectively performing self-attention processing, and using sparsity loss to restrict attention weights obtained by image semantic features and text semantic features through a self-attention module to obtain image global features ∈ ->

And text Global feature->

At the same time, the global feature +.>

And text Global feature->

Again calculate the similarity, tableShown as +.>

；

S24, teacher and student network cross guidance:

image semantic features

And text semantic feature->

Obtaining global image features through self-attention modules respectively

And text Global feature->

For matched image text pairs, the aligned global features should also pay attention to aligned local semantics, cross-modal cross guidance can be performed by using relative entropy loss, similarity between images and texts of teacher and student networks is obtained through calculation according to S22 and S23, and then distribution of the teacher network is used as real distribution to guide distribution of the student networks;

And iterating the process until all the images and text data which participate in training are input into the network model, and adjusting network model parameters through back propagation, wherein the parameters of the teacher network do not participate in back propagation for gradient update, the teacher network and the student network conduct mutual guidance learning, and meanwhile loss is reduced to the minimum, and the similarity relation between the real images and texts is learned.

Further, the image-text retrieval method based on cross-modal cross guidance further comprises a step of applying an optimal transportation algorithm in step S25, and the method specifically comprises the following steps: the teacher network and the student network measure the similarity of two data of the image and the text respectively, and output two paired similarity matrixes

、/>

The method comprises the steps of carrying out a first treatment on the surface of the Similarity matrix calculated by student network>

For calculating the triple loss to optimize the student network, and calculating the similarity matrix of teacher network>

Modeling into an optimal transportation problem, and solving to obtain an optimal solution of the optimal transportation problem>

Obtaining the most accurate matching relation of the image text by +.>

And (3) with

And calculating the optimal transportation self-distillation loss, and realizing the guidance of the student network.

Further, the specific steps of cross-modal sharing semantic learning in step S22 are as follows:

Given image local features

And text local feature->

Pairs Ji Yuyi between them are learning captured using a shared semantic prototype, which is a set of vectors that can be learned, and a semantic decoder, which is randomly initialized and shared among data of different modalities, the shared semantic prototype being defined as:

；

wherein the method comprises the steps of

Representing all shared semantic prototypes ∈ ->

Indicate->

Personal shared semantic prototype->

Representing the number of shared semantic prototypes; sharing semantic prototypes and local features of one modality as inputs to a semantic decoder, image branching for example, image local features +.>

And shared semantic prototype->

Obtaining the image semantic feature after passing through the image semantic decoder>

A semantic decoder consists of +.>

The same attention layer is stacked, in +.>

In a layer, by outputting the previous layer

And shared semantic prototype->

Together, the local features of the image are focused by a multi-head focusing mechanism>

Capturing a specific image semantic feature and outputting the image semantic feature updated in the current time period +.>

And attention weight matrix->

Finally, the output of the whole image semantic decoder is obtained>

；

Similarly, text semantic features of the text branches obtained by the text semantic decoder can be obtained

The method comprises the steps of carrying out a first treatment on the surface of the Then, the triple loss based on hard negative-sample mining is used to optimize the similarity matrix +.>

，

Attention weighting at last layer

The diversity regularization loss is introduced, so that the diversity regularization loss of the image branches and the diversity regularization loss of the text branches can be obtained.

Further, the specific steps of the self-attention processing in step S23 are as follows: image semantic features output in step S22

And text semantic feature->

Input to the self-attention module, respectively, to further learn and align cross-modal semantics; for image semantic features->

The process through the self-attention module is expressed by the following formula:

；

；

；

；

wherein the method comprises the steps of

Temperature parameter representing student network, ++>

Indicate->

Individual image semantic features,/->

For the number of semantic features of an image, +.>

And->

Respectively representing the image global feature after average merging and the attention weighted image global feature, L2Norm represents L2 normalization, +.>

Representation->

The result of the normalization of the individual image semantic features L2,

indicate->

Results of normalization of individual image semantic features L2, < >>

Self-attention module for representing semantic features of imageProcessing the obtained weight, ++>

Indicate->

Weights obtained by processing semantic features of images through self-attention modules and sharing

Individual, softmax refers to the softmax function; likewise, for text semantic features +.>

The text global feature +.>

And text global feature after attention weighting +.>

。

Further, the specific steps of the teacher and student network cross guidance in step S24 are as follows: according to the steps S22 and S23, calculating to obtain the similarity between images and texts of teacher and student networks, and then using the distribution of the teacher network as the real distribution to guide the distribution of the student networks; for matched pairs of image text, the aligned global features should also be noted for aligned local semantics as well, and cross-modal guidance can be performed using distillation loss:

；

；

；

；

；

wherein the method comprises the steps of

Indicating loss of distillation->

Representing the true distribution of text in a teacher network +.>

And image estimation distribution in student network +.>

KL divergence between->

Representing true distribution of images in teacher networks

And text estimation distribution in student network +.>

KL divergence between, using teacher's distribution as the true distribution to guide student's distribution,/->

Representing +.>

The result of the L2 normalization of the individual image semantic features,

representing +.>

Results of L2 normalization of individual text semantic features,/- >

Representing in a student network

Results of L2 normalization of individual image semantic features,/->

Representing +.>

Results of L2 normalization of individual text semantic features,/->

Representing a temperature parameter of the teacher network with a value greater than the temperature parameter of the student network>

。

Further, the self-distillation step based on the optimal transportation of step S25 is as follows: first of all, pairs of labels are assigned

B x B is the size of the input image and text of a batch, by Z and +.>

Constructing an optimal transportation problem using the formula:

；

wherein the method comprises the steps of

Representing optimal transport problems, sup represents the upper bound, < +.>

、/>

Representing two probabilitiesDistribution of->

Representing from->

To->

Is a set of all joint probability distributions, +.>

Is->

Is a joint probability distribution of +.>

Two aggregate elements representing image and text +.>

A similarity matrix between the two; the optimal transport problem aims at finding a joint probability distribution +.>

So that the edge distribution thereof is +.>

And->

And expected benefit +.>

Maximum, so->

Representing maximize +.>

Desired benefit->

To find the optimal solution; max represents maximization,/->

Representing the slave->

To->

Is a set of joint probability distributions ∈ >

And->

Representing weight vector, ++>

Represents +.>

Image and->

Labels corresponding to the individual texts->

Representing +.>

Image and->

The corresponding similarity value of the texts, and the sizes of the images and the texts of one batch are B;

optimal solution for solving the optimal transportation problem equation

The optimal transport self-distillation loss is expressed as follows:

；

wherein the method comprises the steps of

Indicating optimal transport self-distillation loss, < >>

Representing the relative entropy loss, < >>

，/>

Similarity matrix and corresponding optimal transport solution matrix between semantic features of image and text respectively are represented, and +.>

，/>

And respectively representing a similarity matrix and a corresponding optimal transportation solution matrix between the image and the text global features.

Further, during model training, the total loss function of the whole network model is expressed as:

；

wherein the method comprises the steps of

、/>

、/>

、/>

Belongs to super-parameters (Foliumet)>

Representing the triple loss under the similarity of semantic features of images and texts; />

Representing the triple loss under the similarity of the global features of the image and the text; />

Representing diversity regularization loss; />

Representing sparsity loss; />

Indicating distillation loss; />

Indicating optimal transport self-distillation losses.

The invention also provides a graph and text retrieval system based on cross-modal cross-guidance, which is used for realizing the graph and text retrieval method based on cross-modal cross-guidance, and comprises a data preprocessing module, a cross-modal cross-guidance network and a loss function module, wherein the data preprocessing module is used for processing data of images or text data to be retrieved as input of a teacher network and a student network; the cross-modal cross-direction network comprises a teacher network and a student network, the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module and are used for processing input image data, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module and are used for processing input text data, the structure of the teacher network is the same as that of the student network, and the teacher network and the student network perform cross-modal cross-direction; the loss function module is used for calculating triplet loss, diversity regularization loss, sparsity loss, distillation loss and optimal transportation self-distillation loss.

Compared with the prior art, the invention has the advantages that:

the existing image and text retrieval method mainly relies on local image and text features to measure cross-modal similarity, so cross-modal semantic alignment for matching the features becomes critical, however, due to the heterogeneous difference between different modes, the cross-modal semantic alignment is extremely difficult, in order to solve the problem, the invention proposes to learn cross-modal cross-guidance consensus by using a modal shared semantic prototype and a modal specific semantic decoder, unlike the existing method, the invention proposes a new paradigm which does not directly align multi-modal local features, but uses the shared semantic prototype as a bridge, focuses on specific contents of different modalities through a semantic decoder, and in the process, the cross-modal semantic alignment can be realized very naturally, and the accuracy of the cross-modal retrieval is improved; in addition, the invention designs a new self-distillation method based on optimal transmission, which alleviates the defects of paired labels in the multi-mode data set; numerous experimental results demonstrate the effectiveness and versatility of both designs of the present invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a cross-modal cross-guided graph-text retrieval overall structure diagram of the invention;

FIG. 2 is a diagram of a cross-modal cross-guided graph-text retrieval student network structure of the invention;

fig. 3 is a structural diagram of a semantic decoder according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples.

Example 1

With reference to fig. 1, the embodiment provides a cross-modal cross-guidance-based image-text retrieval method, which comprises the following steps:

s1, inputting image data and text data of a batch.

S2, performing feature extraction and shared semantic learning of two different mode data of images and texts by using a cross-mode cross-guidance network model constructed based on a self-distillation algorithm, and completing training of the model.

The cross-modal cross-guidance network model comprises a teacher network and a student network, wherein the student network comprises two branches of images and texts. The image branch comprises an image encoder, an image semantic decoder and an image self-attention module, and the text branch comprises a text encoder, a text semantic decoder and a text self-attention module, wherein the structures and principles of the image semantic decoder and the text semantic decoder are the same, and the structures and principles of the image self-attention module and the text self-attention module are the same. The teacher's network has the same structure as the student's network, so the teacher's network has the same image and text branches and corresponding modules.

During training, the student network and the teacher network firstly perform feature extraction on the image and text data to obtain local features of the image

And text local feature->

. Secondly, local features of the image and text are input to the corresponding semantic decoder via a learnable shared semantic prototype +.>

Extracting corresponding image semantic features from image and text local features>

And text semantic feature->

. Image processing apparatusThe semantic features and the text semantic features realize preliminary alignment, and similarity calculation can be performed to obtain +.>

. In order to facilitate further realizing the alignment of the image semantic features and the text semantic features, the invention designs a self-attention module, the image and the text are processed by the self-attention module respectively, the image semantic features and the text semantic features are processed into image global features and text global features in a refined way, and the similarity between the image global features and the text global features is expressed as>

. And finally, according to the calculated similarity, using the distribution of the teacher network as the real distribution to guide the distribution of the student network.

As a preferred implementation mode, the teacher network and the student network measure the similarity of two data of the image and the text respectively and output two paired similarity matrixes

、/>

. Similarity matrix of student network>

For computing triplet loss to optimize student network, and similarity matrix of teacher network +.>

Is modeled as an optimal transportation problem, and the optimal solution of the optimal transportation problem is obtained by solving, which is also called an optimal transportation scheme->

，/>

B x B is the size of the input image and text of a lot, by +.>

And->

The optimal transport self-distillation loss is calculated to guide the student's network.

In addition, the teacher network provides additional cross-modal guidance for the student network, the parameters of the teacher network model do not calculate gradients, and the parameters come from the exponential sliding updating of the student network parameters in the following updating mode:

；

wherein the method comprises the steps of

Weights representing momentum updates +.>

Parameters representing teacher model->

Parameters representing the student model.

As a preferred implementation mode, when the cross-modal cross-guidance network model is trained, the input image and text data of a batch are subjected to data enhancement and then are respectively input into a student network and a teacher network for training. The teacher network and the student network have the same structure, and the teacher model and the student model guide learning mutually in the training process so as to achieve a better parameter fitting effect. In the verification stage, the model can accurately extract image and text features and match and retrieve corresponding images and text.

The specific steps of model training are described below.

S21, extracting local features:

image branches and text branches are respectively extracted through a visual encoder and a text encoder to obtain image local features

And text local feature->

And map them to the same dimension.

As a preferred embodiment, as shown in fig. 2, at the time of feature extraction, given B image and text pairs in a batch, the data enhancement is realized by randomly deleting or replacing the words of the text and the area of the image therein. Extracting their local features using a Faster-RCNN model based on bottom-up and top-down attention as an image encoder and a Transformer-based pre-trained BERT model as a text encoder, respectively, to obtain image local features

And text local feature->

：

；

；

Wherein the image local features

，/>

Represents the M-th image area, M represents M area characteristics obtained by the image through an image encoder, and the text is localCharacteristics->

，/>

Representing the nth text word, N representing the N word features of the text through the text encoder. Before the image region features and text word features are processed in the follow-up process, the full-connection layer or the multi-layer perceptron is used for carrying out the scaling unification of feature dimensions, so that the final feature dimensions of the image and the final feature dimensions of the text are the same. Due to the dimension +. >

The local features of the image, text, need to be mapped to the same dimension d, i.e. +.>

、/>

。

S22, cross-modal sharing semantic learning:

a set of learnable shared semantic prototypes is designed

And text semantic feature->

And calculates the similarity, specifically: image local feature->

And text local feature->

Are respectively input to the image semantic solutionEncoder and text semantic decoder through a learnable shared semantic prototype +.>

And text semantic features

Computing image semantic features->

And text semantic feature->

Similarity between them, expressed as +.>

。

And optimizing by using the triplet loss on the basis of obtaining the similarity matrix, and using diversity regularization loss constraint on the attention weight obtained by the image and text local features through the semantic decoder.

As a preferred embodiment, the specific steps of cross-modal sharing semantic learning are as follows:

given image local features

And text local feature->

The shared semantic prototypes and semantic decoders are used to learn capture pairs Ji Yuyi between them; a shared semantic prototype is a set of learnable vectors that are randomly initialized and shared among data of different modalities, the shared semantic prototype being defined as:

；

Wherein the method comprises the steps of

Representing all shared semantic prototypes, and +.>

，/>

Indicate->

Personal shared semantic prototype->

Representing the number of shared semantic prototypes; sharing semantic prototypes and local features of one modality as inputs to a semantic decoder, here exemplified by image branches, image local features +.>

And shared semantic prototype->

The image semantic features obtained after passing through the image semantic decoder can be expressed as:

；

here, the

Representing shared semantic prototype->

Local features in the image->

Semantic features of the image mined. SemanticDec represents a semantic decoder structure, as shown in FIG. 3, a semantic decoder consisting of +.>

The same attention layer is stacked, in +.>

In layers, by the output of the previous layer +.>

And shared semantic prototype->

Together, the local features of the image are focused by a multi-head focusing Mechanism (MHA)>

Capturing a specific image semantic feature and outputting an updated image semantic feature +.>

And attention weight matrix->

The following formula may be used to describe:

；

；

wherein the method comprises the steps of

Is a parameter that can be learned, and has:

；

；

；

wherein the method comprises the steps of

、/>

、/>

For the input of the attention mechanism, the principle is the prior art, and will not be described in detail here, in this embodiment +.>

、/>

、/>

Representation- >

、/>

、/>

I-th uniform slice of>

，/>

Is the number of attention heads, so +.>

The output of the individual head is:

；

the outputs of all the heads are combined and can be expressed as:

；

wherein the method comprises the steps of

Is->

Outputting individual heads;

finally, the first

The output of the layer semantic decoder is:

；

；

wherein LayerNorm represents the layer standardization,

representation->

The output of the semantic decoder layer before the layer Dropout means random inactivation of neurons,/for neurons>

Reflecting the attention of different shared semantic prototypes to the local features of different images, will +.>

Set to all 0 matrix, ">

Expressed as the output of the whole semantic decoder, then there are:

；

similarly, text semantic features obtained by the text branches through the semantic decoder can be obtained:

；

SemanticDec represents the semantic decoder structure that, during training,

by focusing on semantics of a specific modality, own parameters are updated layer by layer to adaptively capture diversified semantics from a large amount of input data. At the same time (I)>

Also always focus on the shared semantic prototype +.>

By->

Residual connection on each layer avoids semantic drift between different modalities. Thus->

And->

Is through corresponding shared semantic prototype +.>

The cross-modal consensus is learned to achieve alignment of the two modalities of the image and the text, and the amount of the cross-modal consensus represents the overall similarity of the image and the text. It is possible to calculate the +.sup.th of a batch >

Image and->

Similarity between the individual texts:

；

wherein the method comprises the steps of

Indicate->

Individual image semantic features, together->

Personal (S)>

Indicate->

Individual text semantic features, together->

L2Norm is L2 standardized, the similarity is calculated by adopting a cosine similarity method, and a similarity matrix obtained by a semantic decoder module of images and texts of the whole batch is +.>

The similarity matrix may be optimized using a triplet penalty based on hard negative-sample mining, expressed as:

；

the triplet is lost, the learning target is the relative distance, and the distance between the reference sample and the negative sample is far greater than the distance between the reference sample and the positive sample through learning. In the present invention, the reference sample and the positive sample refer to text matched with an image or an image matched with text. Wherein the method comprises the steps of

Representing the loss of triples under the similarity of semantic features of images and texts in one batch, and +.>

Image text pairs representing positive matches in image, text semantic feature similarity matrix, ++>

And

representing unmatched image text pairs in the image and text semantic feature similarity matrix, wherein margin represents boundary parameters in triplet loss; />

Representing that only hard negative samples (hard negative samples refer to samples that do not match the largest value in similarity of image text) are used in a small batch process, rather than summing all negative samples, allows the model to learn more challenging negative samples, thereby improving the robustness and generalization performance of the model.

In addition, in order to avoid feature redundancy, the diversity of different semantics is ensured, and the attention weight is at the last layer

The diversity regularization penalty introduced above, for image branches, can be expressed as:

；

wherein the method comprises the steps of

A diversity regularization penalty representing image branches, < +.>

Is an identity matrix>

Is attention weight +.>

Results after L2 normalization; />

Is->

Transposed matrix of>

The Frobenius norm of the matrix is represented, and 2 is represented as square.

Using the same procedure, the diversity regularization penalty of the text branches can be obtained

So the diversity regularization loss of the model as a whole +.>

The method comprises the following steps:

。

s23, self-attention processing:

image semantic features output in step S22

And text semantic feature->

The images are respectively input to a self-attention module, the relation between the global features and the local semantic features of the images and the texts is explored, and the cross-modal semantics are further learned and aligned. Meanwhile, the similarity is calculated again for the image and text global features after self-attention processing, and is expressed as +.>

. Using triplet loss optimization, andsparsity constraint is performed on the attention weight obtained through the self-attention module.

As a preferred embodiment, the specific steps of step S23 are as follows: characterizing image semantics

And text semantic feature->

Respectively input to the self-attention module, for the semantic features of the image +.>

The process of going through the self-attention module process can be expressed by the following formula:

；

；

；/>

；

wherein the method comprises the steps of

Temperature parameter representing student network, ++>

Indicate->

Individual image semantic features,/->

For the number of semantic features of an image, +.>

And->

Representation->

The result of the normalization of the individual image semantic features L2,

indicate->

Results of normalization of individual image semantic features L2, < >>

Weights representing semantic features of the image processed by the self-attention module, < ->

Indicate->

The text global feature +.>

And text global feature after attention weighting +.>

The method comprises the steps of carrying out a first treatment on the surface of the Constraint using sparsity loss can be expressed as:

；

wherein the method comprises the steps of

Representing sparsity loss, < >>

Weights representing text semantic features processed by self-attention module, +.>

、/>

Entropy regularization of image weights and text weights is shown separately, from which the +.sup.th of a batch can be calculated >

Image and->

The similarity between the global features of the individual texts is given by:

；

wherein the method comprises the steps of

Indicate->

Image global feature after attention weighting of the individual images,/-, for example>

Indicate->

Global features of the text after the individual text is weighted by attention, L2Norm representing the L2 normalization; the similarity matrix between the global semantic features of the images and texts of the whole batch obtained by the self-attention module is +.>

；

wherein the method comprises the steps of

Representing the triplet loss under the global feature similarity of images and texts in one batch,

image text pairs representing positive matches in the image, text global feature similarity matrix, ++>

And

image text pairs which represent mismatching in the image, text global feature similarity matrix, margin represents boundary parameters in the triplet loss, ++>

Indicating that only hard negative samples are used in a small batch process. />

S24, teacher and student network cross guidance:

image semantic features

And text semantic feature->

Through the process ofThe self-attention module obtains the global features of the image respectively

And text Global feature->

. For matched pairs of image text, the aligned global features should also take care of the aligned local semantics. The relative entropy loss can be used for cross-modal cross-guidance. And (4) calculating according to S22 and S23 to obtain the similarity between images and texts of the teacher and student networks, and then guiding the distribution of the student networks by using the distribution of the teacher network as the actual distribution.

As a preferred embodiment, the specific steps of step S24 are as follows:

for matched pairs of image text, the aligned global features should also take care of the aligned local semantics. The loss of distillation can be used for cross-modal guidance:

；

；

；

；

；

wherein the method comprises the steps of

Indicating loss of distillation->

Representing the true distribution of text in a teacher network +.>

And image estimation distribution in student network +.>

KL divergence between->

Representing the true distribution of images in a teacher network>

And text estimation distribution in student network +.>

Representing +.>

The result of the L2 normalization of the individual image semantic features,

representing +.>

Results of L2 normalization of individual text semantic features,/->

Representing in a student network

Results of L2 normalization of individual image semantic features,/->

Representing +.>

Results of L2 normalization of individual text semantic features,/->

。

S25, self-distillation based on optimal transportation:

the teacher network and the student network respectively measure the similarity of two data of images and texts, output two paired similarity matrixes, model the similarity matrixes calculated by the teacher network into an optimal transportation problem, obtain the most accurate image text matching relation, and realize the guidance of the student network through the optimal transportation self-distillation loss; this is a self-distilling process, since the teacher network and the student network are identical in structure.

As a preferred embodiment, the specific steps of step S25 are as follows:

the teacher network and the student network measure the similarity of two data of the image and the text respectively, and output two paired similarity matrixes

、/>

Similarity matrix calculated by student network>

For optimizing student network by calculating similarity matrix of teacher network>

Modeling optimal transport problem to further guide +.>

To maximize the matching score, pairs of tags are first assigned/>

B x B is the size of the input image and text of a batch, by Z and +.>

Can construct an optimal transportation problem using the formula:

；

wherein the method comprises the steps of

Representing optimal transport problems, sup represents the upper bound, < +.>

、/>

Representing two probability distributions +.>

Representing from->

To->

Is a set of all joint probability distributions, +.>

Is->

Is a joint probability distribution of +.>

Two aggregate elements representing image and text +.>

So that the edge distribution thereof is +.>

And->

And expected benefit +.>

Maximum, so->

Representing maximize +. >

Desired benefit->

To find the optimal solution; max represents maximization,/->

Representing the slave->

To->

Is a set of joint probability distributions ∈>

And->

Representing weight vector, ++>

Represents +.>

Image and->

Labels corresponding to the individual texts->

Representing +.>

Image and->

And the corresponding similarity value of the texts, and the sizes of the images and the texts of one batch are B.

Optimal solution to the optimal transportation problem equation

Image text similarity matrix obtained by teacher based network

、/>

And->

In the presence of->

On the premise of (1) can be applied to->

And->

Constraint is carried out by

，/>

，/>

，/>

Indicate->

How much text the individual images should be assigned to, < >>

Indicate->

How many images each text should be assigned, in the absence of more a priori information, each image and text is considered to have the same assigned weight, namely:

；

thus, the optimal solution can be obtained

，/>

For optimal solution->

Multiplying it by B so that each row and each column is a probability distribution and realizing +.>

For->

Is a guide of (3). The formula is used as follows:

；/>

wherein the method comprises the steps of

Representing the relative entropy loss, < >>

Indicating the KL divergence between the two distributions, < - >

Similarity matrix obtained by calculating semantic features of student network images and texts, wherein softmax refers to softmax function and +.>

Representing the temperature parameters in the optimal transport, the optimal transport self-distillation losses are represented as follows:

；

wherein the method comprises the steps of

Indicating optimal transport self-distillation loss, < >>

，/>

，/>

Representing similarity matrix and corresponding optimal transportation solution matrix among the images and the text global features.

In model training, the total loss function of the entire network model is expressed as:

；

wherein the method comprises the steps of

、/>

、/>

、/>

Belongs to super-parameters (Foliumet)>

Representing diversity regularization loss; />

Representing sparsity loss; />

Indicating distillation loss; />

Indicating optimal transport self-distillation losses. Dimension d in this embodiment is set to 1024, the number of shared semantic prototypes +.>

20 number of layers of semantic decoder +.>

Set to 2, the boundary parameter for triplet loss set to 0.2, temperature parameter +.>

、/>

、/>

0.2, 0.1, respectively. />

、/>

、/>

、/>

Set to 0.2, 0.1, 2.0 and 1.0, respectively.

Example 2

The embodiment provides a graph-text retrieval system based on cross-modal cross guidance, which comprises a data preprocessing module, a cross-modal cross guidance network and a loss function module.

The data preprocessing module is used for data of images or text data to be retrieved, and the data preprocessing module is used for inputting teacher networks and student networks.

The cross-modal cross-guidance network comprises a teacher network and a student network, wherein the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module and are used for processing input image data, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module and are used for processing input text data, the structure of the teacher network is the same as that of the student network, and the teacher network and the student network perform cross-modal cross-guidance.

The loss function module is used for calculating triplet loss, diversity regularization loss, sparsity loss, distillation loss and optimal transportation self-distillation loss.

The functional implementation and data processing procedure of each module can be referred to in the description of embodiment 1, and are not described here.

It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that various changes, modifications, additions and substitutions can be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. The image-text retrieval method based on cross-modal cross guidance is characterized by comprising the following steps of:

s1, inputting image data and text data of a batch;

2. The image-text retrieval method based on cross-modal cross-guidance according to claim 1, wherein the cross-modal cross-guidance network model is characterized in that when training, the input image and text data of a batch are subjected to data enhancement and then are respectively input into a student network and a teacher network for training; the teacher network and the student network have the same structure, and the teacher model and the student model guide learning mutually in the training process so as to achieve a better parameter fitting effect; in the verification stage, the model can accurately extract the image and text characteristics, and match and search the corresponding images and texts; the training comprises the following specific steps:

s21, extracting local features:

And text local feature->

；

S22, cross-modal sharing semantic learning:

designs a group of learnable shared semantic sourceA kind of electronic device with a display unit

And text semantic feature->

And calculates the similarity, expressed as +.>

；

S23, self-attention processing:

image semantic features output in step S22

And text semantic feature->

And text Global feature->

At the same time, the global feature +.>

And text Global feature->

Again calculate the similarity, expressed as +.>

；

S24, teacher and student network cross guidance:

image semantic features

And text semantic feature->

Obtaining global image features through self-attention modules respectively

And text Global feature->

3. The method for text retrieval based on cross-modal cross-guidance according to claim 2, further comprising the step of applying an optimal transportation algorithm in step S25, specifically as follows: the teacher network and the student network measure the similarity of two data of the image and the text respectively, and output two paired similarity matrixes

、/>

The method comprises the steps of carrying out a first treatment on the surface of the Obtaining the most accurate matching relation of the image text by +.>

And->

4. The method for searching pictures and texts based on cross-modal cross-guidance according to claim 2, wherein the specific steps of cross-modal sharing semantic learning in step S22 are as follows:

Given image local features

And text local feature->

；

wherein the method comprises the steps of

Representing all shared semantic prototypes ∈ ->

Indicate->

Personal shared semantic prototype->

And shared semantic prototype->

A semantic decoder consists of +.>

The same attention layer is stacked, in +.>

In layers, by the output of the previous layer +.>

And shared semantic prototype->

And attention weight matrix->

Finally, the output of the whole image semantic decoder is obtained>

；

；

Then, the triple loss based on hard negative-sample mining is used to optimize the similarity matrix

；

Attention weighting at last layer

5. The method for retrieving images and texts based on cross-modal cross guidance according to claim 2, wherein the specific steps of the self-attention processing in step S23 are as follows: image semantic features output in step S22

And text semantic features

；

；

；

；

wherein the method comprises the steps of

Temperature parameter representing student network, ++>

Indicate->

Individual image semantic features,/->

For the number of semantic features of an image, +.>

And->

Representation->

Results of normalization of individual image semantic features L2, < >>

Indicate->

Results of normalization of individual image semantic features L2, < >>

Weights representing semantic features of the image processed by the self-attention module, < - >

Indicate->

Weights of semantic features of the images obtained by processing the self-attention module are shared by +.>

The text global feature +.>

And text global feature after attention weighting +.>

。

6. The method for searching images and texts based on cross-modal cross guidance according to claim 2, wherein the specific steps of the teacher and student network cross guidance in the step S24 are as follows: according to the steps S22 and S23, calculating to obtain the similarity between images and texts of teacher and student networks, and then using the distribution of the teacher network as the real distribution to guide the distribution of the student networks; for matched pairs of image text, cross-modal guidance was performed using distillation loss:

；

；

；

；

；

wherein the method comprises the steps of

Indicating loss of distillation->

Representing the true distribution of text in a teacher network +.>

And in a student networkImage estimation distribution +.>

KL divergence between->

Representing the true distribution of images in a teacher network>

And text estimation distribution in student network +.>

Representing +.>

The result of the L2 normalization of the individual image semantic features,

Representing +.>

Results of L2 normalization of individual text semantic features,/->

Representing in a student network

Results of L2 normalization of individual image semantic features,/->

Representing +.>

Results of L2 normalization of individual text semantic features,/->

。

7. A cross-modal cross-guided teletext retrieval method according to claim 3, wherein step S25 is based on a self-distillation step of optimal transport as follows: first of all, pairs of labels are assigned

Wherein B x B is the size of the input image and text of a batch, by Z and +.>

Constructing an optimal transportation problem using the formula:

；

wherein the method comprises the steps of

Representing optimal transport problems, sup represents the upper bound, < +.>

、/>

Representing two probability distributions +.>

Representing from->

To->

Is a set of all joint probability distributions, +.>

Is->

Is a joint probability distribution of +.>

Two aggregate elements representing image and text +.>

So that the edge distribution thereof is +.>

And->

And expected benefit +. >

Maximum, so->

Representing maximize +.>

Expected benefits from

To find the optimal solution; max represents maximization,/->

Representing the slave->

To->

Is a set of joint probability distributions ∈>

And->

Representing weight vector, ++>

Represents +.>

Image and->

Labels corresponding to the individual texts->

Representing +.>

Image and->

Similarity value corresponding to each text, and large image and text of one batchThe small values are B;

optimal solution for solving the optimal transportation problem equation

The optimal transport self-distillation loss is expressed as follows:

；

wherein the method comprises the steps of

Indicating optimal transport self-distillation loss, < >>

Representing the relative entropy loss, < >>

，/>

，/>

8. The method for cross-modal cross-guided graph retrieval of claim 7, wherein the overall loss function of the whole network model is expressed as:

；

wherein the method comprises the steps of

、/>

、/>

、/>

Belongs to super-parameters (Foliumet)>

Representing the triple loss under the similarity of semantic features of images and texts; / >

Representing diversity regularization loss;

representing sparsity loss; />

Indicating distillation loss; />

Indicating optimal transport self-distillation losses.

9. The image-text retrieval system based on cross-modal cross-guidance is characterized by comprising a data preprocessing module, a cross-modal cross-guidance network and a loss function module, wherein the data preprocessing module is used for inputting images or text data to be retrieved as an input of a teacher network and a student network; the cross-modal cross-direction network comprises a teacher network and a student network, the student network comprises two branches of images and texts, the image branches comprise an image encoder, an image semantic decoder and an image self-attention module and are used for processing input image data, the text branches comprise a text encoder, a text semantic decoder and a text self-attention module and are used for processing input text data, the structure of the teacher network is the same as that of the student network, and the teacher network and the student network perform cross-modal cross-direction; the loss function module is used for calculating triplet loss, diversity regularization loss, sparsity loss, distillation loss and optimal transportation self-distillation loss.