CN114461821A

CN114461821A - Cross-modal image-text inter-searching method based on self-attention reasoning

Info

Publication number: CN114461821A
Application number: CN202210184249.7A
Authority: CN
Inventors: 李召
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-05-10

Abstract

The invention discloses a cross-modal image-text inter-searching method based on self-attention reasoning, and belongs to the field of cross-modal retrieval. The self-attention reasoning model provided by the invention mainly comprises three modules: the first part extracts image significance characteristics by using a pre-training backbone network from top to bottom, and extracts text branch characteristics by using a word embedding + serialization model structure; a self-attention reasoning module is designed in the second part, the contribution of each bounding box to the whole semantics and the adhesiveness between the semantics are considered, and the negative influence brought by irrelevant semantics is further eliminated; the third part designs an interactive attention module between two branches, so that corresponding image-text pairs have greater weight to influence subsequent similarity evaluation. Experiments prove that compared with the traditional method, the method has the characteristics of higher matching precision and higher retrieval speed.

Description

Cross-modal image-text inter-searching method based on self-attention reasoning

Technical Field

The invention belongs to the crossing field of vision and language, is applied to a cross-modal retrieval task between an image and a text, and particularly relates to a cross-modal image-text inter-retrieval method based on self-attention reasoning.

Background

With the rapid development of network technologies, especially emerging social platforms and applications of mobile devices, the internet world is flooded with a large amount of multimodal information (text, images, video, audio, etc.). Just as a result, the demand of users for human-computer interest interaction, namely search function, has changed newly, and many platforms are no longer limited to matching between single modalities but realize cross-modality matching function. When users search for information by submitting queries of any modality, they can obtain search results in various forms, providing more comprehensive supplementary information in view of the different modalities of the data. The graph-text matching is an important branch of the multi-mode matching and is applied to a great number of fields, such as graph-text retrieval, image title generation, image question-answering systems, image knowledge reasoning and the like.

The research significance and value of the image-text retrieval algorithm are embodied in the following aspects. The method has practical application value, is most commonly applied to a search engine and a recommendation system, and carries out a great deal of research work and actual landing projects on various E-commerce platforms and social network sites. The image-text retrieval algorithm enables a user to obtain more interesting information, for example, when you shop, corresponding clothing image recommendation can be obtained by using some simple text descriptions; or the specific menu information can be acquired by using a food picture taken at any time, so that the life is more convenient. Secondly, the image-text retrieval algorithm can also be applied to the aspects of monitoring and security protection. For example, the pedestrian re-identification is realized by using the picture searching function, the pedestrian re-identification is realized by applying the unidirectional text-to-picture searching process on the surface, but the pedestrian re-identification is also image-text retrieval essentially, and the pedestrian re-identification can help law enforcement to quickly lock an initial target person to accelerate the solution process under the pursuit scene. In addition, the retrieval of graphics and texts can also help the development of new media, such as literary creation or media creativity matched with appropriate pictorial pictures.

Depending on the geometric growth of the internet information quantity, the development of big data and computing power, deep learning has a large amount of early support, and the image-text retrieval algorithm using the deep learning begins to develop rapidly. However, the challenge is also increased, how to realize faster image-text retrieval from massive data with interference and noise, and improve the extraction and measurement capability of image and text features, so that the model has good matching effect, which are all key problems to be faced in image-text retrieval.

Disclosure of Invention

The invention aims to design a cross-mode image-text mutual searching method based on self-attention reasoning, aiming at crossing a semantic gap between image vision and natural language and realizing better semantic understanding. The method can improve the accuracy and speed of cross-modal retrieval between the pictures and the texts based on a deep learning method, and can be applied to actual scenes.

In order to achieve the above object, the present invention provides a cross-modal image-text inter-searching method based on self-attention reasoning, comprising:

acquiring a data set to obtain paired original image data and text annotation data, and dividing the paired original image data and text annotation data into a training data set, a verification data set and a test data set;

for each group of data pairs in the training set, respectively extracting initial features of the images output by a pre-training model and feature embedding after text coding;

self-attention reasoning, namely mapping two modal characteristics to the same potential public space, carrying out reasoning on the internal coupling relation of the image branches, and calculating the contribution value of each local bounding box to the whole image by utilizing a self-attention mechanism to be fused into a new image representation;

designing a cross attention layer to obtain the representation of two modal semantics in the opposite semantic space, calculating the similarity, training by utilizing a triple loss function, and finally realizing the cross-modal semantic alignment;

verifying the model, namely verifying the trained model by using a verification set to select an optimal model;

evaluating the model, namely evaluating the optimal model by using a test set to obtain the retrieval precision of the optimal model;

and searching pictures and texts in two directions by using the final optimal model.

Further, ResNet-101 pre-trained on a Visual genome data set is used as a fast R-CNN of a backbone network to perform bottom-up detection and extract image features.

Furthermore, the importance of each bounding box, namely local image features, to global features and the internal relation between the bounding boxes and the local image features are obtained through an attention reasoning module in the image branch, irrelevant semantics are filtered out, and corresponding weight coefficients are deduced and fused to obtain the embedded representation of the final image.

Further, the method of word embedding is used for considering the relation among words, each word is embedded into a same-dimensional vector through an embedding matrix, and the text features are finally coded by using a bidirectional gating circulation unit and mapped into the same public subspace as the image.

Further, an interaction attention matrix between each local image and each word is calculated through a cross attention layer, the feature representation of the visual semantics in the text semantic space and the feature representation of the text in the visual semantic space are obtained, and the correlation between the two modalities is calculated by utilizing a cosine function.

Further, training the cross-modal matching model by adopting a maximum hinge and loss, wherein a calculation formula is as follows:

Loss＝[λ-S(I，T)+S(I，T′)]₊+[λ-S(I，T)+S(I′，T)]₊

in the formula, λ represents a margin, [ x ]]₊Max (x, 0). S (I, T) represents the similarity between the matching image and the text S (I, T '), and S (I ', T) represents the similarity between the difficult example sample pair T ' ═ argmax_p≠TS(I,p)，I'＝argmax_q≠IS(q,T)。

Respectively extracting the characteristics of the pictures or texts to be inquired, inputting the pictures or texts into an optimal model to obtain a matching degree score matrix between the pictures and all the texts or between the texts and all the pictures, and sequencing the scores of the matrix from large to small to obtain a retrieval result.

In the specific training stage, a model is trained under a Pythrch framework by using hardware Nvidia GTX 2080Ti GPU and an Intel Core i7-9700k CPU and an Adam training strategy.

According to the technical scheme, compared with the prior art, the method can eliminate negative effects caused by irrelevant information through self-attention reasoning to obtain more accurate feature expression of visual branches, on the basis, in order to enable the corresponding relation between the modes to be more obvious, the feature expression of the visual semantics in the text semantic space and the feature expression of the text in the visual semantic space are further matched through the cross attention layer between the two modes, and finally the multi-modal features with sufficient general semantics and accuracy are obtained, so that the accuracy and the stability of the cross-modal retrieval model method are effectively improved.

Drawings

FIG. 1 is a schematic diagram of an implementation flow of a cross-modal graph-text interconnection method based on self-attention reasoning provided by the present invention;

FIG. 2 is a schematic diagram of an overall network structure according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a self-attention-driven network according to an embodiment of the present invention;

FIG. 4 is a cross-attention structure diagram according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.

Referring to the attached drawings 1 and 2, the invention designs a cross-modal image-text inter-searching method based on self-attention reasoning, which comprises the following concrete implementation processes:

step 1: and acquiring a data set to obtain paired original image data and text annotation data, and dividing the paired original image data and text annotation data into a training data set, a verification data set and a test data set.

The multimodal dataset used for training, assessment and testing contains pairs of images and text, one picture corresponding to five related text items. The embodiment divides the data set into a training set, a verification set and a test set.

Step 2: and for each group of data pairs in the training set, respectively extracting initial features of the images output by the pre-training model and embedding the features of the images after text coding.

Image feature extraction here focuses on the area at the object and instance level. Fa using ResNet-101 pre-trained on Visual Genomes datasets as backbone networkThe ster R-CNN, a bottom-up assay was performed. The network can predict the attribute class and the instance class, extract the salient region of the original image features, and obtain the bounding box 36 before the confidence ranking as the image embedding. For each bounding box i, f_iIs defined as the average pooled feature of the region and has a dimension of 2048. Through the full connection layer, f_iMapped to a potential common space to represent the final K-dimensional vector.

v_i＝W_vf_i+b_v

In the formula W_V、b_vIs a parameter of the fully connected layer. V ═ V₁,v₂,…,v_n},v_i∈R^KIt is the result of the transformation that represents the image features in the common space, and n represents n bounding boxes per image, each bounding box being K-dimensional.

To connect visual and linguistic domains, it is desirable to map text to the same K-dimensional semantic vector space as the image region. Given a sentence, the simplest approach is to map each word separately. But each word is a unique heat vector at this time, representing the index of the word in the dictionary, independent of each other and without relevance. The method of word embedding is used here to take into account the association between words. By embedding the matrix W_eEmbedding each word into a 300-dimensional vector, x_i＝W_ew_i,i∈[1,m]. Then, instead of using a simple linear mapping, because it does not efficiently describe the context concatenation of words, a bi-directional GRU is used to encode text features, mapping to the same joint subspace as the image.

GRU in the formula^fRepresenting forward GRU, GRU^bRepresenting a reverse GRU. The final text features are obtained by averaging the information of the sentence in two directions.

And step 3: referring to the self-attention reasoning shown in fig. 3, two modal features are mapped to the same potential public space, the internal coupling relation of the image branches is deduced, and the contribution values of each local bounding box to the whole image are calculated by using a self-attention mechanism and merged into a new image representation.

The network can train parameters to distribute weight according to the contribution of each bounding box to the expression of 'semantics' of the whole image; secondly, "semantic bonding" may exist between the bounding boxes, and different areas may finally point to a meaning, so that they are subjected to feature re-fusion or irrelevant parts are removed. First, the image representation of the prior feature extraction network output after flattening is input into two tandem attention modules S1, S2. Secondly, for calculating the response of a certain position in a sequence, the weighted average of all positions except the position in the embedding space is taken for representation. Then, a shortcut connection of a residual error network is applied, so that adverse effects caused by the increase of the layer number are avoided. Finally, a full connection layer is added to map the image features into the joint subspace again.

And 4, step 4: designing a cross attention layer to obtain the representation of two modal semantics in the opposite semantic space, calculating the similarity, training by utilizing a triple loss function, and finally realizing the cross-modal semantic alignment.

And learning the focusing degree among the text images by adopting a cross-modal cross attention mechanism, and realizing the alignment among local features. Calculating the original region v_iAnd characterized by self-attention

With each word t_jAttention score a between_ij、

This determines the degree of attention of the current region to all words. Referring to FIG. 4, v is demonstrated₁And

calculating a weight for each word in the whole sentence, and obtaining a representation of the whole semantic meaning of the sentence with regional attention for measuring the whole semantic meaningAnd (4) similarity. And by analogy, similarity matrixes between all image areas and sentences can be obtained.

In the formula, alpha_ij、

Indicating the attention score between the ith region and the jth word. Then, a gating mechanism is utilized to fuse the two attention weights to obtain the sentence semantics corresponding to the current region, and the sentence semantics are defined as the weighted combination of the words and the attention weights:

where concat represents a vector splicing operation, W₁、W₂Represents the parameter of the full link layer, beta_ijIs the post fusion attention score. Based on the method, the final image embedding is obtained

And text embedding T. For the semantic relevance measure between the image I and the text T, take the average of the relevance between all region features in the image embedding and the corresponding sentence semantic vector:

where r (,) represents a cosine similarity function.

The emphasis in the lost function training is to handle "difficult" negative examples. To improve computational efficiency, only "difficult" negative examples in small batches of random gradient descent are typically considered, and not all negative examples are summed. Maximum hinges and losses are used here to achieve cross-modality image-text alignment. In a small batch, queries can be made from both image to text and text to image directions.

Loss＝[λ-S(I，T)+S(I，T′)]₊+[λ-S(I，T)+S(I'，T)]₊

And 5: and (5) model verification, namely verifying the trained model by using a verification set to select an optimal model. And (5) iterating to perform the steps 2 to 4, performing evaluation on the verification set every 1000 iterations, and keeping the model with the best performance.

Step 6: and (4) model evaluation, namely evaluating the optimal model by using a test set to obtain the retrieval precision of the optimal model.

Specifically, the preprocessed test set is input into the trained optimal model, and the retrieval accuracy of the text to picture and the retrieval accuracy of the picture to text are evaluated at the same time through testing. And comparing the difference between the result obtained by query and the result really labeled, and calculating the recall rate as the evaluation index of the model.

And 7: and searching pictures and texts in two directions by using the final optimal model.

The method provided by the invention is verified on a flow data set Flickr30K and an MSCOCO data set. The Flickr30K and MSCOCO datasets are widely used in teletext matching, image retrieval and image captioning tasks. Flickr30K contains 31000 pictures, each of which has five corresponding sentence entries. The invention uses 1000 pictures as a verification set, 1000 pictures as a test set, and the rest are used for training. The MSCOCO contains 123287 images, each also corresponding to five textual descriptions, of which 113287 images were used as a training set, 5000 images were used as a validation set, and 5000 images were used as a test set. Experiments prove that the method has certain superiority compared with the traditional method.

In summary, the negative effects caused by irrelevant information can be eliminated through self-attention reasoning to obtain more accurate feature expression of visual branches, on the basis, in order to make the corresponding relation between the modalities more obvious, the feature expression of visual semantics in a text semantic space and the feature expression of a text in the visual semantic space are further matched through a cross attention layer between the two modalities, and finally, the multi-modal features with sufficient general semantics and accuracy are obtained, so that the accuracy and the stability of the method for searching the model across the modalities are effectively improved.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention should not be limited to the embodiments shown herein, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A cross-modal graph-text inter-searching method based on self-attention reasoning is characterized by comprising the following steps:

acquiring a data set to obtain paired original image data and text annotation data, and dividing the paired original image data and text annotation data into a training set, a verification set and a test set;

for each group of data pairs in the training set, respectively extracting initial features of the images after pre-training models and feature embedding after text coding;

2. The cross-modal context interconnection method based on self-attention reasoning as claimed in claim 1, wherein a bottom-up detection is performed to extract image features using ResNet-101 pre-trained on Visual Genomes dataset as fast R-CNN of backbone network.

3. The cross-modal image-text interconnection method based on self-attention reasoning according to claim 2, characterized in that importance of each bounding box, i.e. local image feature, to global features and internal relations between them are obtained through an attention reasoning module inside an image branch, irrelevant semantics are filtered out, and corresponding weight coefficients are deduced and fused to obtain the embedded representation of the final image.

4. The cross-modal context retrieval method based on self-attention reasoning as claimed in claim 1, wherein the word embedding method is used to consider the relation between words, each word is embedded into a same-dimensional vector through an embedding matrix, and the text features are finally encoded by using a bidirectional gating cyclic unit and mapped into the same common subspace as the image.

5. The self-attention reasoning-based cross-modal graph-text inter-searching method as claimed in any one of claims 1-4, wherein an interaction attention matrix between each local image and a word is calculated through a cross attention layer, a feature representation of visual semantics in a text semantic space and a feature representation of a text in the visual semantic space are obtained, and a cosine function is used to calculate the correlation between two modalities.

6. The cross-modal graph-text interconnection method based on self-attention reasoning according to claim 5, characterized in that the cross-modal matching model is trained by adopting maximum hinge and loss, and the calculation formula is as follows:

Loss＝[λ-S(I，T)+S(I，T′)]₊+[λ-S(I，T)+S(I′，T)]₊

in the formula, S (I, T) represents the similarity between the matching image and the text, and S (I ', T) represents the similarity between the difficult example sample pair, I' argmax_p≠TS(I,p)，I'＝argmax_q≠IS (q, T), λ represents the margin, [ x ]]₊＝max(x,0)。

7. The cross-modal teletext retrieval method based on self-attention reasoning according to claim 1, characterized in that the teletext retrieval step specifically comprises: respectively extracting the characteristics of the pictures or texts to be inquired, inputting the pictures or texts into an optimal model to obtain a matching degree score matrix between the pictures and all the texts or between the texts and all the pictures, and sequencing the scores of the matrix from large to small to obtain a retrieval result.