CN114492646A

CN114492646A - Image-text matching method based on cross-modal mutual attention mechanism

Info

Publication number: CN114492646A
Application number: CN202210105762.2A
Authority: CN
Inventors: 赵海英; 魏莱
Original assignee: BEIJING INTERNATIONAL STUDIES UNIVERSITY; Beijing University of Posts and Telecommunications
Current assignee: BEIJING INTERNATIONAL STUDIES UNIVERSITY; Beijing University of Posts and Telecommunications
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-13

Abstract

The invention discloses a cross-modal mutual attention mechanism-based image-text matching method, which comprises the following steps of: extracting image semantic features by adopting a target detection network, and extracting text semantic features by adopting a Chinese pre-training model; calculating cos similarity between the local image area and words in the text through an image-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention; calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling; and calculating the prediction result of the image-text similarity. The invention can infer the similarity between the whole image and the complete sentence, output the similarity value between the image and the sentence, and realize the fine-grained image-text matching of the local word alignment of the image.

Description

Image-text matching method based on cross-modal mutual attention mechanism

Technical Field

The invention relates to the field of cross-modal image-text matching, in particular to an image-text matching method based on a cross-modal mutual attention mechanism.

Background

The image-text matching is a problem of researching how to measure the similarity of images and characters on the semantic level. The image-text matching is closely related to the fields of multi-mode alignment, cross-mode retrieval and the like, and also belongs to the cross research field of natural language processing and computer vision, and the image-text matching is often applied to the fields of knowledge retrieval, image-text labeling and the like. In the early research period, because the image feature extraction mode is completely different from the text feature extraction mode, only the word bag of the text and the visual word bag of the image can be simply aligned, and then a mode based on typical correlation analysis (CCA) begins to appear in the cross-modal search field, different modalities are linearly mapped to the same matrix space to perform the ranking comparison of correlation, and the ranking is performed in the front as the result of image-text matching.

However, these early image-text matching methods have poor semantic relevance, in recent years, with the development of deep learning, the image feature extraction method starts to use networks such as CNN, ResNet, VGG, and the like to extract features, and networks such as Faster-RCNN, YOLO, and the like also appear in the target detection field, so that the position of an object interest region in an image can be accurately captured, and the acquisition capability of the image semantic features is greatly improved.

With the rise of extracting image-text semantic features by deep learning, image-text matching starts to turn to research to map the image-text features to the same semantic space after extracting the image-text features, and then the similarity of image-text feature coding vectors is calculated to obtain an image-text matching result. The image-text similarity measurement mode mainly comprises the following steps: cosine distance, Jaccard distance, Euclidean distance, Hamming distance, etc. However, the image-text similarity has the defects that the similarity calculation method only calculates the distance between the encoding vectors, and the method lacks the consideration of semantic distance and fine-grained matching alignment.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides an image-text matching method based on a cross-modal mutual attention mechanism, so as to overcome the technical problems in the prior related art.

Therefore, the invention adopts the following specific technical scheme:

a picture-text matching method based on a cross-modal mutual attention mechanism comprises the following steps:

s1, extracting image semantic features by adopting a Faster-RCNN target detection network, and extracting text semantic features by adopting a Bert Chinese pre-training model;

s2, constructing a graph-text similarity calculation module, calculating cos similarity between a local image area and words in the text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;

s3, calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling;

and S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of image-text similarity.

Further, the step of extracting semantic features of the image by using a fast-RCNN target detection network in S1 further includes the following steps:

region-of-interest characteristic V of image by adopting fast R-CNN target detection network^rAnd POS characteristic V^pCarrying out extraction;

will V^rAnd V^pSplicing to obtain image semantic features V^s。

Further, the extracting semantic features of the text by using the Bert chinese pre-training model in S1 further includes the following steps:

performing word segmentation on a Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;

word vector T for obtaining words by Word Embedding^wAnd vector the word T^wInputting the bidirectional GRU network and extracting text characteristics T^l；

Text feature T by adopting Bert Chinese pre-training network^lPerforming semantic features T^sAnd extracting the semantic feature T^sAs the final text semantic features.

Further, when the image-text similarity calculation module is constructed in S2, the image semantic feature V is used^sAnd text semantic feature T^sThe image semantic feature V is obtained by an input image-text similarity calculation module^sSemantic feature T with text^sThe similarity between them.

Further, the calculating of the text vector supervised by image attention in S2 further includes the following steps:

dividing the image into k local regions according to the number k of interest regions, and setting one of the local regions as V_i；

Dividing the text into n words according to the number n of the words, and setting one word as W_j；

To V_iAnd W_jCalculating the cos similarity by the following formula:

and normalized to obtain

Wherein, [ x ]]₊≡max(x,0)；

Calculation of the weight a by the softmax function_ijThe formula is as follows:

and obtaining the text vector under attention by weighted summation of words

The formula is as follows:

in the formulae (II) to (III)_ijIs the similarity between the ith local image and the jth word, v_iIs the ith local image feature vector, w_jIs the jth word text feature vector, k is that the image is divided into k regions according to the number k of interest regions, n is that the sentence text is totally divided into n words, A_ijIs the weight calculated by the jth word according to its degree of correlation with the image, w_jIs the jth word, for a total of n.

Further, the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further includes the following steps:

by focusing on text vectors in attention

And cos similarity of local area of image

Calculating, summing the cos similarity and carrying out average pooling treatment to obtain the similarity S of the complete text and the complete image_AVG(I,T)。

Further, the similarity S of the complete text and the complete image_AVGThe calculation formula of (I, T) is as follows:

in the formulae V_iIs the ith local image feature, a_iIs a text feature weighted by image attention supervision, k being a total of k image local blocks.

Further, the training in S4 to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain the prediction result of the image-text similarity further includes the following steps:

constructing a negative example by adopting a random pairing method, training data, adopting a triple Loss function as a training target of a graph-text matching task, and obtaining a graph-text matching prediction model after training is finished;

and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity.

Further, when training data by constructing a negative example by using a random pairing method, the negative example construction strategy further includes the following steps:

randomly selecting one from the text set as a negative example for 40% of the image data;

the text extracted from the text collection and the actual matching text containing the same entity words is selected as a negative example for 60% of the image data.

Further, the image-text matching task adopts a triple Loss function as a training target, and the triple Loss function formula is as follows:

wherein, [ x ]]₊Is identical to max (x,0), S (I, T) is the graphic similarity in S2,

in the case of the negative example of a text,

for the text negative example, a is a constant.

The invention has the beneficial effects that: the fine-grained image-text matching based on the cross-modal mutual attention mechanism introduces the ROI characteristic and the POS characteristic output by a target detection network in the process of extracting the characteristic of an image, can not only partition the image, but also enhance the semantic characteristic of the image, and introduces a Bert Chinese pre-training model in the process of extracting the characteristic of the text to perform self-supervision so as to further extract semantic information. Finally, the method introduces a cross-mode attention mechanism to align the local interest area of the image and the words in the text, sums, averages and pools to obtain the complete image-text similarity, and deduces the complete image-text similarity from the local alignment, thereby achieving a better image-text matching effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of an implementation of a teletext matching method based on a cross-modal mutual attention mechanism according to an embodiment of the present invention.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.

According to an embodiment of the invention, a graph-text matching method based on a cross-modal mutual attention mechanism is provided. In consideration of semantic distance and fine-grained matching, a cross-modal mutual attention mechanism and a target detection technology are introduced, attention involved in image-text matching comprises visual attention and text attention, fine-grained information is obtained by calculating word vectors after text word segmentation and extracting ROI (region of interest) features and POS (POS is Position in an image) features of the image by using target detection, a text vector supervised by the image is calculated by using a cross-modal mutual attention module, and finally, the similarity between the text vector supervised by the image and the image features is calculated, so that certain semantic distance information is added for image-text matching, meanwhile, fine-grained matching alignment of words to the ROI regions of interest is realized, and experiments prove that the matching result of the image-text matching scheme is good.

The method is based on a cross-modal mutual attention module, local features of an image are extracted by adopting a fast R-CNN target detection algorithm for the image, context features of a text are extracted by adopting a bidirectional GRU (GRU as a gating circulation unit) for a sentence, the local region features in the image and word features in the text sentence are coded and then are input into the cross-modal mutual attention module, and the purpose is to map the words and the image regions to a public embedding space and then infer the similarity between the whole image and the whole sentence by aligning the words and the image regions. And finally, outputting a similarity numerical value between the image and the sentence to realize the fine-grained image-text matching method for aligning the local part of the image to the word.

Referring to the drawings and the detailed description, the invention will be further described, as shown in fig. 1, in an embodiment of the invention, a method for matching graphics and text based on a cross-modal mutual attention mechanism includes the following steps:

wherein, the step of extracting the image semantic features by adopting a Faster-RCNN target detection network in the step S1 further comprises the following steps:

for the characteristics of image data, ROI (region of interest) characteristic V of image is detected by using Faster R-CNN target detection network^rAnd POS characteristic V^pCarry out and carryTaking;

will V^rAnd V^pSplicing to obtain image semantic features V^s。

The extraction of the text semantic features by adopting the Bert Chinese pre-training model in the S1 further comprises the following steps:

for the characteristics of Chinese character data, performing word segmentation on a Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;

word vector T of Word is obtained by Word Embedding^wAnd vector the word T^wInputting the bidirectional GRU network and extracting text characteristics T^l；

S2, constructing a graph-text similarity calculation module, calculating cos similarity (cosine similarity) between a local image area and words in a text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;

wherein, when the image-text similarity calculation module is constructed in the step S2, the image semantic feature V is used^sAnd text semantic feature T^sThe image semantic feature V is obtained by an input image-text similarity calculation module^sSemantic feature T with text^sThe similarity between them.

The step of calculating the text vector supervised by image attention in S2 further includes the steps of:

dividing the image into k local regions according to the ROI number k, and setting one of the local regions as V_i；

To V_iAnd W_jCalculating cos similarity so as to measure the importance of the image area to the word, wherein the calculation formula is as follows:

and normalized to obtain

Wherein, [ x ]]₊≡max(x,0)；

Calculating the weight a by softmax function_ijThe formula is as follows:

and obtaining text vectors under attention by weighted summation of words

The formula is as follows:

wherein the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further includes the steps of:

by focusing on text vectors in attention

And cos similarity of local region of image

Similarity S of the complete text and the complete image_AVGThe calculation formula of (I, T) is as follows:

And S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of the image-text similarity.

Wherein, the training in S4 to obtain the image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model, and obtaining the prediction result of the image-text similarity further includes the following steps:

constructing a negative example by adopting a random pairing method and training data, generally adopting a triple Loss function as a training target of a graph-text matching task, wherein the training target is to reduce the value of the Loss function, and obtaining a graph-text matching prediction model after the training is finished;

and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity. The result can be used for tasks such as image-text retrieval, image annotation and the like.

When the random pairing method is adopted to construct the negative examples to train the data, the negative example construction strategy further comprises the following steps:

randomly drawing one from the text set for 40% of the image data as a negative example;

The image-text matching task adopts a triple Loss function as a training target, and the triple Loss function formula is as follows:

in the case of the negative example of a text,

for the negative example of text, a is a constant. The training target is to reduce the value of the loss function, and the image-text matching prediction model is obtained after the training is finished.

In conclusion, the fine-grained image-text matching based on the cross-modal mutual attention mechanism introduces the ROI characteristic and the POS characteristic output by the target detection network in the characteristic extraction process of the image, can not only partition the image, but also enhance the semantic characteristic of the image, and introduces a Bert Chinese pre-training model in the text characteristic extraction process for self-supervision to further extract semantic information. Finally, the method introduces a cross-mode attention mechanism to align the local interest area of the image and the words in the text, sums, averages and pools to obtain the complete image-text similarity, and deduces the complete image-text similarity from the local alignment, thereby achieving a better image-text matching effect.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A picture-text matching method based on a cross-modal mutual attention mechanism is characterized by comprising the following steps:

s1, extracting image semantic features by adopting a target detection network, and extracting text semantic features by adopting a Chinese pre-training model;

2. The method for image-text matching based on a cross-modal mutual attention mechanism according to claim 1, wherein the step of extracting semantic features of the image by using a target detection network in S1 further comprises the following steps:

region of interest characteristic V of image by adopting target detection network^rAnd POS characteristic V^pCarrying out extraction;

will V^rAnd V^pSplicing to obtain image semantic features V^s。

3. The method for matching texts based on the cross-modal mutual attention mechanism according to claim 1, wherein the extracting semantic features of the text by using a Chinese pre-training model in S1 further comprises the following steps:

performing word segmentation on the Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;

4. The image-text matching method based on the cross-modal mutual attention mechanism as claimed in claim 3, wherein when the image-text similarity calculation module is constructed in the step S2, the image semantic feature V is used^sAnd text semantic feature T^sThe image semantic feature V is obtained by an input image-text similarity calculation module^sSemantic feature T with text^sThe similarity between them.

5. The method for matching teletext according to claim 1, wherein the step of calculating the text vector supervised by image attention in S2 further comprises the following steps:

Dividing a text into n words according to the number n of the words, and setting one word as W_j；

To V_iAnd W_jCalculating the cos similarity by the following formula:

and normalized to obtain

Wherein, [ x ]]₊≡max(x,0)；

Calculating the weight a by softmax function_ijThe formula is as follows:

and obtaining text vectors under attention by weighted summation of words

The formula is as follows:

6. The method for matching graphics and text based on cross-modal mutual attention mechanism according to claim 5, wherein the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further comprises the steps of:

by focusing on text vectors in attention

And cos similarity of local region of image

7. The method according to claim 6, wherein the similarity S between the complete text and the complete image is_AVGThe calculation formula of (I, T) is as follows:

in each formula, V_iIs the ith local image feature, a_iIs a text feature weighted by image attention supervision, and k is a total of k image local blocks.

8. The method as claimed in claim 1, wherein the step of training in S4 to obtain a prediction model for matching graphics and text, and inputting the extracted feature codes of the graphics and text data into the prediction model for matching graphics and text, and obtaining the prediction result of the similarity between graphics and text further comprises the steps of:

9. The image-text matching method based on the cross-modal mutual attention mechanism according to claim 8, wherein when training data by constructing negative examples by using a random pairing method, the negative example construction strategy further comprises the following steps:

10. The method of claim 9, wherein the teletext matching task uses a Triplet Loss function as a training target, and the Triplet Loss function formula is as follows:

in the case of the negative example of a text,

for the negative example of text, a is a constant.