CN112084358B

CN112084358B - Image-text matching method based on area strengthening network with subject constraint

Info

Publication number: CN112084358B
Application number: CN202010918759.3A
Authority: CN
Inventors: 吴杰; 吴春雷; 王雷全; 路静; 段海龙
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2023-10-27
Anticipated expiration: 2040-09-04
Also published as: CN112084358A

Abstract

The invention discloses an image-text matching method based on a region-enhanced network with topic constraints. This task has received widespread attention due to its ability to correlate different modalities. Previous methods mainly aggregate the similarities between region-word pairs to find the correspondence between regions and words. However, these methods rarely consider the relationship between different regions in the image and treat all regions equally. Furthermore, focusing too much on regional word alignment may misinterpret the image. This invention proposes for the first time to study the correspondence between images and text based on a region-enhanced network with topic constraints. Design a region-enhanced network with cross-attention to infer fine-grained correspondences by redistributing region-word similarities considering the relationships between regions. And a topic constraint module is proposed to summarize the central topic of the image to constrain the deviation of the original image. The present invention conducts a large number of experiments on MSCOCO and Flicr30K to prove the effectiveness of the proposed model.

Description

Image-text matching method based on area strengthening network with subject constraint

Technical Field

The invention belongs to an image-text matching method, and relates to the technical field of computer vision and natural language processing.

Background

A key issue with image-text matching is to measure semantic similarity between the image and the text. The existing matching methods can be roughly classified into a global semantic matching method and a local semantic matching method. The former takes the whole image and the text as research objects, and learns the corresponding relation of the whole image and the text; the latter deduces the similarity of the image content by aligning the visual area with the text word.

The global semantic matching method projects the image and the text into a public space, and learns the corresponding relation in a global scope. As an initial effort, kiros et al learn image and text representations using CNN and LSTM, respectively, to learn a joint embedding space for triplet rank loss. On this basis, wu et al propose an online learning method that preserves bi-directional relative similarity to learn image text correspondence. However, they do not consider the characteristic distribution of a single modality. Thus, zheng et al propose a two-path CNN model for visual text embedding learning and increase the loss of instances to account for intra-modal data distribution. Some work has focused on improvements in the optimization function. For example, vendrov et al propose an objective function that learns an ordered representation that preserves a partially ordered structure of the visual-semantic hierarchy. Zhang et al further improved the ability to learn to distinguish image-text embeddings by a cross-modal projection classification loss and a cross-modal projection matching loss. However, representations of pixel-level images often lack high-level semantic information. Huang et al propose learning semantic concepts and organizing them in the correct semantic order to improve the representation of the image. Meanwhile, li et al reason about visual representations by capturing objects and their semantic relationships. These studies, while making great progress in image text alignment, lack local fine analysis of image text pairs.

The local semantic matching method aims at realizing local semantic matching and searching for the corresponding relation between the visual area and the text word. First, by karplath et al, the relationship between all regional word pairs is known by computing their similarity. But each regional word pair is of different importance in calculating the global similarity score. In recent years, many researchers have designed embedded networks based on attention mechanisms, selectively focusing on regions or words to learn corresponding information. One of the most typical works is the dual-attention network proposed by Nam et al, which co-locates key regions and words through multiple steps. Similarly, ji et al introduce a saliency model to locate salient regions, enhancing the discrimination of visual representations of image-sentence matches. Based on this idea, wang et al propose a method to adjust attention according to context and aggregate local similarity using multi-modal LSTM order. Ding et al propose an iterative matching method with repeated attention memory, which obtains the correspondence between images and text by multi-step comparison. In addition, lee et al designed a stacked cross-attention network to infer image-text matches by focusing on words associated with regions or regions associated with words.

However, the processing of the image areas is equal, and its different complexity is not considered. Furthermore, image text matches inferred by fine-granularity alignment alone are likely to distort the true meaning of the original image, resulting in a mismatch. Unlike existing methods, we employ a region-enhanced network to refine fine-grained region word alignment. In addition, we propose a topic constraint module to summarize the central topic of an image, constraining the original semantic bias of the image.

Disclosure of Invention

The invention aims to solve the problem that in an image text matching method based on a stacked attention mechanism, the relation of different areas in an image is rarely considered, and all areas are uniformly treated. Also, too much attention to the alignment of regional word pairs may distort the true meaning of the original image.

The technical scheme adopted for solving the technical problems is as follows:

s1, constructing a region strengthening module of an image, and giving different weights to different regions according to the contribution degree of the regions to the image.

S2, combining the reinforcement features in the S1, and adaptively reassigning the similarity of the region-word pairs according to the learned weight.

S3, constructing an image theme constraint module, and summarizing semantic deviation of the central theme constraint original image of the image.

S4, combining the network in the S2 and the network in the S3 to construct an area strengthening network architecture based on the constraint of the subject.

S5, training and image-text matching based on the area strengthening network with the subject constraint.

First, for local feature X ε R ^d×m We first apply the average pooling and maximum pooling operations in the horizontal dimension and then concatenate them to generate an efficient feature through a convolution operation.

Where σ refers to a sigmoid function and f represents a convolution operation.

Then willEmbedding into two new feature maps +.> and />Wherein F, G.epsilon.R ^d×m Then, the attention weight of the region is calculated.

wherein ,η_ij The effect of the j-th position on the i-th position is measured. m represents the number of regions in the image, the more similar the feature representations of the two regions, the greater the correlation between them, and the greater the meaning of the image. Finally, the output of the region enhancement module is:

wherein ,the method digs the weight of different areas in the image and strengthens the representation capability of the image. In addition, the region enhancement module can also be used as a weight distribution scheme for adaptively distinguishing the region-word similarity.

The fine-grained alignment of the present invention has different concerns about regions and words, which are context to each other when deducing similarity. Thus, the cross-modal attention mechanism can be divided into two types of attention modules, image-text (I2T) and text-image (T2I). Unlike the approach employing I2T and T2I attention mechanisms, respectively, we add them to obtain a more adequate local alignment.

For the I2T attention module:

first, we infer the importance of all words to each region and then determine the importance of image regions to sentences. To achieve this goal, a similarity matrix of region-word pairs is computed:

the weight of each word to the i-th zone is expressed as:

in the formula ,α_it To control the scale factor of the attention distribution flatness. The text-level attention feature Li is obtained by a weighted combination of word representations:

then, the relevance of the ith region to the corresponding text level vector is calculated using the Li of each region as a context:

the similarity of the image X and the sentence Y is calculated as follows:

for the T2I attention module:

similarly, we first infer the importance of all regions to each word and then determine the importance of each word to the image attention vector. The similarity matrix Sti for all regional word pairs is measured using the following equation.

The weight of each region to the t-th word is expressed as

in the formula ,α_ti To control the scale factor of the flatness of the attention profile. The image level attention feature Lt is obtained by a weighted combination of image area features:

then, taking the Lt of each word as a context, calculating the correlation of the t-th word and the horizontal vector of the corresponding image area:

the similarity of the image X and the sentence Y is calculated as follows:

finally, the visual-semantic similarity of the image X and the text Y is calculated by combining 2 directions:

r(X,Y)＝r _i2t (X,Y)+r _t2i (X,Y) (14)

the theme constraint module aims at summarizing the theme of the image and constraining the semantic deviation of the original information of the image so as to help the model to understand the image correctly. Specifically, given a local feature X ε R ^d×m We first aggregate the region information of the feature map using the average pooling and max pooling operations to generate two different context descriptors, xavg and Xmax. The output eigenvectors are then element-level summed. Calculating the topic attention weight:

θ＝σ(f([X _avg +X _max ])) (15)

where σ refers to a sigmoid function and f represents a convolution operation. The focus of attention is on "what" is meaningful for one image. Theme properties are then generated as follows:

at the position ofRepresenting element-wise multiplication. In order to refine the theme features and avoid deviation of the original features, the theme features I are updated as follows:

g _i ＝sigmoid(W _g b _i +b _g ) (17)

o _i ＝tanh(W _o b _i +b _o ) (18)

where Wg, wo, bg, bo is the learned parameter matrix. gi is used to select the most prominent information. Finally, the theme representation of the whole image is gradually updated in the hidden state I, and the final theme characteristics are obtained:

I＝GRU(g _i *x _i +(1-g _i )*o _i ) (19)

after the theme constraint module is processed, the theme of the image I is summarized, and the deviation of the original characteristics is constrained. For text, we use a text encoder to map text sentences to a semantic vector space T ε R that has the same dimension as I ^d A similarity score is then calculated for the image and text.

The image-text matching method based on the area strengthening network with the theme constraint comprises an area strengthening module, a theme constraint module and an area strengthening network with the theme constraint.

Finally, the training method based on the regional reinforcement network with the subject constraint is as follows:

in our implementation, all experiments were performed using the PyTorch framework version 3.6 of Python, which was performed on a computer with NvidiaTeslaP100 GPU. For each sentence, the word embedding size is set to 300 dimensions. Words are encoded into 1024-dimensional vectors using bi-directional GRUs. The image preprocessing adopts a bottom-up attention model to extract regional features, each image feature vector is set to 1024 dimensions, and the feature dimension is the same as that of the text. Training of our model used Adam optimizer, 20 batches on MSCOCO dataset and 30 batches on Flickr30k dataset. The learning rate was set to 0.0005 on the MSCOCO dataset and 0.0002 on the Flickr30k dataset. Further, the parameter settings β and ε were set to 0.5, and the parameters λ and μ were set to 20 and 0.2, respectively.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a region strengthening network for image-text matching, which is used for giving different weights according to the contribution degree of different regions in an image to the image. And then, the region-word similarity is adaptively reassigned according to the learned weight, so that the image text matching accuracy is improved.

2. The invention provides a theme constraint module which summarizes the central theme of an image, helps the model to correctly understand the image, avoids the original semantic deviation of the image, and further constrains the corresponding relation between the image and a text.

Drawings

Fig. 1 is a schematic diagram of an image-text matching method based on a region-enhanced network with subject constraints.

FIG. 2 is a schematic diagram of a region enhancement module.

FIG. 3 is a schematic diagram of a model of a regional augmentation network with cross-attention.

FIG. 4 is a schematic diagram of a model of a subject constraint module.

Fig. 5 and 6 are graphs comparing results on MSCOCO and Flickr30K datasets based on image-text matching of a region enhanced network with subject constraints with image-text matching of other networks.

Fig. 7 and 8 are visual result diagrams of image matching text and text matching images.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

The invention is further illustrated in the following figures and examples.

FIG. 1 is a schematic diagram of an architecture based on a regional augmentation network with subject constraints. As shown in fig. 1, the framework of the whole image-text matching is mainly composed of two parts, namely region reinforcement (upper) and subject constraint (lower).

FIG. 2 is a schematic diagram of a region enhancement module. As shown in FIG. 2, the local feature X εR is input ^d×m We first apply the average pooling and maximum pooling operations in the horizontal dimension and then concatenate them to generate an efficient feature by convolution operation.

Where σ refers to a sigmoid function and f represents a convolution operation. Then willEmbedding into two new feature maps +.> and />Wherein F, G.epsilon.R ^d×m Then, the attention weight of the region is calculated.

wherein ,η_ij The effect of the j-th position on the i-th position is measured. m represents the number of regions in the image, if the features of the two regions are represented more similarly, the greater the correlation between them, the greater the meaning of the image. Finally, the output of the region enhancement module is:

wherein ,the method digs the weight of different areas in the image and strengthens the representation capability of the image. In addition, the region enhancement module can also be used as a weight component for adaptively distinguishing region-word similarityAnd (5) a preparation scheme.

FIG. 3 is a schematic diagram of a model of a regional augmentation network with cross-attention. As shown in FIG. 3, the fine-grained alignment of the present invention has different concerns about image regions and words, which are context to each other when deducing similarity. Thus, cross-modal cross-attention mechanisms can be divided into two types of attention modules, image-text (I2T) and text-image (T2I). Unlike the approach employing I2T and T2I attention mechanisms, respectively, we add them to obtain a more adequate local alignment.

For the I2T attention module:

the weight of each word to the i-th zone is expressed as:

the similarity of the image X and the sentence Y is calculated as follows:

for the T2I attention module:

The weight of each region to the t-th word is expressed as

in the formula ,α_ti To control the scale factor of the attention distribution flatness. The image level attention feature Lt is obtained by a weighted combination of image area features:

the similarity of the image X and the sentence Y is calculated as follows:

r(X,Y)＝r _i2t (X,Y)+r _t2i (X,Y) (14)

FIG. 4 is a schematic diagram of a model of a subject constraint module. As shown in FIG. 4, a local feature X εR is given ^d×m We first aggregate the region information of the feature map using the average pooling and max pooling operations to generate two different context descriptors, xavg and Xmax. Then, element-level summation is performed on the output feature vectors, and the topic attention weight is calculated:

θ＝σ(f([X _avg +X _max ])) (15)

g _i ＝sigmoid(W _g b _i +b _g ) (17)

o _i ＝tanh(W _o b _i +b _o ) (18)

I＝GRU(g _i *x _i +(1-g _i )*o _i ) (19)

after the theme constraint module is processed, the theme of the image I is summarized, so that the original is avoidedOffset of features. For text, we use a text encoder to map text headers to a semantic vector space T ε R that has the same dimension as I ^d A similarity score is then calculated for the image and text.

Fig. 5 and 6 are graphs comparing results on MSCOCO and Flickr30K datasets based on image-text matching of a region enhanced network with subject constraints with image-text matching of other networks. As shown in fig. 5 and 6, the image-text matching results based on the area-enhanced network with subject constraints are more accurate than other models.

Fig. 7 and 8 are visual result diagrams of image matching text and text matching images. As shown in fig. 7, given an image, corresponding text can be matched based on the region-enhanced network model with subject constraints. Given text, corresponding pictures can be matched based on the region-enhanced network model with subject constraints, as shown in fig. 8.

The invention provides an image-text matching method of a region strengthening network with theme constraint, which designs the region strengthening network and deduces potential corresponding relation by setting different weights on image regions and reassigning similarity of region-word pairs. A theme constraint module is provided that constrains the deviation of the original image semantics by summarizing the theme of the image. Numerous experiments on MSCOCO and Fliker30K datasets showed that this model had a positive effect on image-text matching. In future work, we will continue to explore how better to learn the semantic correspondence of images and text.

Finally, the details of the above examples of the invention are provided only for illustrating the invention, and any modifications, improvements, substitutions, etc. of the above embodiments should be included in the scope of the claims of the invention.

Claims

1. An image-text matching method based on a region-enhanced network with subject constraints, the method comprising the steps of:

s1, constructing a region strengthening module of an image, and giving different weights to different regions according to the contribution degree of the regions to the image;

s2, combining the reinforcement features in the S1, and adaptively reassigning the similarity of the region-word pairs according to the learned weight;

s3, constructing an image theme constraint module, summarizing semantic deviation of a central theme constraint original image of the image, wherein the specific process is as follows:

the object of the topic constraint module is to summarize the topic of the image and to constrain the semantic deviation of the original information of the image, thereby helping the model to understand the image correctly, in particular to give a local feature X E R ^d×m We first aggregate the region information of the feature map using the average pooling and max pooling operations to generate two different context descriptors, xavg and Xmax, and then element-level sum the feature vectors output to calculate the topic attention weights:

θ＝σ(f([X _avg +X _max ])) (15)

where σ refers to the sigmoid function, f represents a convolution operation, focus on "what" is meaningful for an image, and then generate the theme properties as follows:

at the position ofRepresenting element-wise multiplication, in order to refine the subject feature and avoid bias of the original feature, we update the subject feature I as follows:

g _i ＝sigmoid(W _g b _i +b _g ) (17)

o _i ＝tanh(W _o b _i +b _o ) (18)

wherein Wg, wo, bg, bo is a learned parameter matrix, gi is used to select the most prominent information, and finally, the theme representation of the whole image is gradually updated in the hidden state I, so as to obtain the final theme feature:

I＝GRU(g _i *x _i +(1-g _i )*o _i ) (19)

after the theme constraint module is processed, the theme of the image I is summarized, the deviation of original characteristics is constrained, and for a text, a text encoder is used for mapping a text sentence to a semantic vector space T E R with the same dimension as the I ^d Then, similarity scores of the images and the texts are calculated:

s4, constructing an area strengthening network architecture based on the constraint of the subject by combining the network in the S2 and the network in the S3;

2. The image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S1 is:

the region enhancement module comes from a self-attention mechanism, which enhances the representation capability of the region by setting different weights to the region in consideration of its contribution to the image, we describe the detailed operation below:

given a local feature X ε R ^d×m We first apply the average pooling and maximum pooling operations in the horizontal dimension, then concatenate them, generate an efficient feature by convolution operations,

where σ refers to a sigmoid function, f represents a convolution operation,

then willEmbedding into two new feature maps +.> and />Wherein F, G.epsilon.R ^d×m Then, the attention weight of the region is calculated,

wherein ,η_ij The effect of the j-th position on the i-th position is measured, m represents the number of regions in the image, if the features of the two regions represent more similar, the correlation between them is greater, the meaning to the image is greater, and finally, the output of the region enhancement module is:

wherein ,the method digs the weight of different areas in the image and strengthens the representation capability of the image.

3. The image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S2 is:

fine-grained alignment has different concerns about regions and words, which are context to each other when deducing similarity, so that cross-modal cross-attention mechanisms can be divided into two classes of attention modules, image-text (I2T) and text-image (T2I), which we add to get more adequate local alignment, unlike the approach that uses I2T and T2I attention mechanisms respectively,

for the I2T attention module:

first, we infer the importance of all words to each region and then determine the importance of image regions to sentences, to achieve this goal, calculate the similarity matrix for the region-word pairs:

the weight of each word to the i-th zone is expressed as:

in the formula ,α_it To control the scale factor of the attention profile flatness, the text-level attention feature Li is derived by a weighted combination of word representations:

the similarity of the image X and the sentence Y is calculated as follows:

for the T2I attention module:

similarly, we first infer the importance of all regions for each word, then determine the importance of each word for the image attention vector, measure the similarity matrix Sti for all region word pairs using the following equation,

the weight of each region to the t-th word is expressed as

in the formula ,α_ti To control the scale factor of the attention profile flatness, the image level attention feature Lt is obtained by a weighted combination of image area features:

the similarity of the image X and the sentence Y is calculated as follows:

r(X,Y)＝r _i2t (X,Y)+r _t2i (X,Y) (14)。

4. the image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S4 is:

5. The image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S5 is:

the training method based on the area strengthening network with the subject constraint is as follows:

in our implementation, all experiments were performed using the PyTorch framework of Python version 3.6, experiments were performed on a computer with a Nvidia Tesla P100GPU, word embedding size was set to 300 dimensions for each sentence, words were encoded using bi-directional GRU into 1024-dimensional vectors, image preprocessing took region features using a bottom-up attention model, each image feature vector was set to 1024 dimensions, the image feature dimensions were the same as text, training of our model used Adam optimizers, training 20 batches on MSCOCO dataset, training 30 batches on Flickr30k dataset, learning rate was set to 0.0005 on MSCOCO dataset, learning rate was set to 0.0002 on Flickr30k dataset, and furthermore, parameters settings β and ε were set to 0.5, parameters λ and μ were set to 20 and 0.2, respectively.