CN112084358B - Image-text matching method based on area strengthening network with subject constraint - Google Patents

Image-text matching method based on area strengthening network with subject constraint Download PDF

Info

Publication number
CN112084358B
CN112084358B CN202010918759.3A CN202010918759A CN112084358B CN 112084358 B CN112084358 B CN 112084358B CN 202010918759 A CN202010918759 A CN 202010918759A CN 112084358 B CN112084358 B CN 112084358B
Authority
CN
China
Prior art keywords
image
region
text
attention
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010918759.3A
Other languages
Chinese (zh)
Other versions
CN112084358A (en
Inventor
吴杰
吴春雷
王雷全
路静
段海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN202010918759.3A priority Critical patent/CN112084358B/en
Publication of CN112084358A publication Critical patent/CN112084358A/en
Application granted granted Critical
Publication of CN112084358B publication Critical patent/CN112084358B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an image-text matching method based on a regional augmentation network with subject constraints. This task is of great interest because it can relate different modalities. The prior method mainly aggregates the similarity between the region-word pairs to find the corresponding relation between the region and the word. However, these methods rarely treat all regions equally considering the relationship of the different regions in the image. Furthermore, too much region of interest word alignment may misinterpret the image. The invention provides a method for researching the corresponding relation between the image and the text based on the area strengthening network with the subject constraint for the first time. A region reinforcing network with cross attention is designed, and the correspondence of fine granularity is deduced by considering the relationship between regions and reassigning the similarity of region-word. And a theme constraint module is provided to summarize the central theme of the image to constrain the deviation of the original image. A large number of experiments are carried out on MSCOCO and Flicr30K, so that the effectiveness of the model is proved.

Description

Image-text matching method based on area strengthening network with subject constraint
Technical Field
The invention belongs to an image-text matching method, and relates to the technical field of computer vision and natural language processing.
Background
A key issue with image-text matching is to measure semantic similarity between the image and the text. The existing matching methods can be roughly classified into a global semantic matching method and a local semantic matching method. The former takes the whole image and the text as research objects, and learns the corresponding relation of the whole image and the text; the latter deduces the similarity of the image content by aligning the visual area with the text word.
The global semantic matching method projects the image and the text into a public space, and learns the corresponding relation in a global scope. As an initial effort, kiros et al learn image and text representations using CNN and LSTM, respectively, to learn a joint embedding space for triplet rank loss. On this basis, wu et al propose an online learning method that preserves bi-directional relative similarity to learn image text correspondence. However, they do not consider the characteristic distribution of a single modality. Thus, zheng et al propose a two-path CNN model for visual text embedding learning and increase the loss of instances to account for intra-modal data distribution. Some work has focused on improvements in the optimization function. For example, vendrov et al propose an objective function that learns an ordered representation that preserves a partially ordered structure of the visual-semantic hierarchy. Zhang et al further improved the ability to learn to distinguish image-text embeddings by a cross-modal projection classification loss and a cross-modal projection matching loss. However, representations of pixel-level images often lack high-level semantic information. Huang et al propose learning semantic concepts and organizing them in the correct semantic order to improve the representation of the image. Meanwhile, li et al reason about visual representations by capturing objects and their semantic relationships. These studies, while making great progress in image text alignment, lack local fine analysis of image text pairs.
The local semantic matching method aims at realizing local semantic matching and searching for the corresponding relation between the visual area and the text word. First, by karplath et al, the relationship between all regional word pairs is known by computing their similarity. But each regional word pair is of different importance in calculating the global similarity score. In recent years, many researchers have designed embedded networks based on attention mechanisms, selectively focusing on regions or words to learn corresponding information. One of the most typical works is the dual-attention network proposed by Nam et al, which co-locates key regions and words through multiple steps. Similarly, ji et al introduce a saliency model to locate salient regions, enhancing the discrimination of visual representations of image-sentence matches. Based on this idea, wang et al propose a method to adjust attention according to context and aggregate local similarity using multi-modal LSTM order. Ding et al propose an iterative matching method with repeated attention memory, which obtains the correspondence between images and text by multi-step comparison. In addition, lee et al designed a stacked cross-attention network to infer image-text matches by focusing on words associated with regions or regions associated with words.
However, the processing of the image areas is equal, and its different complexity is not considered. Furthermore, image text matches inferred by fine-granularity alignment alone are likely to distort the true meaning of the original image, resulting in a mismatch. Unlike existing methods, we employ a region-enhanced network to refine fine-grained region word alignment. In addition, we propose a topic constraint module to summarize the central topic of an image, constraining the original semantic bias of the image.
Disclosure of Invention
The invention aims to solve the problem that in an image text matching method based on a stacked attention mechanism, the relation of different areas in an image is rarely considered, and all areas are uniformly treated. Also, too much attention to the alignment of regional word pairs may distort the true meaning of the original image.
The technical scheme adopted for solving the technical problems is as follows:
s1, constructing a region strengthening module of an image, and giving different weights to different regions according to the contribution degree of the regions to the image.
S2, combining the reinforcement features in the S1, and adaptively reassigning the similarity of the region-word pairs according to the learned weight.
S3, constructing an image theme constraint module, and summarizing semantic deviation of the central theme constraint original image of the image.
S4, combining the network in the S2 and the network in the S3 to construct an area strengthening network architecture based on the constraint of the subject.
S5, training and image-text matching based on the area strengthening network with the subject constraint.
First, for local feature X ε R d×m We first apply the average pooling and maximum pooling operations in the horizontal dimension and then concatenate them to generate an efficient feature through a convolution operation.
Where σ refers to a sigmoid function and f represents a convolution operation.
Then willEmbedding into two new feature maps +.> and />Wherein F, G.epsilon.R d×m Then, the attention weight of the region is calculated.
wherein ,ηij The effect of the j-th position on the i-th position is measured. m represents the number of regions in the image, the more similar the feature representations of the two regions, the greater the correlation between them, and the greater the meaning of the image. Finally, the output of the region enhancement module is:
wherein ,the method digs the weight of different areas in the image and strengthens the representation capability of the image. In addition, the region enhancement module can also be used as a weight distribution scheme for adaptively distinguishing the region-word similarity.
The fine-grained alignment of the present invention has different concerns about regions and words, which are context to each other when deducing similarity. Thus, the cross-modal attention mechanism can be divided into two types of attention modules, image-text (I2T) and text-image (T2I). Unlike the approach employing I2T and T2I attention mechanisms, respectively, we add them to obtain a more adequate local alignment.
For the I2T attention module:
first, we infer the importance of all words to each region and then determine the importance of image regions to sentences. To achieve this goal, a similarity matrix of region-word pairs is computed:
the weight of each word to the i-th zone is expressed as:
in the formula ,αit To control the scale factor of the attention distribution flatness. The text-level attention feature Li is obtained by a weighted combination of word representations:
then, the relevance of the ith region to the corresponding text level vector is calculated using the Li of each region as a context:
the similarity of the image X and the sentence Y is calculated as follows:
for the T2I attention module:
similarly, we first infer the importance of all regions to each word and then determine the importance of each word to the image attention vector. The similarity matrix Sti for all regional word pairs is measured using the following equation.
The weight of each region to the t-th word is expressed as
in the formula ,αti To control the scale factor of the flatness of the attention profile. The image level attention feature Lt is obtained by a weighted combination of image area features:
then, taking the Lt of each word as a context, calculating the correlation of the t-th word and the horizontal vector of the corresponding image area:
the similarity of the image X and the sentence Y is calculated as follows:
finally, the visual-semantic similarity of the image X and the text Y is calculated by combining 2 directions:
r(X,Y)=r i2t (X,Y)+r t2i (X,Y) (14)
the theme constraint module aims at summarizing the theme of the image and constraining the semantic deviation of the original information of the image so as to help the model to understand the image correctly. Specifically, given a local feature X ε R d×m We first aggregate the region information of the feature map using the average pooling and max pooling operations to generate two different context descriptors, xavg and Xmax. The output eigenvectors are then element-level summed. Calculating the topic attention weight:
θ=σ(f([X avg +X max ])) (15)
where σ refers to a sigmoid function and f represents a convolution operation. The focus of attention is on "what" is meaningful for one image. Theme properties are then generated as follows:
at the position ofRepresenting element-wise multiplication. In order to refine the theme features and avoid deviation of the original features, the theme features I are updated as follows:
g i =sigmoid(W g b i +b g ) (17)
o i =tanh(W o b i +b o ) (18)
where Wg, wo, bg, bo is the learned parameter matrix. gi is used to select the most prominent information. Finally, the theme representation of the whole image is gradually updated in the hidden state I, and the final theme characteristics are obtained:
I=GRU(g i *x i +(1-g i )*o i ) (19)
after the theme constraint module is processed, the theme of the image I is summarized, and the deviation of the original characteristics is constrained. For text, we use a text encoder to map text sentences to a semantic vector space T ε R that has the same dimension as I d A similarity score is then calculated for the image and text.
The image-text matching method based on the area strengthening network with the theme constraint comprises an area strengthening module, a theme constraint module and an area strengthening network with the theme constraint.
Finally, the training method based on the regional reinforcement network with the subject constraint is as follows:
in our implementation, all experiments were performed using the PyTorch framework version 3.6 of Python, which was performed on a computer with NvidiaTeslaP100 GPU. For each sentence, the word embedding size is set to 300 dimensions. Words are encoded into 1024-dimensional vectors using bi-directional GRUs. The image preprocessing adopts a bottom-up attention model to extract regional features, each image feature vector is set to 1024 dimensions, and the feature dimension is the same as that of the text. Training of our model used Adam optimizer, 20 batches on MSCOCO dataset and 30 batches on Flickr30k dataset. The learning rate was set to 0.0005 on the MSCOCO dataset and 0.0002 on the Flickr30k dataset. Further, the parameter settings β and ε were set to 0.5, and the parameters λ and μ were set to 20 and 0.2, respectively.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a region strengthening network for image-text matching, which is used for giving different weights according to the contribution degree of different regions in an image to the image. And then, the region-word similarity is adaptively reassigned according to the learned weight, so that the image text matching accuracy is improved.
2. The invention provides a theme constraint module which summarizes the central theme of an image, helps the model to correctly understand the image, avoids the original semantic deviation of the image, and further constrains the corresponding relation between the image and a text.
Drawings
Fig. 1 is a schematic diagram of an image-text matching method based on a region-enhanced network with subject constraints.
FIG. 2 is a schematic diagram of a region enhancement module.
FIG. 3 is a schematic diagram of a model of a regional augmentation network with cross-attention.
FIG. 4 is a schematic diagram of a model of a subject constraint module.
Fig. 5 and 6 are graphs comparing results on MSCOCO and Flickr30K datasets based on image-text matching of a region enhanced network with subject constraints with image-text matching of other networks.
Fig. 7 and 8 are visual result diagrams of image matching text and text matching images.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent.
The invention is further illustrated in the following figures and examples.
FIG. 1 is a schematic diagram of an architecture based on a regional augmentation network with subject constraints. As shown in fig. 1, the framework of the whole image-text matching is mainly composed of two parts, namely region reinforcement (upper) and subject constraint (lower).
FIG. 2 is a schematic diagram of a region enhancement module. As shown in FIG. 2, the local feature X εR is input d×m We first apply the average pooling and maximum pooling operations in the horizontal dimension and then concatenate them to generate an efficient feature by convolution operation.
Where σ refers to a sigmoid function and f represents a convolution operation. Then willEmbedding into two new feature maps +.> and />Wherein F, G.epsilon.R d×m Then, the attention weight of the region is calculated.
wherein ,ηij The effect of the j-th position on the i-th position is measured. m represents the number of regions in the image, if the features of the two regions are represented more similarly, the greater the correlation between them, the greater the meaning of the image. Finally, the output of the region enhancement module is:
wherein ,the method digs the weight of different areas in the image and strengthens the representation capability of the image. In addition, the region enhancement module can also be used as a weight component for adaptively distinguishing region-word similarityAnd (5) a preparation scheme.
FIG. 3 is a schematic diagram of a model of a regional augmentation network with cross-attention. As shown in FIG. 3, the fine-grained alignment of the present invention has different concerns about image regions and words, which are context to each other when deducing similarity. Thus, cross-modal cross-attention mechanisms can be divided into two types of attention modules, image-text (I2T) and text-image (T2I). Unlike the approach employing I2T and T2I attention mechanisms, respectively, we add them to obtain a more adequate local alignment.
For the I2T attention module:
first, we infer the importance of all words to each region and then determine the importance of image regions to sentences. To achieve this goal, a similarity matrix of region-word pairs is computed:
the weight of each word to the i-th zone is expressed as:
in the formula ,αit To control the scale factor of the attention distribution flatness. The text-level attention feature Li is obtained by a weighted combination of word representations:
then, the relevance of the ith region to the corresponding text level vector is calculated using the Li of each region as a context:
the similarity of the image X and the sentence Y is calculated as follows:
for the T2I attention module:
similarly, we first infer the importance of all regions to each word and then determine the importance of each word to the image attention vector. The similarity matrix Sti for all regional word pairs is measured using the following equation.
The weight of each region to the t-th word is expressed as
in the formula ,αti To control the scale factor of the attention distribution flatness. The image level attention feature Lt is obtained by a weighted combination of image area features:
then, taking the Lt of each word as a context, calculating the correlation of the t-th word and the horizontal vector of the corresponding image area:
the similarity of the image X and the sentence Y is calculated as follows:
finally, the visual-semantic similarity of the image X and the text Y is calculated by combining 2 directions:
r(X,Y)=r i2t (X,Y)+r t2i (X,Y) (14)
FIG. 4 is a schematic diagram of a model of a subject constraint module. As shown in FIG. 4, a local feature X εR is given d×m We first aggregate the region information of the feature map using the average pooling and max pooling operations to generate two different context descriptors, xavg and Xmax. Then, element-level summation is performed on the output feature vectors, and the topic attention weight is calculated:
θ=σ(f([X avg +X max ])) (15)
where σ refers to a sigmoid function and f represents a convolution operation. The focus of attention is on "what" is meaningful for one image. Theme properties are then generated as follows:
at the position ofRepresenting element-wise multiplication. In order to refine the theme features and avoid deviation of the original features, the theme features I are updated as follows:
g i =sigmoid(W g b i +b g ) (17)
o i =tanh(W o b i +b o ) (18)
where Wg, wo, bg, bo is the learned parameter matrix. gi is used to select the most prominent information. Finally, the theme representation of the whole image is gradually updated in the hidden state I, and the final theme characteristics are obtained:
I=GRU(g i *x i +(1-g i )*o i ) (19)
after the theme constraint module is processed, the theme of the image I is summarized, so that the original is avoidedOffset of features. For text, we use a text encoder to map text headers to a semantic vector space T ε R that has the same dimension as I d A similarity score is then calculated for the image and text.
Fig. 5 and 6 are graphs comparing results on MSCOCO and Flickr30K datasets based on image-text matching of a region enhanced network with subject constraints with image-text matching of other networks. As shown in fig. 5 and 6, the image-text matching results based on the area-enhanced network with subject constraints are more accurate than other models.
Fig. 7 and 8 are visual result diagrams of image matching text and text matching images. As shown in fig. 7, given an image, corresponding text can be matched based on the region-enhanced network model with subject constraints. Given text, corresponding pictures can be matched based on the region-enhanced network model with subject constraints, as shown in fig. 8.
The invention provides an image-text matching method of a region strengthening network with theme constraint, which designs the region strengthening network and deduces potential corresponding relation by setting different weights on image regions and reassigning similarity of region-word pairs. A theme constraint module is provided that constrains the deviation of the original image semantics by summarizing the theme of the image. Numerous experiments on MSCOCO and Fliker30K datasets showed that this model had a positive effect on image-text matching. In future work, we will continue to explore how better to learn the semantic correspondence of images and text.
Finally, the details of the above examples of the invention are provided only for illustrating the invention, and any modifications, improvements, substitutions, etc. of the above embodiments should be included in the scope of the claims of the invention.

Claims (5)

1. An image-text matching method based on a region-enhanced network with subject constraints, the method comprising the steps of:
s1, constructing a region strengthening module of an image, and giving different weights to different regions according to the contribution degree of the regions to the image;
s2, combining the reinforcement features in the S1, and adaptively reassigning the similarity of the region-word pairs according to the learned weight;
s3, constructing an image theme constraint module, summarizing semantic deviation of a central theme constraint original image of the image, wherein the specific process is as follows:
the object of the topic constraint module is to summarize the topic of the image and to constrain the semantic deviation of the original information of the image, thereby helping the model to understand the image correctly, in particular to give a local feature X E R d×m We first aggregate the region information of the feature map using the average pooling and max pooling operations to generate two different context descriptors, xavg and Xmax, and then element-level sum the feature vectors output to calculate the topic attention weights:
θ=σ(f([X avg +X max ])) (15)
where σ refers to the sigmoid function, f represents a convolution operation, focus on "what" is meaningful for an image, and then generate the theme properties as follows:
at the position ofRepresenting element-wise multiplication, in order to refine the subject feature and avoid bias of the original feature, we update the subject feature I as follows:
g i =sigmoid(W g b i +b g ) (17)
o i =tanh(W o b i +b o ) (18)
wherein Wg, wo, bg, bo is a learned parameter matrix, gi is used to select the most prominent information, and finally, the theme representation of the whole image is gradually updated in the hidden state I, so as to obtain the final theme feature:
I=GRU(g i *x i +(1-g i )*o i ) (19)
after the theme constraint module is processed, the theme of the image I is summarized, the deviation of original characteristics is constrained, and for a text, a text encoder is used for mapping a text sentence to a semantic vector space T E R with the same dimension as the I d Then, similarity scores of the images and the texts are calculated:
s4, constructing an area strengthening network architecture based on the constraint of the subject by combining the network in the S2 and the network in the S3;
s5, training and image-text matching based on the area strengthening network with the subject constraint.
2. The image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S1 is:
the region enhancement module comes from a self-attention mechanism, which enhances the representation capability of the region by setting different weights to the region in consideration of its contribution to the image, we describe the detailed operation below:
given a local feature X ε R d×m We first apply the average pooling and maximum pooling operations in the horizontal dimension, then concatenate them, generate an efficient feature by convolution operations,
where σ refers to a sigmoid function, f represents a convolution operation,
then willEmbedding into two new feature maps +.> and />Wherein F, G.epsilon.R d×m Then, the attention weight of the region is calculated,
wherein ,ηij The effect of the j-th position on the i-th position is measured, m represents the number of regions in the image, if the features of the two regions represent more similar, the correlation between them is greater, the meaning to the image is greater, and finally, the output of the region enhancement module is:
wherein ,the method digs the weight of different areas in the image and strengthens the representation capability of the image.
3. The image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S2 is:
fine-grained alignment has different concerns about regions and words, which are context to each other when deducing similarity, so that cross-modal cross-attention mechanisms can be divided into two classes of attention modules, image-text (I2T) and text-image (T2I), which we add to get more adequate local alignment, unlike the approach that uses I2T and T2I attention mechanisms respectively,
for the I2T attention module:
first, we infer the importance of all words to each region and then determine the importance of image regions to sentences, to achieve this goal, calculate the similarity matrix for the region-word pairs:
the weight of each word to the i-th zone is expressed as:
in the formula ,αit To control the scale factor of the attention profile flatness, the text-level attention feature Li is derived by a weighted combination of word representations:
then, the relevance of the ith region to the corresponding text level vector is calculated using the Li of each region as a context:
the similarity of the image X and the sentence Y is calculated as follows:
for the T2I attention module:
similarly, we first infer the importance of all regions for each word, then determine the importance of each word for the image attention vector, measure the similarity matrix Sti for all region word pairs using the following equation,
the weight of each region to the t-th word is expressed as
in the formula ,αti To control the scale factor of the attention profile flatness, the image level attention feature Lt is obtained by a weighted combination of image area features:
then, taking the Lt of each word as a context, calculating the correlation of the t-th word and the horizontal vector of the corresponding image area:
the similarity of the image X and the sentence Y is calculated as follows:
finally, the visual-semantic similarity of the image X and the text Y is calculated by combining 2 directions:
r(X,Y)=r i2t (X,Y)+r t2i (X,Y) (14)。
4. the image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S4 is:
the image-text matching method based on the area strengthening network with the theme constraint comprises an area strengthening module, a theme constraint module and an area strengthening network with the theme constraint.
5. The image-text matching method based on the area strengthening network with subject constraints according to claim 1, wherein the specific process of S5 is:
the training method based on the area strengthening network with the subject constraint is as follows:
in our implementation, all experiments were performed using the PyTorch framework of Python version 3.6, experiments were performed on a computer with a Nvidia Tesla P100GPU, word embedding size was set to 300 dimensions for each sentence, words were encoded using bi-directional GRU into 1024-dimensional vectors, image preprocessing took region features using a bottom-up attention model, each image feature vector was set to 1024 dimensions, the image feature dimensions were the same as text, training of our model used Adam optimizers, training 20 batches on MSCOCO dataset, training 30 batches on Flickr30k dataset, learning rate was set to 0.0005 on MSCOCO dataset, learning rate was set to 0.0002 on Flickr30k dataset, and furthermore, parameters settings β and ε were set to 0.5, parameters λ and μ were set to 20 and 0.2, respectively.
CN202010918759.3A 2020-09-04 2020-09-04 Image-text matching method based on area strengthening network with subject constraint Active CN112084358B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010918759.3A CN112084358B (en) 2020-09-04 2020-09-04 Image-text matching method based on area strengthening network with subject constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010918759.3A CN112084358B (en) 2020-09-04 2020-09-04 Image-text matching method based on area strengthening network with subject constraint

Publications (2)

Publication Number Publication Date
CN112084358A CN112084358A (en) 2020-12-15
CN112084358B true CN112084358B (en) 2023-10-27

Family

ID=73731650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010918759.3A Active CN112084358B (en) 2020-09-04 2020-09-04 Image-text matching method based on area strengthening network with subject constraint

Country Status (1)

Country Link
CN (1) CN112084358B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905819B (en) * 2021-01-06 2022-09-23 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN113902764A (en) * 2021-11-19 2022-01-07 东北大学 Semantic-based image-text cross-modal retrieval method
CN114547235B (en) * 2022-01-19 2024-04-16 西北大学 Construction method of image text matching model based on priori knowledge graph

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104584013A (en) * 2012-08-27 2015-04-29 微软公司 Semantic query language
CN111242197A (en) * 2020-01-07 2020-06-05 中国石油大学(华东) Image and text matching method based on double-view-domain semantic reasoning network
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954425B2 (en) * 2010-06-08 2015-02-10 Microsoft Corporation Snippet extraction and ranking

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104584013A (en) * 2012-08-27 2015-04-29 微软公司 Semantic query language
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN111242197A (en) * 2020-01-07 2020-06-05 中国石油大学(华东) Image and text matching method based on double-view-domain semantic reasoning network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于双路细化注意力机制的图像描述模型;丛璐文;;计算机系统应用(05);全文 *

Also Published As

Publication number Publication date
CN112084358A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
Zhang et al. Improved deep hashing with soft pairwise similarity for multi-label image retrieval
CN112966127B (en) Cross-modal retrieval method based on multilayer semantic alignment
CN112084358B (en) Image-text matching method based on area strengthening network with subject constraint
Klein et al. Associating neural word embeddings with deep image representations using fisher vectors
Bu et al. Learning high-level feature by deep belief networks for 3-D model retrieval and recognition
CN114511906A (en) Cross-modal dynamic convolution-based video multi-modal emotion recognition method and device and computer equipment
Gao et al. Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework
Liao et al. Intelligent generative structural design method for shear wall building based on “fused-text-image-to-image” generative adversarial networks
CN111242197B (en) Image text matching method based on double-view semantic reasoning network
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN108154156B (en) Image set classification method and device based on neural topic model
CN113191357A (en) Multilevel image-text matching method based on graph attention network
CN107305543B (en) Method and device for classifying semantic relation of entity words
CN111460824A (en) Unmarked named entity identification method based on anti-migration learning
CN113032601A (en) Zero sample sketch retrieval method based on discriminant improvement
CN115827954A (en) Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116821391A (en) Cross-modal image-text retrieval method based on multi-level semantic alignment
Hu et al. Sketch-a-segmenter: Sketch-based photo segmenter generation
CN106021402A (en) Multi-modal multi-class Boosting frame construction method and device for cross-modal retrieval
Choi CNN output optimization for more balanced classification
CN110569355A (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN113869005A (en) Pre-training model method and system based on sentence similarity
CN117150069A (en) Cross-modal retrieval method and system based on global and local semantic comparison learning
Zhang et al. Weighted score-level feature fusion based on Dempster–Shafer evidence theory for action recognition
CN112861848B (en) Visual relation detection method and system based on known action conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant