CN114492646A - Image-text matching method based on cross-modal mutual attention mechanism - Google Patents

Image-text matching method based on cross-modal mutual attention mechanism Download PDF

Info

Publication number
CN114492646A
CN114492646A CN202210105762.2A CN202210105762A CN114492646A CN 114492646 A CN114492646 A CN 114492646A CN 202210105762 A CN202210105762 A CN 202210105762A CN 114492646 A CN114492646 A CN 114492646A
Authority
CN
China
Prior art keywords
text
image
similarity
matching
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210105762.2A
Other languages
Chinese (zh)
Inventor
赵海英
魏莱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING INTERNATIONAL STUDIES UNIVERSITY
Beijing University of Posts and Telecommunications
Original Assignee
BEIJING INTERNATIONAL STUDIES UNIVERSITY
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING INTERNATIONAL STUDIES UNIVERSITY, Beijing University of Posts and Telecommunications filed Critical BEIJING INTERNATIONAL STUDIES UNIVERSITY
Priority to CN202210105762.2A priority Critical patent/CN114492646A/en
Publication of CN114492646A publication Critical patent/CN114492646A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal mutual attention mechanism-based image-text matching method, which comprises the following steps of: extracting image semantic features by adopting a target detection network, and extracting text semantic features by adopting a Chinese pre-training model; calculating cos similarity between the local image area and words in the text through an image-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention; calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling; and calculating the prediction result of the image-text similarity. The invention can infer the similarity between the whole image and the complete sentence, output the similarity value between the image and the sentence, and realize the fine-grained image-text matching of the local word alignment of the image.

Description

Image-text matching method based on cross-modal mutual attention mechanism
Technical Field
The invention relates to the field of cross-modal image-text matching, in particular to an image-text matching method based on a cross-modal mutual attention mechanism.
Background
The image-text matching is a problem of researching how to measure the similarity of images and characters on the semantic level. The image-text matching is closely related to the fields of multi-mode alignment, cross-mode retrieval and the like, and also belongs to the cross research field of natural language processing and computer vision, and the image-text matching is often applied to the fields of knowledge retrieval, image-text labeling and the like. In the early research period, because the image feature extraction mode is completely different from the text feature extraction mode, only the word bag of the text and the visual word bag of the image can be simply aligned, and then a mode based on typical correlation analysis (CCA) begins to appear in the cross-modal search field, different modalities are linearly mapped to the same matrix space to perform the ranking comparison of correlation, and the ranking is performed in the front as the result of image-text matching.
However, these early image-text matching methods have poor semantic relevance, in recent years, with the development of deep learning, the image feature extraction method starts to use networks such as CNN, ResNet, VGG, and the like to extract features, and networks such as Faster-RCNN, YOLO, and the like also appear in the target detection field, so that the position of an object interest region in an image can be accurately captured, and the acquisition capability of the image semantic features is greatly improved.
With the rise of extracting image-text semantic features by deep learning, image-text matching starts to turn to research to map the image-text features to the same semantic space after extracting the image-text features, and then the similarity of image-text feature coding vectors is calculated to obtain an image-text matching result. The image-text similarity measurement mode mainly comprises the following steps: cosine distance, Jaccard distance, Euclidean distance, Hamming distance, etc. However, the image-text similarity has the defects that the similarity calculation method only calculates the distance between the encoding vectors, and the method lacks the consideration of semantic distance and fine-grained matching alignment.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides an image-text matching method based on a cross-modal mutual attention mechanism, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme:
a picture-text matching method based on a cross-modal mutual attention mechanism comprises the following steps:
s1, extracting image semantic features by adopting a Faster-RCNN target detection network, and extracting text semantic features by adopting a Bert Chinese pre-training model;
s2, constructing a graph-text similarity calculation module, calculating cos similarity between a local image area and words in the text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;
s3, calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling;
and S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of image-text similarity.
Further, the step of extracting semantic features of the image by using a fast-RCNN target detection network in S1 further includes the following steps:
region-of-interest characteristic V of image by adopting fast R-CNN target detection networkrAnd POS characteristic VpCarrying out extraction;
will VrAnd VpSplicing to obtain image semantic features Vs
Further, the extracting semantic features of the text by using the Bert chinese pre-training model in S1 further includes the following steps:
performing word segmentation on a Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;
word vector T for obtaining words by Word EmbeddingwAnd vector the word TwInputting the bidirectional GRU network and extracting text characteristics Tl
Text feature T by adopting Bert Chinese pre-training networklPerforming semantic features TsAnd extracting the semantic feature TsAs the final text semantic features.
Further, when the image-text similarity calculation module is constructed in S2, the image semantic feature V is usedsAnd text semantic feature TsThe image semantic feature V is obtained by an input image-text similarity calculation modulesSemantic feature T with textsThe similarity between them.
Further, the calculating of the text vector supervised by image attention in S2 further includes the following steps:
dividing the image into k local regions according to the number k of interest regions, and setting one of the local regions as Vi
Dividing the text into n words according to the number n of the words, and setting one word as Wj
To ViAnd WjCalculating the cos similarity by the following formula:
Figure BDA0003493448780000031
and normalized to obtain
Figure BDA0003493448780000032
Figure BDA0003493448780000033
Wherein, [ x ]]+≡max(x,0);
Calculation of the weight a by the softmax functionijThe formula is as follows:
Figure BDA0003493448780000034
and obtaining the text vector under attention by weighted summation of words
Figure BDA0003493448780000035
The formula is as follows:
Figure BDA0003493448780000036
in the formulae (II) to (III)ijIs the similarity between the ith local image and the jth word, viIs the ith local image feature vector, wjIs the jth word text feature vector, k is that the image is divided into k regions according to the number k of interest regions, n is that the sentence text is totally divided into n words, AijIs the weight calculated by the jth word according to its degree of correlation with the image, wjIs the jth word, for a total of n.
Further, the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further includes the following steps:
by focusing on text vectors in attention
Figure BDA0003493448780000037
And cos similarity of local area of image
Figure BDA0003493448780000038
Calculating, summing the cos similarity and carrying out average pooling treatment to obtain the similarity S of the complete text and the complete imageAVG(I,T)。
Further, the similarity S of the complete text and the complete imageAVGThe calculation formula of (I, T) is as follows:
Figure BDA0003493448780000039
Figure BDA00034934487800000310
in the formulae ViIs the ith local image feature, aiIs a text feature weighted by image attention supervision, k being a total of k image local blocks.
Further, the training in S4 to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain the prediction result of the image-text similarity further includes the following steps:
constructing a negative example by adopting a random pairing method, training data, adopting a triple Loss function as a training target of a graph-text matching task, and obtaining a graph-text matching prediction model after training is finished;
and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity.
Further, when training data by constructing a negative example by using a random pairing method, the negative example construction strategy further includes the following steps:
randomly selecting one from the text set as a negative example for 40% of the image data;
the text extracted from the text collection and the actual matching text containing the same entity words is selected as a negative example for 60% of the image data.
Further, the image-text matching task adopts a triple Loss function as a training target, and the triple Loss function formula is as follows:
Figure BDA0003493448780000041
wherein, [ x ]]+Is identical to max (x,0), S (I, T) is the graphic similarity in S2,
Figure BDA0003493448780000042
in the case of the negative example of a text,
Figure BDA0003493448780000043
for the text negative example, a is a constant.
The invention has the beneficial effects that: the fine-grained image-text matching based on the cross-modal mutual attention mechanism introduces the ROI characteristic and the POS characteristic output by a target detection network in the process of extracting the characteristic of an image, can not only partition the image, but also enhance the semantic characteristic of the image, and introduces a Bert Chinese pre-training model in the process of extracting the characteristic of the text to perform self-supervision so as to further extract semantic information. Finally, the method introduces a cross-mode attention mechanism to align the local interest area of the image and the words in the text, sums, averages and pools to obtain the complete image-text similarity, and deduces the complete image-text similarity from the local alignment, thereby achieving a better image-text matching effect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of an implementation of a teletext matching method based on a cross-modal mutual attention mechanism according to an embodiment of the present invention.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.
According to an embodiment of the invention, a graph-text matching method based on a cross-modal mutual attention mechanism is provided. In consideration of semantic distance and fine-grained matching, a cross-modal mutual attention mechanism and a target detection technology are introduced, attention involved in image-text matching comprises visual attention and text attention, fine-grained information is obtained by calculating word vectors after text word segmentation and extracting ROI (region of interest) features and POS (POS is Position in an image) features of the image by using target detection, a text vector supervised by the image is calculated by using a cross-modal mutual attention module, and finally, the similarity between the text vector supervised by the image and the image features is calculated, so that certain semantic distance information is added for image-text matching, meanwhile, fine-grained matching alignment of words to the ROI regions of interest is realized, and experiments prove that the matching result of the image-text matching scheme is good.
The method is based on a cross-modal mutual attention module, local features of an image are extracted by adopting a fast R-CNN target detection algorithm for the image, context features of a text are extracted by adopting a bidirectional GRU (GRU as a gating circulation unit) for a sentence, the local region features in the image and word features in the text sentence are coded and then are input into the cross-modal mutual attention module, and the purpose is to map the words and the image regions to a public embedding space and then infer the similarity between the whole image and the whole sentence by aligning the words and the image regions. And finally, outputting a similarity numerical value between the image and the sentence to realize the fine-grained image-text matching method for aligning the local part of the image to the word.
Referring to the drawings and the detailed description, the invention will be further described, as shown in fig. 1, in an embodiment of the invention, a method for matching graphics and text based on a cross-modal mutual attention mechanism includes the following steps:
s1, extracting image semantic features by adopting a Faster-RCNN target detection network, and extracting text semantic features by adopting a Bert Chinese pre-training model;
wherein, the step of extracting the image semantic features by adopting a Faster-RCNN target detection network in the step S1 further comprises the following steps:
for the characteristics of image data, ROI (region of interest) characteristic V of image is detected by using Faster R-CNN target detection networkrAnd POS characteristic VpCarry out and carryTaking;
will VrAnd VpSplicing to obtain image semantic features Vs
The extraction of the text semantic features by adopting the Bert Chinese pre-training model in the S1 further comprises the following steps:
for the characteristics of Chinese character data, performing word segmentation on a Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;
word vector T of Word is obtained by Word EmbeddingwAnd vector the word TwInputting the bidirectional GRU network and extracting text characteristics Tl
Text feature T by adopting Bert Chinese pre-training networklPerforming semantic features TsAnd extracting the semantic feature TsAs the final text semantic features.
S2, constructing a graph-text similarity calculation module, calculating cos similarity (cosine similarity) between a local image area and words in a text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;
wherein, when the image-text similarity calculation module is constructed in the step S2, the image semantic feature V is usedsAnd text semantic feature TsThe image semantic feature V is obtained by an input image-text similarity calculation modulesSemantic feature T with textsThe similarity between them.
The step of calculating the text vector supervised by image attention in S2 further includes the steps of:
dividing the image into k local regions according to the ROI number k, and setting one of the local regions as Vi
Dividing the text into n words according to the number n of the words, and setting one word as Wj
To ViAnd WjCalculating cos similarity so as to measure the importance of the image area to the word, wherein the calculation formula is as follows:
Figure BDA0003493448780000071
and normalized to obtain
Figure BDA0003493448780000072
Figure BDA0003493448780000073
Wherein, [ x ]]+≡max(x,0);
Calculating the weight a by softmax functionijThe formula is as follows:
Figure BDA0003493448780000074
and obtaining text vectors under attention by weighted summation of words
Figure BDA0003493448780000075
The formula is as follows:
Figure BDA0003493448780000076
in the formulae (II) to (III)ijIs the similarity between the ith local image and the jth word, viIs the ith local image feature vector, wjIs the jth word text feature vector, k is that the image is divided into k regions according to the number k of interest regions, n is that the sentence text is totally divided into n words, AijIs the weight calculated by the jth word according to its degree of correlation with the image, wjIs the jth word, for a total of n.
S3, calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling;
wherein the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further includes the steps of:
by focusing on text vectors in attention
Figure BDA0003493448780000077
And cos similarity of local region of image
Figure BDA0003493448780000078
Calculating, summing the cos similarity and carrying out average pooling treatment to obtain the similarity S of the complete text and the complete imageAVG(I,T)。
Similarity S of the complete text and the complete imageAVGThe calculation formula of (I, T) is as follows:
Figure BDA0003493448780000079
Figure BDA00034934487800000710
in the formulae ViIs the ith local image feature, aiIs a text feature weighted by image attention supervision, k being a total of k image local blocks.
And S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of the image-text similarity.
Wherein, the training in S4 to obtain the image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model, and obtaining the prediction result of the image-text similarity further includes the following steps:
constructing a negative example by adopting a random pairing method and training data, generally adopting a triple Loss function as a training target of a graph-text matching task, wherein the training target is to reduce the value of the Loss function, and obtaining a graph-text matching prediction model after the training is finished;
and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity. The result can be used for tasks such as image-text retrieval, image annotation and the like.
When the random pairing method is adopted to construct the negative examples to train the data, the negative example construction strategy further comprises the following steps:
randomly drawing one from the text set for 40% of the image data as a negative example;
the text extracted from the text collection and the actual matching text containing the same entity words is selected as a negative example for 60% of the image data.
The image-text matching task adopts a triple Loss function as a training target, and the triple Loss function formula is as follows:
Figure BDA0003493448780000081
wherein, [ x ]]+Is identical to max (x,0), S (I, T) is the graphic similarity in S2,
Figure BDA0003493448780000082
in the case of the negative example of a text,
Figure BDA0003493448780000083
for the negative example of text, a is a constant. The training target is to reduce the value of the loss function, and the image-text matching prediction model is obtained after the training is finished.
In conclusion, the fine-grained image-text matching based on the cross-modal mutual attention mechanism introduces the ROI characteristic and the POS characteristic output by the target detection network in the characteristic extraction process of the image, can not only partition the image, but also enhance the semantic characteristic of the image, and introduces a Bert Chinese pre-training model in the text characteristic extraction process for self-supervision to further extract semantic information. Finally, the method introduces a cross-mode attention mechanism to align the local interest area of the image and the words in the text, sums, averages and pools to obtain the complete image-text similarity, and deduces the complete image-text similarity from the local alignment, thereby achieving a better image-text matching effect.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A picture-text matching method based on a cross-modal mutual attention mechanism is characterized by comprising the following steps:
s1, extracting image semantic features by adopting a target detection network, and extracting text semantic features by adopting a Chinese pre-training model;
s2, constructing a graph-text similarity calculation module, calculating cos similarity between a local image area and words in the text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;
s3, calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling;
and S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of image-text similarity.
2. The method for image-text matching based on a cross-modal mutual attention mechanism according to claim 1, wherein the step of extracting semantic features of the image by using a target detection network in S1 further comprises the following steps:
region of interest characteristic V of image by adopting target detection networkrAnd POS characteristic VpCarrying out extraction;
will VrAnd VpSplicing to obtain image semantic features Vs
3. The method for matching texts based on the cross-modal mutual attention mechanism according to claim 1, wherein the extracting semantic features of the text by using a Chinese pre-training model in S1 further comprises the following steps:
performing word segmentation on the Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;
word vector T for obtaining words by Word EmbeddingwAnd vector the word TwInputting the bidirectional GRU network and extracting text characteristics Tl
Text feature T by adopting Bert Chinese pre-training networklPerforming semantic features TsAnd extracting the semantic feature TsAs the final text semantic features.
4. The image-text matching method based on the cross-modal mutual attention mechanism as claimed in claim 3, wherein when the image-text similarity calculation module is constructed in the step S2, the image semantic feature V is usedsAnd text semantic feature TsThe image semantic feature V is obtained by an input image-text similarity calculation modulesSemantic feature T with textsThe similarity between them.
5. The method for matching teletext according to claim 1, wherein the step of calculating the text vector supervised by image attention in S2 further comprises the following steps:
dividing the image into k local regions according to the number k of interest regions, and setting one of the local regions as Vi
Dividing a text into n words according to the number n of the words, and setting one word as Wj
To ViAnd WjCalculating the cos similarity by the following formula:
Figure FDA0003493448770000021
and normalized to obtain
Figure FDA0003493448770000022
Figure FDA0003493448770000023
Wherein, [ x ]]+≡max(x,0);
Calculating the weight a by softmax functionijThe formula is as follows:
Figure FDA0003493448770000024
and obtaining text vectors under attention by weighted summation of words
Figure FDA0003493448770000025
The formula is as follows:
Figure FDA0003493448770000026
in the formulae (II) to (III)ijIs the similarity between the ith local image and the jth word, viIs the ith local image feature vector, wjIs the jth word text feature vector, k is that the image is divided into k regions according to the number k of interest regions, n is that the sentence text is totally divided into n words, AijIs the weight calculated by the jth word according to its degree of correlation with the image, wjIs the jth word, for a total of n.
6. The method for matching graphics and text based on cross-modal mutual attention mechanism according to claim 5, wherein the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further comprises the steps of:
by focusing on text vectors in attention
Figure FDA0003493448770000027
And cos similarity of local region of image
Figure FDA0003493448770000028
Calculating, summing the cos similarity and carrying out average pooling treatment to obtain the similarity S of the complete text and the complete imageAVG(I,T)。
7. The method according to claim 6, wherein the similarity S between the complete text and the complete image isAVGThe calculation formula of (I, T) is as follows:
Figure FDA0003493448770000031
Figure FDA0003493448770000032
in each formula, ViIs the ith local image feature, aiIs a text feature weighted by image attention supervision, and k is a total of k image local blocks.
8. The method as claimed in claim 1, wherein the step of training in S4 to obtain a prediction model for matching graphics and text, and inputting the extracted feature codes of the graphics and text data into the prediction model for matching graphics and text, and obtaining the prediction result of the similarity between graphics and text further comprises the steps of:
constructing a negative example by adopting a random pairing method, training data, adopting a triple Loss function as a training target of a graph-text matching task, and obtaining a graph-text matching prediction model after training is finished;
and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity.
9. The image-text matching method based on the cross-modal mutual attention mechanism according to claim 8, wherein when training data by constructing negative examples by using a random pairing method, the negative example construction strategy further comprises the following steps:
randomly drawing one from the text set for 40% of the image data as a negative example;
the text extracted from the text collection and the actual matching text containing the same entity words is selected as a negative example for 60% of the image data.
10. The method of claim 9, wherein the teletext matching task uses a Triplet Loss function as a training target, and the Triplet Loss function formula is as follows:
Figure FDA0003493448770000033
wherein, [ x ]]+Is identical to max (x,0), S (I, T) is the graphic similarity in S2,
Figure FDA0003493448770000034
in the case of the negative example of a text,
Figure FDA0003493448770000041
for the negative example of text, a is a constant.
CN202210105762.2A 2022-01-28 2022-01-28 Image-text matching method based on cross-modal mutual attention mechanism Pending CN114492646A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210105762.2A CN114492646A (en) 2022-01-28 2022-01-28 Image-text matching method based on cross-modal mutual attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210105762.2A CN114492646A (en) 2022-01-28 2022-01-28 Image-text matching method based on cross-modal mutual attention mechanism

Publications (1)

Publication Number Publication Date
CN114492646A true CN114492646A (en) 2022-05-13

Family

ID=81476580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210105762.2A Pending CN114492646A (en) 2022-01-28 2022-01-28 Image-text matching method based on cross-modal mutual attention mechanism

Country Status (1)

Country Link
CN (1) CN114492646A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098644A (en) * 2022-07-14 2022-09-23 平安科技(深圳)有限公司 Image and text matching method and device, electronic equipment and storage medium
CN116150418A (en) * 2023-04-20 2023-05-23 南京邮电大学 Image-text matching method and system based on mixed focusing attention mechanism
CN116958706A (en) * 2023-08-11 2023-10-27 中国矿业大学 Controllable generation method for image diversified description based on part-of-speech tagging
CN117611245A (en) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098644A (en) * 2022-07-14 2022-09-23 平安科技(深圳)有限公司 Image and text matching method and device, electronic equipment and storage medium
CN116150418A (en) * 2023-04-20 2023-05-23 南京邮电大学 Image-text matching method and system based on mixed focusing attention mechanism
CN116150418B (en) * 2023-04-20 2023-07-07 南京邮电大学 Image-text matching method and system based on mixed focusing attention mechanism
CN116958706A (en) * 2023-08-11 2023-10-27 中国矿业大学 Controllable generation method for image diversified description based on part-of-speech tagging
CN116958706B (en) * 2023-08-11 2024-05-14 中国矿业大学 Controllable generation method for image diversified description based on part-of-speech tagging
CN117611245A (en) * 2023-12-14 2024-02-27 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities
CN117611245B (en) * 2023-12-14 2024-05-31 浙江博观瑞思科技有限公司 Data analysis management system and method for planning E-business operation activities

Similar Documents

Publication Publication Date Title
CN110147457B (en) Image-text matching method, device, storage medium and equipment
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110222140B (en) Cross-modal retrieval method based on counterstudy and asymmetric hash
Li et al. Truncation cross entropy loss for remote sensing image captioning
CN113283551B (en) Training method and training device of multi-mode pre-training model and electronic equipment
CN114492646A (en) Image-text matching method based on cross-modal mutual attention mechanism
CN113297975A (en) Method and device for identifying table structure, storage medium and electronic equipment
WO2018196718A1 (en) Image disambiguation method and device, storage medium, and electronic device
CN111160348A (en) Text recognition method for natural scene, storage device and computer equipment
CN116402063B (en) Multi-modal irony recognition method, apparatus, device and storage medium
CN110781413A (en) Interest point determining method and device, storage medium and electronic equipment
CN114973222B (en) Scene text recognition method based on explicit supervision attention mechanism
WO2022206094A1 (en) Method and apparatus for generating captioning device, and method and apparatus for outputting caption
CN113593661A (en) Clinical term standardization method, device, electronic equipment and storage medium
Qin Application of efficient recognition algorithm based on deep neural network in English teaching scene
CN117079310A (en) Pedestrian re-identification method based on image-text multi-mode fusion
Li et al. Review network for scene text recognition
CN117009570A (en) Image-text retrieval method and device based on position information and confidence perception
CN116450829A (en) Medical text classification method, device, equipment and medium
CN116522942A (en) Chinese nested named entity recognition method based on character pairs
CN116433934A (en) Multi-mode pre-training method for generating CT image representation and image report
CN115424275A (en) Fishing boat brand identification method and system based on deep learning technology
CN114694133A (en) Text recognition method based on combination of image processing and deep learning
CN113159071A (en) Cross-modal image-text association anomaly detection method
CN113052156A (en) Optical character recognition method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination