CN114492646A - Image-text matching method based on cross-modal mutual attention mechanism - Google Patents
Image-text matching method based on cross-modal mutual attention mechanism Download PDFInfo
- Publication number
- CN114492646A CN114492646A CN202210105762.2A CN202210105762A CN114492646A CN 114492646 A CN114492646 A CN 114492646A CN 202210105762 A CN202210105762 A CN 202210105762A CN 114492646 A CN114492646 A CN 114492646A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- similarity
- matching
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-modal mutual attention mechanism-based image-text matching method, which comprises the following steps of: extracting image semantic features by adopting a target detection network, and extracting text semantic features by adopting a Chinese pre-training model; calculating cos similarity between the local image area and words in the text through an image-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention; calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling; and calculating the prediction result of the image-text similarity. The invention can infer the similarity between the whole image and the complete sentence, output the similarity value between the image and the sentence, and realize the fine-grained image-text matching of the local word alignment of the image.
Description
Technical Field
The invention relates to the field of cross-modal image-text matching, in particular to an image-text matching method based on a cross-modal mutual attention mechanism.
Background
The image-text matching is a problem of researching how to measure the similarity of images and characters on the semantic level. The image-text matching is closely related to the fields of multi-mode alignment, cross-mode retrieval and the like, and also belongs to the cross research field of natural language processing and computer vision, and the image-text matching is often applied to the fields of knowledge retrieval, image-text labeling and the like. In the early research period, because the image feature extraction mode is completely different from the text feature extraction mode, only the word bag of the text and the visual word bag of the image can be simply aligned, and then a mode based on typical correlation analysis (CCA) begins to appear in the cross-modal search field, different modalities are linearly mapped to the same matrix space to perform the ranking comparison of correlation, and the ranking is performed in the front as the result of image-text matching.
However, these early image-text matching methods have poor semantic relevance, in recent years, with the development of deep learning, the image feature extraction method starts to use networks such as CNN, ResNet, VGG, and the like to extract features, and networks such as Faster-RCNN, YOLO, and the like also appear in the target detection field, so that the position of an object interest region in an image can be accurately captured, and the acquisition capability of the image semantic features is greatly improved.
With the rise of extracting image-text semantic features by deep learning, image-text matching starts to turn to research to map the image-text features to the same semantic space after extracting the image-text features, and then the similarity of image-text feature coding vectors is calculated to obtain an image-text matching result. The image-text similarity measurement mode mainly comprises the following steps: cosine distance, Jaccard distance, Euclidean distance, Hamming distance, etc. However, the image-text similarity has the defects that the similarity calculation method only calculates the distance between the encoding vectors, and the method lacks the consideration of semantic distance and fine-grained matching alignment.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides an image-text matching method based on a cross-modal mutual attention mechanism, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme:
a picture-text matching method based on a cross-modal mutual attention mechanism comprises the following steps:
s1, extracting image semantic features by adopting a Faster-RCNN target detection network, and extracting text semantic features by adopting a Bert Chinese pre-training model;
s2, constructing a graph-text similarity calculation module, calculating cos similarity between a local image area and words in the text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;
s3, calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling;
and S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of image-text similarity.
Further, the step of extracting semantic features of the image by using a fast-RCNN target detection network in S1 further includes the following steps:
region-of-interest characteristic V of image by adopting fast R-CNN target detection networkrAnd POS characteristic VpCarrying out extraction;
will VrAnd VpSplicing to obtain image semantic features Vs。
Further, the extracting semantic features of the text by using the Bert chinese pre-training model in S1 further includes the following steps:
performing word segmentation on a Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;
word vector T for obtaining words by Word EmbeddingwAnd vector the word TwInputting the bidirectional GRU network and extracting text characteristics Tl;
Text feature T by adopting Bert Chinese pre-training networklPerforming semantic features TsAnd extracting the semantic feature TsAs the final text semantic features.
Further, when the image-text similarity calculation module is constructed in S2, the image semantic feature V is usedsAnd text semantic feature TsThe image semantic feature V is obtained by an input image-text similarity calculation modulesSemantic feature T with textsThe similarity between them.
Further, the calculating of the text vector supervised by image attention in S2 further includes the following steps:
dividing the image into k local regions according to the number k of interest regions, and setting one of the local regions as Vi;
Dividing the text into n words according to the number n of the words, and setting one word as Wj;
To ViAnd WjCalculating the cos similarity by the following formula:
Calculation of the weight a by the softmax functionijThe formula is as follows:
and obtaining the text vector under attention by weighted summation of wordsThe formula is as follows:
in the formulae (II) to (III)ijIs the similarity between the ith local image and the jth word, viIs the ith local image feature vector, wjIs the jth word text feature vector, k is that the image is divided into k regions according to the number k of interest regions, n is that the sentence text is totally divided into n words, AijIs the weight calculated by the jth word according to its degree of correlation with the image, wjIs the jth word, for a total of n.
Further, the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further includes the following steps:
by focusing on text vectors in attentionAnd cos similarity of local area of imageCalculating, summing the cos similarity and carrying out average pooling treatment to obtain the similarity S of the complete text and the complete imageAVG(I,T)。
Further, the similarity S of the complete text and the complete imageAVGThe calculation formula of (I, T) is as follows:
in the formulae ViIs the ith local image feature, aiIs a text feature weighted by image attention supervision, k being a total of k image local blocks.
Further, the training in S4 to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain the prediction result of the image-text similarity further includes the following steps:
constructing a negative example by adopting a random pairing method, training data, adopting a triple Loss function as a training target of a graph-text matching task, and obtaining a graph-text matching prediction model after training is finished;
and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity.
Further, when training data by constructing a negative example by using a random pairing method, the negative example construction strategy further includes the following steps:
randomly selecting one from the text set as a negative example for 40% of the image data;
the text extracted from the text collection and the actual matching text containing the same entity words is selected as a negative example for 60% of the image data.
Further, the image-text matching task adopts a triple Loss function as a training target, and the triple Loss function formula is as follows:
wherein, [ x ]]+Is identical to max (x,0), S (I, T) is the graphic similarity in S2,in the case of the negative example of a text,for the text negative example, a is a constant.
The invention has the beneficial effects that: the fine-grained image-text matching based on the cross-modal mutual attention mechanism introduces the ROI characteristic and the POS characteristic output by a target detection network in the process of extracting the characteristic of an image, can not only partition the image, but also enhance the semantic characteristic of the image, and introduces a Bert Chinese pre-training model in the process of extracting the characteristic of the text to perform self-supervision so as to further extract semantic information. Finally, the method introduces a cross-mode attention mechanism to align the local interest area of the image and the words in the text, sums, averages and pools to obtain the complete image-text similarity, and deduces the complete image-text similarity from the local alignment, thereby achieving a better image-text matching effect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of an implementation of a teletext matching method based on a cross-modal mutual attention mechanism according to an embodiment of the present invention.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.
According to an embodiment of the invention, a graph-text matching method based on a cross-modal mutual attention mechanism is provided. In consideration of semantic distance and fine-grained matching, a cross-modal mutual attention mechanism and a target detection technology are introduced, attention involved in image-text matching comprises visual attention and text attention, fine-grained information is obtained by calculating word vectors after text word segmentation and extracting ROI (region of interest) features and POS (POS is Position in an image) features of the image by using target detection, a text vector supervised by the image is calculated by using a cross-modal mutual attention module, and finally, the similarity between the text vector supervised by the image and the image features is calculated, so that certain semantic distance information is added for image-text matching, meanwhile, fine-grained matching alignment of words to the ROI regions of interest is realized, and experiments prove that the matching result of the image-text matching scheme is good.
The method is based on a cross-modal mutual attention module, local features of an image are extracted by adopting a fast R-CNN target detection algorithm for the image, context features of a text are extracted by adopting a bidirectional GRU (GRU as a gating circulation unit) for a sentence, the local region features in the image and word features in the text sentence are coded and then are input into the cross-modal mutual attention module, and the purpose is to map the words and the image regions to a public embedding space and then infer the similarity between the whole image and the whole sentence by aligning the words and the image regions. And finally, outputting a similarity numerical value between the image and the sentence to realize the fine-grained image-text matching method for aligning the local part of the image to the word.
Referring to the drawings and the detailed description, the invention will be further described, as shown in fig. 1, in an embodiment of the invention, a method for matching graphics and text based on a cross-modal mutual attention mechanism includes the following steps:
s1, extracting image semantic features by adopting a Faster-RCNN target detection network, and extracting text semantic features by adopting a Bert Chinese pre-training model;
wherein, the step of extracting the image semantic features by adopting a Faster-RCNN target detection network in the step S1 further comprises the following steps:
for the characteristics of image data, ROI (region of interest) characteristic V of image is detected by using Faster R-CNN target detection networkrAnd POS characteristic VpCarry out and carryTaking;
will VrAnd VpSplicing to obtain image semantic features Vs。
The extraction of the text semantic features by adopting the Bert Chinese pre-training model in the S1 further comprises the following steps:
for the characteristics of Chinese character data, performing word segmentation on a Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;
word vector T of Word is obtained by Word EmbeddingwAnd vector the word TwInputting the bidirectional GRU network and extracting text characteristics Tl;
Text feature T by adopting Bert Chinese pre-training networklPerforming semantic features TsAnd extracting the semantic feature TsAs the final text semantic features.
S2, constructing a graph-text similarity calculation module, calculating cos similarity (cosine similarity) between a local image area and words in a text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;
wherein, when the image-text similarity calculation module is constructed in the step S2, the image semantic feature V is usedsAnd text semantic feature TsThe image semantic feature V is obtained by an input image-text similarity calculation modulesSemantic feature T with textsThe similarity between them.
The step of calculating the text vector supervised by image attention in S2 further includes the steps of:
dividing the image into k local regions according to the ROI number k, and setting one of the local regions as Vi;
Dividing the text into n words according to the number n of the words, and setting one word as Wj;
To ViAnd WjCalculating cos similarity so as to measure the importance of the image area to the word, wherein the calculation formula is as follows:
Calculating the weight a by softmax functionijThe formula is as follows:
in the formulae (II) to (III)ijIs the similarity between the ith local image and the jth word, viIs the ith local image feature vector, wjIs the jth word text feature vector, k is that the image is divided into k regions according to the number k of interest regions, n is that the sentence text is totally divided into n words, AijIs the weight calculated by the jth word according to its degree of correlation with the image, wjIs the jth word, for a total of n.
S3, calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling;
wherein the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further includes the steps of:
by focusing on text vectors in attentionAnd cos similarity of local region of imageCalculating, summing the cos similarity and carrying out average pooling treatment to obtain the similarity S of the complete text and the complete imageAVG(I,T)。
Similarity S of the complete text and the complete imageAVGThe calculation formula of (I, T) is as follows:
in the formulae ViIs the ith local image feature, aiIs a text feature weighted by image attention supervision, k being a total of k image local blocks.
And S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of the image-text similarity.
Wherein, the training in S4 to obtain the image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model, and obtaining the prediction result of the image-text similarity further includes the following steps:
constructing a negative example by adopting a random pairing method and training data, generally adopting a triple Loss function as a training target of a graph-text matching task, wherein the training target is to reduce the value of the Loss function, and obtaining a graph-text matching prediction model after the training is finished;
and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity. The result can be used for tasks such as image-text retrieval, image annotation and the like.
When the random pairing method is adopted to construct the negative examples to train the data, the negative example construction strategy further comprises the following steps:
randomly drawing one from the text set for 40% of the image data as a negative example;
the text extracted from the text collection and the actual matching text containing the same entity words is selected as a negative example for 60% of the image data.
The image-text matching task adopts a triple Loss function as a training target, and the triple Loss function formula is as follows:
wherein, [ x ]]+Is identical to max (x,0), S (I, T) is the graphic similarity in S2,in the case of the negative example of a text,for the negative example of text, a is a constant. The training target is to reduce the value of the loss function, and the image-text matching prediction model is obtained after the training is finished.
In conclusion, the fine-grained image-text matching based on the cross-modal mutual attention mechanism introduces the ROI characteristic and the POS characteristic output by the target detection network in the characteristic extraction process of the image, can not only partition the image, but also enhance the semantic characteristic of the image, and introduces a Bert Chinese pre-training model in the text characteristic extraction process for self-supervision to further extract semantic information. Finally, the method introduces a cross-mode attention mechanism to align the local interest area of the image and the words in the text, sums, averages and pools to obtain the complete image-text similarity, and deduces the complete image-text similarity from the local alignment, thereby achieving a better image-text matching effect.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. A picture-text matching method based on a cross-modal mutual attention mechanism is characterized by comprising the following steps:
s1, extracting image semantic features by adopting a target detection network, and extracting text semantic features by adopting a Chinese pre-training model;
s2, constructing a graph-text similarity calculation module, calculating cos similarity between a local image area and words in the text through the graph-text similarity calculation module to obtain attention weight, and calculating a text vector supervised by image attention;
s3, calculating cos similarity of the text vector supervised by the image attention and the local image, and obtaining the similarity between the complete text and the complete image through summation and average pooling;
and S4, training to obtain an image-text matching prediction model, and inputting the extracted feature codes of the image-text data into the image-text matching prediction model to obtain a prediction result of image-text similarity.
2. The method for image-text matching based on a cross-modal mutual attention mechanism according to claim 1, wherein the step of extracting semantic features of the image by using a target detection network in S1 further comprises the following steps:
region of interest characteristic V of image by adopting target detection networkrAnd POS characteristic VpCarrying out extraction;
will VrAnd VpSplicing to obtain image semantic features Vs。
3. The method for matching texts based on the cross-modal mutual attention mechanism according to claim 1, wherein the extracting semantic features of the text by using a Chinese pre-training model in S1 further comprises the following steps:
performing word segmentation on the Chinese text by adopting a Chinese word segmentation technology jieba to obtain a word list;
word vector T for obtaining words by Word EmbeddingwAnd vector the word TwInputting the bidirectional GRU network and extracting text characteristics Tl;
Text feature T by adopting Bert Chinese pre-training networklPerforming semantic features TsAnd extracting the semantic feature TsAs the final text semantic features.
4. The image-text matching method based on the cross-modal mutual attention mechanism as claimed in claim 3, wherein when the image-text similarity calculation module is constructed in the step S2, the image semantic feature V is usedsAnd text semantic feature TsThe image semantic feature V is obtained by an input image-text similarity calculation modulesSemantic feature T with textsThe similarity between them.
5. The method for matching teletext according to claim 1, wherein the step of calculating the text vector supervised by image attention in S2 further comprises the following steps:
dividing the image into k local regions according to the number k of interest regions, and setting one of the local regions as Vi;
Dividing a text into n words according to the number n of the words, and setting one word as Wj;
To ViAnd WjCalculating the cos similarity by the following formula:
Calculating the weight a by softmax functionijThe formula is as follows:
in the formulae (II) to (III)ijIs the similarity between the ith local image and the jth word, viIs the ith local image feature vector, wjIs the jth word text feature vector, k is that the image is divided into k regions according to the number k of interest regions, n is that the sentence text is totally divided into n words, AijIs the weight calculated by the jth word according to its degree of correlation with the image, wjIs the jth word, for a total of n.
6. The method for matching graphics and text based on cross-modal mutual attention mechanism according to claim 5, wherein the step of obtaining the similarity between the complete text and the complete image through the summation and average pooling in S3 further comprises the steps of:
7. The method according to claim 6, wherein the similarity S between the complete text and the complete image isAVGThe calculation formula of (I, T) is as follows:
in each formula, ViIs the ith local image feature, aiIs a text feature weighted by image attention supervision, and k is a total of k image local blocks.
8. The method as claimed in claim 1, wherein the step of training in S4 to obtain a prediction model for matching graphics and text, and inputting the extracted feature codes of the graphics and text data into the prediction model for matching graphics and text, and obtaining the prediction result of the similarity between graphics and text further comprises the steps of:
constructing a negative example by adopting a random pairing method, training data, adopting a triple Loss function as a training target of a graph-text matching task, and obtaining a graph-text matching prediction model after training is finished;
and respectively extracting the feature codes of the image-text data by adopting fast-RCNN and bidirectional GRU, and inputting the extracted feature codes of the image-text data into an image-text matching prediction model to obtain a prediction result of image-text similarity.
9. The image-text matching method based on the cross-modal mutual attention mechanism according to claim 8, wherein when training data by constructing negative examples by using a random pairing method, the negative example construction strategy further comprises the following steps:
randomly drawing one from the text set for 40% of the image data as a negative example;
the text extracted from the text collection and the actual matching text containing the same entity words is selected as a negative example for 60% of the image data.
10. The method of claim 9, wherein the teletext matching task uses a Triplet Loss function as a training target, and the Triplet Loss function formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210105762.2A CN114492646A (en) | 2022-01-28 | 2022-01-28 | Image-text matching method based on cross-modal mutual attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210105762.2A CN114492646A (en) | 2022-01-28 | 2022-01-28 | Image-text matching method based on cross-modal mutual attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114492646A true CN114492646A (en) | 2022-05-13 |
Family
ID=81476580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210105762.2A Pending CN114492646A (en) | 2022-01-28 | 2022-01-28 | Image-text matching method based on cross-modal mutual attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114492646A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115098644A (en) * | 2022-07-14 | 2022-09-23 | 平安科技(深圳)有限公司 | Image and text matching method and device, electronic equipment and storage medium |
CN116150418A (en) * | 2023-04-20 | 2023-05-23 | 南京邮电大学 | Image-text matching method and system based on mixed focusing attention mechanism |
CN116958706A (en) * | 2023-08-11 | 2023-10-27 | 中国矿业大学 | Controllable generation method for image diversified description based on part-of-speech tagging |
CN117611245A (en) * | 2023-12-14 | 2024-02-27 | 浙江博观瑞思科技有限公司 | Data analysis management system and method for planning E-business operation activities |
-
2022
- 2022-01-28 CN CN202210105762.2A patent/CN114492646A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115098644A (en) * | 2022-07-14 | 2022-09-23 | 平安科技(深圳)有限公司 | Image and text matching method and device, electronic equipment and storage medium |
CN116150418A (en) * | 2023-04-20 | 2023-05-23 | 南京邮电大学 | Image-text matching method and system based on mixed focusing attention mechanism |
CN116150418B (en) * | 2023-04-20 | 2023-07-07 | 南京邮电大学 | Image-text matching method and system based on mixed focusing attention mechanism |
CN116958706A (en) * | 2023-08-11 | 2023-10-27 | 中国矿业大学 | Controllable generation method for image diversified description based on part-of-speech tagging |
CN116958706B (en) * | 2023-08-11 | 2024-05-14 | 中国矿业大学 | Controllable generation method for image diversified description based on part-of-speech tagging |
CN117611245A (en) * | 2023-12-14 | 2024-02-27 | 浙江博观瑞思科技有限公司 | Data analysis management system and method for planning E-business operation activities |
CN117611245B (en) * | 2023-12-14 | 2024-05-31 | 浙江博观瑞思科技有限公司 | Data analysis management system and method for planning E-business operation activities |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110147457B (en) | Image-text matching method, device, storage medium and equipment | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110222140B (en) | Cross-modal retrieval method based on counterstudy and asymmetric hash | |
Li et al. | Truncation cross entropy loss for remote sensing image captioning | |
CN113283551B (en) | Training method and training device of multi-mode pre-training model and electronic equipment | |
CN114492646A (en) | Image-text matching method based on cross-modal mutual attention mechanism | |
CN113297975A (en) | Method and device for identifying table structure, storage medium and electronic equipment | |
WO2018196718A1 (en) | Image disambiguation method and device, storage medium, and electronic device | |
CN111160348A (en) | Text recognition method for natural scene, storage device and computer equipment | |
CN116402063B (en) | Multi-modal irony recognition method, apparatus, device and storage medium | |
CN110781413A (en) | Interest point determining method and device, storage medium and electronic equipment | |
CN114973222B (en) | Scene text recognition method based on explicit supervision attention mechanism | |
WO2022206094A1 (en) | Method and apparatus for generating captioning device, and method and apparatus for outputting caption | |
CN113593661A (en) | Clinical term standardization method, device, electronic equipment and storage medium | |
Qin | Application of efficient recognition algorithm based on deep neural network in English teaching scene | |
CN117079310A (en) | Pedestrian re-identification method based on image-text multi-mode fusion | |
Li et al. | Review network for scene text recognition | |
CN117009570A (en) | Image-text retrieval method and device based on position information and confidence perception | |
CN116450829A (en) | Medical text classification method, device, equipment and medium | |
CN116522942A (en) | Chinese nested named entity recognition method based on character pairs | |
CN116433934A (en) | Multi-mode pre-training method for generating CT image representation and image report | |
CN115424275A (en) | Fishing boat brand identification method and system based on deep learning technology | |
CN114694133A (en) | Text recognition method based on combination of image processing and deep learning | |
CN113159071A (en) | Cross-modal image-text association anomaly detection method | |
CN113052156A (en) | Optical character recognition method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |