CN114911914A - Cross-modal image-text retrieval method - Google Patents
Cross-modal image-text retrieval method Download PDFInfo
- Publication number
- CN114911914A CN114911914A CN202210433101.2A CN202210433101A CN114911914A CN 114911914 A CN114911914 A CN 114911914A CN 202210433101 A CN202210433101 A CN 202210433101A CN 114911914 A CN114911914 A CN 114911914A
- Authority
- CN
- China
- Prior art keywords
- text
- modal
- visual
- features
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Abstract
The invention provides a cross-modal image-text retrieval method, and belongs to the technical field of machine learning. The method inputs the visual and text characteristics of a sample into a unified multi-mode transform inference network to acquire the interactive inference information in and among the modes. The training process of the network is divided into two stages, wherein in the first stage, a twin multi-modal Transformer encoder is used for encoding visual and text information to obtain intra-modal context information; and in the second stage, the fused visual information and text information are input into the same multi-mode Transformer encoder to obtain the interactive context information among the modes. And finally, the model adopts a novel self-adaptive similarity fusion mechanism to perform cross-mode image-text similarity matching calculation and output a result. The invention alleviates the technical problem of mutual split of intra-modal interaction information and inter-modal interaction information of the method in the field of image-text retrieval, and enables the characteristics of different modalities to interact and complement each other.
Description
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a cross-modal image-text retrieval method based on a unified multi-modal Transformer inference network.
Background
Cross-modal graph retrieval is a foundation and promising research direction in the field of multi-modal machine learning. The task aims to obtain a corresponding retrieval result of data of another modality through a query data sample of a certain modality submitted by a user. Today, the popularization of the internet, cross-modal image-text retrieval is becoming the basis of daily life and work of vast internet users. In order to fully utilize the high-level semantic information of multiple modalities, many methods exist to attempt to drill down into the interactive semantic information within or between modalities.
However, the inventors have found that the existing methods still have a common disadvantage that the acquisition of intra-modality and inter-modality mutual inference information is split, so that fine-grained interrelations among different modalities are split, and the fine-grained interrelations are very critical to the accuracy of cross-modality retrieval.
Therefore, the research on the model capable of simultaneously acquiring the intra-modal and inter-modal interactive reasoning information of the visual and text characteristics has strong academic value and practical value.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cross-modal image-text retrieval method. The method has the advantages that the association and complementarity among fine-grained features of different modes are fully excavated through the combined modeling of the visual and text multi-mode interactive context information acquisition, and the accuracy of cross-mode image-text retrieval can be effectively improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
a cross-modal image-text retrieval method comprises the following steps:
s1, extracting visual features of the image and text features of the sentence by using a deep neural network;
s2, designing a unified multi-mode Transformer inference network, and modeling different modal data in a unified mode; the unified multi-mode Transformer reasoning network respectively extracts intra-modal interaction information and inter-modal interaction information of the vision and the text, and calculates the similarity of the vision characteristic and the text characteristic through a self-adaptive similarity fusion module;
and S3, training the unified multi-mode Transformer inference network, inputting the visual features and the text features extracted in the step S1 into the trained unified multi-mode Transformer inference network, obtaining the similarity of the visual features and the text features, and outputting a picture and text retrieval result.
Further, in step S1, the specific way of extracting the visual features of the image is as follows:
aiming at each input image, adopting a pre-training model of the Faster-R-CNN network on a Visual Genome data set as a Visual feature extractor to extract the features of the region of interest;
for each region of interest selected, using a vectorRepresents its position-embedded vector, where (x) l ,y l ) And (x) r ,y r ) Respectively representing the coordinates of the upper left corner and the lower right corner of the region, and W and H respectively representing the width and the height of the input image;
characterizing the visual characteristics of an imageIs shown asWhere v represents a visual feature vector,is the vector space, k is the number of regions of interest and D is the visual feature dimension.
Further, in step S1, the specific way of extracting the text features of the sentence is as follows:
for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text features with dimension D, which are expressed asWhere, t represents the text feature vector,and the vector space is adopted, n is the number of text participles, and D is the text characteristic dimension.
Further, in step S2, a self-attention Transformer encoder is used to extract visual and textual intra-modal interaction information and inter-modal interaction information.
Further, in step S3, a pre-training-fine-tuning two-stage training method is used to train the unified multi-modal Transformer inference network, where a first training stage generates intra-modal context representation information of a visual or text single mode, and a second training stage generates interaction information between the visual and text modes.
Further, the training process of the first training stage is as follows: for an input image or text, respectively extracting a group of visual or text features by utilizing an intra-MMTN coder module in a twin mode, wherein the formula is as follows: andwherein { v 1 ,··,v k And t 1 ,·,t k Represents the fine-grained segment features of the input, i.e. the region features of the image or the word features of the text,andimage or text context representation information representing the output; the encoders of the encoder modules within the twin modality share parameters.
Further, the training process of the second training stage is as follows: on the basis of the first stage training, the segment features { v } 1 ,·,v k And t 1 ,·,t k Splicing is carried out, the input signals are input into an inter-modal encoder module inter-MMTN, the inter-modal encoder module reads a pre-trained model of a twin-modal inner encoder module and then trains in a fine tuning mode, and the formula is as follows: whereinAndrepresenting inter-modality interaction context representation information.
Further, the specific way for the adaptive similarity fusion module to calculate the similarity between the visual features and the text features is as follows:
representing information for a set of visual contextsAnd text context representation informationWherein Defining a cross-modal fine-grained matching degree matrix A: element A in A ij Representing semantic similarity of the ith visual context representation information and the jth textual context representation information, W v And W t Is a network parameter;
defining the weighted sum of the cross-modal fine-grained matching degree matrix A as global image-text similarity S (I, T), wherein each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:where λ is the temperature coefficient of the softmax activation function.
The invention has the beneficial effects that:
1) the invention alleviates the technical problem of mutual split of intra-modal interaction information and inter-modal interaction information of the method in the field of image-text retrieval, and enables the characteristics of different modalities to interact and complement each other.
2) The method divides the training of the model into two stages of pre-training and fine-tuning, can efficiently process the problem of cross-modal retrieval, does not use external knowledge, only uses a cross-modal image-text retrieval data set to train the model, avoids higher calculation cost and relieves training deviation caused by dependence on big data.
3) The invention designs a new self-adaptive multi-mode data fusion mode, which can effectively improve the accuracy of image-text retrieval.
Drawings
FIG. 1 is a diagram showing a structure of a Transformer module.
FIG. 2 is a flow chart of the present invention.
FIG. 3 is a schematic diagram of a unified multi-modal Transformer inference network according to the present invention.
Fig. 4 is a schematic diagram of an adaptive similarity fusion module according to the present invention.
Detailed Description
The invention will now be described in further detail with reference to the drawings and specific examples, but the invention is not limited thereto.
A cross-modal image-text retrieval method uses a deep neural network to extract visual features of a sample image and text features of a query sentence; designing a unified multi-mode Transformer inference network MMTN, modeling different modal data in a unified mode, and extracting interaction information in visual and text modes and interaction information between the modes respectively; a novel adaptive similarity fusion module ASA (adaptive similarity aggregation) is also designed, generated fine-grained cross-modal context representation information is dynamically fused, the similarity of the cross-modal context representation information is calculated, and a picture and text retrieval result is output.
Wherein, the extraction process about the visual characteristics of the sample image comprises the following steps:
for each input image, the fast-R-CNN with ResNet-101 as the backbone network is adopted as the visual feature extractor. Extracting 2048-dimensional ROI (region of interest) features by utilizing a pre-training model of the extractor on a Visual Genome data set, and selecting 36 features with highest ROI feature confidence degrees.
For each ROI selected, use a vectorRepresents its position-embedded vector, where (x) l ,y l ) And (x) r ,y r ) Respectively representing the upper left and lower right corner coordinates of the region, and W and H respectively representing the width and height of the input image. Then, the product is processedThe position vector of the ROI is converted into 1024 dimensions. Finally, the visual features of the image will be represented as
The extraction process of the text features of the query statement comprises the following steps:
for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text example characteristics D expressed as text example characteristics
The generation process of the context interaction information about the visual and text different modalities comprises the following steps:
a self-attention encoder Transformer is used to extract intra-modal and inter-modal mutual inference information for vision and text.
In addition, the method also adopts a pre-training-fine-tuning two-stage training method to train the MMTN network. The first training phase generates intra-modal context representation information for visual or textual single modalities, and the second training phase generates interaction information between visual and textual modalities.
Further, the training process of the first training stage is as follows: for an input image or text, respectively extracting a group of visual or text features by utilizing a twin encoder module intra-MMTN, wherein the formula is as follows:andwherein { v 1 ,·,v k And t 1 ,·,t k Denotes fine particles of an inputDegree segment features (regional features of an image or word features of a text),andrepresenting the output image or text context representation information.
The training process of the second training stage is as follows: on the basis of the first stage training, the segment features { v 1 ,·,v k And t 1 ,·,t k Splicing, and inputting the spliced result into an inter-MMTN module, wherein the formula is as follows: whereinAndrepresenting inter-modality interaction context representation information.
The cross-mode matching calculation process of the adaptive similarity fusion module ASA comprises the following steps:
representing information for a set of visual contextsAnd text context representation informationWherein The invention defines a cross-modal fine-grained matching degree matrix A:wherein A is ij And expressing the semantic similarity of the ith visual context expression information and the jth text context expression information.
The global image-text similarity is defined as the weighted sum of a cross-mode fine-grained matching degree matrix A, each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:where λ is the temperature coefficient of the softmax activation function. In general, the adaptive similarity fusion process can be expressed as:
with the rapid development of internet technology, multimodality data such as images, videos and texts are becoming important media for human perception of the world, so how to enable a computer to accurately understand multimodality data and realize cross-modality retrieval is a research subject with great practical value. The Transformer adopted by the method is a novel neural network which is widely used in recent years and takes a self-attention mechanism as a core. The Transformer encoder may vector a set of sequences Mapping to another set of sequence vectorsWherein the self-attention mechanism enables the network to capture implicit semantic associations between input sequence vectors. Specifically, the transform module attention mechanism comprises a query vector Q, a key vector K, and a value vector V, and its corresponding feature matrix can be represented as d q ,d k ,d v Dimension. Each query vector Q is taken as oneThe anchor vector assigns a weight to the value vector V according to the semantic relevance between the key vector K and the query vector Q, so that the convex combination of the value vector V can be considered as the output of the query vector Q after self-attention processing. The Transformer can be regarded as a feature enhancement operation mode using the self-attention mechanism, and the matrix operation formula is as follows:the structure of the Transformer module is shown in FIG. 1. The method adopts a Transformer module as a key component of the MMTN network, an encoder adopts 2 layers, and each layer comprises 4 attention heads. The steps for implementing the cross-modality retrieval of graphics and text are shown in fig. 2.
The cross-modal teletext retrieval task authority data set comprises MS-COCO, Flickr30k and the like. Taking the MS-COCO dataset as an example, which contains 123287 images, each image has 5 description sentences, in this embodiment, the experiment of the dataset is divided into: 5000 images were used as the validation set, 5000 images as the test set, and the rest as the training set. The experimental index adopts recall rate R @ K, and K is 1, 5 and 10.
The cross-modal graph-text retrieval method is described in detail as follows:
s1, in this embodiment, the original two-dimensional image needs to be preprocessed, and the embodiment adopts ResNet-101 and Faster-R-CNN networks to extract 2048-dimensional ROI features of the two-dimensional image. The Faster-R-CNN network obtains a series of ROI features with different confidence levels, and the embodiment selects 36 features with the highest confidence levels.
Wherein, for each selected ROI feature, a vector is usedIndicates its position embedded in a vector, in which (x) is l ,y l ) And (x) r ,y r ) Respectively representing the upper left and lower right corner coordinates of the region, and W and H respectively representing the width and height of the input image. The position vector of the ROI is then converted into 1024 dimensions. Finally, the visual features of the image will be represented as
In this embodiment, the original text data also needs to be preprocessed, in this embodiment, a BERT network is used to extract 768-dimensional word vectors of each word in the text, and then the 768-dimensional word vectors are input into the full-link layer to obtain a group of 2048-dimensional text example features, which are expressed as 2048-dimensional text example features
The visual and textual features obtained in the above steps are used as input data from the attention Transformer encoder.
And S2, extracting intra-modal and inter-modal interaction inference information of vision and text by adopting a self-attention encoder Transformer. The present embodiment employs a pre-training-fine tuning two-stage strategy to train the MMTN network. The first stage generates intra-modal context representation information for visual or textual single modalities and the second stage generates interaction information between visual and textual modalities as shown in fig. 3.
The training process of the first training stage is as follows: for an input image and a text, a group of visual or text features are respectively extracted by using a parameter-sharing twin encoder module intra-MMTN (Multi-Modal Transformer), and the specific formula is as follows: and wherein { v 1 ,·,v k And t 1 ,·,t k Denotes fine-grained segment characteristics of the input(regional features of images or word features of text), andrepresenting the output image or text context representation information.
The second stage of training is based on the first stage of training, which essentially is the fine tuning of the model obtained in the first pre-training stage. The specific process is to characterize the fragment { v } 1 ,·,v k And t 1 ,·,t k Splicing, and inputting the spliced result into an inter-MMTN module, wherein the formula is as follows: whereinAndrepresenting inter-modality interaction context representation information.
S3, the cross-mode matching calculation process of the adaptive similarity fusion module ASA is as follows: for a set of visual context featuresAnd text context featuresWherein The invention defines a cross-modal fine-grained matching degree matrix A: wherein A is ij Indicating semantic similarity of the ith visual context representation information and the jth textual context representation information, as shown in fig. 4.
The global image-text similarity is defined as the weighted sum of a cross-mode fine-grained matching degree matrix A, each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:where λ is the temperature coefficient of the softmax activation function, an embodiment of the disclosure takes the value of 20. In general, the adaptive similarity fusion process can be expressed as:
the training process utilizes the triplet losses to train the model:
the training learning rate is 0.00001.
In a word, the invention inputs the visual and text characteristics of the sample into a unified multi-mode Transformer inference network MMTN to acquire the interactive inference information in and among the modes. The training process of the network is divided into two stages, wherein in the first stage, a twin multi-modal Transformer encoder is used for encoding visual and text information to obtain intra-modal context information; and in the second stage, the fused visual information and text information are input into the same multi-mode Transformer encoder to obtain the interactive context information between the modes. And finally, the model adopts a novel self-adaptive similarity fusion mechanism to perform cross-mode image-text similarity matching calculation and output a result. The invention adopts a Transformer encoder to uniformly model the interactive information in visual and text modes and among the modes, and deeply excavates the relevance and the complementarity of different modes as a whole.
Variations and modifications to the above-described embodiments may also occur to those skilled in the art, which fall within the scope of the invention as disclosed and taught herein. Therefore, the present invention is not limited to the above-mentioned embodiments, and any obvious improvement, replacement or modification made by those skilled in the art based on the present invention is within the protection scope of the present invention. Furthermore, although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
Claims (8)
1. A cross-modal image-text retrieval method is characterized by comprising the following steps:
s1, extracting visual features of the image and text features of the sentence by using a deep neural network;
s2, designing a unified multi-mode Transformer reasoning network, and modeling different modal data in a unified mode; the unified multi-mode Transformer reasoning network respectively extracts intra-modal interaction information and inter-modal interaction information of the vision and the text, and calculates the similarity of the vision characteristic and the text characteristic through a self-adaptive similarity fusion module;
and S3, training the unified multi-mode Transformer inference network, inputting the visual features and the text features extracted in the step S1 into the trained unified multi-mode Transformer inference network, obtaining the similarity of the visual features and the text features, and outputting a picture and text retrieval result.
2. The cross-modality image-text retrieval method of claim 1, wherein in step S1, the specific way of extracting the visual features of the image is as follows:
aiming at each input image, a pre-training model of an Faster-R-CNN network on a Visual Genome data set is used as a Visual feature extractor to extract the features of the region of interest;
for each region of interest selected, using a vectorRepresents its position-embedded vector, where (x) l ,y l ) And (x) r ,y r ) Respectively representing the coordinates of the upper left corner and the lower right corner of the area, and W and H respectively representing the width and the height of the input image;
3. The cross-modal teletext retrieval method according to claim 1, wherein in step S1, the specific way of extracting the text features of the sentence is:
for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text features with dimension D, which are expressed as Where, t represents the text feature vector,and the vector space is adopted, n is the number of text participles, and D is the text characteristic dimension.
4. The cross-modal teletext retrieval method of claim 1, wherein in step S2, a self-attention Transformer encoder is used to extract intra-modal and inter-modal interaction information for both visual and textual content.
5. The cross-modal teletext retrieval method of claim 1, wherein in step S3, the unified multi-modal Transformer inference network is trained using a pre-training-fine-tuning two-stage training method, wherein a first training stage generates intra-modal context representation information of a visual or textual single modality, and a second training stage generates interaction information between the visual and textual modalities.
6. A cross-modality teletext retrieval method according to claim 5, wherein the training process of the first training stage is: for an input image or text, respectively extracting a group of visual or text features by utilizing an intra-MMTN coder module in a twin mode, wherein the formula is as follows: andwherein { v 1 ,·,v k And t 1 ,·,t k } tableThe fine-grained segment features of the input, i.e. the region features of the image or the word features of the text,andimage or text context representation information representing the output; the encoders of the encoder modules within the twin modality share parameters.
7. A cross-modality teletext retrieval method according to claim 5, wherein the training process of the second training stage is: on the basis of the first stage training, the segment features { v } 1 ,·,v k And t 1 ,·,t k Splicing is carried out, the input signals are input into an inter-modal encoder module inter-MMTN, the inter-modal encoder module reads a pre-trained model of a twin-modal inner encoder module and then trains in a fine tuning mode, and the formula is as follows:whereinAndrepresenting inter-modality interaction context representation information.
8. The cross-modal teletext retrieval method according to claim 1, wherein the specific way in which the adaptive similarity fusion module calculates the similarity of the visual features and the text features is:
representing information for a set of visual contextsAnd text context representation informationWherein Defining a cross-modal fine-grained matching degree matrix A:element A in A ij Representing semantic similarity of the ith visual context representation information and the jth textual context representation information, W v And W t Is a network parameter;
defining the weighted sum of the cross-modal fine-grained matching degree matrix A as global image-text similarity S (I, T), wherein each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:where λ is the temperature coefficient of the softmax activation function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210433101.2A CN114911914A (en) | 2022-04-24 | 2022-04-24 | Cross-modal image-text retrieval method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210433101.2A CN114911914A (en) | 2022-04-24 | 2022-04-24 | Cross-modal image-text retrieval method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114911914A true CN114911914A (en) | 2022-08-16 |
Family
ID=82764765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210433101.2A Pending CN114911914A (en) | 2022-04-24 | 2022-04-24 | Cross-modal image-text retrieval method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114911914A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115203377A (en) * | 2022-09-09 | 2022-10-18 | 北京澜舟科技有限公司 | Model enhancement training method and system based on retrieval and storage medium |
CN115661594A (en) * | 2022-10-19 | 2023-01-31 | 海南港航控股有限公司 | Image-text multi-mode feature representation method and system based on alignment and fusion |
CN115797655A (en) * | 2022-12-13 | 2023-03-14 | 南京恩博科技有限公司 | Character interaction detection model, method, system and device |
CN116258145A (en) * | 2023-05-06 | 2023-06-13 | 华南师范大学 | Multi-mode named entity recognition method, device, equipment and storage medium |
CN116431847A (en) * | 2023-06-14 | 2023-07-14 | 北京邮电大学 | Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure |
CN116486420A (en) * | 2023-04-12 | 2023-07-25 | 北京百度网讯科技有限公司 | Entity extraction method, device and storage medium of document image |
CN117093692A (en) * | 2023-08-23 | 2023-11-21 | 广东技术师范大学 | Multi-granularity image-text matching method and system based on depth fusion |
CN117609527A (en) * | 2024-01-16 | 2024-02-27 | 合肥人工智能与大数据研究院有限公司 | Cross-modal data retrieval optimization method based on vector database |
CN117669738A (en) * | 2023-12-20 | 2024-03-08 | 苏州元脑智能科技有限公司 | Engine updating method, processing method, device, equipment, medium and robot |
CN117688193A (en) * | 2024-02-01 | 2024-03-12 | 湘江实验室 | Picture and text unified coding method, device, computer equipment and medium |
CN117876651A (en) * | 2024-03-13 | 2024-04-12 | 浪潮电子信息产业股份有限公司 | Visual positioning method, device, equipment and medium |
-
2022
- 2022-04-24 CN CN202210433101.2A patent/CN114911914A/en active Pending
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115203377A (en) * | 2022-09-09 | 2022-10-18 | 北京澜舟科技有限公司 | Model enhancement training method and system based on retrieval and storage medium |
CN115661594A (en) * | 2022-10-19 | 2023-01-31 | 海南港航控股有限公司 | Image-text multi-mode feature representation method and system based on alignment and fusion |
CN115661594B (en) * | 2022-10-19 | 2023-08-18 | 海南港航控股有限公司 | Image-text multi-mode feature representation method and system based on alignment and fusion |
CN115797655A (en) * | 2022-12-13 | 2023-03-14 | 南京恩博科技有限公司 | Character interaction detection model, method, system and device |
CN115797655B (en) * | 2022-12-13 | 2023-11-07 | 南京恩博科技有限公司 | Character interaction detection model, method, system and device |
CN116486420B (en) * | 2023-04-12 | 2024-01-12 | 北京百度网讯科技有限公司 | Entity extraction method, device and storage medium of document image |
CN116486420A (en) * | 2023-04-12 | 2023-07-25 | 北京百度网讯科技有限公司 | Entity extraction method, device and storage medium of document image |
CN116258145A (en) * | 2023-05-06 | 2023-06-13 | 华南师范大学 | Multi-mode named entity recognition method, device, equipment and storage medium |
CN116258145B (en) * | 2023-05-06 | 2023-07-25 | 华南师范大学 | Multi-mode named entity recognition method, device, equipment and storage medium |
CN116431847A (en) * | 2023-06-14 | 2023-07-14 | 北京邮电大学 | Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure |
CN116431847B (en) * | 2023-06-14 | 2023-11-14 | 北京邮电大学 | Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure |
CN117093692A (en) * | 2023-08-23 | 2023-11-21 | 广东技术师范大学 | Multi-granularity image-text matching method and system based on depth fusion |
CN117669738A (en) * | 2023-12-20 | 2024-03-08 | 苏州元脑智能科技有限公司 | Engine updating method, processing method, device, equipment, medium and robot |
CN117669738B (en) * | 2023-12-20 | 2024-04-26 | 苏州元脑智能科技有限公司 | Engine updating method, processing method, device, equipment, medium and robot |
CN117609527A (en) * | 2024-01-16 | 2024-02-27 | 合肥人工智能与大数据研究院有限公司 | Cross-modal data retrieval optimization method based on vector database |
CN117688193A (en) * | 2024-02-01 | 2024-03-12 | 湘江实验室 | Picture and text unified coding method, device, computer equipment and medium |
CN117876651A (en) * | 2024-03-13 | 2024-04-12 | 浪潮电子信息产业股份有限公司 | Visual positioning method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114911914A (en) | Cross-modal image-text retrieval method | |
CN112905827B (en) | Cross-modal image-text matching method, device and computer readable storage medium | |
CN112966127A (en) | Cross-modal retrieval method based on multilayer semantic alignment | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN112687388B (en) | Explanatory intelligent medical auxiliary diagnosis system based on text retrieval | |
WO2023065617A1 (en) | Cross-modal retrieval system and method based on pre-training model and recall and ranking | |
CN116610778A (en) | Bidirectional image-text matching method based on cross-modal global and local attention mechanism | |
CN113010700B (en) | Image text cross-modal retrieval method based on category information alignment | |
CN116611024A (en) | Multi-mode trans mock detection method based on facts and emotion oppositivity | |
CN115424096B (en) | Multi-view zero-sample image identification method | |
CN108536735A (en) | Multi-modal lexical representation method and system based on multichannel self-encoding encoder | |
CN116450834A (en) | Archive knowledge graph construction method based on multi-mode semantic features | |
CN114970517A (en) | Visual question and answer oriented method based on multi-modal interaction context perception | |
CN113423004A (en) | Video subtitle generating method and system based on decoupling decoding | |
CN117648429B (en) | Question-answering method and system based on multi-mode self-adaptive search type enhanced large model | |
CN116450877A (en) | Image text matching method based on semantic selection and hierarchical alignment | |
CN117421591A (en) | Multi-modal characterization learning method based on text-guided image block screening | |
CN114564768A (en) | End-to-end intelligent plane design method based on deep learning | |
CN112528989B (en) | Description generation method for semantic fine granularity of image | |
Jiang et al. | Hadamard product perceptron attention for image captioning | |
CN116414988A (en) | Graph convolution aspect emotion classification method and system based on dependency relation enhancement | |
CN117010407A (en) | Multi-mode emotion analysis method based on double-flow attention and gating fusion | |
CN116662591A (en) | Robust visual question-answering model training method based on contrast learning | |
CN116561305A (en) | False news detection method based on multiple modes and transformers | |
CN116662924A (en) | Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |