CN114911914A - Cross-modal image-text retrieval method - Google Patents

Cross-modal image-text retrieval method Download PDF

Info

Publication number
CN114911914A
CN114911914A CN202210433101.2A CN202210433101A CN114911914A CN 114911914 A CN114911914 A CN 114911914A CN 202210433101 A CN202210433101 A CN 202210433101A CN 114911914 A CN114911914 A CN 114911914A
Authority
CN
China
Prior art keywords
text
modal
visual
features
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210433101.2A
Other languages
Chinese (zh)
Inventor
冀中
王耀东
陈珂鑫
王港
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
CETC 54 Research Institute
Original Assignee
Tianjin University
CETC 54 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University, CETC 54 Research Institute filed Critical Tianjin University
Priority to CN202210433101.2A priority Critical patent/CN114911914A/en
Publication of CN114911914A publication Critical patent/CN114911914A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention provides a cross-modal image-text retrieval method, and belongs to the technical field of machine learning. The method inputs the visual and text characteristics of a sample into a unified multi-mode transform inference network to acquire the interactive inference information in and among the modes. The training process of the network is divided into two stages, wherein in the first stage, a twin multi-modal Transformer encoder is used for encoding visual and text information to obtain intra-modal context information; and in the second stage, the fused visual information and text information are input into the same multi-mode Transformer encoder to obtain the interactive context information among the modes. And finally, the model adopts a novel self-adaptive similarity fusion mechanism to perform cross-mode image-text similarity matching calculation and output a result. The invention alleviates the technical problem of mutual split of intra-modal interaction information and inter-modal interaction information of the method in the field of image-text retrieval, and enables the characteristics of different modalities to interact and complement each other.

Description

Cross-modal image-text retrieval method
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a cross-modal image-text retrieval method based on a unified multi-modal Transformer inference network.
Background
Cross-modal graph retrieval is a foundation and promising research direction in the field of multi-modal machine learning. The task aims to obtain a corresponding retrieval result of data of another modality through a query data sample of a certain modality submitted by a user. Today, the popularization of the internet, cross-modal image-text retrieval is becoming the basis of daily life and work of vast internet users. In order to fully utilize the high-level semantic information of multiple modalities, many methods exist to attempt to drill down into the interactive semantic information within or between modalities.
However, the inventors have found that the existing methods still have a common disadvantage that the acquisition of intra-modality and inter-modality mutual inference information is split, so that fine-grained interrelations among different modalities are split, and the fine-grained interrelations are very critical to the accuracy of cross-modality retrieval.
Therefore, the research on the model capable of simultaneously acquiring the intra-modal and inter-modal interactive reasoning information of the visual and text characteristics has strong academic value and practical value.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cross-modal image-text retrieval method. The method has the advantages that the association and complementarity among fine-grained features of different modes are fully excavated through the combined modeling of the visual and text multi-mode interactive context information acquisition, and the accuracy of cross-mode image-text retrieval can be effectively improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
a cross-modal image-text retrieval method comprises the following steps:
s1, extracting visual features of the image and text features of the sentence by using a deep neural network;
s2, designing a unified multi-mode Transformer inference network, and modeling different modal data in a unified mode; the unified multi-mode Transformer reasoning network respectively extracts intra-modal interaction information and inter-modal interaction information of the vision and the text, and calculates the similarity of the vision characteristic and the text characteristic through a self-adaptive similarity fusion module;
and S3, training the unified multi-mode Transformer inference network, inputting the visual features and the text features extracted in the step S1 into the trained unified multi-mode Transformer inference network, obtaining the similarity of the visual features and the text features, and outputting a picture and text retrieval result.
Further, in step S1, the specific way of extracting the visual features of the image is as follows:
aiming at each input image, adopting a pre-training model of the Faster-R-CNN network on a Visual Genome data set as a Visual feature extractor to extract the features of the region of interest;
for each region of interest selected, using a vector
Figure BDA0003611740550000021
Represents its position-embedded vector, where (x) l ,y l ) And (x) r ,y r ) Respectively representing the coordinates of the upper left corner and the lower right corner of the region, and W and H respectively representing the width and the height of the input image;
characterizing the visual characteristics of an imageIs shown as
Figure BDA0003611740550000022
Where v represents a visual feature vector,
Figure BDA0003611740550000023
is the vector space, k is the number of regions of interest and D is the visual feature dimension.
Further, in step S1, the specific way of extracting the text features of the sentence is as follows:
for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text features with dimension D, which are expressed as
Figure BDA0003611740550000031
Where, t represents the text feature vector,
Figure BDA0003611740550000032
and the vector space is adopted, n is the number of text participles, and D is the text characteristic dimension.
Further, in step S2, a self-attention Transformer encoder is used to extract visual and textual intra-modal interaction information and inter-modal interaction information.
Further, in step S3, a pre-training-fine-tuning two-stage training method is used to train the unified multi-modal Transformer inference network, where a first training stage generates intra-modal context representation information of a visual or text single mode, and a second training stage generates interaction information between the visual and text modes.
Further, the training process of the first training stage is as follows: for an input image or text, respectively extracting a group of visual or text features by utilizing an intra-MMTN coder module in a twin mode, wherein the formula is as follows:
Figure BDA0003611740550000033
Figure BDA0003611740550000034
and
Figure BDA0003611740550000035
wherein { v 1 ,··,v k And t 1 ,·,t k Represents the fine-grained segment features of the input, i.e. the region features of the image or the word features of the text,
Figure BDA0003611740550000036
and
Figure BDA0003611740550000037
image or text context representation information representing the output; the encoders of the encoder modules within the twin modality share parameters.
Further, the training process of the second training stage is as follows: on the basis of the first stage training, the segment features { v } 1 ,·,v k And t 1 ,·,t k Splicing is carried out, the input signals are input into an inter-modal encoder module inter-MMTN, the inter-modal encoder module reads a pre-trained model of a twin-modal inner encoder module and then trains in a fine tuning mode, and the formula is as follows:
Figure BDA0003611740550000038
Figure BDA0003611740550000041
wherein
Figure BDA0003611740550000042
And
Figure BDA0003611740550000043
representing inter-modality interaction context representation information.
Further, the specific way for the adaptive similarity fusion module to calculate the similarity between the visual features and the text features is as follows:
representing information for a set of visual contexts
Figure BDA0003611740550000044
And text context representation information
Figure BDA0003611740550000045
Wherein
Figure BDA0003611740550000046
Figure BDA0003611740550000047
Defining a cross-modal fine-grained matching degree matrix A:
Figure BDA0003611740550000048
Figure BDA0003611740550000049
element A in A ij Representing semantic similarity of the ith visual context representation information and the jth textual context representation information, W v And W t Is a network parameter;
defining the weighted sum of the cross-modal fine-grained matching degree matrix A as global image-text similarity S (I, T), wherein each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:
Figure BDA00036117405500000410
where λ is the temperature coefficient of the softmax activation function.
The invention has the beneficial effects that:
1) the invention alleviates the technical problem of mutual split of intra-modal interaction information and inter-modal interaction information of the method in the field of image-text retrieval, and enables the characteristics of different modalities to interact and complement each other.
2) The method divides the training of the model into two stages of pre-training and fine-tuning, can efficiently process the problem of cross-modal retrieval, does not use external knowledge, only uses a cross-modal image-text retrieval data set to train the model, avoids higher calculation cost and relieves training deviation caused by dependence on big data.
3) The invention designs a new self-adaptive multi-mode data fusion mode, which can effectively improve the accuracy of image-text retrieval.
Drawings
FIG. 1 is a diagram showing a structure of a Transformer module.
FIG. 2 is a flow chart of the present invention.
FIG. 3 is a schematic diagram of a unified multi-modal Transformer inference network according to the present invention.
Fig. 4 is a schematic diagram of an adaptive similarity fusion module according to the present invention.
Detailed Description
The invention will now be described in further detail with reference to the drawings and specific examples, but the invention is not limited thereto.
A cross-modal image-text retrieval method uses a deep neural network to extract visual features of a sample image and text features of a query sentence; designing a unified multi-mode Transformer inference network MMTN, modeling different modal data in a unified mode, and extracting interaction information in visual and text modes and interaction information between the modes respectively; a novel adaptive similarity fusion module ASA (adaptive similarity aggregation) is also designed, generated fine-grained cross-modal context representation information is dynamically fused, the similarity of the cross-modal context representation information is calculated, and a picture and text retrieval result is output.
Wherein, the extraction process about the visual characteristics of the sample image comprises the following steps:
for each input image, the fast-R-CNN with ResNet-101 as the backbone network is adopted as the visual feature extractor. Extracting 2048-dimensional ROI (region of interest) features by utilizing a pre-training model of the extractor on a Visual Genome data set, and selecting 36 features with highest ROI feature confidence degrees.
For each ROI selected, use a vector
Figure BDA0003611740550000061
Represents its position-embedded vector, where (x) l ,y l ) And (x) r ,y r ) Respectively representing the upper left and lower right corner coordinates of the region, and W and H respectively representing the width and height of the input image. Then, the product is processedThe position vector of the ROI is converted into 1024 dimensions. Finally, the visual features of the image will be represented as
Figure BDA0003611740550000062
Figure BDA0003611740550000063
The extraction process of the text features of the query statement comprises the following steps:
for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text example characteristics D expressed as text example characteristics
Figure BDA0003611740550000064
Figure BDA0003611740550000065
The generation process of the context interaction information about the visual and text different modalities comprises the following steps:
a self-attention encoder Transformer is used to extract intra-modal and inter-modal mutual inference information for vision and text.
In addition, the method also adopts a pre-training-fine-tuning two-stage training method to train the MMTN network. The first training phase generates intra-modal context representation information for visual or textual single modalities, and the second training phase generates interaction information between visual and textual modalities.
Further, the training process of the first training stage is as follows: for an input image or text, respectively extracting a group of visual or text features by utilizing a twin encoder module intra-MMTN, wherein the formula is as follows:
Figure BDA0003611740550000066
and
Figure BDA0003611740550000067
wherein { v 1 ,·,v k And t 1 ,·,t k Denotes fine particles of an inputDegree segment features (regional features of an image or word features of a text),
Figure BDA0003611740550000068
and
Figure BDA0003611740550000069
representing the output image or text context representation information.
The training process of the second training stage is as follows: on the basis of the first stage training, the segment features { v 1 ,·,v k And t 1 ,·,t k Splicing, and inputting the spliced result into an inter-MMTN module, wherein the formula is as follows:
Figure BDA0003611740550000071
Figure BDA0003611740550000072
wherein
Figure BDA0003611740550000073
And
Figure BDA0003611740550000074
representing inter-modality interaction context representation information.
The cross-mode matching calculation process of the adaptive similarity fusion module ASA comprises the following steps:
representing information for a set of visual contexts
Figure BDA0003611740550000075
And text context representation information
Figure BDA0003611740550000076
Wherein
Figure BDA0003611740550000077
Figure BDA0003611740550000078
The invention defines a cross-modal fine-grained matching degree matrix A:
Figure BDA0003611740550000079
wherein A is ij And expressing the semantic similarity of the ith visual context expression information and the jth text context expression information.
The global image-text similarity is defined as the weighted sum of a cross-mode fine-grained matching degree matrix A, each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:
Figure BDA00036117405500000710
where λ is the temperature coefficient of the softmax activation function. In general, the adaptive similarity fusion process can be expressed as:
Figure BDA00036117405500000711
with the rapid development of internet technology, multimodality data such as images, videos and texts are becoming important media for human perception of the world, so how to enable a computer to accurately understand multimodality data and realize cross-modality retrieval is a research subject with great practical value. The Transformer adopted by the method is a novel neural network which is widely used in recent years and takes a self-attention mechanism as a core. The Transformer encoder may vector a set of sequences
Figure BDA00036117405500000712
Figure BDA00036117405500000713
Mapping to another set of sequence vectors
Figure BDA00036117405500000714
Wherein the self-attention mechanism enables the network to capture implicit semantic associations between input sequence vectors. Specifically, the transform module attention mechanism comprises a query vector Q, a key vector K, and a value vector V, and its corresponding feature matrix can be represented as d q ,d k ,d v Dimension. Each query vector Q is taken as oneThe anchor vector assigns a weight to the value vector V according to the semantic relevance between the key vector K and the query vector Q, so that the convex combination of the value vector V can be considered as the output of the query vector Q after self-attention processing. The Transformer can be regarded as a feature enhancement operation mode using the self-attention mechanism, and the matrix operation formula is as follows:
Figure BDA0003611740550000081
the structure of the Transformer module is shown in FIG. 1. The method adopts a Transformer module as a key component of the MMTN network, an encoder adopts 2 layers, and each layer comprises 4 attention heads. The steps for implementing the cross-modality retrieval of graphics and text are shown in fig. 2.
The cross-modal teletext retrieval task authority data set comprises MS-COCO, Flickr30k and the like. Taking the MS-COCO dataset as an example, which contains 123287 images, each image has 5 description sentences, in this embodiment, the experiment of the dataset is divided into: 5000 images were used as the validation set, 5000 images as the test set, and the rest as the training set. The experimental index adopts recall rate R @ K, and K is 1, 5 and 10.
The cross-modal graph-text retrieval method is described in detail as follows:
s1, in this embodiment, the original two-dimensional image needs to be preprocessed, and the embodiment adopts ResNet-101 and Faster-R-CNN networks to extract 2048-dimensional ROI features of the two-dimensional image. The Faster-R-CNN network obtains a series of ROI features with different confidence levels, and the embodiment selects 36 features with the highest confidence levels.
Wherein, for each selected ROI feature, a vector is used
Figure BDA0003611740550000091
Indicates its position embedded in a vector, in which (x) is l ,y l ) And (x) r ,y r ) Respectively representing the upper left and lower right corner coordinates of the region, and W and H respectively representing the width and height of the input image. The position vector of the ROI is then converted into 1024 dimensions. Finally, the visual features of the image will be represented as
Figure BDA0003611740550000092
In this embodiment, the original text data also needs to be preprocessed, in this embodiment, a BERT network is used to extract 768-dimensional word vectors of each word in the text, and then the 768-dimensional word vectors are input into the full-link layer to obtain a group of 2048-dimensional text example features, which are expressed as 2048-dimensional text example features
Figure BDA0003611740550000093
Figure BDA0003611740550000094
The visual and textual features obtained in the above steps are used as input data from the attention Transformer encoder.
And S2, extracting intra-modal and inter-modal interaction inference information of vision and text by adopting a self-attention encoder Transformer. The present embodiment employs a pre-training-fine tuning two-stage strategy to train the MMTN network. The first stage generates intra-modal context representation information for visual or textual single modalities and the second stage generates interaction information between visual and textual modalities as shown in fig. 3.
The training process of the first training stage is as follows: for an input image and a text, a group of visual or text features are respectively extracted by using a parameter-sharing twin encoder module intra-MMTN (Multi-Modal Transformer), and the specific formula is as follows:
Figure BDA0003611740550000095
Figure BDA0003611740550000096
and
Figure BDA0003611740550000097
Figure BDA0003611740550000098
wherein { v 1 ,·,v k And t 1 ,·,t k Denotes fine-grained segment characteristics of the input(regional features of images or word features of text),
Figure BDA0003611740550000099
Figure BDA0003611740550000101
and
Figure BDA0003611740550000102
representing the output image or text context representation information.
The second stage of training is based on the first stage of training, which essentially is the fine tuning of the model obtained in the first pre-training stage. The specific process is to characterize the fragment { v } 1 ,·,v k And t 1 ,·,t k Splicing, and inputting the spliced result into an inter-MMTN module, wherein the formula is as follows:
Figure BDA0003611740550000103
Figure BDA0003611740550000104
wherein
Figure BDA0003611740550000105
And
Figure BDA0003611740550000106
representing inter-modality interaction context representation information.
S3, the cross-mode matching calculation process of the adaptive similarity fusion module ASA is as follows: for a set of visual context features
Figure BDA0003611740550000107
And text context features
Figure BDA0003611740550000108
Wherein
Figure BDA0003611740550000109
Figure BDA00036117405500001010
The invention defines a cross-modal fine-grained matching degree matrix A:
Figure BDA00036117405500001011
Figure BDA00036117405500001012
wherein A is ij Indicating semantic similarity of the ith visual context representation information and the jth textual context representation information, as shown in fig. 4.
The global image-text similarity is defined as the weighted sum of a cross-mode fine-grained matching degree matrix A, each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:
Figure BDA00036117405500001013
where λ is the temperature coefficient of the softmax activation function, an embodiment of the disclosure takes the value of 20. In general, the adaptive similarity fusion process can be expressed as:
Figure BDA00036117405500001014
Figure BDA00036117405500001015
for each image-text pair, two similarity scores can be obtained from ASA, and the formula is:
Figure BDA0003611740550000111
and
Figure BDA0003611740550000112
the training process utilizes the triplet losses to train the model:
Figure BDA0003611740550000113
the training learning rate is 0.00001.
In a word, the invention inputs the visual and text characteristics of the sample into a unified multi-mode Transformer inference network MMTN to acquire the interactive inference information in and among the modes. The training process of the network is divided into two stages, wherein in the first stage, a twin multi-modal Transformer encoder is used for encoding visual and text information to obtain intra-modal context information; and in the second stage, the fused visual information and text information are input into the same multi-mode Transformer encoder to obtain the interactive context information between the modes. And finally, the model adopts a novel self-adaptive similarity fusion mechanism to perform cross-mode image-text similarity matching calculation and output a result. The invention adopts a Transformer encoder to uniformly model the interactive information in visual and text modes and among the modes, and deeply excavates the relevance and the complementarity of different modes as a whole.
Variations and modifications to the above-described embodiments may also occur to those skilled in the art, which fall within the scope of the invention as disclosed and taught herein. Therefore, the present invention is not limited to the above-mentioned embodiments, and any obvious improvement, replacement or modification made by those skilled in the art based on the present invention is within the protection scope of the present invention. Furthermore, although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims (8)

1. A cross-modal image-text retrieval method is characterized by comprising the following steps:
s1, extracting visual features of the image and text features of the sentence by using a deep neural network;
s2, designing a unified multi-mode Transformer reasoning network, and modeling different modal data in a unified mode; the unified multi-mode Transformer reasoning network respectively extracts intra-modal interaction information and inter-modal interaction information of the vision and the text, and calculates the similarity of the vision characteristic and the text characteristic through a self-adaptive similarity fusion module;
and S3, training the unified multi-mode Transformer inference network, inputting the visual features and the text features extracted in the step S1 into the trained unified multi-mode Transformer inference network, obtaining the similarity of the visual features and the text features, and outputting a picture and text retrieval result.
2. The cross-modality image-text retrieval method of claim 1, wherein in step S1, the specific way of extracting the visual features of the image is as follows:
aiming at each input image, a pre-training model of an Faster-R-CNN network on a Visual Genome data set is used as a Visual feature extractor to extract the features of the region of interest;
for each region of interest selected, using a vector
Figure FDA0003611740540000011
Represents its position-embedded vector, where (x) l ,y l ) And (x) r ,y r ) Respectively representing the coordinates of the upper left corner and the lower right corner of the area, and W and H respectively representing the width and the height of the input image;
representing visual features of an image as
Figure FDA0003611740540000012
Where v represents a visual feature vector,
Figure FDA0003611740540000013
is the vector space, k is the number of regions of interest and D is the visual feature dimension.
3. The cross-modal teletext retrieval method according to claim 1, wherein in step S1, the specific way of extracting the text features of the sentence is:
for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text features with dimension D, which are expressed as
Figure FDA0003611740540000014
Figure FDA0003611740540000015
Where, t represents the text feature vector,
Figure FDA0003611740540000016
and the vector space is adopted, n is the number of text participles, and D is the text characteristic dimension.
4. The cross-modal teletext retrieval method of claim 1, wherein in step S2, a self-attention Transformer encoder is used to extract intra-modal and inter-modal interaction information for both visual and textual content.
5. The cross-modal teletext retrieval method of claim 1, wherein in step S3, the unified multi-modal Transformer inference network is trained using a pre-training-fine-tuning two-stage training method, wherein a first training stage generates intra-modal context representation information of a visual or textual single modality, and a second training stage generates interaction information between the visual and textual modalities.
6. A cross-modality teletext retrieval method according to claim 5, wherein the training process of the first training stage is: for an input image or text, respectively extracting a group of visual or text features by utilizing an intra-MMTN coder module in a twin mode, wherein the formula is as follows:
Figure FDA0003611740540000021
Figure FDA0003611740540000022
and
Figure FDA0003611740540000023
wherein { v 1 ,·,v k And t 1 ,·,t k } tableThe fine-grained segment features of the input, i.e. the region features of the image or the word features of the text,
Figure FDA0003611740540000024
and
Figure FDA0003611740540000025
image or text context representation information representing the output; the encoders of the encoder modules within the twin modality share parameters.
7. A cross-modality teletext retrieval method according to claim 5, wherein the training process of the second training stage is: on the basis of the first stage training, the segment features { v } 1 ,·,v k And t 1 ,·,t k Splicing is carried out, the input signals are input into an inter-modal encoder module inter-MMTN, the inter-modal encoder module reads a pre-trained model of a twin-modal inner encoder module and then trains in a fine tuning mode, and the formula is as follows:
Figure FDA0003611740540000026
wherein
Figure FDA0003611740540000027
And
Figure FDA0003611740540000028
representing inter-modality interaction context representation information.
8. The cross-modal teletext retrieval method according to claim 1, wherein the specific way in which the adaptive similarity fusion module calculates the similarity of the visual features and the text features is:
representing information for a set of visual contexts
Figure FDA0003611740540000031
And text context representation information
Figure FDA0003611740540000032
Wherein
Figure FDA0003611740540000033
Figure FDA0003611740540000034
Defining a cross-modal fine-grained matching degree matrix A:
Figure FDA0003611740540000035
element A in A ij Representing semantic similarity of the ith visual context representation information and the jth textual context representation information, W v And W t Is a network parameter;
defining the weighted sum of the cross-modal fine-grained matching degree matrix A as global image-text similarity S (I, T), wherein each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:
Figure FDA0003611740540000036
where λ is the temperature coefficient of the softmax activation function.
CN202210433101.2A 2022-04-24 2022-04-24 Cross-modal image-text retrieval method Pending CN114911914A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210433101.2A CN114911914A (en) 2022-04-24 2022-04-24 Cross-modal image-text retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210433101.2A CN114911914A (en) 2022-04-24 2022-04-24 Cross-modal image-text retrieval method

Publications (1)

Publication Number Publication Date
CN114911914A true CN114911914A (en) 2022-08-16

Family

ID=82764765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210433101.2A Pending CN114911914A (en) 2022-04-24 2022-04-24 Cross-modal image-text retrieval method

Country Status (1)

Country Link
CN (1) CN114911914A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115203377A (en) * 2022-09-09 2022-10-18 北京澜舟科技有限公司 Model enhancement training method and system based on retrieval and storage medium
CN115661594A (en) * 2022-10-19 2023-01-31 海南港航控股有限公司 Image-text multi-mode feature representation method and system based on alignment and fusion
CN115797655A (en) * 2022-12-13 2023-03-14 南京恩博科技有限公司 Character interaction detection model, method, system and device
CN116258145A (en) * 2023-05-06 2023-06-13 华南师范大学 Multi-mode named entity recognition method, device, equipment and storage medium
CN116431847A (en) * 2023-06-14 2023-07-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
CN116486420A (en) * 2023-04-12 2023-07-25 北京百度网讯科技有限公司 Entity extraction method, device and storage medium of document image
CN117093692A (en) * 2023-08-23 2023-11-21 广东技术师范大学 Multi-granularity image-text matching method and system based on depth fusion
CN117609527A (en) * 2024-01-16 2024-02-27 合肥人工智能与大数据研究院有限公司 Cross-modal data retrieval optimization method based on vector database
CN117669738A (en) * 2023-12-20 2024-03-08 苏州元脑智能科技有限公司 Engine updating method, processing method, device, equipment, medium and robot
CN117688193A (en) * 2024-02-01 2024-03-12 湘江实验室 Picture and text unified coding method, device, computer equipment and medium
CN117876651A (en) * 2024-03-13 2024-04-12 浪潮电子信息产业股份有限公司 Visual positioning method, device, equipment and medium

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115203377A (en) * 2022-09-09 2022-10-18 北京澜舟科技有限公司 Model enhancement training method and system based on retrieval and storage medium
CN115661594A (en) * 2022-10-19 2023-01-31 海南港航控股有限公司 Image-text multi-mode feature representation method and system based on alignment and fusion
CN115661594B (en) * 2022-10-19 2023-08-18 海南港航控股有限公司 Image-text multi-mode feature representation method and system based on alignment and fusion
CN115797655A (en) * 2022-12-13 2023-03-14 南京恩博科技有限公司 Character interaction detection model, method, system and device
CN115797655B (en) * 2022-12-13 2023-11-07 南京恩博科技有限公司 Character interaction detection model, method, system and device
CN116486420B (en) * 2023-04-12 2024-01-12 北京百度网讯科技有限公司 Entity extraction method, device and storage medium of document image
CN116486420A (en) * 2023-04-12 2023-07-25 北京百度网讯科技有限公司 Entity extraction method, device and storage medium of document image
CN116258145A (en) * 2023-05-06 2023-06-13 华南师范大学 Multi-mode named entity recognition method, device, equipment and storage medium
CN116258145B (en) * 2023-05-06 2023-07-25 华南师范大学 Multi-mode named entity recognition method, device, equipment and storage medium
CN116431847A (en) * 2023-06-14 2023-07-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
CN116431847B (en) * 2023-06-14 2023-11-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
CN117093692A (en) * 2023-08-23 2023-11-21 广东技术师范大学 Multi-granularity image-text matching method and system based on depth fusion
CN117669738A (en) * 2023-12-20 2024-03-08 苏州元脑智能科技有限公司 Engine updating method, processing method, device, equipment, medium and robot
CN117669738B (en) * 2023-12-20 2024-04-26 苏州元脑智能科技有限公司 Engine updating method, processing method, device, equipment, medium and robot
CN117609527A (en) * 2024-01-16 2024-02-27 合肥人工智能与大数据研究院有限公司 Cross-modal data retrieval optimization method based on vector database
CN117688193A (en) * 2024-02-01 2024-03-12 湘江实验室 Picture and text unified coding method, device, computer equipment and medium
CN117876651A (en) * 2024-03-13 2024-04-12 浪潮电子信息产业股份有限公司 Visual positioning method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN114911914A (en) Cross-modal image-text retrieval method
CN112905827B (en) Cross-modal image-text matching method, device and computer readable storage medium
CN112966127A (en) Cross-modal retrieval method based on multilayer semantic alignment
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN112687388B (en) Explanatory intelligent medical auxiliary diagnosis system based on text retrieval
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN113010700B (en) Image text cross-modal retrieval method based on category information alignment
CN116611024A (en) Multi-mode trans mock detection method based on facts and emotion oppositivity
CN115424096B (en) Multi-view zero-sample image identification method
CN108536735A (en) Multi-modal lexical representation method and system based on multichannel self-encoding encoder
CN116450834A (en) Archive knowledge graph construction method based on multi-mode semantic features
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN113423004A (en) Video subtitle generating method and system based on decoupling decoding
CN117648429B (en) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model
CN116450877A (en) Image text matching method based on semantic selection and hierarchical alignment
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN114564768A (en) End-to-end intelligent plane design method based on deep learning
CN112528989B (en) Description generation method for semantic fine granularity of image
Jiang et al. Hadamard product perceptron attention for image captioning
CN116414988A (en) Graph convolution aspect emotion classification method and system based on dependency relation enhancement
CN117010407A (en) Multi-mode emotion analysis method based on double-flow attention and gating fusion
CN116662591A (en) Robust visual question-answering model training method based on contrast learning
CN116561305A (en) False news detection method based on multiple modes and transformers
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination