CN114911914A

CN114911914A - Cross-modal image-text retrieval method

Info

Publication number: CN114911914A
Application number: CN202210433101.2A
Authority: CN
Inventors: 冀中; 王耀东; 陈珂鑫; 王港
Original assignee: Tianjin University; CETC 54 Research Institute
Current assignee: Tianjin University; CETC 54 Research Institute
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-08-16

Abstract

The invention provides a cross-modal image-text retrieval method, and belongs to the technical field of machine learning. The method inputs the visual and text characteristics of a sample into a unified multi-mode transform inference network to acquire the interactive inference information in and among the modes. The training process of the network is divided into two stages, wherein in the first stage, a twin multi-modal Transformer encoder is used for encoding visual and text information to obtain intra-modal context information; and in the second stage, the fused visual information and text information are input into the same multi-mode Transformer encoder to obtain the interactive context information among the modes. And finally, the model adopts a novel self-adaptive similarity fusion mechanism to perform cross-mode image-text similarity matching calculation and output a result. The invention alleviates the technical problem of mutual split of intra-modal interaction information and inter-modal interaction information of the method in the field of image-text retrieval, and enables the characteristics of different modalities to interact and complement each other.

Description

Cross-modal image-text retrieval method

Technical Field

The invention belongs to the field of machine learning, and particularly relates to a cross-modal image-text retrieval method based on a unified multi-modal Transformer inference network.

Background

Cross-modal graph retrieval is a foundation and promising research direction in the field of multi-modal machine learning. The task aims to obtain a corresponding retrieval result of data of another modality through a query data sample of a certain modality submitted by a user. Today, the popularization of the internet, cross-modal image-text retrieval is becoming the basis of daily life and work of vast internet users. In order to fully utilize the high-level semantic information of multiple modalities, many methods exist to attempt to drill down into the interactive semantic information within or between modalities.

However, the inventors have found that the existing methods still have a common disadvantage that the acquisition of intra-modality and inter-modality mutual inference information is split, so that fine-grained interrelations among different modalities are split, and the fine-grained interrelations are very critical to the accuracy of cross-modality retrieval.

Therefore, the research on the model capable of simultaneously acquiring the intra-modal and inter-modal interactive reasoning information of the visual and text characteristics has strong academic value and practical value.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cross-modal image-text retrieval method. The method has the advantages that the association and complementarity among fine-grained features of different modes are fully excavated through the combined modeling of the visual and text multi-mode interactive context information acquisition, and the accuracy of cross-mode image-text retrieval can be effectively improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a cross-modal image-text retrieval method comprises the following steps:

s1, extracting visual features of the image and text features of the sentence by using a deep neural network;

s2, designing a unified multi-mode Transformer inference network, and modeling different modal data in a unified mode; the unified multi-mode Transformer reasoning network respectively extracts intra-modal interaction information and inter-modal interaction information of the vision and the text, and calculates the similarity of the vision characteristic and the text characteristic through a self-adaptive similarity fusion module;

and S3, training the unified multi-mode Transformer inference network, inputting the visual features and the text features extracted in the step S1 into the trained unified multi-mode Transformer inference network, obtaining the similarity of the visual features and the text features, and outputting a picture and text retrieval result.

Further, in step S1, the specific way of extracting the visual features of the image is as follows:

aiming at each input image, adopting a pre-training model of the Faster-R-CNN network on a Visual Genome data set as a Visual feature extractor to extract the features of the region of interest;

for each region of interest selected, using a vector

Represents its position-embedded vector, where (x) _l ，y _l ) And (x) _r ，y _r ) Respectively representing the coordinates of the upper left corner and the lower right corner of the region, and W and H respectively representing the width and the height of the input image;

characterizing the visual characteristics of an imageIs shown as

Where v represents a visual feature vector,

is the vector space, k is the number of regions of interest and D is the visual feature dimension.

Further, in step S1, the specific way of extracting the text features of the sentence is as follows:

for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text features with dimension D, which are expressed as

Where, t represents the text feature vector,

and the vector space is adopted, n is the number of text participles, and D is the text characteristic dimension.

Further, in step S2, a self-attention Transformer encoder is used to extract visual and textual intra-modal interaction information and inter-modal interaction information.

Further, in step S3, a pre-training-fine-tuning two-stage training method is used to train the unified multi-modal Transformer inference network, where a first training stage generates intra-modal context representation information of a visual or text single mode, and a second training stage generates interaction information between the visual and text modes.

Further, the training process of the first training stage is as follows: for an input image or text, respectively extracting a group of visual or text features by utilizing an intra-MMTN coder module in a twin mode, wherein the formula is as follows:

and

wherein { v ₁ ，··，v _k And t ₁ ，·，t _k Represents the fine-grained segment features of the input, i.e. the region features of the image or the word features of the text,

and

image or text context representation information representing the output; the encoders of the encoder modules within the twin modality share parameters.

Further, the training process of the second training stage is as follows: on the basis of the first stage training, the segment features { v } ₁ ，·，v _k And t ₁ ，·，t _k Splicing is carried out, the input signals are input into an inter-modal encoder module inter-MMTN, the inter-modal encoder module reads a pre-trained model of a twin-modal inner encoder module and then trains in a fine tuning mode, and the formula is as follows:

wherein

And

representing inter-modality interaction context representation information.

Further, the specific way for the adaptive similarity fusion module to calculate the similarity between the visual features and the text features is as follows:

representing information for a set of visual contexts

And text context representation information

Wherein

Defining a cross-modal fine-grained matching degree matrix A:

element A in A _ij Representing semantic similarity of the ith visual context representation information and the jth textual context representation information, W _v And W _t Is a network parameter;

defining the weighted sum of the cross-modal fine-grained matching degree matrix A as global image-text similarity S (I, T), wherein each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:

where λ is the temperature coefficient of the softmax activation function.

The invention has the beneficial effects that:

1) the invention alleviates the technical problem of mutual split of intra-modal interaction information and inter-modal interaction information of the method in the field of image-text retrieval, and enables the characteristics of different modalities to interact and complement each other.

2) The method divides the training of the model into two stages of pre-training and fine-tuning, can efficiently process the problem of cross-modal retrieval, does not use external knowledge, only uses a cross-modal image-text retrieval data set to train the model, avoids higher calculation cost and relieves training deviation caused by dependence on big data.

3) The invention designs a new self-adaptive multi-mode data fusion mode, which can effectively improve the accuracy of image-text retrieval.

Drawings

FIG. 1 is a diagram showing a structure of a Transformer module.

FIG. 2 is a flow chart of the present invention.

FIG. 3 is a schematic diagram of a unified multi-modal Transformer inference network according to the present invention.

Fig. 4 is a schematic diagram of an adaptive similarity fusion module according to the present invention.

Detailed Description

The invention will now be described in further detail with reference to the drawings and specific examples, but the invention is not limited thereto.

A cross-modal image-text retrieval method uses a deep neural network to extract visual features of a sample image and text features of a query sentence; designing a unified multi-mode Transformer inference network MMTN, modeling different modal data in a unified mode, and extracting interaction information in visual and text modes and interaction information between the modes respectively; a novel adaptive similarity fusion module ASA (adaptive similarity aggregation) is also designed, generated fine-grained cross-modal context representation information is dynamically fused, the similarity of the cross-modal context representation information is calculated, and a picture and text retrieval result is output.

Wherein, the extraction process about the visual characteristics of the sample image comprises the following steps:

for each input image, the fast-R-CNN with ResNet-101 as the backbone network is adopted as the visual feature extractor. Extracting 2048-dimensional ROI (region of interest) features by utilizing a pre-training model of the extractor on a Visual Genome data set, and selecting 36 features with highest ROI feature confidence degrees.

For each ROI selected, use a vector

Represents its position-embedded vector, where (x) _l ，y _l ) And (x) _r ，y _r ) Respectively representing the upper left and lower right corner coordinates of the region, and W and H respectively representing the width and height of the input image. Then, the product is processedThe position vector of the ROI is converted into 1024 dimensions. Finally, the visual features of the image will be represented as

The extraction process of the text features of the query statement comprises the following steps:

for each input sentence, a group of word vectors are extracted by adopting a pre-trained language model BERT, and then the word vectors are input into a full-connection layer to obtain a group of text example characteristics D expressed as text example characteristics

The generation process of the context interaction information about the visual and text different modalities comprises the following steps:

a self-attention encoder Transformer is used to extract intra-modal and inter-modal mutual inference information for vision and text.

In addition, the method also adopts a pre-training-fine-tuning two-stage training method to train the MMTN network. The first training phase generates intra-modal context representation information for visual or textual single modalities, and the second training phase generates interaction information between visual and textual modalities.

Further, the training process of the first training stage is as follows: for an input image or text, respectively extracting a group of visual or text features by utilizing a twin encoder module intra-MMTN, wherein the formula is as follows:

and

wherein { v ₁ ，·，v _k And t ₁ ，·，t _k Denotes fine particles of an inputDegree segment features (regional features of an image or word features of a text),

and

representing the output image or text context representation information.

The training process of the second training stage is as follows: on the basis of the first stage training, the segment features { v ₁ ，·，v _k And t ₁ ，·，t _k Splicing, and inputting the spliced result into an inter-MMTN module, wherein the formula is as follows:

wherein

And

representing inter-modality interaction context representation information.

The cross-mode matching calculation process of the adaptive similarity fusion module ASA comprises the following steps:

representing information for a set of visual contexts

And text context representation information

Wherein

The invention defines a cross-modal fine-grained matching degree matrix A:

wherein A is _ij And expressing the semantic similarity of the ith visual context expression information and the jth text context expression information.

The global image-text similarity is defined as the weighted sum of a cross-mode fine-grained matching degree matrix A, each fine-grained similarity is obtained by carrying out softmax activation function operation on the weighted sum of the column elements of the matrix A, and the formula is as follows:

where λ is the temperature coefficient of the softmax activation function. In general, the adaptive similarity fusion process can be expressed as:

with the rapid development of internet technology, multimodality data such as images, videos and texts are becoming important media for human perception of the world, so how to enable a computer to accurately understand multimodality data and realize cross-modality retrieval is a research subject with great practical value. The Transformer adopted by the method is a novel neural network which is widely used in recent years and takes a self-attention mechanism as a core. The Transformer encoder may vector a set of sequences

Mapping to another set of sequence vectors

Wherein the self-attention mechanism enables the network to capture implicit semantic associations between input sequence vectors. Specifically, the transform module attention mechanism comprises a query vector Q, a key vector K, and a value vector V, and its corresponding feature matrix can be represented as d _q ，d _k ，d _v Dimension. Each query vector Q is taken as oneThe anchor vector assigns a weight to the value vector V according to the semantic relevance between the key vector K and the query vector Q, so that the convex combination of the value vector V can be considered as the output of the query vector Q after self-attention processing. The Transformer can be regarded as a feature enhancement operation mode using the self-attention mechanism, and the matrix operation formula is as follows:

the structure of the Transformer module is shown in FIG. 1. The method adopts a Transformer module as a key component of the MMTN network, an encoder adopts 2 layers, and each layer comprises 4 attention heads. The steps for implementing the cross-modality retrieval of graphics and text are shown in fig. 2.

The cross-modal teletext retrieval task authority data set comprises MS-COCO, Flickr30k and the like. Taking the MS-COCO dataset as an example, which contains 123287 images, each image has 5 description sentences, in this embodiment, the experiment of the dataset is divided into: 5000 images were used as the validation set, 5000 images as the test set, and the rest as the training set. The experimental index adopts recall rate R @ K, and K is 1, 5 and 10.

The cross-modal graph-text retrieval method is described in detail as follows:

s1, in this embodiment, the original two-dimensional image needs to be preprocessed, and the embodiment adopts ResNet-101 and Faster-R-CNN networks to extract 2048-dimensional ROI features of the two-dimensional image. The Faster-R-CNN network obtains a series of ROI features with different confidence levels, and the embodiment selects 36 features with the highest confidence levels.

Wherein, for each selected ROI feature, a vector is used

Indicates its position embedded in a vector, in which (x) is _l ，y _l ) And (x) _r ，y _r ) Respectively representing the upper left and lower right corner coordinates of the region, and W and H respectively representing the width and height of the input image. The position vector of the ROI is then converted into 1024 dimensions. Finally, the visual features of the image will be represented as

In this embodiment, the original text data also needs to be preprocessed, in this embodiment, a BERT network is used to extract 768-dimensional word vectors of each word in the text, and then the 768-dimensional word vectors are input into the full-link layer to obtain a group of 2048-dimensional text example features, which are expressed as 2048-dimensional text example features

The visual and textual features obtained in the above steps are used as input data from the attention Transformer encoder.

And S2, extracting intra-modal and inter-modal interaction inference information of vision and text by adopting a self-attention encoder Transformer. The present embodiment employs a pre-training-fine tuning two-stage strategy to train the MMTN network. The first stage generates intra-modal context representation information for visual or textual single modalities and the second stage generates interaction information between visual and textual modalities as shown in fig. 3.

The training process of the first training stage is as follows: for an input image and a text, a group of visual or text features are respectively extracted by using a parameter-sharing twin encoder module intra-MMTN (Multi-Modal Transformer), and the specific formula is as follows:

and

wherein { v ₁ ，·，v _k And t ₁ ，·，t _k Denotes fine-grained segment characteristics of the input(regional features of images or word features of text),

and

representing the output image or text context representation information.

The second stage of training is based on the first stage of training, which essentially is the fine tuning of the model obtained in the first pre-training stage. The specific process is to characterize the fragment { v } ₁ ，·，v _k And t ₁ ，·，t _k Splicing, and inputting the spliced result into an inter-MMTN module, wherein the formula is as follows:

wherein

And

representing inter-modality interaction context representation information.

S3, the cross-mode matching calculation process of the adaptive similarity fusion module ASA is as follows: for a set of visual context features

And text context features

Wherein

The invention defines a cross-modal fine-grained matching degree matrix A:

wherein A is _ij Indicating semantic similarity of the ith visual context representation information and the jth textual context representation information, as shown in fig. 4.

where λ is the temperature coefficient of the softmax activation function, an embodiment of the disclosure takes the value of 20. In general, the adaptive similarity fusion process can be expressed as:

for each image-text pair, two similarity scores can be obtained from ASA, and the formula is:

and

the training process utilizes the triplet losses to train the model:

the training learning rate is 0.00001.

In a word, the invention inputs the visual and text characteristics of the sample into a unified multi-mode Transformer inference network MMTN to acquire the interactive inference information in and among the modes. The training process of the network is divided into two stages, wherein in the first stage, a twin multi-modal Transformer encoder is used for encoding visual and text information to obtain intra-modal context information; and in the second stage, the fused visual information and text information are input into the same multi-mode Transformer encoder to obtain the interactive context information between the modes. And finally, the model adopts a novel self-adaptive similarity fusion mechanism to perform cross-mode image-text similarity matching calculation and output a result. The invention adopts a Transformer encoder to uniformly model the interactive information in visual and text modes and among the modes, and deeply excavates the relevance and the complementarity of different modes as a whole.

Variations and modifications to the above-described embodiments may also occur to those skilled in the art, which fall within the scope of the invention as disclosed and taught herein. Therefore, the present invention is not limited to the above-mentioned embodiments, and any obvious improvement, replacement or modification made by those skilled in the art based on the present invention is within the protection scope of the present invention. Furthermore, although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

1. A cross-modal image-text retrieval method is characterized by comprising the following steps:

s2, designing a unified multi-mode Transformer reasoning network, and modeling different modal data in a unified mode; the unified multi-mode Transformer reasoning network respectively extracts intra-modal interaction information and inter-modal interaction information of the vision and the text, and calculates the similarity of the vision characteristic and the text characteristic through a self-adaptive similarity fusion module;

2. The cross-modality image-text retrieval method of claim 1, wherein in step S1, the specific way of extracting the visual features of the image is as follows:

aiming at each input image, a pre-training model of an Faster-R-CNN network on a Visual Genome data set is used as a Visual feature extractor to extract the features of the region of interest;

for each region of interest selected, using a vector

Represents its position-embedded vector, where (x) _l ，y _l ) And (x) _r ，y _r ) Respectively representing the coordinates of the upper left corner and the lower right corner of the area, and W and H respectively representing the width and the height of the input image;

representing visual features of an image as

Where v represents a visual feature vector,

3. The cross-modal teletext retrieval method according to claim 1, wherein in step S1, the specific way of extracting the text features of the sentence is:

Where, t represents the text feature vector,

4. The cross-modal teletext retrieval method of claim 1, wherein in step S2, a self-attention Transformer encoder is used to extract intra-modal and inter-modal interaction information for both visual and textual content.

5. The cross-modal teletext retrieval method of claim 1, wherein in step S3, the unified multi-modal Transformer inference network is trained using a pre-training-fine-tuning two-stage training method, wherein a first training stage generates intra-modal context representation information of a visual or textual single modality, and a second training stage generates interaction information between the visual and textual modalities.

6. A cross-modality teletext retrieval method according to claim 5, wherein the training process of the first training stage is: for an input image or text, respectively extracting a group of visual or text features by utilizing an intra-MMTN coder module in a twin mode, wherein the formula is as follows:

and

wherein { v ₁ ，·，v _k And t ₁ ，·，t _k } tableThe fine-grained segment features of the input, i.e. the region features of the image or the word features of the text,

and

7. A cross-modality teletext retrieval method according to claim 5, wherein the training process of the second training stage is: on the basis of the first stage training, the segment features { v } ₁ ，·，v _k And t ₁ ，·，t _k Splicing is carried out, the input signals are input into an inter-modal encoder module inter-MMTN, the inter-modal encoder module reads a pre-trained model of a twin-modal inner encoder module and then trains in a fine tuning mode, and the formula is as follows:

wherein

And

representing inter-modality interaction context representation information.

8. The cross-modal teletext retrieval method according to claim 1, wherein the specific way in which the adaptive similarity fusion module calculates the similarity of the visual features and the text features is:

representing information for a set of visual contexts

And text context representation information

Wherein

Defining a cross-modal fine-grained matching degree matrix A:

where λ is the temperature coefficient of the softmax activation function.