CN113656539A

CN113656539A - Cross-modal retrieval method based on feature separation and reconstruction

Info

Publication number: CN113656539A
Application number: CN202110859387.6A
Authority: CN
Inventors: 邬向前; 卜巍; 张力
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-16
Anticipated expiration: 2041-07-28
Also published as: CN113656539B

Abstract

The invention discloses a cross-modal retrieval method based on feature separation and reconstruction, which comprises the following steps: step one, obtaining a visual representation; step two, converting the word sequence into text representation; performing linear transformation through a visual and text multilayer perceptron to respectively obtain characteristic vectors of a visual space and a text space; step four, decomposing the feature vectors of different modal spaces into modal information, semantic information and specific information; separating the modal information, the semantic information and the specific information from the visual/text representation by using a feature separation module to obtain the modal information, the semantic information and the specific information of the visual representation and the text representation; step six, combining three different information of the image to reconstruct the image; and step seven, performing text reconstruction on the three different kinds of information of the text. The method of the invention can well carry out cross-modal retrieval and obtain competitive results on a plurality of databases.

Description

Cross-modal retrieval method based on feature separation and reconstruction

Technical Field

The invention relates to a cross-modal retrieval method, in particular to a cross-modal retrieval method based on feature separation and reconstruction.

Background

With the rapid development of multimedia, there is a great deal of information on the internet, such as images, text, video, audio, etc. It is becoming increasingly difficult to manually obtain useful information between different modalities in mass data. Naturally, we need a powerful method to help us obtain the text, image or video we need. Cross-modality retrieval takes one modality of data as a query, and retrieves related data of another modality. For example, we can use text to retrieve images of interest (as we do on google image search) or use images to retrieve corresponding text. Of course, modalities are not limited to images and text, and other modalities such as voice, physical signals, and video may be a component of the cross-modality retrieval.

Disclosure of Invention

In order to better perform cross-modal retrieval, the invention provides a cross-modal retrieval method based on feature separation and reconstruction.

The purpose of the invention is realized by the following technical scheme:

a cross-modal retrieval method based on feature separation and reconstruction comprises the following steps:

step one, for an image part in an image-text pair, using a ResNet152 network as a basic image network of image branches, selecting an image in the image-text pair as an input of the image branches, and directly extracting image features from a penultimate full-link layer to obtain a visual representation v;

secondly, for a text part in the image-text pair, word coding is used, each token is coded into a word vector, and then a GRU is used as a basic text network of text branches to convert a word sequence into a text representation l;

after respectively obtaining the visual representation v and the text representation l, carrying out linear transformation through a visual and text multilayer perceptron to respectively obtain the characteristic vectors of a visual space and a text space;

step four, decomposing the feature vectors of different modal spaces into modal information, semantic information and specific information, wherein:

(1) modal information mo, characterizing the source of the feature vector;

(2) semantic information se representing high-level semantics represented by the feature vector;

(3) specific information sp representing information specific to different modal characteristics;

step five, separating the modal information, the semantic information and the specific information from the visual/text representation by utilizing a feature separation module to obtain the modal information (v) of the visual representation v and the text representation l^mo,l^mo) Semantic information (v)^se,l^se) And specific information (v)^sp,l^sp)；

Step six, adopting a generator and a discriminator of DCGAN as a generator G and a discriminator D of image reconstruction respectively₁Introduction of a discriminator D₂Determining whether the generated image corresponds in content to the real image, combining three different pieces of information (v) of the image^mo；v^se；v^sp) Carrying out image reconstruction;

step seven, three different information (l) of the text are decoded by using RNN^mo；l^se；l^sp) And reconstructing the text to finally generate a complete sentence.

Compared with the prior art, the invention has the following advantages:

1. the invention decomposes feature vectors from different modal spaces into three parts: modality information, semantic information, and specific information.

2. The invention introduces feature separation in the traditional cross-modal retrieval task to deal with information asymmetry between different modalities, and supervises different parts of feature vectors by using different loss functions.

3. The invention also introduces three different information of images and texts which are respectively combined by the image reconstruction task and the text reconstruction task, improves the performance of the cross-modal retrieval task through multi-task joint learning, and has good robustness.

4. The method of the invention can well carry out cross-modal retrieval and obtain competitive results on a plurality of databases.

Drawings

FIG. 1 is a cross-modal search flow chart based on feature separation and reconstruction in accordance with the present invention;

FIG. 2 is a diagram illustrating the reconstruction of an input image and text from a given image text pair according to the present invention;

FIG. 3 is a visual result of a visualization of a text search of a given image query on a MS-COCO dataset by the method of the present invention;

FIG. 4 is a visual result of the visualization of an image retrieval of a given text query on an MS-COCO dataset by the method of the present invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

The invention provides a cross-modal retrieval method based on feature separation and reconstruction, and fig. 1 shows the overall structure of the whole network, which can be roughly divided into three parts, specifically as follows:

the first part is: a multi-modal feature extraction network.

The invention follows the common space learning method to extract the characteristics of the image and the text. Formally, given an arbitrary image-text pair (x, y), where x is the image and y is (w)₀...w_t...w_T-1) Is a text sentence, w₀、 w_tAnd w_T-1One-hot codes (tokens) respectively representing the 1 st, T +1 th and T-th words, T representing the sentence length. By using W_ew_t∈R^KEmbedding each token into a distributed representation to encode text, where K300, W_eIs word2vec embeddingAnd (4) matrix. This variable-length word sequence is then converted into a meaningful, fixed-size text representation, i ∈ R, using GRU as the underlying text network^d(only the last step of the GRU needs to be output as a textual representation of the entire sentence), where d 3072. As for image coding, in order to accommodate variable-size images and benefit from the performance of very deep architectures, the present invention relies on the full convolution residual ResNet-152 (pre-trained on ImageNet) as the underlying image network to extract image features directly from the penultimate fully connected layer FC7 to obtain the visual representation v ∈ R^d. For ResNet152, the dimension of the image embedding is 2048.

In the invention, the ResNet152 network comprises six convolution modules of Pool5, Rec5b, Rec4f, Rec3d, Rec2c and Pool1, six additional modules of Pool5_ A, Rec5b _ A, Rec4f _ A, Rec3 _ 3d _ A, Rec2c _ a and Pool1_ a connected after each convolution module, and 3 full connection layers connected in series after all convolution modules, wherein: pool5_ A, Rec5b _ A, Rec4f _ A, Rec3d _ A, Rec2c _ a and Pool1_ a have the same structure: 3 convolutional layers, 1 Concate layer, 1 Correlation layer, 1 sigmoid layer, and 3 fully-connected layers.

The second part is that: the specific structure of the feature separation module.

After the visual representation and the text representation are obtained separately, the present invention decomposes the feature vectors of different modality spaces into three parts due to the asymmetry between different modalities: (1) modality information characterizing the source of the feature vector. The modal information of the same modal feature vector should be as close as possible. Instead, they should be as far apart as possible. (2) Semantic information, which characterizes the high level semantics represented by the feature vector. The semantic information of the feature vectors, from different modalities, having the same or related semantics (i.e. a pair of samples in the database) should be as close to each other as possible. Instead, they should be as far apart as possible. The invention adopts the image semantic information vector and the text semantic information vector to carry out cross-modal retrieval. (3) Specific information characterizing features of different modalities. For example, the feature vectors of an image modality have image-specific detail information (the details of different images are significantly different), which is not present in the feature vectors of the corresponding text modality. The invention utilizes a feature separation module to separate the modal information, semantic information and specific information from the visual/text representation.

Specifically, the present invention inputs a visual representation v into an MLP_vIn (1), the text representation l is input into the MLP_lIn the method, the learnable parameters of the multilayer perceptron are subjected to linear transformation and normalization processing, and finally modal information (v) of visual representation v and text representation l is obtained^mo,l^mo) Semantic information (v)^se,l^se) And specific information (v)^sp,l^sp) As shown in the following formula:

wherein, MLP_vA visual multi-layer perceptron is represented,

indicating that it can learn the parameters of the device,

respectively representing visual modal information, visual semantic information and visual specific information before normalization; MLP_lA multi-layer perceptron of the text is represented,

indicating that it can learn the parameters of the device,

respectively representing text modal information, text semantic information, and text specific information before normalization.

And a third part: image and text reconstruction.

The invention constructs image and text reconstructionThe task combines three different information of the image and the text respectively, and the performance of the cross-modal retrieval task is improved through multi-task joint learning. In order to combine three different information of images and texts, the invention introduces image and text reconstruction respectively. The goal of image reconstruction is to encourage three different pieces of information in the visual representation to generate an image that resembles a real image. A generative countermeasure network (GAN) may be used to generate an image, which consists of a discriminator and a generator. The invention adopts a generator and a discriminator of DCGAN as a generator G and a discriminator D of image reconstruction respectively₁. Due to D₁Only true and false images can be distinguished, so that the consistency of the generated image on the ground in content cannot be guaranteed. The invention also introduces D₂To determine whether the generated image is consistent in content with the real image, as shown in the following equation:

x′＝G([v^mo；v^se；v^sp]；θ_G)；

where x 'represents the image generated by the generator, p and q represent the probability that n belongs to x', and θ represents a learnable parameter.

The goal of text reconstruction is to encourage the generation of a true sentence from three different text-representing information. The invention uses RNN to decode three different information of text, and finally generates a complete sentence, which is shown as the following formula:

where y' represents the probability distribution of the reconstructed sentence, W_eRepresenting word2vec embedded matrix, θ_RNNLearnable parameters, FC, representing RNN₂A fully-connected layer is shown,

denotes FC₂May be used to learn the parameters.

In the invention, a loss function L is used as an optimization target to train the whole network (including a basic image network, a basic text network, a feature separation module, an image reconstruction network and a text reconstruction network), as shown in the following formula:

L＝L_mo+L_se+L_sp+L_img-re+L_cap-re；

wherein the modal loss L_moSemantic loss (ternary ranking loss with hard negative sample mining) L_seSpecific loss L_spImage reconstruction loss L_img-reAnd text reconstruction loss L_cap-reRespectively as follows:

wherein N represents the number of sample pairs in the database,

represents the information characteristic of the ith visual modality,

to represent

The mean value vector of (a) is,

represents the ith text modality information feature,

to represent

The mean vector, | · | | luminance₂Representing a two-norm.

Wherein the content of the first and second substances,

representing the cosine similarity of the two vectors,

represents the jth visual semantic information characteristic,

representing the cosine similarity of the two vectors,

representing the jth text semantic information feature.

Wherein the content of the first and second substances,

represents the ith visually specific information characteristic,

representing the ith text-specific information feature, T representsAnd (4) transposition.

Wherein the content of the first and second substances,

the representation x is derived from an image in a database,

the representation x' originates from an image generator G,

indicating that x "is derived from the portion of G from which x' was removed.

Wherein, w_tIs the word one-hot code, p (w), derived from the sentence y_t) Is derived from w generated by RNN_tProbability distribution of (2).

The experimental results are as follows:

the dimension of word embedding entered into the GRU by the present invention is set to 300. The present invention does not limit sentence length because it has little impact on the memory of the graphics processor. Both the GRU and RNN of the present invention are trained ab initio with 3072-dimensional four stacked hidden layers. The invention sets the dimension of the joint embedding space to 3072. The present invention trains 30 epochs using an ADAM optimizer. The learning rate for the first 15 epochs was 0.0002, and the last 15 epochs reduced the learning rate to 0.00002. For test set evaluation, the present invention solves the overfitting problem by selecting the model checkpoint that performs best on the validation set. The best model is selected based on the total number of recalls in the validation set.

In verifying the performance of the invention, the invention uses two standard public databases to evaluate the performance of the cross-modal search method proposed by the invention, namely MS-COCO and Flickr 30K.

Fig. 2 shows the reconstruction result of the method of the invention given an input image-text pair. The first and second columns represent the image-text pairs input into the network, respectively, and the third and fourth columns represent the reconstruction results of the present invention, respectively. As can be seen from fig. 3, the method of the present invention can reconstruct similar images and texts from the input image-text pair.

Table 1 shows the results of quantitative evaluation of the evaluation criteria R @1, R @5 and R @10 in the MS-COCO data set by the method of the present invention and the 15 best cross-modal search methods. Due to the limitation of hardware, the invention cannot set the BatchSize size set by most methods in the experiment, such as 100, 128, 160, etc. The present invention selects VSE + + as the baseline and re-runs their code under existing hardware conditions (BatchSize 24). The method of the invention has significant improvements in image-to-text (-3.4R @1, -1.9R @5 and-1.0R @10) and text-to-image (-1.7R @1, -1.2R @5 and-0.3R @ 10). As for the results in table 1, several comparative methods gave better performance. Due to hardware limitations, the present invention sets BatchSize to 24 in a single RTX 2080Ti (11G), and these methods achieve better performance with larger BatchSize. The method of the present invention requires more GPUs if the same batches as them are set. It can be seen that the method of the present invention can achieve better performance across modal search tasks in cases where BatchSize is 64, 128, 160, etc.

TABLE 1 comparison of the experimental results with the best current cross-modal search results on the MS-COCO database

Table 2 shows the results of the quantitative evaluation of the method of the invention in the Flickr30K dataset. The present invention selects VSE + + as the baseline and re-runs their code under existing hardware conditions (BatchSize 24). The method of the present invention is a significant improvement in image-to-text (3.3-R @1, 2.7-R @5, and 2.0-R @10) and text-to-image (3.2-R @1, 2.5-R @5, and 1.3-R @ 10). It can be seen that the method of the present invention can achieve better performance across modal search tasks in cases where BatchSize is 64, 128, 160, etc.

TABLE 2 comparison of the results of the experiment with the best current results of cross-modality search on the Flickr30K database

FIG. 3 shows the visual results of the visualization of the text retrieval of a given image query on the MS-COCO dataset by the present invention, and FIG. 4 shows the visual results of the image retrieval of a given text query on the MS-COCO dataset by the present invention. For each image query, the top 5 retrieved texts are presented, ordered according to the similarity scores predicted by the method of the present invention. For each text query, the top 3 retrieved images ranked by similarity score are presented. The true matching samples are labeled blue and the false matching samples are labeled red.

Claims

1. A cross-modal retrieval method based on feature separation and reconstruction is characterized by comprising the following steps:

(1) modal information mo, characterizing the source of the feature vector;

2. The method of claim 1, wherein the ResNet152 network comprises six convolution modules of Pool5, Rec5b, Rec4f, Rec3d, Rec2c and Pool1 and six additional modules of Pool5_ A, Rec5b _ A, Rec4f _ A, Rec3d _ A, Rec2 _ 2c _ A and Pool1_ A following each convolution module and 3 fully connected layers connected in series after all convolution modules, wherein: pool5_ A, Rec5b _ A, Rec4f _ A, Rec3d _ A, Rec2c _ a and Pool1_ a have the same structure: 3 convolutional layers, 1 Concate layer, 1 Correlation layer, 1 sigmoid layer, and 3 fully-connected layers.

3. The cross-modal search method based on feature separation and reconstruction as claimed in claim 1, wherein in the sixth step, the image reconstruction formula is as follows:

x′＝G([v^mo；v^se；v^sp]；θ_G)；

4. The method for cross-modal search based on feature separation and reconstruction as claimed in claim 1, wherein in the seventh step, the formula of text reconstruction is as follows:

denotes FC₂May be used to learn the parameters.