CN112818157B

CN112818157B - Combined query image retrieval method based on multi-order confrontation characteristic learning

Info

Publication number: CN112818157B
Application number: CN202110185641.9A
Authority: CN
Inventors: 纪守领; 付之笑; 董建锋; 张旭鸿; 何源
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2022-09-16
Anticipated expiration: 2041-02-10
Also published as: CN112818157A

Abstract

The invention discloses a combined query image retrieval method based on multi-order confrontation characteristic learning, which comprises the following steps: firstly, image features are obtained through a pre-trained feature extraction module and text features are obtained through an LSTM network, and then features of two modes are fused through self-attention guidance. And generating high-order characteristics by the low-order characteristics in a bilinear fusion mode. And then, the similarity relation between the triple loss learning features is utilized, the fusion between the features is further promoted by using a discriminator and a retrieval network to resist, and finally, the model is trained by combining the discriminator and the retrieval network in an end-to-end mode, so that the efficient combined query image retrieval is realized. The invention utilizes the deep learning technology and the idea of using games for reference, and greatly improves the performance and efficiency of searching the combined query image.

Description

Combined query image retrieval method based on multi-order confrontation feature learning

Technical Field

The invention relates to the technical field of machine learning combined query image retrieval, in particular to a combined query image retrieval method based on multi-order confrontation characteristic learning.

Background

With the rapid development of information technology and the popularization of mobile networking equipment, people can easily contact massive and various picture resources on the network. In the face of such huge data, when a picture meeting the requirements of the user is to be found from the huge data, an efficient and accurate image retrieval method and system become an indispensable requirement. The increasing total number of pictures which is increasing rapidly brings about large-scale increase of the number of similar images, so that the retrieval accuracy is greatly reduced, and the existing image retrieval technology faces huge pressure and challenges. The mainstream character retrieval image and picture retrieval image modes have respective limitations, for example, the expression capability of a pure text is limited, and information is lost when an idea is converted into a language. Pure pictures cannot provide similar directions, and the search result may still include a large number of non-desirable pictures. Combined query image retrieval is one of the methods to solve these problems. The method and the device simultaneously accept the reference image and the modified text as input, and can meet the modification requirement of the text while retaining the information of the image part. Therefore, a new image retrieval method for combined query of pictures and modified texts is becoming a research and development trend.

In recent years, the machine learning method has excellent performance in the image retrieval field, mainly uses a Convolutional Neural Network (CNN) to extract picture features, uses a Recurrent Neural Network (RNN) to extract text features, trains a neural network model by a metric learning method, and completes the tasks of picture retrieval images and character retrieval images by comparing the similarity of the neural network model with the target image features. There are also graph-based methods, neighbor analysis-based methods, and so on.

The existing combined query image retrieval method generally extracts features of an image and a text respectively and then fuses the images and the text, and the fusion features containing information of two modes are used for carrying out similarity measurement with the features of candidate pictures, so that a target picture is retrieved. The current methods also have some defects, and the methods do not fully utilize multi-scale features, however, the features of different scales often contain information specific to each layer. The fusion method of the characters and the texts is simpler and the retrieval efficiency is lower.

Disclosure of Invention

Aiming at the defects of the existing method, the invention provides a combined query image retrieval method based on multi-order confrontation characteristic learning. The method comprises the steps of firstly fusing low-order features into high-order features, and enabling the high-order features to be in confrontation with a retrieval network by setting a discriminator, so that the fusion of image-text features is further promoted in competition, and efficient combined query image retrieval is realized. Compared with the existing method, the obtained features contain richer information and correlation among different hierarchies, and the fusion of the information of the two modes of the image and the text is tighter.

A combined query image retrieval method based on multi-order antagonistic feature learning utilizes a retrieval network to obtain multi-level features and high-order features of candidate images, fuses a reference image and modified text features to obtain multi-level image-text fusion features and high-order image-text fusion features, calculates cosine similarity one by one with the multi-level features of the candidate images and the features after splicing the multi-level image-text fusion features and the high-order image-text fusion features, sorts according to the similarity, and returns the sorted candidate images as retrieval results of query images. The retrieval network comprises a feature extraction module, a self-attention fusion module and a bilinear fusion module. The retrieval network is constructed and trained by the following steps:

(1) and performing feature extraction on the reference image, the modified text and the target image by using a feature extraction module to obtain initial features of two modal data, wherein the initial features of the reference image and the target image comprise image features output by the feature extraction module in multiple layers.

(2) And (3) fusing the text initial features obtained in the step (1) with the features of the reference image at different levels by using a self-attention fusion module to obtain multi-level image-text fusion features.

(3) And (3) carrying out hierarchical fusion on the multilevel image-text fusion characteristics in the step (2) by utilizing a bilinear fusion module to obtain high-order image-text fusion characteristics, and carrying out hierarchical fusion on the multilevel initial characteristics of the target image in the step (1) to obtain high-order image characteristics.

(4) And (3) respectively comparing the multilevel image-text fusion characteristics obtained in the step (2) with the multilevel characteristics of the target image obtained in the step (1) and the high-order image-text fusion characteristics obtained in the step (3) with the high-order image characteristics of the target image, and performing similarity learning.

(5) And inputting the multilevel initial feature, multilevel image-text fusion feature and high-order image-text fusion feature, text feature, multilevel initial feature of the target image and high-order image feature of the reference image into a discriminator, and judging whether the reference image feature corresponding to the multilevel image-text fusion feature or the target image feature in different levels meets the modification requirement of the text or not to perform counterstudy. The search network is finally trained in an end-to-end manner. The reference image features comprise multilevel initial features of the reference image and multilevel initial features of the reference image, and the multilevel initial features of the reference image are subjected to level fusion to obtain high-order image features.

Further preferably, the initial features include image features of 3 levels, namely, low, medium and high levels, and the high-order features are fusion features of medium and high levels.

Further preferably, in step (1), the feature extraction module includes an LSTM network and a MobileNet convolutional neural network, where:

and carrying out feature extraction on the input text by using the LSTM network to obtain the initial features of the text.

And respectively extracting the features of the reference image and the paired target images by utilizing a pre-trained MobileNet or ResNet18 convolutional neural network, and obtaining the initial image features of different layers from the lower layer, the middle layer and the upper layer of the network.

Further preferably, the self-attention fusion module includes a convolutional layer, a self-attention network, and a linear layer, and the step (2) specifically includes the following sub-steps:

and (2-1) connecting the initial features of each layer of the reference image with the text features, and obtaining initial fusion features by using the convolution layer.

And (2-2) learning the preliminary fusion features to further image-text fusion features by utilizing a self-attention network and a linear layer.

Further preferably, the bilinear fusion module includes a plurality of linear layers, and the method for obtaining the high-order features in step (3) includes the following sub-steps:

and (3-1) mapping the multilevel image-text fusion characteristics in the step (2) and the multilevel characteristics of the target image in the step (1) into the same dimension by using linear layers respectively.

(3-2) performing dot product on the image-text fusion characteristics of the plurality of layers mapped in the step (3-1), and then obtaining high-order image-text fusion characteristics by using a linear layer; and (4) performing dot product on the features of the target images of the plurality of layers mapped in the step (3-1), and then using a linear layer to obtain the high-order image features of the target images.

Further preferably, in the step (4), the multilevel image-text fusion feature obtained in the step (2) is compared with the multilevel feature of the target image obtained in the step 1, and the high-order image-text fusion feature obtained in the step (3) is compared with the high-order image feature of the target image, so that the model can learn the similarity between the features at different levels and orders, and the triple loss is expressed as follows:

wherein

Representing negative examples, d () representing euclidean distances, m representing boundaries, where j represents different hierarchical and higher order ordinals. The first two items promote each layer of image-text fusion characteristic and high-order image-text fusion characteristic

Negative sample

Initial feature and high-order image feature of each level of target image matched with smaller distance

The latter two terms align the mapped image feature with its corresponding modified text feature h.

Further preferably, the countermeasure training method using an arbiter in the step (5) includes the steps of:

and (5-1) subtracting the initial characteristic and the high-order image characteristic of the input reference image from the image characteristic corresponding to the target image and then using linear layer transformation.

(5-2) converting the input modified text features by using a linear layer, multiplying the linear layer converted features by the features obtained by the linear layer conversion in the step (5-1), then performing internal summation to obtain a predicted value of a discriminator by using an activation function, and judging that the target image features meet the requirements of the modified text compared with a reference image:

the true loss function is expressed as follows:

indicating batch intra-averaging, D indicating the arbiter predicted value,

and representing reference image characteristics, including an initial characteristic and a high-order image characteristic acquired according to a high-order characteristic acquisition method of the target image.

(5-3) for the disordered initial image features of the reference image and the high-order image features and the disordered corresponding image features of the target image, the discriminator judges that the wrong image feature pairs do not meet the requirement of modifying the text according to the methods of the steps (5-1) - (5-2), and the error loss function is expressed as follows:

and

respectively representing disorder

And

(5-4) the judgment of the multilevel image-text fusion characteristic is the decision of the generation loss of the retrieval network and the confrontation loss of the discriminator, wherein the confrontation loss of the discriminator is the fusion loss and is expressed as:

the loss of generation of the search network is:

and (5-4) monitoring the end-to-end training of the model by the loss calculated by the steps and the triple loss.

Further preferably, the multilevel features and the high-order features of the candidate images acquired by the retrieval network are spliced to construct a candidate feature library.

The invention has the beneficial effects that: the invention utilizes the deep learning technology and the idea of using games for reference, enhances the compactness of cross-modal feature fusion, obtains feature representation with richer semantic information, and further improves the performance and efficiency of combined query image retrieval to a great extent.

Drawings

FIG. 1 is a schematic diagram of a multi-level countermeasure representation learning network architecture in accordance with the present invention;

FIG. 2 is a schematic diagram of a self-attention network structure according to the present invention;

FIG. 3 is a schematic diagram of a bilinear fusion module structure according to the present invention

FIG. 4 is a schematic diagram of a network structure of the discriminator according to the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention provides a combined query image retrieval method based on multi-order confrontation feature learning, which comprises the following steps:

(1) and extracting the features of the reference image, the modified text and the target image by using different feature extraction methods to obtain the initial features of the two modal data.

(1-1) given input reference image X _s And target image X _t Using the ImageNet numberImage feature extraction by pre-trained MobileNet or ResNet18 network model on dataset

Where i represents the hierarchy of the network and the actual model extracts features from the lower, middle and upper layers.

(1-2) given the input modified text T, words are first converted to word-embedded vectors { w ] using a simple vocabulary and embedding layer ₁ ,w ₂ ,…w _n In which w _n An embedded vector representing the nth word. Extracting text features by using an LSTM network model capable of understanding context relations of words in sentences, wherein the size of a hidden layer is 1024, and selecting a final output vector to be used as a feature vector of the text after being mapped by a full connection layer

And l 512 represents the final text feature size.

(2) Through the feature extraction in the steps, the initial features of the image and the text are respectively obtained, but the features are only simply extracted through the CNN model and the LSTM model, and only contain information of respective modalities, and cannot be directly used for retrieval, so that the features of the image and the text need to be fused to represent the features of the target image. The invention adopts a self-attention (SA) method to lead a model to learn characteristic parts needing to be reserved and omitted, and the specific steps are as follows:

(2-1) copying and expanding the text features to the size of the matched image features, connecting the text features and the matched image features, and obtaining a primary image-text fusion feature through a convolution layer of 1 x 1, wherein the fusion feature is simpler, and a network cannot easily accept or reject information of two modes, so that the weights of a reserved part and a modified part need to be further improved by using an attention module, and the weights of an unnecessary part are reduced.

(2-2) the self-attention fusion module structure used in this embodiment is shown in fig. 2, and the processing procedure of the self-attention fusion module on the teletext fusion feature is as follows: mapping the primary image-text fusion characteristics to hidden spaces of query, key and value by using three 1 multiplied by 1 convolution layers respectively to obtainQ ⁱ 、K ⁱ 、V ⁱ I denotes the hierarchy of features, reshaped Q ⁱ 、＝ ⁱ Is of size n ⁱ ×c ⁱ ,n ⁱ ＝h ⁱ ×w ⁱ ，h ⁱ 、w ⁱ 、c ⁱ Indicating the height, width and channel number of the ith layer feature map. This is followed by matrix multiplication and softmax activation to obtain a self-attention moment matrix

The formula is as follows:

(2-3) associating the self-attention matrix with V ⁱ Multiplying, averaging the output characteristics of the resulting product using a 1 × 1 convolution layer

Namely, the fine image-text fusion characteristics of the designated part are modified according to the characteristics required by the attention weight reservation.

(3) At present, the obtained multilevel fine image-text fusion features belong to different feature spaces and represent respective styles. In order to better utilize multi-level feature representation and explore interaction between different levels, the invention uses a bilinear fusion method (a bilinear fusion module structure is shown in figure 3) to combine the middle-level and high-level features into high-level features, so that richer feature representation can be generated, and the correlation between the middle-level and high-level features can be learned. The specific method comprises the following steps:

(3-1) since the feature dimensions of the middle and upper layers are 512 and 1024 respectively, it is necessary to map them into the same dimensional space first for fusing the two. Thus, two linear layers R and S are used to map the middle and high layer features to the same dimension, respectively.

(3-2) carrying outThe dot product preliminarily fuses the features of the two layers together, and then the third linear layer P is used for further extracting fine feature information from the features and transforming the fine feature information into final high-order features

(4) In order to enable the model to learn the similarity relation of the features, the generated features can meet the condition that the similar features are close and the dissimilar features are far, each level is trained by using bidirectional triple loss, and the specific formula is as follows:

wherein

Represents negative samples, d () represents the euclidean distance, M represents the boundary, 0.2 in this example, where j ∈ { L, M, H, MH } represents the low, medium, high, and high orders. The first two items promote the image-text fusion characteristic and the high-level image-text fusion characteristic of each layer of reference image and text

Negative sample

The latter two terms align the mapped image features with the corresponding modified text features h, and promote that the learned fusion features are more similar to the semantic information of the modified text.

(5) In order to further promote the fusion of the reference picture and the modified text, the invention designs a discriminator network to form confrontation with a retrieval network, and calculates the confrontation hinge loss, which comprises the following specific steps:

and (5-1) inputting the reference image features, the text features and the target image features into a discriminator to learn correct pairing relation. The discriminator firstly subtracts the image characteristics of the reference image and the target image of each level, then carries out linear layer transformation, multiplies the image characteristics by the text characteristics which are transformed by another linear layer, sums the values in the results and then uses the tanh activation function to obtain the predicted value. The predicted value is used for representing the degree that the difference of the target image characteristic and the reference image characteristic is consistent with the modified text. The real loss is calculated by the formula

Indicating batch intra-averaging, D indicating the arbiter predicted value,

representing reference image features, including pooled initial features

And high-order image features acquired according to a high-order feature acquisition method (the right half of fig. 1) of the target image.

(5-2) inputting the disordered reference image features, text features and disordered target image features into a discriminator by the same method as the method in (5-1), and learning wrong pairing relation, wherein the calculation formula of the error loss is

And

respectively representing disorder

And

due to the disorder of the sequence, the image feature difference at this time must not meet the requirement of modifying the text.

(5-3) inputting the reference image feature, text feature and image-text fusion feature into a discriminator by the same method as (5-1) to calculate the fusion loss

The final minimized discrimination loss of the discriminator is

(5-4) contrary to the training target of the discriminator, the search network itself is used as a 'generator' of the fusion characteristic, and the self is improved by minimizing the generation loss, and the calculation formula is

In the process of optimizing the search network and the discriminator toward the opposite target, the image features and the text features are fused to be more in line with the target image.

(5-5) the final loss function of the whole network is a weighted sum of the retrieval loss and the counter loss. Generating a loss of

Discriminating the loss as

α is a hyperparameter and is set to 0.01. Used alternately during actual training

Training retrieval partial networks and uses

Training the discriminator to improve the performance of the discriminator and the discriminator.

(6) Through the training of step (5), the model has learned how to fuse the image and text features. Given a reference image and a modified text, the step of retrieving a target image satisfying the modification requirement from the candidate pictures by the model is as follows:

and (6-1) inputting the candidate pictures into a retrieval network without fusion to obtain middle-high-level features and high-level features, and splicing the middle-high-level features and the high-level features together to construct a candidate feature library.

(6-2) inputting the given reference image and the modified text into a retrieval network for fusion to obtain middle-high-level and high-level fusion characteristics, splicing the fusion characteristics together, calculating cosine similarity with the candidate characteristics one by one, sorting according to the similarity, and returning the sorted result as a retrieval result.

The performance ratio of the image retrieval network constructed by the method of the invention and other commonly used image retrieval methods on the Fashion 200k data set is shown in Table 1, and the result shows that the method of the invention greatly improves the performance and efficiency of combined query image retrieval.

TABLE 1 comparison of search network Performance by different methods

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should all embodiments be exhaustive. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims

1. A retrieval method of a combined query image based on multi-order antagonistic characteristic learning is characterized in that a retrieval network is utilized to obtain multi-level characteristics and high-order characteristics of a candidate image, a reference image is fused with modified text characteristics to obtain multi-level image-text fusion characteristics and high-order image-text fusion characteristics, the multi-level image-text fusion characteristics and the high-order image-text fusion characteristics are spliced and then are spliced with the multi-level characteristics and the high-order characteristics of the candidate image to calculate cosine similarity one by one, sorting is carried out according to the similarity, and the sorted candidate image is returned to serve as a retrieval result of the query image; the retrieval network comprises a feature extraction module, a self-attention fusion module and a bilinear fusion module; the retrieval network is constructed and trained by the following steps:

(1) performing feature extraction on the reference image, the modified text and the target image by using a feature extraction module to obtain initial features of two modal data, wherein the initial features of the reference image and the target image comprise image features output by the feature extraction module in multiple layers;

(2) fusing the text initial features obtained in the step (1) with the features of the reference image at different levels by using a self-attention fusion module to obtain multi-level image-text fusion features;

(3) performing hierarchical fusion on the multi-level image-text fusion characteristics in the step (2) by using a bilinear fusion module to obtain high-order image-text fusion characteristics, and performing hierarchical fusion on the multi-level initial characteristics of the target image in the step (1) to obtain high-order image characteristics;

(4) respectively comparing the multilevel image-text fusion characteristics obtained in the step (2) with the multilevel characteristics of the target image obtained in the step (1), and comparing the high-order image-text fusion characteristics obtained in the step (3) with the high-order image characteristics of the target image, and performing similarity learning;

(5) inputting multilevel initial features, multilevel image-text fusion features and high-order image-text fusion features of the reference image, text features, multilevel initial features of the target image and high-order image features into a discriminator, and judging whether the high-order image-text fusion features of different levels or the reference image features corresponding to the target image features meet the modification requirements of the text or not to perform counterstudy; finally, training the retrieval network in an end-to-end mode; the reference image features comprise multilevel initial features of a reference image and multilevel initial features of the reference image, and the multilevel initial features of the reference image are subjected to level fusion to obtain high-order image features; the confrontation training method using the discriminator comprises the following steps:

(5-1) subtracting the initial characteristic and the high-order image characteristic of the input reference image from the image characteristic corresponding to the target image and then using linear layer transformation;

the true loss function is expressed as follows:

indicating the intra-batch averaging, D indicating the arbiter predicted value,

representing reference image characteristics, including an initial characteristic and a high-order image characteristic acquired according to a high-order characteristic acquisition method of a target image;

and

respectively representing disorder

And

(5-4) the judgment of the multilevel image-text fusion characteristic is the decision of the generation loss of the retrieval network and the countermeasure loss of the discriminator, wherein the countermeasure loss of the discriminator is fusion loss and is expressed as:

the loss of generation of the search network is:

(5-4) the loss calculated by the above procedure.

2. The combined query image retrieval method based on multi-order antagonism feature learning as claimed in claim 1, wherein in the step (1), the feature extraction module comprises an LSTM network and a MobileNet convolutional neural network, wherein:

carrying out feature extraction on an input text by using an LSTM network to obtain initial features of the text;

3. The combined query image retrieval method based on multi-order confrontation feature learning of claim 1, wherein the self-attention fusion module comprises a convolutional layer, a self-attention network and a linear layer, and the step (2) specifically comprises the following sub-steps:

(2-1) connecting the initial features of each layer of the reference image with the text features, and obtaining preliminary fusion features by using a convolution layer;

4. The combined query image retrieval method based on multi-order antagonistic feature learning as claimed in claim 1, wherein the bilinear fusion module comprises a plurality of linear layers, and the method for obtaining the high-order features in step (3) comprises the following sub-steps:

(3-1) mapping the multilevel image-text fusion characteristics in the step (2) and the multilevel characteristics of the target image in the step (1) into the same dimensionality by using linear layers respectively;

5. The combined query image retrieval method based on multi-level confrontation feature learning as claimed in claim 1, wherein in the step (4), the multi-level image-text fusion feature obtained in the step (2) and the multi-level feature of the target image obtained in the step (1) and the high-level image-text fusion feature obtained in the step (3) and the high-level image feature of the target image are compared with each other by using a defined triple loss, so that the model can learn the similarity relationship between the features at different levels and orders, and the triple loss is expressed as follows:

wherein

Represents negative samples, d () represents euclidean distance, m represents boundaries, where j represents different levels and higher order numbers; the first two items promote each layer of image-text fusion characteristic and high-order image-text fusion characteristic

Negative sample

6. The method as claimed in claim 1, wherein the multi-level features and the high-level features of the candidate image obtained by the search network are merged to form a candidate feature library.