CN115905610B

CN115905610B - Combined query image retrieval method of multi-granularity attention network

Info

Publication number: CN115905610B
Application number: CN202310213360.9A
Authority: CN
Inventors: 徐行; 李申珅; 沈复民; 申恒涛
Original assignee: Chengdu Koala Youran Technology Co ltd
Current assignee: Chengdu Koala Youran Technology Co ltd
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-05-26
Anticipated expiration: 2043-03-08
Also published as: CN115905610A

Abstract

The invention discloses a combined query image retrieval method of a multi-granularity attention network, relates to the field of cross-mode retrieval in computer vision, and solves the technical problems that overlapping exists in image parts needing to be reserved and modified in a target image learned by an existing model, semantic information of the multi-granularity image and text is not fully utilized, and the like; the invention firstly uses an image feature extractor to extract image features of different semantic levels, extracts text features through the text feature extractor, further fuses the image features of different semantic levels through a cross-layer interaction module, then obtains relatively accurate reserved and modified areas in a target image through self-contrast learning, and finally completes combined query image retrieval through calculating cosine similarity and sequencing from high to low. Meanwhile, the image retrieval is completed by using a combined query image retrieval method based on cross-modal attention preservation, so that semantic information of different levels is more fully utilized.

Description

Combined query image retrieval method of multi-granularity attention network

Technical Field

The invention relates to the field of cross-modal retrieval in computer vision, in particular to a combined query image retrieval method of a multi-granularity attention network.

Background

The combined query image retrieval is an expansion task in the field of image retrieval. Unlike conventional content-based image retrieval and graph-text matching tasks, queries in a combined query image retrieval contain both image and text modalities, rather than just one single-modality data input. Conventional image retrieval queries require that the user describe the query requirements only in images or text, which limits the user to expressing accurate search intent. The combined query image retrieval allows a user to modify image contents by utilizing text information on the basis of querying by using images, and flexibly and comprehensively expresses search intention so as to optimize retrieval results. The target of the task modifies the specific content of the reference image according to the semantic information of the modification text and the reference image, and then finds out target images which are similar to the reference image and are modified according to the modification text in all candidate images. Because of the practicality of the task, the combined query image retrieval has wide application in the fields of product recommendation, interactive image retrieval and the like.

With the rapid development of hardware facilities, deep neural networks have become a benchmark model for various tasks. The existing combined query image retrieval method based on the deep neural network mainly comprises the following three technical routes:

1. a combined query image retrieval method based on a large-scale pre-training model comprises the following steps: a combined query image retrieval method based on a large-scale pre-training model utilizes additional prior knowledge learned from other image text corpuses to initialize model parameters to help the model learn target images. The method utilizes additional data and image features from fine granularity and coarse granularity to improve the retrieval accuracy.

2. The combined query image retrieval method based on feature fusion comprises the following steps: the combined query image retrieval method based on feature fusion obtains image and text feature representation through an image and text encoder, utilizes a designed attention module or various network structures to screen out key features in the text and the image, fuses the screened image features and text features into a unified image-text feature representation, and finally calculates cosine similarity by using the independent image-text features and target image features to measure similarity between candidate images and the fused feature representation.

3. The combined query image retrieval method based on co-training comprises the following steps: in order to reduce the complexity of the model and improve the efficiency of the combined query image retrieval model, the combined query image retrieval method based on co-training learns the part needing to be modified in the target image through a graphic matching strategy, and learns the part needing to be reserved in the reference image through a content-based image retrieval strategy.

The current combined query image retrieval method is mainly a combined query image retrieval method based on feature fusion, and the method can effectively improve the accuracy of query results through a designed attention mechanism and a network structure.

However, in practical application, the existing combined query image retrieval method still has the following problems: the image parts which need to be reserved and modified in the target image learned by the model are overlapped, and semantic information of the images and texts with multiple granularities is not fully utilized. The above deficiencies may reduce the quality of the image retrieval result.

Disclosure of Invention

The invention aims at: in order to solve the technical problems, the invention provides a combined query image retrieval method of a multi-granularity attention network.

The invention adopts the following technical scheme for realizing the purposes:

the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps of:

step S1: acquiring a data set for training, wherein the data set comprises a text, a target image and a reference image;

step S2: constructing a network structure of a text encoder, and acquiring text characteristics of the text in the step S1 by using the text encoder;

step S3: constructing a multi-granularity attention network structure with mutual exclusion limit, wherein the network structure comprises a multi-granularity attention network and three attention modules with mutual exclusion limit;

step S4: constructing a multi-granularity attention network, and extracting reference image features and target image features with different granularities in the step S1 and text features with different granularities in the step 2;

step S5: constructing an attention module with mutual exclusion limitation, which is used for generating the image features and text features with different granularities extracted in the step S4 to obtain the image region features which need to be reserved and modified in the reference image and the target image;

step S6: feature matching of the similarity level is performed by defining a first loss function L _bbc The feature matching method specifically comprises the following steps:

step S61: calculating cosine similarity between the image region features to be reserved in the target image and the reference image obtained in the step S5;

step S62: calculating cosine similarity between the image region characteristics of the target image which are required to be modified and the text characteristics obtained in the step (2);

step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;

step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;

step S7: a first loss function L defined according to step S6 _bbc Using AdamW optimizer pair based onTraining a combined query image retrieval model of the multi-granularity attention network with mutual exclusion limitation;

step S8: image retrieval is performed using the trained multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints to verify the performance of the multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints.

As an optional technical solution, the step S2 specifically includes:

step S21: removing non-alphabetic characters from the text in the step S1 through text preprocessing operation, and replacing special characters with spaces;

step S22: firstly word segmentation processing is carried out on the text obtained through pretreatment in the step S21, and then a GloVE pre-training corpus is used for encoding words in the text into word vectors;

step S23: the word vector in step S22 encodes the whole sentence into text features through a long-short-time memory network or a bi-directional gating loop network.

As an optional technical solution, the step S4 specifically includes:

step S41: the shapes of the target image and the reference image in the data set in the step S1 are firstly adjusted to 256 multiplied by 256 pixels, and then data enhancement is carried out by utilizing random clipping and random horizontal overturning;

step S42: constructing a multi-granularity attention network, and inputting each pair of reference images and corresponding target images subjected to data enhancement in the step S41 into the multi-granularity attention network to obtain reference images and target image features with different granularities;

step S43: and (3) inputting the text features in the step S2 into a multi-granularity attention network to obtain text features with different granularities.

As an optional technical solution, the step S5 specifically includes:

step S51: inputting the text features obtained in the step S2 to a multi-layer perceptron to obtain attention weights which are used for screening and need to be reserved and modified;

step S52: multiplying the attention weight obtained in the step S51 and the reference image features and target image features with different granularities obtained in the step S4 element by element to obtain the image region features to be modified and reserved in the target image;

as an alternative solution, the image region feature may be obtained in step S52 by a second loss function L _att Optimizing;

specifically: taking the feature to be modified in the target image features obtained in the step S4 as a positive sample and the feature to be reserved as a negative sample for defining a second loss function L _att Thereby constructing a mutually exclusive limited attention module;

second loss function L _att As will be shown in detail below,

where Σ is the sum symbol, L _c (.) represents a third loss function constructed using contrast learning, uppercase + in the two formulas represents mathematical addition and subtraction symbols, sum of both ends is taken, lowercase + at the end in the second formula represents less than 0 and 0 is taken,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F _S Representing original text features, L _mi Represent the firstiThe learnable text characteristics of the individual samples,

represent the firstiImage features of reference images of individual granularity layers need to be preserved; />

Represent the firstiImage features to be modified in target images of the samples; />

The size of the space is indicated and,a _i representing the different weights of the light-emitting diode,iindicating the number of layers at the granularity level where the feature is located.

As an alternative solution, in the step S6, the first loss function L is defined by the following formula _bbc ，

wherein ,

similarity score representing text feature and modified image feature of jth training sample and similarity score of reference image and target image>

Sequentially representing the learner text characteristic of the jth sample, the image characteristic needing to be modified in the target image, the reference image and the image characteristic needing to be reserved in the target image;

represent the firstiThe sum of the similarity scores of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>

Sequentially represent the firstiThe method comprises the steps of learning text features of individual samples, image features needing to be modified in a target image, reference images and image features needing to be reserved in the target image;

is shown in the firstiThe granularity level modifies the similarity between the text and the region features to be modified in the target image;

representing the similarity between the region features to be preserved in the i-th granularity level reference image and the target image,/for>

Sequentially represent the firstiImage features of reference images of the granularity layers which need to be reserved and image features of target images which need to be reserved;

representing a parameter that can be learned and which,jrepresent the firstjA number of samples of the sample were taken,Nthe total number of samples representing a batch of data in the training dataset exp represents the exponential function and log represents the logarithmic function.

As an optional technical solution, the step S8 specifically includes: performing image retrieval by using a trained combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, and then selecting an image with highest score in the similarity score matrix obtained in the step S6 as an output result

The beneficial effects of the invention are as follows:

1. the invention can more fully utilize visual and text semantic information with different granularities, so that the network model has robustness for the semantic information with different granularities.

2. The invention designs a combined query image retrieval method of a multi-granularity attention network with mutual exclusion limit for image retrieval, and the multi-granularity attention network with mutual exclusion limit adds mutual exclusion limit to the attention so as to achieve the information which needs to be reserved and modified in a target image learned by an optimization model, thereby improving the accuracy of image retrieval.

Drawings

FIG. 1 is a flow chart of an implementation of a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion constraint of the present invention in an embodiment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, this embodiment describes a method for searching a combined query image of a multi-granularity attention network, firstly, extracting text features by using a text encoder, then extracting reference images and target image features of different granularities by using the multi-granularity attention network, then constructing an attention module with mutual exclusion limitation, obtaining image features which are required to be reserved and modified as mutually exclusive as possible by the attention module with mutual exclusion limitation, finally, performing feature matching on a similarity level, and obtaining a search result by using the obtained similarity matrix.

One key contribution of this embodiment is to fully mine image features of different granularities through a multi-granularity attention network for combined query image retrieval, so that the network can obtain more accurate feature representations for semantic information of different granularities. The method has another core innovation point, mutual exclusion limit is added on the attention module, and the attention module with the mutual exclusion limit can ensure that information which needs to be reserved and modified in the image features focused by the model is mutually exclusive as far as possible, so that a more accurate query result is obtained. Compared with the existing method, the method fully excavates visual and language information with different granularities, so that the model can capture semantic information with different granularities, more importantly, the problem that information needing to be reserved and modified in target image features learned by the model is overlapped is successfully solved by adding mutual exclusion limitation on the attention level, and therefore the model can accurately pay attention to the image features needing to be reserved and modified, and the purpose of improving the quality of image retrieval results is achieved.

Example 2:

step S7: a loss first loss function L defined according to step S6 _bbc Training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer;

Example 3:

a combined query image retrieval method of a multi-granularity attention network comprises the steps of firstly extracting text features by using a text encoder, then extracting reference images and target image features with different granularities by using the multi-granularity attention network, then constructing an attention module with mutual exclusion limit, obtaining image features which are required to be reserved and modified as far as possible mutually exclusive by the attention module with mutual exclusion limit, finally carrying out feature matching of a similarity level, and obtaining a retrieval result by using an obtained similarity matrix.

Step S1: selecting a training data set;

in this example, three common data sets, fashionIQ, shoes and Fashion200K, were selected for the experiment.

The FashionIQ dataset is a natural language-based interactive fashion search dataset, which contains 77,684 images related to clothing, and can be specifically classified into three categories: one-piece dress, coat and shirt. In terms of dataset partitioning, it includes approximately 18,000 triples (modified text, reference image and target image) for training. The data sets for verification and testing each contained approximately 12,000 pairs of queries, the queries including reference images and modified text, wherein the modified text consisted of two artificially annotated descriptions.

The images of the Shoes dataset were initially collected from the internet, after which a natural language description was added for the task of retrieval for the combined query image. It consists of approximately 10,000 images, of which approximately 9,000 are used for training and 4,700 are used for testing.

The Fashion200K dataset is a large-scale Fashion search dataset, which contains about 200,000 images, about 170,000 images for training, about 30,000 images for testing, and since the dataset has no text labels of features, a pair of images with only one word different from the text description is used as a reference image and a target image, and then the text description is constructed according to the words with differences, namely, the form of "reproduction sth.

Step S2: constructing a network structure of a text encoder, and for the modified text in the data set in the step S1, acquiring text characteristics by using the text encoder, wherein the text encoder is a long-short-term memory network (a full-connection layer is connected later) or a bidirectional gating circulation network;

the specific content of the steps is as follows:

step S21: removing non-alphabetic characters from the text in the data set in the step S1 through text preprocessing operation, and replacing special characters with spaces;

step S23: the word vector in the step S22 codes the whole sentence into text characteristics through a long-short time memory network or a two-way gating circulation network;

step S3: the method comprises the steps of constructing a multi-granularity attention network with mutual exclusion limit, wherein a text encoder is arranged in the multi-granularity attention network for extracting semantic information with different granularities, and three attention modules with mutual exclusion and identical structures are used for optimizing target image characteristics with different granularities.

Step S4: constructing a multi-granularity attention network, and extracting images and text features with different granularities by using the multi-granularity attention network for the reference image in the data set in the step S1 and the text features acquired in the step S2;

the specific content of the steps is as follows:

step S41: the shapes of the target image and the reference image in the data set in step S1 are adjusted to 256×256 pixels first, and then data enhancement is performed by random clipping and random horizontal flip.

step S43: the text features obtained in the step S23 are input into a multi-granularity attention network, and text features with different granularities are obtained through three full connection layers.

Step S5: constructing an attention module with mutual exclusion limit, and generating image area features to be reserved and modified in a reference image and a target image by using the text and image features with different granularities extracted from the multi-granularity attention network in the step S4 through the attention module with mutual exclusion limit;

the specific content of the steps is as follows:

step S52: multiplying the weight obtained in the step S51 and the characteristics of the reference image and the target image with different granularities obtained in the step S4 element by element to obtain the characteristics which need to be modified and reserved in the target image;

step S53: the image region features obtained in step S52 can be obtained by a second loss function L _att Optimizing;

specifically: the target image characteristics obtained in the step S4 are required to be modifiedFeatures as positive samples and features to be preserved as negative samples for defining a second loss function L _att Thereby constructing a mutually exclusive limited attention module;

second loss function L _att As will be shown in detail below,

Step S6: constructing a multi-granularity attention network with mutual exclusion limit, reserving and modifying image features of a specific area by using text features in the step S3 and image features in the step S4, and then calculating a similarity score;

the specific content of the steps is as follows:

step S61: calculating cosine similarity between the image features to be reserved in the target image and the reference image obtained in the step S5;

step S62: calculating cosine similarity between the image features and text features of the target image to be modified, which are obtained in the step S5;

step S7: training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer according to the loss function defined in the step S6;

the initial learning rate of the AdamW optimizer is set to 0.0005, and the learning rate is attenuated by half every 10 rounds of training by using the weight attenuation strategy, and after 20 rounds of training, the learning rate is attenuated by half every 5 rounds of training, and the whole model is trained for 50, 100 and 150 cycles on the training sets of the three data sets respectively.

Further, in the step S6, a first loss function L is defined by the following formula _bbc ，

wherein ,

Sequentially representing the learner text feature of the jth sample, the image feature of the target image to be modified, the reference image and the target image to be modifiedRetained image features;

Step S8: image retrieval is performed using a trained multi-granularity query image retrieval model based on a multi-granularity attention network with mutual exclusion constraints to verify the effectiveness of the trained multi-granularity query image retrieval model based on the multi-granularity attention network with mutual exclusion constraints.

And when the image retrieval is carried out in the step S8, selecting the image with the highest score in the similarity score matrix obtained in the step S6 as an output result.

Example 4:

this example uses the Recall@10, recall@50 and MeanR metrics on the FashioniQ dataset to evaluate our designed network, and uses the Recall@1 (R@1), recall@10 (R@10), recall@50 (R@50) and MeanR metrics on the Shoes dataset and the Fashion200K dataset to evaluate our model. The recall@k index is defined as the percentage of all samples tested in which the correct target image appears in the first K returned results.

The model performance comparisons of our method MANME and other methods on the FashionIQ dataset are shown in table 1:

TABLE 1

The model performance versus results on the Shoes dataset are shown in table 2:

TABLE 2

The model performance versus results on the Fashion200K dataset are shown in table 3:

TABLE 3 Table 3

It can be seen from the table that our invention is significantly better than all existing methods in all the accuracy indicators used for evaluation in the FashionIQ dataset and the shaes dataset. For the large-scale dataset Fashion200K, our invention is significantly better than the current method in all the accuracy metrics used for evaluation except Recall@1. The combined query image retrieval method based on the multi-granularity attention network with the mutual exclusion limit is proved to fully mine semantic information with different granularities in text and image characteristics, and ensures that parts needing to be reserved and modified in a target image obtained by a model are not overlapped by using an attention module with the mutual exclusion limit so as to learn more accurate target image characteristics, thereby enabling the image retrieval to be more accurate.

Explanation about the model in tables 1, 2 and 3:

JVSM: is a joint visual semantic matching embedding method in the guidance retrieval of learning language;

TRIG: is the first method of the combined query task;

VAL: an image searching method for text feedback through visual linguistic attention learning;

ARTEMIS: explicit matching and implicit similarity retrieval of text based on an attention mechanism;

ComposeAE: combination learning of image text queries for image retrieval;

CoSMo: content style modulated image retrieval and text feedback;

DCNet: bidirectional learning in interactive image retrieval;

SAC: semantic attention combinations for image retrieval under text conditions;

TCIR, namely carrying out image retrieval under text conditions by using style and content functions;

CIRPLANT, which is to use a pre-trained visual and language model to search images in real life;

fashion vlp, visual language converter for fashion retrieval and feedback;

CLVC-Net, comprehensive image searching language-visual synthesis network;

GSCMR-geometric sensitive cross-modal reasoning based on image retrieval of combined queries.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The method is characterized in that the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps:

step S7: a first loss function L defined according to step S6 _bbc Training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer;

2. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S2 specifically comprises:

3. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S4 specifically comprises:

4. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S5 specifically comprises:

step S52: and multiplying the attention weight obtained in the step S51 and the reference image features and the target image features with different granularities obtained in the step S4 element by element to obtain the image region features which need to be modified and reserved in the target image.

5. The method for searching for combined query image of multi-granularity attention network as claimed in claim 4, wherein when the image region features are obtained in said step S52, the second loss function L is passed _att Optimizing;

second loss function L _att As shown in detail below,

where Σ is the sum symbol, L _c (.) represents a third loss function constructed using contrast learning, with capitalized +representing mathematical in two formulasAdding and subtracting the sign, taking the sum of two ends, taking 0 from the lower case + of the end in the second formula,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F _S Representing original text features, L _mi Represent the firstiThe learnable text characteristics of the individual samples,

6. The method for searching for combined query image of multi-granularity attention network according to claim 1, wherein in step S6, the first loss function L is defined by the following formula _bbc ，

wherein ,

Learner text sequentially representing jth sampleThe method comprises the steps of the feature, the image feature to be modified in a target image, a reference image and the image feature to be reserved in the target image;

represent the firstiThe sum of the similarity score of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>

7. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S8 specifically comprises: and (3) performing image retrieval by using a trained combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, and then selecting an image with the highest score in the similarity score matrix obtained in the step S6 as an output result.