CN115905610B - Combined query image retrieval method of multi-granularity attention network - Google Patents

Combined query image retrieval method of multi-granularity attention network Download PDF

Info

Publication number
CN115905610B
CN115905610B CN202310213360.9A CN202310213360A CN115905610B CN 115905610 B CN115905610 B CN 115905610B CN 202310213360 A CN202310213360 A CN 202310213360A CN 115905610 B CN115905610 B CN 115905610B
Authority
CN
China
Prior art keywords
image
text
features
granularity
attention network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310213360.9A
Other languages
Chinese (zh)
Other versions
CN115905610A (en
Inventor
徐行
李申珅
沈复民
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Koala Youran Technology Co ltd
Original Assignee
Chengdu Koala Youran Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Koala Youran Technology Co ltd filed Critical Chengdu Koala Youran Technology Co ltd
Priority to CN202310213360.9A priority Critical patent/CN115905610B/en
Publication of CN115905610A publication Critical patent/CN115905610A/en
Application granted granted Critical
Publication of CN115905610B publication Critical patent/CN115905610B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a combined query image retrieval method of a multi-granularity attention network, relates to the field of cross-mode retrieval in computer vision, and solves the technical problems that overlapping exists in image parts needing to be reserved and modified in a target image learned by an existing model, semantic information of the multi-granularity image and text is not fully utilized, and the like; the invention firstly uses an image feature extractor to extract image features of different semantic levels, extracts text features through the text feature extractor, further fuses the image features of different semantic levels through a cross-layer interaction module, then obtains relatively accurate reserved and modified areas in a target image through self-contrast learning, and finally completes combined query image retrieval through calculating cosine similarity and sequencing from high to low. Meanwhile, the image retrieval is completed by using a combined query image retrieval method based on cross-modal attention preservation, so that semantic information of different levels is more fully utilized.

Description

Combined query image retrieval method of multi-granularity attention network
Technical Field
The invention relates to the field of cross-modal retrieval in computer vision, in particular to a combined query image retrieval method of a multi-granularity attention network.
Background
The combined query image retrieval is an expansion task in the field of image retrieval. Unlike conventional content-based image retrieval and graph-text matching tasks, queries in a combined query image retrieval contain both image and text modalities, rather than just one single-modality data input. Conventional image retrieval queries require that the user describe the query requirements only in images or text, which limits the user to expressing accurate search intent. The combined query image retrieval allows a user to modify image contents by utilizing text information on the basis of querying by using images, and flexibly and comprehensively expresses search intention so as to optimize retrieval results. The target of the task modifies the specific content of the reference image according to the semantic information of the modification text and the reference image, and then finds out target images which are similar to the reference image and are modified according to the modification text in all candidate images. Because of the practicality of the task, the combined query image retrieval has wide application in the fields of product recommendation, interactive image retrieval and the like.
With the rapid development of hardware facilities, deep neural networks have become a benchmark model for various tasks. The existing combined query image retrieval method based on the deep neural network mainly comprises the following three technical routes:
1. a combined query image retrieval method based on a large-scale pre-training model comprises the following steps: a combined query image retrieval method based on a large-scale pre-training model utilizes additional prior knowledge learned from other image text corpuses to initialize model parameters to help the model learn target images. The method utilizes additional data and image features from fine granularity and coarse granularity to improve the retrieval accuracy.
2. The combined query image retrieval method based on feature fusion comprises the following steps: the combined query image retrieval method based on feature fusion obtains image and text feature representation through an image and text encoder, utilizes a designed attention module or various network structures to screen out key features in the text and the image, fuses the screened image features and text features into a unified image-text feature representation, and finally calculates cosine similarity by using the independent image-text features and target image features to measure similarity between candidate images and the fused feature representation.
3. The combined query image retrieval method based on co-training comprises the following steps: in order to reduce the complexity of the model and improve the efficiency of the combined query image retrieval model, the combined query image retrieval method based on co-training learns the part needing to be modified in the target image through a graphic matching strategy, and learns the part needing to be reserved in the reference image through a content-based image retrieval strategy.
The current combined query image retrieval method is mainly a combined query image retrieval method based on feature fusion, and the method can effectively improve the accuracy of query results through a designed attention mechanism and a network structure.
However, in practical application, the existing combined query image retrieval method still has the following problems: the image parts which need to be reserved and modified in the target image learned by the model are overlapped, and semantic information of the images and texts with multiple granularities is not fully utilized. The above deficiencies may reduce the quality of the image retrieval result.
Disclosure of Invention
The invention aims at: in order to solve the technical problems, the invention provides a combined query image retrieval method of a multi-granularity attention network.
The invention adopts the following technical scheme for realizing the purposes:
the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps of:
step S1: acquiring a data set for training, wherein the data set comprises a text, a target image and a reference image;
step S2: constructing a network structure of a text encoder, and acquiring text characteristics of the text in the step S1 by using the text encoder;
step S3: constructing a multi-granularity attention network structure with mutual exclusion limit, wherein the network structure comprises a multi-granularity attention network and three attention modules with mutual exclusion limit;
step S4: constructing a multi-granularity attention network, and extracting reference image features and target image features with different granularities in the step S1 and text features with different granularities in the step 2;
step S5: constructing an attention module with mutual exclusion limitation, which is used for generating the image features and text features with different granularities extracted in the step S4 to obtain the image region features which need to be reserved and modified in the reference image and the target image;
step S6: feature matching of the similarity level is performed by defining a first loss function L bbc The feature matching method specifically comprises the following steps:
step S61: calculating cosine similarity between the image region features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image region characteristics of the target image which are required to be modified and the text characteristics obtained in the step (2);
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: a first loss function L defined according to step S6 bbc Using AdamW optimizer pair based onTraining a combined query image retrieval model of the multi-granularity attention network with mutual exclusion limitation;
step S8: image retrieval is performed using the trained multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints to verify the performance of the multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints.
As an optional technical solution, the step S2 specifically includes:
step S21: removing non-alphabetic characters from the text in the step S1 through text preprocessing operation, and replacing special characters with spaces;
step S22: firstly word segmentation processing is carried out on the text obtained through pretreatment in the step S21, and then a GloVE pre-training corpus is used for encoding words in the text into word vectors;
step S23: the word vector in step S22 encodes the whole sentence into text features through a long-short-time memory network or a bi-directional gating loop network.
As an optional technical solution, the step S4 specifically includes:
step S41: the shapes of the target image and the reference image in the data set in the step S1 are firstly adjusted to 256 multiplied by 256 pixels, and then data enhancement is carried out by utilizing random clipping and random horizontal overturning;
step S42: constructing a multi-granularity attention network, and inputting each pair of reference images and corresponding target images subjected to data enhancement in the step S41 into the multi-granularity attention network to obtain reference images and target image features with different granularities;
step S43: and (3) inputting the text features in the step S2 into a multi-granularity attention network to obtain text features with different granularities.
As an optional technical solution, the step S5 specifically includes:
step S51: inputting the text features obtained in the step S2 to a multi-layer perceptron to obtain attention weights which are used for screening and need to be reserved and modified;
step S52: multiplying the attention weight obtained in the step S51 and the reference image features and target image features with different granularities obtained in the step S4 element by element to obtain the image region features to be modified and reserved in the target image;
as an alternative solution, the image region feature may be obtained in step S52 by a second loss function L att Optimizing;
specifically: taking the feature to be modified in the target image features obtained in the step S4 as a positive sample and the feature to be reserved as a negative sample for defining a second loss function L att Thereby constructing a mutually exclusive limited attention module;
second loss function L att As will be shown in detail below,
Figure SMS_1
Figure SMS_2
where Σ is the sum symbol, L c (.) represents a third loss function constructed using contrast learning, uppercase + in the two formulas represents mathematical addition and subtraction symbols, sum of both ends is taken, lowercase + at the end in the second formula represents less than 0 and 0 is taken,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F S Representing original text features, L mi Represent the firstiThe learnable text characteristics of the individual samples,
Figure SMS_3
represent the firstiImage features of reference images of individual granularity layers need to be preserved; />
Figure SMS_4
Represent the firstiImage features to be modified in target images of the samples; />
Figure SMS_5
The size of the space is indicated and,a i representing the different weights of the light-emitting diode,iindicating the number of layers at the granularity level where the feature is located.
As an alternative solution, in the step S6, the first loss function L is defined by the following formula bbc
Figure SMS_6
Figure SMS_7
wherein ,
Figure SMS_8
similarity score representing text feature and modified image feature of jth training sample and similarity score of reference image and target image>
Figure SMS_9
Sequentially representing the learner text characteristic of the jth sample, the image characteristic needing to be modified in the target image, the reference image and the image characteristic needing to be reserved in the target image;
Figure SMS_10
represent the firstiThe sum of the similarity scores of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>
Figure SMS_11
Sequentially represent the firstiThe method comprises the steps of learning text features of individual samples, image features needing to be modified in a target image, reference images and image features needing to be reserved in the target image;
Figure SMS_12
is shown in the firstiThe granularity level modifies the similarity between the text and the region features to be modified in the target image;
Figure SMS_13
representing the similarity between the region features to be preserved in the i-th granularity level reference image and the target image,/for>
Figure SMS_14
Sequentially represent the firstiImage features of reference images of the granularity layers which need to be reserved and image features of target images which need to be reserved;
Figure SMS_15
representing a parameter that can be learned and which,jrepresent the firstjA number of samples of the sample were taken,Nthe total number of samples representing a batch of data in the training dataset exp represents the exponential function and log represents the logarithmic function.
As an optional technical solution, the step S8 specifically includes: performing image retrieval by using a trained combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, and then selecting an image with highest score in the similarity score matrix obtained in the step S6 as an output result
The beneficial effects of the invention are as follows:
1. the invention can more fully utilize visual and text semantic information with different granularities, so that the network model has robustness for the semantic information with different granularities.
2. The invention designs a combined query image retrieval method of a multi-granularity attention network with mutual exclusion limit for image retrieval, and the multi-granularity attention network with mutual exclusion limit adds mutual exclusion limit to the attention so as to achieve the information which needs to be reserved and modified in a target image learned by an optimization model, thereby improving the accuracy of image retrieval.
Drawings
FIG. 1 is a flow chart of an implementation of a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion constraint of the present invention in an embodiment.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, this embodiment describes a method for searching a combined query image of a multi-granularity attention network, firstly, extracting text features by using a text encoder, then extracting reference images and target image features of different granularities by using the multi-granularity attention network, then constructing an attention module with mutual exclusion limitation, obtaining image features which are required to be reserved and modified as mutually exclusive as possible by the attention module with mutual exclusion limitation, finally, performing feature matching on a similarity level, and obtaining a search result by using the obtained similarity matrix.
One key contribution of this embodiment is to fully mine image features of different granularities through a multi-granularity attention network for combined query image retrieval, so that the network can obtain more accurate feature representations for semantic information of different granularities. The method has another core innovation point, mutual exclusion limit is added on the attention module, and the attention module with the mutual exclusion limit can ensure that information which needs to be reserved and modified in the image features focused by the model is mutually exclusive as far as possible, so that a more accurate query result is obtained. Compared with the existing method, the method fully excavates visual and language information with different granularities, so that the model can capture semantic information with different granularities, more importantly, the problem that information needing to be reserved and modified in target image features learned by the model is overlapped is successfully solved by adding mutual exclusion limitation on the attention level, and therefore the model can accurately pay attention to the image features needing to be reserved and modified, and the purpose of improving the quality of image retrieval results is achieved.
Example 2:
the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps of:
step S1: acquiring a data set for training, wherein the data set comprises a text, a target image and a reference image;
step S2: constructing a network structure of a text encoder, and acquiring text characteristics of the text in the step S1 by using the text encoder;
step S3: constructing a multi-granularity attention network structure with mutual exclusion limit, wherein the network structure comprises a multi-granularity attention network and three attention modules with mutual exclusion limit;
step S4: constructing a multi-granularity attention network, and extracting reference image features and target image features with different granularities in the step S1 and text features with different granularities in the step 2;
step S5: constructing an attention module with mutual exclusion limitation, which is used for generating the image features and text features with different granularities extracted in the step S4 to obtain the image region features which need to be reserved and modified in the reference image and the target image;
step S6: feature matching of the similarity level is performed by defining a first loss function L bbc The feature matching method specifically comprises the following steps:
step S61: calculating cosine similarity between the image region features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image region characteristics of the target image which are required to be modified and the text characteristics obtained in the step (2);
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: a loss first loss function L defined according to step S6 bbc Training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer;
step S8: image retrieval is performed using the trained multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints to verify the performance of the multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints.
Example 3:
a combined query image retrieval method of a multi-granularity attention network comprises the steps of firstly extracting text features by using a text encoder, then extracting reference images and target image features with different granularities by using the multi-granularity attention network, then constructing an attention module with mutual exclusion limit, obtaining image features which are required to be reserved and modified as far as possible mutually exclusive by the attention module with mutual exclusion limit, finally carrying out feature matching of a similarity level, and obtaining a retrieval result by using an obtained similarity matrix.
Step S1: selecting a training data set;
in this example, three common data sets, fashionIQ, shoes and Fashion200K, were selected for the experiment.
The FashionIQ dataset is a natural language-based interactive fashion search dataset, which contains 77,684 images related to clothing, and can be specifically classified into three categories: one-piece dress, coat and shirt. In terms of dataset partitioning, it includes approximately 18,000 triples (modified text, reference image and target image) for training. The data sets for verification and testing each contained approximately 12,000 pairs of queries, the queries including reference images and modified text, wherein the modified text consisted of two artificially annotated descriptions.
The images of the Shoes dataset were initially collected from the internet, after which a natural language description was added for the task of retrieval for the combined query image. It consists of approximately 10,000 images, of which approximately 9,000 are used for training and 4,700 are used for testing.
The Fashion200K dataset is a large-scale Fashion search dataset, which contains about 200,000 images, about 170,000 images for training, about 30,000 images for testing, and since the dataset has no text labels of features, a pair of images with only one word different from the text description is used as a reference image and a target image, and then the text description is constructed according to the words with differences, namely, the form of "reproduction sth.
Step S2: constructing a network structure of a text encoder, and for the modified text in the data set in the step S1, acquiring text characteristics by using the text encoder, wherein the text encoder is a long-short-term memory network (a full-connection layer is connected later) or a bidirectional gating circulation network;
the specific content of the steps is as follows:
step S21: removing non-alphabetic characters from the text in the data set in the step S1 through text preprocessing operation, and replacing special characters with spaces;
step S22: firstly word segmentation processing is carried out on the text obtained through pretreatment in the step S21, and then a GloVE pre-training corpus is used for encoding words in the text into word vectors;
step S23: the word vector in the step S22 codes the whole sentence into text characteristics through a long-short time memory network or a two-way gating circulation network;
step S3: the method comprises the steps of constructing a multi-granularity attention network with mutual exclusion limit, wherein a text encoder is arranged in the multi-granularity attention network for extracting semantic information with different granularities, and three attention modules with mutual exclusion and identical structures are used for optimizing target image characteristics with different granularities.
Step S4: constructing a multi-granularity attention network, and extracting images and text features with different granularities by using the multi-granularity attention network for the reference image in the data set in the step S1 and the text features acquired in the step S2;
the specific content of the steps is as follows:
step S41: the shapes of the target image and the reference image in the data set in step S1 are adjusted to 256×256 pixels first, and then data enhancement is performed by random clipping and random horizontal flip.
Step S42: constructing a multi-granularity attention network, and inputting each pair of reference images and corresponding target images subjected to data enhancement in the step S41 into the multi-granularity attention network to obtain reference images and target image features with different granularities;
step S43: the text features obtained in the step S23 are input into a multi-granularity attention network, and text features with different granularities are obtained through three full connection layers.
Step S5: constructing an attention module with mutual exclusion limit, and generating image area features to be reserved and modified in a reference image and a target image by using the text and image features with different granularities extracted from the multi-granularity attention network in the step S4 through the attention module with mutual exclusion limit;
the specific content of the steps is as follows:
step S51: inputting the text features obtained in the step S2 to a multi-layer perceptron to obtain attention weights which are used for screening and need to be reserved and modified;
step S52: multiplying the weight obtained in the step S51 and the characteristics of the reference image and the target image with different granularities obtained in the step S4 element by element to obtain the characteristics which need to be modified and reserved in the target image;
step S53: the image region features obtained in step S52 can be obtained by a second loss function L att Optimizing;
specifically: the target image characteristics obtained in the step S4 are required to be modifiedFeatures as positive samples and features to be preserved as negative samples for defining a second loss function L att Thereby constructing a mutually exclusive limited attention module;
second loss function L att As will be shown in detail below,
Figure SMS_16
Figure SMS_17
where Σ is the sum symbol, L c (.) represents a third loss function constructed using contrast learning, uppercase + in the two formulas represents mathematical addition and subtraction symbols, sum of both ends is taken, lowercase + at the end in the second formula represents less than 0 and 0 is taken,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F S Representing original text features, L mi Represent the firstiThe learnable text characteristics of the individual samples,
Figure SMS_18
represent the firstiImage features of reference images of individual granularity layers need to be preserved; />
Figure SMS_19
Represent the firstiImage features to be modified in target images of the samples; />
Figure SMS_20
The size of the space is indicated and,a i representing the different weights of the light-emitting diode,iindicating the number of layers at the granularity level where the feature is located.
Step S6: constructing a multi-granularity attention network with mutual exclusion limit, reserving and modifying image features of a specific area by using text features in the step S3 and image features in the step S4, and then calculating a similarity score;
the specific content of the steps is as follows:
step S61: calculating cosine similarity between the image features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image features and text features of the target image to be modified, which are obtained in the step S5;
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer according to the loss function defined in the step S6;
the initial learning rate of the AdamW optimizer is set to 0.0005, and the learning rate is attenuated by half every 10 rounds of training by using the weight attenuation strategy, and after 20 rounds of training, the learning rate is attenuated by half every 5 rounds of training, and the whole model is trained for 50, 100 and 150 cycles on the training sets of the three data sets respectively.
Further, in the step S6, a first loss function L is defined by the following formula bbc
Figure SMS_21
Figure SMS_22
wherein ,
Figure SMS_23
similarity score representing text feature and modified image feature of jth training sample and similarity score of reference image and target image>
Figure SMS_24
Sequentially representing the learner text feature of the jth sample, the image feature of the target image to be modified, the reference image and the target image to be modifiedRetained image features;
Figure SMS_25
represent the firstiThe sum of the similarity scores of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>
Figure SMS_26
Sequentially represent the firstiThe method comprises the steps of learning text features of individual samples, image features needing to be modified in a target image, reference images and image features needing to be reserved in the target image;
Figure SMS_27
is shown in the firstiThe granularity level modifies the similarity between the text and the region features to be modified in the target image;
Figure SMS_28
representing the similarity between the region features to be preserved in the i-th granularity level reference image and the target image,/for>
Figure SMS_29
Sequentially represent the firstiImage features of reference images of the granularity layers which need to be reserved and image features of target images which need to be reserved;
Figure SMS_30
representing a parameter that can be learned and which,jrepresent the firstjA number of samples of the sample were taken,Nthe total number of samples representing a batch of data in the training dataset exp represents the exponential function and log represents the logarithmic function.
Step S8: image retrieval is performed using a trained multi-granularity query image retrieval model based on a multi-granularity attention network with mutual exclusion constraints to verify the effectiveness of the trained multi-granularity query image retrieval model based on the multi-granularity attention network with mutual exclusion constraints.
And when the image retrieval is carried out in the step S8, selecting the image with the highest score in the similarity score matrix obtained in the step S6 as an output result.
Example 4:
this example uses the Recall@10, recall@50 and MeanR metrics on the FashioniQ dataset to evaluate our designed network, and uses the Recall@1 (R@1), recall@10 (R@10), recall@50 (R@50) and MeanR metrics on the Shoes dataset and the Fashion200K dataset to evaluate our model. The recall@k index is defined as the percentage of all samples tested in which the correct target image appears in the first K returned results.
The model performance comparisons of our method MANME and other methods on the FashionIQ dataset are shown in table 1:
TABLE 1
Figure SMS_31
The model performance versus results on the Shoes dataset are shown in table 2:
TABLE 2
Figure SMS_32
The model performance versus results on the Fashion200K dataset are shown in table 3:
TABLE 3 Table 3
Figure SMS_33
It can be seen from the table that our invention is significantly better than all existing methods in all the accuracy indicators used for evaluation in the FashionIQ dataset and the shaes dataset. For the large-scale dataset Fashion200K, our invention is significantly better than the current method in all the accuracy metrics used for evaluation except Recall@1. The combined query image retrieval method based on the multi-granularity attention network with the mutual exclusion limit is proved to fully mine semantic information with different granularities in text and image characteristics, and ensures that parts needing to be reserved and modified in a target image obtained by a model are not overlapped by using an attention module with the mutual exclusion limit so as to learn more accurate target image characteristics, thereby enabling the image retrieval to be more accurate.
Explanation about the model in tables 1, 2 and 3:
JVSM: is a joint visual semantic matching embedding method in the guidance retrieval of learning language;
TRIG: is the first method of the combined query task;
VAL: an image searching method for text feedback through visual linguistic attention learning;
ARTEMIS: explicit matching and implicit similarity retrieval of text based on an attention mechanism;
ComposeAE: combination learning of image text queries for image retrieval;
CoSMo: content style modulated image retrieval and text feedback;
DCNet: bidirectional learning in interactive image retrieval;
SAC: semantic attention combinations for image retrieval under text conditions;
TCIR, namely carrying out image retrieval under text conditions by using style and content functions;
CIRPLANT, which is to use a pre-trained visual and language model to search images in real life;
fashion vlp, visual language converter for fashion retrieval and feedback;
CLVC-Net, comprehensive image searching language-visual synthesis network;
GSCMR-geometric sensitive cross-modal reasoning based on image retrieval of combined queries.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (7)

1. The method is characterized in that the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps:
step S1: acquiring a data set for training, wherein the data set comprises a text, a target image and a reference image;
step S2: constructing a network structure of a text encoder, and acquiring text characteristics of the text in the step S1 by using the text encoder;
step S3: constructing a multi-granularity attention network structure with mutual exclusion limit, wherein the network structure comprises a multi-granularity attention network and three attention modules with mutual exclusion limit;
step S4: constructing a multi-granularity attention network, and extracting reference image features and target image features with different granularities in the step S1 and text features with different granularities in the step 2;
step S5: constructing an attention module with mutual exclusion limitation, which is used for generating the image features and text features with different granularities extracted in the step S4 to obtain the image region features which need to be reserved and modified in the reference image and the target image;
step S6: feature matching of the similarity level is performed by defining a first loss function L bbc The feature matching method specifically comprises the following steps:
step S61: calculating cosine similarity between the image region features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image region characteristics of the target image which are required to be modified and the text characteristics obtained in the step (2);
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: a first loss function L defined according to step S6 bbc Training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer;
step S8: image retrieval is performed using the trained multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints to verify the performance of the multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints.
2. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S2 specifically comprises:
step S21: removing non-alphabetic characters from the text in the step S1 through text preprocessing operation, and replacing special characters with spaces;
step S22: firstly word segmentation processing is carried out on the text obtained through pretreatment in the step S21, and then a GloVE pre-training corpus is used for encoding words in the text into word vectors;
step S23: the word vector in step S22 encodes the whole sentence into text features through a long-short-time memory network or a bi-directional gating loop network.
3. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S4 specifically comprises:
step S41: the shapes of the target image and the reference image in the data set in the step S1 are firstly adjusted to 256 multiplied by 256 pixels, and then data enhancement is carried out by utilizing random clipping and random horizontal overturning;
step S42: constructing a multi-granularity attention network, and inputting each pair of reference images and corresponding target images subjected to data enhancement in the step S41 into the multi-granularity attention network to obtain reference images and target image features with different granularities;
step S43: and (3) inputting the text features in the step S2 into a multi-granularity attention network to obtain text features with different granularities.
4. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S5 specifically comprises:
step S51: inputting the text features obtained in the step S2 to a multi-layer perceptron to obtain attention weights which are used for screening and need to be reserved and modified;
step S52: and multiplying the attention weight obtained in the step S51 and the reference image features and the target image features with different granularities obtained in the step S4 element by element to obtain the image region features which need to be modified and reserved in the target image.
5. The method for searching for combined query image of multi-granularity attention network as claimed in claim 4, wherein when the image region features are obtained in said step S52, the second loss function L is passed att Optimizing;
specifically: taking the feature to be modified in the target image features obtained in the step S4 as a positive sample and the feature to be reserved as a negative sample for defining a second loss function L att Thereby constructing a mutually exclusive limited attention module;
second loss function L att As shown in detail below,
Figure QLYQS_1
Figure QLYQS_2
where Σ is the sum symbol, L c (.) represents a third loss function constructed using contrast learning, with capitalized +representing mathematical in two formulasAdding and subtracting the sign, taking the sum of two ends, taking 0 from the lower case + of the end in the second formula,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F S Representing original text features, L mi Represent the firstiThe learnable text characteristics of the individual samples,
Figure QLYQS_3
represent the firstiImage features of reference images of individual granularity layers need to be preserved; />
Figure QLYQS_4
Represent the firstiImage features to be modified in target images of the samples; />
Figure QLYQS_5
The size of the space is indicated and,a i representing the different weights of the light-emitting diode,iindicating the number of layers at the granularity level where the feature is located.
6. The method for searching for combined query image of multi-granularity attention network according to claim 1, wherein in step S6, the first loss function L is defined by the following formula bbc
Figure QLYQS_6
Figure QLYQS_7
wherein ,
Figure QLYQS_8
similarity score representing text feature and modified image feature of jth training sample and similarity score of reference image and target image>
Figure QLYQS_9
Learner text sequentially representing jth sampleThe method comprises the steps of the feature, the image feature to be modified in a target image, a reference image and the image feature to be reserved in the target image;
Figure QLYQS_10
represent the firstiThe sum of the similarity score of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>
Figure QLYQS_11
Sequentially represent the firstiThe method comprises the steps of learning text features of individual samples, image features needing to be modified in a target image, reference images and image features needing to be reserved in the target image;
Figure QLYQS_12
is shown in the firstiThe granularity level modifies the similarity between the text and the region features to be modified in the target image;
Figure QLYQS_13
representing the similarity between the region features to be preserved in the i-th granularity level reference image and the target image,/for>
Figure QLYQS_14
Sequentially represent the firstiImage features of reference images of the granularity layers which need to be reserved and image features of target images which need to be reserved;
Figure QLYQS_15
representing a parameter that can be learned and which,jrepresent the firstjA number of samples of the sample were taken,Nthe total number of samples representing a batch of data in the training dataset exp represents the exponential function and log represents the logarithmic function.
7. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S8 specifically comprises: and (3) performing image retrieval by using a trained combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, and then selecting an image with the highest score in the similarity score matrix obtained in the step S6 as an output result.
CN202310213360.9A 2023-03-08 2023-03-08 Combined query image retrieval method of multi-granularity attention network Active CN115905610B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310213360.9A CN115905610B (en) 2023-03-08 2023-03-08 Combined query image retrieval method of multi-granularity attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310213360.9A CN115905610B (en) 2023-03-08 2023-03-08 Combined query image retrieval method of multi-granularity attention network

Publications (2)

Publication Number Publication Date
CN115905610A CN115905610A (en) 2023-04-04
CN115905610B true CN115905610B (en) 2023-05-26

Family

ID=85744778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310213360.9A Active CN115905610B (en) 2023-03-08 2023-03-08 Combined query image retrieval method of multi-granularity attention network

Country Status (1)

Country Link
CN (1) CN115905610B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818157A (en) * 2021-02-10 2021-05-18 浙江大学 Combined query image retrieval method based on multi-order confrontation characteristic learning
CN112860930A (en) * 2021-02-10 2021-05-28 浙江大学 Text-to-commodity image retrieval method based on hierarchical similarity learning
CN114048340A (en) * 2021-11-15 2022-02-15 电子科技大学 Hierarchical fusion combined query image retrieval method
CN114998615A (en) * 2022-04-28 2022-09-02 南京信息工程大学 Deep learning-based collaborative significance detection method
CN115033670A (en) * 2022-06-02 2022-09-09 西安电子科技大学 Cross-modal image-text retrieval method with multi-granularity feature fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11699275B2 (en) * 2020-06-17 2023-07-11 Tata Consultancy Services Limited Method and system for visio-linguistic understanding using contextual language model reasoners

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818157A (en) * 2021-02-10 2021-05-18 浙江大学 Combined query image retrieval method based on multi-order confrontation characteristic learning
CN112860930A (en) * 2021-02-10 2021-05-28 浙江大学 Text-to-commodity image retrieval method based on hierarchical similarity learning
CN114048340A (en) * 2021-11-15 2022-02-15 电子科技大学 Hierarchical fusion combined query image retrieval method
CN114998615A (en) * 2022-04-28 2022-09-02 南京信息工程大学 Deep learning-based collaborative significance detection method
CN115033670A (en) * 2022-06-02 2022-09-09 西安电子科技大学 Cross-modal image-text retrieval method with multi-granularity feature fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Clothing retrieval based by image and text combination;Zongbao Liang et al.;《2021 7th International Conference on Systems and Informatics (ICSAI)》;1-6 *
基于多模态查询的图像检索研究——以时尚领域为例;温皓琨;《中国优秀硕士学位论文全文数据库 信息科技辑》;I138-1326 *
基于深度学习的特征提取及其在图像检索中的应用;吴彬;《中国优秀硕士学位论文全文数据库 信息科技辑》;I138-1036 *

Also Published As

Publication number Publication date
CN115905610A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
CN106980683B (en) Blog text abstract generating method based on deep learning
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111339407B (en) Implementation method of information extraction cloud platform
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
Meshram et al. Long short-term memory network for learning sentences similarity using deep contextual embeddings
CN115858847B (en) Combined query image retrieval method based on cross-modal attention reservation
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
Wang et al. Multi-modal transformer using two-level visual features for fake news detection
Tang et al. Chinese sentiment analysis based on lightweight character-level bert
CN113705207A (en) Grammar error recognition method and device
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN116720519A (en) Seedling medicine named entity identification method
CN116385946A (en) Video-oriented target fragment positioning method, system, storage medium and equipment
CN116414988A (en) Graph convolution aspect emotion classification method and system based on dependency relation enhancement
CN115905610B (en) Combined query image retrieval method of multi-granularity attention network
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
Alwaneen et al. Stacked dynamic memory-coattention network for answering why-questions in Arabic
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium
Abdolahi et al. A new method for sentence vector normalization using word2vec
Tang [Retracted] Analysis of English Multitext Reading Comprehension Model Based on Deep Belief Neural Network
Jiang et al. Spatial relational attention using fully convolutional networks for image caption generation
Wang A study of the tasks and models in machine reading comprehension
Li et al. A multi-granularity semantic space learning approach for cross-lingual open domain question answering
Zeng et al. CLG-Trans: Contrastive learning for code summarization via graph attention-based transformer
CN116991980B (en) Text screening model training method, related method, device, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant