CN115905610B - Combined query image retrieval method of multi-granularity attention network - Google Patents
Combined query image retrieval method of multi-granularity attention network Download PDFInfo
- Publication number
- CN115905610B CN115905610B CN202310213360.9A CN202310213360A CN115905610B CN 115905610 B CN115905610 B CN 115905610B CN 202310213360 A CN202310213360 A CN 202310213360A CN 115905610 B CN115905610 B CN 115905610B
- Authority
- CN
- China
- Prior art keywords
- image
- text
- features
- granularity
- attention network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000003993 interaction Effects 0.000 claims abstract description 4
- 235000019580 granularity Nutrition 0.000 claims description 57
- 230000007717 exclusion Effects 0.000 claims description 44
- 238000012549 training Methods 0.000 claims description 30
- 230000006870 function Effects 0.000 claims description 29
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 3
- 208000013409 limited attention Diseases 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract 1
- 238000004321 preservation Methods 0.000 abstract 1
- 238000012163 sequencing technique Methods 0.000 abstract 1
- 230000000007 visual effect Effects 0.000 description 6
- 230000004927 fusion Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000002238 attenuated effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 235000005956 Cosmos caudatus Nutrition 0.000 description 1
- 101000720958 Homo sapiens Protein artemis Proteins 0.000 description 1
- 102100025918 Protein artemis Human genes 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a combined query image retrieval method of a multi-granularity attention network, relates to the field of cross-mode retrieval in computer vision, and solves the technical problems that overlapping exists in image parts needing to be reserved and modified in a target image learned by an existing model, semantic information of the multi-granularity image and text is not fully utilized, and the like; the invention firstly uses an image feature extractor to extract image features of different semantic levels, extracts text features through the text feature extractor, further fuses the image features of different semantic levels through a cross-layer interaction module, then obtains relatively accurate reserved and modified areas in a target image through self-contrast learning, and finally completes combined query image retrieval through calculating cosine similarity and sequencing from high to low. Meanwhile, the image retrieval is completed by using a combined query image retrieval method based on cross-modal attention preservation, so that semantic information of different levels is more fully utilized.
Description
Technical Field
The invention relates to the field of cross-modal retrieval in computer vision, in particular to a combined query image retrieval method of a multi-granularity attention network.
Background
The combined query image retrieval is an expansion task in the field of image retrieval. Unlike conventional content-based image retrieval and graph-text matching tasks, queries in a combined query image retrieval contain both image and text modalities, rather than just one single-modality data input. Conventional image retrieval queries require that the user describe the query requirements only in images or text, which limits the user to expressing accurate search intent. The combined query image retrieval allows a user to modify image contents by utilizing text information on the basis of querying by using images, and flexibly and comprehensively expresses search intention so as to optimize retrieval results. The target of the task modifies the specific content of the reference image according to the semantic information of the modification text and the reference image, and then finds out target images which are similar to the reference image and are modified according to the modification text in all candidate images. Because of the practicality of the task, the combined query image retrieval has wide application in the fields of product recommendation, interactive image retrieval and the like.
With the rapid development of hardware facilities, deep neural networks have become a benchmark model for various tasks. The existing combined query image retrieval method based on the deep neural network mainly comprises the following three technical routes:
1. a combined query image retrieval method based on a large-scale pre-training model comprises the following steps: a combined query image retrieval method based on a large-scale pre-training model utilizes additional prior knowledge learned from other image text corpuses to initialize model parameters to help the model learn target images. The method utilizes additional data and image features from fine granularity and coarse granularity to improve the retrieval accuracy.
2. The combined query image retrieval method based on feature fusion comprises the following steps: the combined query image retrieval method based on feature fusion obtains image and text feature representation through an image and text encoder, utilizes a designed attention module or various network structures to screen out key features in the text and the image, fuses the screened image features and text features into a unified image-text feature representation, and finally calculates cosine similarity by using the independent image-text features and target image features to measure similarity between candidate images and the fused feature representation.
3. The combined query image retrieval method based on co-training comprises the following steps: in order to reduce the complexity of the model and improve the efficiency of the combined query image retrieval model, the combined query image retrieval method based on co-training learns the part needing to be modified in the target image through a graphic matching strategy, and learns the part needing to be reserved in the reference image through a content-based image retrieval strategy.
The current combined query image retrieval method is mainly a combined query image retrieval method based on feature fusion, and the method can effectively improve the accuracy of query results through a designed attention mechanism and a network structure.
However, in practical application, the existing combined query image retrieval method still has the following problems: the image parts which need to be reserved and modified in the target image learned by the model are overlapped, and semantic information of the images and texts with multiple granularities is not fully utilized. The above deficiencies may reduce the quality of the image retrieval result.
Disclosure of Invention
The invention aims at: in order to solve the technical problems, the invention provides a combined query image retrieval method of a multi-granularity attention network.
The invention adopts the following technical scheme for realizing the purposes:
the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps of:
step S1: acquiring a data set for training, wherein the data set comprises a text, a target image and a reference image;
step S2: constructing a network structure of a text encoder, and acquiring text characteristics of the text in the step S1 by using the text encoder;
step S3: constructing a multi-granularity attention network structure with mutual exclusion limit, wherein the network structure comprises a multi-granularity attention network and three attention modules with mutual exclusion limit;
step S4: constructing a multi-granularity attention network, and extracting reference image features and target image features with different granularities in the step S1 and text features with different granularities in the step 2;
step S5: constructing an attention module with mutual exclusion limitation, which is used for generating the image features and text features with different granularities extracted in the step S4 to obtain the image region features which need to be reserved and modified in the reference image and the target image;
step S6: feature matching of the similarity level is performed by defining a first loss function L bbc The feature matching method specifically comprises the following steps:
step S61: calculating cosine similarity between the image region features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image region characteristics of the target image which are required to be modified and the text characteristics obtained in the step (2);
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: a first loss function L defined according to step S6 bbc Using AdamW optimizer pair based onTraining a combined query image retrieval model of the multi-granularity attention network with mutual exclusion limitation;
step S8: image retrieval is performed using the trained multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints to verify the performance of the multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints.
As an optional technical solution, the step S2 specifically includes:
step S21: removing non-alphabetic characters from the text in the step S1 through text preprocessing operation, and replacing special characters with spaces;
step S22: firstly word segmentation processing is carried out on the text obtained through pretreatment in the step S21, and then a GloVE pre-training corpus is used for encoding words in the text into word vectors;
step S23: the word vector in step S22 encodes the whole sentence into text features through a long-short-time memory network or a bi-directional gating loop network.
As an optional technical solution, the step S4 specifically includes:
step S41: the shapes of the target image and the reference image in the data set in the step S1 are firstly adjusted to 256 multiplied by 256 pixels, and then data enhancement is carried out by utilizing random clipping and random horizontal overturning;
step S42: constructing a multi-granularity attention network, and inputting each pair of reference images and corresponding target images subjected to data enhancement in the step S41 into the multi-granularity attention network to obtain reference images and target image features with different granularities;
step S43: and (3) inputting the text features in the step S2 into a multi-granularity attention network to obtain text features with different granularities.
As an optional technical solution, the step S5 specifically includes:
step S51: inputting the text features obtained in the step S2 to a multi-layer perceptron to obtain attention weights which are used for screening and need to be reserved and modified;
step S52: multiplying the attention weight obtained in the step S51 and the reference image features and target image features with different granularities obtained in the step S4 element by element to obtain the image region features to be modified and reserved in the target image;
as an alternative solution, the image region feature may be obtained in step S52 by a second loss function L att Optimizing;
specifically: taking the feature to be modified in the target image features obtained in the step S4 as a positive sample and the feature to be reserved as a negative sample for defining a second loss function L att Thereby constructing a mutually exclusive limited attention module;
second loss function L att As will be shown in detail below,
where Σ is the sum symbol, L c (.) represents a third loss function constructed using contrast learning, uppercase + in the two formulas represents mathematical addition and subtraction symbols, sum of both ends is taken, lowercase + at the end in the second formula represents less than 0 and 0 is taken,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F S Representing original text features, L mi Represent the firstiThe learnable text characteristics of the individual samples,represent the firstiImage features of reference images of individual granularity layers need to be preserved; />Represent the firstiImage features to be modified in target images of the samples; />The size of the space is indicated and,a i representing the different weights of the light-emitting diode,iindicating the number of layers at the granularity level where the feature is located.
As an alternative solution, in the step S6, the first loss function L is defined by the following formula bbc ,
wherein ,similarity score representing text feature and modified image feature of jth training sample and similarity score of reference image and target image>Sequentially representing the learner text characteristic of the jth sample, the image characteristic needing to be modified in the target image, the reference image and the image characteristic needing to be reserved in the target image;
represent the firstiThe sum of the similarity scores of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>Sequentially represent the firstiThe method comprises the steps of learning text features of individual samples, image features needing to be modified in a target image, reference images and image features needing to be reserved in the target image;
is shown in the firstiThe granularity level modifies the similarity between the text and the region features to be modified in the target image;
representing the similarity between the region features to be preserved in the i-th granularity level reference image and the target image,/for>Sequentially represent the firstiImage features of reference images of the granularity layers which need to be reserved and image features of target images which need to be reserved;
representing a parameter that can be learned and which,jrepresent the firstjA number of samples of the sample were taken,Nthe total number of samples representing a batch of data in the training dataset exp represents the exponential function and log represents the logarithmic function.
As an optional technical solution, the step S8 specifically includes: performing image retrieval by using a trained combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, and then selecting an image with highest score in the similarity score matrix obtained in the step S6 as an output result
The beneficial effects of the invention are as follows:
1. the invention can more fully utilize visual and text semantic information with different granularities, so that the network model has robustness for the semantic information with different granularities.
2. The invention designs a combined query image retrieval method of a multi-granularity attention network with mutual exclusion limit for image retrieval, and the multi-granularity attention network with mutual exclusion limit adds mutual exclusion limit to the attention so as to achieve the information which needs to be reserved and modified in a target image learned by an optimization model, thereby improving the accuracy of image retrieval.
Drawings
FIG. 1 is a flow chart of an implementation of a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion constraint of the present invention in an embodiment.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, this embodiment describes a method for searching a combined query image of a multi-granularity attention network, firstly, extracting text features by using a text encoder, then extracting reference images and target image features of different granularities by using the multi-granularity attention network, then constructing an attention module with mutual exclusion limitation, obtaining image features which are required to be reserved and modified as mutually exclusive as possible by the attention module with mutual exclusion limitation, finally, performing feature matching on a similarity level, and obtaining a search result by using the obtained similarity matrix.
One key contribution of this embodiment is to fully mine image features of different granularities through a multi-granularity attention network for combined query image retrieval, so that the network can obtain more accurate feature representations for semantic information of different granularities. The method has another core innovation point, mutual exclusion limit is added on the attention module, and the attention module with the mutual exclusion limit can ensure that information which needs to be reserved and modified in the image features focused by the model is mutually exclusive as far as possible, so that a more accurate query result is obtained. Compared with the existing method, the method fully excavates visual and language information with different granularities, so that the model can capture semantic information with different granularities, more importantly, the problem that information needing to be reserved and modified in target image features learned by the model is overlapped is successfully solved by adding mutual exclusion limitation on the attention level, and therefore the model can accurately pay attention to the image features needing to be reserved and modified, and the purpose of improving the quality of image retrieval results is achieved.
Example 2:
the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps of:
step S1: acquiring a data set for training, wherein the data set comprises a text, a target image and a reference image;
step S2: constructing a network structure of a text encoder, and acquiring text characteristics of the text in the step S1 by using the text encoder;
step S3: constructing a multi-granularity attention network structure with mutual exclusion limit, wherein the network structure comprises a multi-granularity attention network and three attention modules with mutual exclusion limit;
step S4: constructing a multi-granularity attention network, and extracting reference image features and target image features with different granularities in the step S1 and text features with different granularities in the step 2;
step S5: constructing an attention module with mutual exclusion limitation, which is used for generating the image features and text features with different granularities extracted in the step S4 to obtain the image region features which need to be reserved and modified in the reference image and the target image;
step S6: feature matching of the similarity level is performed by defining a first loss function L bbc The feature matching method specifically comprises the following steps:
step S61: calculating cosine similarity between the image region features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image region characteristics of the target image which are required to be modified and the text characteristics obtained in the step (2);
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: a loss first loss function L defined according to step S6 bbc Training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer;
step S8: image retrieval is performed using the trained multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints to verify the performance of the multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints.
Example 3:
a combined query image retrieval method of a multi-granularity attention network comprises the steps of firstly extracting text features by using a text encoder, then extracting reference images and target image features with different granularities by using the multi-granularity attention network, then constructing an attention module with mutual exclusion limit, obtaining image features which are required to be reserved and modified as far as possible mutually exclusive by the attention module with mutual exclusion limit, finally carrying out feature matching of a similarity level, and obtaining a retrieval result by using an obtained similarity matrix.
Step S1: selecting a training data set;
in this example, three common data sets, fashionIQ, shoes and Fashion200K, were selected for the experiment.
The FashionIQ dataset is a natural language-based interactive fashion search dataset, which contains 77,684 images related to clothing, and can be specifically classified into three categories: one-piece dress, coat and shirt. In terms of dataset partitioning, it includes approximately 18,000 triples (modified text, reference image and target image) for training. The data sets for verification and testing each contained approximately 12,000 pairs of queries, the queries including reference images and modified text, wherein the modified text consisted of two artificially annotated descriptions.
The images of the Shoes dataset were initially collected from the internet, after which a natural language description was added for the task of retrieval for the combined query image. It consists of approximately 10,000 images, of which approximately 9,000 are used for training and 4,700 are used for testing.
The Fashion200K dataset is a large-scale Fashion search dataset, which contains about 200,000 images, about 170,000 images for training, about 30,000 images for testing, and since the dataset has no text labels of features, a pair of images with only one word different from the text description is used as a reference image and a target image, and then the text description is constructed according to the words with differences, namely, the form of "reproduction sth.
Step S2: constructing a network structure of a text encoder, and for the modified text in the data set in the step S1, acquiring text characteristics by using the text encoder, wherein the text encoder is a long-short-term memory network (a full-connection layer is connected later) or a bidirectional gating circulation network;
the specific content of the steps is as follows:
step S21: removing non-alphabetic characters from the text in the data set in the step S1 through text preprocessing operation, and replacing special characters with spaces;
step S22: firstly word segmentation processing is carried out on the text obtained through pretreatment in the step S21, and then a GloVE pre-training corpus is used for encoding words in the text into word vectors;
step S23: the word vector in the step S22 codes the whole sentence into text characteristics through a long-short time memory network or a two-way gating circulation network;
step S3: the method comprises the steps of constructing a multi-granularity attention network with mutual exclusion limit, wherein a text encoder is arranged in the multi-granularity attention network for extracting semantic information with different granularities, and three attention modules with mutual exclusion and identical structures are used for optimizing target image characteristics with different granularities.
Step S4: constructing a multi-granularity attention network, and extracting images and text features with different granularities by using the multi-granularity attention network for the reference image in the data set in the step S1 and the text features acquired in the step S2;
the specific content of the steps is as follows:
step S41: the shapes of the target image and the reference image in the data set in step S1 are adjusted to 256×256 pixels first, and then data enhancement is performed by random clipping and random horizontal flip.
Step S42: constructing a multi-granularity attention network, and inputting each pair of reference images and corresponding target images subjected to data enhancement in the step S41 into the multi-granularity attention network to obtain reference images and target image features with different granularities;
step S43: the text features obtained in the step S23 are input into a multi-granularity attention network, and text features with different granularities are obtained through three full connection layers.
Step S5: constructing an attention module with mutual exclusion limit, and generating image area features to be reserved and modified in a reference image and a target image by using the text and image features with different granularities extracted from the multi-granularity attention network in the step S4 through the attention module with mutual exclusion limit;
the specific content of the steps is as follows:
step S51: inputting the text features obtained in the step S2 to a multi-layer perceptron to obtain attention weights which are used for screening and need to be reserved and modified;
step S52: multiplying the weight obtained in the step S51 and the characteristics of the reference image and the target image with different granularities obtained in the step S4 element by element to obtain the characteristics which need to be modified and reserved in the target image;
step S53: the image region features obtained in step S52 can be obtained by a second loss function L att Optimizing;
specifically: the target image characteristics obtained in the step S4 are required to be modifiedFeatures as positive samples and features to be preserved as negative samples for defining a second loss function L att Thereby constructing a mutually exclusive limited attention module;
second loss function L att As will be shown in detail below,
where Σ is the sum symbol, L c (.) represents a third loss function constructed using contrast learning, uppercase + in the two formulas represents mathematical addition and subtraction symbols, sum of both ends is taken, lowercase + at the end in the second formula represents less than 0 and 0 is taken,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F S Representing original text features, L mi Represent the firstiThe learnable text characteristics of the individual samples,represent the firstiImage features of reference images of individual granularity layers need to be preserved; />Represent the firstiImage features to be modified in target images of the samples; />The size of the space is indicated and,a i representing the different weights of the light-emitting diode,iindicating the number of layers at the granularity level where the feature is located.
Step S6: constructing a multi-granularity attention network with mutual exclusion limit, reserving and modifying image features of a specific area by using text features in the step S3 and image features in the step S4, and then calculating a similarity score;
the specific content of the steps is as follows:
step S61: calculating cosine similarity between the image features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image features and text features of the target image to be modified, which are obtained in the step S5;
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer according to the loss function defined in the step S6;
the initial learning rate of the AdamW optimizer is set to 0.0005, and the learning rate is attenuated by half every 10 rounds of training by using the weight attenuation strategy, and after 20 rounds of training, the learning rate is attenuated by half every 5 rounds of training, and the whole model is trained for 50, 100 and 150 cycles on the training sets of the three data sets respectively.
wherein ,similarity score representing text feature and modified image feature of jth training sample and similarity score of reference image and target image>Sequentially representing the learner text feature of the jth sample, the image feature of the target image to be modified, the reference image and the target image to be modifiedRetained image features;
represent the firstiThe sum of the similarity scores of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>Sequentially represent the firstiThe method comprises the steps of learning text features of individual samples, image features needing to be modified in a target image, reference images and image features needing to be reserved in the target image;
is shown in the firstiThe granularity level modifies the similarity between the text and the region features to be modified in the target image;
representing the similarity between the region features to be preserved in the i-th granularity level reference image and the target image,/for>Sequentially represent the firstiImage features of reference images of the granularity layers which need to be reserved and image features of target images which need to be reserved;
representing a parameter that can be learned and which,jrepresent the firstjA number of samples of the sample were taken,Nthe total number of samples representing a batch of data in the training dataset exp represents the exponential function and log represents the logarithmic function.
Step S8: image retrieval is performed using a trained multi-granularity query image retrieval model based on a multi-granularity attention network with mutual exclusion constraints to verify the effectiveness of the trained multi-granularity query image retrieval model based on the multi-granularity attention network with mutual exclusion constraints.
And when the image retrieval is carried out in the step S8, selecting the image with the highest score in the similarity score matrix obtained in the step S6 as an output result.
Example 4:
this example uses the Recall@10, recall@50 and MeanR metrics on the FashioniQ dataset to evaluate our designed network, and uses the Recall@1 (R@1), recall@10 (R@10), recall@50 (R@50) and MeanR metrics on the Shoes dataset and the Fashion200K dataset to evaluate our model. The recall@k index is defined as the percentage of all samples tested in which the correct target image appears in the first K returned results.
The model performance comparisons of our method MANME and other methods on the FashionIQ dataset are shown in table 1:
TABLE 1
The model performance versus results on the Shoes dataset are shown in table 2:
TABLE 2
The model performance versus results on the Fashion200K dataset are shown in table 3:
TABLE 3 Table 3
It can be seen from the table that our invention is significantly better than all existing methods in all the accuracy indicators used for evaluation in the FashionIQ dataset and the shaes dataset. For the large-scale dataset Fashion200K, our invention is significantly better than the current method in all the accuracy metrics used for evaluation except Recall@1. The combined query image retrieval method based on the multi-granularity attention network with the mutual exclusion limit is proved to fully mine semantic information with different granularities in text and image characteristics, and ensures that parts needing to be reserved and modified in a target image obtained by a model are not overlapped by using an attention module with the mutual exclusion limit so as to learn more accurate target image characteristics, thereby enabling the image retrieval to be more accurate.
Explanation about the model in tables 1, 2 and 3:
JVSM: is a joint visual semantic matching embedding method in the guidance retrieval of learning language;
TRIG: is the first method of the combined query task;
VAL: an image searching method for text feedback through visual linguistic attention learning;
ARTEMIS: explicit matching and implicit similarity retrieval of text based on an attention mechanism;
ComposeAE: combination learning of image text queries for image retrieval;
CoSMo: content style modulated image retrieval and text feedback;
DCNet: bidirectional learning in interactive image retrieval;
SAC: semantic attention combinations for image retrieval under text conditions;
TCIR, namely carrying out image retrieval under text conditions by using style and content functions;
CIRPLANT, which is to use a pre-trained visual and language model to search images in real life;
fashion vlp, visual language converter for fashion retrieval and feedback;
CLVC-Net, comprehensive image searching language-visual synthesis network;
GSCMR-geometric sensitive cross-modal reasoning based on image retrieval of combined queries.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.
Claims (7)
1. The method is characterized in that the method is realized by adopting a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, wherein the model comprises an image feature extraction module, a text feature extraction module and a cross-layer interaction module, and is used for a reserved self-contrast learning module, and the method comprises the following steps:
step S1: acquiring a data set for training, wherein the data set comprises a text, a target image and a reference image;
step S2: constructing a network structure of a text encoder, and acquiring text characteristics of the text in the step S1 by using the text encoder;
step S3: constructing a multi-granularity attention network structure with mutual exclusion limit, wherein the network structure comprises a multi-granularity attention network and three attention modules with mutual exclusion limit;
step S4: constructing a multi-granularity attention network, and extracting reference image features and target image features with different granularities in the step S1 and text features with different granularities in the step 2;
step S5: constructing an attention module with mutual exclusion limitation, which is used for generating the image features and text features with different granularities extracted in the step S4 to obtain the image region features which need to be reserved and modified in the reference image and the target image;
step S6: feature matching of the similarity level is performed by defining a first loss function L bbc The feature matching method specifically comprises the following steps:
step S61: calculating cosine similarity between the image region features to be reserved in the target image and the reference image obtained in the step S5;
step S62: calculating cosine similarity between the image region characteristics of the target image which are required to be modified and the text characteristics obtained in the step (2);
step S63: adding the similarity scores obtained in the step S61 and the step S62 to obtain a granularity similarity score;
step S64: adding the similarity scores of the different granularity planes obtained in the step S63 to obtain a final similarity score matrix;
step S7: a first loss function L defined according to step S6 bbc Training a combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limitation by using an AdamW optimizer;
step S8: image retrieval is performed using the trained multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints to verify the performance of the multi-granularity attention network-based combined query image retrieval model with mutual exclusion constraints.
2. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S2 specifically comprises:
step S21: removing non-alphabetic characters from the text in the step S1 through text preprocessing operation, and replacing special characters with spaces;
step S22: firstly word segmentation processing is carried out on the text obtained through pretreatment in the step S21, and then a GloVE pre-training corpus is used for encoding words in the text into word vectors;
step S23: the word vector in step S22 encodes the whole sentence into text features through a long-short-time memory network or a bi-directional gating loop network.
3. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S4 specifically comprises:
step S41: the shapes of the target image and the reference image in the data set in the step S1 are firstly adjusted to 256 multiplied by 256 pixels, and then data enhancement is carried out by utilizing random clipping and random horizontal overturning;
step S42: constructing a multi-granularity attention network, and inputting each pair of reference images and corresponding target images subjected to data enhancement in the step S41 into the multi-granularity attention network to obtain reference images and target image features with different granularities;
step S43: and (3) inputting the text features in the step S2 into a multi-granularity attention network to obtain text features with different granularities.
4. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S5 specifically comprises:
step S51: inputting the text features obtained in the step S2 to a multi-layer perceptron to obtain attention weights which are used for screening and need to be reserved and modified;
step S52: and multiplying the attention weight obtained in the step S51 and the reference image features and the target image features with different granularities obtained in the step S4 element by element to obtain the image region features which need to be modified and reserved in the target image.
5. The method for searching for combined query image of multi-granularity attention network as claimed in claim 4, wherein when the image region features are obtained in said step S52, the second loss function L is passed att Optimizing;
specifically: taking the feature to be modified in the target image features obtained in the step S4 as a positive sample and the feature to be reserved as a negative sample for defining a second loss function L att Thereby constructing a mutually exclusive limited attention module;
second loss function L att As shown in detail below,
where Σ is the sum symbol, L c (.) represents a third loss function constructed using contrast learning, with capitalized +representing mathematical in two formulasAdding and subtracting the sign, taking the sum of two ends, taking 0 from the lower case + of the end in the second formula,Simrepresenting cosine similarity calculation operation, t representing text semantic information, F S Representing original text features, L mi Represent the firstiThe learnable text characteristics of the individual samples,represent the firstiImage features of reference images of individual granularity layers need to be preserved; />Represent the firstiImage features to be modified in target images of the samples; />The size of the space is indicated and,a i representing the different weights of the light-emitting diode,iindicating the number of layers at the granularity level where the feature is located.
6. The method for searching for combined query image of multi-granularity attention network according to claim 1, wherein in step S6, the first loss function L is defined by the following formula bbc ,
wherein ,similarity score representing text feature and modified image feature of jth training sample and similarity score of reference image and target image>Learner text sequentially representing jth sampleThe method comprises the steps of the feature, the image feature to be modified in a target image, a reference image and the image feature to be reserved in the target image;
represent the firstiThe sum of the similarity score of the text features and the modified image features of the training samples and the similarity score of the reference image and the target image +.>Sequentially represent the firstiThe method comprises the steps of learning text features of individual samples, image features needing to be modified in a target image, reference images and image features needing to be reserved in the target image;
is shown in the firstiThe granularity level modifies the similarity between the text and the region features to be modified in the target image;
representing the similarity between the region features to be preserved in the i-th granularity level reference image and the target image,/for>Sequentially represent the firstiImage features of reference images of the granularity layers which need to be reserved and image features of target images which need to be reserved;
7. The method for searching the combined query image of the multi-granularity attention network according to claim 1, wherein the step S8 specifically comprises: and (3) performing image retrieval by using a trained combined query image retrieval model based on a multi-granularity attention network with mutual exclusion limit, and then selecting an image with the highest score in the similarity score matrix obtained in the step S6 as an output result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310213360.9A CN115905610B (en) | 2023-03-08 | 2023-03-08 | Combined query image retrieval method of multi-granularity attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310213360.9A CN115905610B (en) | 2023-03-08 | 2023-03-08 | Combined query image retrieval method of multi-granularity attention network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115905610A CN115905610A (en) | 2023-04-04 |
CN115905610B true CN115905610B (en) | 2023-05-26 |
Family
ID=85744778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310213360.9A Active CN115905610B (en) | 2023-03-08 | 2023-03-08 | Combined query image retrieval method of multi-granularity attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115905610B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818157A (en) * | 2021-02-10 | 2021-05-18 | 浙江大学 | Combined query image retrieval method based on multi-order confrontation characteristic learning |
CN112860930A (en) * | 2021-02-10 | 2021-05-28 | 浙江大学 | Text-to-commodity image retrieval method based on hierarchical similarity learning |
CN114048340A (en) * | 2021-11-15 | 2022-02-15 | 电子科技大学 | Hierarchical fusion combined query image retrieval method |
CN114998615A (en) * | 2022-04-28 | 2022-09-02 | 南京信息工程大学 | Deep learning-based collaborative significance detection method |
CN115033670A (en) * | 2022-06-02 | 2022-09-09 | 西安电子科技大学 | Cross-modal image-text retrieval method with multi-granularity feature fusion |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11699275B2 (en) * | 2020-06-17 | 2023-07-11 | Tata Consultancy Services Limited | Method and system for visio-linguistic understanding using contextual language model reasoners |
-
2023
- 2023-03-08 CN CN202310213360.9A patent/CN115905610B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818157A (en) * | 2021-02-10 | 2021-05-18 | 浙江大学 | Combined query image retrieval method based on multi-order confrontation characteristic learning |
CN112860930A (en) * | 2021-02-10 | 2021-05-28 | 浙江大学 | Text-to-commodity image retrieval method based on hierarchical similarity learning |
CN114048340A (en) * | 2021-11-15 | 2022-02-15 | 电子科技大学 | Hierarchical fusion combined query image retrieval method |
CN114998615A (en) * | 2022-04-28 | 2022-09-02 | 南京信息工程大学 | Deep learning-based collaborative significance detection method |
CN115033670A (en) * | 2022-06-02 | 2022-09-09 | 西安电子科技大学 | Cross-modal image-text retrieval method with multi-granularity feature fusion |
Non-Patent Citations (3)
Title |
---|
Clothing retrieval based by image and text combination;Zongbao Liang et al.;《2021 7th International Conference on Systems and Informatics (ICSAI)》;1-6 * |
基于多模态查询的图像检索研究——以时尚领域为例;温皓琨;《中国优秀硕士学位论文全文数据库 信息科技辑》;I138-1326 * |
基于深度学习的特征提取及其在图像检索中的应用;吴彬;《中国优秀硕士学位论文全文数据库 信息科技辑》;I138-1036 * |
Also Published As
Publication number | Publication date |
---|---|
CN115905610A (en) | 2023-04-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106980683B (en) | Blog text abstract generating method based on deep learning | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN111339407B (en) | Implementation method of information extraction cloud platform | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
Meshram et al. | Long short-term memory network for learning sentences similarity using deep contextual embeddings | |
CN115858847B (en) | Combined query image retrieval method based on cross-modal attention reservation | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
Wang et al. | Multi-modal transformer using two-level visual features for fake news detection | |
Tang et al. | Chinese sentiment analysis based on lightweight character-level bert | |
CN113705207A (en) | Grammar error recognition method and device | |
CN110852066B (en) | Multi-language entity relation extraction method and system based on confrontation training mechanism | |
CN116720519A (en) | Seedling medicine named entity identification method | |
CN116385946A (en) | Video-oriented target fragment positioning method, system, storage medium and equipment | |
CN116414988A (en) | Graph convolution aspect emotion classification method and system based on dependency relation enhancement | |
CN115905610B (en) | Combined query image retrieval method of multi-granularity attention network | |
CN117216617A (en) | Text classification model training method, device, computer equipment and storage medium | |
Alwaneen et al. | Stacked dynamic memory-coattention network for answering why-questions in Arabic | |
CN114911940A (en) | Text emotion recognition method and device, electronic equipment and storage medium | |
Abdolahi et al. | A new method for sentence vector normalization using word2vec | |
Tang | [Retracted] Analysis of English Multitext Reading Comprehension Model Based on Deep Belief Neural Network | |
Jiang et al. | Spatial relational attention using fully convolutional networks for image caption generation | |
Wang | A study of the tasks and models in machine reading comprehension | |
Li et al. | A multi-granularity semantic space learning approach for cross-lingual open domain question answering | |
Zeng et al. | CLG-Trans: Contrastive learning for code summarization via graph attention-based transformer | |
CN116991980B (en) | Text screening model training method, related method, device, medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |