CN112818157B - Combined query image retrieval method based on multi-order confrontation characteristic learning - Google Patents

Combined query image retrieval method based on multi-order confrontation characteristic learning Download PDF

Info

Publication number
CN112818157B
CN112818157B CN202110185641.9A CN202110185641A CN112818157B CN 112818157 B CN112818157 B CN 112818157B CN 202110185641 A CN202110185641 A CN 202110185641A CN 112818157 B CN112818157 B CN 112818157B
Authority
CN
China
Prior art keywords
image
features
text
fusion
order
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110185641.9A
Other languages
Chinese (zh)
Other versions
CN112818157A (en
Inventor
纪守领
付之笑
董建锋
张旭鸿
何源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202110185641.9A priority Critical patent/CN112818157B/en
Publication of CN112818157A publication Critical patent/CN112818157A/en
Application granted granted Critical
Publication of CN112818157B publication Critical patent/CN112818157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a combined query image retrieval method based on multi-order confrontation characteristic learning, which comprises the following steps: firstly, image features are obtained through a pre-trained feature extraction module and text features are obtained through an LSTM network, and then features of two modes are fused through self-attention guidance. And generating high-order characteristics by the low-order characteristics in a bilinear fusion mode. And then, the similarity relation between the triple loss learning features is utilized, the fusion between the features is further promoted by using a discriminator and a retrieval network to resist, and finally, the model is trained by combining the discriminator and the retrieval network in an end-to-end mode, so that the efficient combined query image retrieval is realized. The invention utilizes the deep learning technology and the idea of using games for reference, and greatly improves the performance and efficiency of searching the combined query image.

Description

Combined query image retrieval method based on multi-order confrontation feature learning
Technical Field
The invention relates to the technical field of machine learning combined query image retrieval, in particular to a combined query image retrieval method based on multi-order confrontation characteristic learning.
Background
With the rapid development of information technology and the popularization of mobile networking equipment, people can easily contact massive and various picture resources on the network. In the face of such huge data, when a picture meeting the requirements of the user is to be found from the huge data, an efficient and accurate image retrieval method and system become an indispensable requirement. The increasing total number of pictures which is increasing rapidly brings about large-scale increase of the number of similar images, so that the retrieval accuracy is greatly reduced, and the existing image retrieval technology faces huge pressure and challenges. The mainstream character retrieval image and picture retrieval image modes have respective limitations, for example, the expression capability of a pure text is limited, and information is lost when an idea is converted into a language. Pure pictures cannot provide similar directions, and the search result may still include a large number of non-desirable pictures. Combined query image retrieval is one of the methods to solve these problems. The method and the device simultaneously accept the reference image and the modified text as input, and can meet the modification requirement of the text while retaining the information of the image part. Therefore, a new image retrieval method for combined query of pictures and modified texts is becoming a research and development trend.
In recent years, the machine learning method has excellent performance in the image retrieval field, mainly uses a Convolutional Neural Network (CNN) to extract picture features, uses a Recurrent Neural Network (RNN) to extract text features, trains a neural network model by a metric learning method, and completes the tasks of picture retrieval images and character retrieval images by comparing the similarity of the neural network model with the target image features. There are also graph-based methods, neighbor analysis-based methods, and so on.
The existing combined query image retrieval method generally extracts features of an image and a text respectively and then fuses the images and the text, and the fusion features containing information of two modes are used for carrying out similarity measurement with the features of candidate pictures, so that a target picture is retrieved. The current methods also have some defects, and the methods do not fully utilize multi-scale features, however, the features of different scales often contain information specific to each layer. The fusion method of the characters and the texts is simpler and the retrieval efficiency is lower.
Disclosure of Invention
Aiming at the defects of the existing method, the invention provides a combined query image retrieval method based on multi-order confrontation characteristic learning. The method comprises the steps of firstly fusing low-order features into high-order features, and enabling the high-order features to be in confrontation with a retrieval network by setting a discriminator, so that the fusion of image-text features is further promoted in competition, and efficient combined query image retrieval is realized. Compared with the existing method, the obtained features contain richer information and correlation among different hierarchies, and the fusion of the information of the two modes of the image and the text is tighter.
A combined query image retrieval method based on multi-order antagonistic feature learning utilizes a retrieval network to obtain multi-level features and high-order features of candidate images, fuses a reference image and modified text features to obtain multi-level image-text fusion features and high-order image-text fusion features, calculates cosine similarity one by one with the multi-level features of the candidate images and the features after splicing the multi-level image-text fusion features and the high-order image-text fusion features, sorts according to the similarity, and returns the sorted candidate images as retrieval results of query images. The retrieval network comprises a feature extraction module, a self-attention fusion module and a bilinear fusion module. The retrieval network is constructed and trained by the following steps:
(1) and performing feature extraction on the reference image, the modified text and the target image by using a feature extraction module to obtain initial features of two modal data, wherein the initial features of the reference image and the target image comprise image features output by the feature extraction module in multiple layers.
(2) And (3) fusing the text initial features obtained in the step (1) with the features of the reference image at different levels by using a self-attention fusion module to obtain multi-level image-text fusion features.
(3) And (3) carrying out hierarchical fusion on the multilevel image-text fusion characteristics in the step (2) by utilizing a bilinear fusion module to obtain high-order image-text fusion characteristics, and carrying out hierarchical fusion on the multilevel initial characteristics of the target image in the step (1) to obtain high-order image characteristics.
(4) And (3) respectively comparing the multilevel image-text fusion characteristics obtained in the step (2) with the multilevel characteristics of the target image obtained in the step (1) and the high-order image-text fusion characteristics obtained in the step (3) with the high-order image characteristics of the target image, and performing similarity learning.
(5) And inputting the multilevel initial feature, multilevel image-text fusion feature and high-order image-text fusion feature, text feature, multilevel initial feature of the target image and high-order image feature of the reference image into a discriminator, and judging whether the reference image feature corresponding to the multilevel image-text fusion feature or the target image feature in different levels meets the modification requirement of the text or not to perform counterstudy. The search network is finally trained in an end-to-end manner. The reference image features comprise multilevel initial features of the reference image and multilevel initial features of the reference image, and the multilevel initial features of the reference image are subjected to level fusion to obtain high-order image features.
Further preferably, the initial features include image features of 3 levels, namely, low, medium and high levels, and the high-order features are fusion features of medium and high levels.
Further preferably, in step (1), the feature extraction module includes an LSTM network and a MobileNet convolutional neural network, where:
and carrying out feature extraction on the input text by using the LSTM network to obtain the initial features of the text.
And respectively extracting the features of the reference image and the paired target images by utilizing a pre-trained MobileNet or ResNet18 convolutional neural network, and obtaining the initial image features of different layers from the lower layer, the middle layer and the upper layer of the network.
Further preferably, the self-attention fusion module includes a convolutional layer, a self-attention network, and a linear layer, and the step (2) specifically includes the following sub-steps:
and (2-1) connecting the initial features of each layer of the reference image with the text features, and obtaining initial fusion features by using the convolution layer.
And (2-2) learning the preliminary fusion features to further image-text fusion features by utilizing a self-attention network and a linear layer.
Further preferably, the bilinear fusion module includes a plurality of linear layers, and the method for obtaining the high-order features in step (3) includes the following sub-steps:
and (3-1) mapping the multilevel image-text fusion characteristics in the step (2) and the multilevel characteristics of the target image in the step (1) into the same dimension by using linear layers respectively.
(3-2) performing dot product on the image-text fusion characteristics of the plurality of layers mapped in the step (3-1), and then obtaining high-order image-text fusion characteristics by using a linear layer; and (4) performing dot product on the features of the target images of the plurality of layers mapped in the step (3-1), and then using a linear layer to obtain the high-order image features of the target images.
Further preferably, in the step (4), the multilevel image-text fusion feature obtained in the step (2) is compared with the multilevel feature of the target image obtained in the step 1, and the high-order image-text fusion feature obtained in the step (3) is compared with the high-order image feature of the target image, so that the model can learn the similarity between the features at different levels and orders, and the triple loss is expressed as follows:
Figure BDA0002942976900000031
wherein
Figure BDA0002942976900000032
Figure BDA0002942976900000033
Representing negative examples, d () representing euclidean distances, m representing boundaries, where j represents different hierarchical and higher order ordinals. The first two items promote each layer of image-text fusion characteristic and high-order image-text fusion characteristic
Figure BDA0002942976900000034
Negative sample
Figure BDA0002942976900000035
Initial feature and high-order image feature of each level of target image matched with smaller distance
Figure BDA0002942976900000036
The latter two terms align the mapped image feature with its corresponding modified text feature h.
Further preferably, the countermeasure training method using an arbiter in the step (5) includes the steps of:
and (5-1) subtracting the initial characteristic and the high-order image characteristic of the input reference image from the image characteristic corresponding to the target image and then using linear layer transformation.
(5-2) converting the input modified text features by using a linear layer, multiplying the linear layer converted features by the features obtained by the linear layer conversion in the step (5-1), then performing internal summation to obtain a predicted value of a discriminator by using an activation function, and judging that the target image features meet the requirements of the modified text compared with a reference image:
the true loss function is expressed as follows:
Figure BDA0002942976900000037
Figure BDA0002942976900000038
indicating batch intra-averaging, D indicating the arbiter predicted value,
Figure BDA0002942976900000039
and representing reference image characteristics, including an initial characteristic and a high-order image characteristic acquired according to a high-order characteristic acquisition method of the target image.
(5-3) for the disordered initial image features of the reference image and the high-order image features and the disordered corresponding image features of the target image, the discriminator judges that the wrong image feature pairs do not meet the requirement of modifying the text according to the methods of the steps (5-1) - (5-2), and the error loss function is expressed as follows:
Figure BDA0002942976900000041
Figure BDA0002942976900000042
and
Figure BDA0002942976900000043
respectively representing disorder
Figure BDA0002942976900000044
And
Figure BDA0002942976900000045
(5-4) the judgment of the multilevel image-text fusion characteristic is the decision of the generation loss of the retrieval network and the confrontation loss of the discriminator, wherein the confrontation loss of the discriminator is the fusion loss and is expressed as:
Figure BDA0002942976900000046
the loss of generation of the search network is:
Figure BDA0002942976900000047
and (5-4) monitoring the end-to-end training of the model by the loss calculated by the steps and the triple loss.
Further preferably, the multilevel features and the high-order features of the candidate images acquired by the retrieval network are spliced to construct a candidate feature library.
The invention has the beneficial effects that: the invention utilizes the deep learning technology and the idea of using games for reference, enhances the compactness of cross-modal feature fusion, obtains feature representation with richer semantic information, and further improves the performance and efficiency of combined query image retrieval to a great extent.
Drawings
FIG. 1 is a schematic diagram of a multi-level countermeasure representation learning network architecture in accordance with the present invention;
FIG. 2 is a schematic diagram of a self-attention network structure according to the present invention;
FIG. 3 is a schematic diagram of a bilinear fusion module structure according to the present invention
FIG. 4 is a schematic diagram of a network structure of the discriminator according to the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention provides a combined query image retrieval method based on multi-order confrontation feature learning, which comprises the following steps:
(1) and extracting the features of the reference image, the modified text and the target image by using different feature extraction methods to obtain the initial features of the two modal data.
(1-1) given input reference image X s And target image X t Using the ImageNet numberImage feature extraction by pre-trained MobileNet or ResNet18 network model on dataset
Figure BDA0002942976900000051
Where i represents the hierarchy of the network and the actual model extracts features from the lower, middle and upper layers.
(1-2) given the input modified text T, words are first converted to word-embedded vectors { w ] using a simple vocabulary and embedding layer 1 ,w 2 ,…w n In which w n An embedded vector representing the nth word. Extracting text features by using an LSTM network model capable of understanding context relations of words in sentences, wherein the size of a hidden layer is 1024, and selecting a final output vector to be used as a feature vector of the text after being mapped by a full connection layer
Figure BDA0002942976900000052
And l 512 represents the final text feature size.
(2) Through the feature extraction in the steps, the initial features of the image and the text are respectively obtained, but the features are only simply extracted through the CNN model and the LSTM model, and only contain information of respective modalities, and cannot be directly used for retrieval, so that the features of the image and the text need to be fused to represent the features of the target image. The invention adopts a self-attention (SA) method to lead a model to learn characteristic parts needing to be reserved and omitted, and the specific steps are as follows:
(2-1) copying and expanding the text features to the size of the matched image features, connecting the text features and the matched image features, and obtaining a primary image-text fusion feature through a convolution layer of 1 x 1, wherein the fusion feature is simpler, and a network cannot easily accept or reject information of two modes, so that the weights of a reserved part and a modified part need to be further improved by using an attention module, and the weights of an unnecessary part are reduced.
(2-2) the self-attention fusion module structure used in this embodiment is shown in fig. 2, and the processing procedure of the self-attention fusion module on the teletext fusion feature is as follows: mapping the primary image-text fusion characteristics to hidden spaces of query, key and value by using three 1 multiplied by 1 convolution layers respectively to obtainQ i 、K i 、V i I denotes the hierarchy of features, reshaped Q i 、= i Is of size n i ×c i ,n i =h i ×w i ,h i 、w i 、c i Indicating the height, width and channel number of the ith layer feature map. This is followed by matrix multiplication and softmax activation to obtain a self-attention moment matrix
Figure BDA0002942976900000053
The formula is as follows:
Figure BDA0002942976900000054
(2-3) associating the self-attention matrix with V i Multiplying, averaging the output characteristics of the resulting product using a 1 × 1 convolution layer
Figure BDA0002942976900000061
Namely, the fine image-text fusion characteristics of the designated part are modified according to the characteristics required by the attention weight reservation.
(3) At present, the obtained multilevel fine image-text fusion features belong to different feature spaces and represent respective styles. In order to better utilize multi-level feature representation and explore interaction between different levels, the invention uses a bilinear fusion method (a bilinear fusion module structure is shown in figure 3) to combine the middle-level and high-level features into high-level features, so that richer feature representation can be generated, and the correlation between the middle-level and high-level features can be learned. The specific method comprises the following steps:
Figure BDA0002942976900000062
(3-1) since the feature dimensions of the middle and upper layers are 512 and 1024 respectively, it is necessary to map them into the same dimensional space first for fusing the two. Thus, two linear layers R and S are used to map the middle and high layer features to the same dimension, respectively.
(3-2) carrying outThe dot product preliminarily fuses the features of the two layers together, and then the third linear layer P is used for further extracting fine feature information from the features and transforming the fine feature information into final high-order features
Figure BDA0002942976900000063
(4) In order to enable the model to learn the similarity relation of the features, the generated features can meet the condition that the similar features are close and the dissimilar features are far, each level is trained by using bidirectional triple loss, and the specific formula is as follows:
Figure BDA0002942976900000064
wherein
Figure BDA0002942976900000065
Figure BDA0002942976900000066
Represents negative samples, d () represents the euclidean distance, M represents the boundary, 0.2 in this example, where j ∈ { L, M, H, MH } represents the low, medium, high, and high orders. The first two items promote the image-text fusion characteristic and the high-level image-text fusion characteristic of each layer of reference image and text
Figure BDA0002942976900000067
Negative sample
Figure BDA0002942976900000068
Initial feature and high-order image feature of each level of target image matched with smaller distance
Figure BDA0002942976900000069
The latter two terms align the mapped image features with the corresponding modified text features h, and promote that the learned fusion features are more similar to the semantic information of the modified text.
(5) In order to further promote the fusion of the reference picture and the modified text, the invention designs a discriminator network to form confrontation with a retrieval network, and calculates the confrontation hinge loss, which comprises the following specific steps:
and (5-1) inputting the reference image features, the text features and the target image features into a discriminator to learn correct pairing relation. The discriminator firstly subtracts the image characteristics of the reference image and the target image of each level, then carries out linear layer transformation, multiplies the image characteristics by the text characteristics which are transformed by another linear layer, sums the values in the results and then uses the tanh activation function to obtain the predicted value. The predicted value is used for representing the degree that the difference of the target image characteristic and the reference image characteristic is consistent with the modified text. The real loss is calculated by the formula
Figure BDA0002942976900000071
Figure BDA0002942976900000072
Indicating batch intra-averaging, D indicating the arbiter predicted value,
Figure BDA0002942976900000073
representing reference image features, including pooled initial features
Figure BDA0002942976900000074
And high-order image features acquired according to a high-order feature acquisition method (the right half of fig. 1) of the target image.
(5-2) inputting the disordered reference image features, text features and disordered target image features into a discriminator by the same method as the method in (5-1), and learning wrong pairing relation, wherein the calculation formula of the error loss is
Figure BDA0002942976900000075
Figure BDA0002942976900000076
And
Figure BDA0002942976900000077
respectively representing disorder
Figure BDA0002942976900000078
And
Figure BDA0002942976900000079
due to the disorder of the sequence, the image feature difference at this time must not meet the requirement of modifying the text.
(5-3) inputting the reference image feature, text feature and image-text fusion feature into a discriminator by the same method as (5-1) to calculate the fusion loss
Figure BDA00029429769000000710
The final minimized discrimination loss of the discriminator is
Figure BDA00029429769000000711
(5-4) contrary to the training target of the discriminator, the search network itself is used as a 'generator' of the fusion characteristic, and the self is improved by minimizing the generation loss, and the calculation formula is
Figure BDA00029429769000000712
In the process of optimizing the search network and the discriminator toward the opposite target, the image features and the text features are fused to be more in line with the target image.
(5-5) the final loss function of the whole network is a weighted sum of the retrieval loss and the counter loss. Generating a loss of
Figure BDA00029429769000000713
Discriminating the loss as
Figure BDA00029429769000000714
α is a hyperparameter and is set to 0.01. Used alternately during actual training
Figure BDA00029429769000000715
Training retrieval partial networks and uses
Figure BDA00029429769000000716
Training the discriminator to improve the performance of the discriminator and the discriminator.
(6) Through the training of step (5), the model has learned how to fuse the image and text features. Given a reference image and a modified text, the step of retrieving a target image satisfying the modification requirement from the candidate pictures by the model is as follows:
and (6-1) inputting the candidate pictures into a retrieval network without fusion to obtain middle-high-level features and high-level features, and splicing the middle-high-level features and the high-level features together to construct a candidate feature library.
(6-2) inputting the given reference image and the modified text into a retrieval network for fusion to obtain middle-high-level and high-level fusion characteristics, splicing the fusion characteristics together, calculating cosine similarity with the candidate characteristics one by one, sorting according to the similarity, and returning the sorted result as a retrieval result.
The performance ratio of the image retrieval network constructed by the method of the invention and other commonly used image retrieval methods on the Fashion 200k data set is shown in Table 1, and the result shows that the method of the invention greatly improves the performance and efficiency of combined query image retrieval.
TABLE 1 comparison of search network Performance by different methods
Figure BDA00029429769000000717
Figure BDA0002942976900000081
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should all embodiments be exhaustive. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims (6)

1. A retrieval method of a combined query image based on multi-order antagonistic characteristic learning is characterized in that a retrieval network is utilized to obtain multi-level characteristics and high-order characteristics of a candidate image, a reference image is fused with modified text characteristics to obtain multi-level image-text fusion characteristics and high-order image-text fusion characteristics, the multi-level image-text fusion characteristics and the high-order image-text fusion characteristics are spliced and then are spliced with the multi-level characteristics and the high-order characteristics of the candidate image to calculate cosine similarity one by one, sorting is carried out according to the similarity, and the sorted candidate image is returned to serve as a retrieval result of the query image; the retrieval network comprises a feature extraction module, a self-attention fusion module and a bilinear fusion module; the retrieval network is constructed and trained by the following steps:
(1) performing feature extraction on the reference image, the modified text and the target image by using a feature extraction module to obtain initial features of two modal data, wherein the initial features of the reference image and the target image comprise image features output by the feature extraction module in multiple layers;
(2) fusing the text initial features obtained in the step (1) with the features of the reference image at different levels by using a self-attention fusion module to obtain multi-level image-text fusion features;
(3) performing hierarchical fusion on the multi-level image-text fusion characteristics in the step (2) by using a bilinear fusion module to obtain high-order image-text fusion characteristics, and performing hierarchical fusion on the multi-level initial characteristics of the target image in the step (1) to obtain high-order image characteristics;
(4) respectively comparing the multilevel image-text fusion characteristics obtained in the step (2) with the multilevel characteristics of the target image obtained in the step (1), and comparing the high-order image-text fusion characteristics obtained in the step (3) with the high-order image characteristics of the target image, and performing similarity learning;
(5) inputting multilevel initial features, multilevel image-text fusion features and high-order image-text fusion features of the reference image, text features, multilevel initial features of the target image and high-order image features into a discriminator, and judging whether the high-order image-text fusion features of different levels or the reference image features corresponding to the target image features meet the modification requirements of the text or not to perform counterstudy; finally, training the retrieval network in an end-to-end mode; the reference image features comprise multilevel initial features of a reference image and multilevel initial features of the reference image, and the multilevel initial features of the reference image are subjected to level fusion to obtain high-order image features; the confrontation training method using the discriminator comprises the following steps:
(5-1) subtracting the initial characteristic and the high-order image characteristic of the input reference image from the image characteristic corresponding to the target image and then using linear layer transformation;
(5-2) converting the input modified text features by using a linear layer, multiplying the linear layer converted features by the features obtained by the linear layer conversion in the step (5-1), then performing internal summation to obtain a predicted value of a discriminator by using an activation function, and judging that the target image features meet the requirements of the modified text compared with a reference image:
the true loss function is expressed as follows:
Figure FDA0003744490940000011
Figure FDA0003744490940000029
indicating the intra-batch averaging, D indicating the arbiter predicted value,
Figure FDA0003744490940000021
representing reference image characteristics, including an initial characteristic and a high-order image characteristic acquired according to a high-order characteristic acquisition method of a target image;
(5-3) for the disordered initial image features of the reference image and the high-order image features and the disordered corresponding image features of the target image, the discriminator judges that the wrong image feature pairs do not meet the requirement of modifying the text according to the methods of the steps (5-1) - (5-2), and the error loss function is expressed as follows:
Figure FDA0003744490940000022
Figure FDA0003744490940000023
and
Figure FDA0003744490940000024
respectively representing disorder
Figure FDA0003744490940000025
And
Figure FDA0003744490940000026
(5-4) the judgment of the multilevel image-text fusion characteristic is the decision of the generation loss of the retrieval network and the countermeasure loss of the discriminator, wherein the countermeasure loss of the discriminator is fusion loss and is expressed as:
Figure FDA0003744490940000027
the loss of generation of the search network is:
Figure FDA0003744490940000028
(5-4) the loss calculated by the above procedure.
2. The combined query image retrieval method based on multi-order antagonism feature learning as claimed in claim 1, wherein in the step (1), the feature extraction module comprises an LSTM network and a MobileNet convolutional neural network, wherein:
carrying out feature extraction on an input text by using an LSTM network to obtain initial features of the text;
and respectively extracting the features of the reference image and the paired target images by utilizing a pre-trained MobileNet or ResNet18 convolutional neural network, and obtaining the initial image features of different layers from the lower layer, the middle layer and the upper layer of the network.
3. The combined query image retrieval method based on multi-order confrontation feature learning of claim 1, wherein the self-attention fusion module comprises a convolutional layer, a self-attention network and a linear layer, and the step (2) specifically comprises the following sub-steps:
(2-1) connecting the initial features of each layer of the reference image with the text features, and obtaining preliminary fusion features by using a convolution layer;
and (2-2) learning the preliminary fusion features to further image-text fusion features by utilizing a self-attention network and a linear layer.
4. The combined query image retrieval method based on multi-order antagonistic feature learning as claimed in claim 1, wherein the bilinear fusion module comprises a plurality of linear layers, and the method for obtaining the high-order features in step (3) comprises the following sub-steps:
(3-1) mapping the multilevel image-text fusion characteristics in the step (2) and the multilevel characteristics of the target image in the step (1) into the same dimensionality by using linear layers respectively;
(3-2) performing dot product on the image-text fusion characteristics of the plurality of layers mapped in the step (3-1), and then obtaining high-order image-text fusion characteristics by using a linear layer; and (4) performing dot product on the features of the target images of the plurality of layers mapped in the step (3-1), and then using a linear layer to obtain the high-order image features of the target images.
5. The combined query image retrieval method based on multi-level confrontation feature learning as claimed in claim 1, wherein in the step (4), the multi-level image-text fusion feature obtained in the step (2) and the multi-level feature of the target image obtained in the step (1) and the high-level image-text fusion feature obtained in the step (3) and the high-level image feature of the target image are compared with each other by using a defined triple loss, so that the model can learn the similarity relationship between the features at different levels and orders, and the triple loss is expressed as follows:
Figure FDA0003744490940000031
wherein
Figure FDA0003744490940000032
Represents negative samples, d () represents euclidean distance, m represents boundaries, where j represents different levels and higher order numbers; the first two items promote each layer of image-text fusion characteristic and high-order image-text fusion characteristic
Figure FDA0003744490940000033
Negative sample
Figure FDA0003744490940000034
Initial feature and high-order image feature of each level of target image matched with smaller distance
Figure FDA0003744490940000035
The latter two terms align the mapped image feature with its corresponding modified text feature h.
6. The method as claimed in claim 1, wherein the multi-level features and the high-level features of the candidate image obtained by the search network are merged to form a candidate feature library.
CN202110185641.9A 2021-02-10 2021-02-10 Combined query image retrieval method based on multi-order confrontation characteristic learning Active CN112818157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110185641.9A CN112818157B (en) 2021-02-10 2021-02-10 Combined query image retrieval method based on multi-order confrontation characteristic learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110185641.9A CN112818157B (en) 2021-02-10 2021-02-10 Combined query image retrieval method based on multi-order confrontation characteristic learning

Publications (2)

Publication Number Publication Date
CN112818157A CN112818157A (en) 2021-05-18
CN112818157B true CN112818157B (en) 2022-09-16

Family

ID=75865297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110185641.9A Active CN112818157B (en) 2021-02-10 2021-02-10 Combined query image retrieval method based on multi-order confrontation characteristic learning

Country Status (1)

Country Link
CN (1) CN112818157B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114723986A (en) * 2022-03-16 2022-07-08 平安科技(深圳)有限公司 Text image matching method, device, equipment and storage medium
CN115905610B (en) * 2023-03-08 2023-05-26 成都考拉悠然科技有限公司 Combined query image retrieval method of multi-granularity attention network
CN117932099A (en) * 2024-03-21 2024-04-26 大连海事大学 Multi-mode image retrieval method based on modified text feedback

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992686A (en) * 2019-02-24 2019-07-09 复旦大学 Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110275972A (en) * 2019-06-17 2019-09-24 浙江工业大学 A kind of case retrieval methods based on content introducing dual training

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319686B (en) * 2018-02-01 2021-07-30 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space
CN109145974B (en) * 2018-08-13 2022-06-24 广东工业大学 Multilevel image feature fusion method based on image-text matching
CN110298395B (en) * 2019-06-18 2023-04-18 天津大学 Image-text matching method based on three-modal confrontation network
CN110516085B (en) * 2019-07-11 2022-05-17 西安电子科技大学 Image text mutual retrieval method based on bidirectional attention

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992686A (en) * 2019-02-24 2019-07-09 复旦大学 Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110275972A (en) * 2019-06-17 2019-09-24 浙江工业大学 A kind of case retrieval methods based on content introducing dual training

Also Published As

Publication number Publication date
CN112818157A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN112818157B (en) Combined query image retrieval method based on multi-order confrontation characteristic learning
WO2023093574A1 (en) News event search method and system based on multi-level image-text semantic alignment model
CN108920720B (en) Large-scale image retrieval method based on depth hash and GPU acceleration
CN110647904B (en) Cross-modal retrieval method and system based on unmarked data migration
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN113204952B (en) Multi-intention and semantic slot joint identification method based on cluster pre-analysis
CN110599592B (en) Three-dimensional indoor scene reconstruction method based on text
CN113657450A (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN112860930B (en) Text-to-commodity image retrieval method based on hierarchical similarity learning
CN111159367A (en) Information processing method and related equipment
CN114328807A (en) Text processing method, device, equipment and storage medium
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN116049450A (en) Multi-mode-supported image-text retrieval method and device based on distance clustering
CN114048295A (en) Cross-modal retrieval method and system for data processing
CN111368176B (en) Cross-modal hash retrieval method and system based on supervision semantic coupling consistency
CN114254093A (en) Multi-space knowledge enhanced knowledge graph question-answering method and system
CN111680264A (en) Multi-document reading understanding method
CN114926742A (en) Loop detection and optimization method based on second-order attention mechanism
CN112925912B (en) Text processing method, synonymous text recall method and apparatus
Jin et al. Discriminant zero-shot learning with center loss
CN113779283A (en) Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN111339258A (en) University computer basic exercise recommendation method based on knowledge graph
CN116628192A (en) Text theme representation method based on Seq2Seq-Attention
CN113723111B (en) Small sample intention recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant