CN117235114A

CN117235114A - Retrieval method based on cross-modal semantic and mixed inverse fact training

Info

Publication number: CN117235114A
Application number: CN202311224075.3A
Authority: CN
Inventors: 曾地荣; 吴伟华; 叶桔
Original assignee: Jiangsu Huazhen Information Technology Co ltd
Current assignee: Jiangsu Huazhen Information Technology Co ltd
Priority date: 2023-09-20
Filing date: 2023-09-20
Publication date: 2023-12-15

Abstract

The invention discloses a retrieval method based on cross-modal semantic and mixed inverse fact training, which comprises the following steps: A. acquisition of reference image I _R Target image I _T And query text T _Q Is characterized by; B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module, and modeling visual language characterization in three-level cascading reasoning; C. constructing a mixed counterfactual sample; D. modeling global-local combination, and capturing local-global information of different scales and different modes; capturing implicit visual modifications and preservation in the reference image from a final composite representation derived from a bottom-up hierarchical combination; E. combined detection of multiple modes by learning matchingAnd (3) searching, adding theta parameterized excitation, and finally, uniquely aligning the learned image text composite representation with the visual representation of the target truth image, wherein F is used for judging the search result by using a loss function. The invention can improve the defects of the prior art and improve the image retrieval accuracy.

Description

Retrieval method based on cross-modal semantic and mixed inverse fact training

Technical Field

The invention relates to the technical field of video image retrieval in an intelligent security system, in particular to a retrieval method based on cross-modal semantic and mixed inverse fact training.

Background

To obtain a combined representation of the multi-modal retrieval, existing approaches rely primarily on cross-modal interaction and fusion operations of global semantics and local features, or on simple separate/cascaded feature learning schemes of reference images and query text from their respective encoders. However, this combination cannot capture the intrinsic relationship between the information features of the bottom layer and the abstract semantics of the top layer, from the perspective of deep understanding:

(i) Detailed local features of image regions that have descriptive terms but do not have global semantic understanding of the query.

(ii) Abstract global semantics are learned from increasing abstractions in the hierarchical order of the encoder, but lack concrete properties rooted at different local locations.

Simply considering the above feature utilization, without properly modeling global-local combinations, may lead to solution degradation.

Disclosure of Invention

The invention aims to provide a retrieval method based on cross-modal semantic and mixed inverse fact training, which can solve the defects of the prior art and improve the image retrieval accuracy.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows.

A retrieval method based on cross-modal semantic and mixed inverse fact training comprises the following steps:

A. a multiparticulate visual representation module and a global-local text embedding module are established for obtaining a reference image I _R Target image I _T And query text T of different layers _Q Is characterized by;

B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module to form a cross-modal semantic synthesis module from bottom to top, and modeling visual language characterization in three-level cascading reasoning;

C. constructing a mixed counterfactual sample;

D. establishing a local feature module, a global-local feature absorption module and a global semantic module; modeling global-local combination by utilizing a characterization absorption synthesis module, so that the designed modification and absorption block based on the transformer can capture local-global information of different scales and different modes from local features of a bottom layer to global semantics of a top layer in a cross-layer manner; capturing implicit visual modification and preservation in the reference image according to the difference of the text modifier by a final composite representation derived from the bottom-up hierarchical combination;

E. learning matching pair M ((I) _R ，T _Q )，I _T ) The combination of multiple modes is searched, then theta parameterized excitation is added, finally the learned image text composite representation is uniquely aligned with the visual representation of the target truth image, as shown in the following formula,

wherein θ represents parameterized excitation, k (·, ·) represents solving for similarity, Φ (·) and ψ (·) are the composite encoder and the image encoder, respectively;

F. and judging the search result by using the loss function.

Preferably, in step A,

the multiparticulate visual representation module generates a discriminative representation of the visual content of the image using the visual converter; shallow extraction of the visual transducer contains basic syntactic information, deep visual transducer extracts more complex semantic information, and in order to further improve the quality of the features, linear projection is utilized to map the extracted features to global semantic representation;

the global local text embedding module marks the query text TQ into M sub-word mark sequences using BERT, and then marks the special mark positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->

Preferably, in step B,

the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; embedding R of reference modality R for self-discovery of potential region-to-region relationships necessary for learning transitions _R ∈R ^Nk×d And embedding R of query modality Q _Q ∈R ^Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern _R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,

wherein L is _n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains referenceAnd query modality->Is a self-participated form of representation;

MSA self-attention captures non-local correlations for feature transformations, in self-attentionBased on the pyramid pooling cross attention CSA is introduced, and the bidirectional cross co attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,

with soft attention toTreatment of +.>And->The potential relationships in terms of image conversion and preservation, the final cross-modality composition is:

wherein,SOA _Q→R (. Cndot.) is a soft-attention operation;

the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively _L ∈R ^Nc×dc And R is _G ∈R ^Nc×dc Characterization of the absorption synthesis moduleRealization of slave R _L Absorbs meaningful and distinguishing information as R _G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,

after self-attention modeling, use is made ofAnd->Generating an intermediate embedding, the global semantic representation obtaining a higher weight, and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,

wherein, [ ·, ]]Representing series operation, T _l Representing a nonlinear transformation;

fusion embedding performs layer normalization of residual connection. And input into the feedforward layer to obtain the final absorption characteristic representation R _AC Such a combination represents R _AC Absorbs useful knowledge from the local features, improves the accuracy of query matching targets,

wherein,

preferably, in step D,

the local feature module locally represents the correction module by using a transmembrane stateAnd->Learning local feature combination->

Global-local feature absorption module combines from local featuresAbsorption meaningful and discriminative information that plays a role in guiding in advance from the local feature layer to robustness to subsequent synthesis-target matching, deriving global-local absorption combinations using a token absorption synthesis module +.>Is a potential embedding of (a);

the global semantic composition module models final composition from visual and linguistic domainsThrough aggregation of intermediate representationsIs a basic semantic and query text global semantic potential vector +.>To update the output of the stream.

Preferably, in step F, the loss function consists of a bi-directional ternary loss, a reconstruction loss and a domain alignment loss;

the two-way ternary loss is extracted from the contrast negative sample of the second degree, the fine-granularity query-target corresponding relation of the input query and the target image is constructed, the semantic matching of the composition and the target representation with high similarity are ensured, the two-way ternary loss is defined as follows,

L _tri (X，Y，m)＝max(0，||X ⁺ -Y|| ₂ -||X ^- -Y|| ₂ +m)

L _bid (C _que ，C _tar ，m，m _a )＝λ _q L _tri (C _que ，C _tar ，m)+λ _i L _tri (C _tar ，C _que ，m _a )

wherein X is ⁺ And X ^- Lambda is positive and negative samples _q And lambda (lambda) _i As a weight super parameter, I I.I.I ₂ Is L ₂ The distance between the two adjacent substrates is determined,sum s _qt Respectively represent C _que And C _tar Alpha-adaptive edge value m _a Super parameters s _qt Near 0, m _a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;

reconstruction loss L _res Constraint C _que Mapping of vision and language, defined by R _img And R is _text Representation, respectively with potential embedding C _tar Andalignment, the goal is to embed a balanced utilization of text and images by reconstructing the specification and enhancing the combination,

wherein lambda is _img And lambda (lambda) _text Super parameters for pre-training;

the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated _m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, and utilizing wasserstein Distance to synthesize the domainCorresponding to the target domain, wasserstein Distance and alignment loss are shown in the following formulas,

L _ali ＝λ _a W _d (C，T)

the beneficial effects brought by adopting the technical scheme are as follows:

(i) And a mode characterization modification module and a characterization absorption synthesis module are provided for implicit bottom-up semantic synthesis modeling. The key idea of this step is to achieve complementary synergy between the bottom-up visual representations by utilizing complementary global local representations of the different encoder layers, thereby achieving selective modification of the relevant image features and ensuring preservation of unaltered features, which is critical to an accurate retrieval method.

(ii) A mixed counterfactual training strategy for plug and play (note: mixed counterfactual is presented here instead of counterfactual). The strategy aims at facilitating the construction of a fine-grained query-target correspondence by a retrieval model to achieve robust image retrieval. The policy can be used as a plug-and-play component to increase the query sensitivity of the retrieval model. In particular, three new different types of counterfactual samples are constructed, image independent, text independent and context preserving. The mixed sample realizes an explicit bidirectional corresponding learning mechanism, is beneficial to establishing one-to-one matching of the combined query and the expected image, and reduces the prediction uncertainty of the model on the similar query.

(iii) The method is characterized by designing a bottom-up cross-modal semantic synthesis and hierarchical combined reasoning, combining cross-granularity semantic update learning and understanding composite image text representation, gradually digesting information flows from vision and language from two new perspectives corresponding to an implicit bottom-up visual representation synthesis and an explicit fine-granularity query target structure, and achieving the challenging task of solving content-based image retrieval. The combination can effectively capture hidden visual modification and storage according to different text modifiers, thereby achieving the effect superior to the prior image retrieval technology.

Drawings

FIG. 1 is an example of multimodal retrieval;

fig. 2 is a content-based image retrieval framework.

Detailed Description

One embodiment of the present invention comprises the steps of:

global local text embedding module uses BERT to query text T _Q Marking into M subword marking sequences, and then placing special marking positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->

the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; is thatEmbedding R of reference modality R by self-discovery of potential region-to-region relationships necessary for learning transformations _R ∈R ^Nk×d And embedding R of query modality Q _Q ∈R ^Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern _R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,

MSA self-attention captures non-local correlation for feature conversion, pyramid pooling cross-attention CSA is introduced on the basis of self-attention, and bidirectional cross-co-attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,

wherein,SOA _Q→R (. Cndot.) is a soft-attention operation;

the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively _L ∈R ^Nc×dc And R is _G ∈R ^Nc×dc Characterizing the absorption synthesis module to achieve the slave R _L Absorbs meaningful and distinguishing information as R _G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,

wherein,

C. constructing a mixed counterfactual sample; comprises the steps of,

c1, constructing a counterfactual sample which is irrelevant to images and irrelevant to texts;

giving a reference image I _R And corresponding query text T _Q Using BERT as pre-trained bi-directional language model, finding text with minimum relevance according to language similarityAnd its corresponding image->Thus these texts and drawingsImage independent +.>Text independentIs a counterfactual sample of (2);

c2, constructing a text retention counterfactual sample;

first, the query vocabulary T is masked according to a priori known attributes _Q Generating preliminary candidate words and replacing the preliminary candidate words by random words to obtain k ₁ Samples are then taken and the original T _Q Calculating semantic similarity in the input point BERT, and finally selecting the topmost k with the reference image ₂ Text is used as a context-preserving query, as shown in the following,

wherein,representing the selected context keeps a negative sample, P _s Representing the probability of measuring the BERT,a number selected in the second stage;

the global semantic composition module models final composition from visual and linguistic domainsThrough aggregation of intermediate representationsIs a basic semantic and query text global semantic potential vector +.>To update the output of the stream;

F. judging the search result by using the loss function;

the loss function consists of a bidirectional ternary loss, a reconstruction loss and a domain alignment loss;

L _tri (X，Y，m)＝max(0，||X ⁺ -Y|| ₂ -||X ^- -Y|| ₂ +m)

wherein X is ⁺ And X ^- Lambda is positive and negative samples _q And lambda (lambda) _t As a weight super parameter, I I.I.I ₂ Is L ₂ The distance between the two adjacent substrates is determined,sum s _qt Respectively represent C _que And C _tar Alpha-adaptive edge value m _a Super parameters s _qt Near 0, m _a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;

the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated _m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, utilizing Wasserstein Distance to correspond the synthesized domain and the target domain, wherein WassersteinDistance and alignment loss are shown in the following formula,

L _ali ＝λ _a W _d (C，T)。

in the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the present invention, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A retrieval method based on cross-modal semantic and mixed inverse fact training is characterized by comprising the following steps:

C. constructing a mixed counterfactual sample;

F. and judging the search result by using the loss function.

2. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 1, wherein the retrieval method is characterized in that: in the step a, the step a is performed,

3. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 2, wherein the retrieval method is characterized in that: in the step B of the process,

wherein L is _n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains reference->And query modality->Is a self-participated form of representation;

wherein,SOA _Q→R (. Cndot.) is a soft-attention operation;

after self-attention modeling, use is made ofarg/>An intermediate embedding is generated and,the global semantic representation gets a higher weight and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,

wherein,

4. the retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 3, wherein: in step C, constructing a hybrid counterfactual sample includes the steps of,

giving a reference image I _R And corresponding query text T _Q Using BERT as pre-trained bi-directional language model, finding text with minimum relevance according to language similarityAnd its corresponding image->These texts and images thus combine the original reference image and the query text to form an image independent +.>Text independentIs a counterfactual sample of (2);

c2, constructing a text retention counterfactual sample;

wherein,representing the selected context keeps a negative sample, P _s Representing the probability of measuring BERT +.>Indicating the number selected in the second stage.

5. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 4, wherein the retrieval method is characterized in that: in the step D of the process,

the global semantic composition module models final composition from visual and linguistic domainsBy means of an aggregate intermediate representation->Is a basic semantic and query text global semantic potential vector +.>To update the output of the stream.

6. The cross-modal semantic and mixed inverse fact training based retrieval method according to claim 5, wherein the method comprises the following steps: in the step F, the loss function consists of bidirectional ternary loss, reconstruction loss and domain alignment loss;

L _tri (X，Y，m)＝max(0，||X ⁺ -Y|| ₂ -||X ^- -Y|| ₂ +m)

L _bid (C _que ，C _tar ，m，m _a ))＝λ _q L _tri (C _que ，C _tar ，m)+λ _i L _tri (C _tar ，C _que ，m _a ) Wherein X is ⁺ And X ^- Lambda is positive and negative samples _q And lambda (lambda) _i As a weight super parameter, I I.I.I ₂ Is L ₂ The distance between the two adjacent substrates is determined,sum s _qt Respectively represent C _que And C _tar Alpha-adaptive edge value m _a Super parameters s _qt Near 0, m _a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;

the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated _m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, utilizing wasserstein Distance to correspond the synthesized domain and the target domain, wasserstein Distance and the alignment loss as shown in the following formula,

L _ali ＝λ _a W _d (C,T)。