CN117235114A - Retrieval method based on cross-modal semantic and mixed inverse fact training - Google Patents

Retrieval method based on cross-modal semantic and mixed inverse fact training Download PDF

Info

Publication number
CN117235114A
CN117235114A CN202311224075.3A CN202311224075A CN117235114A CN 117235114 A CN117235114 A CN 117235114A CN 202311224075 A CN202311224075 A CN 202311224075A CN 117235114 A CN117235114 A CN 117235114A
Authority
CN
China
Prior art keywords
text
global
module
semantic
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311224075.3A
Other languages
Chinese (zh)
Inventor
曾地荣
吴伟华
叶桔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Huazhen Information Technology Co ltd
Original Assignee
Jiangsu Huazhen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Huazhen Information Technology Co ltd filed Critical Jiangsu Huazhen Information Technology Co ltd
Priority to CN202311224075.3A priority Critical patent/CN117235114A/en
Publication of CN117235114A publication Critical patent/CN117235114A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a retrieval method based on cross-modal semantic and mixed inverse fact training, which comprises the following steps: A. acquisition of reference image I R Target image I T And query text T Q Is characterized by; B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module, and modeling visual language characterization in three-level cascading reasoning; C. constructing a mixed counterfactual sample; D. modeling global-local combination, and capturing local-global information of different scales and different modes; capturing implicit visual modifications and preservation in the reference image from a final composite representation derived from a bottom-up hierarchical combination; E. combined detection of multiple modes by learning matchingAnd (3) searching, adding theta parameterized excitation, and finally, uniquely aligning the learned image text composite representation with the visual representation of the target truth image, wherein F is used for judging the search result by using a loss function. The invention can improve the defects of the prior art and improve the image retrieval accuracy.

Description

Retrieval method based on cross-modal semantic and mixed inverse fact training
Technical Field
The invention relates to the technical field of video image retrieval in an intelligent security system, in particular to a retrieval method based on cross-modal semantic and mixed inverse fact training.
Background
To obtain a combined representation of the multi-modal retrieval, existing approaches rely primarily on cross-modal interaction and fusion operations of global semantics and local features, or on simple separate/cascaded feature learning schemes of reference images and query text from their respective encoders. However, this combination cannot capture the intrinsic relationship between the information features of the bottom layer and the abstract semantics of the top layer, from the perspective of deep understanding:
(i) Detailed local features of image regions that have descriptive terms but do not have global semantic understanding of the query.
(ii) Abstract global semantics are learned from increasing abstractions in the hierarchical order of the encoder, but lack concrete properties rooted at different local locations.
Simply considering the above feature utilization, without properly modeling global-local combinations, may lead to solution degradation.
Disclosure of Invention
The invention aims to provide a retrieval method based on cross-modal semantic and mixed inverse fact training, which can solve the defects of the prior art and improve the image retrieval accuracy.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows.
A retrieval method based on cross-modal semantic and mixed inverse fact training comprises the following steps:
A. a multiparticulate visual representation module and a global-local text embedding module are established for obtaining a reference image I R Target image I T And query text T of different layers Q Is characterized by;
B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module to form a cross-modal semantic synthesis module from bottom to top, and modeling visual language characterization in three-level cascading reasoning;
C. constructing a mixed counterfactual sample;
D. establishing a local feature module, a global-local feature absorption module and a global semantic module; modeling global-local combination by utilizing a characterization absorption synthesis module, so that the designed modification and absorption block based on the transformer can capture local-global information of different scales and different modes from local features of a bottom layer to global semantics of a top layer in a cross-layer manner; capturing implicit visual modification and preservation in the reference image according to the difference of the text modifier by a final composite representation derived from the bottom-up hierarchical combination;
E. learning matching pair M ((I) R ,T Q ),I T ) The combination of multiple modes is searched, then theta parameterized excitation is added, finally the learned image text composite representation is uniquely aligned with the visual representation of the target truth image, as shown in the following formula,
wherein θ represents parameterized excitation, k (·, ·) represents solving for similarity, Φ (·) and ψ (·) are the composite encoder and the image encoder, respectively;
F. and judging the search result by using the loss function.
Preferably, in step A,
the multiparticulate visual representation module generates a discriminative representation of the visual content of the image using the visual converter; shallow extraction of the visual transducer contains basic syntactic information, deep visual transducer extracts more complex semantic information, and in order to further improve the quality of the features, linear projection is utilized to map the extracted features to global semantic representation;
the global local text embedding module marks the query text TQ into M sub-word mark sequences using BERT, and then marks the special mark positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->
Preferably, in step B,
the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; embedding R of reference modality R for self-discovery of potential region-to-region relationships necessary for learning transitions R ∈R Nk×d And embedding R of query modality Q Q ∈R Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,
wherein L is n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains referenceAnd query modality->Is a self-participated form of representation;
MSA self-attention captures non-local correlations for feature transformations, in self-attentionBased on the pyramid pooling cross attention CSA is introduced, and the bidirectional cross co attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,
with soft attention toTreatment of +.>And->The potential relationships in terms of image conversion and preservation, the final cross-modality composition is:
wherein,SOA Q→R (. Cndot.) is a soft-attention operation;
the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively L ∈R Nc×dc And R is G ∈R Nc×dc Characterization of the absorption synthesis moduleRealization of slave R L Absorbs meaningful and distinguishing information as R G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,
after self-attention modeling, use is made ofAnd->Generating an intermediate embedding, the global semantic representation obtaining a higher weight, and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,
wherein, [ ·, ]]Representing series operation, T l Representing a nonlinear transformation;
fusion embedding performs layer normalization of residual connection. And input into the feedforward layer to obtain the final absorption characteristic representation R AC Such a combination represents R AC Absorbs useful knowledge from the local features, improves the accuracy of query matching targets,
wherein,
preferably, in step D,
the local feature module locally represents the correction module by using a transmembrane stateAnd->Learning local feature combination->
Global-local feature absorption module combines from local featuresAbsorption meaningful and discriminative information that plays a role in guiding in advance from the local feature layer to robustness to subsequent synthesis-target matching, deriving global-local absorption combinations using a token absorption synthesis module +.>Is a potential embedding of (a);
the global semantic composition module models final composition from visual and linguistic domainsThrough aggregation of intermediate representationsIs a basic semantic and query text global semantic potential vector +.>To update the output of the stream.
Preferably, in step F, the loss function consists of a bi-directional ternary loss, a reconstruction loss and a domain alignment loss;
the two-way ternary loss is extracted from the contrast negative sample of the second degree, the fine-granularity query-target corresponding relation of the input query and the target image is constructed, the semantic matching of the composition and the target representation with high similarity are ensured, the two-way ternary loss is defined as follows,
L tri (X,Y,m)=max(0,||X + -Y|| 2 -||X - -Y|| 2 +m)
L bid (C que ,C tar ,m,m a )=λ q L tri (C que ,C tar ,m)+λ i L tri (C tar ,C que ,m a )
wherein X is + And X - Lambda is positive and negative samples q And lambda (lambda) i As a weight super parameter, I I.I.I 2 Is L 2 The distance between the two adjacent substrates is determined,sum s qt Respectively represent C que And C tar Alpha-adaptive edge value m a Super parameters s qt Near 0, m a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;
reconstruction loss L res Constraint C que Mapping of vision and language, defined by R img And R is text Representation, respectively with potential embedding C tar Andalignment, the goal is to embed a balanced utilization of text and images by reconstructing the specification and enhancing the combination,
wherein lambda is img And lambda (lambda) text Super parameters for pre-training;
the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, and utilizing wasserstein Distance to synthesize the domainCorresponding to the target domain, wasserstein Distance and alignment loss are shown in the following formulas,
L ali =λ a W d (C,T)
the beneficial effects brought by adopting the technical scheme are as follows:
(i) And a mode characterization modification module and a characterization absorption synthesis module are provided for implicit bottom-up semantic synthesis modeling. The key idea of this step is to achieve complementary synergy between the bottom-up visual representations by utilizing complementary global local representations of the different encoder layers, thereby achieving selective modification of the relevant image features and ensuring preservation of unaltered features, which is critical to an accurate retrieval method.
(ii) A mixed counterfactual training strategy for plug and play (note: mixed counterfactual is presented here instead of counterfactual). The strategy aims at facilitating the construction of a fine-grained query-target correspondence by a retrieval model to achieve robust image retrieval. The policy can be used as a plug-and-play component to increase the query sensitivity of the retrieval model. In particular, three new different types of counterfactual samples are constructed, image independent, text independent and context preserving. The mixed sample realizes an explicit bidirectional corresponding learning mechanism, is beneficial to establishing one-to-one matching of the combined query and the expected image, and reduces the prediction uncertainty of the model on the similar query.
(iii) The method is characterized by designing a bottom-up cross-modal semantic synthesis and hierarchical combined reasoning, combining cross-granularity semantic update learning and understanding composite image text representation, gradually digesting information flows from vision and language from two new perspectives corresponding to an implicit bottom-up visual representation synthesis and an explicit fine-granularity query target structure, and achieving the challenging task of solving content-based image retrieval. The combination can effectively capture hidden visual modification and storage according to different text modifiers, thereby achieving the effect superior to the prior image retrieval technology.
Drawings
FIG. 1 is an example of multimodal retrieval;
fig. 2 is a content-based image retrieval framework.
Detailed Description
One embodiment of the present invention comprises the steps of:
A. a multiparticulate visual representation module and a global-local text embedding module are established for obtaining a reference image I R Target image I T And query text T of different layers Q Is characterized by;
the multiparticulate visual representation module generates a discriminative representation of the visual content of the image using the visual converter; shallow extraction of the visual transducer contains basic syntactic information, deep visual transducer extracts more complex semantic information, and in order to further improve the quality of the features, linear projection is utilized to map the extracted features to global semantic representation;
global local text embedding module uses BERT to query text T Q Marking into M subword marking sequences, and then placing special marking positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->
B. Establishing a cross-modal characterization modification module and a characterization absorption synthesis module to form a cross-modal semantic synthesis module from bottom to top, and modeling visual language characterization in three-level cascading reasoning;
the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; is thatEmbedding R of reference modality R by self-discovery of potential region-to-region relationships necessary for learning transformations R ∈R Nk×d And embedding R of query modality Q Q ∈R Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,
wherein L is n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains referenceAnd query modality->Is a self-participated form of representation;
MSA self-attention captures non-local correlation for feature conversion, pyramid pooling cross-attention CSA is introduced on the basis of self-attention, and bidirectional cross-co-attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,
with soft attention toTreatment of +.>And->The potential relationships in terms of image conversion and preservation, the final cross-modality composition is:
wherein,SOA Q→R (. Cndot.) is a soft-attention operation;
the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively L ∈R Nc×dc And R is G ∈R Nc×dc Characterizing the absorption synthesis module to achieve the slave R L Absorbs meaningful and distinguishing information as R G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,
after self-attention modeling, use is made ofAnd->Generating an intermediate embedding, the global semantic representation obtaining a higher weight, and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,
wherein, [ ·, ]]Representing series operation, T l Representing a nonlinear transformation;
fusion embedding performs layer normalization of residual connection. And input into the feedforward layer to obtain the final absorption characteristic representation R AC Such a combination represents R AC Absorbs useful knowledge from the local features, improves the accuracy of query matching targets,
wherein,
C. constructing a mixed counterfactual sample; comprises the steps of,
c1, constructing a counterfactual sample which is irrelevant to images and irrelevant to texts;
giving a reference image I R And corresponding query text T Q Using BERT as pre-trained bi-directional language model, finding text with minimum relevance according to language similarityAnd its corresponding image->Thus these texts and drawingsImage independent +.>Text independentIs a counterfactual sample of (2);
c2, constructing a text retention counterfactual sample;
first, the query vocabulary T is masked according to a priori known attributes Q Generating preliminary candidate words and replacing the preliminary candidate words by random words to obtain k 1 Samples are then taken and the original T Q Calculating semantic similarity in the input point BERT, and finally selecting the topmost k with the reference image 2 Text is used as a context-preserving query, as shown in the following,
wherein,representing the selected context keeps a negative sample, P s Representing the probability of measuring the BERT,a number selected in the second stage;
D. establishing a local feature module, a global-local feature absorption module and a global semantic module; modeling global-local combination by utilizing a characterization absorption synthesis module, so that the designed modification and absorption block based on the transformer can capture local-global information of different scales and different modes from local features of a bottom layer to global semantics of a top layer in a cross-layer manner; capturing implicit visual modification and preservation in the reference image according to the difference of the text modifier by a final composite representation derived from the bottom-up hierarchical combination;
the local feature module locally represents the correction module by using a transmembrane stateAnd->Learning local feature combination->
Global-local feature absorption module combines from local featuresAbsorption meaningful and discriminative information that plays a role in guiding in advance from the local feature layer to robustness to subsequent synthesis-target matching, deriving global-local absorption combinations using a token absorption synthesis module +.>Is a potential embedding of (a);
the global semantic composition module models final composition from visual and linguistic domainsThrough aggregation of intermediate representationsIs a basic semantic and query text global semantic potential vector +.>To update the output of the stream;
E. learning matching pair M ((I) R ,T Q ),I T ) The combination of multiple modes is searched, then theta parameterized excitation is added, finally the learned image text composite representation is uniquely aligned with the visual representation of the target truth image, as shown in the following formula,
wherein θ represents parameterized excitation, k (·, ·) represents solving for similarity, Φ (·) and ψ (·) are the composite encoder and the image encoder, respectively;
F. judging the search result by using the loss function;
the loss function consists of a bidirectional ternary loss, a reconstruction loss and a domain alignment loss;
the two-way ternary loss is extracted from the contrast negative sample of the second degree, the fine-granularity query-target corresponding relation of the input query and the target image is constructed, the semantic matching of the composition and the target representation with high similarity are ensured, the two-way ternary loss is defined as follows,
L tri (X,Y,m)=max(0,||X + -Y|| 2 -||X - -Y|| 2 +m)
L bid (C que ,C tar ,m,m a )=λ q L tri (C que ,C tar ,m)+λ i L tri (C tar ,C que ,m a )
wherein X is + And X - Lambda is positive and negative samples q And lambda (lambda) t As a weight super parameter, I I.I.I 2 Is L 2 The distance between the two adjacent substrates is determined,sum s qt Respectively represent C que And C tar Alpha-adaptive edge value m a Super parameters s qt Near 0, m a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;
reconstruction loss L res Constraint C que Mapping of vision and language, defined by R img And R is text Representation, respectively with potential embedding C tar Andalignment, the goal is to embed a balanced utilization of text and images by reconstructing the specification and enhancing the combination,
wherein lambda is img And lambda (lambda) text Super parameters for pre-training;
the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, utilizing Wasserstein Distance to correspond the synthesized domain and the target domain, wherein WassersteinDistance and alignment loss are shown in the following formula,
L ali =λ a W d (C,T)。
in the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the present invention, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1. A retrieval method based on cross-modal semantic and mixed inverse fact training is characterized by comprising the following steps:
A. a multiparticulate visual representation module and a global-local text embedding module are established for obtaining a reference image I R Target image I T And query text T of different layers Q Is characterized by;
B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module to form a cross-modal semantic synthesis module from bottom to top, and modeling visual language characterization in three-level cascading reasoning;
C. constructing a mixed counterfactual sample;
D. establishing a local feature module, a global-local feature absorption module and a global semantic module; modeling global-local combination by utilizing a characterization absorption synthesis module, so that the designed modification and absorption block based on the transformer can capture local-global information of different scales and different modes from local features of a bottom layer to global semantics of a top layer in a cross-layer manner; capturing implicit visual modification and preservation in the reference image according to the difference of the text modifier by a final composite representation derived from the bottom-up hierarchical combination;
E. learning matching pair M ((I) R ,T Q ),I T ) The combination of multiple modes is searched, then theta parameterized excitation is added, finally the learned image text composite representation is uniquely aligned with the visual representation of the target truth image, as shown in the following formula,
wherein θ represents parameterized excitation, k (·, ·) represents solving for similarity, Φ (·) and ψ (·) are the composite encoder and the image encoder, respectively;
F. and judging the search result by using the loss function.
2. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 1, wherein the retrieval method is characterized in that: in the step a, the step a is performed,
the multiparticulate visual representation module generates a discriminative representation of the visual content of the image using the visual converter; shallow extraction of the visual transducer contains basic syntactic information, deep visual transducer extracts more complex semantic information, and in order to further improve the quality of the features, linear projection is utilized to map the extracted features to global semantic representation;
global local text embedding module uses BERT to query text T Q Marking into M subword marking sequences, and then placing special marking positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->
3. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 2, wherein the retrieval method is characterized in that: in the step B of the process,
the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; embedding R of reference modality R for self-discovery of potential region-to-region relationships necessary for learning transitions R ∈R Nk×d And embedding R of query modality Q Q ∈R Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,
wherein L is n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains reference->And query modality->Is a self-participated form of representation;
MSA self-attention captures non-local correlation for feature conversion, pyramid pooling cross-attention CSA is introduced on the basis of self-attention, and bidirectional cross-co-attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,
with soft attention toTreatment of +.>And->The potential relationships in terms of image conversion and preservation, the final cross-modality composition is:
wherein,SOA Q→R (. Cndot.) is a soft-attention operation;
the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively L ∈R Nc×dc And R is G ∈R Nc×dc Characterizing the absorption synthesis module to achieve the slave R L Absorbs meaningful and distinguishing information as R G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,
after self-attention modeling, use is made ofarg/>An intermediate embedding is generated and,the global semantic representation gets a higher weight and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,
wherein, [ ·, ]]Representing series operation, T l Representing a nonlinear transformation;
fusion embedding performs layer normalization of residual connection. And input into the feedforward layer to obtain the final absorption characteristic representation R AC Such a combination represents R AC Absorbs useful knowledge from the local features, improves the accuracy of query matching targets,
wherein,
4. the retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 3, wherein: in step C, constructing a hybrid counterfactual sample includes the steps of,
c1, constructing a counterfactual sample which is irrelevant to images and irrelevant to texts;
giving a reference image I R And corresponding query text T Q Using BERT as pre-trained bi-directional language model, finding text with minimum relevance according to language similarityAnd its corresponding image->These texts and images thus combine the original reference image and the query text to form an image independent +.>Text independentIs a counterfactual sample of (2);
c2, constructing a text retention counterfactual sample;
first, the query vocabulary T is masked according to a priori known attributes Q Generating preliminary candidate words and replacing the preliminary candidate words by random words to obtain k 1 Samples are then taken and the original T Q Calculating semantic similarity in the input point BERT, and finally selecting the topmost k with the reference image 2 Text is used as a context-preserving query, as shown in the following,
wherein,representing the selected context keeps a negative sample, P s Representing the probability of measuring BERT +.>Indicating the number selected in the second stage.
5. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 4, wherein the retrieval method is characterized in that: in the step D of the process,
the local feature module locally represents the correction module by using a transmembrane stateAnd->Learning local feature combination->
Global-local feature absorption module combines from local featuresAbsorption meaningful and discriminative information that plays a role in guiding in advance from the local feature layer to robustness to subsequent synthesis-target matching, deriving global-local absorption combinations using a token absorption synthesis module +.>Is a potential embedding of (a);
the global semantic composition module models final composition from visual and linguistic domainsBy means of an aggregate intermediate representation->Is a basic semantic and query text global semantic potential vector +.>To update the output of the stream.
6. The cross-modal semantic and mixed inverse fact training based retrieval method according to claim 5, wherein the method comprises the following steps: in the step F, the loss function consists of bidirectional ternary loss, reconstruction loss and domain alignment loss;
the two-way ternary loss is extracted from the contrast negative sample of the second degree, the fine-granularity query-target corresponding relation of the input query and the target image is constructed, the semantic matching of the composition and the target representation with high similarity are ensured, the two-way ternary loss is defined as follows,
L tri (X,Y,m)=max(0,||X + -Y|| 2 -||X - -Y|| 2 +m)
L bid (C que ,C tar ,m,m a ))=λ q L tri (C que ,C tar ,m)+λ i L tri (C tar ,C que ,m a ) Wherein X is + And X - Lambda is positive and negative samples q And lambda (lambda) i As a weight super parameter, I I.I.I 2 Is L 2 The distance between the two adjacent substrates is determined,sum s qt Respectively represent C que And C tar Alpha-adaptive edge value m a Super parameters s qt Near 0, m a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;
reconstruction loss L res Constraint C que Mapping of vision and language, defined by R img And R is text Representation, respectively with potential embedding C tar Andalignment, the goal is to embed a balanced utilization of text and images by reconstructing the specification and enhancing the combination,
wherein lambda is img And lambda (lambda) text Super parameters for pre-training;
the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, utilizing wasserstein Distance to correspond the synthesized domain and the target domain, wasserstein Distance and the alignment loss as shown in the following formula,
L ali =λ a W d (C,T)。
CN202311224075.3A 2023-09-20 2023-09-20 Retrieval method based on cross-modal semantic and mixed inverse fact training Pending CN117235114A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311224075.3A CN117235114A (en) 2023-09-20 2023-09-20 Retrieval method based on cross-modal semantic and mixed inverse fact training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311224075.3A CN117235114A (en) 2023-09-20 2023-09-20 Retrieval method based on cross-modal semantic and mixed inverse fact training

Publications (1)

Publication Number Publication Date
CN117235114A true CN117235114A (en) 2023-12-15

Family

ID=89090733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311224075.3A Pending CN117235114A (en) 2023-09-20 2023-09-20 Retrieval method based on cross-modal semantic and mixed inverse fact training

Country Status (1)

Country Link
CN (1) CN117235114A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117520590A (en) * 2024-01-04 2024-02-06 武汉理工大学三亚科教创新园 Ocean cross-modal image-text retrieval method, system, equipment and storage medium
CN117649461A (en) * 2024-01-29 2024-03-05 吉林大学 Interactive image generation method and system based on space layout and use method thereof

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117520590A (en) * 2024-01-04 2024-02-06 武汉理工大学三亚科教创新园 Ocean cross-modal image-text retrieval method, system, equipment and storage medium
CN117520590B (en) * 2024-01-04 2024-04-26 武汉理工大学三亚科教创新园 Ocean cross-modal image-text retrieval method, system, equipment and storage medium
CN117649461A (en) * 2024-01-29 2024-03-05 吉林大学 Interactive image generation method and system based on space layout and use method thereof
CN117649461B (en) * 2024-01-29 2024-05-07 吉林大学 Interactive image generation method and system based on space layout and use method thereof

Similar Documents

Publication Publication Date Title
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN111444343B (en) Cross-border national culture text classification method based on knowledge representation
CN117235114A (en) Retrieval method based on cross-modal semantic and mixed inverse fact training
CN109992686A (en) Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN106407333A (en) Artificial intelligence-based spoken language query identification method and apparatus
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN113486669B (en) Semantic recognition method for emergency rescue input voice
US20240119716A1 (en) Method for multimodal emotion classification based on modal space assimilation and contrastive learning
CN117391051B (en) Emotion-fused common attention network multi-modal false news detection method
Li et al. Adapting clip for phrase localization without further training
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
Ueda et al. Switching text-based image encoders for captioning images with text
Zhu et al. Unpaired image captioning by image-level weakly-supervised visual concept recognition
CN115311465A (en) Image description method based on double attention models
CN116452688A (en) Image description generation method based on common attention mechanism
CN115827954A (en) Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment
CN113312498B (en) Text information extraction method for embedding knowledge graph by undirected graph
CN114626454A (en) Visual emotion recognition method integrating self-supervision learning and attention mechanism
CN113192030A (en) Remote sensing image description generation method and system
CN114625830B (en) Chinese dialogue semantic role labeling method and system
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data
Ding et al. A Novel Discrimination Structure for Assessing Text Semantic Similarity
CN117746441B (en) Visual language understanding method, device, equipment and readable storage medium
Zhang et al. A Multi-Layer Attention Network for Visual Commonsense Reasoning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination