CN117235114A - Retrieval method based on cross-modal semantic and mixed inverse fact training - Google Patents
Retrieval method based on cross-modal semantic and mixed inverse fact training Download PDFInfo
- Publication number
- CN117235114A CN117235114A CN202311224075.3A CN202311224075A CN117235114A CN 117235114 A CN117235114 A CN 117235114A CN 202311224075 A CN202311224075 A CN 202311224075A CN 117235114 A CN117235114 A CN 117235114A
- Authority
- CN
- China
- Prior art keywords
- text
- global
- module
- semantic
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000012549 training Methods 0.000 title claims abstract description 19
- 230000000007 visual effect Effects 0.000 claims abstract description 39
- 238000010521 absorption reaction Methods 0.000 claims abstract description 38
- 238000012512 characterization method Methods 0.000 claims abstract description 30
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 26
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 26
- 230000004048 modification Effects 0.000 claims abstract description 18
- 238000012986 modification Methods 0.000 claims abstract description 18
- 239000002131 composite material Substances 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims abstract description 10
- 238000004321 preservation Methods 0.000 claims abstract description 8
- 230000005284 excitation Effects 0.000 claims abstract description 7
- 239000000203 mixture Substances 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 12
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 230000003993 interaction Effects 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 239000003607 modifier Substances 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000000758 substrate Substances 0.000 claims description 3
- 230000014759 maintenance of location Effects 0.000 claims description 2
- 230000007704 transition Effects 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract description 2
- 238000001514 detection method Methods 0.000 abstract 1
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a retrieval method based on cross-modal semantic and mixed inverse fact training, which comprises the following steps: A. acquisition of reference image I R Target image I T And query text T Q Is characterized by; B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module, and modeling visual language characterization in three-level cascading reasoning; C. constructing a mixed counterfactual sample; D. modeling global-local combination, and capturing local-global information of different scales and different modes; capturing implicit visual modifications and preservation in the reference image from a final composite representation derived from a bottom-up hierarchical combination; E. combined detection of multiple modes by learning matchingAnd (3) searching, adding theta parameterized excitation, and finally, uniquely aligning the learned image text composite representation with the visual representation of the target truth image, wherein F is used for judging the search result by using a loss function. The invention can improve the defects of the prior art and improve the image retrieval accuracy.
Description
Technical Field
The invention relates to the technical field of video image retrieval in an intelligent security system, in particular to a retrieval method based on cross-modal semantic and mixed inverse fact training.
Background
To obtain a combined representation of the multi-modal retrieval, existing approaches rely primarily on cross-modal interaction and fusion operations of global semantics and local features, or on simple separate/cascaded feature learning schemes of reference images and query text from their respective encoders. However, this combination cannot capture the intrinsic relationship between the information features of the bottom layer and the abstract semantics of the top layer, from the perspective of deep understanding:
(i) Detailed local features of image regions that have descriptive terms but do not have global semantic understanding of the query.
(ii) Abstract global semantics are learned from increasing abstractions in the hierarchical order of the encoder, but lack concrete properties rooted at different local locations.
Simply considering the above feature utilization, without properly modeling global-local combinations, may lead to solution degradation.
Disclosure of Invention
The invention aims to provide a retrieval method based on cross-modal semantic and mixed inverse fact training, which can solve the defects of the prior art and improve the image retrieval accuracy.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows.
A retrieval method based on cross-modal semantic and mixed inverse fact training comprises the following steps:
A. a multiparticulate visual representation module and a global-local text embedding module are established for obtaining a reference image I R Target image I T And query text T of different layers Q Is characterized by;
B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module to form a cross-modal semantic synthesis module from bottom to top, and modeling visual language characterization in three-level cascading reasoning;
C. constructing a mixed counterfactual sample;
D. establishing a local feature module, a global-local feature absorption module and a global semantic module; modeling global-local combination by utilizing a characterization absorption synthesis module, so that the designed modification and absorption block based on the transformer can capture local-global information of different scales and different modes from local features of a bottom layer to global semantics of a top layer in a cross-layer manner; capturing implicit visual modification and preservation in the reference image according to the difference of the text modifier by a final composite representation derived from the bottom-up hierarchical combination;
E. learning matching pair M ((I) R ,T Q ),I T ) The combination of multiple modes is searched, then theta parameterized excitation is added, finally the learned image text composite representation is uniquely aligned with the visual representation of the target truth image, as shown in the following formula,
wherein θ represents parameterized excitation, k (·, ·) represents solving for similarity, Φ (·) and ψ (·) are the composite encoder and the image encoder, respectively;
F. and judging the search result by using the loss function.
Preferably, in step A,
the multiparticulate visual representation module generates a discriminative representation of the visual content of the image using the visual converter; shallow extraction of the visual transducer contains basic syntactic information, deep visual transducer extracts more complex semantic information, and in order to further improve the quality of the features, linear projection is utilized to map the extracted features to global semantic representation;
the global local text embedding module marks the query text TQ into M sub-word mark sequences using BERT, and then marks the special mark positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->
Preferably, in step B,
the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; embedding R of reference modality R for self-discovery of potential region-to-region relationships necessary for learning transitions R ∈R Nk×d And embedding R of query modality Q Q ∈R Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,
wherein L is n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains referenceAnd query modality->Is a self-participated form of representation;
MSA self-attention captures non-local correlations for feature transformations, in self-attentionBased on the pyramid pooling cross attention CSA is introduced, and the bidirectional cross co attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,
with soft attention toTreatment of +.>And->The potential relationships in terms of image conversion and preservation, the final cross-modality composition is:
wherein,SOA Q→R (. Cndot.) is a soft-attention operation;
the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively L ∈R Nc×dc And R is G ∈R Nc×dc Characterization of the absorption synthesis moduleRealization of slave R L Absorbs meaningful and distinguishing information as R G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,
after self-attention modeling, use is made ofAnd->Generating an intermediate embedding, the global semantic representation obtaining a higher weight, and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,
wherein, [ ·, ]]Representing series operation, T l Representing a nonlinear transformation;
fusion embedding performs layer normalization of residual connection. And input into the feedforward layer to obtain the final absorption characteristic representation R AC Such a combination represents R AC Absorbs useful knowledge from the local features, improves the accuracy of query matching targets,
wherein,
preferably, in step D,
the local feature module locally represents the correction module by using a transmembrane stateAnd->Learning local feature combination->
Global-local feature absorption module combines from local featuresAbsorption meaningful and discriminative information that plays a role in guiding in advance from the local feature layer to robustness to subsequent synthesis-target matching, deriving global-local absorption combinations using a token absorption synthesis module +.>Is a potential embedding of (a);
the global semantic composition module models final composition from visual and linguistic domainsThrough aggregation of intermediate representationsIs a basic semantic and query text global semantic potential vector +.>To update the output of the stream.
Preferably, in step F, the loss function consists of a bi-directional ternary loss, a reconstruction loss and a domain alignment loss;
the two-way ternary loss is extracted from the contrast negative sample of the second degree, the fine-granularity query-target corresponding relation of the input query and the target image is constructed, the semantic matching of the composition and the target representation with high similarity are ensured, the two-way ternary loss is defined as follows,
L tri (X,Y,m)=max(0,||X + -Y|| 2 -||X - -Y|| 2 +m)
L bid (C que ,C tar ,m,m a )=λ q L tri (C que ,C tar ,m)+λ i L tri (C tar ,C que ,m a )
wherein X is + And X - Lambda is positive and negative samples q And lambda (lambda) i As a weight super parameter, I I.I.I 2 Is L 2 The distance between the two adjacent substrates is determined,sum s qt Respectively represent C que And C tar Alpha-adaptive edge value m a Super parameters s qt Near 0, m a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;
reconstruction loss L res Constraint C que Mapping of vision and language, defined by R img And R is text Representation, respectively with potential embedding C tar Andalignment, the goal is to embed a balanced utilization of text and images by reconstructing the specification and enhancing the combination,
wherein lambda is img And lambda (lambda) text Super parameters for pre-training;
the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, and utilizing wasserstein Distance to synthesize the domainCorresponding to the target domain, wasserstein Distance and alignment loss are shown in the following formulas,
L ali =λ a W d (C,T)
the beneficial effects brought by adopting the technical scheme are as follows:
(i) And a mode characterization modification module and a characterization absorption synthesis module are provided for implicit bottom-up semantic synthesis modeling. The key idea of this step is to achieve complementary synergy between the bottom-up visual representations by utilizing complementary global local representations of the different encoder layers, thereby achieving selective modification of the relevant image features and ensuring preservation of unaltered features, which is critical to an accurate retrieval method.
(ii) A mixed counterfactual training strategy for plug and play (note: mixed counterfactual is presented here instead of counterfactual). The strategy aims at facilitating the construction of a fine-grained query-target correspondence by a retrieval model to achieve robust image retrieval. The policy can be used as a plug-and-play component to increase the query sensitivity of the retrieval model. In particular, three new different types of counterfactual samples are constructed, image independent, text independent and context preserving. The mixed sample realizes an explicit bidirectional corresponding learning mechanism, is beneficial to establishing one-to-one matching of the combined query and the expected image, and reduces the prediction uncertainty of the model on the similar query.
(iii) The method is characterized by designing a bottom-up cross-modal semantic synthesis and hierarchical combined reasoning, combining cross-granularity semantic update learning and understanding composite image text representation, gradually digesting information flows from vision and language from two new perspectives corresponding to an implicit bottom-up visual representation synthesis and an explicit fine-granularity query target structure, and achieving the challenging task of solving content-based image retrieval. The combination can effectively capture hidden visual modification and storage according to different text modifiers, thereby achieving the effect superior to the prior image retrieval technology.
Drawings
FIG. 1 is an example of multimodal retrieval;
fig. 2 is a content-based image retrieval framework.
Detailed Description
One embodiment of the present invention comprises the steps of:
A. a multiparticulate visual representation module and a global-local text embedding module are established for obtaining a reference image I R Target image I T And query text T of different layers Q Is characterized by;
the multiparticulate visual representation module generates a discriminative representation of the visual content of the image using the visual converter; shallow extraction of the visual transducer contains basic syntactic information, deep visual transducer extracts more complex semantic information, and in order to further improve the quality of the features, linear projection is utilized to map the extracted features to global semantic representation;
global local text embedding module uses BERT to query text T Q Marking into M subword marking sequences, and then placing special marking positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->
B. Establishing a cross-modal characterization modification module and a characterization absorption synthesis module to form a cross-modal semantic synthesis module from bottom to top, and modeling visual language characterization in three-level cascading reasoning;
the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; is thatEmbedding R of reference modality R by self-discovery of potential region-to-region relationships necessary for learning transformations R ∈R Nk×d And embedding R of query modality Q Q ∈R Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,
wherein L is n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains referenceAnd query modality->Is a self-participated form of representation;
MSA self-attention captures non-local correlation for feature conversion, pyramid pooling cross-attention CSA is introduced on the basis of self-attention, and bidirectional cross-co-attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,
with soft attention toTreatment of +.>And->The potential relationships in terms of image conversion and preservation, the final cross-modality composition is:
wherein,SOA Q→R (. Cndot.) is a soft-attention operation;
the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively L ∈R Nc×dc And R is G ∈R Nc×dc Characterizing the absorption synthesis module to achieve the slave R L Absorbs meaningful and distinguishing information as R G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,
after self-attention modeling, use is made ofAnd->Generating an intermediate embedding, the global semantic representation obtaining a higher weight, and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,
wherein, [ ·, ]]Representing series operation, T l Representing a nonlinear transformation;
fusion embedding performs layer normalization of residual connection. And input into the feedforward layer to obtain the final absorption characteristic representation R AC Such a combination represents R AC Absorbs useful knowledge from the local features, improves the accuracy of query matching targets,
wherein,
C. constructing a mixed counterfactual sample; comprises the steps of,
c1, constructing a counterfactual sample which is irrelevant to images and irrelevant to texts;
giving a reference image I R And corresponding query text T Q Using BERT as pre-trained bi-directional language model, finding text with minimum relevance according to language similarityAnd its corresponding image->Thus these texts and drawingsImage independent +.>Text independentIs a counterfactual sample of (2);
c2, constructing a text retention counterfactual sample;
first, the query vocabulary T is masked according to a priori known attributes Q Generating preliminary candidate words and replacing the preliminary candidate words by random words to obtain k 1 Samples are then taken and the original T Q Calculating semantic similarity in the input point BERT, and finally selecting the topmost k with the reference image 2 Text is used as a context-preserving query, as shown in the following,
wherein,representing the selected context keeps a negative sample, P s Representing the probability of measuring the BERT,a number selected in the second stage;
D. establishing a local feature module, a global-local feature absorption module and a global semantic module; modeling global-local combination by utilizing a characterization absorption synthesis module, so that the designed modification and absorption block based on the transformer can capture local-global information of different scales and different modes from local features of a bottom layer to global semantics of a top layer in a cross-layer manner; capturing implicit visual modification and preservation in the reference image according to the difference of the text modifier by a final composite representation derived from the bottom-up hierarchical combination;
the local feature module locally represents the correction module by using a transmembrane stateAnd->Learning local feature combination->
Global-local feature absorption module combines from local featuresAbsorption meaningful and discriminative information that plays a role in guiding in advance from the local feature layer to robustness to subsequent synthesis-target matching, deriving global-local absorption combinations using a token absorption synthesis module +.>Is a potential embedding of (a);
the global semantic composition module models final composition from visual and linguistic domainsThrough aggregation of intermediate representationsIs a basic semantic and query text global semantic potential vector +.>To update the output of the stream;
E. learning matching pair M ((I) R ,T Q ),I T ) The combination of multiple modes is searched, then theta parameterized excitation is added, finally the learned image text composite representation is uniquely aligned with the visual representation of the target truth image, as shown in the following formula,
wherein θ represents parameterized excitation, k (·, ·) represents solving for similarity, Φ (·) and ψ (·) are the composite encoder and the image encoder, respectively;
F. judging the search result by using the loss function;
the loss function consists of a bidirectional ternary loss, a reconstruction loss and a domain alignment loss;
the two-way ternary loss is extracted from the contrast negative sample of the second degree, the fine-granularity query-target corresponding relation of the input query and the target image is constructed, the semantic matching of the composition and the target representation with high similarity are ensured, the two-way ternary loss is defined as follows,
L tri (X,Y,m)=max(0,||X + -Y|| 2 -||X - -Y|| 2 +m)
L bid (C que ,C tar ,m,m a )=λ q L tri (C que ,C tar ,m)+λ i L tri (C tar ,C que ,m a )
wherein X is + And X - Lambda is positive and negative samples q And lambda (lambda) t As a weight super parameter, I I.I.I 2 Is L 2 The distance between the two adjacent substrates is determined,sum s qt Respectively represent C que And C tar Alpha-adaptive edge value m a Super parameters s qt Near 0, m a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;
reconstruction loss L res Constraint C que Mapping of vision and language, defined by R img And R is text Representation, respectively with potential embedding C tar Andalignment, the goal is to embed a balanced utilization of text and images by reconstructing the specification and enhancing the combination,
wherein lambda is img And lambda (lambda) text Super parameters for pre-training;
the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, utilizing Wasserstein Distance to correspond the synthesized domain and the target domain, wherein WassersteinDistance and alignment loss are shown in the following formula,
L ali =λ a W d (C,T)。
in the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the present invention, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (6)
1. A retrieval method based on cross-modal semantic and mixed inverse fact training is characterized by comprising the following steps:
A. a multiparticulate visual representation module and a global-local text embedding module are established for obtaining a reference image I R Target image I T And query text T of different layers Q Is characterized by;
B. establishing a cross-modal characterization modification module and a characterization absorption synthesis module to form a cross-modal semantic synthesis module from bottom to top, and modeling visual language characterization in three-level cascading reasoning;
C. constructing a mixed counterfactual sample;
D. establishing a local feature module, a global-local feature absorption module and a global semantic module; modeling global-local combination by utilizing a characterization absorption synthesis module, so that the designed modification and absorption block based on the transformer can capture local-global information of different scales and different modes from local features of a bottom layer to global semantics of a top layer in a cross-layer manner; capturing implicit visual modification and preservation in the reference image according to the difference of the text modifier by a final composite representation derived from the bottom-up hierarchical combination;
E. learning matching pair M ((I) R ,T Q ),I T ) The combination of multiple modes is searched, then theta parameterized excitation is added, finally the learned image text composite representation is uniquely aligned with the visual representation of the target truth image, as shown in the following formula,
wherein θ represents parameterized excitation, k (·, ·) represents solving for similarity, Φ (·) and ψ (·) are the composite encoder and the image encoder, respectively;
F. and judging the search result by using the loss function.
2. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 1, wherein the retrieval method is characterized in that: in the step a, the step a is performed,
the multiparticulate visual representation module generates a discriminative representation of the visual content of the image using the visual converter; shallow extraction of the visual transducer contains basic syntactic information, deep visual transducer extracts more complex semantic information, and in order to further improve the quality of the features, linear projection is utilized to map the extracted features to global semantic representation;
global local text embedding module uses BERT to query text T Q Marking into M subword marking sequences, and then placing special marking positions [ cls ]]And [ end ]]Leading the text sub-word mark sequence, and dividing the query text into local word embedding and global sentence embedding; for partial word embedding, the word embedding output of the first layer isWith the advancement of BERT, the contextual markup and self-attention interaction is performed in multiple steps, and then the last layer of BERT represents global information of the tag word embedding in a given text, and the tag word embedding representations are connected to form a global sentence embedding->
3. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 2, wherein the retrieval method is characterized in that: in the step B of the process,
the cross-modal characterization modification module consists of a self-attention layer, a bidirectional cross-attention layer and a soft attention layer; embedding R of reference modality R for self-discovery of potential region-to-region relationships necessary for learning transitions R ∈R Nk×d And embedding R of query modality Q Q ∈R Mk×d This module learns the combined embedding conditional on the reference token R and the query token Q, R obtained by selectively suppressing and highlighting the query pattern R Thereby realizing more effective learning of image retrieval, the above is input into a self-care layer with layer normalization and residual linkage, as shown in the following formula,
wherein L is n Andexpressed as layer normalization and residual connection, PSA represents pyramid pooling attention operation, MSA represents multi-head self-attention operation, J (-) judges whether the input is a visual modal function v, and self-attention inquiry obtains reference->And query modality->Is a self-participated form of representation;
MSA self-attention captures non-local correlation for feature conversion, pyramid pooling cross-attention CSA is introduced on the basis of self-attention, and bidirectional cross-co-attention through layer normalization and residual connection is obtainedAnd->As shown in the following formula,
with soft attention toTreatment of +.>And->The potential relationships in terms of image conversion and preservation, the final cross-modality composition is:
wherein,SOA Q→R (. Cndot.) is a soft-attention operation;
the characterization absorption synthesis module comprises a self-attention layer and a residual attention layer, and is constructed according to a hierarchical sequence; for absorbing meaningful information from the local feature space and then creating an information composition to enhance robustness of subsequent query target matches; characterization at local and global level, R respectively L ∈R Nc×dc And R is G ∈R Nc×dc Characterizing the absorption synthesis module to achieve the slave R L Absorbs meaningful and distinguishing information as R G Generating a priori knowledge guidance of the synthesis characterization; the absorption process is as follows,
after self-attention modeling, use is made ofarg/>An intermediate embedding is generated and,the global semantic representation gets a higher weight and +.>As a priori guidance for the generation of the combination, the demander is then known to ask for meaningful information, via a residual attention mechanism, as shown in the following equation,
wherein, [ ·, ]]Representing series operation, T l Representing a nonlinear transformation;
fusion embedding performs layer normalization of residual connection. And input into the feedforward layer to obtain the final absorption characteristic representation R AC Such a combination represents R AC Absorbs useful knowledge from the local features, improves the accuracy of query matching targets,
wherein,
4. the retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 3, wherein: in step C, constructing a hybrid counterfactual sample includes the steps of,
c1, constructing a counterfactual sample which is irrelevant to images and irrelevant to texts;
giving a reference image I R And corresponding query text T Q Using BERT as pre-trained bi-directional language model, finding text with minimum relevance according to language similarityAnd its corresponding image->These texts and images thus combine the original reference image and the query text to form an image independent +.>Text independentIs a counterfactual sample of (2);
c2, constructing a text retention counterfactual sample;
first, the query vocabulary T is masked according to a priori known attributes Q Generating preliminary candidate words and replacing the preliminary candidate words by random words to obtain k 1 Samples are then taken and the original T Q Calculating semantic similarity in the input point BERT, and finally selecting the topmost k with the reference image 2 Text is used as a context-preserving query, as shown in the following,
wherein,representing the selected context keeps a negative sample, P s Representing the probability of measuring BERT +.>Indicating the number selected in the second stage.
5. The retrieval method based on cross-modal semantic and mixed counterfactual training according to claim 4, wherein the retrieval method is characterized in that: in the step D of the process,
the local feature module locally represents the correction module by using a transmembrane stateAnd->Learning local feature combination->
Global-local feature absorption module combines from local featuresAbsorption meaningful and discriminative information that plays a role in guiding in advance from the local feature layer to robustness to subsequent synthesis-target matching, deriving global-local absorption combinations using a token absorption synthesis module +.>Is a potential embedding of (a);
the global semantic composition module models final composition from visual and linguistic domainsBy means of an aggregate intermediate representation->Is a basic semantic and query text global semantic potential vector +.>To update the output of the stream.
6. The cross-modal semantic and mixed inverse fact training based retrieval method according to claim 5, wherein the method comprises the following steps: in the step F, the loss function consists of bidirectional ternary loss, reconstruction loss and domain alignment loss;
the two-way ternary loss is extracted from the contrast negative sample of the second degree, the fine-granularity query-target corresponding relation of the input query and the target image is constructed, the semantic matching of the composition and the target representation with high similarity are ensured, the two-way ternary loss is defined as follows,
L tri (X,Y,m)=max(0,||X + -Y|| 2 -||X - -Y|| 2 +m)
L bid (C que ,C tar ,m,m a ))=λ q L tri (C que ,C tar ,m)+λ i L tri (C tar ,C que ,m a ) Wherein X is + And X - Lambda is positive and negative samples q And lambda (lambda) i As a weight super parameter, I I.I.I 2 Is L 2 The distance between the two adjacent substrates is determined,sum s qt Respectively represent C que And C tar Alpha-adaptive edge value m a Super parameters s qt Near 0, m a Obtaining the maximum value, or else, obtaining the minimum value, so as to achieve the self-adaptive effect of the counterfactual training;
reconstruction loss L res Constraint C que Mapping of vision and language, defined by R img And R is text Representation, respectively with potential embedding C tar Andalignment, the goal is to embed a balanced utilization of text and images by reconstructing the specification and enhancing the combination,
wherein lambda is img And lambda (lambda) text Super parameters for pre-training;
the domain alignment penalty is then a further learning of the fine-grained semantic correspondence of the combined domain and the target image domain, where the representation distributions of the different domains are aligned using the optimal transmission OT to make up the gap between them. First, a cost matrix c between the characteristic distribution of the synthesized domain and the target domain is calculated m Secondly, classifying each feature into the features from another mode according to different weights to complete the prediction of the corresponding relation, utilizing wasserstein Distance to correspond the synthesized domain and the target domain, wasserstein Distance and the alignment loss as shown in the following formula,
L ali =λ a W d (C,T)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311224075.3A CN117235114A (en) | 2023-09-20 | 2023-09-20 | Retrieval method based on cross-modal semantic and mixed inverse fact training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311224075.3A CN117235114A (en) | 2023-09-20 | 2023-09-20 | Retrieval method based on cross-modal semantic and mixed inverse fact training |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117235114A true CN117235114A (en) | 2023-12-15 |
Family
ID=89090733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311224075.3A Pending CN117235114A (en) | 2023-09-20 | 2023-09-20 | Retrieval method based on cross-modal semantic and mixed inverse fact training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117235114A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117520590A (en) * | 2024-01-04 | 2024-02-06 | 武汉理工大学三亚科教创新园 | Ocean cross-modal image-text retrieval method, system, equipment and storage medium |
CN117649461A (en) * | 2024-01-29 | 2024-03-05 | 吉林大学 | Interactive image generation method and system based on space layout and use method thereof |
-
2023
- 2023-09-20 CN CN202311224075.3A patent/CN117235114A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117520590A (en) * | 2024-01-04 | 2024-02-06 | 武汉理工大学三亚科教创新园 | Ocean cross-modal image-text retrieval method, system, equipment and storage medium |
CN117520590B (en) * | 2024-01-04 | 2024-04-26 | 武汉理工大学三亚科教创新园 | Ocean cross-modal image-text retrieval method, system, equipment and storage medium |
CN117649461A (en) * | 2024-01-29 | 2024-03-05 | 吉林大学 | Interactive image generation method and system based on space layout and use method thereof |
CN117649461B (en) * | 2024-01-29 | 2024-05-07 | 吉林大学 | Interactive image generation method and system based on space layout and use method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984724B (en) | Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation | |
CN111444343B (en) | Cross-border national culture text classification method based on knowledge representation | |
CN117235114A (en) | Retrieval method based on cross-modal semantic and mixed inverse fact training | |
CN109992686A (en) | Based on multi-angle from the image-text retrieval system and method for attention mechanism | |
CN117421591A (en) | Multi-modal characterization learning method based on text-guided image block screening | |
CN106407333A (en) | Artificial intelligence-based spoken language query identification method and apparatus | |
CN111967272B (en) | Visual dialogue generating system based on semantic alignment | |
CN113486669B (en) | Semantic recognition method for emergency rescue input voice | |
US20240119716A1 (en) | Method for multimodal emotion classification based on modal space assimilation and contrastive learning | |
CN117391051B (en) | Emotion-fused common attention network multi-modal false news detection method | |
Li et al. | Adapting clip for phrase localization without further training | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
Ueda et al. | Switching text-based image encoders for captioning images with text | |
Zhu et al. | Unpaired image captioning by image-level weakly-supervised visual concept recognition | |
CN115311465A (en) | Image description method based on double attention models | |
CN116452688A (en) | Image description generation method based on common attention mechanism | |
CN115827954A (en) | Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment | |
CN113312498B (en) | Text information extraction method for embedding knowledge graph by undirected graph | |
CN114626454A (en) | Visual emotion recognition method integrating self-supervision learning and attention mechanism | |
CN113192030A (en) | Remote sensing image description generation method and system | |
CN114625830B (en) | Chinese dialogue semantic role labeling method and system | |
CN116227594A (en) | Construction method of high-credibility knowledge graph of medical industry facing multi-source data | |
Ding et al. | A Novel Discrimination Structure for Assessing Text Semantic Similarity | |
CN117746441B (en) | Visual language understanding method, device, equipment and readable storage medium | |
Zhang et al. | A Multi-Layer Attention Network for Visual Commonsense Reasoning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |