CN114330279B - Cross-modal semantic consistency recovery method - Google Patents
Cross-modal semantic consistency recovery method Download PDFInfo
- Publication number
- CN114330279B CN114330279B CN202111638661.3A CN202111638661A CN114330279B CN 114330279 B CN114330279 B CN 114330279B CN 202111638661 A CN202111638661 A CN 202111638661A CN 114330279 B CN114330279 B CN 114330279B
- Authority
- CN
- China
- Prior art keywords
- attention
- matrix
- head
- image
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a cross-modal semantic consistency recovery method which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic public space through linear projection to obtain attention information of a cross-modal ordered position, and finally sequencing disordered sentences by utilizing the attention information with the ordered position so as to finish the consistency recovery of the disordered sentences.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a cross-modal semantic consistency recovery method.
Background
Coherent modeling has been an important research topic, and is widely studied in the field of natural language processing, aiming to organize a group of sentences into a coherent text, and logically form a consistent sequence. Research has made certain progress, and the current research on semantic consistency modeling still stays in a single mode of text. The existing semantic consistency analysis and recovery method is in a single mode, and aiming at a group of sentences in a text mode, a coder-decoder architecture is usually adopted, and a pointer network is utilized for sequence prediction.
Semantic consistency, which initially measures whether text has linguistic meaning, can be extended to broader meaning for evaluating logical, ordered, and consistent relationships of elements in various modalities. For human beings, consistency modeling is a natural and essential world-aware capability that enables us to understand and perceive the world as a whole, so consistency modeling of information is very important to promote human perception and understanding of the physical world.
The current mainstream monomodal semantic consistency analysis and recovery method is an autoregressive attention analysis and recovery method, basic sentence feature vectors are extracted by utilizing Bi-LSTM, an attention mechanism is inspired, reliable paragraph representation is extracted by adopting a Transformer variant structure without position coding to eliminate the influence caused by sentence input sequence, so that sentence features in paragraphs are obtained, the paragraph features are obtained after average pooling to initialize the hidden layer state of a recurrent neural network decoder, and the composition of ordered and coherent paragraphs is recursively predicted by adopting greedy search or cluster search through a pointer network, so that the monomodal semantic consistency analysis and recovery are completed.
The existing semantic coherence modeling work mainly focuses on a single mode of text, basic feature vectors of sentences are extracted by using a bidirectional long-time and short-time memory network during coding, context features of the sentences are extracted by using a self-attention mechanism, and paragraph features are obtained through average pooling operation. When decoding, a pointer network architecture is adopted as a decoder, the decoder consists of long-time and short-time memory network units, basic sentence characteristic vectors are used as the input of the decoder, the input vector of the first step is a zero vector, and paragraph characteristics are used as an initial state of a hidden layer. Although the existing method can effectively solve the problem of modal semantic consistency analysis and recovery and further improve the performance under a single mode, the influence of information integration and semantic consistency among multiple modes is ignored, and cross-modal information is lacked.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a cross-modal semantic consistency recovery method, which effectively utilizes cross-modal information to guide semantic consistency recovery under a text mode according to semantic consistency between the text mode and an image mode.
In order to achieve the above object, the present invention provides a cross-modal semantic consistency recovery method, which is characterized by comprising the following steps:
(1) And setting the out-of-order statement to be restored with semantic consistency in the text mode as X = { X = { (X) 1 ,x 2 ,…,x i ,…,x m },x i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = 1 ,y 2 ,…,y j ,…,y n },y j Represents the j image, and n represents the number of images; setting similar semantics between a text mode and an image mode;
(2) Acquiring basic characteristics of a text mode and an image mode;
(2.1) acquiring basic characteristics of the out-of-order statements by using a bidirectional long-time and short-time memory network: inputting X into bidirectional long-time memory network to output basic characteristics of out-of-order statementWherein it is present>Representing the basic characteristics of the sentence of the ith sentence, wherein the dimension size of the sentence is 1 × d;
(2.2) acquiring basic characteristics of the ordered coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential continuous image Representing the basic characteristics of the jth image, wherein the dimension size of the basic characteristics is 1 x d;
(3) Acquiring context characteristics of a text mode and an image mode;
(3.1) acquiring context characteristics of the text modality by using a Transformer variant structure without position embedding;
(3.1.1) splicing the basic characteristics of each statement to obtain a matrixThe dimension size is mxd;
(3.1.2) Using Transformer's h-head attention layer to characterize the underlying featuresFirst mapped as a query matrix->Key matrix->Sum value matrix>
Wherein k is equal to [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzedConnected-up and/or>Then the context characteristic of the out-of-order statement is obtained through a forward feedback network> Representing the context characteristics of the sentence of the ith sentence;
(3.2) acquiring context characteristics of an image modality by using a Transformer variant structure embedded in a reserved position;
(3.2.1) splicing the basic characteristics of the images to obtain a matrixThe dimension size is n x d;
(3.2.2) basic featuresIn which the basic characteristic of each image is>Is embedded as a compact position, denoted as p j ;
In the basic characteristicsIn (3), embedding the projection of the dimensions of the even terms as: p is a radical of j,2l =sin(j/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of j,2l+1 =cos(j/10000 2l/d );
Wherein p is j,2l 、p j,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]];
Basic characteristics ofGet the compact position p after the embedding of all-dimension projection j ;
Finally, the compact position p of each image is determined j Splicing to obtain a position embedded matrixThe dimension size is n x d;
(3.2.3) basic featuresAnd location embedding>After addition, the h-head attention layer using the transform is mapped into a query matrix ^ according to the h-head attention layer>Key matrix->And value matrix->
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all->
Wherein, the first and the second end of the pipe are connected with each other,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzedIs connected together to>And obtaining the context characteristic ^ of the ordered consecutive image through a forward feedback network> Representing the context feature of the jth image;
(4) Obtaining attention information of cross-modal ordered positions
(4.1) converting the context features of the two modes into a semantic common space through linear projection;
(4.1.1) carrying out linear projection on the context characteristics of the two modes;
wherein, W 1 、W 2 As weight parameter, b 1 、b 2 For the bias term, reLU (-) is the correct linear activation function;
(4.1.2) semantic public space conversion;
Linearly projected contextual featuresSplicing to obtain a semantic representation matrix ^ under the image modality>
(4.2) calculating semantic correlation Corr between the two modes;
(4.3) embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing semantic correlation of the two modalities;
(4.3.1) obtaining implicit position information of each statement in the text mode by using an attention mechanism/>
α=soft max(Corr)
(4.3.2) mixingAfter the context features of the sentences are spliced, the hidden position information->Add up to get the statement context feature @withordered location attention information>The dimension size is n x d;
(5) Restoring the consistency of the out-of-order statement;
(5.1) general characteristicsIn which each sentence basic characteristic>Is embedded as a compact position, denoted as p i ;
In the basic characteristicsIn (2), embedding the projection of the dimension of the even term as: p is a radical of i,2l =sin(i/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of formula i,2l+1 =cos(i/10000 2l/d );
Wherein p is i,2l 、p i,2l+1 Respectively represents the values of the projection embedding of even number dimensionality and odd number dimensionality, wherein l is a constant, 2l +1 belongs to [1, d ]];
Finally, the compact position p of each sentence is divided i Splicing to obtain a position embedded matrixThe dimension size is mxd;
(5.2) embedding the position into the matrix using the Transformer's h-head attention layerFirst mapping to a query matrixKey matrix->And value matrix->
Wherein k is equal to [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtainedConnection ofGet up and shut>And then obtaining the interactive characteristic between the statement positions through a forward feedback network> An interactive feature representing the ith sentence position;
(5.3) acquiring attention characteristics of each sentence about the position through a multi-head mutual attention module;
(5.3.1) Interactive features of respective sentence positionsSplicing to obtain a matrix->The dimension size is mxd;
(5.3.2) matrix alignment using Transformer's h-head attention layerFirst mapped as a query matrix->Then the matrix is combined>Mapping to a key matrix->Sum value matrix>/>
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein, the first and the second end of the pipe are connected with each other,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtainedIs connected together to>And then obtains the attention characteristic ^ of the statement on the position through a forward feedback network> An attention feature representing a sentence with respect to the ith position;
(5.4) calculating the probability of the position of each sentence;
(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the ith sentenceThe attention value of the sentence at the ith position is ω i :
ptr i =softmax(ω i )
Wherein, W p 、W b Is a weight matrix, u is a column weight vector;
similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr } 1 ,ptr 2 ,…,ptr i ,…,ptr m };
(5.4.2) taking the position probability with the maximum probability value in the position probability set as the final probability of the position of the ith sentence, and marking the final probability as Ptr i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
(5.5) sequencing the out-of-order sentences according to the position probability;
starting from the first position, in the set { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the unordered statements.
The invention aims to realize the following steps:
the invention relates to a cross-modal semantic consistency recovery method, which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic common space through linear projection to obtain attention information of cross-modal ordered positions, and finally sequencing disordered sentences by utilizing the attention information with the ordered positions so as to complete the consistency recovery of the disordered sentences.
Meanwhile, the cross-modal semantic consistency recovery method provided by the invention has the following beneficial effects:
(1) The cross-modal semantic consistency analysis and recovery method provided by the invention can effectively extract the features of the elements in different modalities, fully utilizes cross-modal position information to assist and promote semantic consistency analysis and recovery in a single modality, predicts and recovers the elements in each position in parallel, and further improves the speed and precision of the task;
(2) The invention effectively connects the text mode and the image mode with similar semantics in a cross-mode, and is beneficial to the analysis of semantic consistency and the introduction of position attention information under the ordered consistency mode.
Drawings
FIG. 1 is a flow chart of a cross-modal semantic consistency recovery method of the present invention;
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flowchart of a cross-modal semantic consistency recovery method according to the present invention.
In this embodiment, as shown in fig. 1, a cross-modal semantic consistency recovery method according to the present invention includes the following steps:
s1, setting the out-of-order statement to be restored for semantic consistency in the text mode as X = { X = { (X) 1 ,x 2 ,…,x i ,…,x m },x i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = 1 ,y 2 ,…,y j ,…,y n },y j Represents the j image, and n represents the number of images; and (3) setting similar semantics between a text mode and an image mode, and restoring the text into sequential paragraphs by utilizing the image assistance.
S2, obtaining basic characteristics of a text mode and an image mode;
s2.1, acquiring basic characteristics of the out-of-order sentences by using a bidirectional long-short-time memory network: inputting X into the bidirectional long-short term memory network, thereby outputting the basic characteristics of the out-of-order statementsWherein it is present>The basic characteristics of the sentence of the ith sentence are represented, the dimension size is 1 x d, and the value of d is 512;
s2.2, acquiring basic characteristics of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image Representing the basic characteristics of the jth image, and the dimension size of the jth image is 1 x d;
s3, obtaining context characteristics of a text mode and an image mode;
and S3.1, in order to utilize the context semantic relation, acquiring the context characteristics of the text mode by using the Transformer variant structure without position embedding, wherein a self-attention mechanism of scaling dot products is used for utilizing the context information.
S3.1.1, splicing the basic characteristics of the sentences to obtain a matrixThe dimension size is m x d;
s3.1.2, using Transformer's h-head attention layer to combine basic featuresLook-ahead mapping to a query matrix>Key matrix/>Sum value matrix>
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>h takes a value of 4; />
Wherein, the first and the second end of the pipe are connected with each other,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtainedIs connected together to>Then the context characteristic of the out-of-order statement is obtained through a forward feedback network> Representing the context characteristics of the sentence of the ith sentence;
s3.2, in order to model coherent semantic information of the image, a Transformer variant structure with position embedding reserved is adopted to obtain the context characteristics of an image mode;
s3.2.1, splicing the basic characteristics of the images to obtain a matrixThe dimension size is n x d;
s3.2.2, to obtain basic characteristicsIn which the basic characteristic of each image is>Is embedded as a compact position, denoted as p j ;
In the basic characteristicsIn (2), embedding the projection of the dimension of the even term as: p is a radical of j,2l =sin(j/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of j,2l+1 =cos(j/10000 2l/d );
Wherein p is j,2l 、p j,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]];
Basic characteristics ofGet the compact position p after the embedding of all-dimension projection j ;
Finally, the compact position p of each image is determined j Splicing to obtain a position embedded matrixThe dimension size is n x d;
s3.2.3, to obtain the basic characteristicsAnd location embedding>After addition, the h-head attention layer using the transform is mapped into a query matrix ^ according to the h-head attention layer>Key matrix->Sum value matrix>
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtainedConnected-up and/or>And obtaining the context characteristic ^ of the ordered consecutive image through a forward feedback network> Representing the context feature of the jth image;
s4, obtaining attention information of cross-modal ordered positions
S4.1, in order to utilize cross-modal sequence information from the image modalities, semantic consistency between the two modalities is connected through a cross-modal position attention module.
Firstly, converting the context characteristics of two modes into a semantic public space through linear projection;
s4.1.1, performing linear projection on the context characteristics of the two modes;
wherein, W 1 、W 2 As weight parameter, b 1 、b 2 For the bias term, reLU (-) is a correcting linear activation function;
s4.1.2, semantic public space conversion;
Linearly projected contextual featuresSplicing to obtain a semantic representation matrix ^ under the image mode>
S4.2, calculating semantic correlation Corr between the two modes;
s4.3, embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing the semantic correlation of the two modalities;
s4.3.1, obtaining the recessive position information of each statement in the text mode by using an attention mechanism
α=soft max(Corr)
S4.3.2, mixingAfter the context features of the sentences are spliced, the hidden position information->Add up to get the statement context feature @withordered location attention information>The dimension size is n x d;
s5, continuity recovery is carried out;
s5.1, combining basic characteristicsIs based on the basic characteristic->Is embedded as a compact position, denoted as p i ;
In the basic characteristicsIn (3), embedding the projection of the dimensions of the even terms as: p is a radical of i,2l =sin(i/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of formula i,2l+1 =cos(i/10000 2l/d );
Wherein p is i,2l 、p i,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]];
Finally, the compact position p of each statement i Splicing to obtain a position embedded matrixThe dimension size is mxd;
s5.2, embedding the position into the matrix by using an h-head attention layer of a TransformerMapping to a query matrixKey matrix->Sum value matrix>/>
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzedIs connected together to>And then the interactive characteristic between the statement positions is obtained through a forward feedback network> An interactive feature representing the ith sentence position;
s5.3, acquiring the attention characteristics of each sentence about the position through a multi-head mutual attention module;
s5.3.1, interactive characteristics of sentence positionsSplicing to obtain a matrix->The dimension size is m x d;
s5.3.2, using Transformer's h-head attention layer to align the matrixFirst mapped as a query matrix->Then the matrix is selected>Mapping to a key matrix->Sum value matrix>
Wherein k is equal to [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all->
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtainedConnected-up and/or>And then obtains the attention characteristic ^ of the statement on the position through a forward feedback network> An attention feature representing the sentence with respect to the ith position;
s5.4, calculating the probability of the position of each sentence;
s5.4.1, calculating the probability that the ith sentence is at m positions, wherein the attention value of the ith sentence at the ith position isω i :
ptr i =softmax(ω i )
Wherein, W p 、W b Is a weight matrix, u is a column weight vector;
similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr } 1 ,ptr 2 ,…,ptr i ,…,ptr m };
S5.4.2, taking a position probability with the maximum probability value in the position probability set as the final probability of the position of the sentence i, and marking the final probability as Ptr i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
S5.5, sequencing the out-of-order sentences according to the position probability;
starting from the first position, in the set { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the out-of-order statements.
In this example, the invention is used for several common data sets, including the SIND and TACOS two visual narrative and story understanding corpora, with data in both text and picture forms. The invention adopts a Perfect Matching Rate (PMR), an accuracy rate (Acc) and a tau measurement as evaluation indexes. The Perfect Match Ratio (PMR) measures the performance of the prediction of the element position as a whole. The accuracy (Acc) is a more loose measurement index for calculating the accuracy of absolute position prediction of a single element. The τ metric is used to measure the relative order between all pairs of elements in the prediction, more closely to human judgment.
The sentence continuity restoration is carried out by the method and the existing method, and the experimental result is shown in table 1, wherein LSTM + PtrNet is a method of a long-short time memory network and a pointer network, AON-UM is a monomodal autoregressive attention restoration method, AON-CM is a cross-modal autoregressive attention restoration method, NAD-UM is a monomodal non-autoregressive restoration method, NAD-CM1 is a cross-modal non-autoregressive method which does not adopt position embedding and position attention, NAD-CM2 is a cross-modal non-autoregressive method which does not adopt position attention, NAD-CM3 is a cross-modal non-autoregressive method which does not adopt position embedding, NACON (greedy) is a method which does not adopt greed selection and mask exclusion, and NACON is the method of the invention. From experimental results, the performance of the cross-modal semantic consistency analysis and recovery method is greatly superior to that of the existing single-modal method. Compared with various evaluation indexes of NAD-CM1, NAD-CM2 and NAD-CM3, the invention improves the effectiveness of cross-modal position attention information. In addition, compared with AON-CM and NACON (no exl), the performance is also obviously improved, and the effectiveness of the reasoning mode of greedy selection and mask exclusion of the consistency recovery method designed by the invention is verified.
Table 1 is the experimental results on the SIND, TACoS data sets;
although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.
Claims (1)
1. A cross-modal semantic consistency recovery method is characterized by comprising the following steps:
(1) And setting the out-of-order statement to be restored with semantic consistency in the text mode as X = { X = { (X) 1 ,x 2 ,…,x i ,…,x m },x i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = 1 ,y 2 ,…,y j ,…,y n },y j Represents the j image, and n represents the number of images; setting similar semantics between a text mode and an image mode;
(2) Acquiring basic characteristics of a text mode and an image mode;
(2.1) acquiring basic characteristics of the out-of-order statements by using a bidirectional long-time and short-time memory network: inputting X into bidirectional long-time memory network to output basic characteristics of out-of-order statementWherein +>Representing the basic characteristics of the sentence of the ith sentence, wherein the dimension size of the sentence is 1 × d;
(2.2) acquiring basic features of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image Representing the basic characteristics of the jth image, and the dimension size of the jth image is 1 x d;
(3) Acquiring context characteristics of a text mode and an image mode;
(3.1) obtaining context characteristics of a text mode by using a Transformer variant structure with embedded positions removed;
(3.1.1) splicing the basic characteristics of each statement to obtain a matrixThe dimension size is m x d;
(3.1.2) Using the Transformer's h-head attention layer to characterize the basic featuresFirst mapped as a query matrix->Key matrix->And value matrix->
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all->
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtainedConnected-up and/or>And obtaining the context characteristic of the out-of-order statement through a forward feedback network> Representing the context characteristics of the sentence i;
(3.2) acquiring context characteristics of an image modality by using a Transformer variant structure embedded in a reserved position;
(3.2.1) splicing the basic characteristics of the images to obtain a matrixThe dimension size is n x d;
(3.2.2) general featuresIn each image basis characteristic->Is embedded as a compact position, denoted as p j ;
In the basic characteristicsIn (3), embedding the projection of the dimensions of the even terms as: p is a radical of formula j,2l =sin(j/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of j,2l+1 =cos(j/10000 2l/d );
Wherein p is j,2l 、p j,2l+1 Respectively represents the values of the projection embedding of even number dimensionality and odd number dimensionality, wherein l is a constant, 2l +1 belongs to [1, d ]];
Finally, the compact position p of each image is determined j Splicing to obtain a position embedded matrixThe dimension size is n x d;
(3.2.3) basic featuresAnd position insert->After addition, the h-head attention layer using the transform is mapped into a query matrix ^ according to the h-head attention layer>Key matrix>And value matrix->
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein, the first and the second end of the pipe are connected with each other,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzedConnected-up and/or>And obtaining the context characteristic ^ of the ordered consecutive image through a forward feedback network> Representing the context feature of the jth image;
(4) Obtaining attention information of cross-modal ordered positions
(4.1) converting the context features of the two modes into a semantic common space through linear projection;
(4.1.1) carrying out linear projection on the context characteristics of the two modes;
wherein, W 1 、W 2 As a weight parameter, b 1 、b 2 For the bias term, reLU (-) is the correct linear activation function;
(4.1.2) semantic public space conversion;
linearly projected contextual featuresSplicing to obtain a semantic representation matrix in the text mode
Linearly projected contextual featuresSplicing to obtain a semantic representation matrix under the image mode
(4.2) calculating semantic correlation Corr between the two modes;
(4.3) embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing semantic correlation of the two modalities;
(4.3.1) obtaining implicit position information of each statement in the text mode by using an attention mechanism
α=softmax(Corr)
(4.3.2) mixingAfter the context features of the sentences are spliced, the hidden position information->Add up to get the statement context feature @withordered location attention information>The dimension size is n x d;
(5) Restoring the consistency of the out-of-order statement;
(5.1) general characteristicsIs based on the basic characteristic->Is embedded as a compact position, denoted as p i ;
In the basic characteristicsIn (3), embedding the projection of the dimensions of the even terms as: p is a radical of i,2l =sin(i/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of i,2l+1 =cos(i/10000 2l/d );
Wherein p is i,2l 、p i,2l+1 Respectively representing even and odd dimensionsValue after embedding degree projection, l is constant, 2l +1 belongs to [1, d ]];
Basic characteristics ofObtaining a compact position p after embedding of all-dimension projection i ;
Finally, the compact position p of each image is determined i Splicing to obtain a position embedded matrixThe dimension size is mxd;
(5.2) embedding the position into the matrix using the Transformer's h-head attention layerLook-ahead mapping to a query matrix>Key matrix->And value matrix->
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzedConnected-up and/or>And then obtaining the interactive characteristic between the statement positions through a forward feedback network> An interactive feature representing the ith sentence position;
(5.3) acquiring attention characteristics of each sentence about the position through a multi-head mutual attention module;
(5.3.1) Interactive feature of each sentence positionSplicing to obtain a matrix>The dimension size is m x d; />
(5.3.2) h-head notes with TransformerGravity layer matrixFirst mapped as a query matrix->Key matrixSum value matrix>
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Wherein the content of the first and second substances,the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally will beInteraction information between various attention headsConnected-up and/or>And then obtains the attention characteristic ^ of the statement on the position through a forward feedback network> An attention feature representing the sentence with respect to the ith position;
(5.4) calculating the probability of the position of each sentence;
(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the probability that the ith sentence is at the ith position is omega i :
ptr i =softmax(ω i )
Wherein, W p 、W b Is a weight matrix, u is a column weight vector;
similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr } 1 ,ptr 2 ,…,ptr i ,…,ptr m };
(5.4.2) taking the position probability with the maximum probability value in the position probability set as the final probability of the position of the ith sentence, and marking the final probability as Ptr i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
(5.5) sequencing the out-of-order sentences according to the position probability;
starting from the first position, in the set { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the out-of-order statements.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111638661.3A CN114330279B (en) | 2021-12-29 | 2021-12-29 | Cross-modal semantic consistency recovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111638661.3A CN114330279B (en) | 2021-12-29 | 2021-12-29 | Cross-modal semantic consistency recovery method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114330279A CN114330279A (en) | 2022-04-12 |
CN114330279B true CN114330279B (en) | 2023-04-18 |
Family
ID=81016638
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111638661.3A Active CN114330279B (en) | 2021-12-29 | 2021-12-29 | Cross-modal semantic consistency recovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114330279B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108897852A (en) * | 2018-06-29 | 2018-11-27 | 北京百度网讯科技有限公司 | Judgment method, device and the equipment of conversation content continuity |
CN110472242A (en) * | 2019-08-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | A kind of text handling method, device and computer readable storage medium |
CN111951207A (en) * | 2020-08-25 | 2020-11-17 | 福州大学 | Image quality enhancement method based on deep reinforcement learning and semantic loss |
CN112966127A (en) * | 2021-04-07 | 2021-06-15 | 北方民族大学 | Cross-modal retrieval method based on multilayer semantic alignment |
CN112991350A (en) * | 2021-02-18 | 2021-06-18 | 西安电子科技大学 | RGB-T image semantic segmentation method based on modal difference reduction |
CN113378546A (en) * | 2021-06-10 | 2021-09-10 | 电子科技大学 | Non-autoregressive sentence sequencing method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8443337B2 (en) * | 2008-03-11 | 2013-05-14 | Intel Corporation | Methodology and tools for tabled-based protocol specification and model generation |
US11816565B2 (en) * | 2019-10-16 | 2023-11-14 | Apple Inc. | Semantic coherence analysis of deep neural networks |
-
2021
- 2021-12-29 CN CN202111638661.3A patent/CN114330279B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108897852A (en) * | 2018-06-29 | 2018-11-27 | 北京百度网讯科技有限公司 | Judgment method, device and the equipment of conversation content continuity |
CN110472242A (en) * | 2019-08-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | A kind of text handling method, device and computer readable storage medium |
CN111951207A (en) * | 2020-08-25 | 2020-11-17 | 福州大学 | Image quality enhancement method based on deep reinforcement learning and semantic loss |
CN112991350A (en) * | 2021-02-18 | 2021-06-18 | 西安电子科技大学 | RGB-T image semantic segmentation method based on modal difference reduction |
CN112966127A (en) * | 2021-04-07 | 2021-06-15 | 北方民族大学 | Cross-modal retrieval method based on multilayer semantic alignment |
CN113378546A (en) * | 2021-06-10 | 2021-09-10 | 电子科技大学 | Non-autoregressive sentence sequencing method |
Non-Patent Citations (2)
Title |
---|
Zhiliang Wu等.DAPC-Net:Deformable Alignment and Pyramid Context Completion Networks for Video Inpainting.《IEEE Signal Processing Letters》.2021,第28卷第1145-1149页. * |
李京谕 等.基于联合注意力机制的篇章级机器翻译.《中文信息学报》.2019,第33卷(第12期),第45-53页. * |
Also Published As
Publication number | Publication date |
---|---|
CN114330279A (en) | 2022-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108319686B (en) | Antagonism cross-media retrieval method based on limited text space | |
CN112613303B (en) | Knowledge distillation-based cross-modal image aesthetic quality evaluation method | |
CN110083710B (en) | Word definition generation method based on cyclic neural network and latent variable structure | |
WO2021031480A1 (en) | Text generation method and device | |
CN112100351A (en) | Method and equipment for constructing intelligent question-answering system through question generation data set | |
CN110134954B (en) | Named entity recognition method based on Attention mechanism | |
US20210397266A1 (en) | Systems and methods for language driven gesture understanding | |
CN111897939B (en) | Visual dialogue method, training method, device and equipment for visual dialogue model | |
CN110619313B (en) | Remote sensing image discriminant description generation method | |
CN110704621A (en) | Text processing method and device, storage medium and electronic equipment | |
CN113010656B (en) | Visual question-answering method based on multi-mode fusion and structural control | |
CN115080766B (en) | Multi-modal knowledge graph characterization system and method based on pre-training model | |
CN113779996B (en) | Standard entity text determining method and device based on BiLSTM model and storage medium | |
CN109145083B (en) | Candidate answer selecting method based on deep learning | |
CN110688489A (en) | Knowledge graph deduction method and device based on interactive attention and storage medium | |
CN112800205B (en) | Method and device for obtaining question-answer related paragraphs based on semantic change manifold analysis | |
CN114547267A (en) | Intelligent question-answering model generation method and device, computing equipment and storage medium | |
CN113204675A (en) | Cross-modal video time retrieval method based on cross-modal object inference network | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN115630145A (en) | Multi-granularity emotion-based conversation recommendation method and system | |
CN116450883A (en) | Video moment retrieval method based on video content fine granularity information | |
CN112015947A (en) | Video time sequence positioning method and system guided by language description | |
CN115906805A (en) | Long text abstract generating method based on word fine granularity | |
CN115311465A (en) | Image description method based on double attention models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |