CN114330279B - Cross-modal semantic consistency recovery method - Google Patents

Cross-modal semantic consistency recovery method Download PDF

Info

Publication number
CN114330279B
CN114330279B CN202111638661.3A CN202111638661A CN114330279B CN 114330279 B CN114330279 B CN 114330279B CN 202111638661 A CN202111638661 A CN 202111638661A CN 114330279 B CN114330279 B CN 114330279B
Authority
CN
China
Prior art keywords
attention
matrix
head
image
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111638661.3A
Other languages
Chinese (zh)
Other versions
CN114330279A (en
Inventor
杨阳
史文浩
宾燚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111638661.3A priority Critical patent/CN114330279B/en
Publication of CN114330279A publication Critical patent/CN114330279A/en
Application granted granted Critical
Publication of CN114330279B publication Critical patent/CN114330279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a cross-modal semantic consistency recovery method which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic public space through linear projection to obtain attention information of a cross-modal ordered position, and finally sequencing disordered sentences by utilizing the attention information with the ordered position so as to finish the consistency recovery of the disordered sentences.

Description

Cross-modal semantic consistency recovery method
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a cross-modal semantic consistency recovery method.
Background
Coherent modeling has been an important research topic, and is widely studied in the field of natural language processing, aiming to organize a group of sentences into a coherent text, and logically form a consistent sequence. Research has made certain progress, and the current research on semantic consistency modeling still stays in a single mode of text. The existing semantic consistency analysis and recovery method is in a single mode, and aiming at a group of sentences in a text mode, a coder-decoder architecture is usually adopted, and a pointer network is utilized for sequence prediction.
Semantic consistency, which initially measures whether text has linguistic meaning, can be extended to broader meaning for evaluating logical, ordered, and consistent relationships of elements in various modalities. For human beings, consistency modeling is a natural and essential world-aware capability that enables us to understand and perceive the world as a whole, so consistency modeling of information is very important to promote human perception and understanding of the physical world.
The current mainstream monomodal semantic consistency analysis and recovery method is an autoregressive attention analysis and recovery method, basic sentence feature vectors are extracted by utilizing Bi-LSTM, an attention mechanism is inspired, reliable paragraph representation is extracted by adopting a Transformer variant structure without position coding to eliminate the influence caused by sentence input sequence, so that sentence features in paragraphs are obtained, the paragraph features are obtained after average pooling to initialize the hidden layer state of a recurrent neural network decoder, and the composition of ordered and coherent paragraphs is recursively predicted by adopting greedy search or cluster search through a pointer network, so that the monomodal semantic consistency analysis and recovery are completed.
The existing semantic coherence modeling work mainly focuses on a single mode of text, basic feature vectors of sentences are extracted by using a bidirectional long-time and short-time memory network during coding, context features of the sentences are extracted by using a self-attention mechanism, and paragraph features are obtained through average pooling operation. When decoding, a pointer network architecture is adopted as a decoder, the decoder consists of long-time and short-time memory network units, basic sentence characteristic vectors are used as the input of the decoder, the input vector of the first step is a zero vector, and paragraph characteristics are used as an initial state of a hidden layer. Although the existing method can effectively solve the problem of modal semantic consistency analysis and recovery and further improve the performance under a single mode, the influence of information integration and semantic consistency among multiple modes is ignored, and cross-modal information is lacked.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a cross-modal semantic consistency recovery method, which effectively utilizes cross-modal information to guide semantic consistency recovery under a text mode according to semantic consistency between the text mode and an image mode.
In order to achieve the above object, the present invention provides a cross-modal semantic consistency recovery method, which is characterized by comprising the following steps:
(1) And setting the out-of-order statement to be restored with semantic consistency in the text mode as X = { X = { (X) 1 ,x 2 ,…,x i ,…,x m },x i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = 1 ,y 2 ,…,y j ,…,y n },y j Represents the j image, and n represents the number of images; setting similar semantics between a text mode and an image mode;
(2) Acquiring basic characteristics of a text mode and an image mode;
(2.1) acquiring basic characteristics of the out-of-order statements by using a bidirectional long-time and short-time memory network: inputting X into bidirectional long-time memory network to output basic characteristics of out-of-order statement
Figure BDA0003442397980000021
Wherein it is present>
Figure BDA0003442397980000022
Representing the basic characteristics of the sentence of the ith sentence, wherein the dimension size of the sentence is 1 × d;
(2.2) acquiring basic characteristics of the ordered coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential continuous image
Figure BDA0003442397980000023
Figure BDA0003442397980000024
Representing the basic characteristics of the jth image, wherein the dimension size of the basic characteristics is 1 x d;
(3) Acquiring context characteristics of a text mode and an image mode;
(3.1) acquiring context characteristics of the text modality by using a Transformer variant structure without position embedding;
(3.1.1) splicing the basic characteristics of each statement to obtain a matrix
Figure BDA0003442397980000025
The dimension size is mxd;
(3.1.2) Using Transformer's h-head attention layer to characterize the underlying features
Figure BDA0003442397980000026
First mapped as a query matrix->
Figure BDA0003442397980000027
Key matrix->
Figure BDA0003442397980000028
Sum value matrix>
Figure BDA0003442397980000029
Figure BDA00034423979800000210
Wherein k is equal to [1,h ]]The k-th head of attention is shown,
Figure BDA00034423979800000211
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure BDA00034423979800000212
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA0003442397980000031
Figure BDA0003442397980000032
Wherein the content of the first and second substances,
Figure BDA0003442397980000033
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzed
Figure BDA0003442397980000034
Connected-up and/or>
Figure BDA0003442397980000035
Then the context characteristic of the out-of-order statement is obtained through a forward feedback network>
Figure BDA0003442397980000036
Figure BDA0003442397980000037
Representing the context characteristics of the sentence of the ith sentence;
(3.2) acquiring context characteristics of an image modality by using a Transformer variant structure embedded in a reserved position;
(3.2.1) splicing the basic characteristics of the images to obtain a matrix
Figure BDA0003442397980000038
The dimension size is n x d;
(3.2.2) basic features
Figure BDA0003442397980000039
In which the basic characteristic of each image is>
Figure BDA00034423979800000310
Is embedded as a compact position, denoted as p j
In the basic characteristics
Figure BDA00034423979800000311
In (3), embedding the projection of the dimensions of the even terms as: p is a radical of j,2l =sin(j/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of j,2l+1 =cos(j/10000 2l/d );
Wherein p is j,2l 、p j,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]];
Basic characteristics of
Figure BDA00034423979800000312
Get the compact position p after the embedding of all-dimension projection j
Finally, the compact position p of each image is determined j Splicing to obtain a position embedded matrix
Figure BDA00034423979800000313
The dimension size is n x d;
(3.2.3) basic features
Figure BDA00034423979800000314
And location embedding>
Figure BDA00034423979800000315
After addition, the h-head attention layer using the transform is mapped into a query matrix ^ according to the h-head attention layer>
Figure BDA00034423979800000316
Key matrix->
Figure BDA00034423979800000317
And value matrix->
Figure BDA00034423979800000318
Figure BDA00034423979800000319
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure BDA00034423979800000320
is a weight matrix for the kth attention head whose dimensions are all->
Figure BDA00034423979800000321
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA00034423979800000322
Figure BDA00034423979800000323
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003442397980000041
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzed
Figure BDA0003442397980000042
Is connected together to>
Figure BDA0003442397980000043
And obtaining the context characteristic ^ of the ordered consecutive image through a forward feedback network>
Figure BDA0003442397980000044
Figure BDA0003442397980000045
Representing the context feature of the jth image;
(4) Obtaining attention information of cross-modal ordered positions
(4.1) converting the context features of the two modes into a semantic common space through linear projection;
(4.1.1) carrying out linear projection on the context characteristics of the two modes;
Figure BDA0003442397980000046
Figure BDA0003442397980000047
wherein, W 1 、W 2 As weight parameter, b 1 、b 2 For the bias term, reLU (-) is the correct linear activation function;
(4.1.2) semantic public space conversion;
linearly projected contextual features
Figure BDA0003442397980000048
Splicing to obtain a semantic representation matrix->
Figure BDA0003442397980000049
Linearly projected contextual features
Figure BDA00034423979800000410
Splicing to obtain a semantic representation matrix ^ under the image modality>
Figure BDA00034423979800000411
(4.2) calculating semantic correlation Corr between the two modes;
Figure BDA00034423979800000412
(4.3) embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing semantic correlation of the two modalities;
(4.3.1) obtaining implicit position information of each statement in the text mode by using an attention mechanism
Figure BDA00034423979800000413
/>
α=soft max(Corr)
Figure BDA00034423979800000414
(4.3.2) mixing
Figure BDA00034423979800000415
After the context features of the sentences are spliced, the hidden position information->
Figure BDA00034423979800000416
Add up to get the statement context feature @withordered location attention information>
Figure BDA00034423979800000417
The dimension size is n x d;
(5) Restoring the consistency of the out-of-order statement;
(5.1) general characteristics
Figure BDA00034423979800000418
In which each sentence basic characteristic>
Figure BDA00034423979800000419
Is embedded as a compact position, denoted as p i
In the basic characteristics
Figure BDA00034423979800000420
In (2), embedding the projection of the dimension of the even term as: p is a radical of i,2l =sin(i/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of formula i,2l+1 =cos(i/10000 2l/d );
Wherein p is i,2l 、p i,2l+1 Respectively represents the values of the projection embedding of even number dimensionality and odd number dimensionality, wherein l is a constant, 2l +1 belongs to [1, d ]];
Basic features
Figure BDA0003442397980000051
Get the compact position p after the embedding of all-dimension projection i
Finally, the compact position p of each sentence is divided i Splicing to obtain a position embedded matrix
Figure BDA0003442397980000052
The dimension size is mxd;
(5.2) embedding the position into the matrix using the Transformer's h-head attention layer
Figure BDA0003442397980000053
First mapping to a query matrix
Figure BDA0003442397980000054
Key matrix->
Figure BDA0003442397980000055
And value matrix->
Figure BDA0003442397980000056
Figure BDA0003442397980000057
Wherein k is equal to [1,h ]]The k-th head of attention is shown,
Figure BDA0003442397980000058
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure BDA0003442397980000059
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA00034423979800000510
Figure BDA00034423979800000511
Wherein the content of the first and second substances,
Figure BDA00034423979800000512
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure BDA00034423979800000513
Connection ofGet up and shut>
Figure BDA00034423979800000514
And then obtaining the interactive characteristic between the statement positions through a forward feedback network>
Figure BDA00034423979800000515
Figure BDA00034423979800000516
An interactive feature representing the ith sentence position;
(5.3) acquiring attention characteristics of each sentence about the position through a multi-head mutual attention module;
(5.3.1) Interactive features of respective sentence positions
Figure BDA00034423979800000517
Splicing to obtain a matrix->
Figure BDA00034423979800000518
The dimension size is mxd;
(5.3.2) matrix alignment using Transformer's h-head attention layer
Figure BDA00034423979800000519
First mapped as a query matrix->
Figure BDA00034423979800000520
Then the matrix is combined>
Figure BDA00034423979800000521
Mapping to a key matrix->
Figure BDA00034423979800000522
Sum value matrix>
Figure BDA00034423979800000523
/>
Figure BDA00034423979800000524
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure BDA00034423979800000525
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure BDA00034423979800000526
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA00034423979800000527
Figure BDA0003442397980000061
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003442397980000062
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure BDA0003442397980000063
Is connected together to>
Figure BDA0003442397980000064
And then obtains the attention characteristic ^ of the statement on the position through a forward feedback network>
Figure BDA0003442397980000065
Figure BDA0003442397980000066
An attention feature representing a sentence with respect to the ith position;
(5.4) calculating the probability of the position of each sentence;
(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the ith sentenceThe attention value of the sentence at the ith position is ω i
Figure BDA0003442397980000067
ptr i =softmax(ω i )
Wherein, W p 、W b Is a weight matrix, u is a column weight vector;
similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr } 1 ,ptr 2 ,…,ptr i ,…,ptr m };
(5.4.2) taking the position probability with the maximum probability value in the position probability set as the final probability of the position of the ith sentence, and marking the final probability as Ptr i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
(5.5) sequencing the out-of-order sentences according to the position probability;
starting from the first position, in the set { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the unordered statements.
The invention aims to realize the following steps:
the invention relates to a cross-modal semantic consistency recovery method, which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic common space through linear projection to obtain attention information of cross-modal ordered positions, and finally sequencing disordered sentences by utilizing the attention information with the ordered positions so as to complete the consistency recovery of the disordered sentences.
Meanwhile, the cross-modal semantic consistency recovery method provided by the invention has the following beneficial effects:
(1) The cross-modal semantic consistency analysis and recovery method provided by the invention can effectively extract the features of the elements in different modalities, fully utilizes cross-modal position information to assist and promote semantic consistency analysis and recovery in a single modality, predicts and recovers the elements in each position in parallel, and further improves the speed and precision of the task;
(2) The invention effectively connects the text mode and the image mode with similar semantics in a cross-mode, and is beneficial to the analysis of semantic consistency and the introduction of position attention information under the ordered consistency mode.
Drawings
FIG. 1 is a flow chart of a cross-modal semantic consistency recovery method of the present invention;
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flowchart of a cross-modal semantic consistency recovery method according to the present invention.
In this embodiment, as shown in fig. 1, a cross-modal semantic consistency recovery method according to the present invention includes the following steps:
s1, setting the out-of-order statement to be restored for semantic consistency in the text mode as X = { X = { (X) 1 ,x 2 ,…,x i ,…,x m },x i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = 1 ,y 2 ,…,y j ,…,y n },y j Represents the j image, and n represents the number of images; and (3) setting similar semantics between a text mode and an image mode, and restoring the text into sequential paragraphs by utilizing the image assistance.
S2, obtaining basic characteristics of a text mode and an image mode;
s2.1, acquiring basic characteristics of the out-of-order sentences by using a bidirectional long-short-time memory network: inputting X into the bidirectional long-short term memory network, thereby outputting the basic characteristics of the out-of-order statements
Figure BDA0003442397980000071
Wherein it is present>
Figure BDA0003442397980000072
The basic characteristics of the sentence of the ith sentence are represented, the dimension size is 1 x d, and the value of d is 512;
s2.2, acquiring basic characteristics of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image
Figure BDA0003442397980000073
Figure BDA0003442397980000074
Representing the basic characteristics of the jth image, and the dimension size of the jth image is 1 x d;
s3, obtaining context characteristics of a text mode and an image mode;
and S3.1, in order to utilize the context semantic relation, acquiring the context characteristics of the text mode by using the Transformer variant structure without position embedding, wherein a self-attention mechanism of scaling dot products is used for utilizing the context information.
S3.1.1, splicing the basic characteristics of the sentences to obtain a matrix
Figure BDA0003442397980000081
The dimension size is m x d;
s3.1.2, using Transformer's h-head attention layer to combine basic features
Figure BDA0003442397980000082
Look-ahead mapping to a query matrix>
Figure BDA0003442397980000083
Key matrix/>
Figure BDA0003442397980000084
Sum value matrix>
Figure BDA0003442397980000085
Figure BDA0003442397980000086
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure BDA0003442397980000087
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure BDA0003442397980000088
h takes a value of 4; />
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA0003442397980000089
Figure BDA00034423979800000810
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00034423979800000811
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure BDA00034423979800000812
Is connected together to>
Figure BDA00034423979800000813
Then the context characteristic of the out-of-order statement is obtained through a forward feedback network>
Figure BDA00034423979800000814
Figure BDA00034423979800000815
Representing the context characteristics of the sentence of the ith sentence;
s3.2, in order to model coherent semantic information of the image, a Transformer variant structure with position embedding reserved is adopted to obtain the context characteristics of an image mode;
s3.2.1, splicing the basic characteristics of the images to obtain a matrix
Figure BDA00034423979800000816
The dimension size is n x d;
s3.2.2, to obtain basic characteristics
Figure BDA00034423979800000817
In which the basic characteristic of each image is>
Figure BDA00034423979800000818
Is embedded as a compact position, denoted as p j
In the basic characteristics
Figure BDA00034423979800000819
In (2), embedding the projection of the dimension of the even term as: p is a radical of j,2l =sin(j/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of j,2l+1 =cos(j/10000 2l/d );
Wherein p is j,2l 、p j,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]];
Basic characteristics of
Figure BDA00034423979800000820
Get the compact position p after the embedding of all-dimension projection j
Finally, the compact position p of each image is determined j Splicing to obtain a position embedded matrix
Figure BDA00034423979800000821
The dimension size is n x d;
s3.2.3, to obtain the basic characteristics
Figure BDA0003442397980000091
And location embedding>
Figure BDA0003442397980000092
After addition, the h-head attention layer using the transform is mapped into a query matrix ^ according to the h-head attention layer>
Figure BDA0003442397980000093
Key matrix->
Figure BDA0003442397980000094
Sum value matrix>
Figure BDA0003442397980000095
Figure BDA0003442397980000096
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure BDA0003442397980000097
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure BDA0003442397980000098
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA0003442397980000099
Figure BDA00034423979800000910
Wherein the content of the first and second substances,
Figure BDA00034423979800000911
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure BDA00034423979800000912
Connected-up and/or>
Figure BDA00034423979800000913
And obtaining the context characteristic ^ of the ordered consecutive image through a forward feedback network>
Figure BDA00034423979800000914
Figure BDA00034423979800000915
Representing the context feature of the jth image;
s4, obtaining attention information of cross-modal ordered positions
S4.1, in order to utilize cross-modal sequence information from the image modalities, semantic consistency between the two modalities is connected through a cross-modal position attention module.
Firstly, converting the context characteristics of two modes into a semantic public space through linear projection;
s4.1.1, performing linear projection on the context characteristics of the two modes;
Figure BDA00034423979800000916
Figure BDA00034423979800000917
wherein, W 1 、W 2 As weight parameter, b 1 、b 2 For the bias term, reLU (-) is a correcting linear activation function;
s4.1.2, semantic public space conversion;
linearly projected contextual features
Figure BDA00034423979800000918
Splicing to obtain a semantic representation matrix->
Figure BDA00034423979800000919
Linearly projected contextual features
Figure BDA00034423979800000920
Splicing to obtain a semantic representation matrix ^ under the image mode>
Figure BDA00034423979800000921
S4.2, calculating semantic correlation Corr between the two modes;
Figure BDA00034423979800000922
s4.3, embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing the semantic correlation of the two modalities;
s4.3.1, obtaining the recessive position information of each statement in the text mode by using an attention mechanism
Figure BDA0003442397980000101
α=soft max(Corr)
Figure BDA0003442397980000102
S4.3.2, mixing
Figure BDA0003442397980000103
After the context features of the sentences are spliced, the hidden position information->
Figure BDA0003442397980000104
Add up to get the statement context feature @withordered location attention information>
Figure BDA0003442397980000105
The dimension size is n x d;
s5, continuity recovery is carried out;
s5.1, combining basic characteristics
Figure BDA0003442397980000106
Is based on the basic characteristic->
Figure BDA0003442397980000107
Is embedded as a compact position, denoted as p i
In the basic characteristics
Figure BDA0003442397980000108
In (3), embedding the projection of the dimensions of the even terms as: p is a radical of i,2l =sin(i/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of formula i,2l+1 =cos(i/10000 2l/d );
Wherein p is i,2l 、p i,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]];
Basic features
Figure BDA0003442397980000109
Obtaining a compact position p after embedding of all-dimension projection i
Finally, the compact position p of each statement i Splicing to obtain a position embedded matrix
Figure BDA00034423979800001010
The dimension size is mxd;
s5.2, embedding the position into the matrix by using an h-head attention layer of a Transformer
Figure BDA00034423979800001011
Mapping to a query matrix
Figure BDA00034423979800001012
Key matrix->
Figure BDA00034423979800001013
Sum value matrix>
Figure BDA00034423979800001014
/>
Figure BDA00034423979800001015
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure BDA00034423979800001016
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure BDA00034423979800001017
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA00034423979800001018
Figure BDA00034423979800001019
Wherein the content of the first and second substances,
Figure BDA00034423979800001020
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzed
Figure BDA0003442397980000111
Is connected together to>
Figure BDA0003442397980000112
And then the interactive characteristic between the statement positions is obtained through a forward feedback network>
Figure BDA0003442397980000113
Figure BDA0003442397980000114
An interactive feature representing the ith sentence position;
s5.3, acquiring the attention characteristics of each sentence about the position through a multi-head mutual attention module;
s5.3.1, interactive characteristics of sentence positions
Figure BDA0003442397980000115
Splicing to obtain a matrix->
Figure BDA0003442397980000116
The dimension size is m x d;
s5.3.2, using Transformer's h-head attention layer to align the matrix
Figure BDA0003442397980000117
First mapped as a query matrix->
Figure BDA0003442397980000118
Then the matrix is selected>
Figure BDA0003442397980000119
Mapping to a key matrix->
Figure BDA00034423979800001110
Sum value matrix>
Figure BDA00034423979800001111
Figure BDA00034423979800001112
Wherein k is equal to [1,h ]]The k-th head of attention is shown,
Figure BDA00034423979800001113
is a weight matrix for the kth attention head whose dimensions are all->
Figure BDA00034423979800001114
Then extracting the mutual information among the attention heads through the attention mechanism
Figure BDA00034423979800001115
Figure BDA00034423979800001116
Wherein the content of the first and second substances,
Figure BDA00034423979800001117
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure BDA00034423979800001118
Connected-up and/or>
Figure BDA00034423979800001119
And then obtains the attention characteristic ^ of the statement on the position through a forward feedback network>
Figure BDA00034423979800001120
Figure BDA00034423979800001121
An attention feature representing the sentence with respect to the ith position;
s5.4, calculating the probability of the position of each sentence;
s5.4.1, calculating the probability that the ith sentence is at m positions, wherein the attention value of the ith sentence at the ith position isω i
Figure BDA00034423979800001122
ptr i =softmax(ω i )
Wherein, W p 、W b Is a weight matrix, u is a column weight vector;
similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr } 1 ,ptr 2 ,…,ptr i ,…,ptr m };
S5.4.2, taking a position probability with the maximum probability value in the position probability set as the final probability of the position of the sentence i, and marking the final probability as Ptr i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
S5.5, sequencing the out-of-order sentences according to the position probability;
starting from the first position, in the set { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the out-of-order statements.
In this example, the invention is used for several common data sets, including the SIND and TACOS two visual narrative and story understanding corpora, with data in both text and picture forms. The invention adopts a Perfect Matching Rate (PMR), an accuracy rate (Acc) and a tau measurement as evaluation indexes. The Perfect Match Ratio (PMR) measures the performance of the prediction of the element position as a whole. The accuracy (Acc) is a more loose measurement index for calculating the accuracy of absolute position prediction of a single element. The τ metric is used to measure the relative order between all pairs of elements in the prediction, more closely to human judgment.
The sentence continuity restoration is carried out by the method and the existing method, and the experimental result is shown in table 1, wherein LSTM + PtrNet is a method of a long-short time memory network and a pointer network, AON-UM is a monomodal autoregressive attention restoration method, AON-CM is a cross-modal autoregressive attention restoration method, NAD-UM is a monomodal non-autoregressive restoration method, NAD-CM1 is a cross-modal non-autoregressive method which does not adopt position embedding and position attention, NAD-CM2 is a cross-modal non-autoregressive method which does not adopt position attention, NAD-CM3 is a cross-modal non-autoregressive method which does not adopt position embedding, NACON (greedy) is a method which does not adopt greed selection and mask exclusion, and NACON is the method of the invention. From experimental results, the performance of the cross-modal semantic consistency analysis and recovery method is greatly superior to that of the existing single-modal method. Compared with various evaluation indexes of NAD-CM1, NAD-CM2 and NAD-CM3, the invention improves the effectiveness of cross-modal position attention information. In addition, compared with AON-CM and NACON (no exl), the performance is also obviously improved, and the effectiveness of the reasoning mode of greedy selection and mask exclusion of the consistency recovery method designed by the invention is verified.
Table 1 is the experimental results on the SIND, TACoS data sets;
Figure BDA0003442397980000121
although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (1)

1. A cross-modal semantic consistency recovery method is characterized by comprising the following steps:
(1) And setting the out-of-order statement to be restored with semantic consistency in the text mode as X = { X = { (X) 1 ,x 2 ,…,x i ,…,x m },x i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = 1 ,y 2 ,…,y j ,…,y n },y j Represents the j image, and n represents the number of images; setting similar semantics between a text mode and an image mode;
(2) Acquiring basic characteristics of a text mode and an image mode;
(2.1) acquiring basic characteristics of the out-of-order statements by using a bidirectional long-time and short-time memory network: inputting X into bidirectional long-time memory network to output basic characteristics of out-of-order statement
Figure FDA0004072239170000011
Wherein +>
Figure FDA00040722391700000119
Representing the basic characteristics of the sentence of the ith sentence, wherein the dimension size of the sentence is 1 × d;
(2.2) acquiring basic features of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image
Figure FDA0004072239170000012
Figure FDA0004072239170000013
Representing the basic characteristics of the jth image, and the dimension size of the jth image is 1 x d;
(3) Acquiring context characteristics of a text mode and an image mode;
(3.1) obtaining context characteristics of a text mode by using a Transformer variant structure with embedded positions removed;
(3.1.1) splicing the basic characteristics of each statement to obtain a matrix
Figure FDA0004072239170000014
The dimension size is m x d;
(3.1.2) Using the Transformer's h-head attention layer to characterize the basic features
Figure FDA0004072239170000015
First mapped as a query matrix->
Figure FDA0004072239170000016
Key matrix->
Figure FDA0004072239170000017
And value matrix->
Figure FDA0004072239170000018
Figure FDA0004072239170000019
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure FDA00040722391700000110
is a weight matrix for the kth attention head whose dimensions are all->
Figure FDA00040722391700000111
Then extracting the mutual information among the attention heads through an attention mechanism
Figure FDA00040722391700000112
Figure FDA00040722391700000113
Wherein the content of the first and second substances,
Figure FDA00040722391700000114
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure FDA00040722391700000115
Connected-up and/or>
Figure FDA00040722391700000116
And obtaining the context characteristic of the out-of-order statement through a forward feedback network>
Figure FDA00040722391700000117
Figure FDA00040722391700000118
Representing the context characteristics of the sentence i;
(3.2) acquiring context characteristics of an image modality by using a Transformer variant structure embedded in a reserved position;
(3.2.1) splicing the basic characteristics of the images to obtain a matrix
Figure FDA0004072239170000021
The dimension size is n x d;
(3.2.2) general features
Figure FDA0004072239170000022
In each image basis characteristic->
Figure FDA0004072239170000023
Is embedded as a compact position, denoted as p j
In the basic characteristics
Figure FDA0004072239170000024
In (3), embedding the projection of the dimensions of the even terms as: p is a radical of formula j,2l =sin(j/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of j,2l+1 =cos(j/10000 2l/d );
Wherein p is j,2l 、p j,2l+1 Respectively represents the values of the projection embedding of even number dimensionality and odd number dimensionality, wherein l is a constant, 2l +1 belongs to [1, d ]];
Basic features
Figure FDA0004072239170000025
Obtaining a compact position p after embedding of all-dimension projection j
Finally, the compact position p of each image is determined j Splicing to obtain a position embedded matrix
Figure FDA0004072239170000026
The dimension size is n x d;
(3.2.3) basic features
Figure FDA0004072239170000027
And position insert->
Figure FDA0004072239170000028
After addition, the h-head attention layer using the transform is mapped into a query matrix ^ according to the h-head attention layer>
Figure FDA0004072239170000029
Key matrix>
Figure FDA00040722391700000210
And value matrix->
Figure FDA00040722391700000211
Figure FDA00040722391700000212
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure FDA00040722391700000213
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure FDA00040722391700000214
Then extracting the mutual information among the attention heads through the attention mechanism
Figure FDA00040722391700000215
Figure FDA00040722391700000216
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00040722391700000217
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzed
Figure FDA00040722391700000218
Connected-up and/or>
Figure FDA00040722391700000219
And obtaining the context characteristic ^ of the ordered consecutive image through a forward feedback network>
Figure FDA00040722391700000220
Figure FDA00040722391700000221
Representing the context feature of the jth image;
(4) Obtaining attention information of cross-modal ordered positions
(4.1) converting the context features of the two modes into a semantic common space through linear projection;
(4.1.1) carrying out linear projection on the context characteristics of the two modes;
Figure FDA0004072239170000031
Figure FDA0004072239170000032
wherein, W 1 、W 2 As a weight parameter, b 1 、b 2 For the bias term, reLU (-) is the correct linear activation function;
(4.1.2) semantic public space conversion;
linearly projected contextual features
Figure FDA0004072239170000033
Splicing to obtain a semantic representation matrix in the text mode
Figure FDA0004072239170000034
Linearly projected contextual features
Figure FDA0004072239170000035
Splicing to obtain a semantic representation matrix under the image mode
Figure FDA0004072239170000036
(4.2) calculating semantic correlation Corr between the two modes;
Figure FDA0004072239170000037
/>
(4.3) embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing semantic correlation of the two modalities;
(4.3.1) obtaining implicit position information of each statement in the text mode by using an attention mechanism
Figure FDA00040722391700000317
α=softmax(Corr)
Figure FDA0004072239170000038
(4.3.2) mixing
Figure FDA0004072239170000039
After the context features of the sentences are spliced, the hidden position information->
Figure FDA00040722391700000310
Add up to get the statement context feature @withordered location attention information>
Figure FDA00040722391700000311
The dimension size is n x d;
(5) Restoring the consistency of the out-of-order statement;
(5.1) general characteristics
Figure FDA00040722391700000312
Is based on the basic characteristic->
Figure FDA00040722391700000313
Is embedded as a compact position, denoted as p i
In the basic characteristics
Figure FDA00040722391700000314
In (3), embedding the projection of the dimensions of the even terms as: p is a radical of i,2l =sin(i/10000 2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of i,2l+1 =cos(i/10000 2l/d );
Wherein p is i,2l 、p i,2l+1 Respectively representing even and odd dimensionsValue after embedding degree projection, l is constant, 2l +1 belongs to [1, d ]];
Basic characteristics of
Figure FDA00040722391700000315
Obtaining a compact position p after embedding of all-dimension projection i
Finally, the compact position p of each image is determined i Splicing to obtain a position embedded matrix
Figure FDA00040722391700000316
The dimension size is mxd;
(5.2) embedding the position into the matrix using the Transformer's h-head attention layer
Figure FDA0004072239170000041
Look-ahead mapping to a query matrix>
Figure FDA0004072239170000042
Key matrix->
Figure FDA0004072239170000043
And value matrix->
Figure FDA0004072239170000044
Figure FDA0004072239170000045
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure FDA0004072239170000046
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure FDA0004072239170000047
Then go through the attention mechanismExtracting interaction information among all attention heads
Figure FDA0004072239170000048
Figure FDA0004072239170000049
Wherein the content of the first and second substances,
Figure FDA00040722391700000410
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is analyzed
Figure FDA00040722391700000411
Connected-up and/or>
Figure FDA00040722391700000412
And then obtaining the interactive characteristic between the statement positions through a forward feedback network>
Figure FDA00040722391700000413
Figure FDA00040722391700000414
An interactive feature representing the ith sentence position;
(5.3) acquiring attention characteristics of each sentence about the position through a multi-head mutual attention module;
(5.3.1) Interactive feature of each sentence position
Figure FDA00040722391700000415
Splicing to obtain a matrix>
Figure FDA00040722391700000416
The dimension size is m x d; />
(5.3.2) h-head notes with TransformerGravity layer matrix
Figure FDA00040722391700000417
First mapped as a query matrix->
Figure FDA00040722391700000418
Key matrix
Figure FDA00040722391700000419
Sum value matrix>
Figure FDA00040722391700000420
Figure FDA00040722391700000421
Wherein k is ∈ [1,h ]]The k-th head of attention is shown,
Figure FDA00040722391700000422
is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>
Figure FDA00040722391700000423
Then extracting the mutual information among the attention heads through an attention mechanism
Figure FDA00040722391700000424
Figure FDA00040722391700000425
Wherein the content of the first and second substances,
Figure FDA00040722391700000426
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally will beInteraction information between various attention heads
Figure FDA00040722391700000427
Connected-up and/or>
Figure FDA00040722391700000428
And then obtains the attention characteristic ^ of the statement on the position through a forward feedback network>
Figure FDA00040722391700000429
Figure FDA00040722391700000430
An attention feature representing the sentence with respect to the ith position;
(5.4) calculating the probability of the position of each sentence;
(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the probability that the ith sentence is at the ith position is omega i
Figure FDA0004072239170000051
ptr i =softmax(ω i )
Wherein, W p 、W b Is a weight matrix, u is a column weight vector;
similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr } 1 ,ptr 2 ,…,ptr i ,…,ptr m };
(5.4.2) taking the position probability with the maximum probability value in the position probability set as the final probability of the position of the ith sentence, and marking the final probability as Ptr i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
(5.5) sequencing the out-of-order sentences according to the position probability;
starting from the first position, in the set { Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the out-of-order statements.
CN202111638661.3A 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method Active CN114330279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111638661.3A CN114330279B (en) 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111638661.3A CN114330279B (en) 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method

Publications (2)

Publication Number Publication Date
CN114330279A CN114330279A (en) 2022-04-12
CN114330279B true CN114330279B (en) 2023-04-18

Family

ID=81016638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111638661.3A Active CN114330279B (en) 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method

Country Status (1)

Country Link
CN (1) CN114330279B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897852A (en) * 2018-06-29 2018-11-27 北京百度网讯科技有限公司 Judgment method, device and the equipment of conversation content continuity
CN110472242A (en) * 2019-08-05 2019-11-19 腾讯科技(深圳)有限公司 A kind of text handling method, device and computer readable storage medium
CN111951207A (en) * 2020-08-25 2020-11-17 福州大学 Image quality enhancement method based on deep reinforcement learning and semantic loss
CN112966127A (en) * 2021-04-07 2021-06-15 北方民族大学 Cross-modal retrieval method based on multilayer semantic alignment
CN112991350A (en) * 2021-02-18 2021-06-18 西安电子科技大学 RGB-T image semantic segmentation method based on modal difference reduction
CN113378546A (en) * 2021-06-10 2021-09-10 电子科技大学 Non-autoregressive sentence sequencing method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8443337B2 (en) * 2008-03-11 2013-05-14 Intel Corporation Methodology and tools for tabled-based protocol specification and model generation
US11816565B2 (en) * 2019-10-16 2023-11-14 Apple Inc. Semantic coherence analysis of deep neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897852A (en) * 2018-06-29 2018-11-27 北京百度网讯科技有限公司 Judgment method, device and the equipment of conversation content continuity
CN110472242A (en) * 2019-08-05 2019-11-19 腾讯科技(深圳)有限公司 A kind of text handling method, device and computer readable storage medium
CN111951207A (en) * 2020-08-25 2020-11-17 福州大学 Image quality enhancement method based on deep reinforcement learning and semantic loss
CN112991350A (en) * 2021-02-18 2021-06-18 西安电子科技大学 RGB-T image semantic segmentation method based on modal difference reduction
CN112966127A (en) * 2021-04-07 2021-06-15 北方民族大学 Cross-modal retrieval method based on multilayer semantic alignment
CN113378546A (en) * 2021-06-10 2021-09-10 电子科技大学 Non-autoregressive sentence sequencing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Zhiliang Wu等.DAPC-Net:Deformable Alignment and Pyramid Context Completion Networks for Video Inpainting.《IEEE Signal Processing Letters》.2021,第28卷第1145-1149页. *
李京谕 等.基于联合注意力机制的篇章级机器翻译.《中文信息学报》.2019,第33卷(第12期),第45-53页. *

Also Published As

Publication number Publication date
CN114330279A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN108319686B (en) Antagonism cross-media retrieval method based on limited text space
CN112613303B (en) Knowledge distillation-based cross-modal image aesthetic quality evaluation method
CN110083710B (en) Word definition generation method based on cyclic neural network and latent variable structure
WO2021031480A1 (en) Text generation method and device
CN112100351A (en) Method and equipment for constructing intelligent question-answering system through question generation data set
CN110134954B (en) Named entity recognition method based on Attention mechanism
US20210397266A1 (en) Systems and methods for language driven gesture understanding
CN111897939B (en) Visual dialogue method, training method, device and equipment for visual dialogue model
CN110619313B (en) Remote sensing image discriminant description generation method
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN113010656B (en) Visual question-answering method based on multi-mode fusion and structural control
CN115080766B (en) Multi-modal knowledge graph characterization system and method based on pre-training model
CN113779996B (en) Standard entity text determining method and device based on BiLSTM model and storage medium
CN109145083B (en) Candidate answer selecting method based on deep learning
CN110688489A (en) Knowledge graph deduction method and device based on interactive attention and storage medium
CN112800205B (en) Method and device for obtaining question-answer related paragraphs based on semantic change manifold analysis
CN114547267A (en) Intelligent question-answering model generation method and device, computing equipment and storage medium
CN113204675A (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113392265A (en) Multimedia processing method, device and equipment
CN115630145A (en) Multi-granularity emotion-based conversation recommendation method and system
CN116450883A (en) Video moment retrieval method based on video content fine granularity information
CN112015947A (en) Video time sequence positioning method and system guided by language description
CN115906805A (en) Long text abstract generating method based on word fine granularity
CN115311465A (en) Image description method based on double attention models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant