CN114330279B

CN114330279B - Cross-modal semantic consistency recovery method

Info

Publication number: CN114330279B
Application number: CN202111638661.3A
Authority: CN
Inventors: 杨阳; 史文浩; 宾燚
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2023-04-18
Anticipated expiration: 2041-12-29
Also published as: CN114330279A

Abstract

The invention discloses a cross-modal semantic consistency recovery method which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic public space through linear projection to obtain attention information of a cross-modal ordered position, and finally sequencing disordered sentences by utilizing the attention information with the ordered position so as to finish the consistency recovery of the disordered sentences.

Description

Cross-modal semantic consistency recovery method

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a cross-modal semantic consistency recovery method.

Background

Coherent modeling has been an important research topic, and is widely studied in the field of natural language processing, aiming to organize a group of sentences into a coherent text, and logically form a consistent sequence. Research has made certain progress, and the current research on semantic consistency modeling still stays in a single mode of text. The existing semantic consistency analysis and recovery method is in a single mode, and aiming at a group of sentences in a text mode, a coder-decoder architecture is usually adopted, and a pointer network is utilized for sequence prediction.

Semantic consistency, which initially measures whether text has linguistic meaning, can be extended to broader meaning for evaluating logical, ordered, and consistent relationships of elements in various modalities. For human beings, consistency modeling is a natural and essential world-aware capability that enables us to understand and perceive the world as a whole, so consistency modeling of information is very important to promote human perception and understanding of the physical world.

The current mainstream monomodal semantic consistency analysis and recovery method is an autoregressive attention analysis and recovery method, basic sentence feature vectors are extracted by utilizing Bi-LSTM, an attention mechanism is inspired, reliable paragraph representation is extracted by adopting a Transformer variant structure without position coding to eliminate the influence caused by sentence input sequence, so that sentence features in paragraphs are obtained, the paragraph features are obtained after average pooling to initialize the hidden layer state of a recurrent neural network decoder, and the composition of ordered and coherent paragraphs is recursively predicted by adopting greedy search or cluster search through a pointer network, so that the monomodal semantic consistency analysis and recovery are completed.

The existing semantic coherence modeling work mainly focuses on a single mode of text, basic feature vectors of sentences are extracted by using a bidirectional long-time and short-time memory network during coding, context features of the sentences are extracted by using a self-attention mechanism, and paragraph features are obtained through average pooling operation. When decoding, a pointer network architecture is adopted as a decoder, the decoder consists of long-time and short-time memory network units, basic sentence characteristic vectors are used as the input of the decoder, the input vector of the first step is a zero vector, and paragraph characteristics are used as an initial state of a hidden layer. Although the existing method can effectively solve the problem of modal semantic consistency analysis and recovery and further improve the performance under a single mode, the influence of information integration and semantic consistency among multiple modes is ignored, and cross-modal information is lacked.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a cross-modal semantic consistency recovery method, which effectively utilizes cross-modal information to guide semantic consistency recovery under a text mode according to semantic consistency between the text mode and an image mode.

In order to achieve the above object, the present invention provides a cross-modal semantic consistency recovery method, which is characterized by comprising the following steps:

(1) And setting the out-of-order statement to be restored with semantic consistency in the text mode as X = { X = { (X) ₁ ,x ₂ ,…,x _i ,…,x _m }，x _i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = ₁ ,y ₂ ,…,y _j ,…,y _n }，y _j Represents the j image, and n represents the number of images; setting similar semantics between a text mode and an image mode;

(2) Acquiring basic characteristics of a text mode and an image mode;

(2.1) acquiring basic characteristics of the out-of-order statements by using a bidirectional long-time and short-time memory network: inputting X into bidirectional long-time memory network to output basic characteristics of out-of-order statement

Wherein it is present>

Representing the basic characteristics of the sentence of the ith sentence, wherein the dimension size of the sentence is 1 × d;

(2.2) acquiring basic characteristics of the ordered coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential continuous image

Representing the basic characteristics of the jth image, wherein the dimension size of the basic characteristics is 1 x d;

(3) Acquiring context characteristics of a text mode and an image mode;

(3.1) acquiring context characteristics of the text modality by using a Transformer variant structure without position embedding;

(3.1.1) splicing the basic characteristics of each statement to obtain a matrix

The dimension size is mxd;

(3.1.2) Using Transformer's h-head attention layer to characterize the underlying features

First mapped as a query matrix->

Key matrix->

Sum value matrix>

Wherein k is equal to [1,h ]]The k-th head of attention is shown,

is a weight matrix for the kth attention head whose dimensions are all ^ or ^ based>

Then extracting the mutual information among the attention heads through the attention mechanism

Wherein the content of the first and second substances,

the dimension of the kth attention head is represented, and the superscript T represents transposition;

finally, the mutual information among all attention heads is analyzed

Connected-up and/or>

Then the context characteristic of the out-of-order statement is obtained through a forward feedback network>

Representing the context characteristics of the sentence of the ith sentence;

(3.2) acquiring context characteristics of an image modality by using a Transformer variant structure embedded in a reserved position;

(3.2.1) splicing the basic characteristics of the images to obtain a matrix

The dimension size is n x d;

(3.2.2) basic features

In which the basic characteristic of each image is>

Is embedded as a compact position, denoted as p _j ；

In the basic characteristics

In (3), embedding the projection of the dimensions of the even terms as: p is a radical of _j,2l ＝sin(j/10000 ^2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of _j,2l+1 ＝cos(j/10000 ^2l/d )；

Wherein p is _j,2l 、p _j,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]]；

Basic characteristics of

Get the compact position p after the embedding of all-dimension projection _j ；

Finally, the compact position p of each image is determined _j Splicing to obtain a position embedded matrix

The dimension size is n x d;

(3.2.3) basic features

And location embedding>

After addition, the h-head attention layer using the transform is mapped into a query matrix ^ according to the h-head attention layer>

Key matrix->

And value matrix->

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

is a weight matrix for the kth attention head whose dimensions are all->

Wherein, the first and the second end of the pipe are connected with each other,

finally, the mutual information among all attention heads is analyzed

Is connected together to>

And obtaining the context characteristic ^ of the ordered consecutive image through a forward feedback network>

Representing the context feature of the jth image;

(4) Obtaining attention information of cross-modal ordered positions

(4.1) converting the context features of the two modes into a semantic common space through linear projection;

(4.1.1) carrying out linear projection on the context characteristics of the two modes;

wherein, W ₁ 、W ₂ As weight parameter, b ₁ 、b ₂ For the bias term, reLU (-) is the correct linear activation function;

(4.1.2) semantic public space conversion;

linearly projected contextual features

Splicing to obtain a semantic representation matrix->

Linearly projected contextual features

Splicing to obtain a semantic representation matrix ^ under the image modality>

(4.2) calculating semantic correlation Corr between the two modes;

(4.3) embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing semantic correlation of the two modalities;

(4.3.1) obtaining implicit position information of each statement in the text mode by using an attention mechanism

/>

α＝soft max(Corr)

(4.3.2) mixing

After the context features of the sentences are spliced, the hidden position information->

Add up to get the statement context feature @withordered location attention information>

The dimension size is n x d;

(5) Restoring the consistency of the out-of-order statement;

(5.1) general characteristics

In which each sentence basic characteristic>

Is embedded as a compact position, denoted as p _i ；

In the basic characteristics

In (2), embedding the projection of the dimension of the even term as: p is a radical of _i,2l ＝sin(i/10000 ^2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of formula _i,2l+1 ＝cos(i/10000 ^2l/d )；

Wherein p is _i,2l 、p _i,2l+1 Respectively represents the values of the projection embedding of even number dimensionality and odd number dimensionality, wherein l is a constant, 2l +1 belongs to [1, d ]]；

Basic features

Get the compact position p after the embedding of all-dimension projection _i ；

Finally, the compact position p of each sentence is divided _i Splicing to obtain a position embedded matrix

The dimension size is mxd;

(5.2) embedding the position into the matrix using the Transformer's h-head attention layer

First mapping to a query matrix

Key matrix->

And value matrix->

Wherein k is equal to [1,h ]]The k-th head of attention is shown,

Wherein the content of the first and second substances,

finally, the mutual information among all attention heads is obtained

Connection ofGet up and shut>

And then obtaining the interactive characteristic between the statement positions through a forward feedback network>

An interactive feature representing the ith sentence position;

(5.3) acquiring attention characteristics of each sentence about the position through a multi-head mutual attention module;

(5.3.1) Interactive features of respective sentence positions

Splicing to obtain a matrix->

The dimension size is mxd;

(5.3.2) matrix alignment using Transformer's h-head attention layer

First mapped as a query matrix->

Then the matrix is combined>

Mapping to a key matrix->

Sum value matrix>

/>

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

finally, the mutual information among all attention heads is obtained

Is connected together to>

And then obtains the attention characteristic ^ of the statement on the position through a forward feedback network>

An attention feature representing a sentence with respect to the ith position;

(5.4) calculating the probability of the position of each sentence;

(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the ith sentenceThe attention value of the sentence at the ith position is ω _i ：

ptr _i ＝softmax(ω _i )

Wherein, W _p 、W _b Is a weight matrix, u is a column weight vector;

similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr } ₁ ,ptr ₂ ,…,ptr _i ,…,ptr _m }；

(5.4.2) taking the position probability with the maximum probability value in the position probability set as the final probability of the position of the ith sentence, and marking the final probability as Ptr _i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr ₁ ,Ptr ₂ ,…,Ptr _i ,…,Ptr _m }；

(5.5) sequencing the out-of-order sentences according to the position probability;

starting from the first position, in the set { Ptr ₁ ,Ptr ₂ ,…,Ptr _i ,…,Ptr _m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the unordered statements.

The invention aims to realize the following steps:

the invention relates to a cross-modal semantic consistency recovery method, which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic common space through linear projection to obtain attention information of cross-modal ordered positions, and finally sequencing disordered sentences by utilizing the attention information with the ordered positions so as to complete the consistency recovery of the disordered sentences.

Meanwhile, the cross-modal semantic consistency recovery method provided by the invention has the following beneficial effects:

(1) The cross-modal semantic consistency analysis and recovery method provided by the invention can effectively extract the features of the elements in different modalities, fully utilizes cross-modal position information to assist and promote semantic consistency analysis and recovery in a single modality, predicts and recovers the elements in each position in parallel, and further improves the speed and precision of the task;

(2) The invention effectively connects the text mode and the image mode with similar semantics in a cross-mode, and is beneficial to the analysis of semantic consistency and the introduction of position attention information under the ordered consistency mode.

Drawings

FIG. 1 is a flow chart of a cross-modal semantic consistency recovery method of the present invention;

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flowchart of a cross-modal semantic consistency recovery method according to the present invention.

In this embodiment, as shown in fig. 1, a cross-modal semantic consistency recovery method according to the present invention includes the following steps:

s1, setting the out-of-order statement to be restored for semantic consistency in the text mode as X = { X = { (X) ₁ ,x ₂ ,…,x _i ,…,x _m }，x _i The i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y = { Y = { Y = ₁ ,y ₂ ,…,y _j ,…,y _n }，y _j Represents the j image, and n represents the number of images; and (3) setting similar semantics between a text mode and an image mode, and restoring the text into sequential paragraphs by utilizing the image assistance.

S2, obtaining basic characteristics of a text mode and an image mode;

s2.1, acquiring basic characteristics of the out-of-order sentences by using a bidirectional long-short-time memory network: inputting X into the bidirectional long-short term memory network, thereby outputting the basic characteristics of the out-of-order statements

Wherein it is present>

The basic characteristics of the sentence of the ith sentence are represented, the dimension size is 1 x d, and the value of d is 512;

s2.2, acquiring basic characteristics of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image

Representing the basic characteristics of the jth image, and the dimension size of the jth image is 1 x d;

s3, obtaining context characteristics of a text mode and an image mode;

and S3.1, in order to utilize the context semantic relation, acquiring the context characteristics of the text mode by using the Transformer variant structure without position embedding, wherein a self-attention mechanism of scaling dot products is used for utilizing the context information.

S3.1.1, splicing the basic characteristics of the sentences to obtain a matrix

The dimension size is m x d;

s3.1.2, using Transformer's h-head attention layer to combine basic features

Look-ahead mapping to a query matrix>

Key matrix/>

Sum value matrix>

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

h takes a value of 4; />

finally, the mutual information among all attention heads is obtained

Is connected together to>

Representing the context characteristics of the sentence of the ith sentence;

s3.2, in order to model coherent semantic information of the image, a Transformer variant structure with position embedding reserved is adopted to obtain the context characteristics of an image mode;

s3.2.1, splicing the basic characteristics of the images to obtain a matrix

The dimension size is n x d;

s3.2.2, to obtain basic characteristics

In which the basic characteristic of each image is>

Is embedded as a compact position, denoted as p _j ；

In the basic characteristics

In (2), embedding the projection of the dimension of the even term as: p is a radical of _j,2l ＝sin(j/10000 ^2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of _j,2l+1 ＝cos(j/10000 ^2l/d )；

Basic characteristics of

The dimension size is n x d;

s3.2.3, to obtain the basic characteristics

And location embedding>

Key matrix->

Sum value matrix>

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

Wherein the content of the first and second substances,

finally, the mutual information among all attention heads is obtained

Connected-up and/or>

Representing the context feature of the jth image;

s4, obtaining attention information of cross-modal ordered positions

S4.1, in order to utilize cross-modal sequence information from the image modalities, semantic consistency between the two modalities is connected through a cross-modal position attention module.

Firstly, converting the context characteristics of two modes into a semantic public space through linear projection;

s4.1.1, performing linear projection on the context characteristics of the two modes;

wherein, W ₁ 、W ₂ As weight parameter, b ₁ 、b ₂ For the bias term, reLU (-) is a correcting linear activation function;

s4.1.2, semantic public space conversion;

linearly projected contextual features

Splicing to obtain a semantic representation matrix->

Linearly projected contextual features

Splicing to obtain a semantic representation matrix ^ under the image mode>

S4.2, calculating semantic correlation Corr between the two modes;

s4.3, embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing the semantic correlation of the two modalities;

s4.3.1, obtaining the recessive position information of each statement in the text mode by using an attention mechanism

α＝soft max(Corr)

S4.3.2, mixing

The dimension size is n x d;

s5, continuity recovery is carried out;

s5.1, combining basic characteristics

Is based on the basic characteristic->

Is embedded as a compact position, denoted as p _i ；

In the basic characteristics

In (3), embedding the projection of the dimensions of the even terms as: p is a radical of _i,2l ＝sin(i/10000 ^2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of formula _i,2l+1 ＝cos(i/10000 ^2l/d )；

Wherein p is _i,2l 、p _i,2l+1 Respectively represents the values of the projection embedded by the even term dimension and the odd term dimension, wherein l is a constant, 2l +1 belongs to [1, d ]]；

Basic features

Obtaining a compact position p after embedding of all-dimension projection _i ；

Finally, the compact position p of each statement _i Splicing to obtain a position embedded matrix

The dimension size is mxd;

s5.2, embedding the position into the matrix by using an h-head attention layer of a Transformer

Mapping to a query matrix

Key matrix->

Sum value matrix>

/>

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

Wherein the content of the first and second substances,

finally, the mutual information among all attention heads is analyzed

Is connected together to>

And then the interactive characteristic between the statement positions is obtained through a forward feedback network>

An interactive feature representing the ith sentence position;

s5.3, acquiring the attention characteristics of each sentence about the position through a multi-head mutual attention module;

s5.3.1, interactive characteristics of sentence positions

Splicing to obtain a matrix->

The dimension size is m x d;

s5.3.2, using Transformer's h-head attention layer to align the matrix

First mapped as a query matrix->

Then the matrix is selected>

Mapping to a key matrix->

Sum value matrix>

Wherein k is equal to [1,h ]]The k-th head of attention is shown,

is a weight matrix for the kth attention head whose dimensions are all->

Wherein the content of the first and second substances,

finally, the mutual information among all attention heads is obtained

Connected-up and/or>

An attention feature representing the sentence with respect to the ith position;

s5.4, calculating the probability of the position of each sentence;

s5.4.1, calculating the probability that the ith sentence is at m positions, wherein the attention value of the ith sentence at the ith position isω _i ：

ptr _i ＝softmax(ω _i )

Wherein, W _p 、W _b Is a weight matrix, u is a column weight vector;

S5.4.2, taking a position probability with the maximum probability value in the position probability set as the final probability of the position of the sentence i, and marking the final probability as Ptr _i (ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr ₁ ,Ptr ₂ ,…,Ptr _i ,…,Ptr _m }；

S5.5, sequencing the out-of-order sentences according to the position probability;

starting from the first position, in the set { Ptr ₁ ,Ptr ₂ ,…,Ptr _i ,…,Ptr _m And (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the out-of-order statements.

In this example, the invention is used for several common data sets, including the SIND and TACOS two visual narrative and story understanding corpora, with data in both text and picture forms. The invention adopts a Perfect Matching Rate (PMR), an accuracy rate (Acc) and a tau measurement as evaluation indexes. The Perfect Match Ratio (PMR) measures the performance of the prediction of the element position as a whole. The accuracy (Acc) is a more loose measurement index for calculating the accuracy of absolute position prediction of a single element. The τ metric is used to measure the relative order between all pairs of elements in the prediction, more closely to human judgment.

The sentence continuity restoration is carried out by the method and the existing method, and the experimental result is shown in table 1, wherein LSTM + PtrNet is a method of a long-short time memory network and a pointer network, AON-UM is a monomodal autoregressive attention restoration method, AON-CM is a cross-modal autoregressive attention restoration method, NAD-UM is a monomodal non-autoregressive restoration method, NAD-CM1 is a cross-modal non-autoregressive method which does not adopt position embedding and position attention, NAD-CM2 is a cross-modal non-autoregressive method which does not adopt position attention, NAD-CM3 is a cross-modal non-autoregressive method which does not adopt position embedding, NACON (greedy) is a method which does not adopt greed selection and mask exclusion, and NACON is the method of the invention. From experimental results, the performance of the cross-modal semantic consistency analysis and recovery method is greatly superior to that of the existing single-modal method. Compared with various evaluation indexes of NAD-CM1, NAD-CM2 and NAD-CM3, the invention improves the effectiveness of cross-modal position attention information. In addition, compared with AON-CM and NACON (no exl), the performance is also obviously improved, and the effectiveness of the reasoning mode of greedy selection and mask exclusion of the consistency recovery method designed by the invention is verified.

Table 1 is the experimental results on the SIND, TACoS data sets;

although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A cross-modal semantic consistency recovery method is characterized by comprising the following steps:

(2) Acquiring basic characteristics of a text mode and an image mode;

Wherein +>

(2.2) acquiring basic features of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image

(3) Acquiring context characteristics of a text mode and an image mode;

(3.1) obtaining context characteristics of a text mode by using a Transformer variant structure with embedded positions removed;

(3.1.1) splicing the basic characteristics of each statement to obtain a matrix

The dimension size is m x d;

(3.1.2) Using the Transformer's h-head attention layer to characterize the basic features

First mapped as a query matrix->

Key matrix->

And value matrix->

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

is a weight matrix for the kth attention head whose dimensions are all->

Then extracting the mutual information among the attention heads through an attention mechanism

Wherein the content of the first and second substances,

finally, the mutual information among all attention heads is obtained

Connected-up and/or>

And obtaining the context characteristic of the out-of-order statement through a forward feedback network>

Representing the context characteristics of the sentence i;

(3.2.1) splicing the basic characteristics of the images to obtain a matrix

The dimension size is n x d;

(3.2.2) general features

In each image basis characteristic->

Is embedded as a compact position, denoted as p _j ；

In the basic characteristics

In (3), embedding the projection of the dimensions of the even terms as: p is a radical of formula _j,2l ＝sin(j/10000 ^2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of _j,2l+1 ＝cos(j/10000 ^2l/d )；

Wherein p is _j,2l 、p _j,2l+1 Respectively represents the values of the projection embedding of even number dimensionality and odd number dimensionality, wherein l is a constant, 2l +1 belongs to [1, d ]]；

Basic features

Obtaining a compact position p after embedding of all-dimension projection _j ；

The dimension size is n x d;

(3.2.3) basic features

And position insert->

Key matrix>

And value matrix->

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

finally, the mutual information among all attention heads is analyzed

Connected-up and/or>

Representing the context feature of the jth image;

(4) Obtaining attention information of cross-modal ordered positions

wherein, W ₁ 、W ₂ As a weight parameter, b ₁ 、b ₂ For the bias term, reLU (-) is the correct linear activation function;

(4.1.2) semantic public space conversion;

linearly projected contextual features

Splicing to obtain a semantic representation matrix in the text mode

Linearly projected contextual features

Splicing to obtain a semantic representation matrix under the image mode

(4.2) calculating semantic correlation Corr between the two modes;

/>

α＝softmax(Corr)

(4.3.2) mixing

The dimension size is n x d;

(5) Restoring the consistency of the out-of-order statement;

(5.1) general characteristics

Is based on the basic characteristic->

Is embedded as a compact position, denoted as p _i ；

In the basic characteristics

In (3), embedding the projection of the dimensions of the even terms as: p is a radical of _i,2l ＝sin(i/10000 ^2l/d ) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of _i,2l+1 ＝cos(i/10000 ^2l/d )；

Wherein p is _i,2l 、p _i,2l+1 Respectively representing even and odd dimensionsValue after embedding degree projection, l is constant, 2l +1 belongs to [1, d ]]；

Basic characteristics of

Finally, the compact position p of each image is determined _i Splicing to obtain a position embedded matrix

The dimension size is mxd;

Look-ahead mapping to a query matrix>

Key matrix->

And value matrix->

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

Then go through the attention mechanismExtracting interaction information among all attention heads

Wherein the content of the first and second substances,

finally, the mutual information among all attention heads is analyzed

Connected-up and/or>

An interactive feature representing the ith sentence position;

(5.3.1) Interactive feature of each sentence position

Splicing to obtain a matrix>

The dimension size is m x d; />

(5.3.2) h-head notes with TransformerGravity layer matrix

First mapped as a query matrix->

Key matrix

Sum value matrix>

Wherein k is ∈ [1,h ]]The k-th head of attention is shown,

Wherein the content of the first and second substances,

finally will beInteraction information between various attention heads

Connected-up and/or>

(5.4) calculating the probability of the position of each sentence;

(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the probability that the ith sentence is at the ith position is omega _i ：

ptr _i ＝softmax(ω _i )

Wherein, W _p 、W _b Is a weight matrix, u is a column weight vector;