CN115346158A

CN115346158A - Video description method based on coherence attention mechanism and double-stream decoder

Info

Publication number: CN115346158A
Application number: CN202211060223.8A
Authority: CN
Inventors: 钟忺; 巫世峰; 张瑾; 黄文心; 孙志新; 刘文璇; 刘静; 张晓梅
Original assignee: Zhongqian Liyuan Engineering Consulting Co ltd
Current assignee: Zhongqian Liyuan Engineering Consulting Co ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-11-15

Abstract

The invention discloses a video description method based on a coherent attention mechanism and a dual-stream decoder, which mainly comprises a coherent attention mechanism algorithm and a dual-stream decoder video description generation algorithm. The coherence attention mechanism models the attention mechanism again to extract the time sequence characteristics of the attention mechanism to construct the coherence of the attention mechanism; the video description generation algorithm of the dual-stream decoder is designed and added with a self-learning decoder on the basis of a traditional decoder, and is used for utilizing semantic information of a previously generated word. The invention can generate more coherent video description with more accurate and rich semantics.

Description

Video description method based on coherence attention mechanism and double-current decoder

Technical Field

The invention relates to a video description generation algorithm based on a coherence attention mechanism and a double-current decoder; the invention also relates to a coherence attention mechanism algorithm, a video description generation algorithm based on a double-current decoder, and belongs to the field of deep learning and combination of computer vision and natural language processing.

Background

Video profiling (Video hosting) is a cross-modal task that describes the main content of a Video using compact natural language. With the development of deep learning in recent years, video description methods commonly use encoder-decoder architectures. Wherein, the encoder part generally uses convolutional neural network to extract the appearance feature and motion feature of the video, and the decoder uses long-short-term memory cyclic neural network (LSTM) or Transformer to generate sentences. Attention machines are widely used to extract the visual features input to the decoder that are most relevant to generating words at the current time step. However, previous methods do not model the dynamic change relationship of video frames after passing the attention mechanism and the interaction relationship of different frames after passing the attention mechanism. In the decoder stage, the input of the LSTM or the Transformer is the word information of the last time step in the visual characteristics and the real description obtained through the attention mechanism, and the semantic information of the word is not generated by fully utilizing the model.

Disclosure of Invention

The invention aims to: in view of the above problems, a video description generation algorithm based on a coherent attention mechanism and a dual-stream decoder is proposed. The coherent attention mechanism algorithm is used for fusing coherent information in the frames and between the frames into the attention mechanism so as to improve the attention mechanism by utilizing the intra-frame and inter-frame dependency relationship of the cross time step after the attention mechanism. The double-flow decoder video description generation algorithm is characterized in that a self-learning decoder is added on the basis of a traditional decoder, the input of the self-learning decoder is visual features extracted by an attention mechanism and semantic features of words generated by a previous time-step video description model, the semantic information of the traditional decoder and the semantic information of the self-learning decoder are fused in a proportional fusion mode in a testing stage, and the added self-learning decoder is designed to fully utilize the semantic information of the words generated previously.

Firstly, sampling a video containing description sentences to obtain video frames, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network; then, cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network; then designing a coherence attention mechanism to extract coherence visual features corresponding to the current generated words from the visual features; finally, the extracted visual features are respectively input into a traditional decoder and a self-learning decoder to generate sentences describing the video; the traditional decoder and the self-learning decoder both use a long-short term memory recurrent neural network to generate sentences, and the input of the current time step of the traditional decoder is visual characteristics extracted by a coherence mechanism and words really describing the last time step in the sentences; the input of the current time step of the self-learning decoder is the visual characteristics extracted by the coherence attention mechanism and the semantic characteristics of the words generated by the self-learning decoder at the last time step.

The technical scheme adopted by the invention is that the video description method based on the coherence mechanism and the double-current decoder specifically comprises the following steps:

a video description method based on a coherence mechanism and a dual-stream decoder is characterized in that,

sampling a video containing a description sentence to obtain a video frame, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network;

cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network;

extracting a consistency visual feature corresponding to the current generated word from the visual features based on a consistency attention mechanism; the extracted visual features are input into a traditional decoder and a self-learning decoder respectively to generate sentences describing the video.

In the above video description method, the video/sentence pair is preprocessed, including sampling the video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152

Extraction of action features by a pre-trained 3D convolutional neural network ResNeXt-101

Removing punctuation marks in the sentence and converting all letters into lower case; counting the frequency of the occurrence of words and removing the words with the frequency of occurrence lower than 2 to construct a word bank; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end of addition<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.

In the video description method, the 2D appearance characteristics and the 3D action characteristics extracted by the pre-training network are input into a bidirectional long-short term memory cyclic neural network (LSTM 1) in a cascading manner, and the forward output of the bidirectional long-short term memory cyclic neural network is

Is reversely output as

Wherein

Representing concatenated symbols.

Cascading forward output and reverse output of video frames

Video feature expression is V = { x = ₁ ,x ₂ ,…,x _n }。

In the above video description method, the coherence attention mechanism algorithm includes

Step 1, sampling from a video and extracting fused features expressed as V = { x = (x) } ₁ ,x ₂ ,…,x _n And c is the weighted visual feature expression obtained by the coherence mechanism at the previous time step _t-1 ＝{α _1,t-1 x ₁ ,α _2,t-1 x ₂ ,…,α _n,t-1 x _n }。

Step 2, a cyclic neural network is used for constructing a time sequence dependency relationship of each frame after the attention mechanism, and for the t time step of the ith frame, the output after the cyclic neural network is

By { LSTM ₁ ,LSTM ₂ ,…,LSTM _n Parameter sharing constructs the interactive relationship between video frames.

And 3, expressing the weighted characteristics of each video frame as c for the traditional attention mechanism _t ＝f _att (V,h _t-1 ) Wherein V is Value in attention mechanism, i.e. video feature; h is a total of _t-1 Is query in attention mechanism, i.e. hidden layer state of last time step; f. of _att The scoring function for calculating the similarity weight in the attention mechanism adds intra-frame and inter-frame coherent information on the basis of the traditional attention mechanism, and is expressed as

Wherein

In order to concatenate the symbols,

features of intra-frame and inter-frame information are fused.

In the above video description method, a self-learning decoder is added on the basis of a conventional decoder for utilizing semantic information of a previously generated word. The traditional decoder and the self-learning decoder both use a recurrent neural network to generate sentences, and the traditional decoder inputs are words of a last time step in visual characteristics and real description obtained through an attention mechanism; the input of the self-learning decoder is the semantic characteristics of the words generated at the last time step, specifically the probability distribution values distributed in the whole word bank generated at the last time step. And fusing the information of the traditional decoder and the self-learning decoder in a certain ratio in a test stage to generate a final word.

In the above-mentioned video description method, the conventional decoder generates sentences using long-short term memory recurrent neural network, the input of the conventional decoder is the visual features extracted by the coherent attention mechanism and the words actually describing the last time step in the sentence, and finally the output of the LSTM3 is expressed as

In the video description method, the self-learning decoder generates sentences, the input of the self-learning decoder is visual characteristics extracted by the coherence mechanism and semantic characteristics of words generated at a time step on the model, and finally the output of the LSTM4 is expressed as

Wherein

Generating semantic features of the words, namely probability distribution values distributed in the word stock,

for visual features extracted by the coherent attention mechanism, W _e Are trainable mapping parameters.

In the above video description method, the cross-entropy loss function is used to jointly train the conventional decoder and the self-learning decoderThe training hyper-parameter controls the importance degree of the two, and the loss function of the last walk is defined as

Wherein

As a function of the loss of the conventional decoder,

to self-learn the loss function of the decoder, λ is the degree of importance of the training hyper-parameter for controlling both.

The invention has the advantages that: the invention takes video description generation as a research object and designs a video description generation algorithm based on a coherence attention mechanism and a double-flow decoder. The coherent attention mechanism utilizes weight information of the context attention mechanism for weighting different video frames, and guides the attention mechanism to select more coherent visual features according to the time sequence dependency of inter-frame time steps and the interaction among frames, so that coherent sentences are generated. The double-flow decoder video description generation algorithm is designed and added with a self-learning decoder on the basis of a traditional decoder, and different from the traditional decoder which inputs words in real description sentences, the self-learning decoder inputs probability distribution values distributed in a word bank generated by a previous time step model, so that semantic information of the words generated by the model is more fully utilized. And in the testing stage, the final word is generated by combining the information of the two decoders in a proportional fusion mode, so that sentences with more accurate and rich semantics are generated.

Drawings

The invention will now be further described with reference to the following examples.

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a network framework of the method of the present invention.

Detailed Description

For the purpose of making the present invention more apparent, its technical solutions and advantages will be described in detail below with reference to the accompanying drawings.

The invention mainly comprises

Step 1, carrying out data loading pretreatment on a video/sentence pair, including extracting appearance features and action features in a video, and constructing a dictionary through sentences.

And 2, fusing the time sequence information in the video features extracted in the step 1 by utilizing a bidirectional long-short term memory recurrent neural network.

And 3, extracting the visual characteristics with continuity corresponding to the words generated at the current time step by using a continuity attention mechanism.

And 4, inputting the video characteristics extracted in the step 3 and the words actually describing the last time step in the sentence into a traditional decoder to generate the sentence, wherein the network used by the decoder is LSTM.

And 5, inputting the video characteristics extracted in the step 3 and semantic characteristics in words generated at a time step on the model into a self-learning decoder to generate sentences, wherein the network used by the decoder is LSTM.

And 6, jointly training the traditional decoder and the self-learning decoder by using a cross entropy loss function, and introducing training hyperparameters to control the importance degrees of the traditional decoder and the self-learning decoder.

And 7, in the testing stage, the information of the traditional decoder and the self-learning decoder is combined in a proportional fusion mode to generate a final sentence.

Referring to fig. 1 and fig. 2, the following steps of the video description method based on the coherent attention mechanism and the dual-stream decoder according to the present invention are as follows:

step 1, preprocessing a video/sentence pair, including sampling a video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152

Removing punctuation marks in the sentence and converting all letters into lower case; statistics ofThe frequency of the occurrence of the words is removed, and the words with the frequency of occurrence lower than 2 are removed to construct a word stock; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end add<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.

Step 2, cascading and inputting the 2D appearance characteristics and the 3D action characteristics extracted by the pre-training network into a bidirectional long-short term memory cyclic neural network (LSTM 1), wherein the forward output is

Is reversely output as

Wherein

Representing concatenated symbols.

Cascading forward output and reverse output of video frames

Video feature expression as V = { x ₁ ,x ₂ ,…,x _n }。

And 3, extracting visual features corresponding to the words generated at the current time step by using a coherence attention mechanism. The visual feature of traditional attention extraction is expressed as c _t ＝f _att (V,h _t-1 ) Where V is Value in attention mechanism, herein video feature; h is _t-1 As query in attention mechanism, this is the hidden layer state at the last time step; f. of _att A scoring function for calculating the similarity weights in the attention mechanism.

The coherence attention mechanism adds a coherence feature fusing global attention weight information on the basis of the traditional attention mechanism, and the coherence attention mechanismThe output of the force mechanism is

Wherein

Is a consistency feature.

Wherein

{LSTM2 ₁ ,LSTM2 ₂ ,…,LSTM2 _n Sharing parameters; alpha is alpha _i,t-1 The weight value obtained by the attention mechanism at time step t-1 for the ith frame.

Step 4. The conventional decoder generates sentences, the sentence is generated by using the long-short term memory recurrent neural network, the input of the conventional decoder is the visual characteristics extracted by the coherence mechanism and the words actually describing the last time step in the sentences, and finally the output of the LSTM3 is represented as

Step 5, the self-learning decoder generates sentences, the input of the self-learning decoder is the visual characteristics extracted by the consistency attention mechanism and the semantic characteristics of words generated at a time step on the model, and finally the output of the LSTM4 is represented as

Wherein

Step 6. UseCross entropy loss function joint training traditional decoder and self-learning decoder, introducing training super-parameter to control importance degree of the traditional decoder and the self-learning decoder, and defining the finally-passed loss function as

Wherein

As a function of the loss of the conventional decoder,

to learn the loss function of the decoder, λ is the degree of importance of the training hyperparameter for controlling both.

And 7, in the testing stage, the final sentence is generated by combining the information of the traditional decoder and the self-learning decoder in a proportional fusion mode. At time step t, the semantic features of the word are expressed as

Wherein

Information extracted from a legacy decoder;

information extracted from the self-learning decoder; gamma is the condition of the test hyper-parameter used for controlling the fusion ratio of the two.

And 8, mapping the semantic word features obtained in the step 7 to obtain probability information distributed in a word bank, taking a word corresponding to the index with the maximum probability value as a generated word of the current time step, and circulating the steps until the < end > symbol is met.

So far, the technical solutions of the coherence-based attention mechanism algorithm and the dual-stream decoder video description generation algorithm mentioned in the present invention have been described in detail with reference to the accompanying drawings. The consistency attention mechanism algorithm provided by the invention can extract more coherent visual features, and the video description generation algorithm of the double-current decoder can more fully utilize semantic information of previously generated words.

It should be understood by those skilled in the art that the scope of the present invention is not limited to these specific embodiments, the coherent attention mechanism proposed in the present invention can be applied to different types of attention mechanisms, and the dual-stream decoder is not limited to be applied to the long-short-term memory recurrent neural network, and can still exert its function when applied to other network structures (such as Transformer and the like) for modeling sequence dependency. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A video description method based on a coherence mechanism and a dual-stream decoder is characterized in that,

2. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the pre-processing of the video/sentence pair comprises sampling the video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152

Removing punctuation marks in the sentence and converting all letters into lower case; counting the frequency of the occurrence of words and removing the words with the frequency of occurrence lower than 2 to construct a word bank; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end add<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.

3. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the 2D appearance features and the 3D action features extracted from the pre-training network are cascade-input into a bidirectional long-short term memory recurrent neural network (LSTM 1), and the forward output is LSTM1

Is reversely output as

Wherein

Represents a concatenated symbol;

cascading forward output and reverse output of video frames

Video feature expression as V = { x ₁ ,x ₂ ,…,x _n }。

4. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the coherent attention mechanism algorithm comprises

Step 1, sampling from a video and extracting fused features to express as V = { x = ₁ ,x ₂ ,…,x _n And c, expressing the weighted visual features obtained by the consistency attention mechanism at the previous time step as c _t-1 ＝{α _1,t-1 x ₁ ,α _2,t-1 x ₂ ,…,α _n,t- ₁ x _n }；

Step 2, using a recurrent neural network to construct a time sequence dependency relationship of each frame after the attention mechanism, and for the t time step of the ith frame, outputting the time sequence dependency relationship through the recurrent neural network as

By { LSTM ₁ ,LSTM ₂ ,…,LSTM _n Sharing parameters to construct an interactive relation between video frames;

Wherein

In order to concatenate the symbols,

features of intra-frame and inter-frame information are fused.

5. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein a self-learning decoder is added on the basis of a traditional decoder for utilizing semantic information of a previously generated word, the traditional decoder and the self-learning decoder both use a recurrent neural network to generate sentences, and the input of the traditional decoder is a word at a previous time step in the visual characteristics and the real description obtained by the attention mechanism; the input of the self-learning decoder is the semantic characteristics of words generated at the last time step, specifically the probability distribution values distributed in the whole word bank generated at the last time step, and the information of the traditional decoder and the self-learning decoder is fused in a certain proportion to generate the final word in the testing stage.

6. The method as claimed in claim 1, wherein the conventional decoder generates the sentence by using long-short term memory recurrent neural network, the input of the conventional decoder is the visual features extracted from the coherent attention mechanism and the words actually describing the previous time step in the sentence, and finally the output of the LSTM3 is represented as

7. The method as claimed in claim 1, wherein the self-learning decoder generates sentences, the input of the self-learning decoder is visual features extracted from the coherent attention mechanism and semantic features of words generated at a time step on the model, and the output of the LSTM4 is represented as

Wherein

8. The method as claimed in claim 1, wherein the cross entropy loss function is used to jointly train the conventional decoder and the self-learning decoder, the training hyper-parameter is introduced to control the importance of both, and the loss function of the last pass is defined as

Wherein

Which is a function of the loss of a conventional decoder,