CN115346158A - Video description method based on coherence attention mechanism and double-stream decoder - Google Patents

Video description method based on coherence attention mechanism and double-stream decoder Download PDF

Info

Publication number
CN115346158A
CN115346158A CN202211060223.8A CN202211060223A CN115346158A CN 115346158 A CN115346158 A CN 115346158A CN 202211060223 A CN202211060223 A CN 202211060223A CN 115346158 A CN115346158 A CN 115346158A
Authority
CN
China
Prior art keywords
decoder
attention mechanism
video
self
coherent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211060223.8A
Other languages
Chinese (zh)
Inventor
钟忺
巫世峰
张瑾
黄文心
孙志新
刘文璇
刘静
张晓梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongqian Liyuan Engineering Consulting Co ltd
Original Assignee
Zhongqian Liyuan Engineering Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongqian Liyuan Engineering Consulting Co ltd filed Critical Zhongqian Liyuan Engineering Consulting Co ltd
Priority to CN202211060223.8A priority Critical patent/CN115346158A/en
Publication of CN115346158A publication Critical patent/CN115346158A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video description method based on a coherent attention mechanism and a dual-stream decoder, which mainly comprises a coherent attention mechanism algorithm and a dual-stream decoder video description generation algorithm. The coherence attention mechanism models the attention mechanism again to extract the time sequence characteristics of the attention mechanism to construct the coherence of the attention mechanism; the video description generation algorithm of the dual-stream decoder is designed and added with a self-learning decoder on the basis of a traditional decoder, and is used for utilizing semantic information of a previously generated word. The invention can generate more coherent video description with more accurate and rich semantics.

Description

Video description method based on coherence attention mechanism and double-current decoder
Technical Field
The invention relates to a video description generation algorithm based on a coherence attention mechanism and a double-current decoder; the invention also relates to a coherence attention mechanism algorithm, a video description generation algorithm based on a double-current decoder, and belongs to the field of deep learning and combination of computer vision and natural language processing.
Background
Video profiling (Video hosting) is a cross-modal task that describes the main content of a Video using compact natural language. With the development of deep learning in recent years, video description methods commonly use encoder-decoder architectures. Wherein, the encoder part generally uses convolutional neural network to extract the appearance feature and motion feature of the video, and the decoder uses long-short-term memory cyclic neural network (LSTM) or Transformer to generate sentences. Attention machines are widely used to extract the visual features input to the decoder that are most relevant to generating words at the current time step. However, previous methods do not model the dynamic change relationship of video frames after passing the attention mechanism and the interaction relationship of different frames after passing the attention mechanism. In the decoder stage, the input of the LSTM or the Transformer is the word information of the last time step in the visual characteristics and the real description obtained through the attention mechanism, and the semantic information of the word is not generated by fully utilizing the model.
Disclosure of Invention
The invention aims to: in view of the above problems, a video description generation algorithm based on a coherent attention mechanism and a dual-stream decoder is proposed. The coherent attention mechanism algorithm is used for fusing coherent information in the frames and between the frames into the attention mechanism so as to improve the attention mechanism by utilizing the intra-frame and inter-frame dependency relationship of the cross time step after the attention mechanism. The double-flow decoder video description generation algorithm is characterized in that a self-learning decoder is added on the basis of a traditional decoder, the input of the self-learning decoder is visual features extracted by an attention mechanism and semantic features of words generated by a previous time-step video description model, the semantic information of the traditional decoder and the semantic information of the self-learning decoder are fused in a proportional fusion mode in a testing stage, and the added self-learning decoder is designed to fully utilize the semantic information of the words generated previously.
Firstly, sampling a video containing description sentences to obtain video frames, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network; then, cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network; then designing a coherence attention mechanism to extract coherence visual features corresponding to the current generated words from the visual features; finally, the extracted visual features are respectively input into a traditional decoder and a self-learning decoder to generate sentences describing the video; the traditional decoder and the self-learning decoder both use a long-short term memory recurrent neural network to generate sentences, and the input of the current time step of the traditional decoder is visual characteristics extracted by a coherence mechanism and words really describing the last time step in the sentences; the input of the current time step of the self-learning decoder is the visual characteristics extracted by the coherence attention mechanism and the semantic characteristics of the words generated by the self-learning decoder at the last time step.
The technical scheme adopted by the invention is that the video description method based on the coherence mechanism and the double-current decoder specifically comprises the following steps:
a video description method based on a coherence mechanism and a dual-stream decoder is characterized in that,
sampling a video containing a description sentence to obtain a video frame, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network;
cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network;
extracting a consistency visual feature corresponding to the current generated word from the visual features based on a consistency attention mechanism; the extracted visual features are input into a traditional decoder and a self-learning decoder respectively to generate sentences describing the video.
In the above video description method, the video/sentence pair is preprocessed, including sampling the video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152
Figure BDA0003824681800000021
Extraction of action features by a pre-trained 3D convolutional neural network ResNeXt-101
Figure BDA0003824681800000022
Removing punctuation marks in the sentence and converting all letters into lower case; counting the frequency of the occurrence of words and removing the words with the frequency of occurrence lower than 2 to construct a word bank; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end of addition<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.
In the video description method, the 2D appearance characteristics and the 3D action characteristics extracted by the pre-training network are input into a bidirectional long-short term memory cyclic neural network (LSTM 1) in a cascading manner, and the forward output of the bidirectional long-short term memory cyclic neural network is
Figure BDA0003824681800000023
Is reversely output as
Figure BDA0003824681800000024
Wherein
Figure BDA0003824681800000025
Representing concatenated symbols.
Cascading forward output and reverse output of video frames
Figure BDA0003824681800000026
Video feature expression is V = { x = 1 ,x 2 ,…,x n }。
In the above video description method, the coherence attention mechanism algorithm includes
Step 1, sampling from a video and extracting fused features expressed as V = { x = (x) } 1 ,x 2 ,…,x n And c is the weighted visual feature expression obtained by the coherence mechanism at the previous time step t-1 ={α 1,t-1 x 12,t-1 x 2 ,…,α n,t-1 x n }。
Step 2, a cyclic neural network is used for constructing a time sequence dependency relationship of each frame after the attention mechanism, and for the t time step of the ith frame, the output after the cyclic neural network is
Figure BDA0003824681800000031
By { LSTM 1 ,LSTM 2 ,…,LSTM n Parameter sharing constructs the interactive relationship between video frames.
And 3, expressing the weighted characteristics of each video frame as c for the traditional attention mechanism t =f att (V,h t-1 ) Wherein V is Value in attention mechanism, i.e. video feature; h is a total of t-1 Is query in attention mechanism, i.e. hidden layer state of last time step; f. of att The scoring function for calculating the similarity weight in the attention mechanism adds intra-frame and inter-frame coherent information on the basis of the traditional attention mechanism, and is expressed as
Figure BDA0003824681800000032
Wherein
Figure BDA0003824681800000033
In order to concatenate the symbols,
Figure BDA0003824681800000034
features of intra-frame and inter-frame information are fused.
In the above video description method, a self-learning decoder is added on the basis of a conventional decoder for utilizing semantic information of a previously generated word. The traditional decoder and the self-learning decoder both use a recurrent neural network to generate sentences, and the traditional decoder inputs are words of a last time step in visual characteristics and real description obtained through an attention mechanism; the input of the self-learning decoder is the semantic characteristics of the words generated at the last time step, specifically the probability distribution values distributed in the whole word bank generated at the last time step. And fusing the information of the traditional decoder and the self-learning decoder in a certain ratio in a test stage to generate a final word.
In the above-mentioned video description method, the conventional decoder generates sentences using long-short term memory recurrent neural network, the input of the conventional decoder is the visual features extracted by the coherent attention mechanism and the words actually describing the last time step in the sentence, and finally the output of the LSTM3 is expressed as
Figure BDA0003824681800000035
In the video description method, the self-learning decoder generates sentences, the input of the self-learning decoder is visual characteristics extracted by the coherence mechanism and semantic characteristics of words generated at a time step on the model, and finally the output of the LSTM4 is expressed as
Figure BDA0003824681800000036
Wherein
Figure BDA0003824681800000037
Generating semantic features of the words, namely probability distribution values distributed in the word stock,
Figure BDA0003824681800000038
for visual features extracted by the coherent attention mechanism, W e Are trainable mapping parameters.
In the above video description method, the cross-entropy loss function is used to jointly train the conventional decoder and the self-learning decoderThe training hyper-parameter controls the importance degree of the two, and the loss function of the last walk is defined as
Figure BDA0003824681800000039
Wherein
Figure BDA00038246818000000310
As a function of the loss of the conventional decoder,
Figure BDA00038246818000000311
to self-learn the loss function of the decoder, λ is the degree of importance of the training hyper-parameter for controlling both.
The invention has the advantages that: the invention takes video description generation as a research object and designs a video description generation algorithm based on a coherence attention mechanism and a double-flow decoder. The coherent attention mechanism utilizes weight information of the context attention mechanism for weighting different video frames, and guides the attention mechanism to select more coherent visual features according to the time sequence dependency of inter-frame time steps and the interaction among frames, so that coherent sentences are generated. The double-flow decoder video description generation algorithm is designed and added with a self-learning decoder on the basis of a traditional decoder, and different from the traditional decoder which inputs words in real description sentences, the self-learning decoder inputs probability distribution values distributed in a word bank generated by a previous time step model, so that semantic information of the words generated by the model is more fully utilized. And in the testing stage, the final word is generated by combining the information of the two decoders in a proportional fusion mode, so that sentences with more accurate and rich semantics are generated.
Drawings
The invention will now be further described with reference to the following examples.
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of a network framework of the method of the present invention.
Detailed Description
For the purpose of making the present invention more apparent, its technical solutions and advantages will be described in detail below with reference to the accompanying drawings.
The invention mainly comprises
Step 1, carrying out data loading pretreatment on a video/sentence pair, including extracting appearance features and action features in a video, and constructing a dictionary through sentences.
And 2, fusing the time sequence information in the video features extracted in the step 1 by utilizing a bidirectional long-short term memory recurrent neural network.
And 3, extracting the visual characteristics with continuity corresponding to the words generated at the current time step by using a continuity attention mechanism.
And 4, inputting the video characteristics extracted in the step 3 and the words actually describing the last time step in the sentence into a traditional decoder to generate the sentence, wherein the network used by the decoder is LSTM.
And 5, inputting the video characteristics extracted in the step 3 and semantic characteristics in words generated at a time step on the model into a self-learning decoder to generate sentences, wherein the network used by the decoder is LSTM.
And 6, jointly training the traditional decoder and the self-learning decoder by using a cross entropy loss function, and introducing training hyperparameters to control the importance degrees of the traditional decoder and the self-learning decoder.
And 7, in the testing stage, the information of the traditional decoder and the self-learning decoder is combined in a proportional fusion mode to generate a final sentence.
Referring to fig. 1 and fig. 2, the following steps of the video description method based on the coherent attention mechanism and the dual-stream decoder according to the present invention are as follows:
step 1, preprocessing a video/sentence pair, including sampling a video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152
Figure BDA0003824681800000051
Extraction of action features by a pre-trained 3D convolutional neural network ResNeXt-101
Figure BDA0003824681800000052
Removing punctuation marks in the sentence and converting all letters into lower case; statistics ofThe frequency of the occurrence of the words is removed, and the words with the frequency of occurrence lower than 2 are removed to construct a word stock; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end add<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.
Step 2, cascading and inputting the 2D appearance characteristics and the 3D action characteristics extracted by the pre-training network into a bidirectional long-short term memory cyclic neural network (LSTM 1), wherein the forward output is
Figure BDA0003824681800000053
Figure BDA0003824681800000054
Is reversely output as
Figure BDA0003824681800000055
Wherein
Figure BDA0003824681800000056
Representing concatenated symbols.
Cascading forward output and reverse output of video frames
Figure BDA0003824681800000057
Video feature expression as V = { x 1 ,x 2 ,…,x n }。
And 3, extracting visual features corresponding to the words generated at the current time step by using a coherence attention mechanism. The visual feature of traditional attention extraction is expressed as c t =f att (V,h t-1 ) Where V is Value in attention mechanism, herein video feature; h is t-1 As query in attention mechanism, this is the hidden layer state at the last time step; f. of att A scoring function for calculating the similarity weights in the attention mechanism.
The coherence attention mechanism adds a coherence feature fusing global attention weight information on the basis of the traditional attention mechanism, and the coherence attention mechanismThe output of the force mechanism is
Figure BDA0003824681800000058
Wherein
Figure BDA0003824681800000059
Is a consistency feature.
Figure BDA00038246818000000510
Wherein
Figure BDA00038246818000000511
{LSTM2 1 ,LSTM2 2 ,…,LSTM2 n Sharing parameters; alpha is alpha i,t-1 The weight value obtained by the attention mechanism at time step t-1 for the ith frame.
Step 4. The conventional decoder generates sentences, the sentence is generated by using the long-short term memory recurrent neural network, the input of the conventional decoder is the visual characteristics extracted by the coherence mechanism and the words actually describing the last time step in the sentences, and finally the output of the LSTM3 is represented as
Figure BDA00038246818000000512
Step 5, the self-learning decoder generates sentences, the input of the self-learning decoder is the visual characteristics extracted by the consistency attention mechanism and the semantic characteristics of words generated at a time step on the model, and finally the output of the LSTM4 is represented as
Figure BDA0003824681800000061
Wherein
Figure BDA0003824681800000062
Generating semantic features of the words, namely probability distribution values distributed in the word stock,
Figure BDA0003824681800000063
for visual features extracted by the coherent attention mechanism, W e Are trainable mapping parameters.
Step 6. UseCross entropy loss function joint training traditional decoder and self-learning decoder, introducing training super-parameter to control importance degree of the traditional decoder and the self-learning decoder, and defining the finally-passed loss function as
Figure BDA0003824681800000064
Wherein
Figure BDA0003824681800000065
As a function of the loss of the conventional decoder,
Figure BDA0003824681800000066
to learn the loss function of the decoder, λ is the degree of importance of the training hyperparameter for controlling both.
And 7, in the testing stage, the final sentence is generated by combining the information of the traditional decoder and the self-learning decoder in a proportional fusion mode. At time step t, the semantic features of the word are expressed as
Figure BDA0003824681800000067
Wherein
Figure BDA0003824681800000068
Information extracted from a legacy decoder;
Figure BDA0003824681800000069
information extracted from the self-learning decoder; gamma is the condition of the test hyper-parameter used for controlling the fusion ratio of the two.
And 8, mapping the semantic word features obtained in the step 7 to obtain probability information distributed in a word bank, taking a word corresponding to the index with the maximum probability value as a generated word of the current time step, and circulating the steps until the < end > symbol is met.
So far, the technical solutions of the coherence-based attention mechanism algorithm and the dual-stream decoder video description generation algorithm mentioned in the present invention have been described in detail with reference to the accompanying drawings. The consistency attention mechanism algorithm provided by the invention can extract more coherent visual features, and the video description generation algorithm of the double-current decoder can more fully utilize semantic information of previously generated words.
It should be understood by those skilled in the art that the scope of the present invention is not limited to these specific embodiments, the coherent attention mechanism proposed in the present invention can be applied to different types of attention mechanisms, and the dual-stream decoder is not limited to be applied to the long-short-term memory recurrent neural network, and can still exert its function when applied to other network structures (such as Transformer and the like) for modeling sequence dependency. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (8)

1. A video description method based on a coherence mechanism and a dual-stream decoder is characterized in that,
sampling a video containing a description sentence to obtain a video frame, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network;
cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network;
extracting a consistency visual feature corresponding to the current generated word from the visual features based on a consistency attention mechanism; the extracted visual features are input into a traditional decoder and a self-learning decoder respectively to generate sentences describing the video.
2. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the pre-processing of the video/sentence pair comprises sampling the video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152
Figure FDA0003824681790000011
Extraction of action features by a pre-trained 3D convolutional neural network ResNeXt-101
Figure FDA0003824681790000012
Removing punctuation marks in the sentence and converting all letters into lower case; counting the frequency of the occurrence of words and removing the words with the frequency of occurrence lower than 2 to construct a word bank; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end add<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.
3. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the 2D appearance features and the 3D action features extracted from the pre-training network are cascade-input into a bidirectional long-short term memory recurrent neural network (LSTM 1), and the forward output is LSTM1
Figure FDA0003824681790000013
Figure FDA0003824681790000014
Is reversely output as
Figure FDA0003824681790000015
Wherein
Figure FDA0003824681790000016
Represents a concatenated symbol;
cascading forward output and reverse output of video frames
Figure FDA0003824681790000017
Video feature expression as V = { x 1 ,x 2 ,…,x n }。
4. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the coherent attention mechanism algorithm comprises
Step 1, sampling from a video and extracting fused features to express as V = { x = 1 ,x 2 ,…,x n And c, expressing the weighted visual features obtained by the consistency attention mechanism at the previous time step as c t-1 ={α 1,t-1 x 12,t-1 x 2 ,…,α n,t- 1 x n };
Step 2, using a recurrent neural network to construct a time sequence dependency relationship of each frame after the attention mechanism, and for the t time step of the ith frame, outputting the time sequence dependency relationship through the recurrent neural network as
Figure FDA0003824681790000021
By { LSTM 1 ,LSTM 2 ,…,LSTM n Sharing parameters to construct an interactive relation between video frames;
and 3, expressing the weighted characteristics of each video frame as c for the traditional attention mechanism t =f att (V,h t-1 ) Wherein V is Value in attention mechanism, i.e. video feature; h is a total of t-1 Is query in attention mechanism, i.e. hidden layer state of last time step; f. of att The scoring function for calculating the similarity weight in the attention mechanism adds intra-frame and inter-frame coherent information on the basis of the traditional attention mechanism, and is expressed as
Figure FDA0003824681790000022
Wherein
Figure FDA0003824681790000023
In order to concatenate the symbols,
Figure FDA0003824681790000024
features of intra-frame and inter-frame information are fused.
5. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein a self-learning decoder is added on the basis of a traditional decoder for utilizing semantic information of a previously generated word, the traditional decoder and the self-learning decoder both use a recurrent neural network to generate sentences, and the input of the traditional decoder is a word at a previous time step in the visual characteristics and the real description obtained by the attention mechanism; the input of the self-learning decoder is the semantic characteristics of words generated at the last time step, specifically the probability distribution values distributed in the whole word bank generated at the last time step, and the information of the traditional decoder and the self-learning decoder is fused in a certain proportion to generate the final word in the testing stage.
6. The method as claimed in claim 1, wherein the conventional decoder generates the sentence by using long-short term memory recurrent neural network, the input of the conventional decoder is the visual features extracted from the coherent attention mechanism and the words actually describing the previous time step in the sentence, and finally the output of the LSTM3 is represented as
Figure FDA0003824681790000025
7. The method as claimed in claim 1, wherein the self-learning decoder generates sentences, the input of the self-learning decoder is visual features extracted from the coherent attention mechanism and semantic features of words generated at a time step on the model, and the output of the LSTM4 is represented as
Figure FDA0003824681790000026
Wherein
Figure FDA0003824681790000027
Generating semantic features of the words, namely probability distribution values distributed in the word stock,
Figure FDA0003824681790000028
for visual features extracted by the coherent attention mechanism, W e Are trainable mapping parameters.
8. The method as claimed in claim 1, wherein the cross entropy loss function is used to jointly train the conventional decoder and the self-learning decoder, the training hyper-parameter is introduced to control the importance of both, and the loss function of the last pass is defined as
Figure FDA0003824681790000029
Wherein
Figure FDA0003824681790000031
Which is a function of the loss of a conventional decoder,
Figure FDA0003824681790000032
to self-learn the loss function of the decoder, λ is the degree of importance of the training hyper-parameter for controlling both.
CN202211060223.8A 2022-08-31 2022-08-31 Video description method based on coherence attention mechanism and double-stream decoder Pending CN115346158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211060223.8A CN115346158A (en) 2022-08-31 2022-08-31 Video description method based on coherence attention mechanism and double-stream decoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211060223.8A CN115346158A (en) 2022-08-31 2022-08-31 Video description method based on coherence attention mechanism and double-stream decoder

Publications (1)

Publication Number Publication Date
CN115346158A true CN115346158A (en) 2022-11-15

Family

ID=83955002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211060223.8A Pending CN115346158A (en) 2022-08-31 2022-08-31 Video description method based on coherence attention mechanism and double-stream decoder

Country Status (1)

Country Link
CN (1) CN115346158A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089654A (en) * 2023-04-07 2023-05-09 杭州东上智能科技有限公司 Audio supervision-based transferable audio-visual text generation method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089654A (en) * 2023-04-07 2023-05-09 杭州东上智能科技有限公司 Audio supervision-based transferable audio-visual text generation method and system
CN116089654B (en) * 2023-04-07 2023-07-07 杭州东上智能科技有限公司 Audio supervision-based transferable audio-visual text generation method and system

Similar Documents

Publication Publication Date Title
CN110598221B (en) Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network
CN111897949B (en) Guided text abstract generation method based on Transformer
CN108416065B (en) Hierarchical neural network-based image-sentence description generation system and method
CN110738057B (en) Text style migration method based on grammar constraint and language model
CN109492227A (en) It is a kind of that understanding method is read based on the machine of bull attention mechanism and Dynamic iterations
CN109472024A (en) A kind of file classification method based on bidirectional circulating attention neural network
CN112115247B (en) Personalized dialogue generation method and system based on long-short-time memory information
CN110929030A (en) Text abstract and emotion classification combined training method
CN111078866B (en) Chinese text abstract generation method based on sequence-to-sequence model
CN110083710A (en) It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN112183058B (en) Poetry generation method and device based on BERT sentence vector input
CN111581383A (en) Chinese text classification method based on ERNIE-BiGRU
CN110738062A (en) GRU neural network Mongolian Chinese machine translation method
CN110851575B (en) Dialogue generating system and dialogue realizing method
CN112309528B (en) Medical image report generation method based on visual question-answering method
CN109992775A (en) A kind of text snippet generation method based on high-level semantics
CN113657123A (en) Mongolian aspect level emotion analysis method based on target template guidance and relation head coding
CN111125333A (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
CN114708474A (en) Image semantic understanding algorithm fusing local and global features
CN115346158A (en) Video description method based on coherence attention mechanism and double-stream decoder
Mathur et al. A scaled‐down neural conversational model for chatbots
CN113360601A (en) PGN-GAN text abstract model fusing topics
CN115840815A (en) Automatic abstract generation method based on pointer key information
CN112989845B (en) Chapter-level neural machine translation method and system based on routing algorithm
CN115719072A (en) Chapter-level neural machine translation method and system based on mask mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination