CN115346158A - Video description method based on coherence attention mechanism and double-stream decoder - Google Patents
Video description method based on coherence attention mechanism and double-stream decoder Download PDFInfo
- Publication number
- CN115346158A CN115346158A CN202211060223.8A CN202211060223A CN115346158A CN 115346158 A CN115346158 A CN 115346158A CN 202211060223 A CN202211060223 A CN 202211060223A CN 115346158 A CN115346158 A CN 115346158A
- Authority
- CN
- China
- Prior art keywords
- decoder
- attention mechanism
- video
- self
- coherent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video description method based on a coherent attention mechanism and a dual-stream decoder, which mainly comprises a coherent attention mechanism algorithm and a dual-stream decoder video description generation algorithm. The coherence attention mechanism models the attention mechanism again to extract the time sequence characteristics of the attention mechanism to construct the coherence of the attention mechanism; the video description generation algorithm of the dual-stream decoder is designed and added with a self-learning decoder on the basis of a traditional decoder, and is used for utilizing semantic information of a previously generated word. The invention can generate more coherent video description with more accurate and rich semantics.
Description
Technical Field
The invention relates to a video description generation algorithm based on a coherence attention mechanism and a double-current decoder; the invention also relates to a coherence attention mechanism algorithm, a video description generation algorithm based on a double-current decoder, and belongs to the field of deep learning and combination of computer vision and natural language processing.
Background
Video profiling (Video hosting) is a cross-modal task that describes the main content of a Video using compact natural language. With the development of deep learning in recent years, video description methods commonly use encoder-decoder architectures. Wherein, the encoder part generally uses convolutional neural network to extract the appearance feature and motion feature of the video, and the decoder uses long-short-term memory cyclic neural network (LSTM) or Transformer to generate sentences. Attention machines are widely used to extract the visual features input to the decoder that are most relevant to generating words at the current time step. However, previous methods do not model the dynamic change relationship of video frames after passing the attention mechanism and the interaction relationship of different frames after passing the attention mechanism. In the decoder stage, the input of the LSTM or the Transformer is the word information of the last time step in the visual characteristics and the real description obtained through the attention mechanism, and the semantic information of the word is not generated by fully utilizing the model.
Disclosure of Invention
The invention aims to: in view of the above problems, a video description generation algorithm based on a coherent attention mechanism and a dual-stream decoder is proposed. The coherent attention mechanism algorithm is used for fusing coherent information in the frames and between the frames into the attention mechanism so as to improve the attention mechanism by utilizing the intra-frame and inter-frame dependency relationship of the cross time step after the attention mechanism. The double-flow decoder video description generation algorithm is characterized in that a self-learning decoder is added on the basis of a traditional decoder, the input of the self-learning decoder is visual features extracted by an attention mechanism and semantic features of words generated by a previous time-step video description model, the semantic information of the traditional decoder and the semantic information of the self-learning decoder are fused in a proportional fusion mode in a testing stage, and the added self-learning decoder is designed to fully utilize the semantic information of the words generated previously.
Firstly, sampling a video containing description sentences to obtain video frames, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network; then, cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network; then designing a coherence attention mechanism to extract coherence visual features corresponding to the current generated words from the visual features; finally, the extracted visual features are respectively input into a traditional decoder and a self-learning decoder to generate sentences describing the video; the traditional decoder and the self-learning decoder both use a long-short term memory recurrent neural network to generate sentences, and the input of the current time step of the traditional decoder is visual characteristics extracted by a coherence mechanism and words really describing the last time step in the sentences; the input of the current time step of the self-learning decoder is the visual characteristics extracted by the coherence attention mechanism and the semantic characteristics of the words generated by the self-learning decoder at the last time step.
The technical scheme adopted by the invention is that the video description method based on the coherence mechanism and the double-current decoder specifically comprises the following steps:
a video description method based on a coherence mechanism and a dual-stream decoder is characterized in that,
sampling a video containing a description sentence to obtain a video frame, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network;
cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network;
extracting a consistency visual feature corresponding to the current generated word from the visual features based on a consistency attention mechanism; the extracted visual features are input into a traditional decoder and a self-learning decoder respectively to generate sentences describing the video.
In the above video description method, the video/sentence pair is preprocessed, including sampling the video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152Extraction of action features by a pre-trained 3D convolutional neural network ResNeXt-101Removing punctuation marks in the sentence and converting all letters into lower case; counting the frequency of the occurrence of words and removing the words with the frequency of occurrence lower than 2 to construct a word bank; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end of addition<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.
In the video description method, the 2D appearance characteristics and the 3D action characteristics extracted by the pre-training network are input into a bidirectional long-short term memory cyclic neural network (LSTM 1) in a cascading manner, and the forward output of the bidirectional long-short term memory cyclic neural network isIs reversely output asWhereinRepresenting concatenated symbols.
Cascading forward output and reverse output of video framesVideo feature expression is V = { x = 1 ,x 2 ,…,x n }。
In the above video description method, the coherence attention mechanism algorithm includes
Step 1, sampling from a video and extracting fused features expressed as V = { x = (x) } 1 ,x 2 ,…,x n And c is the weighted visual feature expression obtained by the coherence mechanism at the previous time step t-1 ={α 1,t-1 x 1 ,α 2,t-1 x 2 ,…,α n,t-1 x n }。
Step 2, a cyclic neural network is used for constructing a time sequence dependency relationship of each frame after the attention mechanism, and for the t time step of the ith frame, the output after the cyclic neural network isBy { LSTM 1 ,LSTM 2 ,…,LSTM n Parameter sharing constructs the interactive relationship between video frames.
And 3, expressing the weighted characteristics of each video frame as c for the traditional attention mechanism t =f att (V,h t-1 ) Wherein V is Value in attention mechanism, i.e. video feature; h is a total of t-1 Is query in attention mechanism, i.e. hidden layer state of last time step; f. of att The scoring function for calculating the similarity weight in the attention mechanism adds intra-frame and inter-frame coherent information on the basis of the traditional attention mechanism, and is expressed asWhereinIn order to concatenate the symbols,features of intra-frame and inter-frame information are fused.
In the above video description method, a self-learning decoder is added on the basis of a conventional decoder for utilizing semantic information of a previously generated word. The traditional decoder and the self-learning decoder both use a recurrent neural network to generate sentences, and the traditional decoder inputs are words of a last time step in visual characteristics and real description obtained through an attention mechanism; the input of the self-learning decoder is the semantic characteristics of the words generated at the last time step, specifically the probability distribution values distributed in the whole word bank generated at the last time step. And fusing the information of the traditional decoder and the self-learning decoder in a certain ratio in a test stage to generate a final word.
In the above-mentioned video description method, the conventional decoder generates sentences using long-short term memory recurrent neural network, the input of the conventional decoder is the visual features extracted by the coherent attention mechanism and the words actually describing the last time step in the sentence, and finally the output of the LSTM3 is expressed as
In the video description method, the self-learning decoder generates sentences, the input of the self-learning decoder is visual characteristics extracted by the coherence mechanism and semantic characteristics of words generated at a time step on the model, and finally the output of the LSTM4 is expressed asWhereinGenerating semantic features of the words, namely probability distribution values distributed in the word stock,for visual features extracted by the coherent attention mechanism, W e Are trainable mapping parameters.
In the above video description method, the cross-entropy loss function is used to jointly train the conventional decoder and the self-learning decoderThe training hyper-parameter controls the importance degree of the two, and the loss function of the last walk is defined asWhereinAs a function of the loss of the conventional decoder,to self-learn the loss function of the decoder, λ is the degree of importance of the training hyper-parameter for controlling both.
The invention has the advantages that: the invention takes video description generation as a research object and designs a video description generation algorithm based on a coherence attention mechanism and a double-flow decoder. The coherent attention mechanism utilizes weight information of the context attention mechanism for weighting different video frames, and guides the attention mechanism to select more coherent visual features according to the time sequence dependency of inter-frame time steps and the interaction among frames, so that coherent sentences are generated. The double-flow decoder video description generation algorithm is designed and added with a self-learning decoder on the basis of a traditional decoder, and different from the traditional decoder which inputs words in real description sentences, the self-learning decoder inputs probability distribution values distributed in a word bank generated by a previous time step model, so that semantic information of the words generated by the model is more fully utilized. And in the testing stage, the final word is generated by combining the information of the two decoders in a proportional fusion mode, so that sentences with more accurate and rich semantics are generated.
Drawings
The invention will now be further described with reference to the following examples.
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of a network framework of the method of the present invention.
Detailed Description
For the purpose of making the present invention more apparent, its technical solutions and advantages will be described in detail below with reference to the accompanying drawings.
The invention mainly comprises
Step 1, carrying out data loading pretreatment on a video/sentence pair, including extracting appearance features and action features in a video, and constructing a dictionary through sentences.
And 2, fusing the time sequence information in the video features extracted in the step 1 by utilizing a bidirectional long-short term memory recurrent neural network.
And 3, extracting the visual characteristics with continuity corresponding to the words generated at the current time step by using a continuity attention mechanism.
And 4, inputting the video characteristics extracted in the step 3 and the words actually describing the last time step in the sentence into a traditional decoder to generate the sentence, wherein the network used by the decoder is LSTM.
And 5, inputting the video characteristics extracted in the step 3 and semantic characteristics in words generated at a time step on the model into a self-learning decoder to generate sentences, wherein the network used by the decoder is LSTM.
And 6, jointly training the traditional decoder and the self-learning decoder by using a cross entropy loss function, and introducing training hyperparameters to control the importance degrees of the traditional decoder and the self-learning decoder.
And 7, in the testing stage, the information of the traditional decoder and the self-learning decoder is combined in a proportional fusion mode to generate a final sentence.
Referring to fig. 1 and fig. 2, the following steps of the video description method based on the coherent attention mechanism and the dual-stream decoder according to the present invention are as follows:
step 1, preprocessing a video/sentence pair, including sampling a video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152Extraction of action features by a pre-trained 3D convolutional neural network ResNeXt-101Removing punctuation marks in the sentence and converting all letters into lower case; statistics ofThe frequency of the occurrence of the words is removed, and the words with the frequency of occurrence lower than 2 are removed to construct a word stock; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end add<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.
Step 2, cascading and inputting the 2D appearance characteristics and the 3D action characteristics extracted by the pre-training network into a bidirectional long-short term memory cyclic neural network (LSTM 1), wherein the forward output is Is reversely output asWhereinRepresenting concatenated symbols.
Cascading forward output and reverse output of video framesVideo feature expression as V = { x 1 ,x 2 ,…,x n }。
And 3, extracting visual features corresponding to the words generated at the current time step by using a coherence attention mechanism. The visual feature of traditional attention extraction is expressed as c t =f att (V,h t-1 ) Where V is Value in attention mechanism, herein video feature; h is t-1 As query in attention mechanism, this is the hidden layer state at the last time step; f. of att A scoring function for calculating the similarity weights in the attention mechanism.
The coherence attention mechanism adds a coherence feature fusing global attention weight information on the basis of the traditional attention mechanism, and the coherence attention mechanismThe output of the force mechanism isWhereinIs a consistency feature.Wherein{LSTM2 1 ,LSTM2 2 ,…,LSTM2 n Sharing parameters; alpha is alpha i,t-1 The weight value obtained by the attention mechanism at time step t-1 for the ith frame.
Step 4. The conventional decoder generates sentences, the sentence is generated by using the long-short term memory recurrent neural network, the input of the conventional decoder is the visual characteristics extracted by the coherence mechanism and the words actually describing the last time step in the sentences, and finally the output of the LSTM3 is represented as
Step 5, the self-learning decoder generates sentences, the input of the self-learning decoder is the visual characteristics extracted by the consistency attention mechanism and the semantic characteristics of words generated at a time step on the model, and finally the output of the LSTM4 is represented asWhereinGenerating semantic features of the words, namely probability distribution values distributed in the word stock,for visual features extracted by the coherent attention mechanism, W e Are trainable mapping parameters.
Step 6. UseCross entropy loss function joint training traditional decoder and self-learning decoder, introducing training super-parameter to control importance degree of the traditional decoder and the self-learning decoder, and defining the finally-passed loss function asWhereinAs a function of the loss of the conventional decoder,to learn the loss function of the decoder, λ is the degree of importance of the training hyperparameter for controlling both.
And 7, in the testing stage, the final sentence is generated by combining the information of the traditional decoder and the self-learning decoder in a proportional fusion mode. At time step t, the semantic features of the word are expressed asWhereinInformation extracted from a legacy decoder;information extracted from the self-learning decoder; gamma is the condition of the test hyper-parameter used for controlling the fusion ratio of the two.
And 8, mapping the semantic word features obtained in the step 7 to obtain probability information distributed in a word bank, taking a word corresponding to the index with the maximum probability value as a generated word of the current time step, and circulating the steps until the < end > symbol is met.
So far, the technical solutions of the coherence-based attention mechanism algorithm and the dual-stream decoder video description generation algorithm mentioned in the present invention have been described in detail with reference to the accompanying drawings. The consistency attention mechanism algorithm provided by the invention can extract more coherent visual features, and the video description generation algorithm of the double-current decoder can more fully utilize semantic information of previously generated words.
It should be understood by those skilled in the art that the scope of the present invention is not limited to these specific embodiments, the coherent attention mechanism proposed in the present invention can be applied to different types of attention mechanisms, and the dual-stream decoder is not limited to be applied to the long-short-term memory recurrent neural network, and can still exert its function when applied to other network structures (such as Transformer and the like) for modeling sequence dependency. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (8)
1. A video description method based on a coherence mechanism and a dual-stream decoder is characterized in that,
sampling a video containing a description sentence to obtain a video frame, extracting appearance characteristics by using a pre-trained 2D convolutional neural network, and extracting action characteristics by using a pre-trained 3D convolutional neural network;
cascading and inputting the appearance characteristics and the action characteristics into a bidirectional long-short term memory cyclic neural network;
extracting a consistency visual feature corresponding to the current generated word from the visual features based on a consistency attention mechanism; the extracted visual features are input into a traditional decoder and a self-learning decoder respectively to generate sentences describing the video.
2. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the pre-processing of the video/sentence pair comprises sampling the video at equal intervals for 50 frames; extraction of appearance features using a pre-trained 2D convolutional neural network ResNet-152Extraction of action features by a pre-trained 3D convolutional neural network ResNeXt-101Removing punctuation marks in the sentence and converting all letters into lower case; counting the frequency of the occurrence of words and removing the words with the frequency of occurrence lower than 2 to construct a word bank; the length of the sentence is fixed to 20, the sentence with the length larger than 20 is cut off, and the sentence with the length shorter than 20 is added later<pad>A symbol; adding at the beginning of a sentence<start>Symbol, end add<end>Symbols for words not present in the lexicon<unk>The symbols are replaced.
3. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the 2D appearance features and the 3D action features extracted from the pre-training network are cascade-input into a bidirectional long-short term memory recurrent neural network (LSTM 1), and the forward output is LSTM1 Is reversely output asWhereinRepresents a concatenated symbol;
4. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein the coherent attention mechanism algorithm comprises
Step 1, sampling from a video and extracting fused features to express as V = { x = 1 ,x 2 ,…,x n And c, expressing the weighted visual features obtained by the consistency attention mechanism at the previous time step as c t-1 ={α 1,t-1 x 1 ,α 2,t-1 x 2 ,…,α n,t- 1 x n };
Step 2, using a recurrent neural network to construct a time sequence dependency relationship of each frame after the attention mechanism, and for the t time step of the ith frame, outputting the time sequence dependency relationship through the recurrent neural network asBy { LSTM 1 ,LSTM 2 ,…,LSTM n Sharing parameters to construct an interactive relation between video frames;
and 3, expressing the weighted characteristics of each video frame as c for the traditional attention mechanism t =f att (V,h t-1 ) Wherein V is Value in attention mechanism, i.e. video feature; h is a total of t-1 Is query in attention mechanism, i.e. hidden layer state of last time step; f. of att The scoring function for calculating the similarity weight in the attention mechanism adds intra-frame and inter-frame coherent information on the basis of the traditional attention mechanism, and is expressed asWhereinIn order to concatenate the symbols,features of intra-frame and inter-frame information are fused.
5. The video description method based on the coherent attention mechanism and the dual-stream decoder as claimed in claim 1, wherein a self-learning decoder is added on the basis of a traditional decoder for utilizing semantic information of a previously generated word, the traditional decoder and the self-learning decoder both use a recurrent neural network to generate sentences, and the input of the traditional decoder is a word at a previous time step in the visual characteristics and the real description obtained by the attention mechanism; the input of the self-learning decoder is the semantic characteristics of words generated at the last time step, specifically the probability distribution values distributed in the whole word bank generated at the last time step, and the information of the traditional decoder and the self-learning decoder is fused in a certain proportion to generate the final word in the testing stage.
6. The method as claimed in claim 1, wherein the conventional decoder generates the sentence by using long-short term memory recurrent neural network, the input of the conventional decoder is the visual features extracted from the coherent attention mechanism and the words actually describing the previous time step in the sentence, and finally the output of the LSTM3 is represented as
7. The method as claimed in claim 1, wherein the self-learning decoder generates sentences, the input of the self-learning decoder is visual features extracted from the coherent attention mechanism and semantic features of words generated at a time step on the model, and the output of the LSTM4 is represented asWhereinGenerating semantic features of the words, namely probability distribution values distributed in the word stock,for visual features extracted by the coherent attention mechanism, W e Are trainable mapping parameters.
8. The method as claimed in claim 1, wherein the cross entropy loss function is used to jointly train the conventional decoder and the self-learning decoder, the training hyper-parameter is introduced to control the importance of both, and the loss function of the last pass is defined asWhereinWhich is a function of the loss of a conventional decoder,to self-learn the loss function of the decoder, λ is the degree of importance of the training hyper-parameter for controlling both.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211060223.8A CN115346158A (en) | 2022-08-31 | 2022-08-31 | Video description method based on coherence attention mechanism and double-stream decoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211060223.8A CN115346158A (en) | 2022-08-31 | 2022-08-31 | Video description method based on coherence attention mechanism and double-stream decoder |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115346158A true CN115346158A (en) | 2022-11-15 |
Family
ID=83955002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211060223.8A Pending CN115346158A (en) | 2022-08-31 | 2022-08-31 | Video description method based on coherence attention mechanism and double-stream decoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115346158A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116089654A (en) * | 2023-04-07 | 2023-05-09 | 杭州东上智能科技有限公司 | Audio supervision-based transferable audio-visual text generation method and system |
-
2022
- 2022-08-31 CN CN202211060223.8A patent/CN115346158A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116089654A (en) * | 2023-04-07 | 2023-05-09 | 杭州东上智能科技有限公司 | Audio supervision-based transferable audio-visual text generation method and system |
CN116089654B (en) * | 2023-04-07 | 2023-07-07 | 杭州东上智能科技有限公司 | Audio supervision-based transferable audio-visual text generation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110598221B (en) | Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network | |
CN111897949B (en) | Guided text abstract generation method based on Transformer | |
CN108416065B (en) | Hierarchical neural network-based image-sentence description generation system and method | |
CN110738057B (en) | Text style migration method based on grammar constraint and language model | |
CN109492227A (en) | It is a kind of that understanding method is read based on the machine of bull attention mechanism and Dynamic iterations | |
CN109472024A (en) | A kind of file classification method based on bidirectional circulating attention neural network | |
CN112115247B (en) | Personalized dialogue generation method and system based on long-short-time memory information | |
CN110929030A (en) | Text abstract and emotion classification combined training method | |
CN111078866B (en) | Chinese text abstract generation method based on sequence-to-sequence model | |
CN110083710A (en) | It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure | |
CN112183058B (en) | Poetry generation method and device based on BERT sentence vector input | |
CN111581383A (en) | Chinese text classification method based on ERNIE-BiGRU | |
CN110738062A (en) | GRU neural network Mongolian Chinese machine translation method | |
CN110851575B (en) | Dialogue generating system and dialogue realizing method | |
CN112309528B (en) | Medical image report generation method based on visual question-answering method | |
CN109992775A (en) | A kind of text snippet generation method based on high-level semantics | |
CN113657123A (en) | Mongolian aspect level emotion analysis method based on target template guidance and relation head coding | |
CN111125333A (en) | Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism | |
CN114708474A (en) | Image semantic understanding algorithm fusing local and global features | |
CN115346158A (en) | Video description method based on coherence attention mechanism and double-stream decoder | |
Mathur et al. | A scaled‐down neural conversational model for chatbots | |
CN113360601A (en) | PGN-GAN text abstract model fusing topics | |
CN115840815A (en) | Automatic abstract generation method based on pointer key information | |
CN112989845B (en) | Chapter-level neural machine translation method and system based on routing algorithm | |
CN115719072A (en) | Chapter-level neural machine translation method and system based on mask mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |