CN110866510A - Video description system and method based on key frame detection - Google Patents
Video description system and method based on key frame detection Download PDFInfo
- Publication number
- CN110866510A CN110866510A CN201911145738.6A CN201911145738A CN110866510A CN 110866510 A CN110866510 A CN 110866510A CN 201911145738 A CN201911145738 A CN 201911145738A CN 110866510 A CN110866510 A CN 110866510A
- Authority
- CN
- China
- Prior art keywords
- video
- network
- description
- key frame
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video description system and method based on key frame detection, the invention includes a sampling module, a key frame selection network and a video frame description network, the invention also relates to a video description method based on key frame detection, including the following steps: s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode; s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network; and S3, sending the screened key frames into a video frame description network to generate a description text. According to the invention, the key frame selection network is added before the video frame description network, all video frames are firstly sent to the key frame selection network for screening, and key frames containing different information are screened out, so that more redundant video frames can be excluded in the process, the video frame description network processing capacity is greatly reduced, the generation of redundant information is reduced, the noise interference is reduced, and the system processing efficiency is improved.
Description
Technical Field
The invention relates to the technical field of video processing, in particular to a video description method based on key frame detection.
Background
The video description task is similar to the translation of video content into a segment of natural language, and the early video description method mainly solves the task by a bottom-up method, firstly, a plurality of sentence templates are preset, words forming sentences are classified according to the parts of speech, descriptive words of images are obtained by methods of attribute learning, target recognition and the like, and then the predicted words are combined by a language model matched with the predefined sentence templates, and the method is also called as an S-V-O (main-predicate-object) method for short. With the development of neural networks and deep learning, the current video description task is based on convolutional networks-CNN and recurrent neural networks-RNN, and adopts an encoder-decoder structure. Video content is first encoded into a global representation vector and the encoded representation vector is then decoded into natural language using a decoder. One hot branch based on the encoding-decoding framework is to weight the input features with a mechanism of attention, learning to automatically highlight objects. For image description tasks, the representation of the attention mechanism is usually the form of attention directed to spatial regions. For the video description task, the product of attention is usually represented in the time dimension, automatically fixing the emphasis on the relevant frame that represents prominence when generating the corresponding word in the output sequence.
Existing models typically sample a certain number of video frames at equal intervals during the encoding phase, which results in the selection of multiple frames with repetitive and redundant visual information. The most relevant temporal segments are then automatically selected given the decoding RNN, using either the local temporal structure of the video or the global temporal structure, or both. Significant computational overhead will be involved in this process, for example, for a medium-scale depth classification model, extracting visual features of a frame-sized picture requires millions of floating point calculations, and obviously, computational resources are greatly wasted compared with the obtained effect. In addition, the considered frame is only selected by simple sampling, and is not particularly selected. However, the events occurring in adjacent seconds of video do not change much by time, so that the temporal redundancy existing in adjacent frames is not solved, and it cannot be guaranteed that frames obtained by sampling at equal intervals contain meaningful information. These redundancies and noise are likely to cause the model to be too sensitive to noise and to over-fit the video content information.
In summary, the attention-based method, especially the time attention, samples frames at equal intervals and operates under the condition that all video contents are completely observed, which is not applicable in some practical applications.
Disclosure of Invention
The invention aims to provide a video description system based on key frame detection with small calculation amount and a video description method based on key frame detection with small calculation amount aiming at the defects.
The technical scheme adopted by the invention is as follows:
a video description system based on key frame detection comprises a sampling module, a key frame selection network and a video frame description network, wherein:
the sampling module is used for extracting video frames from a video to be described in an equally-spaced sampling mode;
a key frame selection network for selecting key frames with different information from the obtained video frames;
and the video frame description network generates a video description text based on the key frames.
Specifically, the key frame selection network is built on the basis of a convolutional neural network, the video frame description network is based on an encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.
The invention also relates to a video description method based on key frame detection, which comprises the following steps:
s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode;
s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network;
and S3, sending the screened key frames into a video frame description network to generate a description text.
Preferably, the key frame selection network of the present invention is based on a convolutional neural network, and the key frame screening step includes:
s21, sequentially sending all the video frames into a key frame selection network, and obtaining the feature vectors corresponding to the video frames after convolution processing;
s22, comparing the feature vector of the current video frame with the feature vector of the video frame at the previous moment to obtain a difference feature vector between the feature vector of the current video frame and the feature vector of the video frame at the previous moment;
s23, sending the difference feature vectors into a two-class network for processing, respectively obtaining difference probability and non-difference probability, and when the difference probability is larger than the non-difference probability, selecting and reserving the current video frame as a key frame and as a comparison video frame in the next comparison process; when the difference probability is larger than the non-difference probability, the current video frame is selected to be discarded, and the comparison video frame in the comparison process is still used as the comparison video frame in the next process;
and S24, repeating the steps S12-S13 until all the video frames are processed.
Preferably, the video frame description network is based on an encoder-decoder structure, the encoder uses a convolutional neural network and a cyclic neural network for feature extraction, the decoder uses a bidirectional LSTM and combines an attention mechanism, and the video frame description step includes:
and sending the key frame into a video frame description network, firstly obtaining a feature vector of the key frame through a convolutional neural network, then sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, finally sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.
Preferably, the establishment of the video frame selection network and the video frame description network according to the present invention comprises the following steps:
building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism.
Acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;
making a word list: using nltk to screen and segment the manual labels in each video frame, and making a word list;
pre-training the video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss;
training the key frame selection network: and training a key frame selection network by using an enhanced learning algorithm by taking the pre-trained video frame description network as an environment.
Performing combined training: and performing joint training on the key frame selection network and the video frame description network.
Preferably, the pre-training video frame description network of the present invention comprises:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
acquiring a feature vector of the video frame through a convolutional neural network;
sending the feature vector of the video frame into a cyclic neural network to obtain a video global expression vector;
sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;
supervised learning is performed based on the candidate words and the manually established labels.
Preferably, the method of the present invention describes a network training key frame selection network based on the pre-trained video frame, and the step of training the key frame selection network includes:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;
sending the screened key frames into a trained video frame description network to obtain candidate words;
and the evaluation system carries out reward optimization on the key frame selection network based on the matching degree of the candidate words and the artificial labels in the video frame description network.
The invention has the following advantages:
1. according to the invention, a key frame selection network is added before a video frame description network, all video frames are firstly sent to the key frame selection network for screening, so that key frames containing different information are screened out, more repeated redundant video frames can be eliminated in the process, and then the key frames are sent to the video frame description network for processing, so that the processing capacity of the video frame description network is greatly reduced, the generation of redundant information is reduced, the interference of noise is reduced, and the system processing efficiency is improved;
2. the key frame selection network is set independently of the video frame description network, and can be selected to be used or not used according to the unused condition in the use process, so that the system has more flexibility;
3. the method of the invention adopts a key frame selection network to ignore similar frames in content and reserve frames with larger difference, thereby eliminating redundancy, reducing the calculated amount to the maximum extent, reducing the interference of noise, preventing overfitting and obtaining accurate description results.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
The invention is further described below with reference to the accompanying drawings:
FIG. 1 is a flow chart of a video description method according to the present invention
Detailed Description
The present invention is further described in the following with reference to the drawings and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention, and the embodiments and the technical features of the embodiments can be combined with each other without conflict.
It is to be understood that the terms first, second, and the like in the description of the embodiments of the invention are used for distinguishing between the descriptions and not necessarily for describing a sequential or chronological order. The "plurality" in the embodiment of the present invention means two or more.
The term "and/or" in the embodiment of the present invention is only an association relationship describing an associated object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, B exists alone, and A and B exist at the same time. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship. A
Example one
The embodiment provides a video description system based on key frame detection, which comprises a sampling module, a key frame selection network and a video frame description network, wherein the key frame selection network is built based on a convolutional neural network, the video frame description network is based on an encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism. Wherein:
the sampling module is used for extracting video frames from a video to be described in an equally-spaced sampling mode;
a key frame selection network for selecting key frames with different information from the obtained video frames;
and the video frame description network generates a video description text based on the key frames.
Example two
The embodiment provides a video description method based on key frame detection, which comprises the following steps:
s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode;
s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network; the key frame selection network is established based on a convolutional neural network, and specifically, the key frame screening comprises the following steps:
s21, sequentially sending all the video frames into a key frame selection network, and obtaining the feature vectors corresponding to the video frames after convolution processing;
s22, comparing the feature vector of the current video frame with the feature vector of the video frame at the previous moment to obtain a difference feature vector between the feature vector of the current video frame and the feature vector of the video frame at the previous moment;
s23, sending the difference feature vectors into a two-class network for processing, respectively obtaining difference probability and non-difference probability, and when the difference probability is larger than the non-difference probability, selecting and reserving the current video frame as a key frame and as a comparison video frame in the next comparison process; when the difference probability is larger than the non-difference probability, the current video frame is selected to be discarded, and the comparison video frame in the comparison process is still used as the comparison video frame in the next process;
and S24, repeating the steps S22-S23 until all the video frames are processed.
And S3, sending the screened key frames into a video frame description network to generate a description text. The video frame description network is based on an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, the decoder adopts a bidirectional LSTM and combines an attention mechanism, and concretely, the description steps of the video frame include: and sending the key frame into a video frame description network, obtaining a feature vector of the key frame through a convolutional neural network, sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.
The establishment of the video frame selection network and the video frame description network in this embodiment includes the following steps:
s1, building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a good cyclic neural network to extract features, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.
S2, acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;
s3, word list making: using nltk to screen and segment the manual labels in each video frame, and making a word list;
s4, pre-training video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss; specifically, the step of pre-training the video frame description network includes:
s41, extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
s42, acquiring the feature vector of the video frame through a convolutional neural network;
s43, sending the feature vector of the video frame into a recurrent neural network to obtain a video global representation vector;
s44, sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;
and S45, performing supervised learning based on the candidate words and the manually established labels.
S5, training the key frame selection network: the network training key frame selection network is described based on the pre-trained video frames, and specifically, the network training key frame selection network comprises the following steps:
s51, extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
s52, sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;
s53, sending the screened key frames into a trained video frame description network to obtain candidate words;
and S54, performing rewarding optimization key frame selection network by the evaluation system based on the matching degree of the candidate words and the artificial labels in the video frame description network.
S6, joint training: and performing joint training on the key frame selection network and the video frame description network. After two stages of video frame description network pre-training and fixed video frame description network training key frame selection network, the video frame description network and the key frame selection network are both well pre-trained, but because the video frame description network uses all sampled video frames as input during training, and only part of video frames are selected to be sent into the video frame description network after the video frame description network and the key frame selection network are added, a difference exists between the video frame description network and the key frame description network, and the key frame description network and the video frame description network are combined through joint training. In each iteration, the selection key frame is passed forward, and when the codec is trained, the video frame selection is treated as a fixed selection, and the backward propagation and enhancement gradient update are performed normally.
The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.
Claims (8)
1. A video description system based on key frame detection, characterized by: the system comprises a sampling module, a key frame selection network and a video frame description network, wherein:
the sampling module is used for extracting video frames from a video to be described in an equally-spaced sampling mode;
a key frame selection network for selecting key frames with different information from the obtained video frames;
and the video frame description network generates a video description text based on the key frames.
2. The video description system of key frame detection according to claim 1, characterized by: the key frame selection network is built based on a convolutional neural network, the video frame description network is based on an encoder-decoder structure, an encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and a decoder adopts a bidirectional LSTM and combines an attention mechanism.
3. A video description method based on key frame detection is characterized in that: the method comprises the following steps:
s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode;
s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network;
and S3, sending the screened key frames into a video frame description network to generate a description text.
4. The method according to claim 3, wherein the video description method based on key frame detection comprises: the key frame selection network is based on a convolutional neural network, and the key frame screening step comprises the following steps:
s21, sequentially sending all the video frames into a key frame selection network, and obtaining the feature vectors corresponding to the video frames after convolution processing;
s22, comparing the feature vector of the current video frame with the feature vector of the video frame at the previous moment to obtain a difference feature vector between the feature vector of the current video frame and the feature vector of the video frame at the previous moment;
s23, sending the difference feature vectors into a two-class network for processing, respectively obtaining difference probability and non-difference probability, and when the difference probability is larger than the non-difference probability, selecting and reserving the current video frame as a key frame and as a comparison video frame in the next comparison process; when the difference probability is larger than the non-difference probability, the current video frame is selected to be discarded, and the comparison video frame in the comparison process is still used as the comparison video frame in the next process;
and S24, repeating the steps S12-S13 until all the video frames are processed.
5. The method according to claim 4, wherein the video description method based on key frame detection comprises: the video frame description network is based on an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, the decoder adopts a bidirectional LSTM and combines an attention mechanism, and the video frame description step comprises the following steps:
and sending the key frame into a video frame description network, firstly obtaining a feature vector of the key frame through a convolutional neural network, then sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, finally sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.
6. The method according to claim 5, wherein the video description method based on key frame detection comprises: the establishment of the video frame selection network and the video frame description network comprises the following steps:
building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism.
Acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;
making a word list: using nltk to screen and segment the manual labels in each video frame, and making a word list;
pre-training the video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss;
training the key frame selection network: and training a key frame selection network by using an enhanced learning algorithm by taking the pre-trained video frame description network as an environment.
Performing combined training: and performing joint training on the key frame selection network and the video frame description network.
7. The method according to claim 6, wherein the video description method based on key frame detection comprises: the pre-training video frame description network step comprises:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
acquiring a feature vector of the video frame through a convolutional neural network;
sending the feature vector of the video frame into a cyclic neural network to obtain a video global expression vector;
sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;
supervised learning is performed based on the candidate words and the manually established labels.
8. The method according to claim 7, wherein the video description method based on key frame detection comprises: training a key frame selection network based on the pre-trained video frame description network, wherein the step of training the key frame selection network comprises the following steps:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;
sending the screened key frames into a trained video frame description network to obtain candidate words;
and the evaluation system carries out reward optimization on the key frame selection network based on the matching degree of the candidate words and the artificial labels in the video frame description network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911145738.6A CN110866510A (en) | 2019-11-21 | 2019-11-21 | Video description system and method based on key frame detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911145738.6A CN110866510A (en) | 2019-11-21 | 2019-11-21 | Video description system and method based on key frame detection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110866510A true CN110866510A (en) | 2020-03-06 |
Family
ID=69655367
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911145738.6A Pending CN110866510A (en) | 2019-11-21 | 2019-11-21 | Video description system and method based on key frame detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110866510A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259874A (en) * | 2020-05-06 | 2020-06-09 | 成都派沃智通科技有限公司 | Campus security video monitoring method based on deep learning |
CN111556377A (en) * | 2020-04-24 | 2020-08-18 | 珠海横琴电享科技有限公司 | Short video labeling method based on machine learning |
CN112949501A (en) * | 2021-03-03 | 2021-06-11 | 安徽省科亿信息科技有限公司 | Method for learning object availability from teaching video |
CN113792183A (en) * | 2021-09-17 | 2021-12-14 | 咪咕数字传媒有限公司 | Text generation method and device and computing equipment |
CN114786052A (en) * | 2022-04-29 | 2022-07-22 | 同方知网数字出版技术股份有限公司 | Academic live video fast browsing method based on key frame extraction |
CN115018840A (en) * | 2022-08-08 | 2022-09-06 | 珠海市南特金属科技股份有限公司 | Method, system and device for detecting cracks of precision casting |
CN115495615A (en) * | 2022-11-15 | 2022-12-20 | 浪潮电子信息产业股份有限公司 | Method, device, equipment, storage medium and terminal for mutual detection of video and text |
CN117177006A (en) * | 2023-09-01 | 2023-12-05 | 湖南广播影视集团有限公司 | CNN algorithm-based short video intelligent manufacturing method |
WO2024063571A1 (en) * | 2022-09-22 | 2024-03-28 | Samsung Electronics Co., Ltd. | Method and apparatus for vision-language understanding |
CN117809218A (en) * | 2023-12-29 | 2024-04-02 | 浙江博观瑞思科技有限公司 | Electronic shop descriptive video processing system and method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180144208A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
CN109409221A (en) * | 2018-09-20 | 2019-03-01 | 中国科学院计算技术研究所 | Video content description method and system based on frame selection |
CN109559799A (en) * | 2018-10-12 | 2019-04-02 | 华南理工大学 | The construction method and the model of medical image semantic description method, descriptive model |
CN110418210A (en) * | 2019-07-12 | 2019-11-05 | 东南大学 | A kind of video presentation generation method exported based on bidirectional circulating neural network and depth |
-
2019
- 2019-11-21 CN CN201911145738.6A patent/CN110866510A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180144208A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
CN109409221A (en) * | 2018-09-20 | 2019-03-01 | 中国科学院计算技术研究所 | Video content description method and system based on frame selection |
CN109559799A (en) * | 2018-10-12 | 2019-04-02 | 华南理工大学 | The construction method and the model of medical image semantic description method, descriptive model |
CN110418210A (en) * | 2019-07-12 | 2019-11-05 | 东南大学 | A kind of video presentation generation method exported based on bidirectional circulating neural network and depth |
Non-Patent Citations (2)
Title |
---|
YANGYU CHEN ETC.: "Less Is More: Picking Informative Frames for Video Captioning", 《ECCV 2018: COMPUTER VISION – ECCV 2018》 * |
冀中等: "基于解码器注意力机制的视频摘要", 《天津大学学报(自然科学与工程技术版)》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111556377A (en) * | 2020-04-24 | 2020-08-18 | 珠海横琴电享科技有限公司 | Short video labeling method based on machine learning |
CN111259874B (en) * | 2020-05-06 | 2020-07-28 | 成都派沃智通科技有限公司 | Campus security video monitoring method based on deep learning |
CN111259874A (en) * | 2020-05-06 | 2020-06-09 | 成都派沃智通科技有限公司 | Campus security video monitoring method based on deep learning |
CN112949501B (en) * | 2021-03-03 | 2023-12-08 | 安徽省科亿信息科技有限公司 | Method for learning availability of object from teaching video |
CN112949501A (en) * | 2021-03-03 | 2021-06-11 | 安徽省科亿信息科技有限公司 | Method for learning object availability from teaching video |
CN113792183A (en) * | 2021-09-17 | 2021-12-14 | 咪咕数字传媒有限公司 | Text generation method and device and computing equipment |
CN113792183B (en) * | 2021-09-17 | 2023-09-08 | 咪咕数字传媒有限公司 | Text generation method and device and computing equipment |
CN114786052A (en) * | 2022-04-29 | 2022-07-22 | 同方知网数字出版技术股份有限公司 | Academic live video fast browsing method based on key frame extraction |
CN115018840A (en) * | 2022-08-08 | 2022-09-06 | 珠海市南特金属科技股份有限公司 | Method, system and device for detecting cracks of precision casting |
CN115018840B (en) * | 2022-08-08 | 2022-11-18 | 珠海市南特金属科技股份有限公司 | Method, system and device for detecting cracks of precision casting |
WO2024063571A1 (en) * | 2022-09-22 | 2024-03-28 | Samsung Electronics Co., Ltd. | Method and apparatus for vision-language understanding |
CN115495615A (en) * | 2022-11-15 | 2022-12-20 | 浪潮电子信息产业股份有限公司 | Method, device, equipment, storage medium and terminal for mutual detection of video and text |
CN115495615B (en) * | 2022-11-15 | 2023-02-28 | 浪潮电子信息产业股份有限公司 | Method, device, equipment, storage medium and terminal for mutual detection of video and text |
CN117177006A (en) * | 2023-09-01 | 2023-12-05 | 湖南广播影视集团有限公司 | CNN algorithm-based short video intelligent manufacturing method |
CN117177006B (en) * | 2023-09-01 | 2024-07-16 | 湖南广播影视集团有限公司 | CNN algorithm-based short video intelligent manufacturing method |
CN117809218A (en) * | 2023-12-29 | 2024-04-02 | 浙江博观瑞思科技有限公司 | Electronic shop descriptive video processing system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110866510A (en) | Video description system and method based on key frame detection | |
CN111488807B (en) | Video description generation system based on graph rolling network | |
Perarnau et al. | Invertible conditional gans for image editing | |
EP3745305B1 (en) | Video description generation method and device, video playing method and device, and storage medium | |
CN105183720B (en) | Machine translation method and device based on RNN model | |
CN111159454A (en) | Picture description generation method and system based on Actor-Critic generation type countermeasure network | |
CN113011202B (en) | End-to-end image text translation method, system and device based on multitasking training | |
EP3885966B1 (en) | Method and device for generating natural language description information | |
CN110688927B (en) | Video action detection method based on time sequence convolution modeling | |
US20230244704A1 (en) | Sequenced data processing method and device, and text processing method and device | |
CN110083702B (en) | Aspect level text emotion conversion method based on multi-task learning | |
CN108563622B (en) | Absolute sentence generation method and device with style diversity | |
CN108538283B (en) | Method for converting lip image characteristics into voice coding parameters | |
CN113657115B (en) | Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion | |
CN114339450B (en) | Video comment generation method, system, device and storage medium | |
CN113035311A (en) | Medical image report automatic generation method based on multi-mode attention mechanism | |
Bilkhu et al. | Attention is all you need for videos: Self-attention based video summarization using universal transformers | |
CN114373028A (en) | Method and device for generating picture and electronic equipment | |
CN111340006B (en) | Sign language recognition method and system | |
CN115810068A (en) | Image description generation method and device, storage medium and electronic equipment | |
Zaoad et al. | An attention-based hybrid deep learning approach for bengali video captioning | |
CN115269836A (en) | Intention identification method and device | |
CN111008329A (en) | Page content recommendation method and device based on content classification | |
CN112131429A (en) | Video classification method and system based on depth prediction coding network | |
CN116956953A (en) | Translation model training method, device, equipment, medium and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200306 |