CN110866510A - Video description system and method based on key frame detection - Google Patents

Video description system and method based on key frame detection Download PDF

Info

Publication number
CN110866510A
CN110866510A CN201911145738.6A CN201911145738A CN110866510A CN 110866510 A CN110866510 A CN 110866510A CN 201911145738 A CN201911145738 A CN 201911145738A CN 110866510 A CN110866510 A CN 110866510A
Authority
CN
China
Prior art keywords
video
network
description
key frame
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911145738.6A
Other languages
Chinese (zh)
Inventor
尹晓雅
李锐
于治楼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Original Assignee
Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Artificial Intelligence Research Institute Co Ltd filed Critical Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Priority to CN201911145738.6A priority Critical patent/CN110866510A/en
Publication of CN110866510A publication Critical patent/CN110866510A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video description system and method based on key frame detection, the invention includes a sampling module, a key frame selection network and a video frame description network, the invention also relates to a video description method based on key frame detection, including the following steps: s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode; s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network; and S3, sending the screened key frames into a video frame description network to generate a description text. According to the invention, the key frame selection network is added before the video frame description network, all video frames are firstly sent to the key frame selection network for screening, and key frames containing different information are screened out, so that more redundant video frames can be excluded in the process, the video frame description network processing capacity is greatly reduced, the generation of redundant information is reduced, the noise interference is reduced, and the system processing efficiency is improved.

Description

Video description system and method based on key frame detection
Technical Field
The invention relates to the technical field of video processing, in particular to a video description method based on key frame detection.
Background
The video description task is similar to the translation of video content into a segment of natural language, and the early video description method mainly solves the task by a bottom-up method, firstly, a plurality of sentence templates are preset, words forming sentences are classified according to the parts of speech, descriptive words of images are obtained by methods of attribute learning, target recognition and the like, and then the predicted words are combined by a language model matched with the predefined sentence templates, and the method is also called as an S-V-O (main-predicate-object) method for short. With the development of neural networks and deep learning, the current video description task is based on convolutional networks-CNN and recurrent neural networks-RNN, and adopts an encoder-decoder structure. Video content is first encoded into a global representation vector and the encoded representation vector is then decoded into natural language using a decoder. One hot branch based on the encoding-decoding framework is to weight the input features with a mechanism of attention, learning to automatically highlight objects. For image description tasks, the representation of the attention mechanism is usually the form of attention directed to spatial regions. For the video description task, the product of attention is usually represented in the time dimension, automatically fixing the emphasis on the relevant frame that represents prominence when generating the corresponding word in the output sequence.
Existing models typically sample a certain number of video frames at equal intervals during the encoding phase, which results in the selection of multiple frames with repetitive and redundant visual information. The most relevant temporal segments are then automatically selected given the decoding RNN, using either the local temporal structure of the video or the global temporal structure, or both. Significant computational overhead will be involved in this process, for example, for a medium-scale depth classification model, extracting visual features of a frame-sized picture requires millions of floating point calculations, and obviously, computational resources are greatly wasted compared with the obtained effect. In addition, the considered frame is only selected by simple sampling, and is not particularly selected. However, the events occurring in adjacent seconds of video do not change much by time, so that the temporal redundancy existing in adjacent frames is not solved, and it cannot be guaranteed that frames obtained by sampling at equal intervals contain meaningful information. These redundancies and noise are likely to cause the model to be too sensitive to noise and to over-fit the video content information.
In summary, the attention-based method, especially the time attention, samples frames at equal intervals and operates under the condition that all video contents are completely observed, which is not applicable in some practical applications.
Disclosure of Invention
The invention aims to provide a video description system based on key frame detection with small calculation amount and a video description method based on key frame detection with small calculation amount aiming at the defects.
The technical scheme adopted by the invention is as follows:
a video description system based on key frame detection comprises a sampling module, a key frame selection network and a video frame description network, wherein:
the sampling module is used for extracting video frames from a video to be described in an equally-spaced sampling mode;
a key frame selection network for selecting key frames with different information from the obtained video frames;
and the video frame description network generates a video description text based on the key frames.
Specifically, the key frame selection network is built on the basis of a convolutional neural network, the video frame description network is based on an encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.
The invention also relates to a video description method based on key frame detection, which comprises the following steps:
s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode;
s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network;
and S3, sending the screened key frames into a video frame description network to generate a description text.
Preferably, the key frame selection network of the present invention is based on a convolutional neural network, and the key frame screening step includes:
s21, sequentially sending all the video frames into a key frame selection network, and obtaining the feature vectors corresponding to the video frames after convolution processing;
s22, comparing the feature vector of the current video frame with the feature vector of the video frame at the previous moment to obtain a difference feature vector between the feature vector of the current video frame and the feature vector of the video frame at the previous moment;
s23, sending the difference feature vectors into a two-class network for processing, respectively obtaining difference probability and non-difference probability, and when the difference probability is larger than the non-difference probability, selecting and reserving the current video frame as a key frame and as a comparison video frame in the next comparison process; when the difference probability is larger than the non-difference probability, the current video frame is selected to be discarded, and the comparison video frame in the comparison process is still used as the comparison video frame in the next process;
and S24, repeating the steps S12-S13 until all the video frames are processed.
Preferably, the video frame description network is based on an encoder-decoder structure, the encoder uses a convolutional neural network and a cyclic neural network for feature extraction, the decoder uses a bidirectional LSTM and combines an attention mechanism, and the video frame description step includes:
and sending the key frame into a video frame description network, firstly obtaining a feature vector of the key frame through a convolutional neural network, then sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, finally sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.
Preferably, the establishment of the video frame selection network and the video frame description network according to the present invention comprises the following steps:
building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism.
Acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;
making a word list: using nltk to screen and segment the manual labels in each video frame, and making a word list;
pre-training the video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss;
training the key frame selection network: and training a key frame selection network by using an enhanced learning algorithm by taking the pre-trained video frame description network as an environment.
Performing combined training: and performing joint training on the key frame selection network and the video frame description network.
Preferably, the pre-training video frame description network of the present invention comprises:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
acquiring a feature vector of the video frame through a convolutional neural network;
sending the feature vector of the video frame into a cyclic neural network to obtain a video global expression vector;
sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;
supervised learning is performed based on the candidate words and the manually established labels.
Preferably, the method of the present invention describes a network training key frame selection network based on the pre-trained video frame, and the step of training the key frame selection network includes:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;
sending the screened key frames into a trained video frame description network to obtain candidate words;
and the evaluation system carries out reward optimization on the key frame selection network based on the matching degree of the candidate words and the artificial labels in the video frame description network.
The invention has the following advantages:
1. according to the invention, a key frame selection network is added before a video frame description network, all video frames are firstly sent to the key frame selection network for screening, so that key frames containing different information are screened out, more repeated redundant video frames can be eliminated in the process, and then the key frames are sent to the video frame description network for processing, so that the processing capacity of the video frame description network is greatly reduced, the generation of redundant information is reduced, the interference of noise is reduced, and the system processing efficiency is improved;
2. the key frame selection network is set independently of the video frame description network, and can be selected to be used or not used according to the unused condition in the use process, so that the system has more flexibility;
3. the method of the invention adopts a key frame selection network to ignore similar frames in content and reserve frames with larger difference, thereby eliminating redundancy, reducing the calculated amount to the maximum extent, reducing the interference of noise, preventing overfitting and obtaining accurate description results.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
The invention is further described below with reference to the accompanying drawings:
FIG. 1 is a flow chart of a video description method according to the present invention
Detailed Description
The present invention is further described in the following with reference to the drawings and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention, and the embodiments and the technical features of the embodiments can be combined with each other without conflict.
It is to be understood that the terms first, second, and the like in the description of the embodiments of the invention are used for distinguishing between the descriptions and not necessarily for describing a sequential or chronological order. The "plurality" in the embodiment of the present invention means two or more.
The term "and/or" in the embodiment of the present invention is only an association relationship describing an associated object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, B exists alone, and A and B exist at the same time. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship. A
Example one
The embodiment provides a video description system based on key frame detection, which comprises a sampling module, a key frame selection network and a video frame description network, wherein the key frame selection network is built based on a convolutional neural network, the video frame description network is based on an encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism. Wherein:
the sampling module is used for extracting video frames from a video to be described in an equally-spaced sampling mode;
a key frame selection network for selecting key frames with different information from the obtained video frames;
and the video frame description network generates a video description text based on the key frames.
Example two
The embodiment provides a video description method based on key frame detection, which comprises the following steps:
s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode;
s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network; the key frame selection network is established based on a convolutional neural network, and specifically, the key frame screening comprises the following steps:
s21, sequentially sending all the video frames into a key frame selection network, and obtaining the feature vectors corresponding to the video frames after convolution processing;
s22, comparing the feature vector of the current video frame with the feature vector of the video frame at the previous moment to obtain a difference feature vector between the feature vector of the current video frame and the feature vector of the video frame at the previous moment;
s23, sending the difference feature vectors into a two-class network for processing, respectively obtaining difference probability and non-difference probability, and when the difference probability is larger than the non-difference probability, selecting and reserving the current video frame as a key frame and as a comparison video frame in the next comparison process; when the difference probability is larger than the non-difference probability, the current video frame is selected to be discarded, and the comparison video frame in the comparison process is still used as the comparison video frame in the next process;
and S24, repeating the steps S22-S23 until all the video frames are processed.
And S3, sending the screened key frames into a video frame description network to generate a description text. The video frame description network is based on an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, the decoder adopts a bidirectional LSTM and combines an attention mechanism, and concretely, the description steps of the video frame include: and sending the key frame into a video frame description network, obtaining a feature vector of the key frame through a convolutional neural network, sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.
The establishment of the video frame selection network and the video frame description network in this embodiment includes the following steps:
s1, building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a good cyclic neural network to extract features, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.
S2, acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;
s3, word list making: using nltk to screen and segment the manual labels in each video frame, and making a word list;
s4, pre-training video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss; specifically, the step of pre-training the video frame description network includes:
s41, extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
s42, acquiring the feature vector of the video frame through a convolutional neural network;
s43, sending the feature vector of the video frame into a recurrent neural network to obtain a video global representation vector;
s44, sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;
and S45, performing supervised learning based on the candidate words and the manually established labels.
S5, training the key frame selection network: the network training key frame selection network is described based on the pre-trained video frames, and specifically, the network training key frame selection network comprises the following steps:
s51, extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
s52, sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;
s53, sending the screened key frames into a trained video frame description network to obtain candidate words;
and S54, performing rewarding optimization key frame selection network by the evaluation system based on the matching degree of the candidate words and the artificial labels in the video frame description network.
S6, joint training: and performing joint training on the key frame selection network and the video frame description network. After two stages of video frame description network pre-training and fixed video frame description network training key frame selection network, the video frame description network and the key frame selection network are both well pre-trained, but because the video frame description network uses all sampled video frames as input during training, and only part of video frames are selected to be sent into the video frame description network after the video frame description network and the key frame selection network are added, a difference exists between the video frame description network and the key frame description network, and the key frame description network and the video frame description network are combined through joint training. In each iteration, the selection key frame is passed forward, and when the codec is trained, the video frame selection is treated as a fixed selection, and the backward propagation and enhancement gradient update are performed normally.
The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims (8)

1. A video description system based on key frame detection, characterized by: the system comprises a sampling module, a key frame selection network and a video frame description network, wherein:
the sampling module is used for extracting video frames from a video to be described in an equally-spaced sampling mode;
a key frame selection network for selecting key frames with different information from the obtained video frames;
and the video frame description network generates a video description text based on the key frames.
2. The video description system of key frame detection according to claim 1, characterized by: the key frame selection network is built based on a convolutional neural network, the video frame description network is based on an encoder-decoder structure, an encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and a decoder adopts a bidirectional LSTM and combines an attention mechanism.
3. A video description method based on key frame detection is characterized in that: the method comprises the following steps:
s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode;
s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network;
and S3, sending the screened key frames into a video frame description network to generate a description text.
4. The method according to claim 3, wherein the video description method based on key frame detection comprises: the key frame selection network is based on a convolutional neural network, and the key frame screening step comprises the following steps:
s21, sequentially sending all the video frames into a key frame selection network, and obtaining the feature vectors corresponding to the video frames after convolution processing;
s22, comparing the feature vector of the current video frame with the feature vector of the video frame at the previous moment to obtain a difference feature vector between the feature vector of the current video frame and the feature vector of the video frame at the previous moment;
s23, sending the difference feature vectors into a two-class network for processing, respectively obtaining difference probability and non-difference probability, and when the difference probability is larger than the non-difference probability, selecting and reserving the current video frame as a key frame and as a comparison video frame in the next comparison process; when the difference probability is larger than the non-difference probability, the current video frame is selected to be discarded, and the comparison video frame in the comparison process is still used as the comparison video frame in the next process;
and S24, repeating the steps S12-S13 until all the video frames are processed.
5. The method according to claim 4, wherein the video description method based on key frame detection comprises: the video frame description network is based on an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, the decoder adopts a bidirectional LSTM and combines an attention mechanism, and the video frame description step comprises the following steps:
and sending the key frame into a video frame description network, firstly obtaining a feature vector of the key frame through a convolutional neural network, then sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, finally sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.
6. The method according to claim 5, wherein the video description method based on key frame detection comprises: the establishment of the video frame selection network and the video frame description network comprises the following steps:
building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism.
Acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;
making a word list: using nltk to screen and segment the manual labels in each video frame, and making a word list;
pre-training the video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss;
training the key frame selection network: and training a key frame selection network by using an enhanced learning algorithm by taking the pre-trained video frame description network as an environment.
Performing combined training: and performing joint training on the key frame selection network and the video frame description network.
7. The method according to claim 6, wherein the video description method based on key frame detection comprises: the pre-training video frame description network step comprises:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
acquiring a feature vector of the video frame through a convolutional neural network;
sending the feature vector of the video frame into a cyclic neural network to obtain a video global expression vector;
sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;
supervised learning is performed based on the candidate words and the manually established labels.
8. The method according to claim 7, wherein the video description method based on key frame detection comprises: training a key frame selection network based on the pre-trained video frame description network, wherein the step of training the key frame selection network comprises the following steps:
extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;
sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;
sending the screened key frames into a trained video frame description network to obtain candidate words;
and the evaluation system carries out reward optimization on the key frame selection network based on the matching degree of the candidate words and the artificial labels in the video frame description network.
CN201911145738.6A 2019-11-21 2019-11-21 Video description system and method based on key frame detection Pending CN110866510A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911145738.6A CN110866510A (en) 2019-11-21 2019-11-21 Video description system and method based on key frame detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911145738.6A CN110866510A (en) 2019-11-21 2019-11-21 Video description system and method based on key frame detection

Publications (1)

Publication Number Publication Date
CN110866510A true CN110866510A (en) 2020-03-06

Family

ID=69655367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911145738.6A Pending CN110866510A (en) 2019-11-21 2019-11-21 Video description system and method based on key frame detection

Country Status (1)

Country Link
CN (1) CN110866510A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259874A (en) * 2020-05-06 2020-06-09 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN111556377A (en) * 2020-04-24 2020-08-18 珠海横琴电享科技有限公司 Short video labeling method based on machine learning
CN112949501A (en) * 2021-03-03 2021-06-11 安徽省科亿信息科技有限公司 Method for learning object availability from teaching video
CN113792183A (en) * 2021-09-17 2021-12-14 咪咕数字传媒有限公司 Text generation method and device and computing equipment
CN114786052A (en) * 2022-04-29 2022-07-22 同方知网数字出版技术股份有限公司 Academic live video fast browsing method based on key frame extraction
CN115018840A (en) * 2022-08-08 2022-09-06 珠海市南特金属科技股份有限公司 Method, system and device for detecting cracks of precision casting
CN115495615A (en) * 2022-11-15 2022-12-20 浪潮电子信息产业股份有限公司 Method, device, equipment, storage medium and terminal for mutual detection of video and text
CN117177006A (en) * 2023-09-01 2023-12-05 湖南广播影视集团有限公司 CNN algorithm-based short video intelligent manufacturing method
WO2024063571A1 (en) * 2022-09-22 2024-03-28 Samsung Electronics Co., Ltd. Method and apparatus for vision-language understanding
CN117809218A (en) * 2023-12-29 2024-04-02 浙江博观瑞思科技有限公司 Electronic shop descriptive video processing system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180144208A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Adaptive attention model for image captioning
CN109409221A (en) * 2018-09-20 2019-03-01 中国科学院计算技术研究所 Video content description method and system based on frame selection
CN109559799A (en) * 2018-10-12 2019-04-02 华南理工大学 The construction method and the model of medical image semantic description method, descriptive model
CN110418210A (en) * 2019-07-12 2019-11-05 东南大学 A kind of video presentation generation method exported based on bidirectional circulating neural network and depth

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180144208A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Adaptive attention model for image captioning
CN109409221A (en) * 2018-09-20 2019-03-01 中国科学院计算技术研究所 Video content description method and system based on frame selection
CN109559799A (en) * 2018-10-12 2019-04-02 华南理工大学 The construction method and the model of medical image semantic description method, descriptive model
CN110418210A (en) * 2019-07-12 2019-11-05 东南大学 A kind of video presentation generation method exported based on bidirectional circulating neural network and depth

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YANGYU CHEN ETC.: "Less Is More: Picking Informative Frames for Video Captioning", 《ECCV 2018: COMPUTER VISION – ECCV 2018》 *
冀中等: "基于解码器注意力机制的视频摘要", 《天津大学学报(自然科学与工程技术版)》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111556377A (en) * 2020-04-24 2020-08-18 珠海横琴电享科技有限公司 Short video labeling method based on machine learning
CN111259874B (en) * 2020-05-06 2020-07-28 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN111259874A (en) * 2020-05-06 2020-06-09 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN112949501B (en) * 2021-03-03 2023-12-08 安徽省科亿信息科技有限公司 Method for learning availability of object from teaching video
CN112949501A (en) * 2021-03-03 2021-06-11 安徽省科亿信息科技有限公司 Method for learning object availability from teaching video
CN113792183A (en) * 2021-09-17 2021-12-14 咪咕数字传媒有限公司 Text generation method and device and computing equipment
CN113792183B (en) * 2021-09-17 2023-09-08 咪咕数字传媒有限公司 Text generation method and device and computing equipment
CN114786052A (en) * 2022-04-29 2022-07-22 同方知网数字出版技术股份有限公司 Academic live video fast browsing method based on key frame extraction
CN115018840A (en) * 2022-08-08 2022-09-06 珠海市南特金属科技股份有限公司 Method, system and device for detecting cracks of precision casting
CN115018840B (en) * 2022-08-08 2022-11-18 珠海市南特金属科技股份有限公司 Method, system and device for detecting cracks of precision casting
WO2024063571A1 (en) * 2022-09-22 2024-03-28 Samsung Electronics Co., Ltd. Method and apparatus for vision-language understanding
CN115495615A (en) * 2022-11-15 2022-12-20 浪潮电子信息产业股份有限公司 Method, device, equipment, storage medium and terminal for mutual detection of video and text
CN115495615B (en) * 2022-11-15 2023-02-28 浪潮电子信息产业股份有限公司 Method, device, equipment, storage medium and terminal for mutual detection of video and text
CN117177006A (en) * 2023-09-01 2023-12-05 湖南广播影视集团有限公司 CNN algorithm-based short video intelligent manufacturing method
CN117177006B (en) * 2023-09-01 2024-07-16 湖南广播影视集团有限公司 CNN algorithm-based short video intelligent manufacturing method
CN117809218A (en) * 2023-12-29 2024-04-02 浙江博观瑞思科技有限公司 Electronic shop descriptive video processing system and method

Similar Documents

Publication Publication Date Title
CN110866510A (en) Video description system and method based on key frame detection
CN111488807B (en) Video description generation system based on graph rolling network
Perarnau et al. Invertible conditional gans for image editing
EP3745305B1 (en) Video description generation method and device, video playing method and device, and storage medium
CN105183720B (en) Machine translation method and device based on RNN model
CN111159454A (en) Picture description generation method and system based on Actor-Critic generation type countermeasure network
CN113011202B (en) End-to-end image text translation method, system and device based on multitasking training
EP3885966B1 (en) Method and device for generating natural language description information
CN110688927B (en) Video action detection method based on time sequence convolution modeling
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN110083702B (en) Aspect level text emotion conversion method based on multi-task learning
CN108563622B (en) Absolute sentence generation method and device with style diversity
CN108538283B (en) Method for converting lip image characteristics into voice coding parameters
CN113657115B (en) Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion
CN114339450B (en) Video comment generation method, system, device and storage medium
CN113035311A (en) Medical image report automatic generation method based on multi-mode attention mechanism
Bilkhu et al. Attention is all you need for videos: Self-attention based video summarization using universal transformers
CN114373028A (en) Method and device for generating picture and electronic equipment
CN111340006B (en) Sign language recognition method and system
CN115810068A (en) Image description generation method and device, storage medium and electronic equipment
Zaoad et al. An attention-based hybrid deep learning approach for bengali video captioning
CN115269836A (en) Intention identification method and device
CN111008329A (en) Page content recommendation method and device based on content classification
CN112131429A (en) Video classification method and system based on depth prediction coding network
CN116956953A (en) Translation model training method, device, equipment, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200306