CN110866510A

CN110866510A - Video description system and method based on key frame detection

Info

Publication number: CN110866510A
Application number: CN201911145738.6A
Authority: CN
Inventors: 尹晓雅; 李锐; 于治楼
Original assignee: Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Current assignee: Shandong Inspur Artificial Intelligence Research Institute Co Ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-03-06

Abstract

The invention discloses a video description system and method based on key frame detection, the invention includes a sampling module, a key frame selection network and a video frame description network, the invention also relates to a video description method based on key frame detection, including the following steps: s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode; s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network; and S3, sending the screened key frames into a video frame description network to generate a description text. According to the invention, the key frame selection network is added before the video frame description network, all video frames are firstly sent to the key frame selection network for screening, and key frames containing different information are screened out, so that more redundant video frames can be excluded in the process, the video frame description network processing capacity is greatly reduced, the generation of redundant information is reduced, the noise interference is reduced, and the system processing efficiency is improved.

Description

Video description system and method based on key frame detection

Technical Field

The invention relates to the technical field of video processing, in particular to a video description method based on key frame detection.

Background

The video description task is similar to the translation of video content into a segment of natural language, and the early video description method mainly solves the task by a bottom-up method, firstly, a plurality of sentence templates are preset, words forming sentences are classified according to the parts of speech, descriptive words of images are obtained by methods of attribute learning, target recognition and the like, and then the predicted words are combined by a language model matched with the predefined sentence templates, and the method is also called as an S-V-O (main-predicate-object) method for short. With the development of neural networks and deep learning, the current video description task is based on convolutional networks-CNN and recurrent neural networks-RNN, and adopts an encoder-decoder structure. Video content is first encoded into a global representation vector and the encoded representation vector is then decoded into natural language using a decoder. One hot branch based on the encoding-decoding framework is to weight the input features with a mechanism of attention, learning to automatically highlight objects. For image description tasks, the representation of the attention mechanism is usually the form of attention directed to spatial regions. For the video description task, the product of attention is usually represented in the time dimension, automatically fixing the emphasis on the relevant frame that represents prominence when generating the corresponding word in the output sequence.

Existing models typically sample a certain number of video frames at equal intervals during the encoding phase, which results in the selection of multiple frames with repetitive and redundant visual information. The most relevant temporal segments are then automatically selected given the decoding RNN, using either the local temporal structure of the video or the global temporal structure, or both. Significant computational overhead will be involved in this process, for example, for a medium-scale depth classification model, extracting visual features of a frame-sized picture requires millions of floating point calculations, and obviously, computational resources are greatly wasted compared with the obtained effect. In addition, the considered frame is only selected by simple sampling, and is not particularly selected. However, the events occurring in adjacent seconds of video do not change much by time, so that the temporal redundancy existing in adjacent frames is not solved, and it cannot be guaranteed that frames obtained by sampling at equal intervals contain meaningful information. These redundancies and noise are likely to cause the model to be too sensitive to noise and to over-fit the video content information.

In summary, the attention-based method, especially the time attention, samples frames at equal intervals and operates under the condition that all video contents are completely observed, which is not applicable in some practical applications.

Disclosure of Invention

The invention aims to provide a video description system based on key frame detection with small calculation amount and a video description method based on key frame detection with small calculation amount aiming at the defects.

The technical scheme adopted by the invention is as follows:

a video description system based on key frame detection comprises a sampling module, a key frame selection network and a video frame description network, wherein:

the sampling module is used for extracting video frames from a video to be described in an equally-spaced sampling mode;

a key frame selection network for selecting key frames with different information from the obtained video frames;

and the video frame description network generates a video description text based on the key frames.

Specifically, the key frame selection network is built on the basis of a convolutional neural network, the video frame description network is based on an encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.

The invention also relates to a video description method based on key frame detection, which comprises the following steps:

s1, extracting video frames from the video to be described by adopting an equal-interval sampling mode;

s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network;

and S3, sending the screened key frames into a video frame description network to generate a description text.

Preferably, the key frame selection network of the present invention is based on a convolutional neural network, and the key frame screening step includes:

s21, sequentially sending all the video frames into a key frame selection network, and obtaining the feature vectors corresponding to the video frames after convolution processing;

s22, comparing the feature vector of the current video frame with the feature vector of the video frame at the previous moment to obtain a difference feature vector between the feature vector of the current video frame and the feature vector of the video frame at the previous moment;

s23, sending the difference feature vectors into a two-class network for processing, respectively obtaining difference probability and non-difference probability, and when the difference probability is larger than the non-difference probability, selecting and reserving the current video frame as a key frame and as a comparison video frame in the next comparison process; when the difference probability is larger than the non-difference probability, the current video frame is selected to be discarded, and the comparison video frame in the comparison process is still used as the comparison video frame in the next process;

and S24, repeating the steps S12-S13 until all the video frames are processed.

Preferably, the video frame description network is based on an encoder-decoder structure, the encoder uses a convolutional neural network and a cyclic neural network for feature extraction, the decoder uses a bidirectional LSTM and combines an attention mechanism, and the video frame description step includes:

and sending the key frame into a video frame description network, firstly obtaining a feature vector of the key frame through a convolutional neural network, then sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, finally sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.

Preferably, the establishment of the video frame selection network and the video frame description network according to the present invention comprises the following steps:

building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism.

Acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;

making a word list: using nltk to screen and segment the manual labels in each video frame, and making a word list;

pre-training the video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss;

training the key frame selection network: and training a key frame selection network by using an enhanced learning algorithm by taking the pre-trained video frame description network as an environment.

Performing combined training: and performing joint training on the key frame selection network and the video frame description network.

Preferably, the pre-training video frame description network of the present invention comprises:

extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;

acquiring a feature vector of the video frame through a convolutional neural network;

sending the feature vector of the video frame into a cyclic neural network to obtain a video global expression vector;

sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;

supervised learning is performed based on the candidate words and the manually established labels.

Preferably, the method of the present invention describes a network training key frame selection network based on the pre-trained video frame, and the step of training the key frame selection network includes:

sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;

sending the screened key frames into a trained video frame description network to obtain candidate words;

and the evaluation system carries out reward optimization on the key frame selection network based on the matching degree of the candidate words and the artificial labels in the video frame description network.

The invention has the following advantages:

1. according to the invention, a key frame selection network is added before a video frame description network, all video frames are firstly sent to the key frame selection network for screening, so that key frames containing different information are screened out, more repeated redundant video frames can be eliminated in the process, and then the key frames are sent to the video frame description network for processing, so that the processing capacity of the video frame description network is greatly reduced, the generation of redundant information is reduced, the interference of noise is reduced, and the system processing efficiency is improved;

2. the key frame selection network is set independently of the video frame description network, and can be selected to be used or not used according to the unused condition in the use process, so that the system has more flexibility;

3. the method of the invention adopts a key frame selection network to ignore similar frames in content and reserve frames with larger difference, thereby eliminating redundancy, reducing the calculated amount to the maximum extent, reducing the interference of noise, preventing overfitting and obtaining accurate description results.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a flow chart of a video description method according to the present invention

Detailed Description

The present invention is further described in the following with reference to the drawings and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention, and the embodiments and the technical features of the embodiments can be combined with each other without conflict.

It is to be understood that the terms first, second, and the like in the description of the embodiments of the invention are used for distinguishing between the descriptions and not necessarily for describing a sequential or chronological order. The "plurality" in the embodiment of the present invention means two or more.

The term "and/or" in the embodiment of the present invention is only an association relationship describing an associated object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, B exists alone, and A and B exist at the same time. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship. A

Example one

The embodiment provides a video description system based on key frame detection, which comprises a sampling module, a key frame selection network and a video frame description network, wherein the key frame selection network is built based on a convolutional neural network, the video frame description network is based on an encoder-decoder structure, the encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and the decoder adopts bidirectional LSTM and combines an attention mechanism. Wherein:

Example two

The embodiment provides a video description method based on key frame detection, which comprises the following steps:

s2, selecting key frames containing different information from the extracted video frames based on the key frame selection network; the key frame selection network is established based on a convolutional neural network, and specifically, the key frame screening comprises the following steps:

and S24, repeating the steps S22-S23 until all the video frames are processed.

And S3, sending the screened key frames into a video frame description network to generate a description text. The video frame description network is based on an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, the decoder adopts a bidirectional LSTM and combines an attention mechanism, and concretely, the description steps of the video frame include: and sending the key frame into a video frame description network, obtaining a feature vector of the key frame through a convolutional neural network, sending the feature vector of the key frame into a cyclic neural network to obtain a video global representation vector, sending the video global representation vector into a decoder to be decoded to obtain the probability of a word at each moment, selecting the word with the maximum probability as a candidate word, and further generating a description text of the video.

The establishment of the video frame selection network and the video frame description network in this embodiment includes the following steps:

s1, building a network structure: the method comprises the steps of building a video frame selection network based on a convolutional neural network, building a video frame description network based on an encoder-decoder structure, wherein the video frame description network is based on the encoder-decoder structure, the encoder adopts the convolutional neural network and a good cyclic neural network to extract features, and the decoder adopts a bidirectional LSTM and combines an attention mechanism.

S2, acquiring original data: extracting video frames from the acquired video to be described in an equally-spaced sampling mode, manually marking each video frame, and simultaneously dividing the video frames into a training set and a test set;

s3, word list making: using nltk to screen and segment the manual labels in each video frame, and making a word list;

s4, pre-training video frame description network: pre-training a video description network through a cross entropy loss function, calculating cross entropy of the obtained language description and a real label respectively, and taking the sum of the obtained language description as total loss; specifically, the step of pre-training the video frame description network includes:

s41, extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;

s42, acquiring the feature vector of the video frame through a convolutional neural network;

s43, sending the feature vector of the video frame into a recurrent neural network to obtain a video global representation vector;

s44, sending the video global expression vector into a decoder to be decoded to obtain the probability of words at each moment, and selecting the word with the maximum probability as a candidate word;

and S45, performing supervised learning based on the candidate words and the manually established labels.

S5, training the key frame selection network: the network training key frame selection network is described based on the pre-trained video frames, and specifically, the network training key frame selection network comprises the following steps:

s51, extracting video frames at equal intervals from the videos of the training set, and manually establishing labels for the extracted video frames;

s52, sending the video frames into a key frame selection network to screen out key frames, and combining an evaluation system;

s53, sending the screened key frames into a trained video frame description network to obtain candidate words;

and S54, performing rewarding optimization key frame selection network by the evaluation system based on the matching degree of the candidate words and the artificial labels in the video frame description network.

S6, joint training: and performing joint training on the key frame selection network and the video frame description network. After two stages of video frame description network pre-training and fixed video frame description network training key frame selection network, the video frame description network and the key frame selection network are both well pre-trained, but because the video frame description network uses all sampled video frames as input during training, and only part of video frames are selected to be sent into the video frame description network after the video frame description network and the key frame selection network are added, a difference exists between the video frame description network and the key frame description network, and the key frame description network and the video frame description network are combined through joint training. In each iteration, the selection key frame is passed forward, and when the codec is trained, the video frame selection is treated as a fixed selection, and the backward propagation and enhancement gradient update are performed normally.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A video description system based on key frame detection, characterized by: the system comprises a sampling module, a key frame selection network and a video frame description network, wherein:

2. The video description system of key frame detection according to claim 1, characterized by: the key frame selection network is built based on a convolutional neural network, the video frame description network is based on an encoder-decoder structure, an encoder adopts the convolutional neural network and a cyclic neural network for feature extraction, and a decoder adopts a bidirectional LSTM and combines an attention mechanism.

3. A video description method based on key frame detection is characterized in that: the method comprises the following steps:

4. The method according to claim 3, wherein the video description method based on key frame detection comprises: the key frame selection network is based on a convolutional neural network, and the key frame screening step comprises the following steps:

and S24, repeating the steps S12-S13 until all the video frames are processed.

5. The method according to claim 4, wherein the video description method based on key frame detection comprises: the video frame description network is based on an encoder-decoder structure, the encoder adopts a convolutional neural network and a cyclic neural network for feature extraction, the decoder adopts a bidirectional LSTM and combines an attention mechanism, and the video frame description step comprises the following steps:

6. The method according to claim 5, wherein the video description method based on key frame detection comprises: the establishment of the video frame selection network and the video frame description network comprises the following steps:

7. The method according to claim 6, wherein the video description method based on key frame detection comprises: the pre-training video frame description network step comprises:

8. The method according to claim 7, wherein the video description method based on key frame detection comprises: training a key frame selection network based on the pre-trained video frame description network, wherein the step of training the key frame selection network comprises the following steps: