CN114913466A

CN114913466A - Video key frame extraction method based on double-flow information and sparse representation

Info

Publication number: CN114913466A
Application number: CN202210616931.9A
Authority: CN
Inventors: 李玉洁; 郭富林; 王旭; 甘亚奇; 丁数学
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-08-16

Abstract

The invention relates to a video key frame extraction method based on double-flow information and sparse representation, which comprises the following steps: splitting a video file to be extracted to obtain image frames, and respectively constructing a video space stream matrix and a video time stream matrix based on the image frames; obtaining a double-current information matrix through a video space flow matrix and a video time flow matrix, and performing feature extraction on the double-current information matrix to obtain a double-current feature matrix; inputting the double-current feature matrix into a sparse representation model, calculating a sparse coefficient matrix, and acquiring a key frame index based on the sparse coefficient matrix; and extracting the key frames in the video file to be extracted through the key frame indexes. The method can efficiently extract fewer key frames in one video, reduce the number of the extracted key frames, reduce the compression rate of key frame extraction, and improve the calculation speed of a key frame extraction algorithm.

Description

Video key frame extraction method based on double-flow information and sparse representation

Technical Field

The invention relates to the technical field of computer vision key frame extraction, in particular to a video key frame extraction method based on double-flow information and sparse representation.

Background

With the rapid development of information technology, the internet and multimedia technology are widely applied, a large amount of video information is generated in daily life and work of people, the video data structure is complex and various, video abstraction becomes the main research content in the field of video understanding, and the video retrieval efficiency is effectively improved and the video data storage difficulty is reduced. Here, key frame extraction is an important issue in video summarization. Since the video data is huge and contains much redundant information, how to effectively extract the video key frame information is very important. The video signal is composed of continuous image sequences, the content is rich, the expressive force is strong, the information quantity is large, the correlation between adjacent frames is strong, and a large amount of redundancy exists, so that the key frame extraction method is very important. The key frame extraction is to select a small amount of frames with the largest information quantity to approximately represent the original video, so that the pressure of high-dimensional video signal processing can be relieved, and the video understanding efficiency can be improved.

The existing image feature extraction method usually only passes the original image information through a neural network and finally folds the result, or folds the frames as a whole and inputs the result to a convolutional neural network. So as to achieve the goal of space-time learning. The traditional method is better for local information learning, but is not very good for motion information learning before and after the video frame. The double-flow information can well represent the spatial information and the motion information of the original video frame, and the double-flow information is applied to the key frame extraction technology, so that the learning effect of the model on the video frame can be effectively improved, the comprehension capability of the model on the video is improved, and the extraction precision of the key frame is improved.

The current sparse model-based key frame extraction has attracted great attention due to its outstanding advantages, simplicity and sophisticated mathematical analysis. Sparse representations of signals have well-established mathematical representations. The signals are described through the dictionary and the sparse coefficient matrix, a more concise representation mode of complex signals can be obtained, and therefore signal processing performance is improved. Notably, the effectiveness of sparse representation-based keyframe extraction methods depends largely on the sparsity constraint, which is typically based on the L1 norm. Sparse Modeling Representation Selection (SMRS) is a method that uses the L1 norm to compute a sparse matrix corresponding to a key frame. Although the method for sparse modeling representation selection can effectively obtain the key frames, the existing method based on sparse representation cannot extract effective key frames, the extracted information is only from the spatial information of the video, and the motion information of objects in the video is not fully considered.

Therefore, it is necessary to develop more efficient methods to obtain better video key frames.

Disclosure of Invention

The invention aims to provide a video key frame extraction method based on double-flow information and sparse representation, which is used for solving the problems in the prior art, improving the calculation accuracy of a key frame extraction algorithm, reducing the number of extracted key frames and reducing the compression rate.

In order to achieve the purpose, the invention provides the following scheme:

a video key frame extraction method based on double-stream information and sparse representation comprises the following steps:

splitting a video file to be extracted to obtain image frames, and respectively constructing a video space stream matrix and a video time stream matrix based on the image frames;

acquiring a double-flow information matrix through the video space flow matrix and the video time flow matrix, and performing feature extraction on the double-flow information matrix to acquire a double-flow feature matrix;

inputting the double-current characteristic matrix into a sparse representation model, calculating a sparse coefficient matrix, and acquiring a key frame index based on the sparse coefficient matrix;

and extracting the key frames in the video file to be extracted through the key frame indexes.

Preferably, constructing the video spatial stream matrix comprises:

extracting pixel points in the image frames, arranging the pixel points in sequence to obtain space flow characteristic vectors of different image frames, and combining the space flow characteristic vectors of all the image frames to form the video space flow matrix.

Preferably, constructing the video time stream matrix comprises:

extracting pixel points in the image frames, acquiring optical flows of different image frames by a Farneback method according to the movement of the pixel points among different image frames, sequentially arranging and combining the optical flows of different image frames to obtain time flow characteristic vectors of different image frames, and combining the time flow characteristic vectors of all the image frames to form the video time flow matrix.

Preferably, the acquiring the dual-stream information matrix includes:

and splicing the video space stream matrix and the video time stream matrix according to columns to obtain the double-stream information matrix.

Preferably, obtaining the dual-stream feature matrix comprises: and extracting the double-stream information matrix through a VGG16 network to obtain the double-stream feature matrix.

Preferably, the sparse representation model is:

wherein f (C) is a sparse representation model function, Y is a double-flow characteristic matrix, C is a sparse coefficient matrix, tau is a constraint parameter of the sparse matrix, | | | | | represents a norm operation function, | | | | | | | u _F Represents Frobenius norm, and s.t. is a constraint condition.

Preferably, the method for calculating the sparse coefficient matrix based on the sparse representation model is as follows:

wherein, C is a sparse coefficient matrix, Y is a double-flow characteristic matrix, tau is a constraint parameter of the sparse matrix, and | | · | | represents a norm operation function.

Preferably, the sparse coefficient matrix solution is calculated in the form of:

wherein Γ represents a permutation matrix, I _k For a k-dimensional identity matrix, the elements in Δ are in [0, 1).

Preferably, obtaining a key frame index based on the sparse coefficient matrix comprises: and extracting non-zero rows in the sparse coefficient matrix, wherein the non-zero rows in the sparse coefficient are the key frame indexes.

Preferably, the method further comprises evaluating the accurate effect of the extracted key frame, and evaluating based on the video key frame compression rate and the F-measure as evaluation indexes; wherein the F-measure is used for measuring the accurate effect of the extracted key frame;

the method for calculating the video key frame compression rate comprises the following steps:

summary length is the compression ratio of video key frame, N _select For the number of extracted key frames, N _whole The number of frames of the video is full.

The invention has the beneficial effects that:

the video key frame extraction method based on double-stream information and sparse representation can efficiently extract fewer key frames in one video, reduce the number of the extracted key frames, reduce the compression rate of key frame extraction, well represent the information of the original video, comprehensively extract spatial information and motion information in the video frames by utilizing the double-stream information of the video, and represent the learning method through sparse modeling to obtain accurate key frames and improve the calculation speed of a key frame extraction algorithm; the method is simple to operate, video signals are input, and the video key frames can be obtained through the sequential steps of the video key frame extraction method based on the double-current information and sparse representation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a video key frame extraction method based on dual-stream information and sparse representation according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a video key frame extraction model based on dual-stream information and sparse representation according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1-2, the present invention provides a video key frame extraction method based on double-stream information and sparse representation, which specifically includes:

the video is divided into image frames, each frame is expressed into each row of a matrix to form a video spatial stream matrix, optical streams of the video images are obtained through the video image frames by a Farneback method, and the optical streams of each frame are taken as the rows to form the video temporal stream matrix. And then splicing the spatial stream matrix and the time stream matrix according to columns to obtain a double-stream matrix of the video, and extracting the matrix through VGG16 to obtain a double-stream characteristic matrix Y of the video. The original video can be represented in a matrix form capable of sparse representation by constructing a feature matrix of the video.

In this embodiment, the data set uses video signals, each video is composed of a plurality of video frames, redundant information is provided between adjacent video frames, and each video frame is an image containing information.

And extracting pixel points in each video frame, and obtaining the optical flow of the video through a Farnback method according to the movement of the pixel points between frames. And sequentially arranging and combining the pixel points from left to right from top to bottom to form a spatial stream feature vector, namely obtaining the spatial stream feature vector of the corresponding video frame, and combining the spatial stream feature vectors of all the frames to form a video spatial stream matrix. And arranging and combining the optical flows of all the frames into a feature vector from left to right and from top to bottom in sequence to obtain the time stream feature vector of the corresponding video frame, and combining the time stream vectors of all the video frames to form a video time stream matrix. Splicing the space flow matrix and the time flow matrix according to columns to obtain a double-flow matrix of the video, extracting the double-flow matrix through VGG16 to obtain video characteristics Y, namely

Y＝[y ₁ ,y ₂ ,…,y _i ,…,y _N ],y _i And (i is more than or equal to 1 and less than or equal to N) is the feature vector of the ith video frame.

And constructing a sparse representation model, wherein the sparse representation model can obtain a sparse coefficient matrix, and further extracting key frames of the video in colleges and universities.

An initial matrix Y, D, where Y is the video feature matrix and D is the dictionary matrix. The sparse coefficient matrix C is calculated by substituting the matrix Y, D into the following equation (1).

Wherein, f (C) is a sparse representation model function, Y is a video feature matrix, C is a sparse coefficient matrix, D is a dictionary matrix, and constraint parameters of the tau sparse matrix generally take values larger than 0, | · | | represents a norm operation function.

The original signal matrix can be represented by a linear combination of a few column vectors in the matrix, and if only the second row and the fourth row in the sparse matrix are non-zero elements, the original signal matrix can be represented by a linear combination (product) of the second column and the fourth column in the matrix and the second row and the fourth row in the sparse matrix.

Replacing the original signal matrix Y and the dictionary matrix D in the formula (1) by the video dual-stream signal matrix Y, and constructing a sparse representation model as shown in the formula (2):

wherein, Y is a video matrix signal, C is a sparse coefficient matrix, tau is the constraint of the sparse matrix, and the constraint parameter of the tau sparse matrix is generally greater than 0. And the constructed sparse representation model is used for the application of key frame extraction through the introduction of the video feature matrix.

A sparse coefficient matrix is calculated as shown in fig. 2, where the non-zero rows of the sparse coefficient matrix represent the indices of the key frames.

Since the product of the dictionary (i.e., video matrix) and the sparse coefficient matrix is a video signal matrix, the key frame solution problem can constitute an optimized sparse representation problem, i.e., solve the sparse coefficient matrix satisfying the sparse constraint. In this embodiment, the solving formula for calculating the sparse coefficient matrix X is shown in formula (3):

wherein, Y is a video feature matrix, C is a sparse coefficient matrix, tau is a sparse constraint coefficient, and the value is generally larger than 0, | | | |, represents a norm function.

Solving the formula (3), and finally solving a result C as shown in the formula (4):

The calculated C is a sparse coefficient matrix in which most elements are zero, where non-zero rows represent key frame indices in the video. If the i, j-th line in C is not zero, the key frame index of the video is { i, j }. And selecting the corresponding video key frame according to the obtained key frame index. If the key frame index of the video is { i, j }, the key frames of the video are the ith frame video frame image and the jth frame video frame image. The key frame index is associated with the key frame, so that the key frame can be quickly and accurately bound, and the accuracy and efficiency of extracting the key frame are improved.

To test the extracted key frame effect, it is verified using a test set. Video key frame compression rate and F-measure are used as evaluation indexes. Given a video, its key frame compression ratio (sum length) is shown in equation (5)

Wherein N is _select For the number of extracted key frames, N _whole The number of frames of the video is full. The key frame compression rate indicates the ratio of the number of extracted key frames to all frames of the video, and the unit thereof is% and a smaller value indicates a stronger video compression level.

Meanwhile, the F-measure is adopted in the embodiment to measure the accurate effect of the extracted key frame. The definition of F-measure is shown as formula (6):

wherein, P and R are accuracy and recall respectively. The higher the F-measure is, the better the effect of the extracted key frame is, and the more accurate the content of the original video can be reflected.

In order to fully extract motion information hidden in video frames, the invention introduces optical flow features. The feature is extracted from an original image of a video frame, and a few (sparse) video key frames capable of expressing motion features are generated by sparsely expressing the sparsification property expressed to a video signal. Therefore, the invention has better abstract effect on the video data containing a large amount of motion information. In addition, the original video is input, and the accurate video key frame can be obtained through the steps, so that other additional operations are not needed, and the method is relatively convenient and simple.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A video key frame extraction method based on double-stream information and sparse representation is characterized by comprising the following steps:

acquiring a double-current information matrix through the video space flow matrix and the video time flow matrix, and performing feature extraction on the double-current information matrix to acquire a double-current feature matrix;

inputting the double-current feature matrix into a sparse representation model, calculating a sparse coefficient matrix, and acquiring a key frame index based on the sparse coefficient matrix;

2. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 1, wherein constructing the video spatial stream matrix comprises:

3. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 2, wherein constructing the video time stream matrix comprises:

4. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 1, wherein obtaining the dual-stream information matrix comprises:

5. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 1, wherein obtaining the dual-stream feature matrix comprises: and extracting the double-stream information matrix through a VGG16 network to obtain the double-stream feature matrix.

6. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 1, wherein the sparse representation model is:

wherein f (C) is a sparse representation model function, Y is a double-current characteristic matrix, C is a sparse coefficient matrix, tau is a constraint parameter of the sparse matrix, | | | | represents a norm operation function, | | | | | | presents a linear transformation function _F Represents Frobenius norm, and s.t. is a constraint condition.

7. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 6, wherein the method for calculating the sparse coefficient matrix based on the sparse representation model comprises:

8. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 7, wherein the sparse coefficient matrix solution is calculated in the form of:

9. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 1, wherein obtaining key frame indexes based on the sparse coefficient matrix comprises: and extracting non-zero rows in the sparse coefficient matrix, wherein the non-zero rows in the sparse coefficient are the key frame indexes.

10. The method for extracting video key frames based on dual-stream information and sparse representation according to claim 1, further comprising evaluating the accurate effect of the extracted key frames, and evaluating based on video key frame compression ratio and F-measure as evaluation indexes; wherein the F-measure is used for measuring the accurate effect of the extracted key frame;

summary length is the video key frame compression ratio, N _select For the number of extracted key frames, N _whole The number of frames of the video is full.