CN107911755B

CN107911755B - Multi-video abstraction method based on sparse self-encoder

Info

Publication number: CN107911755B
Application number: CN201711113383.3A
Authority: CN
Inventors: 冀中; 马亚茹
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2020-10-20
Anticipated expiration: 2037-11-10
Also published as: CN107911755A

Abstract

A sparse auto-encoder based multi-video summarization method, comprising: extracting visual features of the video frames; inputting the visual characteristics of the video frame into a sparse self-encoder, and respectively learning through the sparse self-encoder to obtain: the compressed representation of the video frame is the representation of the neuron of the hidden layer, the connection weight between the input layer and the hidden layer and the connection weight between the hidden layer and the output layer; generating a weight curve by using the obtained connection weight between the input layer and the hidden layer; selecting each local maximum of the weight curve as a key frame set; and sequencing the key frames to realize the abstract. Aiming at the characteristics of the existing multi-video abstract data set, the multi-video abstract technology based on the prototype analysis method with the weight, which is suitable for the characteristics, is designed, so that the special information of the data is fully utilized under the assistance of effective prior information.

Description

Multi-video abstraction method based on sparse self-encoder

Technical Field

The invention relates to a multi-video summarization method. In particular to a multi-video summarization method based on a sparse self-encoder.

Background

With the rapid development of information technology, video data is emerging in large quantities, and becomes one of important ways for people to acquire information. However, due to the dramatic increase in the number of videos, redundant and repetitive information occurs in a large amount of video data, which makes it difficult for a user to quickly acquire desired information. Therefore, under the circumstances, a technology capable of integrating and analyzing mass video data under the same theme is urgently needed to meet the requirement that people want to browse main information of videos quickly and accurately and improve the information acquisition capability of people. The multiple video summarization technique has attracted increasing researchers' attention over the past several decades as one of the effective ways to solve the above-mentioned problems. The multi-video abstraction technology is a content-based video data compression technology, and aims to analyze and integrate a plurality of videos of related topics under the same event, extract main contents in the plurality of videos, and present the extracted contents to a user according to a certain logical relationship. Currently, the multi-video summary is mainly analyzed from three aspects: 1) coverage rate; 2) novelty; 3) the importance of which. Coverage refers to the fact that the extracted video content can cover the main content of multiple videos on the same topic. Redundancy refers to removing duplicate, redundant information in a multi-video summary. The importance refers to extracting important key shots in a video set according to some prior information so as to extract important contents in a plurality of videos.

Although many single video summarizations have been proposed, the research on the multi-video summarization method is less and still in the preliminary stage. This is mainly due to two reasons: 1) one is due to the diversity of multiple video topics under the same event and the cross-correlation of topics between videos. The theme diversity means that information emphasis points of a plurality of videos in the same event are different and a plurality of sub-themes are provided. The topic cross-over refers to that the content of the videos under the same event has cross-over, and the videos have similar content and different information content. 2) Secondly, the audio information, text information and visual information may have a large difference due to the audio information expressed by the multiple video data to the same content. These reasons make the study of multiple video summaries difficult with traditional single video summaries.

In the past decades, methods for multi-video summarization have been proposed for the features of multi-video data sets. The multi-video summarization method based on complex graph clustering is a relatively classical method. The method comprises the steps of constructing a complex graph by extracting key words of corresponding script information of the video and key frames of the video, and realizing abstract by utilizing a graph clustering algorithm on the basis. But the method mainly aims at news videos, and the method loses meaning for video sets without video script information. In addition, because the content contained in a plurality of videos under the same theme has diversity and redundancy, the clustering method only meets the maximum coverage condition of the video content, but the clustering effect is poor only by using the visual information of the videos, and the complexity is higher although the clustering method is combined with other modes to help to a certain extent.

The multi-video abstract has information of multiple modalities, such as text information, visual information, audio information, and the like of a video. A Balanced AV-MMR (Balanced Audio Video maximum geographic retrieval) algorithm is combined with visual and Audio information, and a multi-Video summary algorithm for iteratively selecting key shots is designed under the idea of maximum margin correlation.

In recent years, novel methods have been proposed. Among them, the realization of multi-video summarization by using the visual co-occurrence characteristics (visualCo-occurrence) of video is a novel method. According to the method, important visual concepts (concepts) are considered to be repeated in a plurality of videos under the same theme, a maximum binary search algorithm (maximum binary matching) is provided according to the characteristic, and a sparse co-occurrence mode of the videos is extracted, so that multi-video abstraction is achieved. However, the method is only suitable for a specific data set, and the method loses significance for a video set with small repeatability in the video.

In addition, in order to utilize more related information, related researchers have proposed that sensors such as a GPS and a compass on a mobile phone are used to acquire information such as a geographic position during a mobile phone video shooting process, and thus assist in determining important information in a video and generating a multi-video summary. In addition, the prior information of the webpage picture is used as auxiliary information in the field, and multi-video abstraction is better realized. At present, due to the complexity of multi-video data, the research of multi-video summarization does not achieve the ideal effect. Therefore, how to better utilize the information of the multi-video data to better realize the multi-video summarization becomes a hot spot of research of relevant scholars at present. To this end, it is proposed herein to implement multi-video summarization using Sparse auto-editor algorithm (Sparse auto encoder). The sparse autoencoder is an unsupervised deep learning framework with a three-layer network structure. It learns the compressed representation of the input data by approximating the output to the input through the idea of nonlinear reconstruction. By utilizing the thought, the invention designs an algorithm for extracting the key frames and designs a bottom-up sorting algorithm for the extracted key frames, so that the presentation of the key frames is more logical, and the readability of the abstract is improved.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a sparse self-encoder-based multi-video abstraction method which can effectively utilize video theme related information and improve video browsing efficiency of a user.

The technical scheme adopted by the invention is as follows: a multi-video summarization method based on a sparse self-encoder comprises the following steps:

1) extracting visual characteristics of the video frame, and expressing the visual characteristics of the video frame as X ═ X₁,x₂,...,x_i,...,x_n},x_i∈R^m；

x_iVisual features representing the ith frame;

2) video frequencyInputting the visual characteristics of the frame into a sparse self-encoder, and respectively learning by the sparse self-encoder to obtain: compressed representation of video frames, i.e. characterization of neurons of hidden layers, connection weights W between input layers and hidden layers⁽¹⁾And connection weight W of hidden layer and output layer⁽²⁾；

3) Using the obtained connection weight W between the input layer and the hidden layer⁽¹⁾Generating a weight curve, i.e. the weight W⁽¹⁾Is given by a 2 norm and is formulated as

4) Selecting each local maximum of the weight curve as a key frame set;

5) and sequencing the key frames to realize the abstract.

The visual feature of the video frame in the step 1) is one of a depth feature, a color feature and a visual bag-of-words feature.

The local processing in step 4) is to divide the video frame index corresponding to the abscissa of the weight curve into a plurality of local spaces according to a set interval, and to use the frame corresponding to the maximum value of the weight curve in each local space as a key frame.

Step 5) comprises the following steps:

(1) dividing a key frame set containing k elements into k subsets;

(2) respectively calculating the time correlation between every two subsets in the k subsets to obtain the time correlation vector F of the k subsets_chro(ii) a The time correlation between each two subsets is calculated as follows, i.e. vector F_chroAny one element of (1) is calculated as:

wherein A and B represent any two of the k subsets; a is_lRepresenting the last frame in set A, b₁Represents the first frame in set B; t (a)_l) Representing frame a_lTime of (a), V (a)_l)＝V(b₁) Representing frame a_lAnd frame b₁Belong to the same video, V (a)_l)≠V(b₁) Representing frame a_lAnd frame b₁Belonging to the same video, N (a)_l)<N(b₁) Representing frame a_lAppearing earlier in the same video than frame b₁；f_chro(A > B) represents the temporal relevance of set A to set B;

(3) calculating the topic compactness between every two subsets of the k subsets to obtain topic compactness vectors F of the k subsets_topic(ii) a The topic closeness calculation formula between every two subsets is as follows, namely vector F_topicAny one element of (1) is calculated as:

where sim (a, B) denotes the cosine similarity between an arbitrary frame a belonging to the set a and an arbitrary frame B belonging to the set B, f_topic(A > B) represents the closeness of the topic of collection A that ranks before collection B;

(4) and superposing the time relevance vector and the topic closeness vector to obtain a relevance vector F of the k subsets, wherein the calculation formula is as follows:

F＝F_chro+F_topic

and sorting the key frames according to the relevance vectors of the k subsets: firstly, two subsets with the maximum correlation degree are selected and combined into a new set, and then the rest subsets are combined pairwise according to the correlation degree to form a plurality of new sets;

(5) repeating the calculation of the steps (2), (3) and (4) on all the generated new sets until all the video frames are contained in one set, and ending the iteration;

(6) and (5) sequencing the video frames in the set obtained in the step (5) according to the index sequence of the video frames to realize the summary.

The invention relates to a multi-video summarization method based on a sparse self-encoder, which is used for designing a multi-video summarization technology based on a prototype analysis method with weight, which is suitable for the characteristics of the existing multi-video summarization data set, so that the multi-video summarization technology fully utilizes the specific information of data under the assistance of effective prior information. The main advantages are mainly as follows:

(1) the novelty is as follows: the sparse self-editor method is applied to multi-video abstraction for the first time. And provides a bottom-up key frame ordering algorithm.

(2) Effectiveness: compared with a typical clustering method and a minimum sparse reconstruction method applied to single video abstraction, the sparse self-encoder-based multi-video abstraction method disclosed by the invention is proved to have obviously better performance than the clustering method and the minimum sparse reconstruction method, so that the sparse self-encoder-based multi-video abstraction method is more suitable for the multi-video abstraction problem.

(3) The practicability is as follows: the method is simple and feasible and can be used in the field of multimedia information processing.

Drawings

Fig. 1 is a flow chart of a sparse auto-encoder based multi-video summarization method of the present invention.

Detailed Description

The following describes a sparse auto-encoder based multi-video summarization method according to the present invention in detail with reference to the following embodiments and the accompanying drawings.

The invention discloses a multi-video abstraction method based on a sparse self-encoder, which aims to obtain compressed representation of input video frames, provide short and main video content for a user and improve the video browsing efficiency of the user. While sparse self-encoders can be viewed as a non-linear reconstruction process that obtains a compressed representation of the input data. Therefore, the invention applies the sparse self-encoder to the multi-video summary and designs a key frame selection algorithm according to the learned compressed representation. Here the input layer neurons represent a set of video frames and the hidden layer neurons represent a compressed representation of the input data that needs to be learned, also called a dictionary, for reconstructing the input vectors. The number of neurons in the output layer is the same as that in the input layer, and represents an approximate representation of the input layer.

The present invention applies a sparse self-encoder to a multi-video summary. The sparse self-encoder is a three-layer neural network with an implicit layer and is an unsupervised deep learning algorithm. The algorithm attempts to approximate an identity function, i.e., the input is approximately equal to the output. In order to reproduce the input signal as much as possible with the output signal, the self-encoder must capture the most important factors that can represent the input data, finding the principal components that can represent the original information. This process can be viewed as automatically obtaining a compressed representation of the input data.

Sparsity can be explained simply as follows. The constraint that a neuron is inhibited most of the time is called sparsity constraint if it is considered to be activated when its output is close to 1, and inhibited otherwise.

The specific principle of the sparse self-encoder is as follows:

given a set of input video frames X ═ X₁,x₂,....,x_n},x_i∈R^mCharacterizing visual features of video frames, W⁽¹⁾Representing the connection weight of the input layer and the hidden layer, W⁽²⁾Connection weight, h, representing hidden and output layers_(W,b)(x) Representing the output. As used herein

Represents the output of the ith neuron of layer 2, i.e. hidden layer:

the activation function f is a sigmoid function, and a nonlinear element is introduced, as shown in formula (2):

the goal of the self-encoder is to approximate the input by the output, so its objective function is:

where b denotes a bias vector.

The number s of hidden layer units is typically required for an auto-encoder₂The number of the hidden layers is sometimes larger than the number of the input layer nerve clouds, and even larger than the number of the input layer neurons, and the self-encoder can still learn the compressed representation of the input data by adding the sparsity condition in the hidden layers. The invention adopts KL divergence to control sparsity, and the specific expression is as follows:

here, the

Represents the average activation of the jth neuron of the hidden layer, and p is a sparse parameter and is a constant close to zero.

The total cost function is then expressed as follows:

where β is an adjustable parameter for balancing the two terms.

As shown in fig. 1, a sparse auto-encoder-based multiple video summarization method of the present invention includes the following steps:

1. a multi-video summarization method based on a sparse self-encoder is characterized by comprising the following steps:

x_iVisual features representing the ith frame; the visual feature of the video frame is one of a depth feature, a color feature and a visual bag-of-words feature.

2) Inputting the visual characteristics of the video frame into a sparse self-encoder, and respectively learning through the sparse self-encoder to obtain: compressed representation of video framesI.e. the representation of the neurons of the hidden layer, the weight W of the connection between the input layer and the hidden layer⁽¹⁾And connection weight W of hidden layer and output layer⁽²⁾；

4) Selecting each local maximum of the weight curve as a key frame set;

the local means that the video frame index corresponding to the abscissa of the weight curve is divided into a plurality of local spaces according to a set interval, and a frame corresponding to the maximum value of the weight curve in each local space is used as a key frame.

5) Sorting the key frames to realize the abstract, comprising the following steps:

(1) dividing a key frame set containing k elements into k subsets;

wherein A and B represent any two of the k subsets; a is_lRepresenting the last frame in set A, b₁Represents the first frame in set B; t (a)_l) Representing frame a_lTime of (a), V (a)_l)＝V(b₁) Representing frame a_lAnd frame b₁Belong to the same video, V (a)_l)≠V(b₁) Representing frame a_lAnd frame b₁Belonging to the same video, N (a)_l)<N(b₁) Representing frame a_lAppearing earlier in the same video than frame b₁；f_chro(A > B) representsThe temporal correlation of set a before set B;

F＝F_chro+F_topic

Claims

1) extracting visual characteristics of the video frame, and expressing the visual characteristics of the video frame as X ═ X₁,x₂,...,x_i,...,x_n},x_i∈R^m；x_iVisual features representing the ith frame;

2) will be provided withInputting visual characteristics of the video frame into a sparse self-encoder, and respectively learning through the sparse self-encoder to obtain: compressed representation of video frames, i.e. characterization of neurons of hidden layers, connection weights W between input layers and hidden layers⁽¹⁾And connection weight W of hidden layer and output layer⁽²⁾；

4) Selecting each local maximum of the weight curve as a key frame set;

(1) dividing a key frame set containing k elements into k subsets;

wherein A and B represent any two of the k subsets; a is_lRepresenting the last frame in set A, b₁Represents the first frame in set B; t (a)_l) Representing frame a_lThe time of (d); v (a)_l) Representing frame a_lIn the video, N (a)_l) Representing frame a_lThe rank of the frame in the video, i.e., the frame number; v (b)_l) Representing frame b_lVideo of where, N (b)_l) Representing frame b_lThe rank of the frame in the video, i.e., the frame number; v (a)_l)＝V(b₁) Representing frame a_lAnd frame b₁Belong to the same video, V (a)_l)≠V(b₁) Representing frame a_lAnd frame b₁Not belonging to the same video, N (a)_l)＜N(b₁) Representing frame a_lAppearing earlier in the same video than frame b₁；f_chro(A > B) represents the temporal relevance of set A to set B;

F＝F_chro+F_topic

and sorting the key frames according to the relevance vectors of the k subsets: firstly, two subsets with the maximum correlation degree are selected and combined into a new set, and then the rest subsets are combined pairwise according to the correlation degree sequence to form a plurality of new sets;

2. The sparse auto-encoder based multi-video summarization method according to claim 1, wherein the visual feature of the video frame of step 1) is one of a depth feature, a color feature and a visual bag-of-words feature.

3. The sparse auto-encoder based multi-video summarization method according to claim 1, wherein the local step 4) is to divide the video frame index corresponding to the abscissa of the weight curve into a plurality of local spaces at a predetermined interval, and to use the frame corresponding to the maximum value of the weight curve in each local space as the key frame.