CN107943990B

CN107943990B - Multi-video abstraction method based on prototype analysis technology with weight

Info

Publication number: CN107943990B
Application number: CN201711249015.1A
Authority: CN
Inventors: 冀中; 江俊杰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2020-02-14
Anticipated expiration: 2037-12-01
Also published as: CN107943990A

Abstract

The invention relates to the technical field of video processing, and provides a multi-video abstraction technology based on a prototype analysis method with weight, which is suitable for the characteristics, so that the specific information of data is fully utilized under the assistance of effective prior information. The invention adopts the technical scheme that a multi-video abstract method based on a weighted prototype analysis technology is characterized in that firstly, a graph model with weights is used for modeling the relation between video frames, so as to obtain a weight matrix required by the weighted prototype analysis; then, a weighted prototype analysis is used to obtain key frames, and a video summary with a given length is generated. The invention is mainly applied to video processing occasions.

Description

Multi-video abstraction method based on prototype analysis technology with weight

Technical Field

The invention relates to the technical field of video processing, in particular to a multi-video abstraction method based on a prototype analysis technology with weight.

Background

With the rapid development of information technology, video data is emerging in large quantities, and becomes one of important ways for people to acquire information. However, due to the dramatic increase in the number of videos, redundant and repetitive information occurs in a large amount of video data, which makes it difficult for a user to quickly acquire desired information. Therefore, under the circumstances, a technology capable of integrating and analyzing mass video data under the same theme is urgently needed to meet the requirement that people want to browse main information of videos quickly and accurately and improve the information acquisition capability of people. The multiple video summarization technique has attracted increasing researchers' attention over the past several decades as one of the effective ways to solve the above-mentioned problems. The multi-video abstraction technology is a content-based video data compression technology, and aims to analyze and integrate a plurality of videos of related topics under the same event, extract main contents in the plurality of videos, and present the extracted contents to a user according to a certain logical relationship. Currently, the multi-video summary is mainly analyzed from three aspects: 1) coverage rate; 2) novelty; 3) the importance of which. Coverage refers to the fact that the extracted video content can cover the main content of multiple videos on the same topic. Redundancy refers to removing duplicate, redundant information in a multi-video summary. The importance refers to extracting important key shots in a video set according to some prior information so as to extract important contents in a plurality of videos.

Although many single video summarizations have been proposed, the research on the multi-video summarization method is less and still in the preliminary stage. This is mainly due to two reasons: 1) one is due to the diversity of multiple video topics under the same event and the cross-correlation of topics between videos. The theme diversity means that information emphasis points of a plurality of videos in the same event are different and a plurality of sub-themes are provided. The topic cross-over refers to that the content of the videos under the same event has cross-over, and the videos have similar content and different information content. 2) Secondly, the audio information, text information and visual information may have a large difference due to the audio information expressed by the multiple video data to the same content. These reasons make the study of multiple video summaries difficult with traditional single video summaries.

In the past decades, methods for multi-video summarization have been proposed for the features of multi-video data sets. The multi-video summarization method based on complex graph clustering is a relatively classical method. The method comprises the steps of constructing a complex graph by extracting key words of corresponding script information of the video and key frames of the video, and realizing abstract by utilizing a graph clustering algorithm on the basis. However, the method mainly aims at news videos, the method loses significance for video sets without video script information, in addition, because the content of a plurality of videos under the same theme has diversity and redundancy, the maximum coverage condition of the video content is met only by using a clustering method, the clustering effect is poor only by using the visual information of the videos aiming at multi-video abstract, and the complexity is higher although the method is combined with other modes to have certain help.

The multi-video abstract has information of multiple modalities, such as text information, visual information, audio information, and the like of a video. The Balanced AV-MMR (Balanced Audio Video maximum local retrieval) is a multi-Video summarization technology which effectively utilizes Video multi-modal information, and analyzes visual information, Audio information and semantic information in the visual information and the Audio information of a Video, wherein the semantic information comprises Audio, human face, time characteristics and the like which have important significance for Video summarization. The method effectively utilizes the multi-modal information of the video, but the extracted video abstract does not achieve a good effect.

In recent years, novel methods have been proposed. Among them, the realization of multi-video summarization by using the visual co-occurrence characteristics (visualCo-occurrence) of video is a novel method. The method considers that important visual concepts are frequently repeatedly appeared in a plurality of videos under the same theme, and provides a maximum binary search algorithm (maximum binary matching) according to the characteristic to extract a sparse co-occurrence mode of the videos, so that multi-video abstraction is realized. However, the method is only suitable for a specific data set, and the method loses significance for a video set with small repeatability in the video.

In addition, in order to utilize more related information, related researchers have proposed that sensors such as a GPS and a compass on a mobile phone are used to acquire information such as a geographic position during a mobile phone video shooting process, and thus assist in determining important information in a video and generating a multi-video summary. In addition, the prior information of the webpage picture is used as auxiliary information in the field, and multi-video abstraction is better realized. At present, due to the complexity of multi-video data, the research of multi-video summarization does not achieve the ideal effect. Therefore, how to better utilize the information of the multi-video data to better realize the multi-video summarization becomes a hot spot of research of relevant scholars at present. To this end, it is proposed herein to implement multiple video summarization using a prototype Analysis technique (Archetypal Analysis).

Archetypal Analysis (AA) treats each data point in a dataset as a blended result of a set of single, observable archetypes, while archetypes themselves are limited to a sparse mixture of data points in a dataset, and are typically located at the boundaries of a dataset. AA models are widely used in different fields, such as economics, astrophysics and pattern recognition. The usefulness of AA models for feature extraction and dimensionality reduction is exploited by machine learning algorithms in various fields, such as computer vision, neural images, chemistry, text mining and collaborative filtering.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a multi-video abstraction technology which is suitable for the characteristic and is based on a prototype analysis method with weight, so that the specific information of data is fully utilized under the assistance of effective prior information. The invention adopts the technical scheme that a multi-video abstract method based on a weighted prototype analysis technology is characterized in that firstly, a graph model with weights is used for modeling the relation between video frames, so as to obtain a weight matrix required by the weighted prototype analysis; then, a weighted prototype analysis is used to obtain key frames, and a video summary with a given length is generated.

The method comprises the following specific steps of obtaining a weight matrix required by prototype analysis with weight:

constructing a weighted simple graph, giving l videos under the same event, preprocessing the videos to obtain n candidate key frames, and expressing the candidate key frames as feature vectors X ═ { f ═ f₁,f₂,f₃,...,f_n},f_i∈R^m，f_iThe method comprises the steps of representing m-dimensional feature vectors of an ith candidate key frame, constructing a visual similarity graph G (X, E, W) by taking the candidate key frame as a vertex, wherein X represents the vertex, E represents a connecting edge between video frames, W represents a visual connecting weight of the edge, and firstly calculating cosine similarity A (f) between the video frames in order to calculate W_i,f_j) The calculation formula is as in equation (1):

where sim (i, j) denotes the cosine similarity between the ith frame and the jth network image;

constructing a graph model with weights, additionally adding a weight to a connecting edge between video frames of cross videos by utilizing the similarity between the videos, and designing a weight matrix W to present the relation_vThe specific calculation is as in equation (2):

where v (f) denotes a video containing frame f, sim (v (f)_i),v(f_j) Is indicative of containing frame f_iAnd comprises a frame f_jThe similarity refers to cosine similarity obtained according to text information of the video, the expression given above only adds weight to the connecting edge between frames crossing the video, and the weight of the connecting edge between frames in the video remains unchanged;

calculating the average similarity between the video frame and all the network images, and using the similarity as an importance standard of the video frame, wherein the specific calculation mode is as shown in formula (3):

wherein g is_jRepresents the j network image, sim (f)_i,g_j) Representing video frames f_iAnd g_jCosine similarity of (d);

the calculation of the connection weight matrix W of the edges of the constructed weighted graph model is shown in equation (4):

the specific steps in one example are as follows:

1) extracting visual features of the video frames and the network images based on the query and text features corresponding to the videos: visual features of a video frame are denoted X ═ f₁,f₂,f₃,...,f_n},f_i∈R^mVisual characteristics of the network image are expressed as { g }₁,g₂,...,g_k},g_k∈R^m，g_kRepresenting the m-dimensional feature vector of the k network image, and the text feature of the video is represented as t₁,t₂,...,t_l},t_a∈R^d，t_aText features representing the a-th video;

2) construct a weighted complete graph: in order to model the correlation relationship between video frames, a weighted simple graph G is constructed by regarding the video frames as vertexes, and a matrix W is solved by using formulas (1) - (4);

3) using the weight matrix W obtained in step 2 as the weight of prototype analysis problem, and using formula

Constructing an input matrix

4) At a given point

Performing prototype analysis with weight, and alternately obtaining optimal solution matrixes P and Q by using an estimation algorithm, wherein P represents a coefficient matrix of prototype reconstruction input, and Q represents a coefficient matrix of input reconstruction prototype;

5) according to the formula

Calculate the importance score S for each prototype_i；

6) Sorting the prototypes in a descending manner, and selecting the prototypes with the importance scores larger than a certain threshold value epsilon.

7) Starting from the prototype with the maximum importance score, selecting a video frame corresponding to the row number corresponding to the maximum element score in the column of Q corresponding to the prototype, judging the similarity between the frame and all the previously selected frames, and if the similarity is greater than a threshold value, not including the frame in the abstract; if all prototypes do not reach the length of the abstract after the iteration of the process, performing a next round of selection process, and selecting a key frame by selecting the row number corresponding to the second maximum value from each column of Q; the above process is then iterated until the required digest length is met.

The invention has the characteristics and beneficial effects that:

the invention mainly aims at the characteristics of the existing multi-video abstract data set, designs a multi-video abstract technology which is suitable for the characteristics and is based on a prototype analysis method with weight, and makes full use of the specific information of the data under the assistance of effective prior information. The main advantages are mainly as follows:

(1) the novelty is as follows: the prototype analysis method with the weight is applied to the multi-video abstract facing the query for the first time. And the text information of the video and the network image information based on the query are introduced into the multi-video abstract together by utilizing the weighted graph model to model the relationship between the video frames.

(2) Effectiveness: compared with a typical clustering method and a minimum sparse reconstruction method applied to single video abstraction, the performance of the multi-video abstraction method based on prototype analysis with weight designed by the invention is obviously superior to that of the clustering method and the minimum sparse reconstruction method, so that the method is more suitable for the multi-video abstraction problem.

(3) The practicability is as follows: the method is simple and feasible and can be used in the field of multimedia information processing.

Description of the drawings:

fig. 1 is a flow chart of video key shot extraction based on a weighted prototype analysis method provided by the present invention.

Detailed Description

The invention aims at the characteristics of more redundant information and repeated information of multimedia video data, combines visual information, text information and other prior information related to subjects of videos, and utilizes the prototype analysis idea to improve the traditional multi-video summarization method, thereby achieving the purposes of effectively utilizing the related information of the video subjects and improving the video browsing efficiency of users.

The method provided by the invention mainly comprises the following steps: 1) a weighted graph model is first designed for constructing associations between sentences. 2) And then designing a key frame selection method suitable for the characteristics of the query-oriented multi-video abstract data set by utilizing a prototype analysis technology with weight.

Archetypal Analysis (AA) treats each data point in a dataset as a blended result of a set of single, observable archetypes, while archetypes themselves are limited to a sparse mixture of data points in a dataset, and are typically located at the boundaries of a dataset.

Given an n × m matrix X ═ f₁,f₂,...f_i,...,f_n},f_i∈R^mAnd z < n, the prototype analysis problem factors the matrix W into two random matrices P ∈ R^n×zAnd Q ∈ R^n×zP represents the coefficient matrix of the prototype reconstruction input and Q represents the coefficient matrix of the input reconstruction prototype, as follows:

X≈PA with A＝X^TQ (4)

the prototype analysis algorithm first initializes matrices P and Q to compute prototype matrix a, and then updates P and Q with equation (5) until the residual sum of squares RSS converges to a sufficiently small value or the maximum number of iterations is reached.

The prototype analysis problem described above treats all video frames as frames with the same weight, and each data point (video frame) and its corresponding residual are normalized by the same weight to obtain a prototype using equation (5). In the multi-video summary, the video frames are not identical, and there is an importance point between them. The invention will therefore use weighted prototype analysis to obtain key frames.

Firstly, the invention utilizes a weighted graph model to model the relationship between video frames, thereby obtaining a weight matrix required by weighted prototype analysis.

In order to model the relationship between video frames, the invention constructs a simple graph with weights. Giving l videos under the same event, preprocessing the videos to obtain n candidate key frames X-f₁,f₂,f₃,...,f_n},f_i∈R^m. The invention takes the candidate key frames as the vertexes to construct a visual similarity graph G ═ X, E and W, wherein X represents the vertexes, E represents the connecting edges between the video frames, and W represents the visual connecting weight of the edges. To calculate W, the present invention first calculates the cosine similarity A (f) between video frames_i,f_j) The calculation formula is as in equation (1):

where sim (i, j) denotes the cosine similarity between the ith frame and the jth network image.

It is observed that distinguishing the inter-frame similarity relationship within a video from the inter-frame similarity relationship between videos helps to improve the quality of the multi-video summary. Therefore, in order to reflect the influence of the relationship between videos on the similarity relationship between frames, a weighted graph model is constructed. The invention will take advantage of the similarity between videos to add an additional weight to the connecting edge between video frames across videos. To present this relationship, the present invention designs a weight matrix W_vThe specific calculation is as in equation (2):

where v (f) represents a video containing frame f. sim (v (f)_i),v(f_j) Is indicative of containing frame f_iAnd comprises a frame f_jThe similarity refers to cosine similarity obtained according to text information of the videos. The expression given above only adds weights across the connecting edges between frames of the video, while the connecting edge weights of frames within the video remain unchanged.

Recently, as more and more user-generated information available on websites, such as images, videos, etc., is available, it is a natural idea to use this external information as an aid to generate summaries. We treat the query image as a priori information to obtain important content of the video. The query image is uploaded as complementary information to the video after being carefully selected by the user, thus presenting the main content of the event in a more semantic way and having less redundant and noisy information relative to the video. All this indicates that the query picture as a priori information facilitates the generation of the multi-video summary. Therefore, the invention firstly calculates the average similarity between the video frame and all the network images, and takes the similarity as the importance standard of the video frame, and the specific calculation mode is as shown in formula (3):

wherein g is_jRepresents the j network image, sim (f)_i,g_j) Representing video frames f_iAnd g_jCosine similarity of (1), W_qRepresenting video frames f_iThe sum of cosine similarities with all network images.

The calculation of the connection weight matrix W of the edges of the thus constructed weighted graph model is as shown in equation (4):

after the weight matrix W is obtained, the method obtains the key frame by applying a prototype analysis technology with weight. The weighted prototype analysis problem can be considered as a minimization problem:

this problem can also be rewritten as:

therefore, the multi-video abstract method based on Weighted archetype analysis mainly comprises three stages of early-stage data preparation, solving of a weight matrix required by the archetype analysis by using a Weighted graph model and solving of the Weighted archetype analysis.

FIG. 1 depicts a flow diagram for extracting key shots in a video using a weighted-based prototype analysis method in conjunction with web page image prior information. The main idea of this method is to soft-cluster (soft-clustered) video frames into weighted prototypes (archetypes), then sort the video frames according to the prototypes, and select the video frames arranged in front as key frames, to generate a video summary of a given length. The method comprises the following specific steps:

1) extracting video frames and query-based network graphsVisual features of the image and corresponding textual features of the video. Visual features of a video frame are denoted X ═ f₁,f₂,f₃,...,f_n},f_i∈R^mVisual characteristics of the network image are expressed as { g }₁,g₂,...,g_k},g_j∈R^mThe text feature of the video is denoted as t₁,t₂,...,t_l},t_a∈R^d。

2) A weighted full graph is constructed. In order to model the correlation between video frames, the invention takes the video frames as vertices to construct a weighted simple graph G ═ X, E, W, and solves the matrix W using equations (1) - (4).

Constructing an input matrix

4) At a given point

And performing weighted prototype analysis, and alternately obtaining optimal solutions P and Q by using an estimation algorithm.

5) According to the formula

Calculate the importance score S for each prototype_i。

7) Starting from the prototype with the maximum importance score, selecting a video frame corresponding to the row number corresponding to the maximum element score in the column of Q corresponding to the prototype, judging the similarity between the frame and all the previously selected frames, and if the similarity is greater than a certain threshold value, excluding the frame in the summary. And if all prototypes do not reach the length of the abstract after the iteration of the process, performing a next round of selection process, and selecting the row number corresponding to the second maximum value from each column of Q to select the key frame. The above process is then iterated until the required digest length is met.

Claims

1. A multi-video abstraction method based on weighted prototype analysis technology is characterized in that firstly, a weighted graph model is used for modeling the relation between video frames, so as to obtain a weight matrix required by weighted prototype analysis; secondly, acquiring key frames by utilizing prototype analysis with weight to generate a video abstract with a given length; the specific steps of obtaining the weight matrix required by the prototype analysis with the weight are as follows:

W＝A⊙W_v⊙W_q(4)。

2. the method for multi-video summarization based on weighted prototype analysis technique according to claim 1, comprising the following steps:

Constructing an input matrix

4) At a given point

5) according to the formula

Calculate the importance score S for each prototype_i；

6) Sorting the prototypes in a descending manner, and selecting the prototypes with the importance scores larger than a certain threshold value epsilon;