CN106993240B

CN106993240B - Multi-video abstraction method based on sparse coding

Info

Publication number: CN106993240B
Application number: CN201710151147.4A
Authority: CN
Inventors: 冀中; 马亚茹
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-03-14
Filing date: 2017-03-14
Publication date: 2020-10-16
Anticipated expiration: 2037-03-14
Also published as: CN106993240A

Abstract

The invention relates to video processing, and aims to provide a multi-video abstraction technology based on sparse coding, realize clustering of videos, namely detection of sub-topics of the videos, sequence key shots and realize multi-video abstraction. The technical scheme adopted by the invention is that a multi-video abstract method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely, the detection of the sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; and finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary. The invention is mainly applied to video processing.

Description

Multi-video abstraction method based on sparse coding

Technical Field

The invention relates to video processing, in particular to a multi-video summarization method based on sparse coding.

Background

With the rapid development of information technology, video data is emerging in large quantities, and becomes one of important ways for people to acquire information. However, due to the dramatic increase in the number of videos, redundant and repetitive information occurs in a large amount of video data, which makes it difficult for a user to quickly acquire desired information. Therefore, under the circumstances, a technology capable of integrating and analyzing mass video data under the same theme is urgently needed to meet the requirement that people want to browse main information of videos quickly and accurately and improve the information acquisition capability of people. The multiple video summarization technique has attracted increasing researchers' attention over the past several decades as one of the effective ways to solve the above-mentioned problems. The multi-video abstraction technology is a content-based video data compression technology, and aims to analyze and integrate a plurality of videos of related topics under the same event, extract main contents in the plurality of videos, and present the extracted contents to a user according to a certain logical relationship. Currently, the multi-video summary is mainly analyzed from three aspects: 1) coverage rate; 2) novelty; 3) the importance of which. Coverage refers to the fact that the extracted video content can cover the main content of multiple videos on the same topic. Redundancy refers to removing duplicate, redundant information in a multi-video summary. The importance refers to extracting important key shots in a video set according to some prior information so as to extract important contents in a plurality of videos.

Although many single video summarizations have been proposed, the research on the multi-video summarization method is less and still in the preliminary stage. This is mainly due to two reasons: 1) one is due to the diversity of multiple video topics under the same event and the cross-correlation of topics between videos. The theme diversity means that information emphasis points of a plurality of videos in the same event are different and a plurality of sub-themes are provided. The topic cross-over refers to that the content of the videos under the same event has cross-over, and the videos have similar content and different information content. 2) Secondly, the audio information, text information and visual information may have a large difference due to the audio information expressed by the multiple video data to the same content. These reasons make the study of multiple video summaries difficult with traditional single video summaries.

In the past decades, methods for multi-video summarization have been proposed for the features of multi-video data sets. The multi-video summarization method based on complex graph clustering is a relatively classical method. The method comprises the steps of constructing a complex graph by extracting key words of corresponding script information of the video and key frames of the video, and realizing abstract by utilizing a graph clustering algorithm on the basis. However, the method is mainly used for news videos, the method is of no significance for video sets without video script information, in addition, due to the fact that the content of a plurality of videos in the same theme has diversity and redundancy, the maximum coverage condition of the video content is met only by using a clustering method, the clustering effect is poor only by using visual information of the videos for multi-video abstraction, and the complexity is high although certain help is provided by combining other modes.

The multi-video abstract has information of multiple modalities, such as text information, visual information, audio information, and the like of a video. The Balanced AV-MMR (Balanced Audio Video maximum local retrieval) is a multi-Video summarization technology which effectively utilizes Video multi-modal information, and analyzes visual information, Audio information and semantic information in the visual information and the Audio information of a Video, wherein the semantic information comprises Audio, human face, time characteristics and the like which have important significance for Video summarization. The method effectively utilizes the multi-modal information of the video, but the extracted video abstract does not achieve a good effect.

In recent years, novel methods have been proposed. Among them, the realization of multi-video summarization by using the visual co-occurrence characteristics (visualCo-occurrence) of video is a novel method. According to the method, important visual concepts (concepts) are considered to be repeated in a plurality of videos under the same theme, a maximum binary search algorithm (maximum binary matching) is provided according to the characteristic, and a sparse co-occurrence mode of the videos is extracted, so that multi-video abstraction is achieved. However, the method is only suitable for a specific data set, and the method loses significance for a video set with small repeatability in the video.

In addition, in order to utilize more related information, related researchers have proposed that sensors such as a GPS and a compass on a mobile phone are used to acquire information such as a geographic position during a mobile phone video shooting process, and thus assist in determining important information in a video and generating a multi-video summary. In addition, the prior information of the webpage picture is used as auxiliary information in the field, and multi-video abstraction is better realized. At present, due to the complexity of multi-video data, the research of multi-video summarization does not achieve the ideal effect. Therefore, how to better utilize the information of the multi-video data to better realize the multi-video summarization becomes a hot spot of research of relevant scholars at present.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a multi-video abstraction technology based on sparse coding, realize clustering of videos, namely detection of sub-topics of the videos, sequence key shots and realize multi-video abstraction. The technical scheme adopted by the invention is that a multi-video abstract method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely, the detection of the sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; and finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary.

The method of using sparse coding is specifically that given a set of video frames for a particular event, X ═ X₁,x₂,…,x_NDenotes the set of features for an N-frame video frame, Z ═ Z₁,z₂,…,z_LDenotes the feature set of the L-frame web page image, where x_i∈R^d,z_i∈R^dUsing the candidate video frame X as the base vector group, video frame X_iWebpage picture z based on sub-topic keyword search_iJointly as an input vector, learning an expression score a of a video frame_iEach frame x_iCorresponding to a coefficient variable a_iThe expression score is called as the expression score of the ith frame, and the expression score conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information, so that an objective function is constructed as follows, and a regularization term is added into the objective function in order to ensure sparsity:

s.t.a_j≥0forj∈{1,…,m},γ＞0 (2)

wherein the coefficient a_jIs the expression score of the jth frame, and all target vectors x_iSharing the same coefficient vector a ═ a₁,a₂,…,a_n}，

As the regularization term is added, a sparse expression vector is obtained.

In a specific example, an objective function for extracting a key shot is constructed as a formula (2), and then a sparse vector A is obtained by using a coordinate gradient descent algorithm, wherein the specific process is as follows:

first, the vector A is initialized to be a zero vector, and the score a is expressed for the objective function_i(i ═ 1,2, …, N) to make a partial derivative; then, the expression score a that maximizes the partial derivative is selected_iAnd updating a by soft-threshold (soft-threshold) method_iFinally, iterating the process until the value of the cost function is less than a certain threshold value or the iteration times reaches a certain value, and solving an expression vector A;

finally, a digest of a given length is generated: assuming a summary time length l given the kth sub-topic^kThe method is solved by the following optimization problems:

wherein s is^kIs the number of shots under the kth sub-topic,

is the importance score or expression score of the ith shot under the kth sub-topic,

is the time length of the ith shot, μ^kIs to select a vector of the vector or vectors,

indicating that the ith shot is selected, otherwise, not selecting. The optimization problem is a typical knapsack problem, and the multi-video abstract is realized by solving the problem through dynamic optimization.

The invention has the characteristics and beneficial effects that:

the invention mainly aims at the defects of the existing multi-video summarization method, designs a multi-video summarization method based on sparse coding and suitable for the characteristics of multi-video data, and makes full use of the specific information of the data under the assistance of effective prior information. The advantages are mainly reflected in that:

(1) the novelty is as follows: the multi-graph model is applied to video clustering, multi-mode information of videos is fully utilized, and sub-topic detection of a multi-video set is better achieved. On the basis, the method for extracting the key shots of the video by using the sparse coding is firstly proposed.

(2) Effectiveness: compared with a typical clustering method and a minimum sparse reconstruction method applied to single video abstraction, the sparse coding-based multi-video abstraction method disclosed by the invention is proved to have obviously better performance than the clustering method and the minimum sparse reconstruction method, so that the sparse coding-based multi-video abstraction method is more suitable for the multi-video abstraction problem.

(3) The practicability is as follows: the method is simple and feasible and can be used in the field of multimedia information processing.

Description of the drawings:

FIG. 1 is a flow chart of the present invention providing sparse coding based video key shot extraction;

FIG. 2 is a flow chart of a coordinate gradient descent algorithm for solving an objective function;

Detailed Description

The invention aims at the characteristics of more redundant information and repeated information of multimedia video data, combines visual information, text information and other prior information related to subjects of videos, and utilizes the idea of coefficient coding to improve the traditional multi-video summarization method, thereby achieving the purposes of effectively utilizing the related information of video subjects and improving the video browsing efficiency of users.

The invention aims to provide a multi-video summarization technology based on sparse coding. According to the characteristics of multi-video data, firstly, the invention provides a method for constructing a multi-graph model by using the text information and the visual information of the video, and the clustering of the video, namely the detection of the sub-topics of the video, is realized by a graph cutting method. Then, under each sub-topic, a sparse coding method is utilized to link the video frames under the sub-topics with the webpage pictures based on the sub-topics, and important key shots are obtained. And finally, sequencing the key shots through the video uploading time, thereby realizing the multi-video summary.

The method provided by the invention mainly comprises the following steps: the webpage pictures are introduced as auxiliary information, a sparse coding method suitable for characteristics of a multi-video abstract data set is designed to acquire key frames (shots) of multiple videos, so that the key frames are extracted, and the key shots (frames) are sequenced on the basis by utilizing uploading time information of the videos.

The multi-video summary aims to compress a longer video set into a shorter summary set, and helps a user to quickly acquire main information of the video set. Generally, multiple video sets of the same event have the characteristics of theme diversity, cross property and the like, so that the method for simply summarizing the single video is infeasible to be applied to the multiple video summaries. It can be generally considered that the pictures of the web page based on the sub-topic keywords reflect the important content of the topic, since each picture is uploaded by the user and mostly comes from the downloaded video of the related topic, the pictures reflect the interests of the user, and the pictures have the advantage over the video frames that the pictures capture the topic from a typical viewpoint in a manner of richer semantic information and are less noisy. The specific principle of the method is as follows:

sparse coding aims at selecting a set of basis vectors

To reconstruct the k input vectors x_jTo minimize the reconstruction error, the following equation is formulated:

wherein a is_ijRepresenting the ith input vector x_iAnd the jth base vector

Where the first term represents the input vector x_iAnd the base vector group

The reconstruction error of (1). The second term in the equation guarantees that the reconstruction coefficient matrix a ═ is (a)_ij) Gamma is the regularization coefficient, balancing the first and second terms.

In the present invention, given a set of video frames for a particular event, X ═ X₁,x₂,…,x_NDenotes the set of features for an N-frame video frame, Z ═ Z₁,z₂,…,z_LDenotes the feature set of the L-frame web page image, where x_i∈R^d,z_i∈R^d. The essence of multi-video summarization is to select a certain number of frames to reconstruct the topic space of the original video. The method combines the prior information of the webpage pictures, constructs an objective function by using the idea of sparse coding, and learns the common mode of the video frames and the webpage pictures searched based on the subtitle keywords. The invention is directly toThe candidate video frame X is used as a base vector group, a video frame set X_iWebpage picture z based on sub-topic keyword search_iJointly as an input vector, learning an expression score a of a video frame_i. Each frame x_iCorresponding to a coefficient variable a_iReferred to as the expression score of the ith frame, which conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information. Constructing an objective function as follows, and adding a regularization term into the objective function in order to ensure sparsity:

s.t.a_j≥0forj∈{1,…,m},γ＞0 (2)

wherein the coefficient a_jIs the expression score of the jth frame, and all target vectors x_iSharing the same coefficient vector a ═ a₁,a₂,…,a_nAnd meanwhile, due to the addition of a regularization term, a sparse expression vector is obtained.

The present invention will be described in further detail with reference to the following drawings and specific examples.

Fig. 1 depicts a flow chart of extracting key shots in a video by using a sparse coding method in combination with prior information of a web page image under a sub-topic. The following process is to extract key frames for one sub-topic, and the extraction methods of key shots of other sub-topics of the same event are the same.

Firstly, extracting video frames under each subtopic and webpage image features based on the subtopic. In the present invention, given a set of video frames for a certain sub-topic at a particular event, X ═ X₁,x₂,…,x_NDenotes the N-frame video frame feature set, with Z ═ Z₁,z₂,…,z_LDenotes the L-frame web page image feature set, where x_i∈R^d,z_i∈R^d. Each frame x_iCorresponding to a coefficient variable a_iThe expression score, referred to as the ith frame, represents the magnitude of the contribution of the ith frame in the reconstruction.

Second, an extraction is constructedObjective function of key shots. Based on the idea of sparse coding, the invention directly takes the candidate video frame X as a base vector group, and a video frame set X_iWebpage picture z based on sub-topic keyword search_iJointly as an input vector, learning an expression score a of a video frame_i. The target is as follows: let an input vector x_i、z_iAnd the set of basis vectors X ═ X₁,x₂,…,x_NThe reconstruction error of the video is minimum at the same time, namely, an important shot of the video is learned under the assistance of webpage picture information searched by keywords based on subtitles, the target function of the important shot is a formula (2), and then a sparse vector A is obtained by utilizing a coordinate gradient descent algorithm, and the specific process is as follows:

first, the vector A is initialized to be a zero vector, and the score a is expressed for the objective function_i(i ═ 1,2, …, N) to make a partial derivative; then, the expression score a that maximizes the partial derivative is selected_iAnd updating a by soft-threshold (soft-threshold) method_i. And finally, iterating the process until the value change of the cost function is smaller than a certain threshold value or the iteration times reaches a certain value, and solving the expression vector A.

Finally, a digest of a given length is generated. And extracting all the key shots of the sub-topics according to the process. Assuming a summary time length l given the kth sub-topic^kThe present invention can be solved by the following optimization problems:

wherein s is^kIs the number of shots under the kth sub-topic,

is the importance score or expression score of the ith shot (frame) under the kth sub-topic,

is the time length of the ith shot. Mu.s^kIs to select a vector of the vector or vectors,

Claims

1. A multi-video abstraction method based on sparse coding is characterized in that a multi-video abstraction method based on sparse coding is used for constructing a multi-graph model by utilizing text information and visual information of videos, and clustering of the videos is realized by a graph cutting method, namely detection of sub-topics of the videos is realized; secondly, associating the video frames under the sub-topics with the webpage pictures based on the sub-topics by utilizing a sparse coding method under each sub-topic to obtain a key shot; finally, sorting the key shots through the video uploading time so as to realize multi-video abstraction; the method of using sparse coding is specifically that given a set of video frames for a particular event, X ═ X₁,x₂,...,x_NDenotes the set of features for an N-frame video frame, Z ═ Z₁,z₂,...,z_LDenotes the feature set of the L-frame web page image, where x_i∈R^d,z_i∈R^dUsing the candidate video frame X as the base vector group, the video frame set X_iWebpage picture z based on sub-topic keyword search_iJointly as an input vector, learning an expression score a of a video frame_iEach frame x_iCorresponding to a coefficient variable a_iThe expression score is called as the expression score of the ith frame, and the expression score conveys the contribution size of each frame in the reconstructed body space with the aid of the webpage picture prior information, so that an objective function is constructed as follows, and a regularization term is added into the objective function in order to ensure sparsity:

s.t.a_j≥0 for j∈{1,...,m},γ>0 (1)

wherein the coefficient a_jIs the expression score of the jth frame, and all target vectors x_iSharing the same coefficient vector a ═ a₁,a₂,...,a_n}，

The expression vector is a regularization term, and sparse expression vectors are obtained due to the addition of the regularization term;

obtaining a sparse vector A by using a coordinate gradient descent algorithm, wherein the specific process is as follows:

first, initializing a sparse vector A as a zero vector, expressing a fraction a with respect to an objective function_iCalculating a partial derivative, i is 1, 2. Then, the expression score a that maximizes the partial derivative is selected_iAnd updating a by using a soft threshold method_iFinally, iterating the process until the value change of the cost function is smaller than a certain threshold value or the iteration times reach a certain value, and solving a sparse vector A.;

finally, generating a summary with a given length, extracting all the key shots of the sub-topics according to the process, and assuming that the summary time length l under the k-th sub-topic is given^kThe method is solved by the following optimization problems:

wherein s is^kIs the number of shots under the kth sub-topic,

is the importance score or expression score of the ith shot/frame under the kth sub-topic,

indicating that the ith shot is selected, otherwise, not selecting.