CN104156423A

CN104156423A - Multiscale video key frame extraction method based on integer programming

Info

Publication number: CN104156423A
Application number: CN201410384972.5A
Authority: CN
Inventors: 聂秀山; 柴彦娥; 马林元
Original assignee: Individual
Current assignee: Shandong University of Finance and Economics
Priority date: 2014-08-06
Filing date: 2014-08-06
Publication date: 2014-11-19
Anticipated expiration: 2034-08-06
Also published as: CN104156423B

Abstract

The invention provides a multiscale video key frame extraction method based on integer programming. The method comprises the following steps that (1) video figure modeling is carried out, wherein video is modeled to an undirected weight figure; (2) video content is divided, wherein video frames are divided into a plurality of parts according to normalized figure cutting, and a scale factor is set; (3) a key frame set is obtained through integer programming according to the scale factor. Compared with the prior art, the method is based on the essence of key frame extraction, the key frames are selected based on the normalized figure cutting theory and integer programming, the video content can be represented to the greatest extent, the scale factor is set, and the general number of the key frames can be determined by a user in an interaction mode by selecting different scales so that the different needs of the user can be met.

Description

Multiple dimensioned video key frame extracting method based on integer programming

Technical field

The present invention relates to a kind of video key frame extracting method, relate in particular to a kind ofly based on integer programming and multiple dimensioned video key frame extracting method, belong to video, multimedia signal processing technique field.

Background technology

Along with developing rapidly of the development of computing machine and infotech, particularly multimedia technology, it is more and more abundanter that video content becomes, and video is the important carrier that people carry out information interchange as the media format a kind ofly containing much information, expressive force is strong always.In addition, along with the develop rapidly of software and hardware technology and network technology, the sharply increase of video resource quantity, the mobile devices such as increasing people's choice for use computing machine or mobile phone are watched video.A large amount of video datas is needed efficient video content management mode badly, thereby gives the better multimedia experiences of user.With key frame, represent that video segment is a kind of conventional video management mode, user only need to browse the key frame of minority just can understand the content of video.Therefore, people are making great efforts to carry out the research of key-frame extraction technology always.On the other hand, geometric growth due to video data, video frequency searching is more and more important in multimedia process field, traditional video frequency searching is mainly to rely on text marking to realize, this method workload is large, efficiency is low, and subjectivity is larger, and therefore a kind of automatic, objective, comprehensive video frequency searching mode---content-based video frequency searching is a research emphasis in recent years.An important step of content-based video frequency searching is extracted key frame exactly from video sequence, and take key frame and original contents is retrieved as index.Therefore, key-frame extraction has important effect in content-based video frequency searching.

The method that current key frame of video extracts is roughly divided into two large classes, the first kind is the extraction method of key frame based on sampling, these class methods adopt the mode of random or uniform sampling to obtain key frame, although this class methods simple and fast, but may cause some important video segments not choose key frame, or some fragments are got the key frame of repetition; Equations of The Second Kind is the extraction method of key frame of cutting apart based on camera lens, these class methods are divided into several video lens video, then choose the first frame of each camera lens or last frame as key frame of video, these class methods are limited to the precision that camera lens is cut apart, meanwhile, the key frame that these class methods obtain can not embody the content of video lens completely.

The number of key frame of video is also an important problem, key frame of video choose be in essence select can representing video content frame.Number of key frames too many, although the embodiment of higher degree the content of video, increased the calculated amount of video frequency searching, and lost to a certain extent the meaning (object of choosing key frame is expression video for simplicity) of key frame; And if number of key frames very little, can not embody the content of video completely.In addition, the number of key frames that existing key-frame extraction technology is chosen is mostly all relatively-stationary, for example, and the method based on sampling, uniform sampling is all generally that interval regular time section is chosen a frame as key frame, and key frame total number has generally all been preset in random sampling; The method of cutting apart based on camera lens, after camera lens is cut apart and determined, the number of key frame has also just been determined.Be that existing method has determined that the number of key frames that same video is chosen is relatively-stationary.

Summary of the invention

The present invention is directed to the deficiency that existing key frame of video extractive technique exists, provide a kind of representing video content to the full extent can realize again the key frame of video choosing method that user-interactive is set number of key frames.Compared with prior art, the present invention is from the essence of key-frame extraction, utilize normalization figure hugger opinion and integer programming to choose key frame, not only can use up the representing video content of large degree, an and scale factor of setting, by selecting different scale to realize the cardinal principle number of the decision key frame of user interactions, the present invention is referred to as the multiple dimensioned video key frame extracting method based on integer programming.

The technical solution used in the present invention is:

A multiple dimensioned video key frame extracting method based on integer programming, is characterized in that the method comprises the following steps:

(1) video figure modeling: video modeling is become to undirected weight map;

(2) video content is divided: set scale factor s, described scale factor is set for determining the number of key frame as required by user, and utilizes normalization figure hugger opinion that video sequence is become to s part according to division of teaching contents;

(3) integer programming modeling: the video figure to the video sequence after dividing carries out integer programming modeling, chooses key frame.

Preferably, the specific implementation step of described step (1) is:

1. frame of video is abstract is the summit in higher dimensional space, and between summit, line, as limit, is configured to the figure in higher dimensional space;

2. extract frame of video SURF (Speed Up Robust Feature: fast robust feature), with the feature of different frame

The distance function of point is as limit weight, and the figure that video is abstracted into changes weight map into.

Preferably, the specific implementation step of described step (3) is:

The video figure that above step (1) is obtained, first define the label on each summit, if frame of video corresponding to this summit is chosen as key frame, label is 1, otherwise be 0, the objective function of integer programming be exactly maximize all summits label and, constraint condition has two, the first guarantees to elect as between the video figure summit that key frame is corresponding and will not be connected mutually, it two is that all to have the label of a point at least be 1 for every part of guaranteeing video figure, the solution of integer programming is an optimum label set, and the vertex set that wherein label is 1 is exactly the set of key frame.

Preferably, the distance function using in step (1) is to realize the be inversely proportional to function of relation of weight and distance.

First said method carries out figure modeling to video, with SURF distance function structure weight, utilizes normalization figure hugger opinion that video is divided into some parts, and video figure is carried out to integer programming modeling, chooses figure summit as key frame of video.

The present invention can extract the key frame of representing video content, again can interactively adjusting number of key frames, compared with prior art, technology of the present invention has taken into full account differentiation and the representativeness of video content, video segment at different content is chosen key frame, both the representativeness that had guaranteed content has the repetition of having avoided keyframe content, simultaneously, the present invention can regulate according to scale factor the number of key frame, when user only needs to understand video content when general, less scale factor can be set and obtain less key frame, when the more detailed video content of needs, larger scale factor can be set and obtain plurality object key frame, this is that traditional key frame technology is not available.

Accompanying drawing explanation

Fig. 1 is step framework schematic diagram of the present invention.

Fig. 2 is a certain frame of video SURF schematic diagram.

Fig. 3 is video figure integer programming modeling schematic diagram.

Fig. 4 key-frame extraction example: (a) original video frame; (b) key frame under different scale.

Embodiment

Below in conjunction with accompanying drawing to the present invention's detailed explanation in addition.

Method of the present invention is pressed flow process shown in Fig. 1, comprises following concrete steps:

(1) video figure modeling

1. video figure modeling represents video with non-directed graph G=(V, E), wherein V and E difference representative graph vertex set and Bian Ji.The every frame of video is corresponding to figure summit, the limit collection of interconnecting line pie graph between summit.

2. define limit weight.The limit weight of figure represents the relation between video different frame, and the present invention utilizes the function of the Hausdoff distance between the acceleration robust features (Speed-Up Robust Feature:SURF) of different frame to define weight.SURF refers to the point of interest in image, generally refers to the interested points of human vision such as angle point, spot, has repeatability and reliability, can resist the interference such as selection, translation, illumination and noise, have stronger robustness, and the retrieval rate of SURF is fast, efficiency is high.Fig. 2 is a certain frame of video SURF schematic diagram.Concrete method is: for each frame of video, the value of calculating the Hessian matrix determinant of every bit x=(x, y) judges whether it is unique point.Hessian matrix is defined as follows:

H (x, σ) = (\begin{matrix} L_{xx} (x, σ) & L_{xy} (x, σ) \\ L_{xy} (x, σ) & L_{yy} (x, σ) \end{matrix}) - - - (1)

L wherein _xx(x, σ), L _xy(x, σ) and L _xy(x, σ) is the second order local derviation of Gaussian function with in a convolution of x=(x, y).σ means a yardstick at x=(x, y) place.Hessian determinant of a matrix value is as follows:

\det (H) = L_{xx} L_{yy} - L_{xy}^{2} - - - (2)

If the Hessian matrix determinant of point, for just, represents that this point is for Local Extremum.Then utilize the unique point on non-maximum Restrainable algorithms search different scale.Finally utilize the little wave response of Haar and by the little wave response of cumulative sector region, determine the direction of unique point, structural attitude is vectorial.

For convenience of calculation, the unique point that the every frame of the present invention is got similar number, for the limit weight w of figure summit i and j _ijbe defined as follows:

w _ij＝e ^-H(i,j) (3)

Wherein H (i, j) is the Hausdorff distance for two frame unique point set.This functional form of all employings be the needs of dividing for video content, for video different frame, limit weight is larger, the distance between the figure summit that two frames are corresponding is less, illustrates that the content between two frames is more similar.For further raising method efficiency, the limit that weight is less than to setting threshold removes.In addition, for the definition of distance weighting, also can adopt other can realize the be inversely proportional to function of relation of weight and distance.

(2) video content is divided

Video content is divided and exactly video sequence is divided into two parts or M part according to content (reasonably).From the angle of video figure modeling, the problem that this problem is divided with regard to being equivalent to figure, how a given figure G=(V, E), be divided into disjoint subset its vertex set, makes this division best.The simplest method is after two or M non-intersect vertex set of delineation, wish the limit between vertex set, its weights and minimum, so-called MinCut (minimal cut) problem that Here it is, but, Mincut likely separates the single summit away from putting from majority with other summit, form two classes, and this is obviously disadvantageous for classification.In fact, we not only want to allow power and the minimum of cut edge, and will allow this M vertex set all similar large, so just meet cluster to people's visual sense.Normalization figure cuts exactly can realize above-mentioned purpose, obtains good figure summit division, thereby realizes the division of video content.

It is the recurrence process of two minutes that normalization is cut apart, and figure vertex set V is broken down into disjoint set A and B, A ∪ B=V, and A ∩ B=φ, this is ground floor, then for set A and B, proceeds two minutes.Until be sub-divided into the number of regions satisfying the demands, number of regions has represented the degree that video content is subdivided.This number is the scale factor s in this programme.When user needs less key frame, can set less scale factor, when needing more key frame, user can set larger scale factor.Scale factor has been determined the minimal amount of key frame on this yardstick, i.e. minimum contents precision on yardstick, and being located at the upper number of key frames of yardstick s is M _s, have following formula to set up:

M _s≥s (4)

Utilize normalized cut to divide video sequence, mapping graph weight matrix places one's entire reliance upon, and the video local feature that places one's entire reliance upon of choosing of weight matrix characterizes in the present invention, to not restriction of time order and function, though therefore not at one time the similar frame in section still can drop in same video segment.This has just effectively been avoided the redundancy of key frame extraction.

(3) integer programming modeling

The On The Choice of key frame of video, can approximately equivalent be the independent sets On The Choice of video figure, and figure independent sets U is the subset of vertex set V, and for any summit i ∈ U, j ∈ U, does not all have limit to be connected.Between any two summits of video figure, have limit to be connected, just weighted, therefore, is defined as the independent sets of video figure: independent sets U is the subset of vertex set V herein, and for any summit i ∈ U, j ∈ U, weight w _ijbe less than threshold value θ.Threshold value setting is the mean value of all limits weight.The On The Choice of figure independent sets is NP-hard problem classical in graph theory, and the present invention adopts the theory of integer programming to be similar to solution.

The general idea of integer programming is: to video figure obtained above, define the label on each summit, if frame of video corresponding to this summit is chosen as key frame, label is 1, otherwise be 0, the objective function of integer programming be exactly maximize all summits label and, constraint condition has two, the first guarantees to elect as between the video figure summit that key frame is corresponding and will not be connected mutually, it two is that all to have the label of a point at least be 1 for every part of guaranteeing video figure, the solution of integer programming is an optimum label set, the vertex set that wherein label is 1 is exactly the set of key frame.

If U is a maximal independent set of video figure, N video figure summit number, on yardstick s, video sequence is divided into s part, and k partly uses M _krepresent defining variable A _i, d _ijas follows:

A_{i} = \{\begin{matrix} 1 & v_{i} &Element; U \\ 0 & v_{i} &NotElement; U \end{matrix}, d_{ij} = \{\begin{matrix} 1 & v_{i} &Element; M_{k_{1}}, v_{j} &Element; M_{k_{2}}, k_{1} &NotEqual; k_{2} \\ 0 & otherwise \end{matrix} - - - (5)

Integer programming model is defined as follows:

\begin{matrix} \max \underset{i}{Σ} A_{i} \\ s . t . A_{i} + A_{j} \leq 1, if w_{ij} > θ \\ Σ_{i, j}^{C_{N}^{2}} d_{ij} &GreaterEqual; C_{s}^{2} \end{matrix} - - - (6)

Wherein for number of combinations, constraint condition 1 explanation larger two summits (between 2, limit weight is large) of similarity can not be selected into independent sets simultaneously, and the present invention is referred to as the property distinguished constraint; Constraint condition 2 explanations, on yardstick s, have at least a point to be selected into independent sets in every part of video figure, the present invention is referred to as representative constraint.Fig. 3 has provided the explanation schematic diagram of constraint condition 2 when s=4, supposes a certain video figure totally 5 summits, i.e. N=5, common property life between summit between two individual d _ijvalue, as shown in Fig. 3 (a), except d ₁₂(vertex v outside=0 ₁and v ₂in same part), other be 1, for guaranteeing that each part has at least a frame to be chosen as key frame, as shown in Fig. 3 (b), at least should have individual d _ijvalue be 1.

For simplified model, next, definition e _ijas follows:

e_{ij} = \{\begin{matrix} 1 & w_{ij} &GreaterEqual; θ \\ 0 & w_{ij} < θ \end{matrix} - - - (7)

Linear programming model (6) can transfer integer programming master pattern (8) to:

\begin{matrix} \min \underset{i}{- Σ} A_{i} \\ s . t . A_{i} + A_{j} \leq {2 - e}_{ij} \\ - Σ_{i, j}^{C_{N}^{2}} d_{ij} \leq {- C}_{s}^{2} \end{matrix} - - - (8)

The solution of this integer programming model is a maximal independent set of video mapping graph, according to mapping relations, obtains the key frame set of corresponding video.

Fig. 4 is an emulation experiment of the inventive method, Fig. 4 (a) is the partial frame sequence of short-sighted frequency " indi009.mpg ", Fig. 4 (b) is for the key frame that utilizes this method to extract at different scale s, can find out that key frame has reflected the substance of video.

Claims

1. the multiple dimensioned video key frame extracting method based on integer programming, is characterized in that the method comprises the following steps:

(1) video figure modeling: video modeling is become to undirected weight map;

2. the multiple dimensioned video key frame extracting method based on integer programming as claimed in claim 1, is characterized in that: the specific implementation step of described step (1) is:

2. extract SURF (the Speed Up Robust Feature: fast robust feature), using the distance function of unique point of different frame as limit weight, the figure that video is abstracted into changes weight map into of frame of video.

3. the multiple dimensioned video key frame extracting method based on integer programming as claimed in claim 1, is characterized in that: the specific implementation step of described step (3) is:

4. the multiple dimensioned video key frame extracting method based on integer programming as claimed in claim 2, is characterized in that: the distance function using in described step (1) is to realize the be inversely proportional to function of relation of weight and distance.