CN103824284A

CN103824284A - Key frame extraction method based on visual attention model and system

Info

Publication number: CN103824284A
Application number: CN201410039072.7A
Authority: CN
Inventors: 纪庆革; 赵杰; 刘勇
Original assignee: Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd; National Sun Yat Sen University
Current assignee: Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd; National Sun Yat Sen University
Priority date: 2014-01-26
Filing date: 2014-01-26
Publication date: 2014-05-28
Anticipated expiration: 2034-01-26
Also published as: CN103824284B

Abstract

The invention discloses a key frame extraction method based on a visual attention model and a system. In a spatial domain, the extraction method uses binomial coefficients to filter the global contrast for salience detection, and uses an adaptive threshold for carrying out extraction on a target region. The algorithm can well maintain the salient target region boundary, and the salience in the region is uniform. Then, in a time domain, the method defines the motion salience, motion of the target is estimated via a homography matrix, a key point is adopted for replacing the target for salience detection, data of salience in the spatial domain is converged, and a boundary extension method based on an energy function is brought forward to acquire a bounding box to serve as the salient target region of the time domain. Finally, the method reduces richness of the video through the salient target region and an online clustering lens adaptive method is adopted for key frame extraction.

Description

A kind of extraction method of key frame and system based on visual attention model

Technical field

The present invention relates to Video Analysis Technology field, particularly relate to a kind of extraction method of key frame and system based on visual attention model.

Background technology

Along with the fast development of Internet technology, we have marched toward the information big bang epoch, and the fast development of various network applications and multimedia technology is widely used.Video is as a kind of common network information carriers, lively and directly perceived, there is very strong sight and expressive force, thereby be widely used in every field, video data magnanimity is increased, take famous video website YouTube as example, the video that per minute is uploaded by user approximately has 60 hours (data are taken from January 23rd, 2012), and is still keeping rising tendency.The video resource of how fast and effeciently to store, manage and to access magnanimity becomes a major issue of current video application.Video is because have relativity of time domain, and under traditional approach, user grasps one section of video information need to browse complete segment video from start to finish.When irrelevant video occupies user plenty of time, a large amount of network bandwidths are also wasted.Therefore, we need to add supplementary to video, help user to screen better.In ripe system, generally adopt traditional label character method at present, by manual type manual classification, give video with the word such as title, description manually semantic.In the face of magnanimity video, this task not only workload is large, and different people is different to video understanding, and other people cannot judge whether video meets the interest of oneself by author's label character.

Therefore, people summarize video effectively in the urgent need to a kind of mode of robotization.

Summary of the invention

In order to solve the deficiencies in the prior art, first the present invention provides a kind of video crux frame extracting method based on visual attention model, adopts the method effectively to obtain video lens is had to fine representational key frame.

Another object of the present invention is to propose a kind of video crux frame extraction system based on visual attention model.

To achieve these goals, technical scheme of the present invention is:

Based on a video key frame extracting method for visual attention model, comprising:

On spatial domain, carry out significance detection by binomial coefficient filtering global contrast, and utilize adaptive threshold to extract target area; Adopt and not only can keep preferably in this way well-marked target zone boundary, and in region, significance is more even.

In time domain, the significance of definition motion, estimates target travel by homography matrix, adopts key point to replace target to carry out significance detection, merge the data of spatial domain significance, propose to obtain the well-marked target region of bounding box as time domain based on the method for energy function border extension;

Reduce the rich of video by well-marked target region, adopt in conjunction with the camera lens adaptive approach of online cluster and carry out key-frame extraction.

A key frame of video extraction system based on visual attention model, this system comprises marking area extraction module, key-frame extraction module;

Concrete, described marking area extraction module comprises:

Spatial domain marking area extraction module, for extracting the marking area on spatial domain;

Time domain key point significance acquisition module, for extracting the significance value of the key point in time domain;

Fusion Module, for the key point in the marking area on spatial domain and time domain is merged, and finally obtains marking area.

Described key-frame extraction module comprises:

Static camera lens key-frame extraction module, for the key-frame extraction of static camera lens;

Dynamically camera lens key-frame extraction module, for the key-frame extraction of dynamic camera lens;

Camera lens adaptation module, for static camera lens key-frame extraction module and the dynamically control between camera lens key-frame extraction module.

Compared with prior art, beneficial effect of the present invention is: adopt the present invention to summarize video automatically with having, effectively obtain video lens is had to fine representational key frame.

Accompanying drawing explanation

Fig. 1 is the key-frame extraction process flow diagram of the static camera lens of the present invention.

Fig. 2 is the key-frame extraction process flow diagram of the dynamic camera lens of the present invention.

Fig. 3 is the key-frame extraction process flow diagram of self-adaptation camera lens of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is further detailed explanation.

A kind of video key frame extracting method based on visual attention model disclosed by the invention, embodiment is as follows:

First, on spatial domain, by carrying out significance detection by binomial coefficient filtering global contrast, and utilize adaptive threshold to extract target area, concrete grammar is as follows:

(11) binomial coefficient is constructed according to pascal's triangle, and the normalized factor of N layer is 2 ⁿ.Select the 4th layer, therefore filter coefficient B ₄=(1/16) [1 464 1];

(12) establishing I is primary stimuli intensity,

for the average of stimulus intensity around,

for I and B ₄convolution; Adopt the vector form of CIELAB color space to weigh the power stimulating pixel, the contrast of stimulation is the Euclidean distance of two CIELAB vectors, therefore detects for the stimulation degree of pixel (x, y) to be

S (x, y) = | | I_{B_{4}} (x, y) - \overset{&OverBar;}{I} | | - - - (1)

(13) obtain the measuring assembly S of significance _s=(s ₁₁, s ₁₂..., s _nM) afterwards, utilize adaptive threshold to extract target area, wherein s _ij(0≤i≤N, 0≤j≤M) is the significance of pixel (i, j), M, and N is respectively width and the height of image.

Specifically, realizing by the following method adaptive threshold extracts target area:

(21) the overall significance detection computations of definition pixel (x, y) formula

S_{g} (x, y) = \frac{1}{A} Σ_{i = 0}^{N} Σ_{j = 0}^{M} | | I_{B_{4}} (x, y) - I (i, j) | | - - - (2)

Wherein A is the area detecting,

for original image is through wave filter B ₄the stimulus intensity of pixel (x, y) after filtering, I (i, j) is the primary stimuli intensity of pixel (i, j), M, N is respectively width and the height of image;

(22) carry out computing acceleration by histogram, primary stimuli intensity I is mapped to and stimulates space

in, the stimulation of finally experiencing for user

significance as follows

S (I_{B_{4}} (I)) = \frac{1}{(m - 1) D (I_{B_{4}} (I))} Σ_{i = 1}^{m} (D (I_{B_{4}} (I)) - | | I_{B_{4}} (I) - I_{B_{4}} (I_{i}) | |) S_{g} (I_{B_{4}} (I)) - - - (3)

Wherein D is for stimulating

distance between m nearest stimulation m is manual control parameter, and getting in the present embodiment m is 8;

(23) by changing threshold value T _sappointment prospect and background area, then using the threshold value that obtains minimum energy function as optimal threshold; With T _sfor the energy function of threshold value is defined as follows:

E (I, T_{s}, λ, σ) = λ Σ_{n = 1}^{N} (f (T_{s}, S_{n}) S_{n}) + V (I, T_{s}, σ) - - - (4)

Wherein S _nobtained by formula (2), λ is the weight of well-marked target energy, gets in the present embodiment λ=1.0, the total pixel number that N is image, f (T _s, S _n)=max (0, sign (S _n-T _s)), V (I, T _s, be s) measurement of the similarity to around stimulating, select current T _sthe pixel composition point of lower significant point and its 8 neighborhood calculates Pair,

V (I, T_{s}, σ) = \underset{{p, q} &Element; Pair}{Σ} \frac{1}{dist (p, q)} \times e^{(- | | I_{p} - I_{q} | | / {2 σ}^{2})},

Dist (p, q) is the space length between 2, and σ is manual control parameter, gets in the present embodiment σ=10.0.

Therefore given piece image and saliency map, by minimization of energy function to T _sestimate, in the time that pixel belongs to well-marked target, be marked as 1, all the other are labeled as 0, and parameter lambda and σ need manual setting in advance.

Then, in time domain, the significance of definition motion, by homography matrix, target travel is estimated, adopt key point to replace target to carry out significance detection, merge afterwards the data of spatial domain significance, propose to obtain the well-marked target region of bounding box as time domain based on the method for energy function border extension, concrete grammar is as follows:

(31) given piece image, adopts real-time good FAST(Features from Accelerated Segment Test) feature point detection algorithm obtains the key point of image;

(32) given two adjacent two field pictures, adopt FLANN(Fast Library for Approximate Nearest Neighbor) carry out reference point coupling fast;

(33) with homography matrix (Homography Matrix) H, the motion of key point is described, because a H only describes a kind of forms of motion, same section of video memory forms of motion be various, therefore need multiple H to be described different motions.Adopt in the present embodiment RANSAC algorithm, by continuous iteration, obtain the estimation H={H of a series of homography matrixes ₁, H ₂..., H _n;

(34) the time domain significance of definition key point is

S_{t} (p_{m}) = \frac{A_{m}}{W \times H} Σ_{i = 1}^{n} A_{i} D (p_{m}, H_{i}) - - - (5)

Wherein A _mfor motion state H _mthe distribution area of all key points, the width that W and H are video image and height;

(35) the significance value in spatial domain and the time domain significance value of the key point of obtaining are merged;

(36) adopt the method based on energy function border extension to obtain the well-marked target region of bounding box as time domain.

Specifically, the significance value that realizes by the following method spatial domain merges with the time domain significance value of the key point of obtaining:

(41) contrast of a motion conspicuousness of definition

wherein key point time domain significance value S _tobtained by formula (5),

for the average of key point time domain significance value;

(42) conspicuousness of motion should be for the target that still has stronger discrimination on spatial domain, therefore to time domain significance S _tscope of statistics should limit to some extent, establish p _ifor S _ti key point, p _ishould meet wherein

for spatial domain significance value average;

(43) weight of definition time domain the weight in spatial domain

time domain and the spatial domain significance value that will meet the key point of (42) are added by weights.

Specifically, realize by the following method time domain well-marked target extracted region:

Using the remarkable key point p in spatial domain as Seed Points, the bounding box B that seed region adopts rectangle, establishes b _ifor the four edges of bounding box B, i ∈ 1,2,3,4} is numbering up and down, and the algorithm of border extension is as follows:

Initialization: the summit up and down of bounding box B is all made as key point p position, some p is the internal point of bounding box B.

Step 1: the order computation b from i=1 to increase progressively _isignificance energy E on outer boundary _outerand the significance energy E of inner boundary (i) _inner(i), the calculating of energy function is as formula (4), and the weights that then computation bound can extend out are

wherein l _ifor the length on the i article of limit of current bounding box B.

Step 2: if w (i) >=ε, i article of limit is to pixel cell of external expansion.ε is the threshold value that expansion is judged, need to set in advance.In experiment herein, be set to 0.8T _s', T _s' be the spatial domain significance average in bounding box.

Step 3: if do not have new limit to be expanded, stop algorithm in step 2, output bounding box B; Otherwise, repeating step 1 and step 2.

Finally, reduce the rich of video by well-marked target region, adopt in conjunction with the camera lens adaptive approach of online cluster and carry out key-frame extraction, concrete grammar is as follows:

(51) be hsv color space by the RGB color space conversion of marking area, get wherein H component (tone) and S component (saturation degree) calculating form and aspect saturation histograms (Hue-saturation Histogram).Note H _p(i) be i bin value of the form and aspect saturation histogram in p frame well-marked target region, the present embodiment adopts Bhattacharyya distance to weigh the visible sensation distance between two frames

D_{sal} (p, q) = \sqrt{1 - \underset{i}{Σ} \frac{\sqrt{H_{p} (i) H_{q} (i)}}{\underset{i}{Σ} H_{p} (i) \underset{i}{Σ} H_{q} (i)}};

(52) adopt in conjunction with the camera lens adaptive approach of online cluster and carry out key-frame extraction, in the cluster mode of static camera lens as main, dynamically the cluster mode of camera lens is auxiliary.For static camera lens, take the form and aspect saturation histogram of marking area as according to carrying out online cluster, choose in cluster any frame as key frame.For dynamic camera lens, first follow the tracks of remarkable moving target, the then foundation using the tracking of remarkable moving target as online cluster, the positional information of well-marked target is as the foundation of extracting key frame from cluster.

Specifically, as Fig. 1, realize the online cluster of static camera lens by following steps:

Initialization: the form and aspect saturation histogram that calculates static camera lens the first frame

initial cell is counted N=1, and will as cell Cell ₁centre of form C ₁vector, C ₁=f ₁.

S11: if present frame p belongs to static camera lens, calculate the form and aspect saturation histogram H of present frame _p.

S12: calculate the visible sensation distance of p and each cell centre of form, obtain wherein minimum visible sensation distance cell

m = \arg \min_{n} {D_{sal} (p, C_{n}) | 1 \leq n \leq N},

The call number that wherein m is cell.

S13: by D _sal(p, C _m) and threshold epsilon _ccompare, work as D _sal(p, C _m)≤ε _ctime, p is included into cell Cell _min, then use H _psubstitute Cell _mthe centre of form.Otherwise, increase cell Cell _n+1, by H _pas cell Cell _n+1centre of form C _n+1vector, final updating cell is counted N=N+1.

S14: repeat S11, S12 and S13 for all static camera lens frames.

Specifically, as Fig. 2, realize the key-frame extraction of dynamic camera lens by following steps:

Initialization: the first frame that obtains dynamic camera lens.

S21: obtain tracking target region, initialization particle or resampling, extract video next frame, judges whether this frame is empty, empty if, finishes.

S22: obtain FAST proper vector, mate with FLANN algorithm, the vectorial weights of regeneration characteristics, if proper vector deficiency finishes.

S23: upgrade each particle weights, calculate key frame weights and target area, S21 is carried out in redirect.

A kind of key-frame extraction system based on visual attention model disclosed by the invention comprises marking area extraction module and key-frame extraction module.

Marking area extraction module comprises:

Key-frame extraction module comprises:

The above; be only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention, be to be understood that; the present invention is not limited to implementation as described herein, and the object that these implementations are described is to help those of skill in the art to put into practice the present invention.Any those of skill in the art are easy to be further improved without departing from the spirit and scope of the present invention and perfect; therefore any modification of having done within spiritual principles of the present invention, be equal to and replace and improvement etc., within all should being included in claim protection domain of the present invention.

Claims

1. the extraction method of key frame based on visual attention model, extracts for the key frame to video, it is characterized in that, comprising:

On spatial domain, carry out significance detection by binomial coefficient filtering global contrast, and utilize adaptive threshold to extract target area;

2. method according to claim 1, is characterized in that, on spatial domain, by carrying out significance detection by binomial coefficient filtering global contrast, and utilizes adaptive threshold to extract target area, and concrete grammar is as follows:

(11) binomial coefficient is constructed according to pascal's triangle, and the normalized factor of N layer is 2 ⁿ; Select the 4th layer, filter coefficient B ₄=(1/16) [1 464 1];

(12) establishing I is primary stimuli intensity,

for the average of stimulus intensity around,

S (x, y) = | | I_{B_{4}} (x, y) - \overset{&OverBar;}{I} | | - - - (1)

(13) obtain the measuring assembly S of significance _s=(s ₁₁, s ₁₂..., s _nM) after, utilize adaptive threshold to extract target area, wherein s _ijfor the significance of pixel (i, j), 0≤i≤N, 0≤j≤M, M, N is respectively width and the height of image.

3. method according to claim 2, is characterized in that, realizes by the following method adaptive threshold target area is extracted:

S_{g} (x, y) = \frac{1}{A} Σ_{i = 0}^{N} Σ_{j = 0}^{M} | | I_{B_{4}} (x, y) - I (i, j) | | - - - (2)

Wherein A is the area detecting, for original image is through wave filter B ₄the stimulus intensity of pixel (x, y) after filtering, I (i, j) is the primary stimuli intensity of pixel (i, j), M, N is respectively width and the height of image;

in, the stimulation of finally experiencing for user significance as follows

S (I_{B_{4}} (I)) = \frac{1}{(m - 1) D (I_{B_{4}} (I))} Σ_{i = 1}^{m} (D (I_{B_{4}} (I)) - | | I_{B_{4}} (I) - I_{B_{4}} (I_{i}) | |) S_{g} (I_{B_{4}} (I)) - - - (3)

Wherein D is for stimulating distance between m nearest stimulation

D (I_{B_{4}} (I)) = Σ_{i = 1}^{m} | | I_{B_{4}} (I) - I_{B_{4}} (I_{i}) | |;

E (I, T_{s}, λ, σ) = λ Σ_{n = 1}^{N} (f (T_{s}, S_{n}) S_{n}) + V (I, T_{s}, σ) - - - (4)

Wherein S _nobtained by formula (2), λ is the weight of well-marked target energy, the total pixel number that N is image, f (T _s, S _n)=max (0, sign (S _n-T _s)), V (I, T _s, σ) and be the measurement of the similarity to around stimulating, select current T _sthe pixel composition point of lower significant point and its 8 neighborhood calculates Pair,

V (I, T_{s}, σ) = \underset{{p, q} &Element; Pair}{Σ} \frac{1}{dist (p, q)} \times e^{(- | | I_{p} - I_{q} | | / {2 σ}^{2})},

Dist (p, q) is the space length between 2, and σ is for controlling parameter;

Given piece image and saliency map, by minimization of energy function to T _sestimate, in the time that pixel belongs to well-marked target, be marked as 1, all the other are labeled as 0.

4. method according to claim 1, it is characterized in that, in time domain, the significance of definition motion, by homography matrix, target travel is estimated, adopted key point to replace target to carry out significance detection, merge afterwards the data of spatial domain significance, propose to obtain the well-marked target region of bounding box as time domain based on the method for energy function border extension, concrete grammar is as follows:

(31) given piece image, the FAST feature point detection algorithm that employing real-time is good obtains the key point of image;

(32) given two adjacent two field pictures, adopt FLANN to carry out reference point coupling fast;

(33) describe the motion of key point with multiple homography matrix H, adopt RANSAC algorithm, by continuous iteration, obtain the estimation H={H of a series of homography matrixes ₁, H ₂..., H _n;

(34) the time domain significance of definition key point is

S_{t} (p_{m}) = \frac{A_{m}}{W \times H} Σ_{i = 1}^{n} A_{i} D (p_{m}, H_{i}) - - - (5)

5. method according to claim 4, is characterized in that, the significance value that realizes by the following method spatial domain merges with the time domain significance value of the key point of obtaining:

(41) contrast of a motion conspicuousness of definition

wherein key point time domain significance value S _tobtained by formula (5),

for the average of key point time domain significance value;

(42) establish p _ifor S _ti key point, p _ishould meet

wherein for spatial domain significance value average;

(43) weight of definition time domain

the weight in spatial domain

time domain and the spatial domain significance value that will meet the key point in step (42) are added by weights.

6. method according to claim 4, is characterized in that, realizes by the following method time domain well-marked target extracted region:

Step 1: the order computation b from i=1 to increase progressively _isignificance energy E on outer boundary _outerand the significance energy E of inner boundary (i) _inner(i), the calculating of energy function is as formula (4), and the weights that then computation bound extends out are

wherein l _ifor the length on the i article of limit of current bounding box B;

Step 2: if w (i) >=ε, i article of limit is to pixel cell of external expansion.ε is the threshold value that the expansion of setting is judged, is set to 0.8T _s', T _s' be the spatial domain significance average in bounding box;

7. method according to claim 1, is characterized in that, reduces the rich of video by well-marked target region, adopts in conjunction with the camera lens adaptive approach of online cluster and carries out key-frame extraction, and concrete grammar is as follows:

(51) be hsv color space by the RGB color space conversion of marking area, get wherein H component and S component calculating form and aspect saturation histogram, note H _p(i) be i bin value of the form and aspect saturation histogram in p frame well-marked target region, adopt Bhattacharyya distance to weigh the visible sensation distance between p, q two frames

D_{sal} (p, q) = \sqrt{1 - \underset{i}{Σ} \frac{\sqrt{H_{p} (i) H_{q} (i)}}{\underset{i}{Σ} H_{p} (i) \underset{i}{Σ} H_{q} (i)}};

(52) adopt in conjunction with the camera lens adaptive approach of online cluster and carry out key-frame extraction, in the cluster mode of static camera lens as main, dynamically the cluster mode of camera lens is auxiliary;

For static camera lens, take the form and aspect saturation histogram of marking area as according to carrying out online cluster, choose in cluster any frame as key frame;

For dynamic camera lens, first follow the tracks of remarkable moving target, the then foundation using the tracking of remarkable moving target as online cluster, the positional information of well-marked target is as the foundation of extracting key frame from cluster.

8. method according to claim 7, is characterized in that, realizes the online cluster of static camera lens by following steps:

initial cell is counted N=1, and will

as cell Cell ₁centre of form C ₁vector, C ₁=f ₁;

S11: if present frame p belongs to static camera lens, calculate the form and aspect saturation histogram H of present frame _p;

m = \arg \min_{n} {D_{sal} (p, C_{n}) | 1 \leq n \leq N},

The call number that wherein m is cell;

S13: by D _sal(p, C _m) and threshold epsilon _ccompare, work as D _sal(p, C _m)≤ε _ctime, p is included into cell Cell _min, then use H _psubstitute Cell _mthe centre of form; Otherwise, increase cell Cell _n+1, by H _pas cell Cell _n+1centre of form C _n+1vector, final updating cell is counted N=N+1;

S14: repeat S11, S12 and S13 for all static camera lens frames.

9. method according to claim 7, is characterized in that, realizes the key-frame extraction of dynamic camera lens by following steps:

Initialization: the first frame that obtains dynamic camera lens;

S21: obtain tracking target region, initialization particle or resampling, extract video next frame, judges whether this frame is empty, empty if, finishes;

S22: obtain FAST proper vector, mate with FLANN algorithm, the vectorial weights of regeneration characteristics, if proper vector deficiency finishes;

10. the key-frame extraction system based on visual attention model, is characterized in that, comprises marking area extraction module, key-frame extraction module;

Described marking area extraction module comprises:

Fusion Module, for the key point in the marking area on spatial domain and time domain is merged, and finally obtains marking area;

Described key-frame extraction module comprises: