CN103824284B

CN103824284B - Key frame extraction method based on visual attention model and system

Info

Publication number: CN103824284B
Application number: CN201410039072.7A
Authority: CN
Inventors: 纪庆革; 赵杰; 刘勇
Original assignee: Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd; National Sun Yat Sen University
Current assignee: Guangzhou Zhongda Nansha Technology Innovation Industrial Park Co Ltd; National Sun Yat Sen University
Priority date: 2014-01-26
Filing date: 2014-01-26
Publication date: 2017-05-10
Anticipated expiration: 2034-01-26
Also published as: CN103824284A

Abstract

The invention discloses a key frame extraction method based on a visual attention model and a system. In a spatial domain, the extraction method uses binomial coefficients to filter the global contrast for salience detection, and uses an adaptive threshold for carrying out extraction on a target region. The algorithm can well maintain the salient target region boundary, and the salience in the region is uniform. Then, in a time domain, the method defines the motion salience, motion of the target is estimated via a homography matrix, a key point is adopted for replacing the target for salience detection, data of salience in the spatial domain is converged, and a boundary extension method based on an energy function is brought forward to acquire a bounding box to serve as the salient target region of the time domain. Finally, the method reduces richness of the video through the salient target region and an online clustering lens adaptive method is adopted for key frame extraction.

Description

A kind of extraction method of key frame and system of view-based access control model attention model

Technical field

The present invention relates to Video Analysis Technology field, more particularly to a kind of key frame of view-based access control model attention model is carried Take method and system.

Background technology

With the fast development of Internet technology, we have marched toward information huge explosion epoch, various networks It is widely used using the fast development with multimedia technology.Video is raw as a kind of common network information carriers It is dynamic and directly perceived, with very strong sight and representability, so as to be widely used in every field so that video data Magnanimity increases, and by taking famous video website YouTube as an example, the video uploaded by user per minute there are about 60 hours（Data take From on January 23rd, 2012）, and still remain growth trend.How fast and effeciently to store, manage and access regarding for magnanimity Frequency resource becomes a major issue of current video application.Video under traditional approach, is used because having relativity of time domain One section of video information is grasped at family to be needed to browse complete segment video from start to finish.While unrelated video occupies user plenty of time, Waste a large amount of network bandwidths.It would therefore be desirable to add auxiliary information to video, user is helped preferably to screen.At present Traditional label character method is generally adopted in ripe system, by manual type manual classification, is assigned with words such as title, descriptions Give video artefacts semantic.In the face of massive video, not only workload is big for this task, and different people understands different to video, Other people cannot judge whether video meets the interest of oneself by the label character of author.

Therefore, people are effectively summarized video in the urgent need to a kind of mode of automatization.

The content of the invention

In order to solve the deficiencies in the prior art, present invention firstly provides a kind of video crux of view-based access control model attention model Frame extracting method, can effectively be obtained using the method and have fine representational key frame to video lens.

A further object of the present invention is the video crux frame extraction system for proposing a kind of view-based access control model attention model.

To achieve these goals, the technical scheme is that：

A kind of video key frame extracting method of view-based access control model attention model, including：

On spatial domain, significance detection is carried out with binomial coefficient filtering global contrast, and utilize adaptive threshold Target area is extracted；Adopt and not only can preferably keep in this way well-marked target zone boundary, and show in region Work degree is more uniform.

In time domain, the significance of motion is defined, target motion is estimated by homography matrix, using key point Replace target to carry out significance detection, merge the data of spatial domain significance, propose to be obtained based on the method for energy function border extension Bounding box is obtained as the well-marked target region of time domain；

The rich of video is reduced by well-marked target region, is carried out using the camera lens adaptive approach with reference to on-line talking Key-frame extraction.

A kind of key frame of video extraction system of view-based access control model attention model, the system includes that marking area extracts mould Block, key-frame extraction module；

Specifically, the marking area extraction module includes：

Spatial domain marking area extraction module, for extracting the marking area on spatial domain；

Time domain key point significance acquisition module, for extracting the notable angle value of the key point in time domain；

Fusion Module, for the key point on the marking area and time domain on spatial domain to be merged, and finally obtains aobvious Write region.

The key-frame extraction module includes：

Static camera lens key-frame extraction module, for the key-frame extraction of static camera lens；

Dynamic camera lens key-frame extraction module, for the key-frame extraction of dynamic camera lens；

Camera lens adaptation module, between static camera lens key-frame extraction module and dynamic camera lens key-frame extraction module Control.

Compared with prior art, beneficial effects of the present invention are：Automatically can carry out ground to video using the present invention Summarize, effectively obtain and there is fine representational key frame to video lens.

Description of the drawings

Fig. 1 is the key-frame extraction flow chart of static state camera lens of the invention.

Fig. 2 is the key-frame extraction flow chart of dynamic camera lens of the invention.

Fig. 3 is the key-frame extraction flow chart of self adaptation camera lens of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawings the present invention is further detailed explanation.

A kind of video key frame extracting method of view-based access control model attention model disclosed by the invention, specific embodiment is such as Under：

First, on spatial domain, significance detection is carried out by filtering global contrast with binomial coefficient, and using certainly Adapt to threshold value to extract target area, concrete grammar is as follows：

（11）Binomial coefficient is constructed according to pascal's triangle, and the normalization factor of N shell is 2^N.The 4th layer is selected, therefore is filtered Ripple device coefficient B₄=(1/16)[1 4 6 4 1]；

（12）If I is primary stimuli intensity,For the average of stimulus intensity around,For I and B₄Convolution；Pixel is adopted Weigh the power for stimulating with the vector form of CIELAB color spaces, the contrast of stimulation be two CIELAB vectors it is European away from From, therefore for the stimulation degree of pixel (x, y) is detected as

（13）Obtain the measuring assembly S of significance_s=(s₁₁,s₁₂,...,s_NM) after, using adaptive threshold to target area Domain is extracted, wherein s_ij(0≤i≤N, 0≤j≤M) for pixel (i, j) significance, M, N be respectively image width and Highly.

Specifically, it is realized by the following method adaptive threshold to extract target area：

（21）Define pixel (x, y) overall situation significance detection calculating formula

Wherein A is the area of detection,For the filtered device B of original image₄The stimulation of Filtered Picture vegetarian refreshments (x, y) is strong Degree, I (i, j) is the primary stimuli intensity of pixel (i, j), and M, N is respectively the width and height of image；

（22）Computing acceleration is carried out by rectangular histogram, primary stimuli intensity I is mapped to into stimulation spaceIn, finally for The stimulation that user experiencesSignificance it is as follows

Wherein D is stimulationDistance between stimulating recently at mM is Manual control parameter, takes in the present embodiment m for 8；

（23）By changing threshold value T_sSpecified foreground and background region, is then made with the threshold value for obtaining minimum energy function For optimal threshold；With T_sEnergy function for threshold value is defined as follows：

Wherein S_nBy formula（2）Obtain, λ is the weight of well-marked target energy, and λ=1.0 are taken in the present embodiment, and N is image Total pixel number, f (T_s,S_n)=max(0,sign(S_n-T_s)), V (I, T_s, s) be similarity to around stimulating measurement, select current T_s The pixel composition point of lower point of significance and its 8 neighborhood is calculated Pair, Dist (p, q) is the space length between 2 points, and σ is artificial control parameter, and σ=10.0 are taken in the present embodiment.

Therefore piece image and saliency map are given, by minimizing energy function to T_sEstimated, worked as pixel 1 is marked as when belonging to well-marked target, remaining is labeled as 0, and parameter lambda and σ need manual setting in advance.

Then, in time domain, the significance of motion is defined, target motion is estimated by homography matrix, adopted Key point replaces target to carry out significance detection, and the data of spatial domain significance are merged afterwards, proposes to expand based on energy function border The method of exhibition obtains bounding box as the well-marked target region of time domain, and concrete grammar is as follows：

（31）Given piece image, using the good FAST of real-time（Features from Accelerated Segment Test）Feature point detection algorithm obtains the key point of image；

（32）Given adjacent two field pictures, using FLANN（Fast Library for Approximate Nearest Neighbor）Carry out quickly correlation Point matching；

（33）Use homography matrix（Homography Matrix）H describing the motion of key point, due to a H only A kind of forms of motion described, same section of video memory forms of motion be various, it is therefore desirable to multiple H are to different motions It is described.RANSAC algorithms are adopted in the present embodiment, by continuous iteration, obtain a series of estimation H=of homography matrixes {H₁,H₂,...,H_n}；

（34）Define key point time domain significance be

Wherein A_mFor kinestate H_mAll key points distribution area, W and H for video image width and height；

（35）The notable angle value in spatial domain is merged with the notable angle value of time domain of the key point for obtaining；

（36）Bounding box is obtained as the well-marked target region of time domain using the method based on energy function border extension.

Specifically, the notable angle value and the notable angle value of time domain of the key point for obtaining for being realized by the following method spatial domain is melted Close：

（41）Define the contrast of a motion significanceWherein notable angle value S of key point time domain_t By formula（5）Obtain,For the average of the notable angle value of key point time domain；

（42）The significance of motion should be directed to the target for still having stronger discrimination on spatial domain, therefore notable to time domain Degree S_tScope of statistics should limit, if p_iFor S_tI-th key point, then p_iShould meetWhereinFor The notable angle value average in spatial domain；

（43）Define the weight of time domainThe weight in spatial domainTo meet（42）Key The time domain of point is added with the notable angle value in spatial domain by weights.

Specifically, it is realized by the following method time domain well-marked target extracted region：

Using notable key point p in spatial domain as seed point, seed region adopts the bounding box B of rectangle, if b_iFor bounding box B Four edges, i ∈ { 1,2,3,4 } are numbering up and down, and the algorithm of border extension is as follows：

Initialization：The summit up and down of bounding box B is all set to key point p position, and point p is the internal point of bounding box B.

Step 1：B is calculated with the order being incremented by from i=1_iSignificance ENERGY E on external boundary_outerI () is aobvious with inner boundary Work degree ENERGY E_inner(i), the calculating such as formula of energy function（4）, then calculating the weights that can extend out of border isWherein l_iFor the length on i-th side of current bounding box B.

Step 2：If w (i) >=ε, i-th side is to one pixel cell of external expansion.ε is the threshold value that extension judges, is needed Pre-set.In the experiment of this paper, 0.8T is set to_s', T_s' for the spatial domain significance average in bounding box.

Step 3：If be expanded without new side in step 2, stop algorithm, export bounding box B；Otherwise, repeat Step 1 and step 2.

Finally, the rich of video is reduced by well-marked target region, using the camera lens self adaptation side with reference to on-line talking Method carries out key-frame extraction, and concrete grammar is as follows：

（51）The RGB color of marking area is converted to into hsv color space, wherein H components are taken（Tone）With S components （Saturation）Calculate form and aspect saturation histogram（Hue-saturation Histogram）.Note H_pI () is pth frame well-marked target I-th bin value of the form and aspect saturation histogram in region, the present embodiment weighed using Bhattacharyya distances two frames it Between visible sensation distance

（52）Key-frame extraction is carried out using the camera lens adaptive approach with reference to on-line talking, with the cluster side of static camera lens Based on formula, supplemented by the cluster mode of dynamic camera lens.For static camera lens, with the form and aspect saturation histogram of marking area as foundation On-line talking is carried out, any one frame is used as key frame in selection cluster.For dynamic camera lens, notable moving target is tracked first, Then using the tracking of notable moving target as the foundation of on-line talking, the positional information of well-marked target is extracted as from cluster The foundation of key frame.

Specifically, such as Fig. 1, by following steps static camera lens on-line talking is realized：

Initialization：Calculate the form and aspect saturation histogram of the static frame of camera lens firstInitial cell number N=1, and willAs cell Cell₁Centre of form C₁Vector, C₁=f₁。

S11：If present frame p belongs to static camera lens, the form and aspect saturation histogram H of present frame is calculated_p。

S12：The visible sensation distance of p and each cell centre of form is calculated, wherein minimum visible sensation distance cell is obtainedWherein m is the call number of cell.

S13：By D_sal(p,C_m) and threshold epsilon_cIt is compared, works as D_sal(p,C_m)≤ε_cWhen, p is included into cell Cell_mIn, so After use H_pSubstitute Cell_mThe centre of form.Otherwise, cell Cell is increased_N+1, by H_pAs cell Cell_N+1Centre of form C_N+1Vector, most Cell number N=N+1 is updated afterwards.

S14：For all static camera lens frames repeat S11, S12 and S13.

Specifically, such as Fig. 2, by following steps the key-frame extraction of dynamic camera lens is realized：

Initialization：Obtain the first frame of dynamic camera lens.

S21：Tracking target area is obtained, initialization particle or resampling extracts video next frame, judge that whether the frame is Sky, if sky, then terminates.

S22：Obtain FAST characteristic vectors, matched with FLANN algorithms, update characteristic vector weights, if feature to Amount is not enough, then terminate.

S23：Each particle weights are updated, key frame weights and target area is calculated, execution S21 is redirected.

A kind of key-frame extraction system of view-based access control model attention model disclosed by the invention includes that marking area extracts mould Block and key-frame extraction module.

Marking area extraction module includes:

Key-frame extraction module includes：

The above, only presently preferred embodiments of the present invention is not intended to limit protection scope of the present invention, should Understand, the present invention is not limited to implementation as described herein, the purpose of these implementation descriptions is to help this area In technical staff practice the present invention.Any those of skill in the art are easy to without departing from spirit and scope of the invention In the case of be further improved and perfect, therefore any done modification within the spiritual principles of the present invention, equivalent Replace and improve etc., should be included within the claims of the present invention.

Claims

1. a kind of extraction method of key frame of view-based access control model attention model, for extracting to the key frame of video, it is special Levy and be, including：

On spatial domain, significance detection is carried out with binomial coefficient filtering global contrast, and using adaptive threshold to mesh Extracted in mark region；

In time domain, the significance of motion is defined, target motion is estimated by homography matrix, replaced using key point Target carries out significance detection, merges the data of spatial domain significance, proposes to be wrapped based on the method for energy function border extension Box is enclosed as the well-marked target region of time domain；

The rich of video is reduced by well-marked target region, key is carried out using the camera lens adaptive approach with reference to on-line talking Frame is extracted；

On spatial domain, significance detection is carried out by filtering global contrast with binomial coefficient, and utilize adaptive threshold Target area is extracted, concrete grammar is as follows：

(11) binomial coefficient is constructed according to pascal's triangle, and the normalization factor of N shell is 2^N；The 4th layer is selected, filter coefficient B₄ =(1/16) [1 464 1]；

(12) I is set as primary stimuli intensity,For the average of stimulus intensity around,For I and B₄Convolution；Pixel is adopted The vector form of CIELAB color spaces weighs the power for stimulating, the contrast of stimulation be two CIELAB vectors it is European away from From, therefore for the stimulation degree of pixel (x, y) is detected as

S (x, y) = | | I_{B_{4}} (x, y) - \overset{&OverBar;}{I} | | - - - (1)

(13) the measuring assembly S of significance is obtained_s=(s₁₁,s₁₂,…,s_NM) after, target area is carried out using adaptive threshold Extract, wherein s_ijFor the significance of pixel (i, j), 0≤i≤N, 0≤j≤M, M, N is respectively the width and height of image；It is logical Cross following methods and realize that adaptive threshold is extracted to target area：

(21) pixel (x, y) overall situation significance detection calculating formula is defined

S_{g} (x, y) = \frac{1}{A} Σ_{i = 0}^{N} Σ_{j = 0}^{M} | | I_{B_{4}} (x, y) - I (i, j) | | - - - (2)

Wherein A is the area of detection,For the filtered device B of original image₄The stimulus intensity of Filtered Picture vegetarian refreshments (x, y), I (i, j) is the primary stimuli intensity of pixel (i, j), and M, N is respectively the width and height of image；

(22) computing acceleration is carried out by rectangular histogram, primary stimuli intensity I is mapped to into stimulation spaceIn, finally for user's sense What is be subject to stimulatesSignificance it is as follows

S (I_{B_{4}} (I)) = \frac{1}{(m - 1) D (I_{B_{4}} (I))} Σ_{i = 1}^{m} (D (I_{B_{4}} (I)) - | | I_{B_{4}} (I) - I_{B_{4}} (I_{i}) | |) S_{g} (I_{B_{4}} (I)) - - - (3)

Wherein D is stimulationDistance between stimulating recently at m

(23) by changing threshold value T_sSpecified foreground and background region, then to obtain the threshold value of minimum energy function as most Excellent threshold value；With T_sEnergy function for threshold value is defined as follows：

E (I, T_{s}, λ, σ) = λ Σ_{n = 1}^{N} (f (T_{s}, S_{n}) S_{n}) + V (I, T_{s}, σ) - - - (4)

Wherein S_nBy formula (2) obtain, λ for well-marked target energy weight, N for image total pixel number, f (T_s,S_n)=max (0,sign(S_n-T_s)), V (I, T_s, σ) be similarity to around stimulating measurement, select current T_sLower point of significance and it is 8 adjacent The pixel composition point in domain is calculated Pair,dist(p,q) For the space length between 2 points, σ is control parameter.

2. method according to claim 1, it is characterised in that in time domain, defines the significance of motion, by homography Matrix is moved to target to be estimated, replaces target to carry out significance detection using key point, and spatial domain significance is merged afterwards Data, propose that the method based on energy function border extension obtains bounding box as the well-marked target region of time domain, concrete grammar It is as follows：

(31) piece image is given, the key point of image is obtained using the good FAST feature point detection algorithms of real-time；

(32) adjacent two field pictures are given, quickly correlation Point matching is carried out using FLANN；

(33) motion of key point is described with multiple homography matrix H, using RANSAC algorithms, by continuous iteration, is obtained A series of estimation H={ H of homography matrixes₁,H₂,...,H_n}；

(34) the time domain significance of definition key point is

S_{t} (p_{m}) = \frac{A_{m}}{W \times H} Σ_{i = 1}^{n} A_{i} D (p_{m}, H_{i}) - - - (5)

(35) bounding box is obtained as the well-marked target region of time domain using the method based on energy function border extension.

3. method according to claim 2, it is characterised in that be realized by the following method the notable angle value in spatial domain and obtain The notable angle value of time domain of key point merged：

(41) contrast of a motion significance is definedWherein notable angle value S of key point time domain_tBy public affairs Formula (5) is obtained,For the average of the notable angle value of key point time domain；

(42) p is set_iFor S_tI-th key point, then p_iShould meetWhereinFor the notable angle value average in spatial domain；

(43) weight of time domain is definedThe weight in spatial domainThe key in step (42) will be met The time domain of point is added with the notable angle value in spatial domain by weights.

4. method according to claim 2, it is characterised in that be realized by the following method time domain well-marked target region and carry Take：

Using notable key point p in spatial domain as seed point, seed region adopts the bounding box B of rectangle, if b_iFor the four of bounding box B Bar side, i ∈ { 1,2,3,4 } are numbering up and down, and the algorithm of border extension is as follows：

Initialization：The summit up and down of bounding box B is all set to key point p position, and point p is the internal point of bounding box B；

Step 1：B is calculated with the order being incremented by from i=1_iSignificance ENERGY E on external boundary_outerThe significance of (i) and inner boundary ENERGY E_inner(i), the calculating such as formula (4) of energy function, then calculating the weights that extend out of border isWherein l_iFor the length on i-th side of current bounding box B；

Step 2：If w (i) >=ε, i-th side is to one pixel cell of external expansion；ε is the threshold value that the extension of setting judges, It is set to 0.8T_s', T_s' for the spatial domain significance average in bounding box；

Step 3：If be expanded without new side in step 2, stop algorithm, export bounding box B；Otherwise, repeat step 1 With step 2.

5. method according to claim 1, it is characterised in that the rich of video is reduced by well-marked target region, is adopted Key-frame extraction is carried out with the camera lens adaptive approach with reference to on-line talking, concrete grammar is as follows：

(51) RGB color of marking area is converted to into hsv color space, takes wherein H components and S components calculate form and aspect and satisfy With degree rectangular histogram, H is remembered_pI () is i-th bin value of the form and aspect saturation histogram in pth frame well-marked target region, adopt Bhattacharyya distances are weighing the visible sensation distance between pth, the frames of q two

(52) key-frame extraction is carried out using the camera lens adaptive approach with reference to on-line talking, the cluster mode with static camera lens is It is main, supplemented by the cluster mode of dynamic camera lens；

For static camera lens, the form and aspect saturation histogram with marking area is that foundation carries out on-line talking, chooses in cluster and appoints A frame anticipate as key frame；

For dynamic camera lens, notable moving target is tracked first, then the tracking using notable moving target is used as on-line talking Foundation, the positional information of well-marked target is used as the foundation that key frame is extracted from cluster.

6. method according to claim 5, it is characterised in that realize static camera lens on-line talking by following steps：

Initialization：Calculate the form and aspect saturation histogram of the static frame of camera lens firstInitial cell number N=1, and willMake For cell Cell₁Centre of form C₁Vector, C₁=f₁；

S11：If present frame p belongs to static camera lens, the form and aspect saturation histogram H of present frame is calculated_p；

S12：The visible sensation distance of p and each cell centre of form is calculated, wherein minimum visible sensation distance cell is obtainedWherein m is the call number of cell；

S13：By D_sal(p,C_m) and threshold epsilon_cIt is compared, works as D_sal(p,C_m)≤ε_cWhen, p is included into cell Cell_mIn, Ran Houyong H_pSubstitute Cell_mThe centre of form；Otherwise, cell Cell is increased_N+1, by H_pAs cell Cell_N+1Centre of form C_N+1Vector, finally more New cell number N=N+1；

S14：For all static camera lens frames repeat S11, S12 and S13.

7. method according to claim 5, it is characterised in that realize that the key frame of dynamic camera lens is carried by following steps Take：

Initialization：Obtain the first frame of dynamic camera lens；

S21：Tracking target area, initialization particle or resampling are obtained, video next frame is extracted, judge whether the frame is sky, If sky, then terminate；

S22：FAST characteristic vectors are obtained, is matched with FLANN algorithms, characteristic vector weights are updated, if characteristic vector is not Foot, then terminate；

8. the system of the extraction method of key frame of view-based access control model attention model described in a kind of any one of application claim 1 to 7, Characterized in that, including marking area extraction module, key-frame extraction module；

The marking area extraction module includes：

Fusion Module, for the key point on the marking area and time domain on spatial domain to be merged, and finally obtains notable area Domain；

The key-frame extraction module includes：

Camera lens adaptation module, for the control between static camera lens key-frame extraction module and dynamic camera lens key-frame extraction module System.