CN102902756A

CN102902756A - Video abstraction extraction method based on story plots

Info

Publication number: CN102902756A
Application number: CN2012103581835A
Authority: CN
Inventors: 朱松豪; 范莉莉; 邹黎明; 梁志伟
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2012-09-24
Filing date: 2012-09-24
Publication date: 2013-01-30
Anticipated expiration: 2032-09-24
Also published as: CN102902756B

Abstract

The invention discloses a video abstraction extraction method based on story plots. The method comprises the following steps of conducting key frame, shot and scene detection on original video; detecting splendid scenes from the scene of the video story plots; selecting abstraction fragments from the splendid scenes according to actual situation, jointing according to timing sequences and generating the abstraction of the original video. The video abstraction extraction method further screens the splendid scenes according to evolution strength between the scenes, rejects alternative abstraction fragments which cannot express useful information due to short duration, and adjust the alternative abstraction fragments according to integrity of sentences. The method selects appropriate abstraction fragments to generate video abstraction according to development relationship of the story plots, meets logical thinking of people and is favorable for guaranteeing integrity of movie contents.

Description

A kind of video frequency abstract extracting method based on plot

Technical field

The present invention relates to a kind of video frequency abstract extracting method, relate in particular to a kind of video frequency abstract extracting method based on plot, belong to technical field of image processing.

Background technology

Along with increasing film data appears on network, PC and the digital device, require to take the hope of effective and practical these mass datas of method organization and management also more and more stronger.In these methods, the film abstraction method not only can obtain the simple description to the development of original film data plot, and is conducive to just can catch the film theme before spectators watch whole film.Therefore, the purpose of movie summaries is the development according to plot, selects suitable fragment to consist of film abstraction.Yet, how reasonably to select vidclip and effectively they are integrated into summary, be still a problem that remains further research.

Find by prior art documents, the people such as Ma (Y.Ma, X.Hua, L.Lu, and H Zhang.A generic framework of user attention model and its application in video summarization.In IEEE Transactions on Multimedia, 7 (5): 907 – 919,2005) movie summaries of user's attention model has been proposed, the people such as Li (K.Li, L.Guo, C.Faraco, and et al.Human-centered attention models for video summarization.In Proceedings of IEEE International Conference on Multimodal Interfaces, 2010:27-30) proposed about the movie summaries attention model that people-oriented, the people such as Lu (S.Lu, I.King, and M.Lyu.Video summarization by video structure analysis and graph optimization.In Proceedings of IEEE International Conference on Multimedia and Expo, 2004:1959-1962) realize movie summaries by film structure analysis and the method that figure optimizes.These movie summaries methods mainly lay particular emphasis on by extracting bottom audiovisual features or middle layer audiovisual features and generate summary.Yet understand angle from people, because the difference that bottom audiovisual features and high-level semantic are understood, the bottom audiovisual features can not be described the progress of film plot well.By the film making theory as can be known, the essence of any film all is to tell about a story.Therefore, a desirable movie summaries can be known the progress of describing the original film plot.From spectators' angle, why a film attracts him, is that he wonders the ensuing plot of story how this develops.That is, plot is structure and the excellent content of a film, and significant description is provided.

Summary of the invention

Technical matters to be solved by this invention is to overcome the deficiency of existing video summarization method, a kind of video frequency abstract extracting method based on plot is provided, select suitable summary fragment according to plot development relation, the logical thinking that had both met people also is conducive to guarantee the integrality of substance film.

Video frequency abstract extracting method based on plot of the present invention may further comprise the steps:

Steps A, original video is carried out key frame, camera lens and scene detection;

Step B, from scene, detect highlight scene according to the video story plot;

Step C, from highlight scene, select the summary fragment according to actual conditions, and splice according to sequential, generate the summary of original video.

The detection of described highlight scene comprises:

Session operational scenarios detects: at first detect the scene that contains the people's face camera lens that alternately occurs according to people's face information, as candidate's session operational scenarios; Then, from candidate's session operational scenarios, select the scene that comprises voice, be session operational scenarios;

Action scene detects: when a scene satisfies following three conditions simultaneously, then this scene is considered as action scene: the frame number of each camera lens is less than 25 in this scene, the intensity of on average enlivening of each camera lens surpasses 200, and the average audio energy of each camera lens surpasses 100;

The suspense scene detection: when a scene satisfied following three conditions simultaneously, then this scene was the suspense scene: the average intensity of illumination of this scene is less than 50; The audio power bag that this scene begins certain several camera lens is no more than 5, and the audio power bag of certain two cinestrip changes above 50; The intensity of enlivening that this scene begins several camera lenses is no more than 5, and the Strength Changes of enlivening of certain two cinestrip surpasses 100.

Further, described session operational scenarios detects the detection that also comprises the emotion session operational scenarios: extract respectively the average fundamental frequency of each session operational scenarios and Strength Changes in short-term, select both all greater than the session operational scenarios of predetermined threshold value, be the emotion session operational scenarios.

Further, described action scene detects and also comprises:

The gunbattle scene detection: select orange, yellow, red three kinds of color characteristics all greater than the action scene of predetermined threshold value as the gunbattle scene;

The scene detection of fighting: select to comprise the action scene of blare audio frequency characteristics as fighting scene;

Chase scene detects: select to comprise the action scene of grating and birdie audio frequency characteristics as chase scene.

Preferably, described step C specifically comprises following each substep:

Step C1, calculate differentiation intensity between any two highlight scenes according to following formula:

PIF(AS _u,AS _v)=α*TT _n(AS _u,AS _v)+β*ST _n(AS _u,AS _v)+γ*RT _n(AS _u,AS _v)

In the formula, PIF (AS _u, AS _v) two different scenario A S of expression _uAnd AS _vBetween differentiation intensity, TT _n(AS _u, AS _v), ST _n(AS _u, AS _v), RT _n(AS _u, AS _v) be respectively AS _uAnd AS _vBetween time domain conversion intensity TT (AS _u,, AS _v), spatial alternation strength S T (AS _u, AS _v), periodic conversion intensity RT (AS _u, AS _v) normalized form, α, β, γ are the weight coefficient that satisfies alpha+beta+γ=1; Wherein,

Time domain conversion intensity TT (AS _u, AS _v) computing formula be:

TT ({AS}_{u}, {AS}_{v}) = | Σ_{p = 1}^{P} N ({AS}_{u}, {Sh}_{l}, {Kf}_{p}) - Σ_{q = 1}^{Q} N ({AS}_{v}, {Sh}_{w}, {Kf}_{q}) |

In the formula, N (AS _u, Sh _l, Kf _p) be scenario A S _uThe people's face number that occurs in the key frame p among interior last camera lens l, N (AS _v, Sh _w, Kf _q) be scenario A S _vThe people's face number that occurs in the key frame q among interior first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w;

Spatial alternation strength S T (AS _u, AS _v) computing formula be:

ST ({AS}_{u}, {AS}_{v}) = | \frac{1}{P} Σ_{p = 1}^{P} RA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} RA (q) | + | \frac{1}{P} Σ_{p = 1}^{P} GA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} GA (q) |

+ | \frac{1}{P} Σ_{p = 1}^{P} BA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} BA (q) | + | \frac{1}{P} Σ_{p = 1}^{P} LA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} LA (q) |

In the formula, RA (p), GA (p), BA (p) and LA (p) represent respectively scenario A S _uIn the mean value of red, green, blue and brightness in the background area of key frame p among last camera lens l, RA (q), GA (q), BA (q) and LA (q) represent respectively scenario A S _vIn the mean value of red, green, blue and brightness in the background area of key frame q among first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w;

Periodic conversion intensity RT (AS _u, AS _v) computing formula be:

RT ({AS}_{u}, {AS}_{v}) = | \frac{1}{M} Σ_{m = 1}^{M} Len ({Sh}_{m}) - \frac{1}{N} Σ_{n = 1}^{N} Len ({Sh}_{n}) |

In the formula, Len (Sh _m) be scenario A S _uInterior m camera lens Sh _mThe frame number that comprises, Len (Sh _n) be scenario A S _vInterior n camera lens Sh _nThe frame number that comprises, M, N are respectively scenario A S _u, scenario A S _vIn the camera lens number;

Step C2, will develop intensity and sort from big to small, and select front K maximum corresponding all highlight scenes of differentiation intensity as alternative summary fragment; The value of K is less than or equal to the sum of the detected highlight scene of step B;

Step C3, from alternative summary fragment, select final summary fragment, splice according to sequential, generate the summary of original video.

Describedly from alternative summary fragment, select final summary fragment, can be directly with alternative summary fragment as final summary fragment, also can select at random according to required length of summarization.For the video segment in the summary that makes final generation can present to audience more glibly, the present invention further alternative summary fragment of the duration is the too short and useful information that is beyond expression rejects, and according to the integrality of statement alternative summary fragment is adjusted, specifically in accordance with the following methods:

At first length in the alternative summary fragment is rejected less than 1 second highlight scene; Then respectively remaining each alternative summary fragment is carried out the detection of complete statement, and according to testing result alternative summary fragment is adjusted accordingly: the border such as complete statement exceeds alternative summary segment boundaries, and boundary adjustment that then will this alternative summary fragment is to the border of complete statement; Candidate after the adjustment fragment of making a summary is final summary fragment.

Compared to existing technology, the present invention has following beneficial effect:

The present invention selects suitable summary fragment generating video summary according to plot development relation, and this had both met people's logical thinking, also was conducive to guarantee the integrality of substance film; In addition, compare with bottom or middle layer audiovisual features, the plot Characteristics of Development shows the high-level semantic intension, therefore, can think more to press close to the semantic description of video content based on the summary of the method generation.

Description of drawings

Fig. 1 is the differentiation intensity between each highlight scene;

Fig. 2 is the summary Piece Selection situation of given length of summarization when being 7 seconds;

Fig. 3 is the summary Piece Selection situation of given length of summarization when being 10 seconds.

Embodiment

Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated:

The purpose of this invention is to provide a kind of video frequency abstract extracting method based on plot, its realization approach is: at first, utilize temporal correlation analysis film composition, comprise effectively cutting apart of camera lens and scene; Then, analyze the scene of interest content and extract the audiovisual Expressive Features, realize the plot analysis; At last, according to changing intensity between scene, generation meets the movie summaries that the mankind watch custom.

A preferred implementation of the video frequency abstract extracting method based on plot of the present invention specifically may further comprise the steps:

Steps A, original video is carried out key frame, camera lens and scene detection.

1, camera lens is cut apart

Camera lens is the elementary cell of video data, and therefore, it is the first step of extracting summary that video data is divided into significant camera lens.From the image process angle, the process that camera lens is cut apart is that the picture frame that will take in same place is poly-to of a sort process.Can adopt existing various lens detection method to realize, for example, the employing that the people such as Zhuang propose realizes shot boundary detection (Y.Zhuang without measure of supervision, Y.Rui, T.Huang, and S.Mehrotra.Adaptive key frame extraction using unsupervised clustering.In Proceedings of IEEE International Conference on Image Processing, 1998:866-870), the people such as Boreczky adopt Hidden Markov Model (HMM) to realize solving the problem (J.Boreczky of shot boundary, L.Wilcox.A hidden markov model framework for video segmentation using audio and image features.In Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing, 1998:3741 – 3744), the people such as Lienhart use neural network to carry out shot boundary detection (R.Lienhart.Reliable dissolve detection.In Proceedings of IEEE International Conference on Storage and Retrieval for Media Databases 2001:219-230,2001:219 – 230.).In order to make Shot Detection more accurate, use following methods realization camera lens to cut apart in this embodiment: at first, realize candidate's shot boundary detection by following steps: utilize the otherness of content information between the camera lens frame to determine the initial boundary of candidate's camera lens, on this basis, determine the exact boundary of candidate's camera lens according to the otherness of content information in the initial shot boundary neighborhood; Secondly, determine the translation type (gradual change, sudden change etc.) of real candidate's camera lens according to the two-dimensional entropy characteristic of picture frame, the invalid candidate's camera lens that utilizes simultaneously the situations such as the shake of removing those rapid movements because of object, video camera, flashlamp to generate.

2, scene detection

Compare with the content analysis of camera lens layer, it is more meaningful and more complete that the content information of scene layer will seem.This is that scene detection process may be defined as and will have the camera lens of correlativity on the space-time because from the image process angle, and cluster is to the process of Same Scene.Can use existing various scene detection method among the present invention, for example, the people such as Yeung propose to realize based on the method for scene conversion figure the detection (M.Yeung of scene boundary, B.Yeo, and B.Liu.Segmentation of video by clustering and graph analysis.Computer Vision and Image Understanding, 1998,71 (1): 94-109), the people such as Tavanapong carry out detection (the W.Tavanapong and J.Zhou.Shot clustering technique for story browsing.IEEEtransactions on multimedia of scene boundary in conjunction with the Moviemaking theory, (2004), 6 (4), 517 – 526), the people such as Zhai adopt horse customer service chain Monte-Carlo method to solve problem (the Y.Zhai and M of scene boundary detection, Shah.A general framework temporal video scene segmentation.In Proceedings on IEEE international conference on computer vision, 2005:1111-1116).The preferred following methods that adopts is realized scene detection in this embodiment: at first, under the tied mechanism of time window, (be not divided in the different scenes from the camera lens of Same Scene in order to guaranteeing, and prevent from being divided in the Same Scene from the camera lens of different scenes by mistake), determine the spatial coherence of the semantic content information between camera lens; Then, on the basis of existing space-time correlation, according to the otherness of semantic content information between each camera lens, accurately set up the border of scene.The method has been introduced time-constrain mechanism in scene detection, can avoid less divided or the over-segmentation of scene, thereby obtain accurately scene fragment.

Step B, from scene, detect highlight scene according to the video story plot.

Plot is most important to the semantic understanding of effective management of film and movie contents, and excellent plot (scene) wherein is the core that consists of whole movie contents, utilize highlight scene to make up video frequency abstract, can embody better the core content of video.The present invention by the audiovisual features of analyzing video detect session operational scenarios, the scene of fighting, these three kinds of highlight scenes the most representative of suspense scene, specific as follows:

1, the detection of session operational scenarios

Session operational scenarios in the film video often can be passed on important information, helps the beholder to understand the development of plot.The present invention at first utilizes method for detecting human face to detect to contain the scene of the people's face camera lens that alternately occurs, as similar session operational scenarios; Then, utilize audio analysis method (for example Hidden Markov Model (HMM)) to distinguish language and other audio frequency, from similar session operational scenarios, select the scene that comprises voice, be session operational scenarios.

In different session operational scenarios, the easier attraction of emotion session operational scenarios beholder's attention, and whole plot development had important impact.Therefore, be necessary from general session operational scenarios, to detect the emotion session operational scenarios.The present invention adopts two kinds of exemplary audio features to realize the identification of emotion session operational scenarios: average fundamental frequency and Strength Changes in short-term, be specially: extract respectively the average fundamental frequency of each session operational scenarios and Strength Changes in short-term, select both all greater than the session operational scenarios of predetermined threshold value, be the emotion session operational scenarios.。

2, the fight detection of scene

In the video of action/war/adventure, often a lot of action scenes can appear, such as rifle bucket scene, fight scene and chase scene.

If when a scene satisfied following three conditions simultaneously, we then were considered as action scene with this scene: the frame number of each camera lens is less than 25 in this scene, and the intensity of on average enlivening of each camera lens surpasses 200, and the average audio energy of each camera lens surpasses 100.On this basis, action scene also can be further divided into the more familiar scene of following three-type-persons.

(1) gunbattle scene.According to we daily experience knowledge and the theory of film making as can be known, the picture that artillery fire often occurs in the gunbattle scene, explodes and bleed.By to the carefully analyzing of color histogram, we find the most significant color of these three kinds of pictures respectively: be orange, yellow and red.Therefore, we realize the identification of gunbattle scene by the pre-service to color, namely select orange, yellow, red three kinds of color characteristics all greater than the action scene of predetermined threshold value as the gunbattle scene.

(2) fight scene and chase scene.By to the scrutinizing of this two scene audio-frequency informations, we find that they have unique separately audio-frequency information.Wherein, the scene of fighting comprises blare usually, and chase scene then often comprises grating and birdie.Therefore, can adopt audio analysis method (for example Hidden Markov Model (HMM)) to distinguish the audio-frequency information of these three kinds of uniquenesses, the differentiation of scene and chase scene thereby realization is fought: the action scene of selecting to comprise the blare audio frequency characteristics is selected to comprise the action scene of grating and birdie audio frequency characteristics as chase scene as fighting scene.

3, the detection of suspense scene

A lot of suspense scenes can appear in terrible film and the detective's film.When a scene satisfies following three conditions simultaneously, we then this scene be called the suspense scene:

(1) the average intensity of illumination of this scene is less than 50;

(2) this scene audio power bag of beginning certain several camera lens is no more than 5, and the audio power bag of certain two cinestrip changes and surpasses 50;

(3) this scene intensity of enlivening of beginning several camera lenses is no more than 5, and the Strength Changes of enlivening of certain two cinestrip surpasses 100.

According to the highlight scene that obtains among the step B, direct generating video summary for example, can according to the required summary duration suitable highlight scene of selected part at random, also can select certain class highlight scene to consist of summary according to the main action of video.The present invention further screens highlight scene according to the conversion intensity between excellent plot, thereby better describes the development of plot by the variation between plot.Step C specifically comprises following substep:

Differentiation intensity between step C1, any two highlight scenes of calculating.

Alternative types between excellent plot comprises following three kinds: time domain conversion, spatial alternation and rhythm conversion.According to our daily experience and film clip principle as can be known: correlativity is fewer between two plots, and it is larger then to change accordingly intensity.Therefore, scenario transition intensity herein is not only the leading indicator of estimating the development of photoplay plot, also is simultaneously the important foundation that generates movie summaries.

(1) generally speaking, the time domain conversion between two different scenes can be described by corresponding people's face quantity.Two scenario A S _uAnd AS _vBetween time domain conversion intensity be expressed as:

TT ({AS}_{u}, {AS}_{v}) = | Σ_{p = 1}^{P} N ({AS}_{u}, {Sh}_{l}, {Kf}_{p}) - Σ_{q = 1}^{Q} N ({AS}_{v}, {Sh}_{w}, {Kf}_{q}) | - - - (1)

In the formula, N (AS _u, Sh _l, Kf _p) be scenario A S _uThe people's face number that occurs in the key frame p among interior last camera lens l, N (AS _v, Sh _w, Kf _q) be scenario A S _vThe people's face number that occurs in the key frame q among interior first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w.

Set up such as following inequality, then scenario A S _uAnd AS _vBetween have a time domain conversion:

TT ({AS}_{u}, {AS}_{v}) > \frac{1}{P + Q} [Σ_{p = 1}^{P} N ({AS}_{u}, {Sh}_{l}, {Kf}_{p}) + Σ_{q = 1}^{Q} N ({AS}_{v}, {Sh}_{w}, {Kf}_{q})] - - - (2)

(2) spatial alternation represents that same performer appears in two different scenes, can obtain by the variation of judging the background area.The computing formula of spatial alternation intensity is as follows:

ST ({AS}_{u}, {AS}_{v}) = | \frac{1}{P} Σ_{p = 1}^{P} RA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} RA (q) | + | \frac{1}{P} Σ_{p = 1}^{P} GA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} GA (q) |

+ | \frac{1}{P} Σ_{p = 1}^{P} BA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} BA (q) | + | \frac{1}{P} Σ_{p = 1}^{P} LA (p) - \frac{1}{Q} Σ_{q = 1}^{Q} LA (q) |

In the formula, RA (p), GA (p), BA (p) and LA (p) represent respectively scenario A S _uIn the mean value of red, green, blue and brightness in the background area of key frame p among last camera lens l, RA (q), GA (q), BA (q) and LA (q) represent respectively scenario A S _vIn the mean value of red, green, blue and brightness in the background area of key frame q among first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w.

Set up such as following inequality, then scenario A S _uAnd AS _vBetween Existential Space conversion:

ST ({AS}_{u}, {AS}_{v}) > \frac{1}{2} | \frac{1}{P} Σ_{p = 1}^{P} RA (p) + \frac{1}{Q} Σ_{q = 1}^{Q} RA (q) | + \frac{1}{2} | \frac{1}{P} Σ_{p = 1}^{P} GA (p) + \frac{1}{Q} Σ_{q = 1}^{Q} GA (q) |

(4)

+ \frac{1}{2} | \frac{1}{P} Σ_{p = 1}^{P} BA (p) + \frac{1}{Q} Σ_{q = 1}^{Q} BA (q) | + \frac{1}{2} | \frac{1}{P} Σ_{p = 1}^{P} LA (p) + \frac{1}{Q} Σ_{q = 1}^{Q} LA (q) |

(3) adopt periodically conversion on the duration, representative be the anxiety of atmosphere and easily, scenario A S _uWith scenario A S _vThe periodic conversion strength calculation formula as follows:

RT ({AS}_{u}, {AS}_{v}) = | \frac{1}{M} Σ_{m = 1}^{M} Len ({Sh}_{m}) - \frac{1}{N} Σ_{n = 1}^{N} Len ({Sh}_{n}) | - - - (5)

In the formula, Len (SH _m) be scenario A S _uInterior m camera lens Sh _mThe frame number that comprises, Len (Sh _n) be scenario A S _vInterior n camera lens Sh _nThe frame number that comprises, M, N are respectively scenario A S _u, scenario A S _vIn the camera lens number.

When following inequality is set up, scenario A S _uAnd AS _vThere is periodic conversion:

RT ({AS}_{u}, {AS}_{v}) > 2 | \frac{1}{M} Σ_{m = 1}^{M} Len ({Sh}_{m}) + \frac{1}{N} Σ_{n = 1}^{N} Len ({Sh}_{n}) | - - - (6)

The present invention adopts and develops two scenario A S of strength function (Progress Intensity Function, PIF) elaboration _uAnd AS _vBetween plot develop, its expression formula is as follows:

PIF(AS _u,AS _v)=α*TT _n(AS _u,AS _v)+β*ST _n(AS _u,AS _v)+γ*RT _n(AS _u,AS _v) （7）

In the formula, PIF (AS _u, AS _v) two different scenario A S of expression _uAnd AS _vBetween differentiation intensity, TT _n(AS _u, AS _v), ST _n(AS _u, AS _v), RT _n(AS _u, AS _v) be respectively AS _uAnd AS _vBetween time domain conversion intensity TT (AS _u, AS _v), spatial alternation strength S T (AS _u, AS _v), periodic conversion intensity RT (AS _u, AS _v) normalized form, α, β, γ are the weight coefficient that satisfies alpha+beta+γ=1.

Step C2, will develop intensity and sort from big to small, and select front K maximum corresponding all highlight scenes of differentiation intensity as alternative summary fragment; The value of K is less than or equal to the sum of the detected highlight scene of step B.

Because the Summary Time that usually requires is shorter, therefore can according to required summary duration, pick out with other highlight scenes and develop a part of highlight scene of intensity maximum as alternative summary fragment.

Alternative summary fragment meets the summary requirement in fact substantially, can be directly with it as final summary fragment, it is spliced chronologically the generating video summary.Fig. 2 has shown an example that adopts this scheme generating video summary, has obtained altogether KS-1, KS-2, these four sections highlight scenes of KS-3, KS-4 by step B in this example, and persistence length was respectively 2 seconds, 3 seconds, 3 seconds and 4 seconds.Its differentiation intensity between any two according to developing intensity order from high to low, can be determined final summary fragment according to given length of summarization as shown in Figure 1.For example, when given length of summarization is 7 seconds, should select KS-1 and KS-3, as shown in Figure 2; When given length of summarization is 10 seconds, then should select KS-1, KS-3, KS-4, as shown in Figure 3.

For the video segment in the summary that makes final generation can present to audience more glibly, the present invention further alternative summary fragment of the duration is the too short and useful information that is beyond expression rejects, and according to the integrality of statement alternative summary fragment is adjusted.

The order of summary is and will generates the video segment that compressibility is very high that it should comprise useful plot information as much as possible, and these video segments can present to audience glibly.Want so that each the summary fragment that generates is meaningful, the fragment of then respectively making a summary can not be too short, finds according to statistical research: if the video sequence duration less than 1s, then it can not represent any useful information.Therefore, the present invention at first directly rejects the duration as useless video segment less than 1 second alternative summary fragment.

For make video segment as far as possible smoothness be and dedicate spectators to, also need make suitable adjustment according to the statement integrality to remaining alternative summary fragment, specific as follows:

Respectively remaining each alternative summary fragment is carried out the detection of complete statement, and according to testing result alternative summary fragment is adjusted accordingly.The detection of complete statement can be adopted existing the whole bag of tricks, for example, Schreiner proposes complete statement detection method (the O.Schreiner.Modulation spectrum for pitch and speech pause detection.In Proceedings on.European Conference on Speech Communication and Technology based on modulation spectrum, 2003), the people such as Liu adopt the condition random field method to detect the border (Y.Liu of complete statement, A.Stolcke, E.Shriberg, and M.Harper.Using Conditional Random Fields for Sentence Boundary Detection in Speech.Annual Meeting of the Association for Computational Linguistics, 2005), the people such as Szczurowska utilize the Kohonen network to realize detection (I.Szczurowska, the W. of statement boundary integrality , and E.Smolka.Speech nonfluency detection using Kohonen networks.Neural Computing and Applications, 2009,18 (7): 677-687).Adopt following methods to carry out the detection of complete statement in this embodiment: to use audio power and second order zero-crossing rate, from continuous voice sequence, detect the time-out fragment; Adopt minimum time out and statement time, realize the level and smooth of previous step segmentation result; With longer time out, detect the statement fragment.

According to testing result alternative summary fragment is adjusted: the border such as complete statement exceeds alternative summary segment boundaries, and boundary adjustment that then will this alternative summary fragment is to the border of complete statement.The border of complete statement exceeds alternative summary segment boundaries can be divided into two kinds of situations, a kind of is that the monolateral border of complete statement exceeds alternative summary segment boundaries, another kind is that the border all exceeds alternative summary segment boundaries (being that complete statement covers alternative summary fragment) before and after the complete statement, and the boundary adjustment that this moment will this alternative summary fragment is the border of complete statement extremely.Candidate after the adjustment fragment of making a summary is final summary fragment.According to the time order and function order final summary fragment assembly is got up namely to obtain video frequency abstract.

Compare existing various video frequency abstract extracting method, the present invention selects suitable summary fragment generating video summary according to plot development relation, the logical thinking that more meets people also is conducive to guarantee the integrality of substance film, accurately embodies the main plot of video.

Claims

1. the video frequency abstract extracting method based on plot is characterized in that, may further comprise the steps:

Step B, from scene, detect highlight scene according to the video story plot;

2. as claimed in claim 1 based on the video frequency abstract extracting method of plot, it is characterized in that the detection of described highlight scene comprises:

3. as claimed in claim 2 based on the video frequency abstract extracting method of plot, it is characterized in that, described session operational scenarios detects the detection that also comprises the emotion session operational scenarios: extract respectively the average fundamental frequency of each session operational scenarios and Strength Changes in short-term, select both all greater than the session operational scenarios of predetermined threshold value, be the emotion session operational scenarios.

4. as claimed in claim 2 based on the video frequency abstract extracting method of plot, it is characterized in that described action scene detects and also comprises:

5. such as claim 1-4 video frequency abstract extracting method based on plot as described in each, it is characterized in that described step C specifically comprises following each substep:

In the formula,

Represent two different scenes

Figure 2012103581835100001DEST_PATH_IMAGE003

With

Between differentiation intensity, TT _n (AS _u , AS _v ), ST _n (AS _u , AS _v ), RT _n (AS _u , AS _v )Be respectively

With

Between time domain conversion intensity TT (AS _u , AS _v ), spatial alternation intensity ST (AS _u , AS _v ), periodic conversion intensity RT (AS _u , AS _v )Normalized form, α, β, γFor satisfying Alpha+beta+γ=1 weight coefficient; Wherein,

Time domain conversion intensity TT (AS _u , AS _v )Computing formula be:

In the formula, N (AS _u , Sh _l , Kf _p )It is scene AS _uInterior last camera lens lMiddle key frame pIn people's face number of occurring, N (AS _v , Sh _w , Kf _q )It is scene AS _vInterior first camera lens wMiddle key frame qIn people's face number of occurring, P, QBe respectively camera lens lWith wIn number of key frames;

Spatial alternation intensity ST (AS _u , AS _v )Computing formula be:

In the formula, RA (p), GA (p), BA (p)With LA (p)Represent respectively scene AS _uInterior last camera lens lMiddle key frame pThe background area in the mean value of red, green, blue and brightness, RA (q), GA (q), BA (q)With LA (q)Represent respectively scene AS _vInterior first camera lens wMiddle key frame qThe background area in the mean value of red, green, blue and brightness, P, QBe respectively camera lens lWith wIn number of key frames;

Periodic conversion intensity RT (AS _u , AS _v )Computing formula be:

In the formula,

Be scene AS _uIn the mIndividual camera lens

The frame number that comprises,

Be scene AS _vIn the nIndividual camera lens The frame number that comprises,

,

Be respectively scene AS _u, scene AS _vIn the camera lens number;

Step C2, will develop intensity and sort from big to small, select maximum before KCorresponding all highlight scenes of individual differentiation intensity are as alternative summary fragment; KValue be less than or equal to the sum of the detected highlight scene of step B;

6. as claimed in claim 5 based on the video frequency abstract extracting method of plot, it is characterized in that, describedly from alternative summary fragment, select final summary fragment, specifically in accordance with the following methods: at first length in the alternative summary fragment is rejected less than 1 second highlight scene; Then respectively remaining each alternative summary fragment is carried out the detection of complete statement, and according to testing result alternative summary fragment is adjusted accordingly: the border such as complete statement exceeds alternative summary segment boundaries, and boundary adjustment that then will this alternative summary fragment is to the border of complete statement; Candidate after the adjustment fragment of making a summary is final summary fragment.

7. as claimed in claim 6 based on the video frequency abstract extracting method of plot, it is characterized in that, the detection of described complete statement in accordance with the following methods: use audio power and second order zero-crossing rate, from continuous voice sequence, detect the time-out fragment; Adopt minimum time out and statement time, realize the level and smooth of previous step segmentation result; With longer time out, detect the statement fragment.

8. as claimed in claim 5 based on the video frequency abstract extracting method of plot, it is characterized in that, described Shot Detection is specifically in accordance with the following methods: at first, carry out candidate's shot boundary detection: utilize the otherness of content information between the camera lens frame to determine the initial boundary of candidate's camera lens, on this basis, determine the exact boundary of candidate's camera lens according to the otherness of content information in the initial shot boundary neighborhood; Secondly, determine the translation type of real candidate's camera lens according to the two-dimensional entropy characteristic of picture frame, utilize simultaneously and remove invalid candidate's camera lens.

9. as claimed in claim 5 based on the video frequency abstract extracting method of plot, it is characterized in that, described scene detection specifically in accordance with the following methods: at first, under the tied mechanism of time window, determine the spatial coherence of the semantic content information between camera lens; Then, on the basis of existing space-time correlation, according to the otherness of semantic content information between each camera lens, accurately set up the border of scene.