Summary of the invention
Technical matters to be solved by this invention is to overcome the deficiency of existing video summarization method, a kind of video frequency abstract extracting method based on plot is provided, select suitable summary fragment according to plot development relation, the logical thinking that had both met people also is conducive to guarantee the integrality of substance film.
Video frequency abstract extracting method based on plot of the present invention may further comprise the steps:
Steps A, original video is carried out key frame, camera lens and scene detection;
Step B, from scene, detect highlight scene according to the video story plot;
Step C, from highlight scene, select the summary fragment according to actual conditions, and splice according to sequential, generate the summary of original video.
The detection of described highlight scene comprises:
Session operational scenarios detects: at first detect the scene that contains the people's face camera lens that alternately occurs according to people's face information, as candidate's session operational scenarios; Then, from candidate's session operational scenarios, select the scene that comprises voice, be session operational scenarios;
Action scene detects: when a scene satisfies following three conditions simultaneously, then this scene is considered as action scene: the frame number of each camera lens is less than 25 in this scene, the intensity of on average enlivening of each camera lens surpasses 200, and the average audio energy of each camera lens surpasses 100;
The suspense scene detection: when a scene satisfied following three conditions simultaneously, then this scene was the suspense scene: the average intensity of illumination of this scene is less than 50; The audio power bag that this scene begins certain several camera lens is no more than 5, and the audio power bag of certain two cinestrip changes above 50; The intensity of enlivening that this scene begins several camera lenses is no more than 5, and the Strength Changes of enlivening of certain two cinestrip surpasses 100.
Further, described session operational scenarios detects the detection that also comprises the emotion session operational scenarios: extract respectively the average fundamental frequency of each session operational scenarios and Strength Changes in short-term, select both all greater than the session operational scenarios of predetermined threshold value, be the emotion session operational scenarios.
Further, described action scene detects and also comprises:
The gunbattle scene detection: select orange, yellow, red three kinds of color characteristics all greater than the action scene of predetermined threshold value as the gunbattle scene;
The scene detection of fighting: select to comprise the action scene of blare audio frequency characteristics as fighting scene;
Chase scene detects: select to comprise the action scene of grating and birdie audio frequency characteristics as chase scene.
Preferably, described step C specifically comprises following each substep:
Step C1, calculate differentiation intensity between any two highlight scenes according to following formula:
PIF(AS
u,AS
v)=α*TT
n(AS
u,AS
v)+β*ST
n(AS
u,AS
v)+γ*RT
n(AS
u,AS
v)
In the formula, PIF (AS
u, AS
v) two different scenario A S of expression
uAnd AS
vBetween differentiation intensity, TT
n(AS
u, AS
v), ST
n(AS
u, AS
v), RT
n(AS
u, AS
v) be respectively AS
uAnd AS
vBetween time domain conversion intensity TT (AS
u,, AS
v), spatial alternation strength S T (AS
u, AS
v), periodic conversion intensity RT (AS
u, AS
v) normalized form, α, β, γ are the weight coefficient that satisfies alpha+beta+γ=1; Wherein,
Time domain conversion intensity TT (AS
u, AS
v) computing formula be:
In the formula, N (AS
u, Sh
l, Kf
p) be scenario A S
uThe people's face number that occurs in the key frame p among interior last camera lens l, N (AS
v, Sh
w, Kf
q) be scenario A S
vThe people's face number that occurs in the key frame q among interior first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w;
Spatial alternation strength S T (AS
u, AS
v) computing formula be:
In the formula, RA (p), GA (p), BA (p) and LA (p) represent respectively scenario A S
uIn the mean value of red, green, blue and brightness in the background area of key frame p among last camera lens l, RA (q), GA (q), BA (q) and LA (q) represent respectively scenario A S
vIn the mean value of red, green, blue and brightness in the background area of key frame q among first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w;
Periodic conversion intensity RT (AS
u, AS
v) computing formula be:
In the formula, Len (Sh
m) be scenario A S
uInterior m camera lens Sh
mThe frame number that comprises, Len (Sh
n) be scenario A S
vInterior n camera lens Sh
nThe frame number that comprises, M, N are respectively scenario A S
u, scenario A S
vIn the camera lens number;
Step C2, will develop intensity and sort from big to small, and select front K maximum corresponding all highlight scenes of differentiation intensity as alternative summary fragment; The value of K is less than or equal to the sum of the detected highlight scene of step B;
Step C3, from alternative summary fragment, select final summary fragment, splice according to sequential, generate the summary of original video.
Describedly from alternative summary fragment, select final summary fragment, can be directly with alternative summary fragment as final summary fragment, also can select at random according to required length of summarization.For the video segment in the summary that makes final generation can present to audience more glibly, the present invention further alternative summary fragment of the duration is the too short and useful information that is beyond expression rejects, and according to the integrality of statement alternative summary fragment is adjusted, specifically in accordance with the following methods:
At first length in the alternative summary fragment is rejected less than 1 second highlight scene; Then respectively remaining each alternative summary fragment is carried out the detection of complete statement, and according to testing result alternative summary fragment is adjusted accordingly: the border such as complete statement exceeds alternative summary segment boundaries, and boundary adjustment that then will this alternative summary fragment is to the border of complete statement; Candidate after the adjustment fragment of making a summary is final summary fragment.
Compared to existing technology, the present invention has following beneficial effect:
The present invention selects suitable summary fragment generating video summary according to plot development relation, and this had both met people's logical thinking, also was conducive to guarantee the integrality of substance film; In addition, compare with bottom or middle layer audiovisual features, the plot Characteristics of Development shows the high-level semantic intension, therefore, can think more to press close to the semantic description of video content based on the summary of the method generation.
Embodiment
Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated:
The purpose of this invention is to provide a kind of video frequency abstract extracting method based on plot, its realization approach is: at first, utilize temporal correlation analysis film composition, comprise effectively cutting apart of camera lens and scene; Then, analyze the scene of interest content and extract the audiovisual Expressive Features, realize the plot analysis; At last, according to changing intensity between scene, generation meets the movie summaries that the mankind watch custom.
A preferred implementation of the video frequency abstract extracting method based on plot of the present invention specifically may further comprise the steps:
Steps A, original video is carried out key frame, camera lens and scene detection.
1, camera lens is cut apart
Camera lens is the elementary cell of video data, and therefore, it is the first step of extracting summary that video data is divided into significant camera lens.From the image process angle, the process that camera lens is cut apart is that the picture frame that will take in same place is poly-to of a sort process.Can adopt existing various lens detection method to realize, for example, the employing that the people such as Zhuang propose realizes shot boundary detection (Y.Zhuang without measure of supervision, Y.Rui, T.Huang, and S.Mehrotra.Adaptive key frame extraction using unsupervised clustering.In Proceedings of IEEE International Conference on Image Processing, 1998:866-870), the people such as Boreczky adopt Hidden Markov Model (HMM) to realize solving the problem (J.Boreczky of shot boundary, L.Wilcox.A hidden markov model framework for video segmentation using audio and image features.In Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing, 1998:3741 – 3744), the people such as Lienhart use neural network to carry out shot boundary detection (R.Lienhart.Reliable dissolve detection.In Proceedings of IEEE International Conference on Storage and Retrieval for Media Databases 2001:219-230,2001:219 – 230.).In order to make Shot Detection more accurate, use following methods realization camera lens to cut apart in this embodiment: at first, realize candidate's shot boundary detection by following steps: utilize the otherness of content information between the camera lens frame to determine the initial boundary of candidate's camera lens, on this basis, determine the exact boundary of candidate's camera lens according to the otherness of content information in the initial shot boundary neighborhood; Secondly, determine the translation type (gradual change, sudden change etc.) of real candidate's camera lens according to the two-dimensional entropy characteristic of picture frame, the invalid candidate's camera lens that utilizes simultaneously the situations such as the shake of removing those rapid movements because of object, video camera, flashlamp to generate.
2, scene detection
Compare with the content analysis of camera lens layer, it is more meaningful and more complete that the content information of scene layer will seem.This is that scene detection process may be defined as and will have the camera lens of correlativity on the space-time because from the image process angle, and cluster is to the process of Same Scene.Can use existing various scene detection method among the present invention, for example, the people such as Yeung propose to realize based on the method for scene conversion figure the detection (M.Yeung of scene boundary, B.Yeo, and B.Liu.Segmentation of video by clustering and graph analysis.Computer Vision and Image Understanding, 1998,71 (1): 94-109), the people such as Tavanapong carry out detection (the W.Tavanapong and J.Zhou.Shot clustering technique for story browsing.IEEEtransactions on multimedia of scene boundary in conjunction with the Moviemaking theory, (2004), 6 (4), 517 – 526), the people such as Zhai adopt horse customer service chain Monte-Carlo method to solve problem (the Y.Zhai and M of scene boundary detection, Shah.A general framework temporal video scene segmentation.In Proceedings on IEEE international conference on computer vision, 2005:1111-1116).The preferred following methods that adopts is realized scene detection in this embodiment: at first, under the tied mechanism of time window, (be not divided in the different scenes from the camera lens of Same Scene in order to guaranteeing, and prevent from being divided in the Same Scene from the camera lens of different scenes by mistake), determine the spatial coherence of the semantic content information between camera lens; Then, on the basis of existing space-time correlation, according to the otherness of semantic content information between each camera lens, accurately set up the border of scene.The method has been introduced time-constrain mechanism in scene detection, can avoid less divided or the over-segmentation of scene, thereby obtain accurately scene fragment.
Step B, from scene, detect highlight scene according to the video story plot.
Plot is most important to the semantic understanding of effective management of film and movie contents, and excellent plot (scene) wherein is the core that consists of whole movie contents, utilize highlight scene to make up video frequency abstract, can embody better the core content of video.The present invention by the audiovisual features of analyzing video detect session operational scenarios, the scene of fighting, these three kinds of highlight scenes the most representative of suspense scene, specific as follows:
1, the detection of session operational scenarios
Session operational scenarios in the film video often can be passed on important information, helps the beholder to understand the development of plot.The present invention at first utilizes method for detecting human face to detect to contain the scene of the people's face camera lens that alternately occurs, as similar session operational scenarios; Then, utilize audio analysis method (for example Hidden Markov Model (HMM)) to distinguish language and other audio frequency, from similar session operational scenarios, select the scene that comprises voice, be session operational scenarios.
In different session operational scenarios, the easier attraction of emotion session operational scenarios beholder's attention, and whole plot development had important impact.Therefore, be necessary from general session operational scenarios, to detect the emotion session operational scenarios.The present invention adopts two kinds of exemplary audio features to realize the identification of emotion session operational scenarios: average fundamental frequency and Strength Changes in short-term, be specially: extract respectively the average fundamental frequency of each session operational scenarios and Strength Changes in short-term, select both all greater than the session operational scenarios of predetermined threshold value, be the emotion session operational scenarios.。
2, the fight detection of scene
In the video of action/war/adventure, often a lot of action scenes can appear, such as rifle bucket scene, fight scene and chase scene.
If when a scene satisfied following three conditions simultaneously, we then were considered as action scene with this scene: the frame number of each camera lens is less than 25 in this scene, and the intensity of on average enlivening of each camera lens surpasses 200, and the average audio energy of each camera lens surpasses 100.On this basis, action scene also can be further divided into the more familiar scene of following three-type-persons.
(1) gunbattle scene.According to we daily experience knowledge and the theory of film making as can be known, the picture that artillery fire often occurs in the gunbattle scene, explodes and bleed.By to the carefully analyzing of color histogram, we find the most significant color of these three kinds of pictures respectively: be orange, yellow and red.Therefore, we realize the identification of gunbattle scene by the pre-service to color, namely select orange, yellow, red three kinds of color characteristics all greater than the action scene of predetermined threshold value as the gunbattle scene.
(2) fight scene and chase scene.By to the scrutinizing of this two scene audio-frequency informations, we find that they have unique separately audio-frequency information.Wherein, the scene of fighting comprises blare usually, and chase scene then often comprises grating and birdie.Therefore, can adopt audio analysis method (for example Hidden Markov Model (HMM)) to distinguish the audio-frequency information of these three kinds of uniquenesses, the differentiation of scene and chase scene thereby realization is fought: the action scene of selecting to comprise the blare audio frequency characteristics is selected to comprise the action scene of grating and birdie audio frequency characteristics as chase scene as fighting scene.
3, the detection of suspense scene
A lot of suspense scenes can appear in terrible film and the detective's film.When a scene satisfies following three conditions simultaneously, we then this scene be called the suspense scene:
(1) the average intensity of illumination of this scene is less than 50;
(2) this scene audio power bag of beginning certain several camera lens is no more than 5, and the audio power bag of certain two cinestrip changes and surpasses 50;
(3) this scene intensity of enlivening of beginning several camera lenses is no more than 5, and the Strength Changes of enlivening of certain two cinestrip surpasses 100.
Step C, from highlight scene, select the summary fragment according to actual conditions, and splice according to sequential, generate the summary of original video.
According to the highlight scene that obtains among the step B, direct generating video summary for example, can according to the required summary duration suitable highlight scene of selected part at random, also can select certain class highlight scene to consist of summary according to the main action of video.The present invention further screens highlight scene according to the conversion intensity between excellent plot, thereby better describes the development of plot by the variation between plot.Step C specifically comprises following substep:
Differentiation intensity between step C1, any two highlight scenes of calculating.
Alternative types between excellent plot comprises following three kinds: time domain conversion, spatial alternation and rhythm conversion.According to our daily experience and film clip principle as can be known: correlativity is fewer between two plots, and it is larger then to change accordingly intensity.Therefore, scenario transition intensity herein is not only the leading indicator of estimating the development of photoplay plot, also is simultaneously the important foundation that generates movie summaries.
(1) generally speaking, the time domain conversion between two different scenes can be described by corresponding people's face quantity.Two scenario A S
uAnd AS
vBetween time domain conversion intensity be expressed as:
In the formula, N (AS
u, Sh
l, Kf
p) be scenario A S
uThe people's face number that occurs in the key frame p among interior last camera lens l, N (AS
v, Sh
w, Kf
q) be scenario A S
vThe people's face number that occurs in the key frame q among interior first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w.
Set up such as following inequality, then scenario A S
uAnd AS
vBetween have a time domain conversion:
(2) spatial alternation represents that same performer appears in two different scenes, can obtain by the variation of judging the background area.The computing formula of spatial alternation intensity is as follows:
In the formula, RA (p), GA (p), BA (p) and LA (p) represent respectively scenario A S
uIn the mean value of red, green, blue and brightness in the background area of key frame p among last camera lens l, RA (q), GA (q), BA (q) and LA (q) represent respectively scenario A S
vIn the mean value of red, green, blue and brightness in the background area of key frame q among first camera lens w, P, Q are respectively the number of key frames among camera lens l and the w.
Set up such as following inequality, then scenario A S
uAnd AS
vBetween Existential Space conversion:
(3) adopt periodically conversion on the duration, representative be the anxiety of atmosphere and easily, scenario A S
uWith scenario A S
vThe periodic conversion strength calculation formula as follows:
In the formula, Len (SH
m) be scenario A S
uInterior m camera lens Sh
mThe frame number that comprises, Len (Sh
n) be scenario A S
vInterior n camera lens Sh
nThe frame number that comprises, M, N are respectively scenario A S
u, scenario A S
vIn the camera lens number.
When following inequality is set up, scenario A S
uAnd AS
vThere is periodic conversion:
The present invention adopts and develops two scenario A S of strength function (Progress Intensity Function, PIF) elaboration
uAnd AS
vBetween plot develop, its expression formula is as follows:
PIF(AS
u,AS
v)=α*TT
n(AS
u,AS
v)+β*ST
n(AS
u,AS
v)+γ*RT
n(AS
u,AS
v) (7)
In the formula, PIF (AS
u, AS
v) two different scenario A S of expression
uAnd AS
vBetween differentiation intensity, TT
n(AS
u, AS
v), ST
n(AS
u, AS
v), RT
n(AS
u, AS
v) be respectively AS
uAnd AS
vBetween time domain conversion intensity TT (AS
u, AS
v), spatial alternation strength S T (AS
u, AS
v), periodic conversion intensity RT (AS
u, AS
v) normalized form, α, β, γ are the weight coefficient that satisfies alpha+beta+γ=1.
Step C2, will develop intensity and sort from big to small, and select front K maximum corresponding all highlight scenes of differentiation intensity as alternative summary fragment; The value of K is less than or equal to the sum of the detected highlight scene of step B.
Because the Summary Time that usually requires is shorter, therefore can according to required summary duration, pick out with other highlight scenes and develop a part of highlight scene of intensity maximum as alternative summary fragment.
Step C3, from alternative summary fragment, select final summary fragment, splice according to sequential, generate the summary of original video.
Alternative summary fragment meets the summary requirement in fact substantially, can be directly with it as final summary fragment, it is spliced chronologically the generating video summary.Fig. 2 has shown an example that adopts this scheme generating video summary, has obtained altogether KS-1, KS-2, these four sections highlight scenes of KS-3, KS-4 by step B in this example, and persistence length was respectively 2 seconds, 3 seconds, 3 seconds and 4 seconds.Its differentiation intensity between any two according to developing intensity order from high to low, can be determined final summary fragment according to given length of summarization as shown in Figure 1.For example, when given length of summarization is 7 seconds, should select KS-1 and KS-3, as shown in Figure 2; When given length of summarization is 10 seconds, then should select KS-1, KS-3, KS-4, as shown in Figure 3.
For the video segment in the summary that makes final generation can present to audience more glibly, the present invention further alternative summary fragment of the duration is the too short and useful information that is beyond expression rejects, and according to the integrality of statement alternative summary fragment is adjusted.
The order of summary is and will generates the video segment that compressibility is very high that it should comprise useful plot information as much as possible, and these video segments can present to audience glibly.Want so that each the summary fragment that generates is meaningful, the fragment of then respectively making a summary can not be too short, finds according to statistical research: if the video sequence duration less than 1s, then it can not represent any useful information.Therefore, the present invention at first directly rejects the duration as useless video segment less than 1 second alternative summary fragment.
For make video segment as far as possible smoothness be and dedicate spectators to, also need make suitable adjustment according to the statement integrality to remaining alternative summary fragment, specific as follows:
Respectively remaining each alternative summary fragment is carried out the detection of complete statement, and according to testing result alternative summary fragment is adjusted accordingly.The detection of complete statement can be adopted existing the whole bag of tricks, for example, Schreiner proposes complete statement detection method (the O.Schreiner.Modulation spectrum for pitch and speech pause detection.In Proceedings on.European Conference on Speech Communication and Technology based on modulation spectrum, 2003), the people such as Liu adopt the condition random field method to detect the border (Y.Liu of complete statement, A.Stolcke, E.Shriberg, and M.Harper.Using Conditional Random Fields for Sentence Boundary Detection in Speech.Annual Meeting of the Association for Computational Linguistics, 2005), the people such as Szczurowska utilize the Kohonen network to realize detection (I.Szczurowska, the W. of statement boundary integrality
, and E.Smolka.Speech nonfluency detection using Kohonen networks.Neural Computing and Applications, 2009,18 (7): 677-687).Adopt following methods to carry out the detection of complete statement in this embodiment: to use audio power and second order zero-crossing rate, from continuous voice sequence, detect the time-out fragment; Adopt minimum time out and statement time, realize the level and smooth of previous step segmentation result; With longer time out, detect the statement fragment.
According to testing result alternative summary fragment is adjusted: the border such as complete statement exceeds alternative summary segment boundaries, and boundary adjustment that then will this alternative summary fragment is to the border of complete statement.The border of complete statement exceeds alternative summary segment boundaries can be divided into two kinds of situations, a kind of is that the monolateral border of complete statement exceeds alternative summary segment boundaries, another kind is that the border all exceeds alternative summary segment boundaries (being that complete statement covers alternative summary fragment) before and after the complete statement, and the boundary adjustment that this moment will this alternative summary fragment is the border of complete statement extremely.Candidate after the adjustment fragment of making a summary is final summary fragment.According to the time order and function order final summary fragment assembly is got up namely to obtain video frequency abstract.
Compare existing various video frequency abstract extracting method, the present invention selects suitable summary fragment generating video summary according to plot development relation, the logical thinking that more meets people also is conducive to guarantee the integrality of substance film, accurately embodies the main plot of video.