CN100559879C

CN100559879C - A kind of movie action scene detection method based on story line development model analysis

Info

Publication number: CN100559879C
Application number: CN 200710099727
Authority: CN
Inventors: 刘安安; 李锦涛; 张勇东; 唐胜; 宋砚
Original assignee: Institute of Computing Technology of CAS
Current assignee: Huan Rui century film and television media Limited by Share Ltd
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2009-11-11
Anticipated expiration: 2027-05-29
Also published as: CN101316362A

Abstract

The invention discloses a kind of movie action scene detection method, comprising: video is done pretreatment operation based on story line development model analysis; Calculate the lens length of each camera lens; Calculate the mean motion intensity of camera lens; Utilize lens length and camera lens mean motion intensity to calculate the move edit factor; Calculate the audio power in short-term of each audio frame and the average audio energy of camera lens; Calculate the mean motion decentralization of camera lens; Utilize camera lens average audio energy and camera lens mean motion decentralization, calculate the human perception factor; According to the move edit factor and the human perception factor, set up the plot progressions model, and generate plot development flow graph according to time sequencing; According to the action scene in the plot progressions model detection film.The invention has the advantages that from move edit gimmick and two angles of human perception and consider vision and sense of hearing factor, set up the plot progressions model, thereby the development and change of simulation plot have realized the accurate detection of action scene in the film.

Description

A kind of movie action scene detection method based on story line development model analysis

Technical field

The present invention relates to video analysis and detection range, particularly a kind of movie action scene detection method based on story line development model analysis.

Background technology

Along with current Digital Media and rapid development of network, people can be in home-confinedly and enjoy various multimedia messagess.Nowadays in the world film industry become video production one enliven strength, probably have every year 4500 New cinema to occur.Therefore, how these magnanimity informations are effectively analyzed, manage, inquired about and retrieve, become present problem demanding prompt solution.

In film, because action scene has strong dramatic conflicts and fierce visual effect usually, be the emphasis of move edit therefore, also be the scene that spectators are most interested in.Detect the action scene in the film need be by the method for movie contents analysis.Present movie contents analytical method is mainly divided two kinds:

1, detects the certain semantic incident by analyzing low-level features such as move edit Rule Extraction video/audio;

2, extract low-level feature detection certain semantic incidents such as video/audio by analyzing human attention change factor.

The groundwork of early stage content-based video analysis is that video structural is retrieved with similar fragment, and video data is mainly news, the video of structure comparison rule such as physical culture.But because " semantic wide gap " between vision and sense of hearing low-level feature and the human senior semanteme, this video analysis and retrieval mode can not satisfy human demand.Along with to video semanteme analysis and research deeply and in news, physical culture etc. have application in the video of regular texture, the researcher is placed on more attention on the semantic fact retrieval of film.But film video is a kind of artistic rendition of height, have very complicated plot and potential edit pattern, the incomprehensive semantic event detection accuracy rate that caused that existing method is described certain semantic is lower, causes failing to reflect the of photoplay plot development own comprehensively.

Generally speaking, main at present accuracy and the robustness that has semantic event detection in two The key factor restriction films:

1, to the deep understanding of move edit technology and human attention model, low-level features such as excavation video/audio characterize these factors;

2, comprehensive objectively move edit mode and subjective spectators' perception two aspect factors are set up the development that rational model characterizes plot.

Summary of the invention

The objective of the invention is to overcome existing movie contents analytical method and pay close attention to the semantic incident of film from move edit gimmick or human perception angle merely, the defective low to the semantic event detection accuracy rate, thereby provide a kind of fusion move edit gimmick and human perception two angles, can realize movie contents analytical method to the movie action scene detection, have high accuracy and robustness, help the move edit personnel the editor of film and spectators selection to scene of interest in the film.

To achieve these goals, the invention provides a kind of movie action scene detection method, carry out according to the following steps order based on story line development model analysis:

Step 10), extract frame of video, frame of video is done pretreatment operation, obtain key frame, the scene in camera lens, the camera lens, the macroblock motion vector of video image by pretreatment operation from original video;

Step 20), calculate the number that comprises frame of video in each camera lens, thereby the acquisition lens length;

Step 30), the macroblock motion vector that utilizes step 10) to obtain, calculate the mean motion intensity of camera lens;

Step 40), utilize step 20) resulting lens length and step 30) the resulting camera lens mean motion intensity calculating move edit factor;

Step 50), extract audio frame, calculate the audio power in short-term of each audio frame from original video, and by calculating in the camera lens average audio energy of the mean value computation camera lens of audio power in short-term;

Step 60), calculate the mean motion decentralization of camera lens;

Step 70), utilize step 50) the camera lens average audio energy and the step 60 that obtain) the camera lens mean motion decentralization that obtains, calculate the human perception factor;

Step 80), according to step 40) the move edit factor and the step 70 that obtain) the human perception factor that obtains, set up the plot progressions model, and generate plot development flow graph according to time sequencing;

Step 90), according to step 80) plot that obtains development flow graph detects the action scene in the film.

In the technique scheme, in described step 30) in, the mean motion intensity of calculating camera lens specifically may further comprise the steps:

Step 31), calculate the energy of the motion vector of macro block in interior all the P frames of camera lens;

Step 32), set up a template, for foreground macro block in the image and background macro block are given different weights;

Step 33), according to step 32) in the weights given for macro block, calculate the exercise intensity of each P frame in the camera lens;

Step 34), with the exercise intensity sum of all P frames in the camera lens divided by the P frame number in the camera lens, obtain the mean motion intensity of camera lens.

In described step 31) in, ask that macroblock motion vector energy calculation formula is in the P frame:

E_{MV} (i, j) = \sqrt{x_{i, j}^{2} + y_{i, j}^{2}}

Wherein, i, j represent the residing position of macro block.

Described step 32) specifically comprise following operation:

Step 32-1), with the left and right edges of P two field picture background parts as image, calculate the average of motion vector energy in the background, obtain the background motion intensity of P frame;

Step 32-2), calculate the average μ and the variances sigma of all P frame background motion intensity in the camera lens, obtain the threshold value Th of described template according to result calculated, the computing formula of this threshold value is as follows:

Th＝μ+a*σ

Wherein, a is an empirical value, and value is 3;

Step 32-3), by step 32-2) resulting threshold value Th, set up template, in template, give different weights for the different macro blocks in the image, described template is:

\{\begin{matrix} If & E_{MV} (i, j) > Th, & weight (x_{i, j}, y_{i, j}) = 2; \\ If & E_{MV} (i, j) < = Th, & weight (x_{i, j}, y_{i, j}) = 1; \end{matrix}

Weight (x wherein _{I, j}, y _{I, j}) for giving the weights of corresponding macro block in the template.

In described step 33) in, the exercise intensity that calculates the P frame is meant: the summation of the product of the weights that the motion vector energy of each macro block of P frame is corresponding with it is as the exercise intensity of this P frame, and the computing formula of described exercise intensity is:

E_{MV}^{AVE} (P_{w}) = Σ_{i = 1}^{M} Σ_{j = 1}^{N} E_{MV} (i, j) * weight (x_{i, j}, y_{i, j})

Wherein, M and N represent the macroblock number of level and vertical direction respectively.

In the technique scheme, in described step 40) in, before calculating the move edit factor, described lens length and described camera lens mean motion intensity are done normalized.

In the technique scheme, the calculating of the described move edit factor is the product that the product of lens length and α is added camera lens mean intensity and β, and the value of described α and β is 0.5.

Described step 60) concrete operations are as follows:

Step 61), calculation procedure 32) the weights in the template set up be the direction of the motion vector of the macro block of " 2 ";

Step 62), two dimensional surface is divided into [90,0), [0,90), [90,180), [180,270) four sub spaces, calculate the four-dimensional direction histogram of every two field picture, each H[i of Wesy] (i=1,2,3,4) expression;

Step 63), calculate the motion decentralization of each frame, the motion decentralization of frame represents that with MD its computing formula is:

MD = Π_{i = 1}^{4} H [i];

Step 64), according to step 63) result of calculation, ask the mean motion decentralization of camera lens, camera lens mean motion decentralization MD ^AVEExpression, then its computing formula is:

{MD}^{AVE} = \frac{1}{Q} Σ_{w = 1}^{Q} MD (P_{w})

Wherein, Q represents P frame number in each camera lens.

In the technique scheme, in described step 70) in, before calculating the described human perception factor, described camera lens average audio energy and described camera lens mean motion decentralization are done normalized.

In the technique scheme, the calculating of the described human perception factor is the product that the product of camera lens average audio energy and γ is added camera lens mean motion decentralization and λ, and wherein, described γ and λ value all are 0.5.

In the technique scheme, described step 80) may further comprise the steps:

Step 81), the described move edit factor and the described human perception factor are done linear the fusion, set up the plot progressions model, described plot progressions model is the product that the product of the described move edit factor and φ is added the described human perception factor and ψ, wherein, the value of described φ and ψ is 0.5.

Step 82), with step 81) the plot progressions model set up changes continuously with the No.1 n of mirror, forms plot development flow graph.

With the level and smooth described plot development flow graph of Gauss's template.

The coefficient of described Gauss's template is: window size is 9, and standard deviation is 1.5.

In the technique scheme, described step 90) comprising: at first handle resulting scene information, add up the number N that plot development flow graph value in the scene surpasses first threshold according to the video preprocessor of step 10); Whether judge described N then greater than second threshold value, if greater than, then this scene is an action scene, otherwise this scene is non-action scene.

In the value of plot that described first thresholding is camera lens development flow graph 1/3rd of all local peak maximum, described second threshold value is 2.

The invention has the advantages that from move edit gimmick and two angles of human perception and take all factors into consideration vision and sense of hearing factor, set up the plot progressions model, thereby the development and change of simulation plot have realized the accurate detection of action scene in the film.

Description of drawings

Fig. 1 is the flow chart of the movie action scene detection method based on story line development model analysis of the present invention;

Fig. 2 is to image background schematic diagram partly when setting up template in the embodiment of the invention;

Fig. 3 (a) is the schematic diagram of plot development flow graph;

Fig. 3 (b) is the schematic diagram of the plot development flow graph after level and smooth through Gauss.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:

As shown in Figure 1, the movie action scene detection method based on story line development model analysis of the present invention specifically may further comprise the steps:

Step 10, film video is done pretreatment operation.In complete film video, comprise frame of video that is used for token image and the audio frame that is used to characterize sound, in this step, mainly be that the frame of video in the film video is partly carried out preliminary treatment.Pretreatment operation specifically may further comprise the steps:

Step 11, realization are cut apart camera lens, and the camera lens among the present invention is meant the continuous video frames that has similar content in the video;

Key frame in step 12, the extraction camera lens, key frame is the frame of video that best embodies lens features in the camera lens, includes at least one key frame in a camera lens;

Step 13, realization are cut apart film scene, and film scene is the combination with continuous several camera lenses of similar content;

Macroblock motion vector in step 14, the extraction video image.

Above-mentioned steps 11 pretreatment operation to video in the step 14 all belongs to ripe prior art, at list of references 1 " Yueting Zhuang, Yong Rui, Thomas S.Huang et al.Adaptive key frameextraction using unsupervised clustering.Image Processing, ICIP 1998 and Zeeshan Rasheed, Mubarak Shah.Detection and Representation of Scenes inVideos.IEEE Transaction on Multimedia, Vol7, NO.6, December, 2005 " in pair detailed description of step 11-13 is arranged.And step 14 can realize by extract motion vector from MPEG video compression territory.

Step 20, calculate the number that comprises frame of video in each camera lens, thus acquisition lens length information.

The mean motion intensity of step 30, calculating camera lens, the computational methods of camera lens mean motion intensity are as follows:

Step 31, calculate in the camera lens energy of the motion vector of macro block in all P frames, in MPEG video compression territory, the P frame is the frame that is used for forward prediction, asks in the P frame macroblock motion vector energy calculation formula as follows:

E_{MV} (i, j) = \sqrt{x_{i, j}^{2} + y_{i, j}^{2}}

Wherein, i, j represent the residing position of macro block.

Step 32, set up a template, give different weights to foreground macro block and background macro block, its implementation method is as follows:

Step 32-1, calculate the background motion intensity of each P frame, as shown in Figure 2, dash area in the drawings is considered as the background parts of a P frame, represents with F, and then calculating background motion intensity is exactly the average of calculating motion vector energy among the F of shadow region, uses BE _MV ^AVE(P _k) expression, its computing formula is as follows:

{BE}_{MV}^{AVE} (P_{k}) = \frac{1}{L} \underset{(x_{i, j}, y_{i, j}) &Element; F}{Σ} E_{MV} (i, j)

Step 32-2, calculate the average μ and the variances sigma of all P frame background motion intensity in the video, obtain the threshold value of template according to result calculated, the computing formula of this threshold value is as follows:

Th＝μ+a*σ

Wherein, a is an empirical value, and value is 3.

Step 32-3, by the resulting threshold value Th of step 32-2, set up as lower bolster:

\{\begin{matrix} If & E_{MV} (i, j) > Th, & weight (x_{i, j}, y_{i, j}) = 2; \\ If & E_{MV} (i, j) < = Th, & weight (x_{i, j}, y_{i, j}) = 1; \end{matrix}

Step 33, according to the weights of giving for macro block in the step 32, calculate the exercise intensity of each P frame in the camera lens, suppose that a certain P frame is W P frame in the camera lens, then the computing formula of its exercise intensity is:

E_{MV}^{AVE} (P_{w}) = Σ_{i = 1}^{M} Σ_{j = 1}^{N} E_{MV} (i, j) * weight (x_{i, j}, y_{i, j})

Step 34, according to the result of calculation of step 33, calculate the mean motion intensity of camera lens.The exercise intensity sum that when calculating the mean motion intensity of camera lens is exactly all P frames in the camera lens is divided by the P frame number in the camera lens.Its computing formula is as follows:

E_{MV}^{AVE} = \frac{1}{Q} Σ_{w = 1}^{Q} E_{MV}^{AVE} (P_{w})

Wherein, Q represents P frame number in each camera lens.

Step 40, resulting lens length of step 20 and the resulting camera lens mean motion of step 30 intensity are done normalized respectively, set up the move edit factor according to two features after the normalization then.With s (n) expression lens length, with m (n) expression camera lens mean motion intensity, represent frame number with n, use P ₁(n) the expression move edit factor, then the computing formula of this factor is as follows:

P ₁(n)＝α*s(n)+β*m(n)

α＝β＝0.5

Resulting move edit factor representation the influence that movie action scene is detected of move edit technology.

Step 50, from film video, extract audio frame, calculate the audio power in short-term of each audio frame, and by calculating in the camera lens mean value computation camera lens average audio energy of audio power in short-term.

Audio power in short-term in this step is meant the energy summation of each all sampling point of audio frequency short time frame, and its computational methods are at list of references 2 " Bai Liang; Hu Yaali, Feature analysis and extraction for audioautomatic classification, Proc.of IEEE International Conference on Systems, Man andCybernetics, vol.1, pp:767-772,2005. " be documented in.

Step 60, calculating camera lens mean motion decentralization, camera lens mean motion decentralization is used for representing the average complexity that this camera lens vision content changes.Calculating to camera lens mean motion decentralization comprises following specific implementation:

Step 61, calculation procedure 32 are set up the direction for the motion vector of the macro block of " 2 " of weights in the template, and its computing formula is as follows:

\{\begin{matrix} If & x_{i, j} > 0, y &NotEqual; 0, θ = \arctan (y_{i, j} / x_{i, j}); \\ If & x_{i, j} < 0, y &NotEqual; 0, θ = \arctan (y_{i, j} / x_{i, j}) + 180; \\ If & x_{i, j} = 0, y_{i, j} > 0, θ = 90; \\ If & x_{i, j} = 0, y_{i, j} < 0, θ = - 90; \\ If & y_{i, j} = 0, x_{i, j} > 0, θ = 0; \\ If & y_{i, j} = 0, x_{i, j} = < 0, θ = 180; \\ If & x_{i, j} = 0, y_{i, j} = 0, omittingθ; \end{matrix}

θ wherein represents the direction of motion vector.

Step 62, two dimensional surface is divided into [90,0), [0,90), [90,180), [180,270) four sub spaces, calculate the four-dimensional direction histogram of every two field picture, each H[i of Wesy] (i=1,2,3,4) expression.Calculating four-dimensional direction histogram is exactly to calculate the number of the motion vector of angle in each sub spaces and the ratio of motion vector sum.

Step 63, calculate the motion decentralization of each frame, the motion decentralization of frame represents that with MD its computing formula is as follows:

MD = Π_{i = 1}^{4} H [i]

Step 64, according to the result of calculation of step 63, ask the mean motion decentralization of camera lens, camera lens mean motion decentralization MD ^AVEExpression, then its computing formula is as follows:

{MD}^{AVE} = \frac{1}{Q} Σ_{w = 1}^{Q} MD (P_{w})

Wherein, Q represents P frame number in each camera lens.

The camera lens mean motion decentralization that step 70, the camera lens average audio energy that step 50 is obtained and step 60 obtain is done the normalization operation respectively, sets up the human perception factor according to two features after the normalization then.With a (n) expression camera lens average audio energy, with d (n) expression camera lens mean motion decentralization, represent that with n mirror is No.1, use P ₂(n) the human perception factor of expression, the computing formula of the then human perception factor is as follows:

P ₂(n)＝γ*a(n)+λ*d(n)

γ＝λ＝0.5

Resulting human perception factor representation human attention influence that movie action scene is detected.

The human perception factor that step 80, the move edit factor that obtains according to step 40 and step 70 obtain is set up the plot progressions model.The plot progressions model can be used for characterizing the importance of each construction unit content in the video and the mankind being attracted intensity of force.The specific implementation step of setting up the plot progressions model is as follows:

Step 81, the move edit factor and the human perception factor are done linear the fusion, set up the plot progressions model, its computing formula is as follows:

M(n)＝φ*P ₁(n)+ψ*P ₂(n)

φ＝ψ＝0.5

Step 82, the plot progressions model set up form plot development flow graph along with the continuous variation of the No.1 n of mirror.Plot development flow graph has reflected the importance of different camera lenses in the whole video, and it compares the importance of whole video different units in chronological order, embodies the difference of importance of different camera lenses, has showed the development and change of plot.Fig. 3 (a) is exactly the example of a plot development flow graph.

Step 83, the level and smooth plot development of usefulness Gauss template flow graph, wherein the coefficient of Gauss's template is: window size is 9, standard deviation is 1.5.The plot development flow graph of Fig. 3 (b) after to be Fig. 3 (a) through Gauss level and smooth.

Action scene in step 90, the detection film.During action scene in detecting film, scene carve information at first handling according to video preprocessor add up in the scene plot development flow graph M (n) value above threshold value Th ₁Number N; Judge that then N is greater than threshold value Th ₂, then this scene is an action scene; Otherwise, be non-action scene.Th wherein ₁And Th ₂Determine by experiment, in the present embodiment, Th ₁Be 1/3rd of all local peak maximum in each camera lens M (n) value; Th ₁Value be 2.

Claims

1, a kind of movie action scene detection method based on story line development model analysis, carry out according to the following steps order:

Step 60), calculate the mean motion decentralization of camera lens;

2, the movie action scene detection method based on story line development model analysis according to claim 1 is characterized in that, in described step 30) in, the mean motion intensity of calculating camera lens specifically may further comprise the steps:

3, the movie action scene detection method based on story line development model analysis according to claim 2 is characterized in that, in described step 31) in, ask that macroblock motion vector energy calculation formula is in the P frame:

E_{MV} (i, j) = \sqrt{x_{i, j}^{2} + y_{i, j}^{2}}

Wherein, i, j represent the residing position of macro block.

4, the movie action scene detection method based on story line development model analysis according to claim 2 is characterized in that, described step 32) specifically comprise following operation:

Th＝μ+a*σ

Wherein, a is an empirical value, and value is 3;

\{\begin{matrix} If & E_{MV} (i, j) > Th, & weight (x_{i, j}, y_{i, j}) = 2; \\ If & E_{MV} (i, j) < = Th, & weight (x_{i, j}, y_{i, j}) = 1; \end{matrix}

5, the movie action scene detection method based on story line development model analysis according to claim 4, it is characterized in that, in described step 33) in, the exercise intensity that calculates the P frame is meant: the summation of the product of the weights that the motion vector energy of each macro block of P frame is corresponding with it is as the exercise intensity of this P frame, and the computing formula of described exercise intensity is:

E_{MV}^{AVE} (P_{w}) = Σ_{i = 1}^{M} Σ_{j = 1}^{M} E_{MV} (i, j) * weight (x_{i, j}, y_{i, j})

6, the movie action scene detection method based on story line development model analysis according to claim 1, it is characterized in that, in described step 40) in, before calculating the move edit factor, described lens length and described camera lens mean motion intensity are done normalized.

7, according to claim 1 or 6 described movie action scene detection methods based on story line development model analysis, it is characterized in that, the calculating of the described move edit factor is the product that the product of lens length and α is added camera lens mean intensity and β, and the value of described α and β is 0.5.

8, the movie action scene detection method based on story line development model analysis according to claim 4 is characterized in that, described step 60) concrete operations as follows:

MD = Π_{i = 1}^{4} H [i];

{MD}^{AVE} = \frac{1}{Q} Σ_{w = 1}^{Q} MD (P_{w})

Wherein, Q represents P frame number in each camera lens.

9, the movie action scene detection method based on story line development model analysis according to claim 1, it is characterized in that, in described step 70) in, before calculating the described human perception factor, described camera lens average audio energy and described camera lens mean motion decentralization are done normalized.

10, according to claim 1 or 9 described movie action scene detection methods based on story line development model analysis, it is characterized in that, the calculating of the described human perception factor is the product that the product of camera lens average audio energy and γ is added camera lens mean motion decentralization and λ, wherein, described γ and λ value all are 0.5.

11, the movie action scene detection method based on story line development model analysis according to claim 1 is characterized in that, described step 80) may further comprise the steps:

12, the movie action scene detection method based on story line development model analysis according to claim 11 is characterized in that, with the level and smooth described plot development flow graph of Gauss's template.

13, the movie action scene detection method based on story line development model analysis according to claim 12 is characterized in that, the coefficient of described Gauss's template is: window size is 9, and standard deviation is 1.5.

14, the movie action scene detection method based on story line development model analysis according to claim 1, it is characterized in that, described step 90) comprising: at first handle resulting scene information, add up the number N that the development of plot in scene flow graph value surpasses first threshold according to the video preprocessor of step 10); Whether judge described N then greater than second threshold value, if greater than, then this scene is an action scene, otherwise this scene is non-action scene.

15, the movie action scene detection method based on story line development model analysis according to claim 14, it is characterized in that, in the value of plot that described first thresholding is camera lens development flow graph 1/3rd of all local peak maximum, described second threshold value is 2.