CN105005772A

CN105005772A - Video scene detection method

Info

Publication number: CN105005772A
Application number: CN201510427821.8A
Authority: CN
Inventors: 童云海; 杨亚鸣; 丁宇辰; 郜渊源; 蒋云飞
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2015-07-20
Filing date: 2015-07-20
Publication date: 2015-10-28
Anticipated expiration: 2035-07-20
Also published as: CN105005772B

Abstract

The invention discloses a video scene detection method. According to the method, video data is detected by a computer instead of artificial detection, and scenes in the video are recognized. The detection method includes an offline training discrimination model process and a video scene detection process. The offline training discrimination model process includes: centralizing each video extraction feature comprising semantic and space-time feature extraction aiming to training video samples; performing category annotation of feature vectors and obtaining a group of sample set; performing iteration training of the sample set by employing a multiple-kernel learning frame and obtaining an offline training model. The video scene detection process includes: accessing to a monitoring video source; performing video sampling and obtaining a short video; extracting features from the short video; and loading the offline training model, detecting the features, and obtaining a detection result. According to the method, the scenes in the video can be recognized by the computer instead of artificial detection, the detection efficiency can be improved, the cost is lowered, and convenience is brought to data storage and retrieval.

Description

A kind of video scene detection method

Technical field

The present invention relates to video information analytical technology, particularly relate to a kind of video scene detection method.

Background technology

Current, video monitoring system is day by day popularized, its maintaining public order, crack in crime case etc. and play irreplaceable effect.In field of video monitoring, identify that abnormal scene is very important, such as accurately detect the abnormal scene such as the behavior of the impairment public safeties such as group affray, the illegal operation of detection pedlar significant in social management, city management field.

Video monitoring system comprises front-end camera, transmission equipment and video monitoring platform.Camera acquisition head end video picture signal, by sending to monitor supervision platform after transmission equipment pressure, platform is by work such as the storage of complete paired data, accident detections.Monitor video often has the feature that data volume is large, information redundancy is many, if arrange manually to monitor these videos, process, not only take time and effort, accuracy rate also cannot be guaranteed.

Along with the development of computer vision technique, computing machine can the object such as people, animal, car in recognition image, and progressively for oblige by doing, some simply work.But, prior art to the identification of scene for object mainly static images.Compare static images, video has time dimension, and comprises the change information of background and the movable information of target object, therefore deals with more complicated.At present, mostly by manual method, video data monitored, process and find abnormal scene wherein, take time and effort, cost is high, efficiency is low, and accuracy rate cannot be guaranteed, and is also difficult to realize efficiently to the storage of Video processing analysis result data and retrieval recycling in the future.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of detection method of video Scene, replace manually detecting video data with computing machine, find abnormal scene wherein, greatly can improve detection efficiency, reduce costs, data are stored and also provides facility with retrieval in the future.

Technical scheme provided by the invention is:

A kind of video scene detection method, the method for manually detecting video data, identifies the scene in video by computer generation; Detection method comprises off-line training discrimination model process and video scene testing process:

1) off-line training discrimination model process, performs and operates as follows:

11) training video sample set is prepared;

12) for video extraction feature each in training video sample set, be characterized as vector form, comprise semantic feature and extract and space-time characteristic extraction;

13) carry out classification mark to proper vector, obtain one group of sample set, each sample packages is containing semantic feature vector sum space-time characteristic vector, and corresponding classification mark;

14) Multiple Kernel Learning framework is utilized to step 13) described sample set carries out repetitive exercise, obtains an off-line training model;

2) video scene testing process, performs and operates as follows:

21) the monitor video source that will detect is accessed;

22) sample mode is set and carries out video sampling, obtain a short-sighted frequency; This short-sighted frequency is for detecting target;

23) to step 22) described short video extraction feature, comprise semantic feature vector sum space-time characteristic vector, abstracting method and step 12 in training process) identical;

24) utilize Multiple Kernel Learning framework to be loaded into off-line training model, detection is carried out to feature and differentiates, determine whether given scenario, obtain testing result.

For above-mentioned video scene detection method, further, step 11) described training video sample comprises two class samples, and a class is comprise the video set that pedlar manages scene, another kind of for not comprise the video set that pedlar manages scene.

Step 12) for video extraction feature each in training video sample set, comprise semantic feature extraction process and space-time characteristic extraction process.

Wherein, semantic feature extraction process specifically comprises the steps:

121a) to each video, calculated the score of every frame picture by extraction method of key frame, choose the m that score is the highest

Frame picture is as key frame, and score computing formula is as follows:

s c o r e (f_{k}) = α * \frac{Sdi f f (f_{k}) - M i n_S d i f f}{M a x_S d i f f - M i n_S d i f f} + β * \frac{M o V a l u e (f_{k}) - M i n_M o V a l u e}{M a x_M o V a l u e - M i n_M o V a l u e}

(formula 1)

Sdiff (f _k)=∑ _i,j| I _k(i, j)-I _k-1(i, j) | (formula 2)

M o V a l u e (f_{k}) = Σ_{i = 1}^{N_{k}} ({(v_{k}^{x} (i))}^{2} + {(v_{k}^{y} (i))}^{2})

(formula 3)

In formula 1 ~ formula 3, f _krepresent kth frame picture in video sequence; Score (f _k) represent the score of kth frame picture; Sdiff (f _k) represent the measures of dispersion of this frame and former frame; α, β are respectively weight; Max_Sdiff and Min_Sdiff is respectively maximum difference and the minimal difference of adjacent two interframe; with represent the variable quantity of horizontal direction and the variable quantity of vertical direction of pixel i light stream in kth frame picture respectively; N _krepresent kth frame number of pixels; MoValue (f _k) represent the light stream intensity of kth frame; Max_MoValue represents maximum light stream intensity in all frames; Min_MoValue represents minimum light intensity of flow in all frames;

121b) to the m frame picture chosen, for every frame picture, extract picture semantic feature with Dartmouth Classeme feature extracting method, obtain the semantic feature vector of this frame picture;

121c) m the real number proper vector that extraction m frame picture obtains is spliced, obtain the vector of a m*2659 dimension, as the semantic feature vector of this video.

In an embodiment of the present invention, step 121a) described m frame picture is three frame pictures.For feature extraction, space and time order feature extraction process specifically comprises the steps:

122a) to each training video, extracted by MoSIFT feature extracting method and obtain MoSIFT feature;

122b) based on MoSIFT features all in video set, generate visual dictionary;

122c) utilize above-mentioned visual dictionary, Fei Sheer vector coding is carried out to each video, obtain the Fei Sheer vector of a 2*D*K dimension;

122d) implement principal component analysis (PCA) to above-mentioned Fei Sheer vector, obtain a low dimensional vector, this low dimensional vector is the space-time characteristic vector of video.

Above-mentioned steps 122b) concrete employing mixed Gauss model generation visual dictionary.

For above-mentioned video scene detection method, further, step 14) described Multiple Kernel Learning framework is Multiple Kernel Learning framework in Shogun kit, adopts the mode of linear weighted function to combine kernel function, is expressed as formula 9:

K (x_{i}, x_{j}) = Σ_{k = 1}^{S} β_{k} K_{k} (x_{i}, x_{j})

(formula 9)

In formula 9, K _k(x _i, x _j) represent a kth kernel function; β _krepresent the weight of a kth kernel function; x _i, x _jrepresent video sample i respectively, j is to should the feature of kernel function;

Choose two polynomial kernel as kernel function, characteristic of correspondence is semantic feature and space-time characteristic respectively; The formula of polynomial kernel is such as formula 10:

K (x, x _i)=((xx _i)+1) ^d(formula 10)

In formula 10, x, x _irepresent the vector of the input space respectively; D represents exponent number;

The constrained optimization problem of Multiple Kernel Learning is expressed as:

M i n \frac{1}{2} {(Σ_{k = 1}^{S} \frac{{| | w}_{k} {| |}_{2}}{β_{k}})}^{2} + C Σ_{i = 1}^{N} ξ_{i}

(formula 11)

In formula 11, N represents the vectorial number of the input space; ξ _irepresent the coefficient of relaxation of vectorial i; S represents the number of kernel function; w _krepresent the width of the interphase corresponding to a kth kernel function to support vector; C represents penalty factor; In constraint condition, y _ifor the classification (being 1 or-1) of vector; for the higher dimensional space mapping function that a kth kernel function is corresponding; B is side-play amount.

Solving especially by Lagrangian changing method of described Multiple Kernel Learning model, obtains solving objective function and is:

\min_{β} \max_{α} J (α, β) = Σ_{i = 1}^{N} α_{i} - \frac{1}{2} Σ_{i, j = 1}^{N} α_{i} α_{j} y_{i} y_{j} Σ_{k = 1}^{S} β_{k} K_{K} (x_{i}, x_{j})

\begin{matrix} s . t . & 0 \leq α_{i} \leq C, Σ_{i = 1}^{N} α_{i} y_{i} = 0 \end{matrix}

β &Element; Δ_{p}, Δ_{p} = {β &Element; R_{+}^{S} : | | β | |_{p} \leq 1}

(formula 12)

In formula 12, N represents the vectorial number of the input space; x _i, x _jrepresent the vector of the input space; α _i, α _jfor the weight of correspondence, obtained by study; y _i, y _jfor the classification of correspondence; S represents the number of kernel function; β _krepresent the weight of a kth kernel function, also obtained by study; In constraint condition, C represents penalty factor, and p is normalization norm.

In an embodiment of the present invention, the exponent number d of polynomial kernel described in formula 10 is 2.

Step 22) mode of described video sampling comprise every time sampling and every frame sampling; Every time sampling specifically every t, sampling should be carried out second, once sample 10 seconds, form a short-sighted frequency; Every frame sampling specifically every k frame sampling once, adopt enough 240 frames and form a short-sighted frequency; This short-sighted frequency is for detecting target.

Compared with prior art, the invention has the beneficial effects as follows:

The invention provides a kind of detection method of video Scene, the method is replaced by computer generation and is manually detected video data, video semanteme feature is extracted based on external knowledge storehouse, consider the Key-frame Extraction Algorithm of background and movable information, and solve video Scene test problems by the method for Multiple Kernel Learning, detection method comprises off-line training discrimination model process and video scene testing process, by identifying the scene in video, can find abnormal scene wherein.Technical scheme provided by the invention can improve detection efficiency greatly, reduces costs, and stores also provide facility with retrieval in the future to data.

Accompanying drawing explanation

Fig. 1 is that the present invention obtains the FB(flow block) of off-line training discrimination model by learning training process.

Fig. 2 is the FB(flow block) of video scene testing process provided by the invention.

Embodiment

Below in conjunction with accompanying drawing, further describe the present invention by embodiment, but the scope do not limited the present invention in any way.

The invention provides a kind of video scene detection method, the method for manually detecting video data, identifies the scene in video by computer generation; Detection method comprises off-line training discrimination model process and video scene testing process:

11) training video sample set is prepared;

2) video scene testing process, performs and operates as follows:

21) the monitor video source that will detect is accessed;

Whether the present embodiment utilizes monitor video, detect in video and have pedlar to manage scene.Detection method comprises off-line training discrimination model process and video scene testing process.

1) off-line training discrimination model process: utilize training video sample, off-line training discrimination model

11) training video sample is prepared;

In the present embodiment, training video sample comprises two class samples, and a class is comprise the video set that pedlar manages scene, and a class is do not comprise the video set that pedlar manages scene;

12) for video extraction feature each in training video sample, comprise semantic feature and extract and space-time characteristic extraction;

Feature for characterizing this video comprises semantic feature and space-time characteristic; Be characterized as vector form; Obtain two proper vectors for each video extraction feature, one of them is semantic feature vector, for characterizing semantic feature; Another is space-time characteristic vector (space-time dimension), for characterizing space-time characteristic.

121) process extracting semantic feature specifically comprises:

121a) to each video, utilize extraction method of key frame to calculate the score of every frame picture, choose the highest m frame picture of score as key frame, score computing formula is as follows:

s c o r e (f_{k}) = α * \frac{s d i f f (f_{k}) - M i n_S d i f f}{M a x_S d i f f - M i n_S d i f f} + β * \frac{M o V a l u e (f_{k}) - M i n_M o V a l u e}{M a x_M o V a l u e - M i n_M o V a l u e}

(formula 1)

Sdiff (f _k)=∑ _i,j| I _k(i, j)-I _k-1(i, j) | (formula 2)

M o V a l u e (f_{k}) = Σ_{i = 1}^{N_{k}} ({(v_{k}^{x} (i))}^{2} + {(v_{k}^{y} (i))}^{2})

(formula 3)

In formula 1 ~ formula 3, f _krepresent kth frame picture in video sequence; Score (f _k) represent the score of kth frame picture; Sdiff (f _k) represent the measures of dispersion (between two frames the difference of pixel value, for RGB color image, measures of dispersion is the average of R, G, channel B difference) of this frame and former frame; α, β are respectively weight; Max_Sdiff and Min_Sdiff is respectively maximum difference and the minimal difference of adjacent two interframe; with represent the variable quantity of horizontal direction and the variable quantity of vertical direction of pixel i light stream in kth frame picture respectively; N _krepresent kth frame number of pixels; MoValue (f _k) represent the light stream intensity of kth frame; Max_MoValue represents maximum light stream intensity in all frames; Min_MoValue represents minimum light intensity of flow in all frames.

Above-mentioned extraction method of key frame is chosen obtain key frame by being considered picture scene change informa and movable information.The present embodiment setting m=3, namely utilizes extraction method of key frame to calculate the score of every frame picture, chooses 3 the highest frame pictures of score as key frame.

Classeme feature extracting method is the semantics extraction instrument based on external knowledge storehouse, it is a kind of descriptor of expressing image attributes, Classeme image attributes descriptor (Classemes attribute descriptor) comprises 2659 kinds of image attributes (that is having 2659 dimensions), corresponding 2659 concepts; Comprise object (as basketball, bicycle), personage (as football player, boy), place (as swimming pool, outdoor) etc.Every frame figure sector-meeting extracts the real number vector of one 2659 dimension.

121c) m the real number vector that extraction m frame picture obtains is spliced, obtain the vector of a m*2659 dimension, as the semantic feature vector of this video;

122) process extracting space-time characteristic specifically comprises:

122a) to each training video, extracted by feature extracting method and obtain MoSIFT feature;

Training video includes and comprises pedlar and manage the video of scene and do not comprise the video that pedlar manages scene; The feature extracting method that the present embodiment adopts is MoSIFT feature extracting method; Document (M.-Y.Chen and A.Hauptmann, " Mosift:Recognizing human actions in surveillance videos; " CMU-CS-09-161.Carnegie MellonUniversity, 2009.) describe and extract by MoSIFT feature extracting method the process obtaining MoSIFT feature, MoSIFT feature is a kind of space-time characteristic considering space dimension and time dimension, the feature generated is 256 dimensions, counts D;

Extracting MoSIFT feature to each training video and comprise two steps, is first the detection of point of interest, is secondly build the description to point of interest.

The detection of point of interest specifically comprises finds out Local Extremum alternatively point of interest and determine that whether candidate's point of interest is as point of interest:

Build multiple dimensioned difference of Gaussian pyramid, find out Local Extremum alternatively point of interest, the computing formula of difference of Gaussian is:

D (x, y, k δ)=L (x, y, k δ)-L (x, y, (k-1) δ) (formula 4)

In formula 4, the pixel coordinate in x and y representative image; K δ represents the standard deviation of the Gaussian function of pyramid kth layer; L (x, y, k δ) represents the convolution results of pyramid kth layer Gaussian function and image; L (x, y, (k-1) δ) represents the convolution results of pyramid kth-1 layer of Gaussian function and image; The difference result that D (x, y, k δ) is pyramid kth layer;

Whether then judge whether these candidate points exist enough movable informations by optical flow analysis, namely whether exercise intensity is enough large, to determine as point of interest.

After obtaining point of interest, MoSIFT feature extracting method is by describe SIFT (Scale-invariant feature transform) and light stream describing to combine and obtains the description as this point of interest of one 256 vector tieed up; Wherein SIFT is the classical feature for token image, there is scale invariability, carry out the point of interest in Description Image with the real number vector of 128, the describing mode of light stream is similar with SIFT feature, and both combine and just obtain one the 256 real number vector tieed up.

122b) based on MoSIFT features all in video set, generate visual dictionary;

This method adopts mixed Gauss model to generate visual dictionary, wherein, represent the size of visual dictionary with K, mixed Gauss model main thought supposes that the distribution of MoSIFT unique point meets the linear superposition of K Gaussian distribution, this method gets K=64, and the mathematical notation of mixed Gauss model is:

P (y | θ) = Σ_{k = 1}^{K} α_{k} φ (y | θ_{k})

(formula 5)

In formula 5, the probability distribution that P (y| θ) is MoSIFT feature; α _kfor the weight of each Gauss model; K represents the size of visual dictionary; Y represents MoSIFT proper vector; θ represents the parameter of distribution; θ _krepresent the parameter of a kth Gaussian function.

122d) implement principal component analysis (PCA) to above-mentioned Fei Sheer vector, obtain a low dimensional vector, this low dimensional vector is the space-time characteristic vector of video;

Above-mentioned 2*D*K Wei Feisheer vector is 32768 Wei Feisheer vectors; Principal component analysis (PCA) utilizes dimensionality reduction thought, and be a few generalized variable by multiple variables transformations, these generalized variables are major component, and these major components can reflect most information of original variable.To the vectorial process of carrying out principal component analysis (PCA) of Fei Sheer be in the method:

Fei Sheer vector dimension is designated as p; Make x _i=(x _i1, x _i2..., x _ip) ^t, i=1,2 .., N, representation feature matrix; x _ijrepresent the jth dimensional feature value of i-th sample, eigenmatrix carried out as down conversion:

Z_{i j} = \frac{x_{i j} - \overset{&OverBar;}{x_{J}}}{s_{j}}, i = 1, 2, ..., N; j = 1, 2, ..., p

(formula 6)

Wherein, Z _ijfor the i-th row jth row value for standardization battle array Z; n is number of samples;

Then correlation matrix R is asked to Z:

R = \frac{Z^{T} Z}{N - 1}

(formula 7)

Then the secular equation of correlation matrix R is solved:

| R-λ I _p|=0 (formula 8)

In formula 8, R is correlation matrix; I _pfor unit matrix; λ is eigenwert;

Solve formula 8 and obtain p characteristic root, it is M=1168 that this method gets major component number; Finally by primitive character matrix projection in M principal direction, obtain final space-time characteristic.

13) carry out classification mark to proper vector, obtain one group of sample set, each sample packages is containing two proper vectors, and corresponding classification mark;

In the present embodiment, classification mark is carried out to proper vector, specifically: comprising the video labeling that pedlar manages scene is 1, represent positive example, be-1 to the video labeling not comprising pedlar and manage scene, be expressed as negative example, so just obtain one group of sample set, each sample packages is containing two proper vectors, and corresponding classification mark;

14) Multiple Kernel Learning framework is utilized to carry out repetitive exercise to above-mentioned training sample set;

The present invention adopts the Multiple Kernel Learning framework in Shogun kit, combines kernel function by the mode of linear weighted function, and concrete formula is as follows:

K (x_{i}, x_{j}) = Σ_{k = 1}^{S} β_{k} K_{k} (x_{i}, x_{j})

(formula 9)

In formula 9, K _k(x _i, x _j) represent a kth kernel function; β _krepresent the weight of a kth kernel function; x _i, x _jrepresent video sample i respectively, j is to should the feature of kernel function; Choose altogether two polynomial kernel in the method as kernel function, a kernel function characteristic of correspondence is semantic feature, and another kernel function characteristic of correspondence is space-time characteristic; The formula of polynomial kernel is as follows,

K (x, x _i)=((xx _i)+1) ^d(formula 10)

In formula 10, x, x _irepresent the vector of the input space respectively; D represents exponent number, and in this method, the exponent number of polynomial kernel is 2.

The constrained optimization problem of Multiple Kernel Learning can be expressed as:

M i n \frac{1}{2} {(Σ_{k = 1}^{S} \frac{| | w_{k} | |_{2}}{β_{k}})}^{2} + C Σ_{i = 1}^{N} ξ_{i}

(formula 11)

Similar with SVM, solving of the Multiple Kernel Learning model that this method adopts also can become solving its dual problem by Lagrange, and the objective function that solves of the primal-dual optimization problem of Multiple Kernel Learning is:

\min_{β} \max_{α} J (α, β) = Σ_{i = 1}^{N} α_{i} - \frac{1}{2} Σ_{i, j = 1}^{N} α_{i} α_{j} y_{i} y_{j} Σ_{k = 1}^{S} β_{k} K_{K} (x_{i}, x_{j})

\begin{matrix} s . t . & 0 \leq α_{i} \leq C, Σ_{i = 1}^{N} α_{i} y_{i} = 0 \end{matrix}

β &Element; Δ_{p}, Δ_{p} = {β &Element; R_{+}^{s} : | | β | |_{p} \leq 1}

(formula 12)

In formula 12, N represents the vectorial number of the input space; x _i, x _jrepresent the vector of the input space; α _i, α _jfor the weight of correspondence, obtained by study; y _i, y _jfor the classification of correspondence; S represents the number of kernel function; β _krepresent the weight of a kth kernel function, also obtained by study; In constraint condition, C represents penalty factor, and p is normalization norm; This method is set as p=2, C=8.

15) off-line model can be obtained through multinuclear training;

The off-line model obtained is exactly by training the unknown parameter obtained, mainly comprising the isoparametric value of weight of support vector sample and weight, kernel function and correspondence thereof;

2) video scene testing process

21) the monitor video source that will detect is accessed;

Sample mode comprise every time sampling and every frame sampling; Every time sampling specifically every t, sampling should be carried out second, once sample 10 seconds, form a short-sighted frequency; Every frame sampling specifically every k frame sampling once, adopt enough 240 frames and form a short-sighted frequency; This short-sighted frequency is for detecting target.

23) to above-mentioned short video extraction semantic feature and space-time characteristic, abstracting method flow process is identical with training process;

24) utilize Multiple Kernel Learning framework, be loaded into off-line training module, detection is carried out to feature and differentiates, determine whether given scenario, obtain testing result;

Discriminant function is:

f (x) = s i g n (Σ_{i = 0}^{N} α_{i} y_{i} Σ_{k = 1}^{S} β_{k} K_{K} (x_{i}, x) + b)

(formula 13)

In formula 13, except parameter x, other meaning of parameters are with formula is identical above; X represents the semantic feature and space-time characteristics that extract short-sighted frequency; By calculate discriminant function f (x) be 1 represent this video segment comprise given scenario, for-1 represent this video segment do not comprise given scenario.

It should be noted that the object publicizing and implementing example is to help to understand the present invention further, but it will be appreciated by those skilled in the art that: in the spirit and scope not departing from the present invention and claims, various substitutions and modifications are all possible.Therefore, the present invention should not be limited to the content disclosed in embodiment, and the scope that the scope of protection of present invention defines with claims is as the criterion.

Claims

1. a video scene detection method, by computer generation for manually detecting video data, identifies the scene in video; Detection method comprises off-line training discrimination model process and video scene testing process:

11) training video sample set is prepared;

12) for video extraction feature each in training video sample set, be characterized as vector form, comprise semantic feature vector sum space-time characteristic vector;

2) video scene testing process, performs and operates as follows:

21) the monitor video source that will detect is accessed;

2. video scene detection method as claimed in claim 1, is characterized in that, step 11) described training video sample comprises two class samples, and a class is comprise the video set that pedlar manages scene, another kind of for not comprise the video set that pedlar manages scene.

3. video scene detection method as claimed in claim 1, is characterized in that, step 12) for video extraction feature each in training video sample set, comprise and extract semantic feature extraction process and space-time characteristic extraction process.

4. video scene detection method as claimed in claim 3, it is characterized in that, semantic feature extraction process specifically comprises the steps:

121a) to each video, calculated the score of every frame picture by extraction method of key frame, choose the highest m frame picture of score as key frame, score computing formula is as follows:

s c o r e (f_{k}) = α * \frac{S d i f f (f_{k}) - M i n_S d i f f}{M a x_S d i f f - M i n_S d i f f} + β * \frac{M o V a l u e (f_{k}) - M i n_M o V a l u e}{M a x_M o V a l u e - M i n_M o V a l u e}

(formula 1)

Sdiff (f _k)=∑ _i,j| I _k(i, j)-I _k-1(i, j) | (formula 2)

M o V a 1 u e (f_{k}) = Σ_{i = 1}^{N_{k}} ({(v_{k}^{x} (i))}^{2} + {(v_{k}^{y} (i))}^{2})

(formula 3)

5. video scene detection method as claimed in claim 4, is characterized in that, step 121a) described m frame picture is three frame pictures.

6. video scene detection method as claimed in claim 3, it is characterized in that, space and time order feature extraction process specifically comprises the steps:

122b) based on MoSIFT features all in video set, generate visual dictionary;

7. video scene detection method as claimed in claim 6, is characterized in that, step 122b) adopt mixed Gauss model to generate visual dictionary.

8. video scene detection method as claimed in claim 1, is characterized in that, step 14) described Multiple Kernel Learning framework is Multiple Kernel Learning framework in Shogun kit, adopts the mode of linear weighted function to combine kernel function, is expressed as formula 9:

K (x_{i}, x_{j}) = Σ_{k = 1}^{s} β_{k} K_{k} (x_{i}, x_{j})

(formula 9)

K (x, x _i)=((xx _i)+1) ^d(formula 10)

M i n \frac{1}{2} {(Σ_{k = 1}^{S} \frac{| | w_{k} | |_{2}}{β_{k}})}^{2} + C Σ_{i = 1}^{N} ξ_{i}

(formula 11)

In formula 11, N represents the vectorial number of the input space; ξ _irepresent the coefficient of relaxation of vectorial i; S represents the number of kernel function; w _krepresent the width of the interphase corresponding to a kth kernel function to support vector; C represents penalty factor; In constraint condition, y _ifor the classification (being 1 or-1) of vectorial i; for the higher dimensional space mapping function that a kth kernel function is corresponding; B is side-play amount.

\underset{β}{m i n} \underset{α}{m a x} J (α, β) = Σ_{i = 1}^{N} α_{i} - \frac{1}{2} Σ_{i, j = 1}^{N} α_{i} α_{j} y_{i} y_{j} Σ_{k = 1}^{S} β_{k} K_{K} (x_{i}, x_{j})

\begin{matrix} s . t . & 0 \leq α_{i} \leq C, Σ_{i = 1}^{N} α_{i} y_{i} = 0 \end{matrix}

β &Element; Δ_{p}, Δ_{p} = {β &Element; R_{+}^{s} : | | β | |_{p} \leq 1}

(formula 12)

In formula 12, N represents the vectorial number of the input space; x _i, x _jrepresent the vector of the input space; α _i, α _jfor the weight of correspondence, obtained by study; y _i, y _jfor the classification of correspondence; S represents the number of kernel function; β _krepresent the weight of a kth kernel function, also obtained by study; In constraint condition, C represents penalty factor; P is normalization norm.

9. video scene detection method as claimed in claim 7, it is characterized in that, the exponent number d of polynomial kernel described in formula 10 is 2.

10. video scene detection method as claimed in claim 1, is characterized in that, step 22) mode of described video sampling comprise every time sampling and every frame sampling; Every time sampling specifically every t, sampling should be carried out second, once sample 10 seconds, form a short-sighted frequency; Every frame sampling specifically every k frame sampling once, adopt enough 240 frames and form a short-sighted frequency; Described short-sighted frequency is as detection target.