CN105005772A - Video scene detection method - Google Patents

Video scene detection method Download PDF

Info

Publication number
CN105005772A
CN105005772A CN201510427821.8A CN201510427821A CN105005772A CN 105005772 A CN105005772 A CN 105005772A CN 201510427821 A CN201510427821 A CN 201510427821A CN 105005772 A CN105005772 A CN 105005772A
Authority
CN
China
Prior art keywords
video
formula
vector
represent
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510427821.8A
Other languages
Chinese (zh)
Other versions
CN105005772B (en
Inventor
童云海
杨亚鸣
丁宇辰
郜渊源
蒋云飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201510427821.8A priority Critical patent/CN105005772B/en
Publication of CN105005772A publication Critical patent/CN105005772A/en
Application granted granted Critical
Publication of CN105005772B publication Critical patent/CN105005772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention discloses a video scene detection method. According to the method, video data is detected by a computer instead of artificial detection, and scenes in the video are recognized. The detection method includes an offline training discrimination model process and a video scene detection process. The offline training discrimination model process includes: centralizing each video extraction feature comprising semantic and space-time feature extraction aiming to training video samples; performing category annotation of feature vectors and obtaining a group of sample set; performing iteration training of the sample set by employing a multiple-kernel learning frame and obtaining an offline training model. The video scene detection process includes: accessing to a monitoring video source; performing video sampling and obtaining a short video; extracting features from the short video; and loading the offline training model, detecting the features, and obtaining a detection result. According to the method, the scenes in the video can be recognized by the computer instead of artificial detection, the detection efficiency can be improved, the cost is lowered, and convenience is brought to data storage and retrieval.

Description

A kind of video scene detection method
Technical field
The present invention relates to video information analytical technology, particularly relate to a kind of video scene detection method.
Background technology
Current, video monitoring system is day by day popularized, its maintaining public order, crack in crime case etc. and play irreplaceable effect.In field of video monitoring, identify that abnormal scene is very important, such as accurately detect the abnormal scene such as the behavior of the impairment public safeties such as group affray, the illegal operation of detection pedlar significant in social management, city management field.
Video monitoring system comprises front-end camera, transmission equipment and video monitoring platform.Camera acquisition head end video picture signal, by sending to monitor supervision platform after transmission equipment pressure, platform is by work such as the storage of complete paired data, accident detections.Monitor video often has the feature that data volume is large, information redundancy is many, if arrange manually to monitor these videos, process, not only take time and effort, accuracy rate also cannot be guaranteed.
Along with the development of computer vision technique, computing machine can the object such as people, animal, car in recognition image, and progressively for oblige by doing, some simply work.But, prior art to the identification of scene for object mainly static images.Compare static images, video has time dimension, and comprises the change information of background and the movable information of target object, therefore deals with more complicated.At present, mostly by manual method, video data monitored, process and find abnormal scene wherein, take time and effort, cost is high, efficiency is low, and accuracy rate cannot be guaranteed, and is also difficult to realize efficiently to the storage of Video processing analysis result data and retrieval recycling in the future.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of detection method of video Scene, replace manually detecting video data with computing machine, find abnormal scene wherein, greatly can improve detection efficiency, reduce costs, data are stored and also provides facility with retrieval in the future.
Technical scheme provided by the invention is:
A kind of video scene detection method, the method for manually detecting video data, identifies the scene in video by computer generation; Detection method comprises off-line training discrimination model process and video scene testing process:
1) off-line training discrimination model process, performs and operates as follows:
11) training video sample set is prepared;
12) for video extraction feature each in training video sample set, be characterized as vector form, comprise semantic feature and extract and space-time characteristic extraction;
13) carry out classification mark to proper vector, obtain one group of sample set, each sample packages is containing semantic feature vector sum space-time characteristic vector, and corresponding classification mark;
14) Multiple Kernel Learning framework is utilized to step 13) described sample set carries out repetitive exercise, obtains an off-line training model;
2) video scene testing process, performs and operates as follows:
21) the monitor video source that will detect is accessed;
22) sample mode is set and carries out video sampling, obtain a short-sighted frequency; This short-sighted frequency is for detecting target;
23) to step 22) described short video extraction feature, comprise semantic feature vector sum space-time characteristic vector, abstracting method and step 12 in training process) identical;
24) utilize Multiple Kernel Learning framework to be loaded into off-line training model, detection is carried out to feature and differentiates, determine whether given scenario, obtain testing result.
For above-mentioned video scene detection method, further, step 11) described training video sample comprises two class samples, and a class is comprise the video set that pedlar manages scene, another kind of for not comprise the video set that pedlar manages scene.
Step 12) for video extraction feature each in training video sample set, comprise semantic feature extraction process and space-time characteristic extraction process.
Wherein, semantic feature extraction process specifically comprises the steps:
121a) to each video, calculated the score of every frame picture by extraction method of key frame, choose the m that score is the highest
Frame picture is as key frame, and score computing formula is as follows:
s c o r e ( f k ) = α * Sdi f f ( f k ) - M i n _ S d i f f M a x _ S d i f f - M i n _ S d i f f + β * M o V a l u e ( f k ) - M i n _ M o V a l u e M a x _ M o V a l u e - M i n _ M o V a l u e (formula 1)
Sdiff (f k)=∑ i,j| I k(i, j)-I k-1(i, j) | (formula 2)
M o V a l u e ( f k ) = Σ i = 1 N k ( ( v k x ( i ) ) 2 + ( v k y ( i ) ) 2 ) (formula 3)
In formula 1 ~ formula 3, f krepresent kth frame picture in video sequence; Score (f k) represent the score of kth frame picture; Sdiff (f k) represent the measures of dispersion of this frame and former frame; α, β are respectively weight; Max_Sdiff and Min_Sdiff is respectively maximum difference and the minimal difference of adjacent two interframe; with represent the variable quantity of horizontal direction and the variable quantity of vertical direction of pixel i light stream in kth frame picture respectively; N krepresent kth frame number of pixels; MoValue (f k) represent the light stream intensity of kth frame; Max_MoValue represents maximum light stream intensity in all frames; Min_MoValue represents minimum light intensity of flow in all frames;
121b) to the m frame picture chosen, for every frame picture, extract picture semantic feature with Dartmouth Classeme feature extracting method, obtain the semantic feature vector of this frame picture;
121c) m the real number proper vector that extraction m frame picture obtains is spliced, obtain the vector of a m*2659 dimension, as the semantic feature vector of this video.
In an embodiment of the present invention, step 121a) described m frame picture is three frame pictures.For feature extraction, space and time order feature extraction process specifically comprises the steps:
122a) to each training video, extracted by MoSIFT feature extracting method and obtain MoSIFT feature;
122b) based on MoSIFT features all in video set, generate visual dictionary;
122c) utilize above-mentioned visual dictionary, Fei Sheer vector coding is carried out to each video, obtain the Fei Sheer vector of a 2*D*K dimension;
122d) implement principal component analysis (PCA) to above-mentioned Fei Sheer vector, obtain a low dimensional vector, this low dimensional vector is the space-time characteristic vector of video.
Above-mentioned steps 122b) concrete employing mixed Gauss model generation visual dictionary.
For above-mentioned video scene detection method, further, step 14) described Multiple Kernel Learning framework is Multiple Kernel Learning framework in Shogun kit, adopts the mode of linear weighted function to combine kernel function, is expressed as formula 9:
K ( x i , x j ) = Σ k = 1 S β k K k ( x i , x j ) (formula 9)
In formula 9, K k(x i, x j) represent a kth kernel function; β krepresent the weight of a kth kernel function; x i, x jrepresent video sample i respectively, j is to should the feature of kernel function;
Choose two polynomial kernel as kernel function, characteristic of correspondence is semantic feature and space-time characteristic respectively; The formula of polynomial kernel is such as formula 10:
K (x, x i)=((xx i)+1) d(formula 10)
In formula 10, x, x irepresent the vector of the input space respectively; D represents exponent number;
The constrained optimization problem of Multiple Kernel Learning is expressed as:
M i n 1 2 ( Σ k = 1 S | | w k | | 2 β k ) 2 + C Σ i = 1 N ξ i
(formula 11)
In formula 11, N represents the vectorial number of the input space; ξ irepresent the coefficient of relaxation of vectorial i; S represents the number of kernel function; w krepresent the width of the interphase corresponding to a kth kernel function to support vector; C represents penalty factor; In constraint condition, y ifor the classification (being 1 or-1) of vector; for the higher dimensional space mapping function that a kth kernel function is corresponding; B is side-play amount.
Solving especially by Lagrangian changing method of described Multiple Kernel Learning model, obtains solving objective function and is:
min β max α J ( α , β ) = Σ i = 1 N α i - 1 2 Σ i , j = 1 N α i α j y i y j Σ k = 1 S β k K K ( x i , x j )
s . t . 0 ≤ α i ≤ C , Σ i = 1 N α i y i = 0
β ∈ Δ p , Δ p = { β ∈ R + S : | | β | | p ≤ 1 } (formula 12)
In formula 12, N represents the vectorial number of the input space; x i, x jrepresent the vector of the input space; α i, α jfor the weight of correspondence, obtained by study; y i, y jfor the classification of correspondence; S represents the number of kernel function; β krepresent the weight of a kth kernel function, also obtained by study; In constraint condition, C represents penalty factor, and p is normalization norm.
In an embodiment of the present invention, the exponent number d of polynomial kernel described in formula 10 is 2.
Step 22) mode of described video sampling comprise every time sampling and every frame sampling; Every time sampling specifically every t, sampling should be carried out second, once sample 10 seconds, form a short-sighted frequency; Every frame sampling specifically every k frame sampling once, adopt enough 240 frames and form a short-sighted frequency; This short-sighted frequency is for detecting target.
Compared with prior art, the invention has the beneficial effects as follows:
The invention provides a kind of detection method of video Scene, the method is replaced by computer generation and is manually detected video data, video semanteme feature is extracted based on external knowledge storehouse, consider the Key-frame Extraction Algorithm of background and movable information, and solve video Scene test problems by the method for Multiple Kernel Learning, detection method comprises off-line training discrimination model process and video scene testing process, by identifying the scene in video, can find abnormal scene wherein.Technical scheme provided by the invention can improve detection efficiency greatly, reduces costs, and stores also provide facility with retrieval in the future to data.
Accompanying drawing explanation
Fig. 1 is that the present invention obtains the FB(flow block) of off-line training discrimination model by learning training process.
Fig. 2 is the FB(flow block) of video scene testing process provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, further describe the present invention by embodiment, but the scope do not limited the present invention in any way.
The invention provides a kind of video scene detection method, the method for manually detecting video data, identifies the scene in video by computer generation; Detection method comprises off-line training discrimination model process and video scene testing process:
1) off-line training discrimination model process, performs and operates as follows:
11) training video sample set is prepared;
12) for video extraction feature each in training video sample set, be characterized as vector form, comprise semantic feature and extract and space-time characteristic extraction;
13) carry out classification mark to proper vector, obtain one group of sample set, each sample packages is containing semantic feature vector sum space-time characteristic vector, and corresponding classification mark;
14) Multiple Kernel Learning framework is utilized to step 13) described sample set carries out repetitive exercise, obtains an off-line training model;
2) video scene testing process, performs and operates as follows:
21) the monitor video source that will detect is accessed;
22) sample mode is set and carries out video sampling, obtain a short-sighted frequency; This short-sighted frequency is for detecting target;
23) to step 22) described short video extraction feature, comprise semantic feature vector sum space-time characteristic vector, abstracting method and step 12 in training process) identical;
24) utilize Multiple Kernel Learning framework to be loaded into off-line training model, detection is carried out to feature and differentiates, determine whether given scenario, obtain testing result.
Whether the present embodiment utilizes monitor video, detect in video and have pedlar to manage scene.Detection method comprises off-line training discrimination model process and video scene testing process.
1) off-line training discrimination model process: utilize training video sample, off-line training discrimination model
11) training video sample is prepared;
In the present embodiment, training video sample comprises two class samples, and a class is comprise the video set that pedlar manages scene, and a class is do not comprise the video set that pedlar manages scene;
12) for video extraction feature each in training video sample, comprise semantic feature and extract and space-time characteristic extraction;
Feature for characterizing this video comprises semantic feature and space-time characteristic; Be characterized as vector form; Obtain two proper vectors for each video extraction feature, one of them is semantic feature vector, for characterizing semantic feature; Another is space-time characteristic vector (space-time dimension), for characterizing space-time characteristic.
121) process extracting semantic feature specifically comprises:
121a) to each video, utilize extraction method of key frame to calculate the score of every frame picture, choose the highest m frame picture of score as key frame, score computing formula is as follows:
s c o r e ( f k ) = α * s d i f f ( f k ) - M i n _ S d i f f M a x _ S d i f f - M i n _ S d i f f + β * M o V a l u e ( f k ) - M i n _ M o V a l u e M a x _ M o V a l u e - M i n _ M o V a l u e (formula 1)
Sdiff (f k)=∑ i,j| I k(i, j)-I k-1(i, j) | (formula 2)
M o V a l u e ( f k ) = Σ i = 1 N k ( ( v k x ( i ) ) 2 + ( v k y ( i ) ) 2 ) (formula 3)
In formula 1 ~ formula 3, f krepresent kth frame picture in video sequence; Score (f k) represent the score of kth frame picture; Sdiff (f k) represent the measures of dispersion (between two frames the difference of pixel value, for RGB color image, measures of dispersion is the average of R, G, channel B difference) of this frame and former frame; α, β are respectively weight; Max_Sdiff and Min_Sdiff is respectively maximum difference and the minimal difference of adjacent two interframe; with represent the variable quantity of horizontal direction and the variable quantity of vertical direction of pixel i light stream in kth frame picture respectively; N krepresent kth frame number of pixels; MoValue (f k) represent the light stream intensity of kth frame; Max_MoValue represents maximum light stream intensity in all frames; Min_MoValue represents minimum light intensity of flow in all frames.
Above-mentioned extraction method of key frame is chosen obtain key frame by being considered picture scene change informa and movable information.The present embodiment setting m=3, namely utilizes extraction method of key frame to calculate the score of every frame picture, chooses 3 the highest frame pictures of score as key frame.
121b) to the m frame picture chosen, for every frame picture, extract picture semantic feature with Dartmouth classeme feature extracting method, obtain the semantic feature vector of this frame picture;
Classeme feature extracting method is the semantics extraction instrument based on external knowledge storehouse, it is a kind of descriptor of expressing image attributes, Classeme image attributes descriptor (Classemes attribute descriptor) comprises 2659 kinds of image attributes (that is having 2659 dimensions), corresponding 2659 concepts; Comprise object (as basketball, bicycle), personage (as football player, boy), place (as swimming pool, outdoor) etc.Every frame figure sector-meeting extracts the real number vector of one 2659 dimension.
121c) m the real number vector that extraction m frame picture obtains is spliced, obtain the vector of a m*2659 dimension, as the semantic feature vector of this video;
122) process extracting space-time characteristic specifically comprises:
122a) to each training video, extracted by feature extracting method and obtain MoSIFT feature;
Training video includes and comprises pedlar and manage the video of scene and do not comprise the video that pedlar manages scene; The feature extracting method that the present embodiment adopts is MoSIFT feature extracting method; Document (M.-Y.Chen and A.Hauptmann, " Mosift:Recognizing human actions in surveillance videos; " CMU-CS-09-161.Carnegie MellonUniversity, 2009.) describe and extract by MoSIFT feature extracting method the process obtaining MoSIFT feature, MoSIFT feature is a kind of space-time characteristic considering space dimension and time dimension, the feature generated is 256 dimensions, counts D;
Extracting MoSIFT feature to each training video and comprise two steps, is first the detection of point of interest, is secondly build the description to point of interest.
The detection of point of interest specifically comprises finds out Local Extremum alternatively point of interest and determine that whether candidate's point of interest is as point of interest:
Build multiple dimensioned difference of Gaussian pyramid, find out Local Extremum alternatively point of interest, the computing formula of difference of Gaussian is:
D (x, y, k δ)=L (x, y, k δ)-L (x, y, (k-1) δ) (formula 4)
In formula 4, the pixel coordinate in x and y representative image; K δ represents the standard deviation of the Gaussian function of pyramid kth layer; L (x, y, k δ) represents the convolution results of pyramid kth layer Gaussian function and image; L (x, y, (k-1) δ) represents the convolution results of pyramid kth-1 layer of Gaussian function and image; The difference result that D (x, y, k δ) is pyramid kth layer;
Whether then judge whether these candidate points exist enough movable informations by optical flow analysis, namely whether exercise intensity is enough large, to determine as point of interest.
After obtaining point of interest, MoSIFT feature extracting method is by describe SIFT (Scale-invariant feature transform) and light stream describing to combine and obtains the description as this point of interest of one 256 vector tieed up; Wherein SIFT is the classical feature for token image, there is scale invariability, carry out the point of interest in Description Image with the real number vector of 128, the describing mode of light stream is similar with SIFT feature, and both combine and just obtain one the 256 real number vector tieed up.
122b) based on MoSIFT features all in video set, generate visual dictionary;
This method adopts mixed Gauss model to generate visual dictionary, wherein, represent the size of visual dictionary with K, mixed Gauss model main thought supposes that the distribution of MoSIFT unique point meets the linear superposition of K Gaussian distribution, this method gets K=64, and the mathematical notation of mixed Gauss model is:
P ( y | θ ) = Σ k = 1 K α k φ ( y | θ k ) (formula 5)
In formula 5, the probability distribution that P (y| θ) is MoSIFT feature; α kfor the weight of each Gauss model; K represents the size of visual dictionary; Y represents MoSIFT proper vector; θ represents the parameter of distribution; θ krepresent the parameter of a kth Gaussian function.
122c) utilize above-mentioned visual dictionary, Fei Sheer vector coding is carried out to each video, obtain the Fei Sheer vector of a 2*D*K dimension;
122d) implement principal component analysis (PCA) to above-mentioned Fei Sheer vector, obtain a low dimensional vector, this low dimensional vector is the space-time characteristic vector of video;
Above-mentioned 2*D*K Wei Feisheer vector is 32768 Wei Feisheer vectors; Principal component analysis (PCA) utilizes dimensionality reduction thought, and be a few generalized variable by multiple variables transformations, these generalized variables are major component, and these major components can reflect most information of original variable.To the vectorial process of carrying out principal component analysis (PCA) of Fei Sheer be in the method:
Fei Sheer vector dimension is designated as p; Make x i=(x i1, x i2..., x ip) t, i=1,2 .., N, representation feature matrix; x ijrepresent the jth dimensional feature value of i-th sample, eigenmatrix carried out as down conversion:
Z i j = x i j - x J ‾ s j , i = 1 , 2 , ... , N ; j = 1 , 2 , ... , p (formula 6)
Wherein, Z ijfor the i-th row jth row value for standardization battle array Z; n is number of samples;
Then correlation matrix R is asked to Z:
R = Z T Z N - 1 (formula 7)
Then the secular equation of correlation matrix R is solved:
| R-λ I p|=0 (formula 8)
In formula 8, R is correlation matrix; I pfor unit matrix; λ is eigenwert;
Solve formula 8 and obtain p characteristic root, it is M=1168 that this method gets major component number; Finally by primitive character matrix projection in M principal direction, obtain final space-time characteristic.
13) carry out classification mark to proper vector, obtain one group of sample set, each sample packages is containing two proper vectors, and corresponding classification mark;
In the present embodiment, classification mark is carried out to proper vector, specifically: comprising the video labeling that pedlar manages scene is 1, represent positive example, be-1 to the video labeling not comprising pedlar and manage scene, be expressed as negative example, so just obtain one group of sample set, each sample packages is containing two proper vectors, and corresponding classification mark;
14) Multiple Kernel Learning framework is utilized to carry out repetitive exercise to above-mentioned training sample set;
The present invention adopts the Multiple Kernel Learning framework in Shogun kit, combines kernel function by the mode of linear weighted function, and concrete formula is as follows:
K ( x i , x j ) = Σ k = 1 S β k K k ( x i , x j ) (formula 9)
In formula 9, K k(x i, x j) represent a kth kernel function; β krepresent the weight of a kth kernel function; x i, x jrepresent video sample i respectively, j is to should the feature of kernel function; Choose altogether two polynomial kernel in the method as kernel function, a kernel function characteristic of correspondence is semantic feature, and another kernel function characteristic of correspondence is space-time characteristic; The formula of polynomial kernel is as follows,
K (x, x i)=((xx i)+1) d(formula 10)
In formula 10, x, x irepresent the vector of the input space respectively; D represents exponent number, and in this method, the exponent number of polynomial kernel is 2.
The constrained optimization problem of Multiple Kernel Learning can be expressed as:
M i n 1 2 ( Σ k = 1 S | | w k | | 2 β k ) 2 + C Σ i = 1 N ξ i
(formula 11)
In formula 11, N represents the vectorial number of the input space; ξ irepresent the coefficient of relaxation of vectorial i; S represents the number of kernel function; w krepresent the width of the interphase corresponding to a kth kernel function to support vector; C represents penalty factor; In constraint condition, y ifor the classification (being 1 or-1) of vector; for the higher dimensional space mapping function that a kth kernel function is corresponding; B is side-play amount.
Similar with SVM, solving of the Multiple Kernel Learning model that this method adopts also can become solving its dual problem by Lagrange, and the objective function that solves of the primal-dual optimization problem of Multiple Kernel Learning is:
min β max α J ( α , β ) = Σ i = 1 N α i - 1 2 Σ i , j = 1 N α i α j y i y j Σ k = 1 S β k K K ( x i , x j )
s . t . 0 ≤ α i ≤ C , Σ i = 1 N α i y i = 0
β ∈ Δ p , Δ p = { β ∈ R + s : | | β | | p ≤ 1 } (formula 12)
In formula 12, N represents the vectorial number of the input space; x i, x jrepresent the vector of the input space; α i, α jfor the weight of correspondence, obtained by study; y i, y jfor the classification of correspondence; S represents the number of kernel function; β krepresent the weight of a kth kernel function, also obtained by study; In constraint condition, C represents penalty factor, and p is normalization norm; This method is set as p=2, C=8.
15) off-line model can be obtained through multinuclear training;
The off-line model obtained is exactly by training the unknown parameter obtained, mainly comprising the isoparametric value of weight of support vector sample and weight, kernel function and correspondence thereof;
2) video scene testing process
21) the monitor video source that will detect is accessed;
22) sample mode is set and carries out video sampling, obtain a short-sighted frequency; This short-sighted frequency is for detecting target;
Sample mode comprise every time sampling and every frame sampling; Every time sampling specifically every t, sampling should be carried out second, once sample 10 seconds, form a short-sighted frequency; Every frame sampling specifically every k frame sampling once, adopt enough 240 frames and form a short-sighted frequency; This short-sighted frequency is for detecting target.
23) to above-mentioned short video extraction semantic feature and space-time characteristic, abstracting method flow process is identical with training process;
24) utilize Multiple Kernel Learning framework, be loaded into off-line training module, detection is carried out to feature and differentiates, determine whether given scenario, obtain testing result;
Discriminant function is:
f ( x ) = s i g n ( Σ i = 0 N α i y i Σ k = 1 S β k K K ( x i , x ) + b ) (formula 13)
In formula 13, except parameter x, other meaning of parameters are with formula is identical above; X represents the semantic feature and space-time characteristics that extract short-sighted frequency; By calculate discriminant function f (x) be 1 represent this video segment comprise given scenario, for-1 represent this video segment do not comprise given scenario.
It should be noted that the object publicizing and implementing example is to help to understand the present invention further, but it will be appreciated by those skilled in the art that: in the spirit and scope not departing from the present invention and claims, various substitutions and modifications are all possible.Therefore, the present invention should not be limited to the content disclosed in embodiment, and the scope that the scope of protection of present invention defines with claims is as the criterion.

Claims (10)

1. a video scene detection method, by computer generation for manually detecting video data, identifies the scene in video; Detection method comprises off-line training discrimination model process and video scene testing process:
1) off-line training discrimination model process, performs and operates as follows:
11) training video sample set is prepared;
12) for video extraction feature each in training video sample set, be characterized as vector form, comprise semantic feature vector sum space-time characteristic vector;
13) carry out classification mark to proper vector, obtain one group of sample set, each sample packages is containing semantic feature vector sum space-time characteristic vector, and corresponding classification mark;
14) Multiple Kernel Learning framework is utilized to step 13) described sample set carries out repetitive exercise, obtains an off-line training model;
2) video scene testing process, performs and operates as follows:
21) the monitor video source that will detect is accessed;
22) sample mode is set and carries out video sampling, obtain a short-sighted frequency; This short-sighted frequency is for detecting target;
23) to step 22) described short video extraction feature, comprise semantic feature vector sum space-time characteristic vector, abstracting method and step 12 in training process) identical;
24) utilize Multiple Kernel Learning framework to be loaded into off-line training model, detection is carried out to feature and differentiates, determine whether given scenario, obtain testing result.
2. video scene detection method as claimed in claim 1, is characterized in that, step 11) described training video sample comprises two class samples, and a class is comprise the video set that pedlar manages scene, another kind of for not comprise the video set that pedlar manages scene.
3. video scene detection method as claimed in claim 1, is characterized in that, step 12) for video extraction feature each in training video sample set, comprise and extract semantic feature extraction process and space-time characteristic extraction process.
4. video scene detection method as claimed in claim 3, it is characterized in that, semantic feature extraction process specifically comprises the steps:
121a) to each video, calculated the score of every frame picture by extraction method of key frame, choose the highest m frame picture of score as key frame, score computing formula is as follows:
s c o r e ( f k ) = α * S d i f f ( f k ) - M i n _ S d i f f M a x _ S d i f f - M i n _ S d i f f + β * M o V a l u e ( f k ) - M i n _ M o V a l u e M a x _ M o V a l u e - M i n _ M o V a l u e (formula 1)
Sdiff (f k)=∑ i,j| I k(i, j)-I k-1(i, j) | (formula 2)
M o V a 1 u e ( f k ) = Σ i = 1 N k ( ( v k x ( i ) ) 2 + ( v k y ( i ) ) 2 ) (formula 3)
In formula 1 ~ formula 3, f krepresent kth frame picture in video sequence; Score (f k) represent the score of kth frame picture; Sdiff (f k) represent the measures of dispersion of this frame and former frame; α, β are respectively weight; Max_Sdiff and Min_Sdiff is respectively maximum difference and the minimal difference of adjacent two interframe; with represent the variable quantity of horizontal direction and the variable quantity of vertical direction of pixel i light stream in kth frame picture respectively; N krepresent kth frame number of pixels; MoValue (f k) represent the light stream intensity of kth frame; Max_MoValue represents maximum light stream intensity in all frames; Min_MoValue represents minimum light intensity of flow in all frames;
121b) to the m frame picture chosen, for every frame picture, extract picture semantic feature with Dartmouth classeme feature extracting method, obtain the semantic feature vector of this frame picture;
121c) m the real number proper vector that extraction m frame picture obtains is spliced, obtain the vector of a m*2659 dimension, as the semantic feature vector of this video.
5. video scene detection method as claimed in claim 4, is characterized in that, step 121a) described m frame picture is three frame pictures.
6. video scene detection method as claimed in claim 3, it is characterized in that, space and time order feature extraction process specifically comprises the steps:
122a) to each training video, extracted by MoSIFT feature extracting method and obtain MoSIFT feature;
122b) based on MoSIFT features all in video set, generate visual dictionary;
122c) utilize above-mentioned visual dictionary, Fei Sheer vector coding is carried out to each video, obtain the Fei Sheer vector of a 2*D*K dimension;
122d) implement principal component analysis (PCA) to above-mentioned Fei Sheer vector, obtain a low dimensional vector, this low dimensional vector is the space-time characteristic vector of video.
7. video scene detection method as claimed in claim 6, is characterized in that, step 122b) adopt mixed Gauss model to generate visual dictionary.
8. video scene detection method as claimed in claim 1, is characterized in that, step 14) described Multiple Kernel Learning framework is Multiple Kernel Learning framework in Shogun kit, adopts the mode of linear weighted function to combine kernel function, is expressed as formula 9:
K ( x i , x j ) = Σ k = 1 s β k K k ( x i , x j ) (formula 9)
In formula 9, K k(x i, x j) represent a kth kernel function; β krepresent the weight of a kth kernel function; x i, x jrepresent video sample i respectively, j is to should the feature of kernel function;
Choose two polynomial kernel as kernel function, characteristic of correspondence is semantic feature and space-time characteristic respectively; The formula of polynomial kernel is such as formula 10:
K (x, x i)=((xx i)+1) d(formula 10)
In formula 10, x, x irepresent the vector of the input space respectively; D represents exponent number;
The constrained optimization problem of Multiple Kernel Learning is expressed as:
M i n 1 2 ( Σ k = 1 S | | w k | | 2 β k ) 2 + C Σ i = 1 N ξ i
(formula 11)
In formula 11, N represents the vectorial number of the input space; ξ irepresent the coefficient of relaxation of vectorial i; S represents the number of kernel function; w krepresent the width of the interphase corresponding to a kth kernel function to support vector; C represents penalty factor; In constraint condition, y ifor the classification (being 1 or-1) of vectorial i; for the higher dimensional space mapping function that a kth kernel function is corresponding; B is side-play amount.
Solving especially by Lagrangian changing method of described Multiple Kernel Learning model, obtains solving objective function and is:
m i n β m a x α J ( α , β ) = Σ i = 1 N α i - 1 2 Σ i , j = 1 N α i α j y i y j Σ k = 1 S β k K K ( x i , x j )
s . t . 0 ≤ α i ≤ C , Σ i = 1 N α i y i = 0
β ∈ Δ p , Δ p = { β ∈ R + s : | | β | | p ≤ 1 } (formula 12)
In formula 12, N represents the vectorial number of the input space; x i, x jrepresent the vector of the input space; α i, α jfor the weight of correspondence, obtained by study; y i, y jfor the classification of correspondence; S represents the number of kernel function; β krepresent the weight of a kth kernel function, also obtained by study; In constraint condition, C represents penalty factor; P is normalization norm.
9. video scene detection method as claimed in claim 7, it is characterized in that, the exponent number d of polynomial kernel described in formula 10 is 2.
10. video scene detection method as claimed in claim 1, is characterized in that, step 22) mode of described video sampling comprise every time sampling and every frame sampling; Every time sampling specifically every t, sampling should be carried out second, once sample 10 seconds, form a short-sighted frequency; Every frame sampling specifically every k frame sampling once, adopt enough 240 frames and form a short-sighted frequency; Described short-sighted frequency is as detection target.
CN201510427821.8A 2015-07-20 2015-07-20 A kind of video scene detection method Active CN105005772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510427821.8A CN105005772B (en) 2015-07-20 2015-07-20 A kind of video scene detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510427821.8A CN105005772B (en) 2015-07-20 2015-07-20 A kind of video scene detection method

Publications (2)

Publication Number Publication Date
CN105005772A true CN105005772A (en) 2015-10-28
CN105005772B CN105005772B (en) 2018-06-12

Family

ID=54378437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510427821.8A Active CN105005772B (en) 2015-07-20 2015-07-20 A kind of video scene detection method

Country Status (1)

Country Link
CN (1) CN105005772B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844239A (en) * 2016-03-23 2016-08-10 北京邮电大学 Method for detecting riot and terror videos based on CNN and LSTM
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN107239801A (en) * 2017-06-28 2017-10-10 安徽大学 Video attribute represents that learning method and video text describe automatic generation method
CN107273863A (en) * 2017-06-21 2017-10-20 天津师范大学 A kind of scene character recognition method based on semantic stroke pond
CN107766838A (en) * 2017-11-08 2018-03-06 央视国际网络无锡有限公司 A kind of switching detection method of video scene
CN108197566A (en) * 2017-12-29 2018-06-22 成都三零凯天通信实业有限公司 Monitoring video behavior detection method based on multi-path neural network
CN108229336A (en) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 Video identification and training method and device, electronic equipment, program and medium
CN108647264A (en) * 2018-04-28 2018-10-12 北京邮电大学 A kind of image automatic annotation method and device based on support vector machines
CN108881950A (en) * 2018-05-30 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN109218721A (en) * 2018-11-26 2019-01-15 南京烽火星空通信发展有限公司 A kind of mutation video detecting method compared based on frame
CN109241811A (en) * 2017-07-10 2019-01-18 南京原觉信息科技有限公司 Scene analysis method based on image spiral line and the scene objects monitoring system using this method
CN110126846A (en) * 2019-05-24 2019-08-16 北京百度网讯科技有限公司 Representation method, device, system and the storage medium of Driving Scene
CN110532990A (en) * 2019-09-04 2019-12-03 上海眼控科技股份有限公司 The recognition methods of turn signal use state, device, computer equipment and storage medium
CN110969066A (en) * 2018-09-30 2020-04-07 北京金山云网络技术有限公司 Live video identification method and device and electronic equipment
WO2022012002A1 (en) * 2020-07-15 2022-01-20 Zhejiang Dahua Technology Co., Ltd. Systems and methods for video analysis

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110166744A (en) * 2019-04-28 2019-08-23 南京师范大学 A kind of monitoring method violating the regulations of setting up a stall based on video geography fence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073864A (en) * 2010-12-01 2011-05-25 北京邮电大学 Football item detecting system with four-layer structure in sports video and realization method thereof
CN102473291A (en) * 2009-07-20 2012-05-23 汤姆森特许公司 Method for detecting and adapting video processing for far-view scenes in sports video
CN102509084A (en) * 2011-11-18 2012-06-20 中国科学院自动化研究所 Multi-examples-learning-based method for identifying horror video scene
US8489627B1 (en) * 2008-08-28 2013-07-16 Adobe Systems Incorporated Combined semantic description and visual attribute search
CN103679192A (en) * 2013-09-30 2014-03-26 中国人民解放军理工大学 Image scene type discrimination method based on covariance features
CN104184925A (en) * 2014-09-11 2014-12-03 刘鹏 Video scene change detection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8489627B1 (en) * 2008-08-28 2013-07-16 Adobe Systems Incorporated Combined semantic description and visual attribute search
CN102473291A (en) * 2009-07-20 2012-05-23 汤姆森特许公司 Method for detecting and adapting video processing for far-view scenes in sports video
CN102073864A (en) * 2010-12-01 2011-05-25 北京邮电大学 Football item detecting system with four-layer structure in sports video and realization method thereof
CN102509084A (en) * 2011-11-18 2012-06-20 中国科学院自动化研究所 Multi-examples-learning-based method for identifying horror video scene
CN103679192A (en) * 2013-09-30 2014-03-26 中国人民解放军理工大学 Image scene type discrimination method based on covariance features
CN104184925A (en) * 2014-09-11 2014-12-03 刘鹏 Video scene change detection method

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844239B (en) * 2016-03-23 2019-03-29 北京邮电大学 It is a kind of that video detecting method is feared based on CNN and LSTM cruelly
CN105844239A (en) * 2016-03-23 2016-08-10 北京邮电大学 Method for detecting riot and terror videos based on CNN and LSTM
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN107038221B (en) * 2017-03-22 2020-11-17 杭州电子科技大学 Video content description method based on semantic information guidance
CN107273863B (en) * 2017-06-21 2019-07-23 天津师范大学 A kind of scene character recognition method based on semantic stroke pond
CN107273863A (en) * 2017-06-21 2017-10-20 天津师范大学 A kind of scene character recognition method based on semantic stroke pond
CN107239801A (en) * 2017-06-28 2017-10-10 安徽大学 Video attribute represents that learning method and video text describe automatic generation method
CN107239801B (en) * 2017-06-28 2020-07-28 安徽大学 Video attribute representation learning method and video character description automatic generation method
CN109241811A (en) * 2017-07-10 2019-01-18 南京原觉信息科技有限公司 Scene analysis method based on image spiral line and the scene objects monitoring system using this method
CN109241811B (en) * 2017-07-10 2021-04-09 南京原觉信息科技有限公司 Scene analysis method based on image spiral line and scene target monitoring system using same
CN107766838B (en) * 2017-11-08 2021-06-01 央视国际网络无锡有限公司 Video scene switching detection method
CN107766838A (en) * 2017-11-08 2018-03-06 央视国际网络无锡有限公司 A kind of switching detection method of video scene
CN108229336B (en) * 2017-12-13 2021-06-04 北京市商汤科技开发有限公司 Video recognition and training method and apparatus, electronic device, program, and medium
CN108229336A (en) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 Video identification and training method and device, electronic equipment, program and medium
CN108197566B (en) * 2017-12-29 2022-03-25 成都三零凯天通信实业有限公司 Monitoring video behavior detection method based on multi-path neural network
CN108197566A (en) * 2017-12-29 2018-06-22 成都三零凯天通信实业有限公司 Monitoring video behavior detection method based on multi-path neural network
CN108647264B (en) * 2018-04-28 2020-10-13 北京邮电大学 Automatic image annotation method and device based on support vector machine
CN108647264A (en) * 2018-04-28 2018-10-12 北京邮电大学 A kind of image automatic annotation method and device based on support vector machines
CN108881950A (en) * 2018-05-30 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN110969066A (en) * 2018-09-30 2020-04-07 北京金山云网络技术有限公司 Live video identification method and device and electronic equipment
CN110969066B (en) * 2018-09-30 2023-10-10 北京金山云网络技术有限公司 Live video identification method and device and electronic equipment
CN109218721A (en) * 2018-11-26 2019-01-15 南京烽火星空通信发展有限公司 A kind of mutation video detecting method compared based on frame
CN110126846A (en) * 2019-05-24 2019-08-16 北京百度网讯科技有限公司 Representation method, device, system and the storage medium of Driving Scene
CN110532990A (en) * 2019-09-04 2019-12-03 上海眼控科技股份有限公司 The recognition methods of turn signal use state, device, computer equipment and storage medium
WO2022012002A1 (en) * 2020-07-15 2022-01-20 Zhejiang Dahua Technology Co., Ltd. Systems and methods for video analysis

Also Published As

Publication number Publication date
CN105005772B (en) 2018-06-12

Similar Documents

Publication Publication Date Title
CN105005772A (en) Video scene detection method
CN111709409B (en) Face living body detection method, device, equipment and medium
CN111310731B (en) Video recommendation method, device, equipment and storage medium based on artificial intelligence
CN111325115B (en) Cross-modal countervailing pedestrian re-identification method and system with triple constraint loss
Wang et al. Joint learning of visual attributes, object classes and visual saliency
CN109255289B (en) Cross-aging face recognition method based on unified generation model
CN112183468A (en) Pedestrian re-identification method based on multi-attention combined multi-level features
CN110516707B (en) Image labeling method and device and storage medium thereof
CN109165612B (en) Pedestrian re-identification method based on depth feature and bidirectional KNN sequencing optimization
CN110874576B (en) Pedestrian re-identification method based on typical correlation analysis fusion characteristics
Pei et al. Consistency guided network for degraded image classification
Zheng et al. When saliency meets sentiment: Understanding how image content invokes emotion and sentiment
Li et al. Codemaps-segment, classify and search objects locally
CN107894996A (en) The image intelligent analysis method for clapping device is supervised based on intelligence
CN115147641A (en) Video classification method based on knowledge distillation and multi-mode fusion
US20240013368A1 (en) Pavement nondestructive detection and identification method based on small samples
CN113762151A (en) Fault data processing method and system and fault prediction method
CN110135363B (en) Method, system, equipment and medium for searching pedestrian image based on recognition dictionary embedding
CN116189063B (en) Key frame optimization method and device for intelligent video monitoring
Mortezaie et al. A color-based re-ranking process for people re-identification: Paper ID 21
Ling et al. Magnetic tile surface defect detection methodology based on self-attention and self-supervised learning
CN112380970B (en) Video target detection method based on local area search
CN111459050B (en) Intelligent simulation type nursing teaching system and teaching method based on dual-network interconnection
CN106971151B (en) Open visual angle action identification method based on linear discriminant analysis
CN115063591B (en) RGB image semantic segmentation method and device based on edge measurement relation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant