CN102034096A

CN102034096A - Video event recognition method based on top-down motion attention mechanism

Info

Publication number: CN102034096A
Application number: CN 201010591513
Authority: CN
Inventors: 胡卫明; 李莉
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2010-12-08
Filing date: 2010-12-08
Publication date: 2011-04-27
Anticipated expiration: 2030-12-08
Also published as: CN102034096B

Abstract

The invention discloses a video event recognition method based on a top-down motion attention mechanism, which comprises the following steps of: 1, detecting points of interest of each frame in each video in a video set on a computer by using Gaussian difference detector, wherein the video set comprises a training video set and a testing video set; 2, extracting scale-invariant characteristic description sub-characteristics and light stream characteristics from the detected points of interest of each frame; 3, establishing an apparent word list and a motion word list; 4, learning the probability of each motion word about each type of events on the training video set and establishing a motion information-based attention histogram; 5, calculating the similarity between videos in the video set by using the distance of a bulldozer, and generating a kernel function matrix; and 6, training a support vector machine classifier by using the obtained kernel function matrix so as to obtain classifier parameters, classifying the tested video sets, and outputting classification results.

Description

Video Events recognition methods based on top-down motion attention mechanism

Technical field

The present invention relates to the Computer Applied Technology field, particularly Video Events recognition methods.

Background technology

In recent years, develop rapidly along with Internet, popularizing of video compression technology, DVD, WebTV, 3G (Third Generation) Moblie technology technology such as (3G), especially the construction of broadband networks makes that the chance of people's interactive access video information is more and more, some video portal websites arise at the historic moment, as domestic excellent cruel and potato net, external youtube etc.Video information wright in the world, as TV station, film maker, ad production merchant etc., even various digital capture devices such as digital camera, Digital Video etc. have entered into ordinary citizen house, all the time all manufacture the audio-visual-materials that make new advances continuously, the digital video medium have begun to be full of in a large number people's living space.

How to make people that the useful information that comprises in the video is fast located, conveniently obtained and effectively management be a problem demanding prompt solution, the essence of this problem is exactly how with computer technology video content effectively to be managed and expressed; And video content understanding has been an international research focus, and a lot of researchists begin to use relevant video data treatment technology to extract implicit, useful, understandable semantic information in the video, thereby realizes video content understanding.Video information has the characteristics of himself, and that is exactly that data volume is big, and is structural poor, so the problem that the video information expansion brings is also very serious.Leave unused owing to can't effectively handle the video information that causes gathering to a large amount of video informations in a lot of fields.

Event recognition always is one of main task of TRECVID.Along with enriching constantly of various multimedia messagess on the network, content-based multimedia retrieval technology more and more receives publicity and payes attention to.At present, the greatest problem that information retrieval based on contents faced is exactly " the semantic wide gap " that exists between low-level image feature and the high-level semantic.The detection of Video Events is that computer vision technique is combined with content-based multimedia retrieval technology with identification, information from the context and relevant domain knowledge, merging various clues and carry out reasoning, is that the contact between low-level image feature and the high-level semantic is set up on the basis with the incident.Describe by setting up based on the video semanteme of incident, we can carry out higher level semantic analysis to multimedia video, set up index and search mechanism efficiently.Video analysis in the past all is confined to video or strict video such as the databases of controlling such as Weizman, KTH, IXMAS under some fixed cameras, be different from ordinary video, video in the event detection all derives from the video in real video such as news broadcast video, sports tournament video and the film etc., and this just makes event detection face lot of challenges: geometric deformation of the blocking of unordered motion, complicated background, target, illumination and target or the like.

A common Video Events is by being what (what) and how (how) individual aspect description taking place.What is commonly referred to as the frame of video lens features, i.e. appearance features, for example people, object, buildings etc.; The behavioral characteristics that how is commonly referred to as video is a motion feature.Movable information is that video data is exclusive, and it has represented video content development and change situation in time, has considerable effect for describing and understanding video content.How to merge this two problems that the aspect also is a very challenging property effectively.But the method that also lacks at present effective description incident, this mainly is that as what or how, especially some method is only utilized the distributed intelligence of motion because present method is only considered incident in a certain respect, this method robust not in real video.For the present work in both fusion aspects all seldom, and for traditional fusion method as merging earlier and the back fusion method, all be bottom-up basically, just go blindly two aspects of incident are combined, be not task-driven.

Summary of the invention

(1) technical matters that will solve

In order to solve of the interference of prior art background information to assorting process, make that the feature tool specific aim of extracting is not strong, the low technical matters of accuracy of identification the purpose of this invention is to provide the Video Events recognition methods based on top-down motion attention mechanism that a kind of video static nature and behavioral characteristics merge for this reason.

(2) technical scheme

For achieving the above object, the invention provides a kind of Video Events recognition methods based on top-down motion attention mechanism, the technical scheme of the technical solution problem of this method comprises:

Step S1: utilize difference of Gaussian to detect son, detect the point of interest that video is concentrated each each frame of video on computers, described video collection comprises: training video collection and test video collection;

Step S2: the point of interest that detection is obtained each frame extracts appearance features and motion feature, and described appearance features is a yardstick invariant features descriptor feature, and described motion feature is the light stream feature;

Step S3: the yardstick invariant features descriptor feature and the light stream feature that obtain are carried out cluster, and set up apparent vocabulary and motion vocabulary respectively;

Step S4: each motion word of study is about the probability of each class incident and set up attention histogram based on movable information on the training video collection;

Step S5: that utilizes the video collection notes histogram feature based on motion, adopts similarity between dozer distance calculation training video collection and the training video collection, reaches the similarity between training video collection and the test video collection, and the produced nucleus Jacobian matrix;

Step S6: utilize the kernel function matrix that obtains that support vector machine classifier is trained, obtain classifier parameters, utilize the support vector machine classifier model that trains to the classification of test video collection, the classification results of output test video collection.

Wherein, the point of interest of described each frame extract to adopt Harris's angle point, Harris-Laplce's point of interest, Hessen-Laplce's point of interest, Harris-affined transformation point of interest, Hessen-affined transformation point of interest, maximum stable extremal region point of interest, fast robust feature point of interest or net point and difference of Gaussian to detect a kind of in the son.

Wherein, described foundation comprises based on the histogrammic step of the attention of movable information:

Step S41: set video and concentrate each frame I of video ⁱBe expressed from the next:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{m}) δ (w_{d_{j}}^{v}, w^{v}),

In the formula: n () is i frame I ⁱHistogram represent w ^vBe the appearance features word, w ^mBe the motion feature word, C is the class label of incident, c ∈ 1,2 ... },

It is the motion word The probability that belongs to the c class; δ is an indicative function,

Be respectively point of interest d _jMotion and appearance features word index;

Step S42: the attention histogram of setting up two types for exercise intensity and direction of motion is:

Based on the exercise intensity histogram (MMA-BOW) of vision word as shown in the formula expression:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{Mag}) δ (w_{d_{j}}^{v}, w^{v}),

In the formula:

Be point of interest d _jMotion amplitude word index;

Based on the direction of motion histogram (OMA-BOW) of vision word as shown in the formula expression:

In the formula:

Be point of interest d _jDirection of motion word index;

Step S43: consider the intensity and the directional information of light stream simultaneously, set up based on the motion of visual word bag and notice that histogram (MOMA-BoW) is as shown in the formula expression:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{Mag}) P (C = c | w_{d_{j}}^{Orient}) δ (w_{d_{j}}^{v}, w^{v}) .

Wherein, for each class training video collection c ∈ C that training video is concentrated, each motion word w ^mProbability P (C=c|w about each class ^m) obtain by bayes rule:

P (C = c | w^{m}) = \frac{P (w^{m} | C = c) P (C = c)}{P (w^{m})},

P (w^{m} | C = c) = \frac{1}{| | T^{c +} | |} \underset{w_{d_{j}} &Element; T^{c +}}{Σ} δ (w_{d_{j}}^{m}, w^{m})

P (w^{m}) = \frac{1}{| | T^{c} | |} \underset{w_{d_{j}} &Element; T^{c}}{Σ} δ (w_{d_{j}}^{m}, w^{m})

T in the formula ^C+Be all set that belong to the training video collection of c class, T ^cBe the set of all training samples, || || expression be the number of point of interest.

Wherein, the distance that described employing dozer distance is measured two video sequences of video collection for any two sections video P and Q, is expressed as respectively

P wherein _iAnd q _iThe histogram feature of representing video P and Q respectively,

With

Represent the weight of the i frame of video P and video Q respectively, m and n represent the frame number of video P and video Q respectively, the similarity D of video P and video Q (P, Q) calculate by following formula:

D (P, Q) = \frac{Σ_{i = 1}^{m} Σ_{j = 1}^{n} d_{ij} f_{ij}}{Σ_{i = 1}^{m} Σ_{j = 1}^{n} f_{ij}}

D in the formula _IjBe p _iAnd q _jBetween Euclidean distance, f _IjBe the Optimum Matching of video P and video Q, described Optimum Matching is solved by a linear programming problem.

(3) beneficial effect

From technique scheme as can be seen, the present invention has the following advantages:

1, the recognition methods of this video provided by the invention, because the system of selection of point of interest is varied, the selection of point of interest place local feature is also very flexible, if make and to have occurred more the point of interest detection method of fast robust and the extracting method of point of interest place local feature from now on, can add in the native system easily, thus the performance of further elevator system.

2, because the point of interest quantity of directly extracting on video is often very big, comprised complicated background information, the existence of these background informations brings very serious disturbance to follow-up processing, reduce the accuracy rate of classification, the method of this video identification provided by the invention, owing to adopted people's attention mechanism that point of interest is selected, outstanding those are contributed those big points of interest to event recognition, significantly reduced the interference of background information to assorting process, make the feature of extracting have more specific aim, can significantly improve the accuracy of identification.

3, to merge with the back all be from bottom to top as merging earlier for traditional Feature Fusion method, and we utilize people's attention mechanism to adopt top-down mode to merge the static state and the behavioral characteristics of video, and fusion efficiencies has had and significantly improves.

The present invention utilizes top-down mode to merge the apparent and motion feature of video according to people's attention mechanism, this fusion method is without any need for the setting of parameter, can be well in conjunction with merging the advantage that merges with the back earlier, significantly improved recognition efficiency, the present invention has overcome the shortcoming that traditional event recognition method needs technology such as background subtraction, target following, detection, has good application prospects.

Description of drawings

Fig. 1 is the process flow diagram that the present invention is based on the Video Events recognition methods of top-down motion attention mechanism;

Fig. 1 a-Fig. 1 d is that the point of interest of video frame image of the present invention detects and the light stream example;

Fig. 2 is a system architecture diagram of the present invention.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Execution environment of the present invention adopts an algorithm routine that has the Pentium 4 computing machine of 3.0G hertz central processing unit and 2G byte of memory and worked out Video Events identification efficiently with Matlab and C language, can also adopt other execution environment, not repeat them here.

The general frame of system schema of the present invention is seen accompanying drawing 2, utilizes the Video Events identification mission of computer realization based on top-down motion attention mechanism, contains five main modules to be:

Point of interest detection module 1, this module functions are that video database is divided into training set (training video) and test set (test video) two parts, are beneficial to utilize difference of Gaussian to detect the point of interest that son detects training video and each frame of test video.

The input end of characteristic extracting module 2 is connected with the output terminal of point of interest detection module 1, and the major function of characteristic extracting module 2 is on the basis of point of interest detection module 1, extracts the yardstick invariant features descriptor feature and the light stream feature of each point of interest.

The output terminal of the input end characteristic extracting module 2 of setting up module 3 of vocabulary connects, and is used for yardstick invariant features descriptor and light stream feature cluster on training data to obtaining, and sets up apparent vocabulary and motion vocabulary respectively;

Be connected with the output terminal of characteristic extracting module 2 and the output terminal of setting up module 3 of vocabulary based on the histogrammic input end of setting up module 4 of the attention of movable information, according to training data, calculate each motion word in the motion vocabulary about the probability of motion of the special class of each incident, obtain attention histogram based on movable information by the apparent word in described probability and the apparent vocabulary.

The input end of sort module 5 be connected based on the histogrammic output terminal of setting up module 4 of the attention of movable information, the histogram feature that is used for receiver, video based on the motion attention, and the similarity of any two videos of employing dozer distance calculation, the produced nucleus Jacobian matrix, utilize training set that support vector machine classifier is trained, obtain classifier parameters, the support vector machine classifier model that utilization trains is classified to test set, and the classification results of output test video collection, wherein " car (Existing Car) occurs, shake hands (Handshaking) runs (Running); demonstration (Demonstration Or Protest); walk (Walking), and rebellion (Riot) is danced (Dancing); shooting (Shooting), and the masses march (People Marching) " is our event recognition task.

The process flow diagram of the Video Events recognition methods based on top-down motion attention mechanism as shown in Figure 1; Provide the explanation of each related in this invention technical scheme detailed problem below in detail.

(1) point of interest detects

The extracting method of point of interest can have a lot of selections, as: Harris's angle point (Harris), Harris-Laplce's point of interest (Harris Laplace), Hessen-Laplce's point of interest (Hessian Laplace), Harris-affined transformation point of interest (Harris Affine), Hessen-affined transformation point of interest (Hessian_Affine), maximum stable extremal region point of interest (Maximally Stable Extremal Regions, MSER), fast robust feature point of interest (Speeded Up Robust Features, SURF) and net point (Grid) etc.

Video V note is made V={I ⁱ, i ∈ 1,2 ..., N}.Each frame I to video ⁱDetect Local Extremum in Gaussian difference pyrene (DOG, the Difference of Gassian) metric space simultaneously with as point of interest.

(2) feature extraction

Next put forward topography's feature at point of interest place, alternative local feature extracting method has: yardstick invariant feature (Scale Invariant Feature Transform, SIFT), fast robust feature (Speeded Up Robust Features, SURF) and shape context-descriptive feature (Shape Context, SC) etc.

We adopt the SIFT of 128 dimensions to represent the appearance features of point of interest, according to detected point of interest, and the light stream that utilizes the iteration Lucas-Kanade method in the pyramid to calculate a sparse features collection.Fig. 1 a to Fig. 1 d has provided the example of detected interest and light stream vectors on some frame of video.

With k mean cluster method or other clustering method detected point of interest is distinguished cluster according to apparent and motion feature, be clustered into two vocabulary: w ^m(motion word) and w ^v(apparent word), defining each cluster centre is a word (word).

Light stream can be represented with intensity Mag and direction Orient under polar coordinate system, and in the two dimensional motion field, each motion vector has all comprised intensity and these two kinds of motion clues of direction.Strength information has reflected the space amplitude of motion, and directional information has reflected the trend of motion.Therefore we have two type of motion words: a kind of is the exercise intensity word

A kind of is the direction of motion word

(3) based on the histogrammic foundation of the attention of movable information

By Fig. 1 a-Fig. 1 d as can be seen, the point of interest quantity of being carried on the frame of video is often very big, comprised complicated background information and with the irrelevant information of our event recognition task, the existence meeting of these information brings very serious disturbance to our follow-up processing.The present invention utilizes people's attention mechanism point of interest is selected and to be weighed, biological and psychological research proves, human always be focussed in particular on one's own initiative that some is specific, can produce the zone of the stimulation that strange stimulation and people expected, be called as focus-of-attention or marking area.Vision significance comprises bottom-up and top-down two kinds of patterns, and the former is by data-driven, and the latter is by knowledge or task-driven.Utilize people's top-down attention mechanism, outstanding those are contributed those big points of interest to event recognition, ignore those to the irrelevant point of interest of identification mission as far as possible.

Each frame I of video ⁱCan be expressed from the next:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{m}) δ (w_{d_{j}}^{v}, w^{v}),

In the formula: C is the class label of incident, c ∈ 1,2 ... }, δ is an indicative function,

From following formula we as can be seen, the function of yardstick invariant features descriptor (SIFT) feature is a descriptor, what aspect in the description incident, and the function of motion feature has two aspects, how aspect in the one side description incident, note clue as one again on the other hand, instruct people to go to discern the events corresponding classification.

Can set up two types attention histogram for exercise intensity and direction of motion:

Exercise intensity histogram (MMA-BOW) based on the vision word is expressed as follows:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{Mag}) δ (w_{d_{j}}^{v}, w^{v}),

In the formula:

Be point of interest d _jMotion amplitude word index;

Direction of motion histogram (OMA-BOW) based on the vision word is expressed as follows:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{Orient}) δ (w_{d_{j}}^{v}, w^{v}),

In the formula:

Be point of interest d _jDirection of motion word index;

If consider the intensity and the directional information of light stream simultaneously, note histogram (MOMA-BoW) based on the motion of the special class of visual word bag:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{Mag}) P (C = c | w_{d_{j}}^{Orient}) δ (w_{d_{j}}^{v}, w^{v}),

And for each class Video Events c ∈ C, each motion word can obtain by bayes rule about the probability of each class:

P (C = c | w^{m}) = \frac{P (w^{m} | C = c) P (C = c)}{P (w^{m})},

P (w^{m} | C = c) = \frac{1}{| | T^{c +} | |} \underset{w_{d_{j}} &Element; T^{c +}}{Σ} δ (w_{d_{j}}, w^{m}),

P (w^{m}) = \frac{1}{| | T^{c} | |} \underset{w_{d_{j}} &Element; T^{c}}{Σ} δ (w_{d_{j}}, w^{m}),

T wherein ^C+Be all set that belong to the video of c class, T is the set of all training samples, || || expression be the number of point of interest.

From based on the histogrammic formula of the attention of movable information as can be seen, movable information lies in the expression of video, and also can be used as is the weight of apparent information SIFT feature.Especially, for a given motion word, be different about the probability of different event class, that is to say that same motion word is different for the contribution of inhomogeneous identification.For example when we carry out incident " Running " classification, in all detected points of interest, describe the motion word of " Run " this action really and should give the weight of some greatly.On the other hand, resemble some such incidents of " Riot " for some, movable information is not what be correlated with, and each motion word all is the same for the probability of this class basically so, and speech bag model also can be degenerated to the most basic form.

(4) event recognition

Given one section video V, histogram feature p is noted in the motion based on the visual word bag that can obtain the i frame _iAfter, this video just can be expressed as

The weight of representing the i frame satisfies

Here adopt default value 1/m.(The Earth ' s Mover Distance EMD) measures the distance of two video sequences to adopt the dozer distance.For any two sections video P and Q, can be expressed as respectively

With The weight of representing the i frame of video P and video Q respectively, m and n represent the frame number of video P and video Q respectively, the similarity of video P and Q can be calculated the characteristics that the dozer distance has timing drift and dimensional variation by following formula, the start frame that the former refers to one section video may mate with the end frame of other one section video, and the frame that the latter refers to one section video may mate with the multiframe of other one section video.

The similarity of video P and video Q can be calculated by following formula:

D wherein _IjBe p _iAnd q _jBetween Euclidean distance, f _IjBe the Optimum Matching of two video P and Q, can solve by a linear programming problem.

D (P, Q) = \frac{Σ_{i = 1}^{m} Σ_{j = 1}^{n} d_{ij} f_{ij}}{Σ_{i = 1}^{m} Σ_{j = 1}^{n} f_{ij}},

\min : WORK (P, Q, F) = Σ_{i = 1}^{m} Σ_{j = 1}^{n} d_{ij} f_{ij}

s.t.

f _ij≥0

Σ_{j = 1}^{n} f_{ij} \leq p_{i}

Σ_{i = 1}^{m} f_{ij} \leq q_{j}

Σ_{i = 1}^{m} Σ_{j = 1}^{n} f_{ij} = \min (Σ_{i = 1}^{m} p_{i}, Σ_{j = 1}^{n} q_{j}),

Next use support vector machine as sorter, " one-to-many " is as classification policy.

Because what need identification is 9 incidents, therefore trained 9 sorters, the sample that is a class incident in each sorter is as test, and remaining is as training.Dozer distance between the video is embedded in the gaussian kernel function of support vector machine classifier:

K (P, Q) = \exp - (- \frac{1}{λ} D (P, Q)),

M is a normalized factor, can be obtained by the average dozer distance that all training datas are concentrated.λ is that scale factor can be determined by the cross validation experience.

The above; only be the embodiment among the present invention; but protection scope of the present invention is not limited thereto; anyly be familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected; all should be encompassed in of the present invention comprising within the scope, therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims

1. Video Events recognition methods based on top-down motion attention mechanism comprises step:

Step S4: on the training video collection, calculate each motion word about the probability of each class incident and set up attention histogram based on movable information;

2. Video Events recognition methods according to claim 1, it is characterized in that the point of interest of described each frame extracts and adopts Harris's angle point, Harris-Laplce's point of interest, Hessen-Laplce's point of interest, Harris-affined transformation point of interest, Hessen-affined transformation point of interest, maximum stable extremal region point of interest, fast robust feature point of interest or net point and difference of Gaussian to detect a kind of in the son.

3. Video Events recognition methods according to claim 1 is characterized in that, described foundation comprises based on the histogrammic step of the attention of movable information:

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{m}) δ (w_{d_{j}}^{v}, w^{v}),

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{Mag}) δ (w_{d_{j}}^{v}, w^{v}),

In the formula:

Be point of interest d _jMotion amplitude word index;

In the formula:

Be point of interest d _jDirection of motion word index;

n (w^{v} | I^{i}, C = c) = Σ_{j = 1}^{| | I^{i} | |} P (C = c | w_{d_{j}}^{Mag}) P (C = c | w_{d_{j}}^{Orient}) δ (w_{d_{j}}^{v}, w^{v}) .

4. Video Events recognition methods according to claim 3 is characterized in that, for each class c ∈ C that training video is concentrated, each motion word w ^mProbability P (C=c|w about each class ^m) obtain by bayes rule:

P (C = c | w^{m}) = \frac{P (w^{m} | C = c) P (C = c)}{P (w^{m})},

P (w^{m} | C = c) = \frac{1}{| | T^{c +} | |} \underset{w_{d_{j}} &Element; T^{c +}}{Σ} δ (w_{d_{j}}^{m}, w^{m})

P (w^{m}) = \frac{1}{| | T^{c} | |} \underset{w_{d_{j}} &Element; T^{c}}{Σ} δ (w_{d_{j}}^{m}, w^{m})

T wherein ^C+Be all set that belong to the training video collection of c class, T ^cBe the set of all training samples, || || expression be the number of point of interest.

5. Video Events recognition methods according to claim 1 is characterized in that, the distance that adopts the dozer distance to measure two video sequences of video collection for any two sections video P and Q, is expressed as respectively

With

D (P, Q) = \frac{Σ_{i = 1}^{m} Σ_{j = 1}^{n} d_{ij} f_{ij}}{Σ_{i = 1}^{m} Σ_{j = 1}^{n} f_{ij}}

D wherein _IjBe p _iAnd q _jBetween Euclidean distance, f _IjBe the Optimum Matching of video P and video Q, described Optimum Matching is solved by a linear programming problem.