CN102034096A - Video event recognition method based on top-down motion attention mechanism - Google Patents

Video event recognition method based on top-down motion attention mechanism Download PDF

Info

Publication number
CN102034096A
CN102034096A CN 201010591513 CN201010591513A CN102034096A CN 102034096 A CN102034096 A CN 102034096A CN 201010591513 CN201010591513 CN 201010591513 CN 201010591513 A CN201010591513 A CN 201010591513A CN 102034096 A CN102034096 A CN 102034096A
Authority
CN
China
Prior art keywords
video
motion
interest
point
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010591513
Other languages
Chinese (zh)
Other versions
CN102034096B (en
Inventor
胡卫明
李莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN 201010591513 priority Critical patent/CN102034096B/en
Publication of CN102034096A publication Critical patent/CN102034096A/en
Application granted granted Critical
Publication of CN102034096B publication Critical patent/CN102034096B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a video event recognition method based on a top-down motion attention mechanism, which comprises the following steps of: 1, detecting points of interest of each frame in each video in a video set on a computer by using Gaussian difference detector, wherein the video set comprises a training video set and a testing video set; 2, extracting scale-invariant characteristic description sub-characteristics and light stream characteristics from the detected points of interest of each frame; 3, establishing an apparent word list and a motion word list; 4, learning the probability of each motion word about each type of events on the training video set and establishing a motion information-based attention histogram; 5, calculating the similarity between videos in the video set by using the distance of a bulldozer, and generating a kernel function matrix; and 6, training a support vector machine classifier by using the obtained kernel function matrix so as to obtain classifier parameters, classifying the tested video sets, and outputting classification results.

Description

Video Events recognition methods based on top-down motion attention mechanism
Technical field
The present invention relates to the Computer Applied Technology field, particularly Video Events recognition methods.
Background technology
In recent years, develop rapidly along with Internet, popularizing of video compression technology, DVD, WebTV, 3G (Third Generation) Moblie technology technology such as (3G), especially the construction of broadband networks makes that the chance of people's interactive access video information is more and more, some video portal websites arise at the historic moment, as domestic excellent cruel and potato net, external youtube etc.Video information wright in the world, as TV station, film maker, ad production merchant etc., even various digital capture devices such as digital camera, Digital Video etc. have entered into ordinary citizen house, all the time all manufacture the audio-visual-materials that make new advances continuously, the digital video medium have begun to be full of in a large number people's living space.
How to make people that the useful information that comprises in the video is fast located, conveniently obtained and effectively management be a problem demanding prompt solution, the essence of this problem is exactly how with computer technology video content effectively to be managed and expressed; And video content understanding has been an international research focus, and a lot of researchists begin to use relevant video data treatment technology to extract implicit, useful, understandable semantic information in the video, thereby realizes video content understanding.Video information has the characteristics of himself, and that is exactly that data volume is big, and is structural poor, so the problem that the video information expansion brings is also very serious.Leave unused owing to can't effectively handle the video information that causes gathering to a large amount of video informations in a lot of fields.
Event recognition always is one of main task of TRECVID.Along with enriching constantly of various multimedia messagess on the network, content-based multimedia retrieval technology more and more receives publicity and payes attention to.At present, the greatest problem that information retrieval based on contents faced is exactly " the semantic wide gap " that exists between low-level image feature and the high-level semantic.The detection of Video Events is that computer vision technique is combined with content-based multimedia retrieval technology with identification, information from the context and relevant domain knowledge, merging various clues and carry out reasoning, is that the contact between low-level image feature and the high-level semantic is set up on the basis with the incident.Describe by setting up based on the video semanteme of incident, we can carry out higher level semantic analysis to multimedia video, set up index and search mechanism efficiently.Video analysis in the past all is confined to video or strict video such as the databases of controlling such as Weizman, KTH, IXMAS under some fixed cameras, be different from ordinary video, video in the event detection all derives from the video in real video such as news broadcast video, sports tournament video and the film etc., and this just makes event detection face lot of challenges: geometric deformation of the blocking of unordered motion, complicated background, target, illumination and target or the like.
A common Video Events is by being what (what) and how (how) individual aspect description taking place.What is commonly referred to as the frame of video lens features, i.e. appearance features, for example people, object, buildings etc.; The behavioral characteristics that how is commonly referred to as video is a motion feature.Movable information is that video data is exclusive, and it has represented video content development and change situation in time, has considerable effect for describing and understanding video content.How to merge this two problems that the aspect also is a very challenging property effectively.But the method that also lacks at present effective description incident, this mainly is that as what or how, especially some method is only utilized the distributed intelligence of motion because present method is only considered incident in a certain respect, this method robust not in real video.For the present work in both fusion aspects all seldom, and for traditional fusion method as merging earlier and the back fusion method, all be bottom-up basically, just go blindly two aspects of incident are combined, be not task-driven.
Summary of the invention
(1) technical matters that will solve
In order to solve of the interference of prior art background information to assorting process, make that the feature tool specific aim of extracting is not strong, the low technical matters of accuracy of identification the purpose of this invention is to provide the Video Events recognition methods based on top-down motion attention mechanism that a kind of video static nature and behavioral characteristics merge for this reason.
(2) technical scheme
For achieving the above object, the invention provides a kind of Video Events recognition methods based on top-down motion attention mechanism, the technical scheme of the technical solution problem of this method comprises:
Step S1: utilize difference of Gaussian to detect son, detect the point of interest that video is concentrated each each frame of video on computers, described video collection comprises: training video collection and test video collection;
Step S2: the point of interest that detection is obtained each frame extracts appearance features and motion feature, and described appearance features is a yardstick invariant features descriptor feature, and described motion feature is the light stream feature;
Step S3: the yardstick invariant features descriptor feature and the light stream feature that obtain are carried out cluster, and set up apparent vocabulary and motion vocabulary respectively;
Step S4: each motion word of study is about the probability of each class incident and set up attention histogram based on movable information on the training video collection;
Step S5: that utilizes the video collection notes histogram feature based on motion, adopts similarity between dozer distance calculation training video collection and the training video collection, reaches the similarity between training video collection and the test video collection, and the produced nucleus Jacobian matrix;
Step S6: utilize the kernel function matrix that obtains that support vector machine classifier is trained, obtain classifier parameters, utilize the support vector machine classifier model that trains to the classification of test video collection, the classification results of output test video collection.
Wherein, the point of interest of described each frame extract to adopt Harris's angle point, Harris-Laplce's point of interest, Hessen-Laplce's point of interest, Harris-affined transformation point of interest, Hessen-affined transformation point of interest, maximum stable extremal region point of interest, fast robust feature point of interest or net point and difference of Gaussian to detect a kind of in the son.
Wherein, described foundation comprises based on the histogrammic step of the attention of movable information:
Step S41: set video and concentrate each frame I of video iBe expressed from the next:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j m ) δ ( w d j v , w v ) ,
In the formula: n () is i frame I iHistogram represent w vBe the appearance features word, w mBe the motion feature word, C is the class label of incident, c ∈ 1,2 ... },
Figure BSA00000388316500032
It is the motion word The probability that belongs to the c class; δ is an indicative function,
Figure BSA00000388316500034
Figure BSA00000388316500035
Be respectively point of interest d jMotion and appearance features word index;
Step S42: the attention histogram of setting up two types for exercise intensity and direction of motion is:
Based on the exercise intensity histogram (MMA-BOW) of vision word as shown in the formula expression:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j Mag ) δ ( w d j v , w v ) ,
In the formula:
Figure BSA00000388316500042
Be point of interest d jMotion amplitude word index;
Based on the direction of motion histogram (OMA-BOW) of vision word as shown in the formula expression:
Figure BSA00000388316500043
In the formula:
Figure BSA00000388316500044
Be point of interest d jDirection of motion word index;
Step S43: consider the intensity and the directional information of light stream simultaneously, set up based on the motion of visual word bag and notice that histogram (MOMA-BoW) is as shown in the formula expression:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j Mag ) P ( C = c | w d j Orient ) δ ( w d j v , w v ) .
Wherein, for each class training video collection c ∈ C that training video is concentrated, each motion word w mProbability P (C=c|w about each class m) obtain by bayes rule:
P ( C = c | w m ) = P ( w m | C = c ) P ( C = c ) P ( w m ) ,
P ( w m | C = c ) = 1 | | T c + | | Σ w d j ∈ T c + δ ( w d j m , w m )
P ( w m ) = 1 | | T c | | Σ w d j ∈ T c δ ( w d j m , w m )
T in the formula C+Be all set that belong to the training video collection of c class, T cBe the set of all training samples, || || expression be the number of point of interest.
Wherein, the distance that described employing dozer distance is measured two video sequences of video collection for any two sections video P and Q, is expressed as respectively
Figure BSA00000388316500049
Figure BSA000003883165000410
P wherein iAnd q iThe histogram feature of representing video P and Q respectively,
Figure BSA000003883165000411
With
Figure BSA000003883165000412
Represent the weight of the i frame of video P and video Q respectively, m and n represent the frame number of video P and video Q respectively, the similarity D of video P and video Q (P, Q) calculate by following formula:
D ( P , Q ) = Σ i = 1 m Σ j = 1 n d ij f ij Σ i = 1 m Σ j = 1 n f ij
D in the formula IjBe p iAnd q jBetween Euclidean distance, f IjBe the Optimum Matching of video P and video Q, described Optimum Matching is solved by a linear programming problem.
(3) beneficial effect
From technique scheme as can be seen, the present invention has the following advantages:
1, the recognition methods of this video provided by the invention, because the system of selection of point of interest is varied, the selection of point of interest place local feature is also very flexible, if make and to have occurred more the point of interest detection method of fast robust and the extracting method of point of interest place local feature from now on, can add in the native system easily, thus the performance of further elevator system.
2, because the point of interest quantity of directly extracting on video is often very big, comprised complicated background information, the existence of these background informations brings very serious disturbance to follow-up processing, reduce the accuracy rate of classification, the method of this video identification provided by the invention, owing to adopted people's attention mechanism that point of interest is selected, outstanding those are contributed those big points of interest to event recognition, significantly reduced the interference of background information to assorting process, make the feature of extracting have more specific aim, can significantly improve the accuracy of identification.
3, to merge with the back all be from bottom to top as merging earlier for traditional Feature Fusion method, and we utilize people's attention mechanism to adopt top-down mode to merge the static state and the behavioral characteristics of video, and fusion efficiencies has had and significantly improves.
The present invention utilizes top-down mode to merge the apparent and motion feature of video according to people's attention mechanism, this fusion method is without any need for the setting of parameter, can be well in conjunction with merging the advantage that merges with the back earlier, significantly improved recognition efficiency, the present invention has overcome the shortcoming that traditional event recognition method needs technology such as background subtraction, target following, detection, has good application prospects.
Description of drawings
Fig. 1 is the process flow diagram that the present invention is based on the Video Events recognition methods of top-down motion attention mechanism;
Fig. 1 a-Fig. 1 d is that the point of interest of video frame image of the present invention detects and the light stream example;
Fig. 2 is a system architecture diagram of the present invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Execution environment of the present invention adopts an algorithm routine that has the Pentium 4 computing machine of 3.0G hertz central processing unit and 2G byte of memory and worked out Video Events identification efficiently with Matlab and C language, can also adopt other execution environment, not repeat them here.
The general frame of system schema of the present invention is seen accompanying drawing 2, utilizes the Video Events identification mission of computer realization based on top-down motion attention mechanism, contains five main modules to be:
Point of interest detection module 1, this module functions are that video database is divided into training set (training video) and test set (test video) two parts, are beneficial to utilize difference of Gaussian to detect the point of interest that son detects training video and each frame of test video.
The input end of characteristic extracting module 2 is connected with the output terminal of point of interest detection module 1, and the major function of characteristic extracting module 2 is on the basis of point of interest detection module 1, extracts the yardstick invariant features descriptor feature and the light stream feature of each point of interest.
The output terminal of the input end characteristic extracting module 2 of setting up module 3 of vocabulary connects, and is used for yardstick invariant features descriptor and light stream feature cluster on training data to obtaining, and sets up apparent vocabulary and motion vocabulary respectively;
Be connected with the output terminal of characteristic extracting module 2 and the output terminal of setting up module 3 of vocabulary based on the histogrammic input end of setting up module 4 of the attention of movable information, according to training data, calculate each motion word in the motion vocabulary about the probability of motion of the special class of each incident, obtain attention histogram based on movable information by the apparent word in described probability and the apparent vocabulary.
The input end of sort module 5 be connected based on the histogrammic output terminal of setting up module 4 of the attention of movable information, the histogram feature that is used for receiver, video based on the motion attention, and the similarity of any two videos of employing dozer distance calculation, the produced nucleus Jacobian matrix, utilize training set that support vector machine classifier is trained, obtain classifier parameters, the support vector machine classifier model that utilization trains is classified to test set, and the classification results of output test video collection, wherein " car (Existing Car) occurs, shake hands (Handshaking) runs (Running); demonstration (Demonstration Or Protest); walk (Walking), and rebellion (Riot) is danced (Dancing); shooting (Shooting), and the masses march (People Marching) " is our event recognition task.
The process flow diagram of the Video Events recognition methods based on top-down motion attention mechanism as shown in Figure 1; Provide the explanation of each related in this invention technical scheme detailed problem below in detail.
(1) point of interest detects
The extracting method of point of interest can have a lot of selections, as: Harris's angle point (Harris), Harris-Laplce's point of interest (Harris Laplace), Hessen-Laplce's point of interest (Hessian Laplace), Harris-affined transformation point of interest (Harris Affine), Hessen-affined transformation point of interest (Hessian_Affine), maximum stable extremal region point of interest (Maximally Stable Extremal Regions, MSER), fast robust feature point of interest (Speeded Up Robust Features, SURF) and net point (Grid) etc.
Video V note is made V={I i, i ∈ 1,2 ..., N}.Each frame I to video iDetect Local Extremum in Gaussian difference pyrene (DOG, the Difference of Gassian) metric space simultaneously with as point of interest.
(2) feature extraction
Next put forward topography's feature at point of interest place, alternative local feature extracting method has: yardstick invariant feature (Scale Invariant Feature Transform, SIFT), fast robust feature (Speeded Up Robust Features, SURF) and shape context-descriptive feature (Shape Context, SC) etc.
We adopt the SIFT of 128 dimensions to represent the appearance features of point of interest, according to detected point of interest, and the light stream that utilizes the iteration Lucas-Kanade method in the pyramid to calculate a sparse features collection.Fig. 1 a to Fig. 1 d has provided the example of detected interest and light stream vectors on some frame of video.
With k mean cluster method or other clustering method detected point of interest is distinguished cluster according to apparent and motion feature, be clustered into two vocabulary: w m(motion word) and w v(apparent word), defining each cluster centre is a word (word).
Light stream can be represented with intensity Mag and direction Orient under polar coordinate system, and in the two dimensional motion field, each motion vector has all comprised intensity and these two kinds of motion clues of direction.Strength information has reflected the space amplitude of motion, and directional information has reflected the trend of motion.Therefore we have two type of motion words: a kind of is the exercise intensity word
Figure BSA00000388316500081
A kind of is the direction of motion word
Figure BSA00000388316500082
(3) based on the histogrammic foundation of the attention of movable information
By Fig. 1 a-Fig. 1 d as can be seen, the point of interest quantity of being carried on the frame of video is often very big, comprised complicated background information and with the irrelevant information of our event recognition task, the existence meeting of these information brings very serious disturbance to our follow-up processing.The present invention utilizes people's attention mechanism point of interest is selected and to be weighed, biological and psychological research proves, human always be focussed in particular on one's own initiative that some is specific, can produce the zone of the stimulation that strange stimulation and people expected, be called as focus-of-attention or marking area.Vision significance comprises bottom-up and top-down two kinds of patterns, and the former is by data-driven, and the latter is by knowledge or task-driven.Utilize people's top-down attention mechanism, outstanding those are contributed those big points of interest to event recognition, ignore those to the irrelevant point of interest of identification mission as far as possible.
Each frame I of video iCan be expressed from the next:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j m ) δ ( w d j v , w v ) ,
In the formula: C is the class label of incident, c ∈ 1,2 ... }, δ is an indicative function,
Figure BSA00000388316500084
Be respectively point of interest d jMotion and appearance features word index;
From following formula we as can be seen, the function of yardstick invariant features descriptor (SIFT) feature is a descriptor, what aspect in the description incident, and the function of motion feature has two aspects, how aspect in the one side description incident, note clue as one again on the other hand, instruct people to go to discern the events corresponding classification.
Can set up two types attention histogram for exercise intensity and direction of motion:
Exercise intensity histogram (MMA-BOW) based on the vision word is expressed as follows:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j Mag ) δ ( w d j v , w v ) ,
In the formula:
Figure BSA00000388316500092
Be point of interest d jMotion amplitude word index;
Direction of motion histogram (OMA-BOW) based on the vision word is expressed as follows:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j Orient ) δ ( w d j v , w v ) ,
In the formula:
Figure BSA00000388316500094
Be point of interest d jDirection of motion word index;
If consider the intensity and the directional information of light stream simultaneously, note histogram (MOMA-BoW) based on the motion of the special class of visual word bag:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j Mag ) P ( C = c | w d j Orient ) δ ( w d j v , w v ) ,
And for each class Video Events c ∈ C, each motion word can obtain by bayes rule about the probability of each class:
P ( C = c | w m ) = P ( w m | C = c ) P ( C = c ) P ( w m ) ,
P ( w m | C = c ) = 1 | | T c + | | Σ w d j ∈ T c + δ ( w d j , w m ) ,
P ( w m ) = 1 | | T c | | Σ w d j ∈ T c δ ( w d j , w m ) ,
T wherein C+Be all set that belong to the video of c class, T is the set of all training samples, || || expression be the number of point of interest.
From based on the histogrammic formula of the attention of movable information as can be seen, movable information lies in the expression of video, and also can be used as is the weight of apparent information SIFT feature.Especially, for a given motion word, be different about the probability of different event class, that is to say that same motion word is different for the contribution of inhomogeneous identification.For example when we carry out incident " Running " classification, in all detected points of interest, describe the motion word of " Run " this action really and should give the weight of some greatly.On the other hand, resemble some such incidents of " Riot " for some, movable information is not what be correlated with, and each motion word all is the same for the probability of this class basically so, and speech bag model also can be degenerated to the most basic form.
(4) event recognition
Given one section video V, histogram feature p is noted in the motion based on the visual word bag that can obtain the i frame iAfter, this video just can be expressed as
Figure BSA00000388316500101
Figure BSA00000388316500102
The weight of representing the i frame satisfies
Figure BSA00000388316500103
Figure BSA00000388316500104
Here adopt default value 1/m.(The Earth ' s Mover Distance EMD) measures the distance of two video sequences to adopt the dozer distance.For any two sections video P and Q, can be expressed as respectively
Figure BSA00000388316500105
Figure BSA00000388316500106
P wherein iAnd q iThe histogram feature of representing video P and Q respectively,
Figure BSA00000388316500107
With The weight of representing the i frame of video P and video Q respectively, m and n represent the frame number of video P and video Q respectively, the similarity of video P and Q can be calculated the characteristics that the dozer distance has timing drift and dimensional variation by following formula, the start frame that the former refers to one section video may mate with the end frame of other one section video, and the frame that the latter refers to one section video may mate with the multiframe of other one section video.
The similarity of video P and video Q can be calculated by following formula:
D wherein IjBe p iAnd q jBetween Euclidean distance, f IjBe the Optimum Matching of two video P and Q, can solve by a linear programming problem.
D ( P , Q ) = Σ i = 1 m Σ j = 1 n d ij f ij Σ i = 1 m Σ j = 1 n f ij ,
min : WORK ( P , Q , F ) = Σ i = 1 m Σ j = 1 n d ij f ij
s.t.
f ij≥0
Σ j = 1 n f ij ≤ p i
Σ i = 1 m f ij ≤ q j
Σ i = 1 m Σ j = 1 n f ij = min ( Σ i = 1 m p i , Σ j = 1 n q j ) ,
Next use support vector machine as sorter, " one-to-many " is as classification policy.
Because what need identification is 9 incidents, therefore trained 9 sorters, the sample that is a class incident in each sorter is as test, and remaining is as training.Dozer distance between the video is embedded in the gaussian kernel function of support vector machine classifier:
K ( P , Q ) = exp - ( - 1 λ D ( P , Q ) ) ,
M is a normalized factor, can be obtained by the average dozer distance that all training datas are concentrated.λ is that scale factor can be determined by the cross validation experience.
The above; only be the embodiment among the present invention; but protection scope of the present invention is not limited thereto; anyly be familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected; all should be encompassed in of the present invention comprising within the scope, therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (5)

1. Video Events recognition methods based on top-down motion attention mechanism comprises step:
Step S1: utilize difference of Gaussian to detect son, detect the point of interest that video is concentrated each each frame of video on computers, described video collection comprises: training video collection and test video collection;
Step S2: the point of interest that detection is obtained each frame extracts appearance features and motion feature, and described appearance features is a yardstick invariant features descriptor feature, and described motion feature is the light stream feature;
Step S3: the yardstick invariant features descriptor feature and the light stream feature that obtain are carried out cluster, and set up apparent vocabulary and motion vocabulary respectively;
Step S4: on the training video collection, calculate each motion word about the probability of each class incident and set up attention histogram based on movable information;
Step S5: that utilizes the video collection notes histogram feature based on motion, adopts similarity between dozer distance calculation training video collection and the training video collection, reaches the similarity between training video collection and the test video collection, and the produced nucleus Jacobian matrix;
Step S6: utilize the kernel function matrix that obtains that support vector machine classifier is trained, obtain classifier parameters, utilize the support vector machine classifier model that trains to the classification of test video collection, the classification results of output test video collection.
2. Video Events recognition methods according to claim 1, it is characterized in that the point of interest of described each frame extracts and adopts Harris's angle point, Harris-Laplce's point of interest, Hessen-Laplce's point of interest, Harris-affined transformation point of interest, Hessen-affined transformation point of interest, maximum stable extremal region point of interest, fast robust feature point of interest or net point and difference of Gaussian to detect a kind of in the son.
3. Video Events recognition methods according to claim 1 is characterized in that, described foundation comprises based on the histogrammic step of the attention of movable information:
Step S41: set video and concentrate each frame I of video iBe expressed from the next:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j m ) δ ( w d j v , w v ) ,
In the formula: n () is i frame I iHistogram represent w vBe the appearance features word, w mBe the motion feature word, C is the class label of incident, c ∈ 1,2 ... },
Figure FSA00000388316400021
It is the motion word The probability that belongs to the c class; δ is an indicative function,
Figure FSA00000388316400023
Figure FSA00000388316400024
Be respectively point of interest d jMotion and appearance features word index;
Step S42: the attention histogram of setting up two types for exercise intensity and direction of motion is:
Based on the exercise intensity histogram (MMA-BOW) of vision word as shown in the formula expression:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j Mag ) δ ( w d j v , w v ) ,
In the formula:
Figure FSA00000388316400026
Be point of interest d jMotion amplitude word index;
Based on the direction of motion histogram (OMA-BOW) of vision word as shown in the formula expression:
Figure FSA00000388316400027
In the formula:
Figure FSA00000388316400028
Be point of interest d jDirection of motion word index;
Step S43: consider the intensity and the directional information of light stream simultaneously, set up based on the motion of visual word bag and notice that histogram (MOMA-BoW) is as shown in the formula expression:
n ( w v | I i , C = c ) = Σ j = 1 | | I i | | P ( C = c | w d j Mag ) P ( C = c | w d j Orient ) δ ( w d j v , w v ) .
4. Video Events recognition methods according to claim 3 is characterized in that, for each class c ∈ C that training video is concentrated, each motion word w mProbability P (C=c|w about each class m) obtain by bayes rule:
P ( C = c | w m ) = P ( w m | C = c ) P ( C = c ) P ( w m ) ,
P ( w m | C = c ) = 1 | | T c + | | Σ w d j ∈ T c + δ ( w d j m , w m )
P ( w m ) = 1 | | T c | | Σ w d j ∈ T c δ ( w d j m , w m )
T wherein C+Be all set that belong to the training video collection of c class, T cBe the set of all training samples, || || expression be the number of point of interest.
5. Video Events recognition methods according to claim 1 is characterized in that, the distance that adopts the dozer distance to measure two video sequences of video collection for any two sections video P and Q, is expressed as respectively
Figure FSA00000388316400031
Figure FSA00000388316400032
P wherein iAnd q iThe histogram feature of representing video P and Q respectively,
Figure FSA00000388316400033
With
Figure FSA00000388316400034
Represent the weight of the i frame of video P and video Q respectively, m and n represent the frame number of video P and video Q respectively, the similarity D of video P and video Q (P, Q) calculate by following formula:
D ( P , Q ) = Σ i = 1 m Σ j = 1 n d ij f ij Σ i = 1 m Σ j = 1 n f ij
D wherein IjBe p iAnd q jBetween Euclidean distance, f IjBe the Optimum Matching of video P and video Q, described Optimum Matching is solved by a linear programming problem.
CN 201010591513 2010-12-08 2010-12-08 Video event recognition method based on top-down motion attention mechanism Active CN102034096B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010591513 CN102034096B (en) 2010-12-08 2010-12-08 Video event recognition method based on top-down motion attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010591513 CN102034096B (en) 2010-12-08 2010-12-08 Video event recognition method based on top-down motion attention mechanism

Publications (2)

Publication Number Publication Date
CN102034096A true CN102034096A (en) 2011-04-27
CN102034096B CN102034096B (en) 2013-03-06

Family

ID=43886959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010591513 Active CN102034096B (en) 2010-12-08 2010-12-08 Video event recognition method based on top-down motion attention mechanism

Country Status (1)

Country Link
CN (1) CN102034096B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163290A (en) * 2011-05-16 2011-08-24 天津大学 Method for modeling abnormal events in multi-visual angle video monitoring based on temporal-spatial correlation information
CN102930302A (en) * 2012-10-18 2013-02-13 山东大学 On-line sequential extreme learning machine-based incremental human behavior recognition method
CN103077401A (en) * 2012-12-27 2013-05-01 深圳市赛为智能股份有限公司 Method and system for detecting context histogram abnormal behaviors based on light streams
CN103093236A (en) * 2013-01-15 2013-05-08 北京工业大学 Movable terminal porn filtering method based on analyzing image and semantics
CN103226713A (en) * 2013-05-16 2013-07-31 中国科学院自动化研究所 Multi-view behavior recognition method
CN103366370A (en) * 2013-07-03 2013-10-23 深圳市智美达科技有限公司 Target tracking method and device in video monitoring
CN103854016A (en) * 2014-03-27 2014-06-11 北京大学深圳研究生院 Human body behavior classification and identification method and system based on directional common occurrence characteristics
CN104200235A (en) * 2014-07-28 2014-12-10 中国科学院自动化研究所 Time-space local feature extraction method based on linear dynamic system
CN104657468A (en) * 2015-02-12 2015-05-27 中国科学院自动化研究所 Fast video classification method based on images and texts
WO2015078134A1 (en) * 2013-11-29 2015-06-04 华为技术有限公司 Video classification method and device
CN103116896B (en) * 2013-03-07 2015-07-15 中国科学院光电技术研究所 Visual saliency model based automatic detecting and tracking method
CN105512606A (en) * 2015-11-24 2016-04-20 北京航空航天大学 AR-model-power-spectrum-based dynamic scene classification method and apparatus
CN105528594A (en) * 2016-01-31 2016-04-27 江南大学 Incident identification method based on video signal
CN108268597A (en) * 2017-12-18 2018-07-10 中国电子科技集团公司第二十八研究所 A kind of moving-target activity probability map construction and behavior intension recognizing method
CN108764050A (en) * 2018-04-28 2018-11-06 中国科学院自动化研究所 Skeleton Activity recognition method, system and equipment based on angle independence
CN109670174A (en) * 2018-12-14 2019-04-23 腾讯科技(深圳)有限公司 A kind of training method and device of event recognition model
CN110288592A (en) * 2019-07-02 2019-09-27 中南大学 A method of the zinc flotation dosing state evaluation based on probability semantic analysis model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033347A1 (en) * 2001-05-10 2003-02-13 International Business Machines Corporation Method and apparatus for inducing classifiers for multimedia based on unified representation of features reflecting disparate modalities
JP2006012012A (en) * 2004-06-29 2006-01-12 Matsushita Electric Ind Co Ltd Event extraction device, and method and program therefor
CN1945628A (en) * 2006-10-20 2007-04-11 北京交通大学 Video frequency content expressing method based on space-time remarkable unit
CN101894276A (en) * 2010-06-01 2010-11-24 中国科学院计算技术研究所 Training method of human action recognition and recognition method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033347A1 (en) * 2001-05-10 2003-02-13 International Business Machines Corporation Method and apparatus for inducing classifiers for multimedia based on unified representation of features reflecting disparate modalities
JP2006012012A (en) * 2004-06-29 2006-01-12 Matsushita Electric Ind Co Ltd Event extraction device, and method and program therefor
CN1945628A (en) * 2006-10-20 2007-04-11 北京交通大学 Video frequency content expressing method based on space-time remarkable unit
CN101894276A (en) * 2010-06-01 2010-11-24 中国科学院计算技术研究所 Training method of human action recognition and recognition method

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163290B (en) * 2011-05-16 2012-08-01 天津大学 Method for modeling abnormal events in multi-visual angle video monitoring based on temporal-spatial correlation information
CN102163290A (en) * 2011-05-16 2011-08-24 天津大学 Method for modeling abnormal events in multi-visual angle video monitoring based on temporal-spatial correlation information
CN102930302A (en) * 2012-10-18 2013-02-13 山东大学 On-line sequential extreme learning machine-based incremental human behavior recognition method
CN102930302B (en) * 2012-10-18 2016-01-13 山东大学 Based on the incrementally Human bodys' response method of online sequential extreme learning machine
CN103077401A (en) * 2012-12-27 2013-05-01 深圳市赛为智能股份有限公司 Method and system for detecting context histogram abnormal behaviors based on light streams
CN103093236A (en) * 2013-01-15 2013-05-08 北京工业大学 Movable terminal porn filtering method based on analyzing image and semantics
CN103093236B (en) * 2013-01-15 2015-11-04 北京工业大学 A kind of pornographic filter method of mobile terminal analyzed based on image, semantic
CN103116896B (en) * 2013-03-07 2015-07-15 中国科学院光电技术研究所 Visual saliency model based automatic detecting and tracking method
CN103226713A (en) * 2013-05-16 2013-07-31 中国科学院自动化研究所 Multi-view behavior recognition method
CN103226713B (en) * 2013-05-16 2016-04-13 中国科学院自动化研究所 A kind of various visual angles Activity recognition method
CN103366370A (en) * 2013-07-03 2013-10-23 深圳市智美达科技有限公司 Target tracking method and device in video monitoring
CN103366370B (en) * 2013-07-03 2016-04-20 深圳市智美达科技股份有限公司 Method for tracking target in video monitoring and device
WO2015078134A1 (en) * 2013-11-29 2015-06-04 华为技术有限公司 Video classification method and device
US10002296B2 (en) 2013-11-29 2018-06-19 Huawei Technologies Co., Ltd. Video classification method and apparatus
CN103854016A (en) * 2014-03-27 2014-06-11 北京大学深圳研究生院 Human body behavior classification and identification method and system based on directional common occurrence characteristics
CN104200235A (en) * 2014-07-28 2014-12-10 中国科学院自动化研究所 Time-space local feature extraction method based on linear dynamic system
CN104657468B (en) * 2015-02-12 2018-07-31 中国科学院自动化研究所 The rapid classification method of video based on image and text
CN104657468A (en) * 2015-02-12 2015-05-27 中国科学院自动化研究所 Fast video classification method based on images and texts
CN105512606A (en) * 2015-11-24 2016-04-20 北京航空航天大学 AR-model-power-spectrum-based dynamic scene classification method and apparatus
CN105512606B (en) * 2015-11-24 2018-12-21 北京航空航天大学 Dynamic scene classification method and device based on AR model power spectrum
CN105528594B (en) * 2016-01-31 2019-01-22 江南大学 A kind of event recognition method based on vision signal
CN105528594A (en) * 2016-01-31 2016-04-27 江南大学 Incident identification method based on video signal
CN108268597A (en) * 2017-12-18 2018-07-10 中国电子科技集团公司第二十八研究所 A kind of moving-target activity probability map construction and behavior intension recognizing method
CN108764050A (en) * 2018-04-28 2018-11-06 中国科学院自动化研究所 Skeleton Activity recognition method, system and equipment based on angle independence
CN108764050B (en) * 2018-04-28 2021-02-26 中国科学院自动化研究所 Method, system and equipment for recognizing skeleton behavior based on angle independence
CN109670174A (en) * 2018-12-14 2019-04-23 腾讯科技(深圳)有限公司 A kind of training method and device of event recognition model
CN109670174B (en) * 2018-12-14 2022-12-16 腾讯科技(深圳)有限公司 Training method and device of event recognition model
CN110288592A (en) * 2019-07-02 2019-09-27 中南大学 A method of the zinc flotation dosing state evaluation based on probability semantic analysis model
CN110288592B (en) * 2019-07-02 2021-03-02 中南大学 Zinc flotation dosing state evaluation method based on probability semantic analysis model

Also Published As

Publication number Publication date
CN102034096B (en) 2013-03-06

Similar Documents

Publication Publication Date Title
CN102034096B (en) Video event recognition method based on top-down motion attention mechanism
Fenil et al. Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM
CN109034044B (en) Pedestrian re-identification method based on fusion convolutional neural network
Yang et al. STA-CNN: Convolutional spatial-temporal attention learning for action recognition
Pouyanfar et al. Automatic video event detection for imbalance data using enhanced ensemble deep learning
CN101894276B (en) Training method of human action recognition and recognition method
Gnouma et al. Stacked sparse autoencoder and history of binary motion image for human activity recognition
Soomro et al. Action localization in videos through context walk
Wang et al. Video event detection using motion relativity and feature selection
CN104268586A (en) Multi-visual-angle action recognition method
Gao et al. Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework
Jiang et al. An efficient attention module for 3d convolutional neural networks in action recognition
CN103886585A (en) Video tracking method based on rank learning
Kiruba et al. Hexagonal volume local binary pattern (H-VLBP) with deep stacked autoencoder for human action recognition
Pang et al. Predicting skeleton trajectories using a Skeleton-Transformer for video anomaly detection
Huang et al. Multilabel remote sensing image annotation with multiscale attention and label correlation
Yang et al. Bottom-up foreground-aware feature fusion for practical person search
Symeonidis et al. Neural attention-driven non-maximum suppression for person detection
Wang et al. Action recognition using linear dynamic systems
Sun et al. Exploiting deeply supervised inception networks for automatically detecting traffic congestion on freeway in China using ultra-low frame rate videos
Aakur et al. Action localization through continual predictive learning
Zhou et al. Learning semantic context feature-tree for action recognition via nearest neighbor fusion
Li et al. Video is graph: Structured graph module for video action recognition
Elharrouss et al. Mhad: multi-human action dataset
Ahmed Motion classification using CNN based on image difference

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant