CN106529426A

CN106529426A - Visual human behavior recognition method based on Bayesian model

Info

Publication number: CN106529426A
Application number: CN201610921854.2A
Authority: CN
Inventors: 胡卫明; 杨双; 原春锋
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2016-10-21
Filing date: 2016-10-21
Publication date: 2017-03-22

Abstract

The invention relates to a visual human behavior recognition method based on a Bayesian model, and the method comprises the following steps: extracting the features in a training video, and forming the bottom expression for the human body behaviors in the training model; constructing a layered Bayesian model from the features, so as to extract human body behavior modes at different scales from the training video, and obtain the human body behavior expression based on high-layer semantic information; embedding a maximum interval mechanism, and achieving the learning of a layered Bayesian model of a discriminant; learning the parameters of the layered Bayesian model of the discriminant, so as to determine the parameters. The invention also relates to a visual human body behavior recognizer formed through the learning of the method. According to the invention, the maximum interval mechanism is introduced to the recognition model, and forms a unified layered Bayesian model of the discriminant with a former recognition model, can effectively respond to the condition of a complex behavior background, and achieves the robust behavior recognition.

Description

A kind of vision Human bodys' response method based on Bayesian model

Technical field

The present invention relates to computer vision field, more particularly to a kind of vision Human bodys' response based on Bayesian model Method.

Background technology

Vision Human bodys' response be in computer vision field one it is important study a question, it is in intelligent monitoring, height The occasions such as level man-machine interaction, Film Animation making all have huge using value.Usual vision Human bodys' response method master To include two steps：(1) the human body behavioural information in video is expressed as into the forms such as vector or figure, obtains vision human body behavior Expression；(2) sorting technique for utilizing the expression input for obtaining related, such as support vector machine etc., complete classification and identification.

At present with regard to, in many research work of vision human body behavior analysiss, many counting methods are to distinguish the two steps Complete independently, i.e., independently carried out according to sequencing substep.Such method is independently carried out with identification due to would indicate that, because This expression obtained by cannot both having ensured also be able to cannot be ensured with optimum designed recognition methodss suitable for later step Resulting expression in the utilization previous step that selected recognition methodss can be optimal.

On the other hand, Bayes's class model is due to being directly modeled to the relation between data, from the angle table of statistics The distribution situation of data is shown, can overcome the shortcomings of that traditional word bag model is beyond expression the potential applications of feature, often can be with Learn the substitutive characteristics with regard to data, thus be also widely applied in vision human body behavior analysiss field.But at present Most Bayes's class methods purely from the angle of production model, have ignored the utilization to identification information.

Meanwhile, at present in human body behavior analysiss task, it is to adopt the identification method based on largest interval criterion mostly, such as Support vector machine class method, realizes classification with identification.The method is due to directly good and bad to weigh classification performance Classification Loss For optimization aim, therefore preferable recognition effect is all achieved in the identification mission including many behavior analysiss.In addition, such Directly by the purpose solved to reach final optimal classification performance of optimization problem, implementation method is ripe for recognition methodss, because This is widely applied.

These current Activity recognition methods generally only lay particular emphasis on some link for representing or recognizing, and cannot form system One learning framework, it is impossible to make expression result mutually strengthen mutually regulation with recognition result so as to which the scope of application receives larger Restriction.

The content of the invention

The problems referred to above that the present invention is present for prior art, propose a kind of vision human body behavior based on Bayesian model Recognition methodss, which can effectively tackle the situation of complex behavior background, and then realize the Activity recognition of robust.

The present invention's is comprised the following steps based on the vision Human bodys' response method of Bayesian model：

Step 1：The feature in training video is extracted, the bottom to human body behavior in the training video is formed and is expressed；

Step 2：From the feature, layering Bayesian model is built, to extract different scale in the training video Under human body behavioral pattern, obtain the human body Behavior Expression based on high-layer semantic information；

Step 3：Embedded largest interval mechanism, realizes the study of the layering Bayesian model of discriminant；

Step 4：Learn the parameter of the layering Bayesian model of the discriminant, to determine the parameter.

Further, step 1 specifically includes herein below：

Step 1a：Based on the pixel value changes of pixel in the training video, in detecting the training video The point of significance of human body behavior；

Step 1b：Centered on each point of significance, description is built respectively, centered on being formed to each point of significance The description of regional area；

Step 1c：All description are clustered, corresponding vision word and visual dictionary is formed, and then is built The histogram vectors of word-based bag model, form the bottom expression of human body behavior in the training video.

Preferably, description is 3DSIFT description.

Further, step 2 specifically includes herein below：

Step 2a：Training video d ∈ { 1 ..., M }, wherein, M are extracted according to prior distribution Uniform (M) of the parameter for M For the quantity of all training videos；

Step 2b：It is θ according to parameter_dThe distribution of global behavior pattern, extract from the training video d being extracted global Behavioral pattern z_{D, n}=k, k=1 ..., K, wherein K represent the number of all different global behavior patterns；

Step 2c：According to depending on global behavior pattern z that is extracted_{D, n}=k, parameter are τ_kLocal behavior mould Formula is distributed, and extracts local behavioral pattern h_{D, n}=r, r=1 ..., R, wherein R represent the number of all different local behavioral pattern；

According to depending on the local behavioral pattern h that is extracted_{D, n}=r, parameter are φ_rThe vision word point Cloth, extracts vision word w_{D, n}∈ { 1 ..., V }.

Preferably, to the parameter θ_d、τ_kAnd φ_rGiving K dimension Dirichlet prior distributions, parameter that parameter is α respectively is The R dimension Dirichlet prior distributions of γ and the V dimension Dirichlet prior distributions that parameter is β.

Preferably, the global behavior pattern distribution and/or local behavioral pattern distribution and/or the vision word Be distributed as multinomial distribution.

Further, step 3 specifically includes herein below：

Step 3a：With the meansigma methodss of global behavior pattern frequency of occurrence described in each training videoAs to described The expression of training video；

Step 3b：The expression is transported to into figure parameters for η_cLinear classifier in, obtain the value of discriminant functionWherein c=1 ..., C represent c classes, and C represents class number；

Step 3c：Calculate the loss based on largest interval criterionIts In when the true classification of the video is c,Otherwise

Step 3d：Introduce and the loss ζ_{D, c}Corresponding hidden variable λ_{D, c}, and by the loss ζ_{D, c}It is expressed as mixed distribution Form.

Further, step 4 specifically includes herein below：

Step 4a：To the global behavior pattern belonging to each vision word in the training video and local behavioral pattern The random integer value in interval [1, K] and [1, R] is given respectively；

Step 4b：Calculate h_d,n=r, z_{D, n}=k, λ_{D, c}、η_{D, c}(η_{D, c}Represent variable η_cD-th element) Posterior distrbutionp, And repeated sampling in turn is carried out respectively, until restraining or reaching a predetermined sampling number；

Step 4c：With the parameter θ_d、τ_kAnd φ_rPosterior distrbutionp average combined sampling after each statistic obtain to institute State parameter θ_d、τ_kAnd φ_rEstimation；

Step 4d：Record ASSOCIATE STATISTICS amount, for the deduction process of test video.

Further, also include the step of being identified to test video 5, the step 5 specifically includes herein below：

Step 5a：To the global behavior pattern belonging to each vision word in test video and local behavioral pattern difference Give the random integer value in interval [1, K] and [1, R]；

Step 5b：Combine each parameter value and the statistic in the training video obtained in above-mentioned steps 4, to the test Global behavior pattern z in video belonging to each vision word_{D, n}With local behavioral pattern h_{D, n}Sampled, until reaching convergence Condition reaches a predetermined sampling number；

Step 5c：Calculate the meansigma methodss of the frequency of occurrence of all global behavior patterns in the test videoAs right The expression of the test video；

Step 5d：Discriminant function parameter η obtained using study_c, calculate the test video and belong to all kinds of score valuesAnd the test video is divided into into that maximum class of score value, complete identification.

Preferably, local feature is characterized as described in extracting in step 1.

The vision Human bodys' response method based on Bayesian model of the present invention, by introducing largest interval mechanism to knowledge In other model, unite to form the layering Bayesian model of a unified discriminant with identification model before, so as to reality Show unified training video and represented purpose with the parameter of identification model, can effectively tackle the situation of complex behavior background, And then realize the Activity recognition of robust.

Description of the drawings

Fig. 1 is the flow chart of the vision Human bodys' response method based on Bayesian model of the present invention；

Fig. 2 is the layering Bayesian model schematic diagram of the present invention.

Specific embodiment

With reference to the accompanying drawings describing the preferred embodiment of the present invention.It will be apparent to a skilled person that this A little embodiments are used only for explaining the know-why of the present invention, it is not intended that limit the scope of the invention.

The hardware of the carrying out practically of the vision Human bodys' response method based on Bayesian model of the present invention and programming language Speech is not restricted by, and is write with any language and can realize the method for the present invention.Have in 2.83G hertz for example with one The computer of central processor and 4G byte of memorys, and with Matlab language combine with VC++ establishment based on complementary expression with it is embedding Enter the working procedure of multiple randomness, can be achieved with the method for the present invention.

Fig. 1 is the flow chart of the vision Human bodys' response method based on Bayesian model of the present invention, and Fig. 2 is the present invention Layering Bayesian model schematic diagram.The method is comprised the following steps：

In step 1, the feature extracted is preferably the local feature of human body behavior in training video.In addition it is also possible to Using global characteristics, but compared with global characteristics, typically to insensitive for noise, robustness is more preferable, so excellent here for local feature Local feature is selected first.

Specifically, step 1 comprises the steps：

In the human body Behavior Expression of the histogram vectors of word-based bag model, each training video is because comprising multiple visions Word and be considered visual document d, wherein d ∈ { 1 ..., M }, M represent total visual document number, namely total video Number, vision word are designated as w_{D, n}, n ∈ { 1 ..., N_d, wherein N_dRepresent that the vision word in whole training video is total.

In the present invention, 3DSIFT descriptions is preferably used as description.

Common SIFT (Scale invariant features transform) descriptions son is to ask gradient to carry out calculating finally in the space dimension of image Eigenvalue, 3DSIFT here be will be general 2DSIFT description from image spreading to video, cover space dimension and time Dimension has three-dimensional XYT altogether, can preferably embody apparent characteristic.Therefore in the present invention 3DSIFT descriptions preferably, better than 2DSIFT and Other description.

Specifically, step 2 comprises the steps：

Step 2a：Extracted for the prior distribution Uniform (M) of M according to parameter and practice video d ∈ { 1 ..., M }, wherein, M is The quantity of whole training videos.

Step 2b：It is θ according to parameter_dThe distribution of global behavior pattern, extract from the training video d being extracted global Behavioral pattern z_{D, n}=k, k=1 ..., K, wherein K represent the number of all different global behavior patterns.

Step 2c：According to depending on global behavior pattern z that is extracted_{D, n}=k, parameter are τ_kLocal behavior mould Formula is distributed, and extracts local behavioral pattern h_{D, n}=r, r=1 ..., R, wherein R represent the number of all different local behavioral pattern.

Step 2d：According to depending on the local behavioral pattern h that is extracted_{D, n}=r, parameter are φ_rThe vision list The distribution of word, extracts vision word w_{D, n}∈ { 1 ..., V }.

Repeat N_dSecondary step 2b～step 2d, until generate each vision word in training video d, wherein N_dRepresent Vision word number in training video d.

Preferably, adopt and be uniformly distributed as prior distribution Uniform (M) so that each training video is when initial There is the chance equally drawn.In addition with other distributions, but can also be represented in advance to all training videos " with being uniformly distributed Treat as core ", information is had no bias for, generally more there is reasonability.

Preferably, in step 2b, under conditions of given current training video, global behavior pattern is distributed as ginseng Multinomial distribution Mult (z of the number for θ_{D, n}|θ)。

In step 2c, the local behavioral pattern under each global behavior pattern is distributed as multinomial distribution Mult (h_{D, n} | τ, z_{D, n})。

Preferably, in step 2d, the condition of vision word is distributed as multinomial distribution, is abbreviated as Mult (w_{D, n}|h_{D, n}, φ)。

In the present invention, above-mentioned parameter θ can be tried to achieve as follows_d、τ_kAnd φ_r：

According in the current training video d being extracted global behavior pattern distribution parameter θ prior distribution p (θ | α, D), extract the global behavior pattern distribution variable θ of current training video d_d, wherein θ_dIt is the matrix of a M × K, represents per a line The distribution of global behavior pattern in each training video, α are the vectors of a K dimension, represent the Di Li Crays elder generation obeyed by θ The parameter of distribution is tested, K is the number of all global behavior patterns；

According to the prior distribution p of the distributed constant τ of local behavioral pattern when given global behavior pattern is distributed (τ | γ, z =k), extract the local behavioral pattern distribution variable τ of current training video d_k, wherein τ_kIt is the matrix of a K × R, per a line generation The distribution of local behavioral pattern in table each global behavior pattern, γ is the vector of a R dimension, represents Di that τ is obeyed The parameter of sharp Cray prior distribution, R are the numbers of all local behavioral pattern；

According to the prior distribution p of the distributed constant φ of vision word under given local behavioral pattern (φ | β, h=r), take out Take the distribution variable φ of the vision word under current local behavioral pattern_r, wherein h=r is to represent current local behavior mould Formula value is r, φ_rIt is the matrix of a R × V, represents dividing for the vision word under each local behavioral pattern per a line Cloth, β are the vectors of a V dimension, represent φ_rThe parameter of the Dirichlet prior distribution obeyed, V is the word of vision word composition Allusion quotation size.

Specifically, step 3 comprises the steps：

Specifically, step 4 comprises the steps：

Step 4b：Calculate h_d,n=r, z_{D, n}=k, λ_{D, c}、η_{D, c}Posterior distrbutionp, and carry out repeated sampling in turn respectively, directly To restraining or reach a predetermined sampling number；

Wherein, h is calculated as follows_d,nThe Posterior probability distribution of=r, and which is sampled：

Subscript "-" in formula represents the vision word for not calculating current n-th in statistics, and D represents overall training set,Represent in addition to Current vision word, vision word number of the global behavior pattern value for k,Represent to remove and work as forward sight Feel beyond word, vision word number of the local behavioral pattern value for r,Represent in addition to Current vision word, global row It is k vision word numbers of the local behavioral pattern for r simultaneously for pattern,Represent in addition to Current vision word, local behavior Pattern is r vision word numbers of the vision word value for w simultaneously itself.

Z is calculated as follows_{D, n}The Posterior probability distribution of=k is as follows, and which is sampled：

Wherein, Expression global behavior in addition to current word in document d Number of words of the pattern for k,Represent the number of the whole vision words in training video d in addition to Current vision word Mesh, η_{C, k}Represent vectorK-th element.

Calculate λ_{D, c}Posterior probability it is as follows, and which is sampled：

WhereinRepresent that variable x is obeyed with q, the broad sense dead wind area of b, g for parameter.

The Posterior distrbutionp of variable η is calculated as follows, and to sampling：

Wherein,

Repeat the above steps, in turn sample variation, z_{D, n}、λ_{D, c}、η_{D, c}Until restraining or reaching a predetermined sampling number. For example just stop sampling when the relative change of sample variation is less than 1e-7, or can generally arrange and reach when sampling number Stop for 100 times or so.

Step 4d：Record ASSOCIATE STATISTICS amount, for the deduction process of test video.Specifically, with θ_d、τ_kAnd φ_rAfter The each statistic tested after distribution average combined sampling is obtained to parameter θ_d、τ_kAnd φ_rEstimation, and record each statistic now, Including N_kr、N_rw、N_kAnd N_r, local behavioral pattern under global behavior pattern k is illustrated respectively in for the vision word number of r, in office Under portion behavioral pattern r vision word value be the number of words of w, all global behavior pattern values for k vision word number and Number of words of all local behavioral pattern values for r.

The method of the present invention also includes the step of being identified to test video, and the step specifically includes following steps：

Step 5b：Combine each parameter value and the statistic in the training video obtained in above-mentioned steps 4, to the test Global behavior pattern z in video belonging to each vision word_{D, n}With local behavioral pattern h_{D, n}Sampled, until reaching convergence Condition reaches a predetermined sampling number；Wherein,

A () is sampled to the global behavior pattern in test video using following formula：

WhereinThe corresponding global behavior pattern of n-th vision word in expression test video d,Represent current to survey The data possessed by examination video d；The corresponding local behavioral pattern of n-th vision word in expression test video d, N-th vision word in expression test video d,Respectively represent test video d in vision list Word sum, global behavior pattern value is the vision word sum of k, global behavior pattern is in test video d in test video d K and local behavioral pattern are total for the vision word of k for all global behavior patterns in the vision word sum and test video of r Number, N_{K, r}And N_kIt is then the ASSOCIATE STATISTICS amount in the training set that records in step 4.

B () is sampled to the local behavioral pattern in test video using following formula：

WhereinWithBe illustrated respectively in behavioral pattern value in local in test video for r vision word sum and The number of words of word itself value w when behavioral pattern value is r in local；N_rAnd N_{R, w}Record in being illustrated respectively in step 4g Training set in ASSOCIATE STATISTICS amount.

Repeat step (a) (b), in turn to the global behavior pattern corresponding to the vision word in test video and partial row Sampled for pattern, until reaching the condition of convergence or reaching a predetermined sampling number.For example when the relative change of sample variation Just stop sampling when changing less than 1e-7, or can generally arrange and stop when sampling number reaches 100 times or so.

The step of by being identified to test video, can assess by above the step of the identity of model set up Can, and then model can be improved.

So far, technical scheme is described already in connection with preferred implementation shown in the drawings, but, this area Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this On the premise of the principle of invention, those skilled in the art can make the change or replacement of equivalent to correlation technique feature, these Technical scheme after changing or replacing it is fallen within protection scope of the present invention.

Claims

1. a kind of vision Human bodys' response method based on Bayesian model, the method are comprised the following steps：

Step 2：From the feature, layering Bayesian model is built, to extract in the training video under different scale Human body behavioral pattern, obtains the human body Behavior Expression based on high-layer semantic information；

2. method according to claim 1, it is characterised in that the step 1 includes：

Step 1a：Based on the pixel value changes of pixel in the training video, the human body in the training video is detected The point of significance of behavior；

Step 1b：Centered on each point of significance, description is built respectively, is formed to the local centered on each point of significance The description in region；

Step 1c：All description are clustered, corresponding vision word and visual dictionary is formed, and then structure is based on The histogram vectors of word bag model, form the bottom expression of human body behavior in the training video.

3. method according to claim 2, it is characterised in that description is 3DSIFT description.

4. method according to claim 2, it is characterised in that the step 2 includes：

Step 2a：Training video d ∈ { 1 ..., M } are extracted according to prior distribution Uniform (M) of the parameter for M, wherein, M is complete The quantity of training video described in portion；

Step 2b：It is θ according to parameter_dThe distribution of global behavior pattern, extract global behavior from the training video d being extracted Pattern z_{D, n}=k, k=1 ..., K, wherein K represent the number of all different global behavior patterns；

Step 2c：According to depending on global behavior pattern z that is extracted_{D, n}=k, parameter are τ_kLocal behavioral pattern point Cloth, extracts local behavioral pattern h_{D, n}=r, r=1 ..., R, wherein R represent the number of all different local behavioral pattern；

Step 2d：According to depending on the local behavioral pattern h that is extracted_{D, n}=r, parameter are φ_rThe vision word Distribution, extracts vision word w_{D, n}∈ { 1 ..., V }.

5. method according to claim 4, it is characterised in that to the parameter θ_d、τ_kAnd φ_rParameter is given respectively for α's K dimension Dirichlet prior distributions, the R dimension Dirichlet prior distributions that parameter is γ and parameter are divided for the V dimension Dirichlet priors of β Cloth.

6. method according to claim 4, it is characterised in that the global behavior pattern distribution and/or the partial row It is distributed for pattern and/or the vision word is distributed as multinomial distribution.

7. method according to claim 4, it is characterised in that the step 3 includes：

Step 3a：With the meansigma methodss of global behavior pattern frequency of occurrence described in each training videoAs to the training The expression of video；

Step 3b：The expression is transported to into figure parameters for η_cLinear classifier in, obtain the value of discriminant function Wherein c=1 ..., C represent c classes, and C represents class number；

Step 3c：Calculate the loss based on largest interval criterionWherein when When the true classification of the video is c,Otherwise

Step 3d：Introduce and the loss ζ_{D, c}Corresponding hidden variable λ_{D, c}, and by the loss ζ_{D, c}It is expressed as mixed distribution shape Formula.

8. method according to claim 7, it is characterised in that the step 4 includes：

Step 4a：To the global behavior pattern belonging to each vision word in the training video and local behavioral pattern difference Give the random integer value in interval [1, K] and [1, R]；

Step 4b：Calculate h_d,n=r, z_{D, n}=k, λ_{D, c}、η_{D, c}Posterior distrbutionp, and carry out repeated sampling in turn respectively, until receiving Hold back or reach a predetermined sampling number；

Step 4c：With the parameter θ_d、τ_kAnd φ_rPosterior distrbutionp average combined sampling after each statistic obtain to the ginseng Number θ_d、τ_kAnd φ_rEstimation；

9. method according to claim 8, it is characterised in that include the step of being identified to test video 5, the step Rapid 5 include：

Step 5a：Global behavior pattern belonging to each vision word in test video and local behavioral pattern are given respectively Random integer value in interval [1, K] and [1, R]；

Step 5b：Combine each parameter value and the statistic in the training video obtained in above-mentioned steps 4, to the test video In global behavior pattern z belonging to each vision word_{D, n}With local behavioral pattern h_{D, n}Sampled, until reaching the condition of convergence Or reach a predetermined sampling number；

Step 5c：Calculate the meansigma methodss of the frequency of occurrence of all global behavior patterns in the test videoAs to described The expression of test video；

Step 5d：Discriminant function parameter η obtained using study_c, calculate the test video and belong to all kinds of score values And the test video is divided into into that maximum class of score value, complete identification.

10. the method according to any one of claim 1-9, it is characterised in that what is extracted in step 1 described is characterized as office Portion's feature.