CN109583360B

CN109583360B - Video human body behavior identification method based on spatio-temporal information and hierarchical representation

Info

Publication number: CN109583360B
Application number: CN201811418871.XA
Authority: CN
Inventors: 吴昱焜; 李仲泓; 衣杨; 沈金龙; 佘滢; 朱艺
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2023-01-10
Anticipated expiration: 2038-11-26
Also published as: CN109583360A

Abstract

The invention relates to the field of artificial intelligence, in particular to a video human body behavior identification method based on spatio-temporal information and hierarchy representation. The invention fully utilizes the spatio-temporal information in the video, and the hierarchical spatio-temporal bundle divides the video motion into a plurality of parts, thereby obtaining the higher-dimensional representation of the video motion. Aiming at the defects that the traditional video representation method ignores the semantic information of the middle and high layers of the video, only focuses on the occurrence frequency of the features, utilizes only 0-order information and the like, the video representation method based on the hierarchical space-time beam can effectively eliminate the background noise interference of the video, make up the semantic gap between the bottom-layer features and the high-layer features, and can capture higher-order and more complex motion structure information. The hierarchical space-time beam method can extract more complex and expressive video representation on higher dimensionality, and can effectively improve the video identification effect.

Description

Video human body behavior identification method based on spatio-temporal information and hierarchical representation

Technical Field

The invention relates to the field of artificial intelligence, in particular to a video human body behavior identification method based on spatio-temporal information and hierarchy representation.

Background

The video human behavior identification is a leading-edge artificial intelligence technology, can automatically calculate through a computer so as to identify and classify video contents, and can be widely applied to intelligent monitoring, human-computer interaction and video content retrieval. Specifically, the method is a process of extracting features from a video data set with calibrated categories through a machine learning technology, training the features to obtain a classifier, and further judging unknown videos. In order to obtain a high human behavior recognition rate, it is first necessary to extract features having expressive power. The ideal characteristics firstly need to have robustness to human appearance and size, scene illumination, shooting visual angle and the like; secondly, the extracted features should contain rich scene context information, so that videos of other motion categories can be effectively distinguished.

From the aspect of feature extraction, the current human behavior recognition technology comprises video representation methods based on bottom-layer features, hierarchical features and depth features. The video representation method based on the underlying features can be divided into a video representation based on global features and a video representation based on local features, such as a video representation method of space-time interest points, track features and the like. The hierarchical feature-based method may be classified into a scene context-based video representation method and a space-time segment-based video representation method. The above techniques now have the following disadvantages:

1) Separation of foreground motion from background motion

Under the monitoring environment that the background is fixed and the optical flow is not changed much, the human behavior recognition can achieve a good effect. However, in a natural scene, a video is susceptible to many factors such as angle change, camera shake, illumination, background occlusion, rapid and irregular movement of background speckle, and the like.

2) Foreground feature extraction difficulties

The human motion video shot in the natural scene cannot avoid the background illumination and the camera motion. If the feature extraction is improper, a large amount of background noise is mixed, information redundancy is caused, the effectiveness of the extracted feature is reduced, and the identification result is influenced.

3) Video representation construction

Even two videos in the same motion category have different motion patterns. The speed of movement is also different for each individual video execution. Moreover, the same action category has different shooting scenes and angles.

Disclosure of Invention

The invention provides a video human body behavior identification method based on spatio-temporal information and hierarchical representation.

In order to realize the purpose of the invention, the technical scheme is as follows:

a video human body behavior identification method based on spatio-temporal information and hierarchical representation comprises the following steps:

step S1: extracting a foreground motion optical flow based on the whole optical flow of the camera motion compensation video clip, and forming a compensation track;

step S2: filtering to obtain a key frame with discrimination in the video through key frame selection;

and step S3: sampling and training the compensation track to obtain a Gaussian mixture model;

and step S4: selecting a key frame to obtain a video key frame set, and performing FV coding on the compensation track by combining a Gaussian mixture model to form a key track set;

step S5: carrying out a fragment segmentation and sequencing model on the whole video, and executing the steps S1-S4 on the segmented video fragments to obtain the hierarchical space-time beam characteristics of the segmented video fragments;

step S6: and taking the level space-time beam as video representation and as the input of a classifier, and obtaining a video classification label after SVM classification.

Preferably, the specific steps of step S1 are as follows:

step S101: simulating the motion of a camera by adopting a six-parameter affine model;

step S102: to video frameiPixelp _i = (x _i ,y _i ) Affine optical flow vectorw _A (p _i ) The expression is shown in equation 3:

（3）

whereinu _A (p _i ) = c ₁ (i) + a ₁ (i)x _i + a ₂ (i)y _i In the form of a horizontal affine optical flow vector,v _A (p _i ) = c ₂ (i) + a ₃ (i)x _i + a ₄ (i)y _i in the form of a vertical affine optical flow vector,θ =[c ₁ ,c ₂ ,a ₁ ,…,a ₄ ] ^T a parameter vector being a six-parameter affine model, whereinc ₁ , c ₂ Representing the parameters of the translation of the camera,a ₁ , a ₂ , a ₃ , a ₄ representing camera rotation and zoom parameters, set pointsp _i At the position of the next video frame

As shown in equation 4, whereinθ =[c a]The objective function to be solved is shown in equation 5, wheremThe number of the characteristic points in the video frame is shown;

（4）

（5）

wherein the content of the first and second substances,

after the object is compensated by the camera, the motion of the camera is removed, and the displacement of the real world of the object is represented;

step S103: based on a real-time incremental multi-resolution Motion2D algorithm, an affine model parameter vector is obtained by calculation in a parameter incremental estimation modeθAnd defining a pixelp _i = (x _i ,y _i ) A global optical flow vector ofw(p _i ) = ( u(p _i ),v(p _i ) Compensating for the light fluxw _F (p _i ) As shown in equation 6:

（6）

defining compensated optical floww _F The resulting improved dense tracks are tracked as compensated tracks.

Preferably, the specific steps of step S2 are as follows:

step S201: respectively calculating the time significance and the space significance of each input video frame;

step S202: the two saliency linear combinations are used to calculate the saliency of each pixel, and each

Defining the significance value of a video frame as the sum of the significance values of all pixel points of the frame;

step S203: video frames with higher than average significance are selected, and video frames with lower significance are filtered.

Preferably, the specific steps of step S3 are as follows:

step S301: randomly sampling a compensation track to construct a Gaussian mixture model GMM and create a visual vocabulary dictionary;

step S302: according to the analyzed compensated track characteristics in the video frame,Trj-HOG、Trj-HOF andTrj-MBH, estimating probability density functions corresponding to feature space points using FV coding;

step S303: all feature points are fitted by using Gaussian distribution to obtain the features of the key track set, and the GMM generation model can be expressed as formula 7:

（7）

whereinKIs the number of the gaussian kernels, and,

model of parameters representing it

Wherein

、

And

respectively representing prior mode probability, mean vector and covariance matrix,

representDGaussian distribution of dimension whereinDAnd the compensated track dimension characteristics after dimension reduction.

Preferably, the specific steps of step S4 are as follows:

step S401: for input videoXSelecting a key frame to obtain a video key frame set;

step S402: FV coding to obtain a critical track set representation TB, if a frameiIn framei+Before 1, define the timing relationship as TB _i+1 >TB _i As shown in equation 1, the following linear function is defined:

（1）

equation 1 is a sequential regression problem in which the label y of (x, y) represents a rank rather than a scalar category, defining x as videoXIs defined as the number of frames in which x is located, and P is defined as the set P = { (TB) of pairs of video frame features _i , TB _j ): TB _i >TB _j And define

Defining the constrained objective function as equation 2 under the framework of structure risk minimization and max-margin algorithm, whereC In order to be a penalty factor,

in order to be a function of the relaxation variable,wrepresenting spatio-temporal structure information between video frame features TB as videoXIs shown in

（2）。

Preferably, the specific steps of step S5 are as follows:

s501: dividing the input video into repeated fixed length of16The video clip of (1);

s502: executing the steps S1-S4 on the segmented segments to obtain segment representation of each video segment;

s503: the video segment is subjected to dimensionality reduction and whitening by PCA, and then is used as the input of a LIBLINEAR tool set to obtain parameters according to a formula (2)wNormalized to the final video representation, i.e., a hierarchical spatio-temporal bundle.

Compared with the prior art, the invention has the beneficial effects that:

the invention fully utilizes the spatiotemporal information in the video, and the hierarchy spatiotemporal bundle divides the video motion into a plurality of parts, thereby obtaining the higher-dimensional representation of the video motion. Aiming at the defects that the traditional video representation method ignores the semantic information of the middle and high layers of the video, only focuses on the occurrence frequency of the features, utilizes only 0-order information and the like, the video representation method based on the hierarchical space-time beam can effectively eliminate the background noise interference of the video, make up the semantic gap between the bottom-layer features and the high-layer features, and can capture higher-order and more complex motion structure information. The hierarchical space-time beam method can extract more complex and expressive video representation on a higher dimension, and can effectively improve the effect of video identification.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a histogram of the average recognition accuracy of the recognition method of the present invention on the Hollywood2 movie data set.

FIG. 3 is a confusion matrix on UCF Sports data set by the recognition method of the present invention.

Fig. 4 is a confusion matrix on the HMDB51 data set by the identification method of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

step S1: extracting a foreground motion light stream based on the whole light stream of the camera motion compensation video clip and forming a compensation track;

step S5: performing a segment segmentation and sequencing model on the whole video, and executing the steps S1-S4 on the segmented video segments to obtain the level space-time beam characteristics of the segmented video segments;

Preferably, the specific steps of step S1 are as follows:

step S102: to video frameiPixelp _i = (x _i ,y _i ) Imitating thatVector of the incident lightw _A (p _i ) The expression is shown in equation 3:

（3）

（4）

（5）

wherein the content of the first and second substances,

supplementing the object with the cameraAfter compensation, the motion of the camera is removed, and the displacement of the real world of the object is represented;

step S103: based on a real-time incremental multiresolution Motion2D algorithm, an affine model parameter vector is obtained by calculation in a parameter incremental estimation modeθAnd define a pixelp _i = (x _i ,y _i ) A global optical flow vector ofw(p _i ) = ( u(p _i ),v(p _i ) Compensating for the light fluxw _F (p _i ) As shown in equation 6:

（6）

Preferably, the specific steps of step S2 are as follows:

Defining the significance value of the video frame as the sum of the significance values of all pixel points of the frame;

Preferably, the specific steps of step S3 are as follows:

step S301: randomly sampling a compensation track to construct a Gaussian mixture model GMM and establish a visual vocabulary dictionary;

step S302: according to the compensated track characteristics analyzed from the video frames,Trj-HOG、Trj-HOF andTrj-MBH, estimating probability density functions corresponding to feature space points using FV coding;

（7）

whereinKIs the number of the gaussian kernels,

model of parameters representing it

In which

、

And

respectively representing the prior mode probability, the mean vector and the covariance matrix,

to representDGaussian distribution of dimension whereinDAnd the dimension characteristic of the compensated track after dimension reduction.

Preferably, the specific steps of step S4 are as follows:

step S402: FV coding to obtain a critical track set representation TB, if a frameiIn-framei+Before 1, define the timing relationship as TB _i+1 >TB _i As shown in equation 1, the following linear function is defined:

（1）

equation 1 is a sequential regression problem in which (x, y) isThe label y represents rank rather than scalar category, and x is defined as videoXIs defined as the number of frames in which x is located, and P is defined as the set P = { (TB) of pairs of video frame features _i , TB _j ): TB _i >TB _j And define

Defining the constrained objective function as formula 2 under the framework of the structure risk minimization and max-margin algorithm, whereinC In order to be a penalty factor,

（2）。

Preferably, the specific steps of step S5 are as follows:

s501: dividing the input video into repeated fixed length of16The video clip of (a);

s503: the video segments are subjected to PCA dimensionality reduction and whitening, and then used as input of LIBLINEAR toolset to obtain parameters according to formula (2)wAnd after regularization, the video is used as a final video representation, namely a hierarchical spatio-temporal bundle.

Example 2

As shown in fig. 1 and fig. 2, the experimental environment of this embodiment is: the single machine Linux operating system (Ubuntu 16.04 LTS) has the CPU frequency of 2.10GHz, 32-core CPU,64G memory and 15T hard disk capacity. The experimental codes are mainly C + + and Matlab, and open APIs and class libraries such as OpenCV2.4.9, libSVM, VLFeat, CUDA8.0 and CAFFE are used.

The experimental data sets of this example are three standard data sets of UCF Sports, hollywood2 and HMDB 51. The evaluation index of Hollywood2 is mean Average Accuracy of identification (mAP), and the evaluation index of the rest two data sets is mean Average Accuracy of identification (mAP). Wherein the calculation of mAA, mAP is shown in

equations

10 and 11.

（10）

（11）

Wherein, R is the total number of the behaviors, and AAr and APr respectively represent the accuracy and the recognition precision of the R-th behavior.

The formula of the confusion matrix is shown in fig. 12, where any row in the matrix corresponds to the classification result of a class of behavior, and the sum of each row of the matrix is 1. The elements of the diagonal represent the percentage of correct classification, i.e. the accuracy of a certain type of behavior.

（12）

As shown in fig. 2, the average recognition accuracy histogram on the Hollywood2 movie data set by using the recognition method of the present invention is shown, and the average recognition accuracy is 66.71%. As shown in fig. 3 and 4, the confusion matrix on the UCF Sports and HMDB51 data sets using the identification method of the present invention is shown, and the average identification rates are 89.17% and 65.63%, respectively. The experimental result shows that the identification method of the invention achieves better identification effect, and has obvious progress compared with the existing method.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A video human body behavior identification method based on spatio-temporal information and hierarchical representation is characterized by comprising the following steps:

step S1: based on the overall optical flow of the whole video clip of the motion compensation of the camera, extracting the foreground motion optical flow and forming a compensation track;

step S5: carrying out a segment segmentation and sequencing model on the whole video, and executing the steps S1-S4 on the segmented video segments to obtain the level space-time beam characteristics of the segmented video segments;

step S6: taking a level space-time beam as video representation and as input of a classifier, and obtaining a video classification label after SVM classification;

the specific steps of step S4 are as follows:

step S401: for an input video clip, selecting a key frame to obtain a video key frame set;

step S402: performing FV coding to obtain a key track set representing TB, and if frame i precedes frame i +1, defining the timing relationship as TB _i+1 >TB _i As shown in equation 1, the following linear function is defined:

equation 1 is a sequential regression problem in which P is defined as the set of video frame feature pairs P = { (TB) _i ,TB _j ):TB _i >TB _j And define

Under the framework of structure risk minimization and max-margin algorithm, a limited objective function is defined as formula 2, wherein C is a penalty factor and ξ _ij For the relaxation variables, w represents spatio-temporal structure information between video frame features TB as a representation of video X

2. The video human body behavior recognition method based on spatio-temporal information and hierarchical representation according to claim 1, wherein the specific steps of step S1 are as follows:

step S102: for the ith pixel p of the video frame _i ＝(x _i ,y _i ) Affine optical flow vector w _A (p _i ) The expression is shown in equation 3:

wherein u is _A (p _i )＝c ₁ (i)+a ₁ (i)x _i +a ₂ (i)y _i For horizontal affine optical flow vectors, v _A (p _i )＝c ₂ (i)+a ₃ (i)x _i +a ₄ (i)y _i For vertical affine optical flow vectors, θ = [ c = ₁ ,c ₂ ,a ₁ ,…,a ₄ ] ^T A parameter vector being a six-parameter affine model, wherein c ₁ ,c ₂ Representing camera translation parameters, a ₁ ,a ₂ ,a ₃ ,a ₄ Representing camera rotation and zoom parameters, set point p _i At the position of the next video frame is p _i ', as shown in equation 4, the objective function to be solved is as shown in equation 5Wherein m is the number of the characteristic points in the video frame;

wherein gamma (theta) is the displacement of the real world of the object, wherein the motion of the camera is removed after the object is compensated by the camera;

step S103: based on a real-time incremental multiresolution Motion2D algorithm, an affine model parameter vector theta is calculated by adopting a parameter incremental estimation mode, and a pixel p is defined _i ＝(x _i ,y _i ) Has a global optical flow vector of w (p) _i )＝(u(p _i ),v(p _i ) Compensating the luminous flux w) _F (p _i ) As shown in equation 6:

w _F (p _i )＝w(p _i )-w _A (p _i ) (6)

defining a compensated optical flow w _F The resulting improved dense track is tracked as a compensated track.

3. The video human body behavior recognition method based on spatio-temporal information and hierarchical representation according to claim 2, wherein the specific steps of step S2 are as follows:

step S202: the two significance linear combinations are used for calculating the significance of each pixel, and the significance value of each video frame is defined as the sum of the significance values of all pixel points of the frame;

4. The video human body behavior recognition method based on spatio-temporal information and hierarchical representation according to claim 3, wherein the specific steps of step S3 are as follows:

step S302: estimating probability density functions corresponding to the feature space points by adopting FV coding according to the compensation track features, trj-HOG, trj-HOF and Trj-MBH, analyzed and obtained from the video frame;

step S303: all feature points are fitted by utilizing Gaussian distribution to obtain the features of the key track set, and the GMM generation formula model can be expressed as a formula 7:

where K is the number of Gaussian kernels and θ represents its parametric model { π _k ,μ _k ,σ _k K =1,.., K }, wherein pi & _k 、μ _k And σ _k Respectively representing the prior mode probability, the mean vector and the covariance matrix, zeta (x; mu) _k And Σ k) represents a gaussian distribution in the D dimension, where D is the compensated trajectory dimensional feature after dimensionality reduction.

5. The method for recognizing the human body behaviors based on the videos of the spatio-temporal information and the hierarchical representations according to claim 4, wherein the specific steps of the step S5 are as follows:

s501: dividing an input video into repeated video segments with fixed length of 16;

s502: executing the steps S1 to S4 on the segmented segments to obtain segment representation of each video segment;

s503: the video segments are subjected to dimensionality reduction and whitening by PCA, then used as input of a LIBLINEAR tool set, a parameter w is obtained according to a formula (2), and the parameter w is used as final video representation, namely a level space-time beam after regularization.