CN109583360B - Video human body behavior identification method based on spatio-temporal information and hierarchical representation - Google Patents

Video human body behavior identification method based on spatio-temporal information and hierarchical representation Download PDF

Info

Publication number
CN109583360B
CN109583360B CN201811418871.XA CN201811418871A CN109583360B CN 109583360 B CN109583360 B CN 109583360B CN 201811418871 A CN201811418871 A CN 201811418871A CN 109583360 B CN109583360 B CN 109583360B
Authority
CN
China
Prior art keywords
video
frame
spatio
representation
track
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811418871.XA
Other languages
Chinese (zh)
Other versions
CN109583360A (en
Inventor
吴昱焜
李仲泓
衣杨
沈金龙
佘滢
朱艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201811418871.XA priority Critical patent/CN109583360B/en
Publication of CN109583360A publication Critical patent/CN109583360A/en
Application granted granted Critical
Publication of CN109583360B publication Critical patent/CN109583360B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of artificial intelligence, in particular to a video human body behavior identification method based on spatio-temporal information and hierarchy representation. The invention fully utilizes the spatio-temporal information in the video, and the hierarchical spatio-temporal bundle divides the video motion into a plurality of parts, thereby obtaining the higher-dimensional representation of the video motion. Aiming at the defects that the traditional video representation method ignores the semantic information of the middle and high layers of the video, only focuses on the occurrence frequency of the features, utilizes only 0-order information and the like, the video representation method based on the hierarchical space-time beam can effectively eliminate the background noise interference of the video, make up the semantic gap between the bottom-layer features and the high-layer features, and can capture higher-order and more complex motion structure information. The hierarchical space-time beam method can extract more complex and expressive video representation on higher dimensionality, and can effectively improve the video identification effect.

Description

Video human body behavior identification method based on spatio-temporal information and hierarchical representation
Technical Field
The invention relates to the field of artificial intelligence, in particular to a video human body behavior identification method based on spatio-temporal information and hierarchy representation.
Background
The video human behavior identification is a leading-edge artificial intelligence technology, can automatically calculate through a computer so as to identify and classify video contents, and can be widely applied to intelligent monitoring, human-computer interaction and video content retrieval. Specifically, the method is a process of extracting features from a video data set with calibrated categories through a machine learning technology, training the features to obtain a classifier, and further judging unknown videos. In order to obtain a high human behavior recognition rate, it is first necessary to extract features having expressive power. The ideal characteristics firstly need to have robustness to human appearance and size, scene illumination, shooting visual angle and the like; secondly, the extracted features should contain rich scene context information, so that videos of other motion categories can be effectively distinguished.
From the aspect of feature extraction, the current human behavior recognition technology comprises video representation methods based on bottom-layer features, hierarchical features and depth features. The video representation method based on the underlying features can be divided into a video representation based on global features and a video representation based on local features, such as a video representation method of space-time interest points, track features and the like. The hierarchical feature-based method may be classified into a scene context-based video representation method and a space-time segment-based video representation method. The above techniques now have the following disadvantages:
1) Separation of foreground motion from background motion
Under the monitoring environment that the background is fixed and the optical flow is not changed much, the human behavior recognition can achieve a good effect. However, in a natural scene, a video is susceptible to many factors such as angle change, camera shake, illumination, background occlusion, rapid and irregular movement of background speckle, and the like.
2) Foreground feature extraction difficulties
The human motion video shot in the natural scene cannot avoid the background illumination and the camera motion. If the feature extraction is improper, a large amount of background noise is mixed, information redundancy is caused, the effectiveness of the extracted feature is reduced, and the identification result is influenced.
3) Video representation construction
Even two videos in the same motion category have different motion patterns. The speed of movement is also different for each individual video execution. Moreover, the same action category has different shooting scenes and angles.
Disclosure of Invention
The invention provides a video human body behavior identification method based on spatio-temporal information and hierarchical representation.
In order to realize the purpose of the invention, the technical scheme is as follows:
a video human body behavior identification method based on spatio-temporal information and hierarchical representation comprises the following steps:
step S1: extracting a foreground motion optical flow based on the whole optical flow of the camera motion compensation video clip, and forming a compensation track;
step S2: filtering to obtain a key frame with discrimination in the video through key frame selection;
and step S3: sampling and training the compensation track to obtain a Gaussian mixture model;
and step S4: selecting a key frame to obtain a video key frame set, and performing FV coding on the compensation track by combining a Gaussian mixture model to form a key track set;
step S5: carrying out a fragment segmentation and sequencing model on the whole video, and executing the steps S1-S4 on the segmented video fragments to obtain the hierarchical space-time beam characteristics of the segmented video fragments;
step S6: and taking the level space-time beam as video representation and as the input of a classifier, and obtaining a video classification label after SVM classification.
Preferably, the specific steps of step S1 are as follows:
step S101: simulating the motion of a camera by adopting a six-parameter affine model;
step S102: to video frameiPixelp i = (x i ,y i ) Affine optical flow vectorw A (p i ) The expression is shown in equation 3:
Figure DEST_PATH_IMAGE002
(3)
whereinu A (p i ) = c 1 (i) + a 1 (i)x i + a 2 (i)y i In the form of a horizontal affine optical flow vector,v A (p i ) = c 2 (i) + a 3 (i)x i + a 4 (i)y i in the form of a vertical affine optical flow vector,θ =[c 1 ,c 2 ,a 1 ,…,a 4 ] T a parameter vector being a six-parameter affine model, whereinc 1 , c 2 Representing the parameters of the translation of the camera,a 1 , a 2 , a 3 , a 4 representing camera rotation and zoom parameters, set pointsp i At the position of the next video frame
Figure DEST_PATH_IMAGE004
As shown in equation 4, whereinθ =[c a]The objective function to be solved is shown in equation 5, wheremThe number of the characteristic points in the video frame is shown;
Figure DEST_PATH_IMAGE006
(4)
Figure DEST_PATH_IMAGE008
(5)
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE010
after the object is compensated by the camera, the motion of the camera is removed, and the displacement of the real world of the object is represented;
step S103: based on a real-time incremental multi-resolution Motion2D algorithm, an affine model parameter vector is obtained by calculation in a parameter incremental estimation modeθAnd defining a pixelp i = (x i ,y i ) A global optical flow vector ofw(p i ) = ( u(p i ),v(p i ) Compensating for the light fluxw F (p i ) As shown in equation 6:
Figure DEST_PATH_IMAGE012
(6)
defining compensated optical floww F The resulting improved dense tracks are tracked as compensated tracks.
Preferably, the specific steps of step S2 are as follows:
step S201: respectively calculating the time significance and the space significance of each input video frame;
step S202: the two saliency linear combinations are used to calculate the saliency of each pixel, and each
Defining the significance value of a video frame as the sum of the significance values of all pixel points of the frame;
step S203: video frames with higher than average significance are selected, and video frames with lower significance are filtered.
Preferably, the specific steps of step S3 are as follows:
step S301: randomly sampling a compensation track to construct a Gaussian mixture model GMM and create a visual vocabulary dictionary;
step S302: according to the analyzed compensated track characteristics in the video frame,Trj-HOG、Trj-HOF andTrj-MBH, estimating probability density functions corresponding to feature space points using FV coding;
step S303: all feature points are fitted by using Gaussian distribution to obtain the features of the key track set, and the GMM generation model can be expressed as formula 7:
Figure DEST_PATH_IMAGE014
(7)
whereinKIs the number of the gaussian kernels, and,
Figure DEST_PATH_IMAGE016
model of parameters representing it
Figure DEST_PATH_IMAGE018
Wherein
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE022
And
Figure DEST_PATH_IMAGE024
respectively representing prior mode probability, mean vector and covariance matrix,
Figure DEST_PATH_IMAGE026
representDGaussian distribution of dimension whereinDAnd the compensated track dimension characteristics after dimension reduction.
Preferably, the specific steps of step S4 are as follows:
step S401: for input videoXSelecting a key frame to obtain a video key frame set;
step S402: FV coding to obtain a critical track set representation TB, if a frameiIn framei+Before 1, define the timing relationship as TB i+1 >TB i As shown in equation 1, the following linear function is defined:
Figure DEST_PATH_IMAGE028
(1)
equation 1 is a sequential regression problem in which the label y of (x, y) represents a rank rather than a scalar category, defining x as videoXIs defined as the number of frames in which x is located, and P is defined as the set P = { (TB) of pairs of video frame features i , TB j ): TB i >TB j And define
Figure DEST_PATH_IMAGE030
Defining the constrained objective function as equation 2 under the framework of structure risk minimization and max-margin algorithm, whereC In order to be a penalty factor,
Figure DEST_PATH_IMAGE032
in order to be a function of the relaxation variable,wrepresenting spatio-temporal structure information between video frame features TB as videoXIs shown in
Figure DEST_PATH_IMAGE034
(2)。
Preferably, the specific steps of step S5 are as follows:
s501: dividing the input video into repeated fixed length of16The video clip of (1);
s502: executing the steps S1-S4 on the segmented segments to obtain segment representation of each video segment;
s503: the video segment is subjected to dimensionality reduction and whitening by PCA, and then is used as the input of a LIBLINEAR tool set to obtain parameters according to a formula (2)wNormalized to the final video representation, i.e., a hierarchical spatio-temporal bundle.
Compared with the prior art, the invention has the beneficial effects that:
the invention fully utilizes the spatiotemporal information in the video, and the hierarchy spatiotemporal bundle divides the video motion into a plurality of parts, thereby obtaining the higher-dimensional representation of the video motion. Aiming at the defects that the traditional video representation method ignores the semantic information of the middle and high layers of the video, only focuses on the occurrence frequency of the features, utilizes only 0-order information and the like, the video representation method based on the hierarchical space-time beam can effectively eliminate the background noise interference of the video, make up the semantic gap between the bottom-layer features and the high-layer features, and can capture higher-order and more complex motion structure information. The hierarchical space-time beam method can extract more complex and expressive video representation on a higher dimension, and can effectively improve the effect of video identification.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a histogram of the average recognition accuracy of the recognition method of the present invention on the Hollywood2 movie data set.
FIG. 3 is a confusion matrix on UCF Sports data set by the recognition method of the present invention.
Fig. 4 is a confusion matrix on the HMDB51 data set by the identification method of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
A video human body behavior identification method based on spatio-temporal information and hierarchical representation comprises the following steps:
step S1: extracting a foreground motion light stream based on the whole light stream of the camera motion compensation video clip and forming a compensation track;
step S2: filtering to obtain a key frame with discrimination in the video through key frame selection;
and step S3: sampling and training the compensation track to obtain a Gaussian mixture model;
and step S4: selecting a key frame to obtain a video key frame set, and performing FV coding on the compensation track by combining a Gaussian mixture model to form a key track set;
step S5: performing a segment segmentation and sequencing model on the whole video, and executing the steps S1-S4 on the segmented video segments to obtain the level space-time beam characteristics of the segmented video segments;
step S6: and taking the level space-time beam as video representation and as the input of a classifier, and obtaining a video classification label after SVM classification.
Preferably, the specific steps of step S1 are as follows:
step S101: simulating the motion of a camera by adopting a six-parameter affine model;
step S102: to video frameiPixelp i = (x i ,y i ) Imitating thatVector of the incident lightw A (p i ) The expression is shown in equation 3:
Figure DEST_PATH_IMAGE002A
(3)
whereinu A (p i ) = c 1 (i) + a 1 (i)x i + a 2 (i)y i In the form of a horizontal affine optical flow vector,v A (p i ) = c 2 (i) + a 3 (i)x i + a 4 (i)y i in the form of a vertical affine optical flow vector,θ =[c 1 ,c 2 ,a 1 ,…,a 4 ] T a parameter vector being a six-parameter affine model, whereinc 1 , c 2 Representing the parameters of the translation of the camera,a 1 , a 2 , a 3 , a 4 representing camera rotation and zoom parameters, set pointsp i At the position of the next video frame
Figure DEST_PATH_IMAGE004A
As shown in equation 4, whereinθ =[c a]The objective function to be solved is shown in equation 5, wheremThe number of the characteristic points in the video frame is shown;
Figure DEST_PATH_IMAGE006A
(4)
Figure DEST_PATH_IMAGE008A
(5)
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE010A
supplementing the object with the cameraAfter compensation, the motion of the camera is removed, and the displacement of the real world of the object is represented;
step S103: based on a real-time incremental multiresolution Motion2D algorithm, an affine model parameter vector is obtained by calculation in a parameter incremental estimation modeθAnd define a pixelp i = (x i ,y i ) A global optical flow vector ofw(p i ) = ( u(p i ),v(p i ) Compensating for the light fluxw F (p i ) As shown in equation 6:
Figure DEST_PATH_IMAGE012A
(6)
defining compensated optical floww F The resulting improved dense tracks are tracked as compensated tracks.
Preferably, the specific steps of step S2 are as follows:
step S201: respectively calculating the time significance and the space significance of each input video frame;
step S202: the two saliency linear combinations are used to calculate the saliency of each pixel, and each
Defining the significance value of the video frame as the sum of the significance values of all pixel points of the frame;
step S203: video frames with higher than average significance are selected, and video frames with lower significance are filtered.
Preferably, the specific steps of step S3 are as follows:
step S301: randomly sampling a compensation track to construct a Gaussian mixture model GMM and establish a visual vocabulary dictionary;
step S302: according to the compensated track characteristics analyzed from the video frames,Trj-HOG、Trj-HOF andTrj-MBH, estimating probability density functions corresponding to feature space points using FV coding;
step S303: all feature points are fitted by using Gaussian distribution to obtain the features of the key track set, and the GMM generation model can be expressed as formula 7:
Figure DEST_PATH_IMAGE014A
(7)
whereinKIs the number of the gaussian kernels,
Figure DEST_PATH_IMAGE036
model of parameters representing it
Figure DEST_PATH_IMAGE018A
In which
Figure DEST_PATH_IMAGE020A
Figure DEST_PATH_IMAGE022A
And
Figure DEST_PATH_IMAGE024A
respectively representing the prior mode probability, the mean vector and the covariance matrix,
Figure DEST_PATH_IMAGE026A
to representDGaussian distribution of dimension whereinDAnd the dimension characteristic of the compensated track after dimension reduction.
Preferably, the specific steps of step S4 are as follows:
step S401: for input videoXSelecting a key frame to obtain a video key frame set;
step S402: FV coding to obtain a critical track set representation TB, if a frameiIn-framei+Before 1, define the timing relationship as TB i+1 >TB i As shown in equation 1, the following linear function is defined:
Figure DEST_PATH_IMAGE028A
(1)
equation 1 is a sequential regression problem in which (x, y) isThe label y represents rank rather than scalar category, and x is defined as videoXIs defined as the number of frames in which x is located, and P is defined as the set P = { (TB) of pairs of video frame features i , TB j ): TB i >TB j And define
Figure DEST_PATH_IMAGE030A
Defining the constrained objective function as formula 2 under the framework of the structure risk minimization and max-margin algorithm, whereinC In order to be a penalty factor,
Figure DEST_PATH_IMAGE032A
in order to be a function of the relaxation variable,wrepresenting spatio-temporal structure information between video frame features TB as videoXIs shown in
Figure DEST_PATH_IMAGE034A
(2)。
Preferably, the specific steps of step S5 are as follows:
s501: dividing the input video into repeated fixed length of16The video clip of (a);
s502: executing the steps S1-S4 on the segmented segments to obtain segment representation of each video segment;
s503: the video segments are subjected to PCA dimensionality reduction and whitening, and then used as input of LIBLINEAR toolset to obtain parameters according to formula (2)wAnd after regularization, the video is used as a final video representation, namely a hierarchical spatio-temporal bundle.
Example 2
As shown in fig. 1 and fig. 2, the experimental environment of this embodiment is: the single machine Linux operating system (Ubuntu 16.04 LTS) has the CPU frequency of 2.10GHz, 32-core CPU,64G memory and 15T hard disk capacity. The experimental codes are mainly C + + and Matlab, and open APIs and class libraries such as OpenCV2.4.9, libSVM, VLFeat, CUDA8.0 and CAFFE are used.
The experimental data sets of this example are three standard data sets of UCF Sports, hollywood2 and HMDB 51. The evaluation index of Hollywood2 is mean Average Accuracy of identification (mAP), and the evaluation index of the rest two data sets is mean Average Accuracy of identification (mAP). Wherein the calculation of mAA, mAP is shown in equations 10 and 11.
Figure DEST_PATH_IMAGE038
(10)
Figure DEST_PATH_IMAGE040
(11)
Wherein, R is the total number of the behaviors, and AAr and APr respectively represent the accuracy and the recognition precision of the R-th behavior.
The formula of the confusion matrix is shown in fig. 12, where any row in the matrix corresponds to the classification result of a class of behavior, and the sum of each row of the matrix is 1. The elements of the diagonal represent the percentage of correct classification, i.e. the accuracy of a certain type of behavior.
Figure DEST_PATH_IMAGE042
(12)
As shown in fig. 2, the average recognition accuracy histogram on the Hollywood2 movie data set by using the recognition method of the present invention is shown, and the average recognition accuracy is 66.71%. As shown in fig. 3 and 4, the confusion matrix on the UCF Sports and HMDB51 data sets using the identification method of the present invention is shown, and the average identification rates are 89.17% and 65.63%, respectively. The experimental result shows that the identification method of the invention achieves better identification effect, and has obvious progress compared with the existing method.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (5)

1. A video human body behavior identification method based on spatio-temporal information and hierarchical representation is characterized by comprising the following steps:
step S1: based on the overall optical flow of the whole video clip of the motion compensation of the camera, extracting the foreground motion optical flow and forming a compensation track;
step S2: filtering to obtain a key frame with discrimination in the video through key frame selection;
and step S3: sampling and training the compensation track to obtain a Gaussian mixture model;
and step S4: selecting a key frame to obtain a video key frame set, and performing FV coding on the compensation track by combining a Gaussian mixture model to form a key track set;
step S5: carrying out a segment segmentation and sequencing model on the whole video, and executing the steps S1-S4 on the segmented video segments to obtain the level space-time beam characteristics of the segmented video segments;
step S6: taking a level space-time beam as video representation and as input of a classifier, and obtaining a video classification label after SVM classification;
the specific steps of step S4 are as follows:
step S401: for an input video clip, selecting a key frame to obtain a video key frame set;
step S402: performing FV coding to obtain a key track set representing TB, and if frame i precedes frame i +1, defining the timing relationship as TB i+1 >TB i As shown in equation 1, the following linear function is defined:
Figure FDA0003853451140000011
equation 1 is a sequential regression problem in which P is defined as the set of video frame feature pairs P = { (TB) i ,TB j ):TB i >TB j And define
Figure FDA0003853451140000013
Under the framework of structure risk minimization and max-margin algorithm, a limited objective function is defined as formula 2, wherein C is a penalty factor and ξ ij For the relaxation variables, w represents spatio-temporal structure information between video frame features TB as a representation of video X
Figure FDA0003853451140000012
2. The video human body behavior recognition method based on spatio-temporal information and hierarchical representation according to claim 1, wherein the specific steps of step S1 are as follows:
step S101: simulating the motion of a camera by adopting a six-parameter affine model;
step S102: for the ith pixel p of the video frame i =(x i ,y i ) Affine optical flow vector w A (p i ) The expression is shown in equation 3:
Figure FDA0003853451140000021
wherein u is A (p i )=c 1 (i)+a 1 (i)x i +a 2 (i)y i For horizontal affine optical flow vectors, v A (p i )=c 2 (i)+a 3 (i)x i +a 4 (i)y i For vertical affine optical flow vectors, θ = [ c = 1 ,c 2 ,a 1 ,…,a 4 ] T A parameter vector being a six-parameter affine model, wherein c 1 ,c 2 Representing camera translation parameters, a 1 ,a 2 ,a 3 ,a 4 Representing camera rotation and zoom parameters, set point p i At the position of the next video frame is p i ', as shown in equation 4, the objective function to be solved is as shown in equation 5Wherein m is the number of the characteristic points in the video frame;
Figure FDA0003853451140000022
Figure FDA0003853451140000023
wherein gamma (theta) is the displacement of the real world of the object, wherein the motion of the camera is removed after the object is compensated by the camera;
step S103: based on a real-time incremental multiresolution Motion2D algorithm, an affine model parameter vector theta is calculated by adopting a parameter incremental estimation mode, and a pixel p is defined i =(x i ,y i ) Has a global optical flow vector of w (p) i )=(u(p i ),v(p i ) Compensating the luminous flux w) F (p i ) As shown in equation 6:
w F (p i )=w(p i )-w A (p i ) (6)
defining a compensated optical flow w F The resulting improved dense track is tracked as a compensated track.
3. The video human body behavior recognition method based on spatio-temporal information and hierarchical representation according to claim 2, wherein the specific steps of step S2 are as follows:
step S201: respectively calculating the time significance and the space significance of each input video frame;
step S202: the two significance linear combinations are used for calculating the significance of each pixel, and the significance value of each video frame is defined as the sum of the significance values of all pixel points of the frame;
step S203: video frames with higher than average significance are selected, and video frames with lower significance are filtered.
4. The video human body behavior recognition method based on spatio-temporal information and hierarchical representation according to claim 3, wherein the specific steps of step S3 are as follows:
step S301: randomly sampling a compensation track to construct a Gaussian mixture model GMM and create a visual vocabulary dictionary;
step S302: estimating probability density functions corresponding to the feature space points by adopting FV coding according to the compensation track features, trj-HOG, trj-HOF and Trj-MBH, analyzed and obtained from the video frame;
step S303: all feature points are fitted by utilizing Gaussian distribution to obtain the features of the key track set, and the GMM generation formula model can be expressed as a formula 7:
Figure FDA0003853451140000031
where K is the number of Gaussian kernels and θ represents its parametric model { π kkk K =1,.., K }, wherein pi & k 、μ k And σ k Respectively representing the prior mode probability, the mean vector and the covariance matrix, zeta (x; mu) k And Σ k) represents a gaussian distribution in the D dimension, where D is the compensated trajectory dimensional feature after dimensionality reduction.
5. The method for recognizing the human body behaviors based on the videos of the spatio-temporal information and the hierarchical representations according to claim 4, wherein the specific steps of the step S5 are as follows:
s501: dividing an input video into repeated video segments with fixed length of 16;
s502: executing the steps S1 to S4 on the segmented segments to obtain segment representation of each video segment;
s503: the video segments are subjected to dimensionality reduction and whitening by PCA, then used as input of a LIBLINEAR tool set, a parameter w is obtained according to a formula (2), and the parameter w is used as final video representation, namely a level space-time beam after regularization.
CN201811418871.XA 2018-11-26 2018-11-26 Video human body behavior identification method based on spatio-temporal information and hierarchical representation Active CN109583360B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811418871.XA CN109583360B (en) 2018-11-26 2018-11-26 Video human body behavior identification method based on spatio-temporal information and hierarchical representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811418871.XA CN109583360B (en) 2018-11-26 2018-11-26 Video human body behavior identification method based on spatio-temporal information and hierarchical representation

Publications (2)

Publication Number Publication Date
CN109583360A CN109583360A (en) 2019-04-05
CN109583360B true CN109583360B (en) 2023-01-10

Family

ID=65924617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811418871.XA Active CN109583360B (en) 2018-11-26 2018-11-26 Video human body behavior identification method based on spatio-temporal information and hierarchical representation

Country Status (1)

Country Link
CN (1) CN109583360B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188733A (en) * 2019-06-10 2019-08-30 电子科技大学 Timing behavioral value method and system based on the region 3D convolutional neural networks

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529477A (en) * 2016-11-11 2017-03-22 中山大学 Video human behavior recognition method based on significant trajectory and time-space evolution information
CN106682258A (en) * 2016-11-16 2017-05-17 中山大学 Method and system for multi-operand addition optimization in high-level synthesis tool
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN107563345A (en) * 2017-09-19 2018-01-09 桂林安维科技有限公司 A kind of human body behavior analysis method based on time and space significance region detection
CN108256434A (en) * 2017-12-25 2018-07-06 西安电子科技大学 High-level semantic video behavior recognition methods based on confusion matrix
CN109508684A (en) * 2018-11-21 2019-03-22 中山大学 A kind of method of Human bodys' response in video

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529477A (en) * 2016-11-11 2017-03-22 中山大学 Video human behavior recognition method based on significant trajectory and time-space evolution information
CN106682258A (en) * 2016-11-16 2017-05-17 中山大学 Method and system for multi-operand addition optimization in high-level synthesis tool
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN107563345A (en) * 2017-09-19 2018-01-09 桂林安维科技有限公司 A kind of human body behavior analysis method based on time and space significance region detection
CN108256434A (en) * 2017-12-25 2018-07-06 西安电子科技大学 High-level semantic video behavior recognition methods based on confusion matrix
CN109508684A (en) * 2018-11-21 2019-03-22 中山大学 A kind of method of Human bodys' response in video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自适应特征融合的自然环境视频行为识别;郭梓鑫等;《计算机学报》;20131115;第36卷(第11期);全文 *

Also Published As

Publication number Publication date
CN109583360A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN109508684B (en) Method for recognizing human behavior in video
Rössler et al. Faceforensics: A large-scale video dataset for forgery detection in human faces
Wang et al. Generative neural networks for anomaly detection in crowded scenes
CN106778854B (en) Behavior identification method based on trajectory and convolutional neural network feature extraction
CN110147743B (en) Real-time online pedestrian analysis and counting system and method under complex scene
Chung et al. You said that?
Wang et al. A robust and efficient video representation for action recognition
Islam et al. Efficient two-stream network for violence detection using separable convolutional lstm
Zhao et al. Dynamic texture recognition using local binary patterns with an application to facial expressions
CN106709419B (en) Video human behavior recognition method based on significant trajectory spatial information
CN110889375B (en) Hidden-double-flow cooperative learning network and method for behavior recognition
Chen et al. End-to-end learning of object motion estimation from retinal events for event-based object tracking
Fernando et al. Exploiting human social cognition for the detection of fake and fraudulent faces via memory networks
Huang et al. Deepfake mnist+: a deepfake facial animation dataset
Vignesh et al. Abnormal event detection on BMTT-PETS 2017 surveillance challenge
Rong et al. Scene text recognition in multiple frames based on text tracking
CN108629301B (en) Human body action recognition method
Oluwasammi et al. Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning
CN113312973A (en) Method and system for extracting features of gesture recognition key points
Zhang et al. Contrastive spatio-temporal pretext learning for self-supervised video representation
CN113705490A (en) Anomaly detection method based on reconstruction and prediction
Hirschorn et al. Normalizing flows for human pose anomaly detection
Katircioglu et al. Self-supervised human detection and segmentation via background inpainting
CN109583360B (en) Video human body behavior identification method based on spatio-temporal information and hierarchical representation
CN105893967B (en) Human behavior classification detection method and system based on time sequence retention space-time characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant