CN110852182A - Depth video human body behavior recognition method based on three-dimensional space time sequence modeling - Google Patents

Depth video human body behavior recognition method based on three-dimensional space time sequence modeling Download PDF

Info

Publication number
CN110852182A
CN110852182A CN201910999089.XA CN201910999089A CN110852182A CN 110852182 A CN110852182 A CN 110852182A CN 201910999089 A CN201910999089 A CN 201910999089A CN 110852182 A CN110852182 A CN 110852182A
Authority
CN
China
Prior art keywords
space
dimensional
time
human body
point cloud
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910999089.XA
Other languages
Chinese (zh)
Other versions
CN110852182B (en
Inventor
肖阳
王焱乘
曹治国
姜文祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910999089.XA priority Critical patent/CN110852182B/en
Publication of CN110852182A publication Critical patent/CN110852182A/en
Application granted granted Critical
Publication of CN110852182B publication Critical patent/CN110852182B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, which belongs to the field of digital image recognition and comprises the following steps: marking the human body position in the depth image frame by frame; converting a depth image containing human body behaviors into three-dimensional space point cloud data; performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor; uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence; converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data and randomly sampling to obtain human behavior space-time characteristics; and inputting the human body behavior space-time characteristics into the trained 3D target point cloud classification model for classification to obtain a behavior classification result. The invention can fully excavate depth image three-dimensional information and realize the efficient and robust identification of various human behaviors.

Description

Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
Technical Field
The invention belongs to the field of digital image recognition, and particularly relates to a depth video human body behavior recognition method based on three-dimensional space time sequence modeling.
Background
In the field of computer vision, human behavior recognition based on depth video is concerned by more and more researchers, and has become one of the research hotspots, and the technology is widely applied to video monitoring, multimedia data analysis, human-computer interaction and the like.
At present, methods for identifying deep video behaviors mainly include three types: a method based on human skeleton, a method based on original depth map and a method of fusing skeleton and depth map; the identification method based on the human skeleton is the most common method at present, the human skeleton can simply and clearly describe the posture information of human motion due to no interference of environmental noise, and a better result is obtained on the existing behavior data set, but the method is based on the premise that the human skeleton information is accurately estimated, the human skeleton information extraction technology cannot be completely and correctly extracted, and particularly in a special environment, the human skeleton information is difficult to obtain; the human behavior recognition method based on the depth image projects the human behavior based on the 3D time sequence space to the 2D plane for recognition, more environment and character information can be obtained, but the 3D information of the human behavior is still not effectively extracted, and the space-time information of the behavior is difficult to effectively mine and extract due to the obvious representation of environmental noise in the 2D plane, so that the method has higher requirements on the robustness and the fitting performance of an algorithm model.
Generally, the existing depth video behavior identification method has the technical problems that human skeleton information cannot be accurately extracted, depth video information is not extracted to the maximum extent, and the identification accuracy is low due to the fact that the depth video information is easily influenced by environmental noise.
Disclosure of Invention
Aiming at the defects or improvement requirements of the prior art, the invention provides a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, and aims to solve the technical problems that the existing depth video shape behavior recognition method is based on two-dimensional plane recognition, can not effectively extract three-dimensional information of a depth video, is easily influenced by environmental noise and is low in recognition accuracy.
In order to achieve the purpose, the invention provides a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, which comprises the following steps:
(1) marking the human body position in the depth image frame by frame to obtain a depth image containing the human body behavior part;
(2) converting the pixel coordinates of the depth image into three-dimensional space point cloud data;
(3) performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor;
(4) uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence;
(5) converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data, and randomly sampling the high-dimensional space point cloud data to obtain human behavior space-time characteristics;
(6) and inputting the human body behavior space-time characteristics into a trained 3D target point cloud classification model for classification to obtain a behavior classification result.
Further, the step (1) specifically comprises:
(1.1) framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behaviors;
(1.2) each frame of depth image is represented as an A-B matrix, and matrix values corresponding to non-human body positions outside the labeling frame are set to be 0; wherein the index of each matrix value corresponds to the pixel coordinate of the location, and each matrix value corresponds to the distance between the location point of the pixel coordinate and the depth camera.
Further, according to the internal parameters of the depth camera, the correspondence relationship between the pixel coordinates of the depth image in the step (2) and the three-dimensional space point cloud under the world coordinate system is as follows:
Figure BDA0002240714830000031
wherein u and v are the coordinate position of each pixel in the image, and fx,fyIs the depth camera focal length, cx,cyIs the depth camera center point.
Further, the step (3) specifically comprises:
(3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids;
and (3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain the multi-scale three-dimensional tensors corresponding to different voxel sizes.
Further, the performing spatial time series coding on the three-dimensional tensor corresponding to each time period in the step (4) specifically includes:
(01) according to a sorting function S (v; u) ═ ut·vtScoring the frame image;
wherein u istRepresenting the transpose of the vector resulting from the optimization of the sorting function,
Figure BDA0002240714830000032
mean feature, x, representing the depth image of the t-th frametRepresented as the t-th frame depth image,
Figure BDA0002240714830000033
representing a three-dimensional tensor obtained by voxelization of the t frame depth image;
(02) optimizing the parameter u of the sorting function through a rankSVM, so that the frame images behind the time series have larger scores;
(03) converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D represents the number of voxels in the three-dimensional space X-axis, Y-axis, and Z-axis of the point cloud voxelized in the scale.
Further, the step (5) specifically comprises:
(5.1) extracting the space position information corresponding to the three-dimensional tensor space-time sequence index and the time sequence information corresponding to the element values of the space position information to obtain M high-dimensional point cloud data (x, y, z, c)1,…,cm) The method comprises the steps that M is the number of video segments obtained by time division of a depth video, M represents the number of point features with motion information in a three-dimensional tensor space-time sequence, and c represents the motion information of a tensor value at a corresponding coordinate position;
(5.2) in M high-dimensional point cloud data (x, y, z, c)1,…,cm) And randomly selecting K from the Chinese characters as human behavior space-time characteristics.
And further, after data enhancement is carried out on the human behavior space-time characteristics by adopting a data enhancement mode of rotation and translation, a trained 3D target point cloud classification model is input for classification.
Further, the 3D target point cloud classification model comprises a multilayer sensor and a non-local mechanism NetVLAD network which are sequentially connected;
the multilayer perceptron is used for sampling and grouping the human behavior space-time characteristics and extracting the characteristics of each group of behavior space-time characteristics to obtain a plurality of groups of local characteristics;
the non-local mechanism NetVLAD network aggregates a plurality of groups of local characteristics to obtain the non-local characteristics.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) the method effectively screens the spatial position of the human body by detecting the human body behaviors in the plane depth image, and converts the depth image containing the human body behaviors into the point cloud sequence so as to restore the spatial information of the human body behaviors and facilitate the subsequent acquisition of richer human body behavior characteristics; on the basis, the point cloud sequence is subjected to voxelization and spatial time sequence coding, so that the spatial geometrical characteristics of the depth video are fully excavated, and the accuracy of human behavior recognition is effectively improved.
(2) The method adopts a self-attention-based non-local region feature fusion module on the basis of the existing point cloud classification network PointNet + +, and further improves the accuracy of human behavior recognition by combining global behavior features and obvious local motion features.
(3) According to the invention, the input of the classification network is further data-enhanced, namely the input point cloud is rotated at random any angle in space, so that the classification model has more robustness for human behavior identification under different visual angles.
Drawings
Fig. 1 is a schematic flow chart of a depth video human body behavior recognition method based on three-dimensional spatial time sequence modeling according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a physical meaning of a point cloud corresponding to a depth image based on an original;
fig. 3 is a visualization result obtained after the point cloud sequence provided by the embodiment of the present invention is voxelized;
fig. 4 is a visualization result of a 3-dimensional space tensor obtained after a voxelized sequence is subjected to spatial time sequence coding according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a point cloud classification network structure according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a self-attention-based non-local region feature fusion module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a depth video human body behavior recognition method based on three-dimensional time sequence modeling, which comprises the steps of human body target detection, depth image conversion point cloud, point cloud voxelization, time sequence coding of voxelized tensor, sampling of the coded tensor to obtain space-time characteristics, and sending the characteristics to a point cloud classification network for training and testing. The depth video human body behavior identification method based on three-dimensional space time sequence modeling provided by the invention is specifically described below by combining with an example.
As shown in fig. 1, the depth video human body behavior recognition method based on three-dimensional spatial time series modeling provided by the embodiment of the present invention includes the following steps:
(1) marking the human body position in the depth image frame by frame to obtain a depth image containing the human body behavior part;
specifically, framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behavior; and each frame of depth image is represented as an A-B matrix (A, B is the length and width of the depth image based on the number of pixels respectively), the index of each matrix value corresponds to the pixel coordinate of the position, and each matrix value corresponds to the distance between the position point of the pixel coordinate and the depth camera.
(2) Converting pixel coordinates of a depth image containing human body behaviors into three-dimensional space point cloud data;
specifically, as shown in fig. 2, a camera coordinate point, namely an optical center, is set as a world coordinate system origin O, a depth image center point is set as O ', an M' point in a depth image is converted into an M point of the world coordinate system, the M point is projected on a world coordinate system z axis as an a point, and a mapping relation of a similar triangle exists: OM 'O' is similar to OMA, i.e., can be obtained
Figure BDA0002240714830000061
Further obtaining the corresponding relation between the pixel coordinates of the depth image and the three-dimensional space point cloud under the world coordinate system according to the camera internal parametersComprises the following steps:
Figure BDA0002240714830000062
wherein u and v are the coordinate position of each pixel in the image, and fx,fyIs the depth camera focal length, cx,cyIs the depth camera center point.
(3) Performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor;
specifically, the step (3) specifically includes: (3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids; and (3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain the multi-scale three-dimensional tensors corresponding to different voxel sizes. Converting each frame of depth image obtained in the step (2) into point cloud data (M x, y, z), calculating the maximum and minimum values of all point cloud sets in the x, y and z axes under a section of video and marking as hx, hy, hz, lx, ly and lz, thereby obtaining the spatial position size of the human body behavior under a world coordinate system, and setting the voxelized voxel size as a, thereby obtaining the voxel number of the depth behavior in a 3D space as
Figure BDA0002240714830000063
The visualization result of the depth image containing human body behaviors through point cloud and voxelization is shown in fig. 3.
(4) Uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence;
specifically, the performing spatial time series coding on the three-dimensional tensor corresponding to each time period in step (4) specifically includes:
(01) according to a sorting function S (v; u) ═ ut·vtScoring the frame image; wherein u istRepresenting the transpose of the vector resulting from the optimization of the sorting function,mean feature, x, representing the depth image of the t-th frametRepresented as the t-th frame depth image,
Figure BDA0002240714830000072
representing a three-dimensional tensor obtained by voxelization of the t frame depth image; (02) optimizing the parameter u of the sorting function through a rankSVM, so that the frame images behind the time series have larger scores; (03) converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D represents the number of voxels in the three-dimensional space x, y, and z of the point cloud voxelized at the scale.
The present invention uses a structural risk minimization and maximum separation optimization framework, and the objective optimization problem can be expressed as:
Figure BDA0002240714830000073
the first term is a regularization term and the second term is a change-loss error penalty term. The above formula proves to be a convex optimization problem, which can be solved by using RankSVM, and the optimized parameter u is obtained*Can be a new representation of the entire sequence of feature tensors. Parameter u*The 3-dimensional tensor feature that becomes W.H.D after resize, and
Figure BDA0002240714830000074
the feature dimensions are consistent.
The above formula is simplified, and d represents a better parameter u to be obtained:
from
Figure BDA0002240714830000081
At first, the first approximate solution
Figure BDA0002240714830000082
Can therefore obtain
Figure BDA0002240714830000084
Summing the left series of numbers
αt=2(N-t+1)-(N+1)(HN-Ht-1)
Wherein
Figure BDA0002240714830000085
The tensor characteristics of W × H × D that are finally desired to be obtained become:
Figure BDA0002240714830000086
in an embodiment of the present invention, α was usedtThe tensor eigensequence is processed 2(N-t +1), formula αt=2(T-t+1)-(T+1)(HT-Ht-1) The second item in (1) does not influence the coding effect, and the consumption of much time is reduced. Visualization of results after rankPooling is shown in FIG. 4, in which the original video is equally divided into four segments, each segment has 1/2 overlapping parts, and in addition to the original full sequence time segments, a total of 5 time-series 3-dimensional tensors can be obtained.
(5) Converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data, and randomly sampling the high-dimensional space point cloud data to obtain human behavior space-time characteristics;
specifically, the step (5) specifically includes:
(5.1) extracting three-dimensional tensor space-time sequence indexCorresponding space position information and time sequence information corresponding to the element values of the space position information obtain M pieces of high-dimensional point cloud data (x, y, z, c)1,…,cm) Wherein M is the number of video segments obtained by time division of the depth video, M represents the number of point features with motion information in a three-dimensional tensor space-time sequence, and c represents the motion information of a tensor value at a corresponding coordinate (x, y, z) position;
more specifically, for the tensor obtained after encoding, the index of the tensor represents spatial position information, the value of an element in the tensor represents spatial time sequence information obtained after spatial time sequence encoding, if the value of the tensor is 0, it is indicated that no motion information exists at the corresponding position, for the 3-dimensional tensor obtained in a plurality of time periods, voxels of which the values are all 0 under the same index are screened out, the spatial information indexed by the tensor and the motion information of the tensor are extracted, and the information is stored as a high-dimensional point cloud format (x, y, z, c)1,…,c5) Thereby obtaining M high-dimensional point cloud data (x, y, z, c)1,…,cm)。
(5.2) in M high-dimensional point cloud data (x, y, z, c)1,…,cm) And randomly selecting K from the Chinese characters as human behavior space-time characteristics.
More specifically, if M is less than K, all M point sets are selected, then (K-M) points are randomly extracted from the M points to serve as repetition points, and finally K point data are obtained and serve as input of a classification network; the value of K is set according to the size of the network input model and the overall level of the value of M, and the value of K selected in the embodiment of the present invention is 2048.
(6) And inputting the human body behavior space-time characteristics into a trained 3D target point cloud classification model for classification to obtain a behavior classification result.
Specifically, before inputting a point cloud classification model, the embodiment of the invention performs data enhancement on human behavior space-time characteristics by adopting a rotational translation data enhancement mode, wherein a rotational formula is as follows:
Figure BDA0002240714830000091
Figure BDA0002240714830000092
wherein R isx、RyThe rotation matrix of the point cloud in the world coordinate system around the x and y axes is represented, β and α represent the rotation degree, and the rotation process of the point cloud can be represented as x' ═ x (R) through matrix multiplicationx*Ry)TIn the embodiment of the invention, β ranges from-10 degrees to +10 degrees and α ranges from-45 degrees to +45 degrees are set, and after the rotating data of the data set is enhanced, the robustness of the model of the invention to behaviors under different visual angles can be improved.
On the basis of a multilayer shared sensor neural network adopted by the existing point cloud classification network PointNet, a non-local area feature fusion module based on self-attribute is adopted to fuse local features mapped by a high-dimensional space, so that the connection among point clouds can be further explored, the dependency relationship among point sets of different positions can be captured, the commonality among empty positions when the human body moves is different is obtained, and the identification capability of behavior features is enhanced. The behavior classification model structure is shown in fig. 5 and comprises a multilayer perceptron and a non-local mechanism NetVLAD network which are connected in sequence; the model firstly carries out sampling grouping on input point clouds by using a nearest neighbor method, sends each grouped point cloud into a multilayer perceptron shared by weight to obtain local features, and then adopts a non-local mechanism NetVLAD network to aggregate the local features of the groups.
The structure of the non-local mechanism NetVLAD network is shown in FIG. 6, and the input is set as N d-dimensional point features { x }iVLAD parameters are set to K cluster centers { c }kThe output of the final VLAD is a descriptive feature in dimension K x d and is denoted by V. The aggregation formula for NetVLAD networks is as follows:
Figure BDA0002240714830000101
wherein, ak(Xi) The method can be used for indicating whether the point belongs to the kth cluster or not in the form of softmaxThe approximate expression is:
Figure BDA0002240714830000102
wherein
Figure BDA0002240714830000103
bkAre all parameters that can be learned by the network,
Figure BDA0002240714830000104
the distance between the point feature and the kth cluster center point can be expressed, so that the network can obtain the feature V after aggregation, and based on this, a non-local feature module is adopted to mine the correlation between features obtained by different cluster centers of the VLAD, and the calculation formula of the non-local feature is as follows:
Figure BDA0002240714830000111
let input V be a characteristic of K x C shape, ViFeature obtained from cluster center representing a NetVLAD cluster, ViIs a column vector of length C, i being used to indicate the position of the point. f is used for calculating the similarity between the feature vectors of the two points, and g is a mapping function and can be realized by a multivariate sensor.
The similarity measurement function f can be selected from various types, such as a gaussian measurement and an embedded gaussian measurement (embedded gaussian), and the similarity measurement function f adopts an embedded gaussian measurement mode, and has the following formula:
Figure BDA0002240714830000112
wherein, the functions θ and φ can be expressed as the following by linear mapping function, i.e. perceptron function:
Figure BDA0002240714830000113
the final formula can be expressed as follows:
Figure BDA0002240714830000114
the output of the non-local NetVLAD module is used as the output of the non-local NetVLAD module, and the next stage of feature learning is carried out.
The embodiment of the invention uses an NTU RGBD behavior data set proposed by the university of southern Yankee, Singapore to extract and encode the features according to the steps (1) - (5), and then trains the classification network of the step (6) end to end, but the invention is not limited to the data set. The NTU RGBD120 data set is used in the training process, so the output result is 120 types, which comprises 82 types of daily behaviors, 12 types of medical behaviors and 26 types of multi-person interactive behaviors. The classification result shows that the method provided by the invention can effectively acquire the 3D human motion information and obtain the best performance on the classification result.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A depth video human body behavior recognition method based on three-dimensional space time sequence modeling is characterized by comprising the following steps:
(1) marking the human body position in the depth image frame by frame to obtain a depth image containing the human body behavior part;
(2) converting the pixel coordinates of the depth image into three-dimensional space point cloud data;
(3) performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor;
(4) uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence;
(5) converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data, and randomly sampling the high-dimensional space point cloud data to obtain human behavior space-time characteristics;
(6) and inputting the human body behavior space-time characteristics into a trained 3D target point cloud classification model for classification to obtain a behavior classification result.
2. The method for recognizing the human body behavior based on the depth video of the three-dimensional space time sequence modeling as claimed in claim 1, wherein the step (1) specifically comprises:
(1.1) framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behaviors;
(1.2) each frame of depth image is represented as an A-B matrix, and matrix values corresponding to non-human body positions outside the labeling frame are set to be 0; wherein the index of each matrix value corresponds to the pixel coordinate of the location, and each matrix value corresponds to the distance between the location point of the pixel coordinate and the depth camera.
3. The method for recognizing the human body behaviors through the depth video based on the three-dimensional time-series modeling as claimed in claim 1 or 2, wherein according to the depth camera internal parameters, the correspondence relationship between the pixel coordinates of the depth image in the step (2) and the three-dimensional space point cloud under the world coordinate system is as follows:
Figure FDA0002240714820000021
wherein u and v are the coordinate position of each pixel in the image, and fx,fyIs the depth camera focal length, cx,cyIs the depth camera center point.
4. The method for recognizing the human body behaviors through the depth video based on the three-dimensional space time sequence modeling according to any one of claims 1 to 3, wherein the step (3) specifically comprises the following steps:
(3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids;
and (3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain the multi-scale three-dimensional tensors corresponding to different voxel sizes.
5. The method for recognizing human body behaviors through depth video based on three-dimensional time-series modeling according to any one of claims 1 to 4, wherein the step (4) of performing space time-series coding on the three-dimensional tensor corresponding to each time period specifically comprises:
(01) according to a sorting function S (v; u) ═ ut·vtScoring the frame image;
wherein u istRepresenting the transpose of the vector resulting from the optimization of the sorting function,
Figure FDA0002240714820000022
mean feature, x, representing the depth image of the t-th frametRepresented as the t-th frame depth image,
Figure FDA0002240714820000023
representing a three-dimensional tensor obtained by voxelization of the t frame depth image;
(02) optimizing the parameter u of the sorting function through a rankSVM, so that the frame images behind the time series have larger scores;
(03) converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D represents the number of voxels in the three-dimensional space X-axis, Y-axis, and Z-axis of the point cloud voxelized in the scale.
6. The method for recognizing the human body behavior based on the depth video of the three-dimensional time-series modeling according to any one of claims 1 to 5, wherein the step (5) specifically comprises the following steps:
(5.1) extracting the space position information corresponding to the three-dimensional tensor space-time sequence index and the time sequence information corresponding to the element value of the space position information to obtainTo M high-dimensional point cloud data (x, y, z, c)1,…,cm) Wherein M is the number of video segments obtained by time division of the depth video, M represents the number of point features with motion information in a three-dimensional tensor space-time sequence, and c represents the motion information at the position of a corresponding coordinate (x, y, z);
(5.2) in M high-dimensional point cloud data (x, y, z, c)1,…,cm) And randomly selecting K from the Chinese characters as human behavior space-time characteristics.
7. The deep video human body behavior recognition method based on three-dimensional time series modeling according to any one of claims 1 to 6, characterized in that after data enhancement is performed on human body behavior space-time characteristics by adopting a data enhancement mode of rotation and translation, a trained 3D target point cloud classification model is input for classification.
8. The method for recognizing human body behaviors through depth videos based on three-dimensional time-series modeling according to any one of claims 1 to 7, wherein the 3D target point cloud classification model comprises a multilayer perceptron and a non-local mechanism NetVLAD network which are connected in sequence;
the multilayer perceptron is used for sampling and grouping the human behavior space-time characteristics and extracting the characteristics of each group of behavior space-time characteristics to obtain a plurality of groups of local characteristics;
the non-local mechanism NetVLAD network aggregates a plurality of groups of local characteristics to obtain the non-local characteristics.
CN201910999089.XA 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling Active CN110852182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910999089.XA CN110852182B (en) 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910999089.XA CN110852182B (en) 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Publications (2)

Publication Number Publication Date
CN110852182A true CN110852182A (en) 2020-02-28
CN110852182B CN110852182B (en) 2022-09-20

Family

ID=69596732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910999089.XA Active CN110852182B (en) 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Country Status (1)

Country Link
CN (1) CN110852182B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111932822A (en) * 2020-07-11 2020-11-13 广州融康汽车科技有限公司 Passenger body position alarm device
CN112001298A (en) * 2020-08-20 2020-11-27 佳都新太科技股份有限公司 Pedestrian detection method, device, electronic equipment and storage medium
CN112215101A (en) * 2020-09-27 2021-01-12 武汉科技大学 Attention mechanism-based three-dimensional target identification method and system
CN112989930A (en) * 2021-02-04 2021-06-18 西安美格智联软件科技有限公司 Method, system, medium and terminal for automatically monitoring fire fighting channel blockage
CN113111760A (en) * 2021-04-07 2021-07-13 同济大学 Lightweight graph convolution human skeleton action identification method based on channel attention
CN113269218A (en) * 2020-12-30 2021-08-17 威创集团股份有限公司 Video classification method based on improved VLAD algorithm
CN113536997A (en) * 2021-07-01 2021-10-22 深圳中智明科智能科技有限公司 Intelligent security system and method based on image recognition and behavior analysis
CN113536892A (en) * 2021-05-13 2021-10-22 泰康保险集团股份有限公司 Gesture recognition method and device, readable storage medium and electronic equipment
CN115131562A (en) * 2022-07-08 2022-09-30 北京百度网讯科技有限公司 Three-dimensional scene segmentation method, model training method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955682A (en) * 2014-05-22 2014-07-30 深圳市赛为智能股份有限公司 Behavior recognition method and device based on SURF interest points
CN105894571A (en) * 2016-01-22 2016-08-24 冯歆鹏 Multimedia information processing method and device
US20190004533A1 (en) * 2017-07-03 2019-01-03 Baidu Usa Llc High resolution 3d point clouds generation from downsampled low resolution lidar 3d point clouds and camera images
CN109993103A (en) * 2019-03-29 2019-07-09 华南理工大学 A kind of Human bodys' response method based on point cloud data
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955682A (en) * 2014-05-22 2014-07-30 深圳市赛为智能股份有限公司 Behavior recognition method and device based on SURF interest points
CN105894571A (en) * 2016-01-22 2016-08-24 冯歆鹏 Multimedia information processing method and device
US20190004533A1 (en) * 2017-07-03 2019-01-03 Baidu Usa Llc High resolution 3d point clouds generation from downsampled low resolution lidar 3d point clouds and camera images
CN109993103A (en) * 2019-03-29 2019-07-09 华南理工大学 A kind of Human bodys' response method based on point cloud data
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YANG XIAO ET AL.: ""Action Recognition for Depth Video using Multi-view Dynamic Images"", 《ARXIV》 *
刘婷婷 等: ""多视角深度运动图的人体行为识别"", 《中国图象图形学报》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111932822A (en) * 2020-07-11 2020-11-13 广州融康汽车科技有限公司 Passenger body position alarm device
CN112001298A (en) * 2020-08-20 2020-11-27 佳都新太科技股份有限公司 Pedestrian detection method, device, electronic equipment and storage medium
CN112001298B (en) * 2020-08-20 2021-09-21 佳都科技集团股份有限公司 Pedestrian detection method, device, electronic equipment and storage medium
CN112215101A (en) * 2020-09-27 2021-01-12 武汉科技大学 Attention mechanism-based three-dimensional target identification method and system
CN113269218A (en) * 2020-12-30 2021-08-17 威创集团股份有限公司 Video classification method based on improved VLAD algorithm
CN112989930A (en) * 2021-02-04 2021-06-18 西安美格智联软件科技有限公司 Method, system, medium and terminal for automatically monitoring fire fighting channel blockage
CN113111760A (en) * 2021-04-07 2021-07-13 同济大学 Lightweight graph convolution human skeleton action identification method based on channel attention
CN113111760B (en) * 2021-04-07 2023-05-02 同济大学 Light-weight graph convolution human skeleton action recognition method based on channel attention
CN113536892A (en) * 2021-05-13 2021-10-22 泰康保险集团股份有限公司 Gesture recognition method and device, readable storage medium and electronic equipment
CN113536892B (en) * 2021-05-13 2023-11-21 泰康保险集团股份有限公司 Gesture recognition method and device, readable storage medium and electronic equipment
CN113536997A (en) * 2021-07-01 2021-10-22 深圳中智明科智能科技有限公司 Intelligent security system and method based on image recognition and behavior analysis
CN115131562A (en) * 2022-07-08 2022-09-30 北京百度网讯科技有限公司 Three-dimensional scene segmentation method, model training method and device and electronic equipment
CN115131562B (en) * 2022-07-08 2023-06-13 北京百度网讯科技有限公司 Three-dimensional scene segmentation method, model training method, device and electronic equipment

Also Published As

Publication number Publication date
CN110852182B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN110852182B (en) Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
Shi et al. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network
Yang et al. Pixor: Real-time 3d object detection from point clouds
CN108549873B (en) Three-dimensional face recognition method and three-dimensional face recognition system
CN106682598B (en) Multi-pose face feature point detection method based on cascade regression
CN107742102B (en) Gesture recognition method based on depth sensor
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN110706248A (en) Visual perception mapping algorithm based on SLAM and mobile robot
CN106295568A (en) The mankind's naturalness emotion identification method combined based on expression and behavior bimodal
CN104182765A (en) Internet image driven automatic selection method of optimal view of three-dimensional model
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN113408584A (en) RGB-D multi-modal feature fusion 3D target detection method
CN110751097A (en) Semi-supervised three-dimensional point cloud gesture key point detection method
CN104463962B (en) Three-dimensional scene reconstruction method based on GPS information video
CN112396655A (en) Point cloud data-based ship target 6D pose estimation method
Fei et al. Self-supervised learning for pre-training 3d point clouds: A survey
Wang et al. Paccdu: pyramid attention cross-convolutional dual unet for infrared and visible image fusion
CN114299339A (en) Three-dimensional point cloud model classification method and system based on regional correlation modeling
CN114639115A (en) 3D pedestrian detection method based on fusion of human body key points and laser radar
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
Kanaujia et al. Part segmentation of visual hull for 3d human pose estimation
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
Hou et al. Multi-modal feature fusion for 3D object detection in the production workshop
Lu et al. Multimode gesture recognition algorithm based on convolutional long short-term memory network
CN102663369A (en) Human motion tracking method on basis of SURF (Speed Up Robust Feature) high efficiency matching kernel

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant