CN110852182B - Depth video human body behavior recognition method based on three-dimensional space time sequence modeling - Google Patents

Depth video human body behavior recognition method based on three-dimensional space time sequence modeling Download PDF

Info

Publication number
CN110852182B
CN110852182B CN201910999089.XA CN201910999089A CN110852182B CN 110852182 B CN110852182 B CN 110852182B CN 201910999089 A CN201910999089 A CN 201910999089A CN 110852182 B CN110852182 B CN 110852182B
Authority
CN
China
Prior art keywords
space
dimensional
point cloud
human body
time sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910999089.XA
Other languages
Chinese (zh)
Other versions
CN110852182A (en
Inventor
肖阳
王焱乘
曹治国
姜文祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910999089.XA priority Critical patent/CN110852182B/en
Publication of CN110852182A publication Critical patent/CN110852182A/en
Application granted granted Critical
Publication of CN110852182B publication Critical patent/CN110852182B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention discloses a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, which belongs to the field of digital image recognition and comprises the following steps: marking the human body position in the depth image frame by frame; converting a depth image containing human body behaviors into three-dimensional space point cloud data; performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor; uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence; converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data and randomly sampling to obtain human behavior space-time characteristics; and inputting the human body behavior space-time characteristics into the trained 3D target point cloud classification model for classification to obtain a behavior classification result. The invention can fully excavate the depth image three-dimensional information and realize the efficient and robust identification of various human body behaviors.

Description

Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
Technical Field
The invention belongs to the field of digital image recognition, and particularly relates to a depth video human body behavior recognition method based on three-dimensional space time sequence modeling.
Background
In the field of computer vision, human behavior recognition based on depth video is concerned by more and more researchers, and has become one of the research hotspots, and the technology is widely applied to video monitoring, multimedia data analysis, human-computer interaction and the like.
At present, methods for identifying deep video behaviors mainly comprise three types: a method based on human skeleton, a method based on original depth map and a method of fusing skeleton and depth map; the identification method based on the human skeleton is the most common method at present, the human skeleton can simply and clearly describe the posture information of human motion due to no interference of environmental noise, and a better result is obtained on the existing behavior data set, but the method is based on the premise that the human skeleton information is accurately estimated, the human skeleton information extraction technology cannot be completely and correctly extracted, and particularly in a special environment, the human skeleton information is difficult to obtain; the human behavior recognition method based on the depth image projects the human behavior based on the 3D time sequence space to the 2D plane for recognition, more environment and character information can be obtained, but the 3D information of the human behavior is still not effectively extracted, and the space-time information of the behavior is difficult to effectively mine and extract due to the obvious representation of environmental noise in the 2D plane, so that the method has higher requirements on the robustness and the fitting performance of an algorithm model.
Generally, the existing depth video behavior identification method has the technical problems that human skeleton information cannot be accurately extracted, depth video information is not extracted to the maximum extent, and the identification accuracy is low due to the fact that the depth video information is easily influenced by environmental noise.
Disclosure of Invention
Aiming at the defects or improvement requirements of the prior art, the invention provides a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, and aims to solve the technical problems that the existing depth video shape behavior recognition method is based on two-dimensional plane recognition, can not effectively extract three-dimensional information of a depth video, is easily influenced by environmental noise and is low in recognition accuracy.
In order to achieve the purpose, the invention provides a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, which comprises the following steps:
(1) marking the human body position in the depth image frame by frame to obtain a depth image containing the human body behavior part;
(2) converting the pixel coordinates of the depth image into three-dimensional space point cloud data;
(3) performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor;
(4) uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence;
(5) converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data, and randomly sampling the high-dimensional space point cloud data to obtain human behavior space-time characteristics;
(6) and inputting the human body behavior space-time characteristics into a trained 3D target point cloud classification model for classification to obtain a behavior classification result.
Further, the step (1) specifically comprises:
(1.1) framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behaviors;
(1.2) each frame of depth image is represented as an A & ltx & gt B matrix, and matrix values corresponding to non-human body positions outside the labeling frame are set to be 0; wherein the index of each matrix value corresponds to the pixel coordinate of the location, and each matrix value corresponds to the distance between the location point of the pixel coordinate and the depth camera.
Further, according to the internal parameters of the depth camera, the correspondence relationship between the pixel coordinates of the depth image in the step (2) and the three-dimensional space point cloud under the world coordinate system is as follows:
Figure BDA0002240714830000031
wherein u and v are the coordinate position of each pixel in the image, and f x ,f y Is the depth camera focal length, c x ,c y Is the depth camera center point.
Further, the step (3) specifically comprises:
(3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids;
and (3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain the multi-scale three-dimensional tensors corresponding to different voxel sizes.
Further, the performing spatial time series coding on the three-dimensional tensor corresponding to each time period in the step (4) specifically includes:
(01) according to a sorting function S (v; u) ═ u t ·v t Scoring the frame image;
wherein u is t Representing the transpose of the vector resulting from the optimization of the sorting function,
Figure BDA0002240714830000032
mean feature, x, representing the t-th frame depth image t Represented as the t-th frame depth image,
Figure BDA0002240714830000033
representing a three-dimensional tensor obtained by voxelization of the t frame depth image;
(02) optimizing the parameter u of the sorting function through a rankSVM, so that the frame images behind the time series have larger scores;
(03) converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D represents the number of voxels in the three-dimensional space X-axis, Y-axis, and Z-axis of the point cloud voxelized in the scale.
Further, the step (5) specifically comprises:
(5.1) extracting spatial position information corresponding to the three-dimensional tensor space-time sequence index and time sequence information corresponding to the element values of the spatial position information to obtain M pieces of high-dimensional point cloud data (x, y, z, c) 1 ,…,c m ) Wherein M is the number of video segments obtained by time division of the depth video, and M represents a three-dimensional tensor space-timeC represents the motion information of the tensor value under the corresponding coordinate position;
(5.2) in M high-dimensional point cloud data (x, y, z, c) 1 ,…,c m ) And randomly selecting K from the Chinese characters as human behavior space-time characteristics.
And further, after data enhancement is carried out on the human behavior space-time characteristics by adopting a data enhancement mode of rotation and translation, a trained 3D target point cloud classification model is input for classification.
Further, the 3D target point cloud classification model comprises a multilayer sensor and a non-local mechanism NetVLAD network which are sequentially connected;
the multilayer perceptron is used for sampling and grouping the human behavior space-time characteristics and extracting the characteristics of each group of behavior space-time characteristics to obtain a plurality of groups of local characteristics;
the non-local mechanism NetVLAD network aggregates a plurality of groups of local characteristics to obtain the non-local characteristics.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) the method effectively screens the spatial position of the human body by detecting the human body behaviors in the plane depth image, and converts the depth image containing the human body behaviors into the point cloud sequence so as to restore the spatial information of the human body behaviors and facilitate the subsequent acquisition of richer human body behavior characteristics; on the basis, the point cloud sequence is subjected to voxelization and spatial time sequence coding, so that the spatial geometrical characteristics of the depth video are fully excavated, and the accuracy of human behavior recognition is effectively improved.
(2) The method adopts a self-attention-based non-local region feature fusion module on the basis of the existing point cloud classification network PointNet + +, and further improves the accuracy of human behavior recognition by combining global behavior features and obvious local motion features.
(3) According to the invention, the input of the classification network is further data-enhanced, namely the input point cloud is rotated at random any angle in space, so that the classification model has more robustness for human behavior identification under different visual angles.
Drawings
Fig. 1 is a schematic flow chart of a depth video human body behavior recognition method based on three-dimensional spatial time sequence modeling according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a physical meaning of a point cloud corresponding to a depth image based on an original;
fig. 3 is a visualization result obtained after the point cloud sequence provided by the embodiment of the present invention is voxelized;
fig. 4 is a visualization result of a 3-dimensional space tensor obtained after a voxelized sequence is subjected to spatial time sequence coding according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a point cloud classification network structure according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a self-attention-based non-local region feature fusion module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a depth video human body behavior recognition method based on three-dimensional time sequence modeling, which comprises the steps of human body target detection, depth image conversion point cloud, point cloud voxelization, time sequence coding of voxelized tensor, sampling of the coded tensor to obtain space-time characteristics, and sending the characteristics to a point cloud classification network for training and testing. The depth video human body behavior identification method based on three-dimensional time sequence modeling provided by the invention is specifically described below by combining an example.
As shown in fig. 1, the depth video human body behavior recognition method based on three-dimensional spatial time series modeling provided by the embodiment of the present invention includes the following steps:
(1) marking the human body position in the depth image frame by frame to obtain a depth image containing the human body behavior part;
specifically, framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behavior; and each frame of depth image is represented as an A-B matrix (A, B is the length and width of the depth image based on the number of pixels respectively), the index of each matrix value corresponds to the pixel coordinate of the position, and each matrix value corresponds to the distance between the position point of the pixel coordinate and the depth camera.
(2) Converting pixel coordinates of a depth image containing human body behaviors into three-dimensional space point cloud data;
specifically, as shown in fig. 2, a camera coordinate point, namely an optical center, is set as a world coordinate system origin O, a depth image center point is set as O ', an M' point in a depth image is converted into an M point of the world coordinate system, the M point is projected on a world coordinate system z axis as an a point, and a mapping relation of a similar triangle exists: OM 'O' is similar to OMA, i.e., can be obtained
Figure BDA0002240714830000061
Further, according to the camera internal parameters, the corresponding relation between the pixel coordinates of the depth image and the three-dimensional space point cloud under the world coordinate system is as follows:
Figure BDA0002240714830000062
wherein u and v are the coordinate position of each pixel in the image, and f x ,f y Is the depth camera focal length, c x ,c y Is the depth camera center point.
(3) Performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor;
specifically, the step (3) specifically includes: (3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids; and (3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain the multi-scale three-dimensional tensors corresponding to different voxel sizes. Converting each frame of depth image obtained in the step (2) into point cloud data (M x, y, z), calculating the maximum and minimum values of all point cloud sets in the x, y and z axes under a section of video and marking as hx, hy, hz, lx, ly and lz, thereby obtaining the spatial position size of the human body behavior under a world coordinate system, and setting the voxelized voxel size as a, thereby obtaining the voxel number of the depth behavior in a 3D space as
Figure BDA0002240714830000063
The visualization result of the depth image containing human body behaviors through point cloud and voxelization is shown in fig. 3.
(4) Uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence;
specifically, the performing spatial time series coding on the three-dimensional tensor corresponding to each time period in step (4) specifically includes:
(01) according to a sorting function S (v; u) ═ u t ·v t Scoring the frame image; wherein u is t Representing the transpose of the vector resulting from the optimization of the sorting function,
Figure BDA0002240714830000071
mean feature, x, representing the depth image of the t-th frame t Denoted as the t-th frame depth image,
Figure BDA0002240714830000072
representing a three-dimensional tensor obtained by voxelization of the t frame depth image; (02) optimizing the parameter u of the ranking function by means of a rankSVM such that the frame images following the time sequence have a greater score further along(ii) a (03) Converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D represents the number of voxels in the three-dimensional space x, y, and z of the point cloud voxelized at the scale.
The present invention uses a structural risk minimization and maximum separation optimization framework, and the objective optimization problem can be expressed as:
Figure BDA0002240714830000073
the first term is a regularization term and the second term is a change-loss error penalty term. The above formula proves to be a convex optimization problem, which can be solved by using RankSVM, and the optimized parameter u is obtained * Can be a new representation of the entire sequence of feature tensors. Parameter u * The 3-dimensional tensor feature that becomes W.H.D after resize, and
Figure BDA0002240714830000074
the feature dimensions are consistent.
The above formula is simplified, and d represents a better parameter u to be obtained:
Figure BDA0002240714830000075
from
Figure BDA0002240714830000081
At first, the first approximate solution
Figure BDA0002240714830000082
Figure BDA0002240714830000083
Can therefore obtain
Figure BDA0002240714830000084
Summing the left series of numbers
α t =2(N-t+1)-(N+1)(H N -H t-1 )
Wherein
Figure BDA0002240714830000085
The tensor characteristics of W × H × D that are finally desired to be obtained become:
Figure BDA0002240714830000086
in the present embodiment, α is used t The tensor eigensequence is processed 2(N-t +1), formula α t =2(T-t+1)-(T+1)(H T -H t-1 ) The second item in (2) is omitted, the coding effect is not influenced, and much time consumption is reduced. Visualization of results after rankPooling is shown in FIG. 4, in which the original video is equally divided into four segments, each segment has 1/2 overlapping parts, and in addition to the original full sequence time segments, a total of 5 time-series 3-dimensional tensors can be obtained.
(5) Converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data, and randomly sampling the high-dimensional space point cloud data to obtain human behavior space-time characteristics;
specifically, the step (5) specifically includes:
(5.1) extracting the space position information corresponding to the three-dimensional tensor space-time sequence index and the time sequence information corresponding to the element values of the space position information to obtain M high-dimensional point cloud data (x, y, z, c) 1 ,…,c m ) Wherein M is the number of video segments obtained by time division of the depth video, M represents the number of point features with motion information in a three-dimensional tensor space-time sequence, and c represents the tensor value corresponding to the coordinateMotion information at the target (x, y, z) position;
more specifically, for the tensor obtained after encoding, the index of the tensor represents spatial position information, the value of an element in the tensor represents spatial time sequence information obtained after encoding, if the value of the tensor is 0, it is indicated that no motion information exists at the corresponding position, for the 3-dimensional tensor obtained in a plurality of time periods, voxels of which the values are all 0 under the same index are screened out, the spatial information and the motion information of the tensor value of the tensor index are extracted, and the information is stored as a high-dimensional point cloud format (x, y, z, c) 1 ,…,c 5 ) Thereby obtaining M high-dimensional point cloud data (x, y, z, c) 1 ,…,c m )。
(5.2) in M high-dimensional point cloud data (x, y, z, c) 1 ,…,c m ) The K numbers are randomly selected as the human behavior space-time characteristics.
Further specifically, if M is less than K, all M point sets are selected, then (K-M) points are randomly extracted from the M points to serve as repetition points, and finally K point data are obtained and serve as input of a classification network; the value of K is set according to the size of the network input model and the overall level of the value of M, and the value of K selected in the embodiment of the present invention is 2048.
(6) And inputting the human body behavior space-time characteristics into a trained 3D target point cloud classification model for classification to obtain a behavior classification result.
Specifically, before inputting a point cloud classification model, the embodiment of the invention performs data enhancement on human behavior space-time characteristics by adopting a rotational translation data enhancement mode, wherein a rotational formula is as follows:
Figure BDA0002240714830000091
Figure BDA0002240714830000092
wherein R is x 、R y A rotation matrix, beta and alpha table, representing the rotation of the point cloud around the x and y axes in the world coordinate systemThe degree of rotation is shown, and through matrix multiplication, the point cloud rotation process can be represented as: x ═ x (R) x *R y ) T In the embodiment of the invention, the beta range is-10 degrees to +10 degrees, and the alpha range is-45 degrees to +45 degrees. After the rotation data enhancement is carried out on the data set, the robustness of the model of the invention to behaviors under different visual angles can be improved.
On the basis of a multilayer shared sensor neural network adopted by the existing point cloud classification network PointNet, a non-local area feature fusion module based on self-attribute is adopted to fuse local features mapped by a high-dimensional space, so that the connection among point clouds can be further explored, the dependency relationship among point sets of different positions can be captured, the commonality among empty positions when the human body moves is different is obtained, and the identification capability of behavior features is enhanced. The behavior classification model structure is shown in fig. 5 and comprises a multilayer perceptron and a non-local mechanism NetVLAD network which are connected in sequence; the model firstly carries out sampling grouping on input point clouds by using a nearest neighbor method, sends each grouped point cloud into a multilayer perceptron shared by weight to obtain local features, and then adopts a non-local mechanism NetVLAD network to aggregate the local features of the groups.
The structure of the non-local mechanism NetVLAD network is shown in FIG. 6, and the input is set as N d-dimensional point features { x } i VLAD parameters are set to K cluster centers { c } k The output of the final VLAD is a descriptive feature in dimension K x d and is denoted by V. The aggregation formula for NetVLAD networks is as follows:
Figure BDA0002240714830000101
wherein, a k (X i ) To indicate whether the point belongs to the kth cluster, it can be approximated by the form softmax as:
Figure BDA0002240714830000102
wherein
Figure BDA0002240714830000103
b k Are all parameters that can be learned by the network,
Figure BDA0002240714830000104
the distance between the point feature and the kth cluster center point can be expressed, so that the network can obtain the feature V after aggregation, and based on this, a non-local feature module is adopted to mine the correlation between features obtained by different cluster centers of the VLAD, and the calculation formula of the non-local feature is as follows:
Figure BDA0002240714830000111
let input V be a characteristic of K x C shape, V i Feature obtained from cluster center representing a NetVLAD cluster, V i Is a column vector of length C, i being used to indicate the position of the point. f is used for calculating the similarity between the feature vectors of the two points, and g is a mapping function and can be realized by a multivariate sensor.
The similarity measurement function f can be selected from various types, such as a Gaussian measurement and an Embedded Gaussian measurement (Embedded Gaussian), and the similarity measurement function f adopts an Embedded Gaussian measurement mode, and has the following formula:
Figure BDA0002240714830000112
wherein, the functions θ and φ can be expressed as the following by linear mapping function, i.e. perceptron function:
Figure BDA0002240714830000113
the final formula can be expressed as follows:
Figure BDA0002240714830000114
the output of the non-local NetVLAD module is used as the output of the non-local NetVLAD module, and the next stage of feature learning is carried out.
The embodiment of the invention uses an NTU RGBD behavior data set proposed by the university of southern Yankee, Singapore to extract and encode the features according to the steps (1) - (5), and then trains the classification network of the step (6) end to end, but the invention is not limited to the data set. The invention uses the NTU RGBD120 dataset during training, so the output results are 120 classes, which include the daily activities 82 class, the medical activities 12 class, and the multi-person interaction activities 26 class. The classification result shows that the method provided by the invention can effectively acquire the 3D human body motion information and obtain the best performance on the classification result.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (6)

1. A depth video human body behavior recognition method based on three-dimensional space time sequence modeling is characterized by comprising the following steps:
(1) marking the human body position in the depth image frame by frame to obtain a depth image containing the human body behavior part;
(2) converting the pixel coordinates of the depth image into three-dimensional space point cloud data;
(3) performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor; the step (3) specifically comprises the following steps:
(3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids;
(3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain multi-scale three-dimensional tensors corresponding to different voxel sizes;
(4) uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence; performing space time sequence coding on the three-dimensional tensor corresponding to each time period in the step (4), specifically comprising:
(01) according to a sorting function S (v; u) ═ u t ·v t Scoring the frame image;
wherein u is t Representing the transpose of the vector resulting from the optimization of the sorting function,
Figure FDA0003752504390000011
mean feature, x, representing the depth image of the t-th frame t Represented as the t-th frame depth image,
Figure FDA0003752504390000012
representing a three-dimensional tensor obtained by voxelization of the t frame depth image;
(02) optimizing the parameter u of the sorting function through a rankSVM, so that the frame images behind the time series have larger scores;
(03) converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D respectively represents the number of voxels of the point cloud voxelized in the three-dimensional space in the X axis, Y axis and Z axis;
(5) converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data, and randomly sampling the high-dimensional space point cloud data to obtain human behavior space-time characteristics;
(6) and inputting the human body behavior space-time characteristics into a trained 3D target point cloud classification model for classification to obtain a behavior classification result.
2. The method for recognizing the human body behavior based on the depth video of the three-dimensional space time sequence modeling as claimed in claim 1, wherein the step (1) specifically comprises:
(1.1) framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behaviors;
(1.2) each frame of depth image is represented as an A & ltx & gt B matrix, and matrix values corresponding to non-human body positions outside the labeling frame are set to be 0; wherein the index of each matrix value corresponds to the pixel coordinate of the location, and each matrix value corresponds to the distance between the location point of the pixel coordinate and the depth camera.
3. The method for recognizing the human body behaviors through the depth video based on the three-dimensional time-series modeling as claimed in claim 1 or 2, wherein according to the depth camera internal parameters, the correspondence relationship between the pixel coordinates of the depth image in the step (2) and the three-dimensional space point cloud under the world coordinate system is as follows:
Figure FDA0003752504390000021
wherein u and v are the coordinate position of each pixel in the image, and f x ,f y Is the depth camera focal length, c x ,c y Is the depth camera center point.
4. The method for recognizing the human body behavior based on the depth video of the three-dimensional space time sequence modeling as claimed in claim 1, wherein the step (5) specifically comprises:
(5.1) extracting the space position information corresponding to the three-dimensional tensor space-time sequence index and the time sequence information corresponding to the element values of the space position information to obtain M high-dimensional point cloud data (x, y, z, c) 1 ,…,c m ) Wherein M is the number of video segments obtained by time division of the depth video, M represents the number of point features with motion information in a three-dimensional tensor space-time sequence, and c represents the motion information at the position of a corresponding coordinate (x, y, z);
(5.2) in M high-dimensional point cloud data (x, y, z, c) 1 ,…,c m ) And randomly selecting K from the Chinese characters as human behavior space-time characteristics.
5. The method for recognizing the human body behaviors through the depth video based on the three-dimensional time sequence modeling as claimed in claim 1, wherein after data enhancement is performed on the human body behavior space-time characteristics through a data enhancement mode of rotation and translation, a trained 3D target point cloud classification model is input for classification.
6. The method according to claim 5, wherein the 3D target point cloud classification model comprises a multilayer sensor and a non-local mechanism NetVLAD network which are connected in sequence;
the multilayer perceptron is used for sampling and grouping the human behavior space-time characteristics and extracting the characteristics of each group of behavior space-time characteristics to obtain a plurality of groups of local characteristics;
the non-local mechanism NetVLAD network aggregates a plurality of groups of local features to obtain the non-local features.
CN201910999089.XA 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling Active CN110852182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910999089.XA CN110852182B (en) 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910999089.XA CN110852182B (en) 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Publications (2)

Publication Number Publication Date
CN110852182A CN110852182A (en) 2020-02-28
CN110852182B true CN110852182B (en) 2022-09-20

Family

ID=69596732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910999089.XA Active CN110852182B (en) 2019-10-21 2019-10-21 Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Country Status (1)

Country Link
CN (1) CN110852182B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111932822A (en) * 2020-07-11 2020-11-13 广州融康汽车科技有限公司 Passenger body position alarm device
CN112001298B (en) * 2020-08-20 2021-09-21 佳都科技集团股份有限公司 Pedestrian detection method, device, electronic equipment and storage medium
CN112215101A (en) * 2020-09-27 2021-01-12 武汉科技大学 Attention mechanism-based three-dimensional target identification method and system
CN113269218B (en) * 2020-12-30 2023-06-09 威创集团股份有限公司 Video classification method based on improved VLAD algorithm
CN112989930A (en) * 2021-02-04 2021-06-18 西安美格智联软件科技有限公司 Method, system, medium and terminal for automatically monitoring fire fighting channel blockage
CN113111760B (en) * 2021-04-07 2023-05-02 同济大学 Light-weight graph convolution human skeleton action recognition method based on channel attention
CN113536892B (en) * 2021-05-13 2023-11-21 泰康保险集团股份有限公司 Gesture recognition method and device, readable storage medium and electronic equipment
CN113536997B (en) * 2021-07-01 2022-11-22 深圳中智明科智能科技有限公司 Intelligent security system and method based on image recognition and behavior analysis
CN115131562B (en) * 2022-07-08 2023-06-13 北京百度网讯科技有限公司 Three-dimensional scene segmentation method, model training method, device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955682A (en) * 2014-05-22 2014-07-30 深圳市赛为智能股份有限公司 Behavior recognition method and device based on SURF interest points
CN105894571A (en) * 2016-01-22 2016-08-24 冯歆鹏 Multimedia information processing method and device
CN109993103A (en) * 2019-03-29 2019-07-09 华南理工大学 A kind of Human bodys' response method based on point cloud data
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10474160B2 (en) * 2017-07-03 2019-11-12 Baidu Usa Llc High resolution 3D point clouds generation from downsampled low resolution LIDAR 3D point clouds and camera images

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955682A (en) * 2014-05-22 2014-07-30 深圳市赛为智能股份有限公司 Behavior recognition method and device based on SURF interest points
CN105894571A (en) * 2016-01-22 2016-08-24 冯歆鹏 Multimedia information processing method and device
CN109993103A (en) * 2019-03-29 2019-07-09 华南理工大学 A kind of Human bodys' response method based on point cloud data
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Action Recognition for Depth Video using Multi-view Dynamic Images";Yang Xiao et al.;《arXiv》;20181227;第1-48页 *
"多视角深度运动图的人体行为识别";刘婷婷 等;《中国图象图形学报》;20190331;第400-409页 *

Also Published As

Publication number Publication date
CN110852182A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
CN110852182B (en) Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
Shi et al. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network
Yang et al. Pixor: Real-time 3d object detection from point clouds
CN106682598B (en) Multi-pose face feature point detection method based on cascade regression
CN108491880B (en) Object classification and pose estimation method based on neural network
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN110706248A (en) Visual perception mapping algorithm based on SLAM and mobile robot
CN104063702B (en) Three-dimensional gait recognition based on shielding recovery and partial similarity matching
CN106295568A (en) The mankind's naturalness emotion identification method combined based on expression and behavior bimodal
CN110751097B (en) Semi-supervised three-dimensional point cloud gesture key point detection method
CN104182765A (en) Internet image driven automatic selection method of optimal view of three-dimensional model
CN113408584B (en) RGB-D multi-modal feature fusion 3D target detection method
Wang et al. An overview of 3d object detection
CN112784736A (en) Multi-mode feature fusion character interaction behavior recognition method
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN112396655B (en) Point cloud data-based ship target 6D pose estimation method
CN104463962B (en) Three-dimensional scene reconstruction method based on GPS information video
CN111914643A (en) Human body action recognition method based on skeleton key point detection
CN105574545B (en) The semantic cutting method of street environment image various visual angles and device
Wang et al. Paccdu: pyramid attention cross-convolutional dual unet for infrared and visible image fusion
Fei et al. Self-supervised learning for pre-training 3d point clouds: A survey
CN114639115A (en) 3D pedestrian detection method based on fusion of human body key points and laser radar
CN114299339A (en) Three-dimensional point cloud model classification method and system based on regional correlation modeling
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
Kanaujia et al. Part segmentation of visual hull for 3d human pose estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant