CN110852182B

CN110852182B - Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Info

Publication number: CN110852182B
Application number: CN201910999089.XA
Authority: CN
Inventors: 肖阳; 王焱乘; 曹治国; 姜文祥
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2022-09-20
Anticipated expiration: 2039-10-21
Also published as: CN110852182A

Abstract

The invention discloses a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, which belongs to the field of digital image recognition and comprises the following steps: marking the human body position in the depth image frame by frame; converting a depth image containing human body behaviors into three-dimensional space point cloud data; performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor; uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence; converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data and randomly sampling to obtain human behavior space-time characteristics; and inputting the human body behavior space-time characteristics into the trained 3D target point cloud classification model for classification to obtain a behavior classification result. The invention can fully excavate the depth image three-dimensional information and realize the efficient and robust identification of various human body behaviors.

Description

Depth video human body behavior recognition method based on three-dimensional space time sequence modeling

Technical Field

The invention belongs to the field of digital image recognition, and particularly relates to a depth video human body behavior recognition method based on three-dimensional space time sequence modeling.

Background

In the field of computer vision, human behavior recognition based on depth video is concerned by more and more researchers, and has become one of the research hotspots, and the technology is widely applied to video monitoring, multimedia data analysis, human-computer interaction and the like.

At present, methods for identifying deep video behaviors mainly comprise three types: a method based on human skeleton, a method based on original depth map and a method of fusing skeleton and depth map; the identification method based on the human skeleton is the most common method at present, the human skeleton can simply and clearly describe the posture information of human motion due to no interference of environmental noise, and a better result is obtained on the existing behavior data set, but the method is based on the premise that the human skeleton information is accurately estimated, the human skeleton information extraction technology cannot be completely and correctly extracted, and particularly in a special environment, the human skeleton information is difficult to obtain; the human behavior recognition method based on the depth image projects the human behavior based on the 3D time sequence space to the 2D plane for recognition, more environment and character information can be obtained, but the 3D information of the human behavior is still not effectively extracted, and the space-time information of the behavior is difficult to effectively mine and extract due to the obvious representation of environmental noise in the 2D plane, so that the method has higher requirements on the robustness and the fitting performance of an algorithm model.

Generally, the existing depth video behavior identification method has the technical problems that human skeleton information cannot be accurately extracted, depth video information is not extracted to the maximum extent, and the identification accuracy is low due to the fact that the depth video information is easily influenced by environmental noise.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, and aims to solve the technical problems that the existing depth video shape behavior recognition method is based on two-dimensional plane recognition, can not effectively extract three-dimensional information of a depth video, is easily influenced by environmental noise and is low in recognition accuracy.

In order to achieve the purpose, the invention provides a depth video human body behavior recognition method based on three-dimensional space time sequence modeling, which comprises the following steps:

(1) marking the human body position in the depth image frame by frame to obtain a depth image containing the human body behavior part;

(2) converting the pixel coordinates of the depth image into three-dimensional space point cloud data;

(3) performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor;

(4) uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence;

(5) converting the three-dimensional tensor space-time sequence into high-dimensional space point cloud data, and randomly sampling the high-dimensional space point cloud data to obtain human behavior space-time characteristics;

(6) and inputting the human body behavior space-time characteristics into a trained 3D target point cloud classification model for classification to obtain a behavior classification result.

Further, the step (1) specifically comprises:

(1.1) framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behaviors;

(1.2) each frame of depth image is represented as an A & ltx & gt B matrix, and matrix values corresponding to non-human body positions outside the labeling frame are set to be 0; wherein the index of each matrix value corresponds to the pixel coordinate of the location, and each matrix value corresponds to the distance between the location point of the pixel coordinate and the depth camera.

Further, according to the internal parameters of the depth camera, the correspondence relationship between the pixel coordinates of the depth image in the step (2) and the three-dimensional space point cloud under the world coordinate system is as follows:

wherein u and v are the coordinate position of each pixel in the image, and f _x ,f _y Is the depth camera focal length, c _x ,c _y Is the depth camera center point.

Further, the step (3) specifically comprises:

(3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids;

and (3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain the multi-scale three-dimensional tensors corresponding to different voxel sizes.

Further, the performing spatial time series coding on the three-dimensional tensor corresponding to each time period in the step (4) specifically includes:

(01) according to a sorting function S (v; u) ═ u ^t ·v _t Scoring the frame image;

wherein u is ^t Representing the transpose of the vector resulting from the optimization of the sorting function,

mean feature, x, representing the t-th frame depth image _t Represented as the t-th frame depth image,

representing a three-dimensional tensor obtained by voxelization of the t frame depth image;

(02) optimizing the parameter u of the sorting function through a rankSVM, so that the frame images behind the time series have larger scores;

(03) converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D represents the number of voxels in the three-dimensional space X-axis, Y-axis, and Z-axis of the point cloud voxelized in the scale.

Further, the step (5) specifically comprises:

(5.1) extracting spatial position information corresponding to the three-dimensional tensor space-time sequence index and time sequence information corresponding to the element values of the spatial position information to obtain M pieces of high-dimensional point cloud data (x, y, z, c) ₁ ,…,c _m ) Wherein M is the number of video segments obtained by time division of the depth video, and M represents a three-dimensional tensor space-timeC represents the motion information of the tensor value under the corresponding coordinate position;

(5.2) in M high-dimensional point cloud data (x, y, z, c) ₁ ,…,c _m ) And randomly selecting K from the Chinese characters as human behavior space-time characteristics.

And further, after data enhancement is carried out on the human behavior space-time characteristics by adopting a data enhancement mode of rotation and translation, a trained 3D target point cloud classification model is input for classification.

Further, the 3D target point cloud classification model comprises a multilayer sensor and a non-local mechanism NetVLAD network which are sequentially connected;

the multilayer perceptron is used for sampling and grouping the human behavior space-time characteristics and extracting the characteristics of each group of behavior space-time characteristics to obtain a plurality of groups of local characteristics;

the non-local mechanism NetVLAD network aggregates a plurality of groups of local characteristics to obtain the non-local characteristics.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) the method effectively screens the spatial position of the human body by detecting the human body behaviors in the plane depth image, and converts the depth image containing the human body behaviors into the point cloud sequence so as to restore the spatial information of the human body behaviors and facilitate the subsequent acquisition of richer human body behavior characteristics; on the basis, the point cloud sequence is subjected to voxelization and spatial time sequence coding, so that the spatial geometrical characteristics of the depth video are fully excavated, and the accuracy of human behavior recognition is effectively improved.

(2) The method adopts a self-attention-based non-local region feature fusion module on the basis of the existing point cloud classification network PointNet + +, and further improves the accuracy of human behavior recognition by combining global behavior features and obvious local motion features.

(3) According to the invention, the input of the classification network is further data-enhanced, namely the input point cloud is rotated at random any angle in space, so that the classification model has more robustness for human behavior identification under different visual angles.

Drawings

Fig. 1 is a schematic flow chart of a depth video human body behavior recognition method based on three-dimensional spatial time sequence modeling according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a physical meaning of a point cloud corresponding to a depth image based on an original;

fig. 3 is a visualization result obtained after the point cloud sequence provided by the embodiment of the present invention is voxelized;

fig. 4 is a visualization result of a 3-dimensional space tensor obtained after a voxelized sequence is subjected to spatial time sequence coding according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a point cloud classification network structure according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a self-attention-based non-local region feature fusion module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a depth video human body behavior recognition method based on three-dimensional time sequence modeling, which comprises the steps of human body target detection, depth image conversion point cloud, point cloud voxelization, time sequence coding of voxelized tensor, sampling of the coded tensor to obtain space-time characteristics, and sending the characteristics to a point cloud classification network for training and testing. The depth video human body behavior identification method based on three-dimensional time sequence modeling provided by the invention is specifically described below by combining an example.

As shown in fig. 1, the depth video human body behavior recognition method based on three-dimensional spatial time series modeling provided by the embodiment of the present invention includes the following steps:

specifically, framing out the human behavior part in each frame of depth image by using human skeleton information to obtain a labeling frame containing human behavior; and each frame of depth image is represented as an A-B matrix (A, B is the length and width of the depth image based on the number of pixels respectively), the index of each matrix value corresponds to the pixel coordinate of the position, and each matrix value corresponds to the distance between the position point of the pixel coordinate and the depth camera.

(2) Converting pixel coordinates of a depth image containing human body behaviors into three-dimensional space point cloud data;

specifically, as shown in fig. 2, a camera coordinate point, namely an optical center, is set as a world coordinate system origin O, a depth image center point is set as O ', an M' point in a depth image is converted into an M point of the world coordinate system, the M point is projected on a world coordinate system z axis as an a point, and a mapping relation of a similar triangle exists: OM 'O' is similar to OMA, i.e., can be obtained

Further, according to the camera internal parameters, the corresponding relation between the pixel coordinates of the depth image and the three-dimensional space point cloud under the world coordinate system is as follows:

specifically, the step (3) specifically includes: (3.1) setting voxel values with different sizes, and uniformly dividing the space to obtain a plurality of space grids; and (3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain the multi-scale three-dimensional tensors corresponding to different voxel sizes. Converting each frame of depth image obtained in the step (2) into point cloud data (M x, y, z), calculating the maximum and minimum values of all point cloud sets in the x, y and z axes under a section of video and marking as hx, hy, hz, lx, ly and lz, thereby obtaining the spatial position size of the human body behavior under a world coordinate system, and setting the voxelized voxel size as a, thereby obtaining the voxel number of the depth behavior in a 3D space as

The visualization result of the depth image containing human body behaviors through point cloud and voxelization is shown in fig. 3.

specifically, the performing spatial time series coding on the three-dimensional tensor corresponding to each time period in step (4) specifically includes:

(01) according to a sorting function S (v; u) ═ u ^t ·v _t Scoring the frame image; wherein u is ^t Representing the transpose of the vector resulting from the optimization of the sorting function,

mean feature, x, representing the depth image of the t-th frame _t Denoted as the t-th frame depth image,

representing a three-dimensional tensor obtained by voxelization of the t frame depth image; (02) optimizing the parameter u of the ranking function by means of a rankSVM such that the frame images following the time sequence have a greater score further along(ii) a (03) Converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D represents the number of voxels in the three-dimensional space x, y, and z of the point cloud voxelized at the scale.

The present invention uses a structural risk minimization and maximum separation optimization framework, and the objective optimization problem can be expressed as:

the first term is a regularization term and the second term is a change-loss error penalty term. The above formula proves to be a convex optimization problem, which can be solved by using RankSVM, and the optimized parameter u is obtained ^* Can be a new representation of the entire sequence of feature tensors. Parameter u ^* The 3-dimensional tensor feature that becomes W.H.D after resize, and

the feature dimensions are consistent.

The above formula is simplified, and d represents a better parameter u to be obtained:

from

At first, the first approximate solution

Can therefore obtain

Summing the left series of numbers

α _t ＝2(N-t+1)-(N+1)(H _N -H _t-1 )

Wherein

The tensor characteristics of W × H × D that are finally desired to be obtained become:

in the present embodiment, α is used _t The tensor eigensequence is processed 2(N-t +1), formula α _t ＝2(T-t+1)-(T+1)(H _T -H _t-1 ) The second item in (2) is omitted, the coding effect is not influenced, and much time consumption is reduced. Visualization of results after rankPooling is shown in FIG. 4, in which the original video is equally divided into four segments, each segment has 1/2 overlapping parts, and in addition to the original full sequence time segments, a total of 5 time-series 3-dimensional tensors can be obtained.

specifically, the step (5) specifically includes:

(5.1) extracting the space position information corresponding to the three-dimensional tensor space-time sequence index and the time sequence information corresponding to the element values of the space position information to obtain M high-dimensional point cloud data (x, y, z, c) ₁ ,…,c _m ) Wherein M is the number of video segments obtained by time division of the depth video, M represents the number of point features with motion information in a three-dimensional tensor space-time sequence, and c represents the tensor value corresponding to the coordinateMotion information at the target (x, y, z) position;

more specifically, for the tensor obtained after encoding, the index of the tensor represents spatial position information, the value of an element in the tensor represents spatial time sequence information obtained after encoding, if the value of the tensor is 0, it is indicated that no motion information exists at the corresponding position, for the 3-dimensional tensor obtained in a plurality of time periods, voxels of which the values are all 0 under the same index are screened out, the spatial information and the motion information of the tensor value of the tensor index are extracted, and the information is stored as a high-dimensional point cloud format (x, y, z, c) ₁ ,…,c ₅ ) Thereby obtaining M high-dimensional point cloud data (x, y, z, c) ₁ ,…,c _m )。

(5.2) in M high-dimensional point cloud data (x, y, z, c) ₁ ,…,c _m ) The K numbers are randomly selected as the human behavior space-time characteristics.

Further specifically, if M is less than K, all M point sets are selected, then (K-M) points are randomly extracted from the M points to serve as repetition points, and finally K point data are obtained and serve as input of a classification network; the value of K is set according to the size of the network input model and the overall level of the value of M, and the value of K selected in the embodiment of the present invention is 2048.

Specifically, before inputting a point cloud classification model, the embodiment of the invention performs data enhancement on human behavior space-time characteristics by adopting a rotational translation data enhancement mode, wherein a rotational formula is as follows:

wherein R is _x 、R _y A rotation matrix, beta and alpha table, representing the rotation of the point cloud around the x and y axes in the world coordinate systemThe degree of rotation is shown, and through matrix multiplication, the point cloud rotation process can be represented as: x ═ x (R) _x *R _y ) ^T In the embodiment of the invention, the beta range is-10 degrees to +10 degrees, and the alpha range is-45 degrees to +45 degrees. After the rotation data enhancement is carried out on the data set, the robustness of the model of the invention to behaviors under different visual angles can be improved.

On the basis of a multilayer shared sensor neural network adopted by the existing point cloud classification network PointNet, a non-local area feature fusion module based on self-attribute is adopted to fuse local features mapped by a high-dimensional space, so that the connection among point clouds can be further explored, the dependency relationship among point sets of different positions can be captured, the commonality among empty positions when the human body moves is different is obtained, and the identification capability of behavior features is enhanced. The behavior classification model structure is shown in fig. 5 and comprises a multilayer perceptron and a non-local mechanism NetVLAD network which are connected in sequence; the model firstly carries out sampling grouping on input point clouds by using a nearest neighbor method, sends each grouped point cloud into a multilayer perceptron shared by weight to obtain local features, and then adopts a non-local mechanism NetVLAD network to aggregate the local features of the groups.

The structure of the non-local mechanism NetVLAD network is shown in FIG. 6, and the input is set as N d-dimensional point features { x } _i VLAD parameters are set to K cluster centers { c } _k The output of the final VLAD is a descriptive feature in dimension K x d and is denoted by V. The aggregation formula for NetVLAD networks is as follows:

wherein, a _k (X _i ) To indicate whether the point belongs to the kth cluster, it can be approximated by the form softmax as:

wherein

b _k Are all parameters that can be learned by the network,

the distance between the point feature and the kth cluster center point can be expressed, so that the network can obtain the feature V after aggregation, and based on this, a non-local feature module is adopted to mine the correlation between features obtained by different cluster centers of the VLAD, and the calculation formula of the non-local feature is as follows:

let input V be a characteristic of K x C shape, V _i Feature obtained from cluster center representing a NetVLAD cluster, V _i Is a column vector of length C, i being used to indicate the position of the point. f is used for calculating the similarity between the feature vectors of the two points, and g is a mapping function and can be realized by a multivariate sensor.

The similarity measurement function f can be selected from various types, such as a Gaussian measurement and an Embedded Gaussian measurement (Embedded Gaussian), and the similarity measurement function f adopts an Embedded Gaussian measurement mode, and has the following formula:

wherein, the functions θ and φ can be expressed as the following by linear mapping function, i.e. perceptron function:

the final formula can be expressed as follows:

the output of the non-local NetVLAD module is used as the output of the non-local NetVLAD module, and the next stage of feature learning is carried out.

The embodiment of the invention uses an NTU RGBD behavior data set proposed by the university of southern Yankee, Singapore to extract and encode the features according to the steps (1) - (5), and then trains the classification network of the step (6) end to end, but the invention is not limited to the data set. The invention uses the NTU RGBD120 dataset during training, so the output results are 120 classes, which include the daily activities 82 class, the medical activities 12 class, and the multi-person interaction activities 26 class. The classification result shows that the method provided by the invention can effectively acquire the 3D human body motion information and obtain the best performance on the classification result.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A depth video human body behavior recognition method based on three-dimensional space time sequence modeling is characterized by comprising the following steps:

(3) performing voxelization on the three-dimensional space point cloud data in different scales to obtain a multi-scale three-dimensional tensor; the step (3) specifically comprises the following steps:

(3.2) setting the voxel value corresponding to the space grid with the point cloud data as 1, and setting the voxel values corresponding to the other space grids as 0 to obtain multi-scale three-dimensional tensors corresponding to different voxel sizes;

(4) uniformly dividing the three-dimensional tensor of the same scale into a plurality of time periods, and carrying out space time sequence coding on the three-dimensional tensor corresponding to each time period to obtain a multi-scale multi-time period three-dimensional tensor space-time sequence; performing space time sequence coding on the three-dimensional tensor corresponding to each time period in the step (4), specifically comprising:

mean feature, x, representing the depth image of the t-th frame _t Represented as the t-th frame depth image,

(03) converting the optimal value of the parameter u into a tensor of W H D, and using the tensor as a three-dimensional tensor space-time sequence of the three-dimensional tensor of the same scale corresponding to the time period after space time sequence coding; h, W, D respectively represents the number of voxels of the point cloud voxelized in the three-dimensional space in the X axis, Y axis and Z axis;

2. The method for recognizing the human body behavior based on the depth video of the three-dimensional space time sequence modeling as claimed in claim 1, wherein the step (1) specifically comprises:

3. The method for recognizing the human body behaviors through the depth video based on the three-dimensional time-series modeling as claimed in claim 1 or 2, wherein according to the depth camera internal parameters, the correspondence relationship between the pixel coordinates of the depth image in the step (2) and the three-dimensional space point cloud under the world coordinate system is as follows:

4. The method for recognizing the human body behavior based on the depth video of the three-dimensional space time sequence modeling as claimed in claim 1, wherein the step (5) specifically comprises:

(5.1) extracting the space position information corresponding to the three-dimensional tensor space-time sequence index and the time sequence information corresponding to the element values of the space position information to obtain M high-dimensional point cloud data (x, y, z, c) ₁ ,…,c _m ) Wherein M is the number of video segments obtained by time division of the depth video, M represents the number of point features with motion information in a three-dimensional tensor space-time sequence, and c represents the motion information at the position of a corresponding coordinate (x, y, z);

5. The method for recognizing the human body behaviors through the depth video based on the three-dimensional time sequence modeling as claimed in claim 1, wherein after data enhancement is performed on the human body behavior space-time characteristics through a data enhancement mode of rotation and translation, a trained 3D target point cloud classification model is input for classification.

6. The method according to claim 5, wherein the 3D target point cloud classification model comprises a multilayer sensor and a non-local mechanism NetVLAD network which are connected in sequence;

the non-local mechanism NetVLAD network aggregates a plurality of groups of local features to obtain the non-local features.