CN111178142A

CN111178142A - Hand posture estimation method based on space-time context learning

Info

Publication number: CN111178142A
Application number: CN201911235772.2A
Authority: CN
Inventors: 李玺; 吴一鸣
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2020-05-19

Abstract

The invention discloses a hand posture estimation method based on space-time context learning, which is used for outputting three-dimensional coordinates of hand nodes in each frame under the condition of giving continuous depth images. The method specifically comprises the following steps: acquiring a continuous frame depth image data set for training hand posture estimation, and defining an algorithm target; respectively modeling corresponding context information on a space dimension and a time dimension by using a space network and a time network; fusing the output of the time-space model by using a fusion network according to the input image; establishing a prediction model of hand posture estimation; and performing hand pose estimation on the depth images of the continuous frames by using the prediction model. The invention uses the hand gesture estimation in the real video, and has better effect and robustness in the face of various complex conditions.

Description

Hand posture estimation method based on space-time context learning

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a hand posture estimation method based on space-time context learning.

Background

Hand pose estimation is defined as the problem: the specific positions of the hand joint points relative to the camera are found and given in a given depth image containing the hand. Hand pose estimation is commonly used in human-computer interaction, augmented reality, or virtual reality applications. The traditional method carries out model parameter optimization by using a parameterized model for the hand and defining an energy function, but because the model-based method is expensive in calculation consumption and the deep neural network is developed in the recent years, the mode of the hand gesture is discovered from data by the method based on the apparent characteristics, and the resource consumption is smaller compared with the model-based method.

Due to the effectiveness of statistical modeling, current learning-based methods are increasingly applied to hand pose estimation tasks. The existing learning method based on the apparent characteristics mainly adopts an end-to-end deep neural network model, and outputs the predicted joint point position of the hand by inputting a single frame or a plurality of frames of depth images containing the hand. On the one hand, most of the methods today use depth images or three-dimensional voxels as inputs, and the present invention considers that the two inputs are correlated and can complement each other; on the other hand, in the actual scene, the multi-frame depth images have correlation, and the prediction accuracy of the network is improved by modeling the context information in the time dimension.

Disclosure of Invention

In order to solve the above problems, the present invention provides a hand pose estimation method based on spatiotemporal context learning. The method is based on a deep neural network, utilizes the neural network to carry out feature extraction and effective fusion on the deep image and three-dimensional voxel input, and uses the recurrent neural network to model the relationship between the features of the multi-frame images in the time dimension, so that the hand posture estimation in the multi-frame scene can be improved.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a hand posture estimation method based on space-time context learning comprises the following steps:

s1, acquiring a continuous frame depth image data set for training hand posture estimation;

s2, respectively modeling corresponding context information by using a space network and a time network in space and time dimensions;

s3, fusing the output of the time-space model by using a fusion network according to the input image;

s4, establishing a hand posture estimation prediction model;

and S5, performing hand posture estimation on the depth images of the continuous frames by using the prediction model.

Based on the technical scheme, the steps can be realized in the following preferred mode.

Preferably, in step S1, a continuous frame depth image data set for training hand pose estimation is obtained, which includes N training videos, each of which contains a continuous frame depth image (X)₁，...，X_T)_trainAnd a pre-marked hand joint point location (J)₁，...，J_T)_train。

Further, in step S2, the step of modeling the corresponding context information using the spatial network and the temporal network in the spatial and temporal dimensions includes:

s21, depth image (X) for continuous frames₁，...，X_T)_trainScaling the image to 128 x 128 size, randomly rotating and turning, and normalizing to scale the image to-1 to obtain normalized depth image (I)₁，...，I_T)_trainAs an algorithmic input, the normalized depth image is then converted to a 128 × 128 × 8 three-dimensional voxel representation (V) by depth value₁，...，V_T)_trainThen also as algorithm input, and pair (J)₁，...，J_T)_trainDo and (X)₁，...，X_T)_trainCorresponding rotation and inversion transformation to obtain

S22, modeling spatial context information, and processing any frame depth image I_tAnd a three-dimensional voxel representation V_tPerforming spatial network operation F_spatio(. in the spatial network operation, for I_tAnd V_tTriple layer convolution operations all using addition of ReLU activation function per layer and best resultsLarge pooling operation for down-sampling to obtain features

And

two features were then fused using a hierarchical fusion approach with a total number of layers of 3, namely:

m＝1，2

wherein: phi m, t represents the fusion characteristics of the mth layer,

and

is a full connection function of the mth layer,

and

all the parameters are the parameters of the m-th layer; returning the coordinates of the hand joint points by using a full-connection operation

Formally expressing the above spatial network operation as:

wherein: f_spatio(. to) represents a spatial network operation, Θ_spatioParameters in the spatial network;

s23, for modeling the time context information, the multi-frame depth image (I) obtained in S21₁，...，I_T)_trainTime network operation F on a frame-by-frame basis_tempIn the time network operation, the down-sampling is carried out by using three layers of convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain characteristic psi₁，...，ψ_T) Wherein the depth image I_tCharacteristic psi of_t＝H(I_t；θ_c) H (-) is a convolution operation, θ_cIs a convolution parameter; performing time dimension correlation modeling on the obtained features by using LSTM to obtain hidden layer features (h)₁，...，h_T) Hidden layer feature h at time t_tCalculated as:

i_t＝σ(W_hi*h_t-1+W_xi*ψ_t+b_i)

f_t＝σ(W_hf*h_t-1+W_xf*ψ_t+b_f)

o_t＝σ(W_ho*h_t-1+W_xo*ψ_t+b_t)

c_t＝(f_t⊙c_t-1)+(i_t⊙tanh(W_hc*h_t-1+W_xc*ψ_t+b_c))

h_t＝o_t⊙tanh(c_t)

wherein: i.e. i_tFor the output of the input gate at time t, f_tOutput of forgetting gate at time t, o_tOutput of the output gate at time t, c_tFor final memorization at time t, W_hi、W_xi、W_hf、W_xf、W_ho、W_xo、W_hc、W_xcAll represent a weight, b_i、b_f、b_t、b_call representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operation_t，h_t]Carry out hand joint pointCoordinates of the object

Regression of (4);

formally expressing the time network operation as follows:

wherein: f_temp(. The) represents a time network operation, Θ_tempAre parameters in the time network.

Further, in step S3, the fusing with the fusion network as the output of the spatio-temporal model according to the input image specifically includes:

s31, extracting multiple frames of depth images (I) from S21₁，...，I_T)_trainPerforming a converged network operation frame by frame, wherein in the converged network operation, a three-layer convolution operation and a maximum pooling operation of adding a ReLU activation function into each layer are firstly used for performing down-sampling, and then three layers of full connection are performed; and obtaining the weight w through sigmoid function₁And w₂The two weights are obtained by the following formula:

w_1，t＝σ(F_fusion(I_t；Θ_fusion))

w_2，t＝1-w_1，t

wherein w_1，tAnd w_2，tRespectively weight w of the t-th frame image₁And w₂；F_fusion(. is a converged network operation, Θ_fusionTo fuse parameters in the network, σ (-) represents a sigmoid function.

S32, obtaining the fused coordinates in a linear weighting mode

where |, indicates a matrix element multiplication.

Further, in step S4, the establishing a prediction model for hand pose estimation specifically includes:

s41, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)₁，...，X_T) Outputting the position of the joint point of the hand in each frame

Thereby constructing a map in a neural network

Is formulated as:

wherein:

is a deep convolutional neural network;

s42 loss function of neural network

Comprises the following steps:

wherein: n is the video frequency; the index i in the parameter indicates the corresponding parameter value in the ith video;

loss function using Adam optimization method and back propagation algorithm

And training the whole neural network.

Further, in step S5, the performing hand pose estimation on the depth images of consecutive frames using the prediction model includes: subjecting successive frame depth images to the same scaling and binning as the training imageAfter normalization, inputting the data into a trained deep convolutional neural network, and outputting

I.e. the predicted hand joint point coordinates.

Compared with the existing hand posture estimation method, the hand posture estimation method based on the space-time context learning has the following beneficial effects:

firstly, the hand posture estimation method based on space-time context learning defines two important problems in hand posture estimation, namely extracting effective information in a depth image and accurately regressing hand coordinates by using extracted features. By seeking a solution for the two directions, the hand posture estimation under the complex condition can be effectively solved.

Secondly, the hand pose estimation method of the invention models the spatial context and the temporal context based on the deep convolutional neural network to extract the effective information in the depth image. The spatial context network effectively fuses multi-mode information from depth images and three-dimensional voxel expression through a hierarchical fusion method, and more robust visual expression features are extracted; the time context network utilizes the time sequence among the images of multiple frames and uses a recurrent neural network to model the corresponding relation among the multiple frames.

Finally, the hand gesture estimation method of the invention uses a fusion network to unify the time and space contexts in a frame, utilizes the weight of the time and space network of the self-adaptive learning of the input depth image, and uses the linear weighting method to effectively fuse a plurality of outputs.

The hand posture estimation method based on the space-time context learning can effectively improve the accuracy and efficiency of hand posture estimation in human-computer interaction, virtual reality and augmented reality, and has good application value. For example, in the application scene of human-computer interaction, the hand posture estimation method can quickly and accurately estimate the joint point position of the hand, so that the robot can be controlled by using the hand action.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Fig. 2 shows the result of an image with a portion of the two data sets being difficult to recognize.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

Referring to fig. 1, in a preferred embodiment of the present invention, a hand pose estimation method based on spatiotemporal context learning includes the following steps:

s1, acquiring a continuous frame depth image data set for training hand posture estimation, wherein the continuous frame depth image data set comprises training videos (N in total) meeting the requirement of the number of training samples, and each training video contains a continuous frame depth image (X)₁，...，X_T)_trainAnd manually pre-labeled hand joint point locations (J)₁，...，J_T)_train，X_tAnd J_tRespectively showing the t-th frame depth image and the position of the hand joint point corresponding to the image.

The algorithm targets are defined as: and predicting the coordinates of the hand joint points in the arbitrary depth image.

S2, modeling the corresponding context information using a spatial network and a temporal network in spatial and temporal dimensions, respectively, including:

first, for successive frame depth images (X)₁，...，X_T)_trainThe image is firstly scaled to 128X 128 size, then randomly rotated and turned, and finally normalized (scaled to between-1 and 1), (X)₁，...，X_T)_trainAfter the processing, a normalized depth image (I) is finally obtained₁，...，I_T)_trainAs an algorithm input; subsequently, the depth image (I) is normalized according to the depth value₁，...，I_T)_trainConversion to 128 × 128 × 8 three-dimensional voxel representation (V)₁，...，V_T)_trainAnd then also as algorithm input. In addition, the position (J) of the joint point of the hand is randomly rotated and turned over for the original image₁，...，J_T)_trainVariations also occur, so that the pair (J) is required₁，...，J_T)_trainDo and (X)₁，...，X_T)_trainCorresponding rotation and turnover transformation are carried out to obtain the transformed positions of the joint points of the hand

Secondly, for modeling spatial context information, for any frame of depth image I_tAnd a three-dimensional voxel representation V_tPerforming spatial network operation F_spatio(. in the spatial network operation, for I_tAnd V_tDownsampling by using three layers of convolution operation (adding a ReLU activation function to each layer) and maximum pooling operation to respectively obtain features

And

m＝1，2

wherein: phi is a_0，tDenotes the fusion characteristic of layer 0,. phi_m，tIndicating the fusion characteristics of the m-th layer,

and

is a full connection function of the mth layer,

and

Formally expressing the above spatial network operation as:

thirdly, for modeling time context information, the multi-frame depth image (I) obtained in the previous step is subjected to₁，...，I_T)_trainTime network operation F on a frame-by-frame basis_tempIn the time network operation, down-sampling is performed using a three-layer convolution operation (adding ReLU activation function to each layer) and a max-pooling operation to obtain a feature (ψ)₁，...，ψ_T) Wherein the depth image I_tCharacteristic psi of_t＝H(I_t；θ_c) H (-) is a convolution operation, θ_cIs a convolution parameter; using long and short term memory networks LSTM carries out time dimension correlation modeling on the obtained features to obtain hidden layer features (h)₁，...，h_T) Hidden layer feature h at time t_tCalculated as:

i_t＝σ(W_hi*h_t-1+W_xi*ψ_t+b_i)

f_t＝σ(W_hf*h_t-1+W_xf*ψ_t+b_f)

o_t＝σ(W_ho*h_t-1+W_xo*ψ_t+b_t)

c_t＝(f_t⊙c_t-1)+(i_t⊙tanh(W_hc*h_t-1+W_xc*ψ_t+b_c))

h_t＝o_t⊙tanh(c_t)

wherein: i.e. i_tFor the output of the input gate at time t, f_tOutput of forgetting gate at time t, o_tOutput of the output gate at time t, c_tFor final memory at time t, h_tIs the output of the LSTM layer at time t, W_hi、W_xi、W_hf、W_xf、W_ho、W_xo、W_hc、W_xcAll represent a weight, b_i、b_f、b_t、b_call representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operation_t，h_t]Performing hand joint point coordinates

Regression of (4);

formally expressing the time network operation as follows:

wherein: f_temp(. The) represents a time network operation, Θ_tempIs time of dayParameters in the network.

S3, fusing the input images using a fusion network as the output of the spatio-temporal model, specifically including:

the first step is to extract multiple frames of depth images (I) in the previous step₁，...，I_T)_trainPerforming fusion network operation frame by frame F_fusionIn the operation of the converged network, firstly carrying out down-sampling by using three layers of convolution operation (adding a ReLU activation function into each layer) and maximum pooling operation, and then carrying out three layers of full connection;

and obtaining the weight w through sigmoid function₁And w₂The two weights are obtained by the following formula:

w_1，t＝σ(F_fusion(I_t；Θ_fusion))

w_2，t＝1-w_1，t

Secondly, obtaining the fused coordinates in a linear weighting mode

where |, indicates a matrix element multiplication.

S4, establishing a prediction model of hand posture estimation specifically comprises:

firstly, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)₁，...，X_T) Outputting the position of the joint point of the hand in each frame

Thereby constructing a map in a neural network

Is formulated as:

wherein:

is a deep convolutional neural network;

second, loss function of neural network

Comprises the following steps:

wherein: n is the number of videos used for training; the index i in the parameter indicates the corresponding parameter value in the ith video;

loss function using Adam optimization method and back propagation algorithm

And training the whole neural network.

S5, using the prediction model to carry out hand posture estimation on the depth images of the continuous frames, and the specific steps comprise:

after the continuous frame depth image is subjected to the same scaling and normalization operations as the training image, the continuous frame depth image is input into a depth convolution neural network after the training is finished, and the result is output

I.e. the predicted hand joint point coordinates.

It can be seen that the above method outputs three-dimensional coordinates of the hand node in each frame, given the continuous depth image.

The above-described method is applied to specific examples so that those skilled in the art can better understand the effects of the present invention.

Examples

In this embodiment, experiments are performed based on the above method, the specific implementation method is as described above, specific steps are not elaborated, and the results are shown below only for the experimental results.

NYU hand pose dataset: the data set contains ten video sequences for a total of 17604 frame depth images, eight of which contain 16008 frame depth images for training and two of which contain 1596 frame images for testing.

ICVL hand gesture data set: the data set contains three video sequences, for a total of 81009 frame depth images, one of which contains 72757 frame depth images for training and two of which contain 8252 frame images for testing.

Table 1 shows the comparison of evaluation indexes of the present example on NYU hand posture data set

Method of producing a composite material	Mean error of joint point (mm)
		HeatMap[1]	21.02
DeepPrior[2]	19.73
		Feedback[3]	15.97
DeepModel[4]	16.90
		Lie-X[5]	14.51
CADSTN	14.83

TABLE 1

TABLE 2

The CADSTN is the method of the invention, and the other methods correspond to the following references:

[1]J.Tompson，M.Stein，Y.Lecun，and K.Perlin，“Real-time continuous poserecovery of human hands using convolutional networks，”ACM Transactions onGraphics(ToG)，vol.33，no.5，2014.

[2]M.Oberweger，P.Wohlhart，and V.Lepetit，“Hands deep in deep learningfor hand pose estimation，”in CVWW，2015.

[3]Oberweger，Markus and Wohlhart，Paul and Lepetit，Vincent，“Training afeedback loop for hand pose estimation，”in ICCV，2015.

[4]X.Zhou，Q.Wan，W.Zhang，X.Xue，and Y.Wei，“Model-based deep hand poseestimation，”IJCAI，2016.

[5]C.Xu，L.N.Govindarajan，Y.Zhang，and L.Cheng，“Lie-x：Depth image basedarticulated object pose estimation，tracking，and action recognition on liegroups，”Intemational Journal of Computer Vision，2017.

[6]D.Tang，H.Jin Chang，A.Tejani，and T.-K.Kim，“Latent regressionforest：Structured estimation of 3d articulated hand posture，”in CVPR，2014.

[7]C.Wan，T.Probst，L.Van Gool，and A.Yao，“Crossing nets：Combining gansand vaes with a shared latent space for hand pose estimation，”in CVPR，2017.

[8]C.Wan，A.Yao，and L.Van Gool，“Direction matters：hand pose estimationfrom local surface normals，”in ECCV，2016.

the implementation results of the images with the parts of the two data sets being difficult to recognize are shown in fig. 2, wherein the first line and the third line are implementation results of the pre-labeled joint, and the second line and the fourth line are implementation results of the method of the present invention. By observing the implementation result, the method can still carry out robust joint point estimation in the depth image scene with self-occlusion and noise of the hand.

In the above embodiment, the hand pose estimation method based on spatio-temporal context learning according to the present invention first models corresponding context information in spatial and temporal dimensions using a spatial network and a temporal network, and then effectively merges a plurality of predictions by using a merging method, and establishes a hand pose estimation model based on a deep neural network. And finally, predicting the positions of hand joint points of the continuous frame depth images by using the trained hand posture estimation model.

Through the technical scheme, the hand posture estimation method based on the space-time context learning is developed based on the deep learning technology. The invention can model the dependency relationship between the pixels in time and space dimensions, and uniformly use the space-time context for estimating the positions of the hand joint points through the fusion network.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A hand posture estimation method based on space-time context learning is characterized by comprising the following steps:

s4, establishing a hand posture estimation prediction model;

2. The hand pose estimation method based on spatio-temporal context learning of claim 1, wherein in step S1, a continuous frame depth image data set for training hand pose estimation is obtained, comprising N training videos, each of which contains continuous frame depth images (X)₁，...，X_T)_trainAnd a pre-marked hand joint point location (J)₁，...，J_T)_train。

3. The hand pose estimation method based on spatio-temporal context learning according to claim 2, wherein in step S2, the modeling the corresponding context information using the spatial network and the temporal network in the spatial and temporal dimensions respectively comprises:

s21, depth image (X) for continuous frames₁，...，X_T)_trainScaling the image to 128 x 128 size, randomly rotating and turning, and normalizing to scale the image to-1 to obtain normalized depth image (I)₁，...，I_T)_trainAs an algorithmic input, the normalized depth image is then converted to a 128 × 128 × 8 three-dimensional voxel representation (V) by depth value₁，...，V_T)_trainThen also as algorithm input, and pair (J)₁，...，J_T)_trainDo and (X)₁，...，X_T)_trainCorresponding rotationChange over and turn over to obtain

S22, modeling spatial context information, and processing any frame depth image I_tAnd a three-dimensional voxel representation V_tPerforming spatial network operation F_spatio(. in the spatial network operation, for I_tAnd V_tPerforming down-sampling by using three-layer convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain features respectively

And

m＝1，2

wherein: phi is a_m，tIndicating the fusion characteristics of the m-th layer,

and

is a full connection function of the mth layer,

and

are all as followsFull connection layer parameters of m layers; returning the coordinates of the hand joint points by using a full-connection operation

Formally expressing the above spatial network operation as:

i_t＝σ(W_hi*h_t-1+W_xi*ψ_t+b_i)

f_t＝σ(W_hf*h_t-1+W_xf*ψ_t+b_f)

o_t＝σ(W_ho*h_t-1+W_xo*ψ_t+b_t)

c_t＝(f_t⊙c_t-1)+(i_t⊙tanh(W_hc*h_t-1+W_xc*ψ_t+b_c))

h_t＝o_t⊙tanh(c_t)

wherein: i.e. i_tFor the output of the input gate at time t, f_tOutput of forgetting gate at time t, o_tOutput of the output gate at time t, c_tFor final memorization at time t, W_hi、W_xi、W_hf、W_xf、W_ho、W_xo、W_hc、W_xcAll represent a weight, b_t、b_f、b_t、b_call representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operation_t，h_t]Performing hand joint point coordinates

Regression of (4);

formally expressing the time network operation as follows:

4. The hand pose estimation method based on spatio-temporal context learning according to claim 3, wherein in step S3, fusing the output of the spatio-temporal model by using a fusion network according to the input images, specifically comprising:

w_1，t＝σ(F_fusion(I_t；Θ_fusion))

w_2，t＝1-w_1，t

S32, obtaining the fused coordinates in a linear weighting mode

where |, indicates a matrix element multiplication.

5. The hand pose estimation method based on spatio-temporal context learning of claim 4, wherein in step S4, the establishing of the prediction model of the hand pose estimation specifically comprises:

Thereby constructing a map in a neural network

Is formulated as:

wherein:

is a deep convolutional neural network;

s42 loss of neural networkLoss function

Comprises the following steps:

loss function using Adam optimization method and back propagation algorithm

And training the whole neural network.

6. The hand pose estimation method based on spatio-temporal context learning of claim 5, wherein in step S5, the hand pose estimation for the depth images of consecutive frames using the prediction model comprises: after the continuous frame depth image is subjected to the same scaling and normalization operations as the training image, the continuous frame depth image is input into a depth convolution neural network after the training is finished, and the output is

I.e. the predicted hand joint point coordinates.