CN111178142A - Hand posture estimation method based on space-time context learning - Google Patents

Hand posture estimation method based on space-time context learning Download PDF

Info

Publication number
CN111178142A
CN111178142A CN201911235772.2A CN201911235772A CN111178142A CN 111178142 A CN111178142 A CN 111178142A CN 201911235772 A CN201911235772 A CN 201911235772A CN 111178142 A CN111178142 A CN 111178142A
Authority
CN
China
Prior art keywords
network
hand
time
frame
train
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911235772.2A
Other languages
Chinese (zh)
Inventor
李玺
吴一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201911235772.2A priority Critical patent/CN111178142A/en
Publication of CN111178142A publication Critical patent/CN111178142A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a hand posture estimation method based on space-time context learning, which is used for outputting three-dimensional coordinates of hand nodes in each frame under the condition of giving continuous depth images. The method specifically comprises the following steps: acquiring a continuous frame depth image data set for training hand posture estimation, and defining an algorithm target; respectively modeling corresponding context information on a space dimension and a time dimension by using a space network and a time network; fusing the output of the time-space model by using a fusion network according to the input image; establishing a prediction model of hand posture estimation; and performing hand pose estimation on the depth images of the continuous frames by using the prediction model. The invention uses the hand gesture estimation in the real video, and has better effect and robustness in the face of various complex conditions.

Description

Hand posture estimation method based on space-time context learning
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a hand posture estimation method based on space-time context learning.
Background
Hand pose estimation is defined as the problem: the specific positions of the hand joint points relative to the camera are found and given in a given depth image containing the hand. Hand pose estimation is commonly used in human-computer interaction, augmented reality, or virtual reality applications. The traditional method carries out model parameter optimization by using a parameterized model for the hand and defining an energy function, but because the model-based method is expensive in calculation consumption and the deep neural network is developed in the recent years, the mode of the hand gesture is discovered from data by the method based on the apparent characteristics, and the resource consumption is smaller compared with the model-based method.
Due to the effectiveness of statistical modeling, current learning-based methods are increasingly applied to hand pose estimation tasks. The existing learning method based on the apparent characteristics mainly adopts an end-to-end deep neural network model, and outputs the predicted joint point position of the hand by inputting a single frame or a plurality of frames of depth images containing the hand. On the one hand, most of the methods today use depth images or three-dimensional voxels as inputs, and the present invention considers that the two inputs are correlated and can complement each other; on the other hand, in the actual scene, the multi-frame depth images have correlation, and the prediction accuracy of the network is improved by modeling the context information in the time dimension.
Disclosure of Invention
In order to solve the above problems, the present invention provides a hand pose estimation method based on spatiotemporal context learning. The method is based on a deep neural network, utilizes the neural network to carry out feature extraction and effective fusion on the deep image and three-dimensional voxel input, and uses the recurrent neural network to model the relationship between the features of the multi-frame images in the time dimension, so that the hand posture estimation in the multi-frame scene can be improved.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a hand posture estimation method based on space-time context learning comprises the following steps:
s1, acquiring a continuous frame depth image data set for training hand posture estimation;
s2, respectively modeling corresponding context information by using a space network and a time network in space and time dimensions;
s3, fusing the output of the time-space model by using a fusion network according to the input image;
s4, establishing a hand posture estimation prediction model;
and S5, performing hand posture estimation on the depth images of the continuous frames by using the prediction model.
Based on the technical scheme, the steps can be realized in the following preferred mode.
Preferably, in step S1, a continuous frame depth image data set for training hand pose estimation is obtained, which includes N training videos, each of which contains a continuous frame depth image (X)1,...,XT)trainAnd a pre-marked hand joint point location (J)1,...,JT)train
Further, in step S2, the step of modeling the corresponding context information using the spatial network and the temporal network in the spatial and temporal dimensions includes:
s21, depth image (X) for continuous frames1,...,XT)trainScaling the image to 128 x 128 size, randomly rotating and turning, and normalizing to scale the image to-1 to obtain normalized depth image (I)1,...,IT)trainAs an algorithmic input, the normalized depth image is then converted to a 128 × 128 × 8 three-dimensional voxel representation (V) by depth value1,...,VT)trainThen also as algorithm input, and pair (J)1,...,JT)trainDo and (X)1,...,XT)trainCorresponding rotation and inversion transformation to obtain
Figure BDA0002304839760000031
S22, modeling spatial context information, and processing any frame depth image ItAnd a three-dimensional voxel representation VtPerforming spatial network operation Fspatio(. in the spatial network operation, for ItAnd VtTriple layer convolution operations all using addition of ReLU activation function per layer and best resultsLarge pooling operation for down-sampling to obtain features
Figure BDA0002304839760000032
And
Figure BDA0002304839760000033
two features were then fused using a hierarchical fusion approach with a total number of layers of 3, namely:
Figure BDA0002304839760000034
Figure BDA0002304839760000035
m=1,2
wherein: phi m, t represents the fusion characteristics of the mth layer,
Figure BDA0002304839760000036
and
Figure BDA0002304839760000037
is a full connection function of the mth layer,
Figure BDA0002304839760000038
and
Figure BDA0002304839760000039
all the parameters are the parameters of the m-th layer; returning the coordinates of the hand joint points by using a full-connection operation
Figure BDA00023048397600000310
Formally expressing the above spatial network operation as:
Figure BDA00023048397600000311
wherein: fspatio(. to) represents a spatial network operation, ΘspatioParameters in the spatial network;
s23, for modeling the time context information, the multi-frame depth image (I) obtained in S211,...,IT)trainTime network operation F on a frame-by-frame basistempIn the time network operation, the down-sampling is carried out by using three layers of convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain characteristic psi1,...,ψT) Wherein the depth image ItCharacteristic psi oft=H(It;θc) H (-) is a convolution operation, θcIs a convolution parameter; performing time dimension correlation modeling on the obtained features by using LSTM to obtain hidden layer features (h)1,...,hT) Hidden layer feature h at time ttCalculated as:
it=σ(Whi*ht-1+Wxit+bi)
ft=σ(Whf*ht-1+Wxft+bf)
ot=σ(Who*ht-1+Wxot+bt)
ct=(ft⊙ct-1)+(it⊙tanh(Whc*ht-1+Wxct+bc))
ht=ot⊙tanh(ct)
wherein: i.e. itFor the output of the input gate at time t, ftOutput of forgetting gate at time t, otOutput of the output gate at time t, ctFor final memorization at time t, Whi、Wxi、Whf、Wxf、Who、Wxo、Whc、WxcAll represent a weight, bi、bf、bt、bcall representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operationt,ht]Carry out hand joint pointCoordinates of the object
Figure BDA0002304839760000041
Regression of (4);
formally expressing the time network operation as follows:
Figure BDA0002304839760000042
wherein: ftemp(. The) represents a time network operation, ΘtempAre parameters in the time network.
Further, in step S3, the fusing with the fusion network as the output of the spatio-temporal model according to the input image specifically includes:
s31, extracting multiple frames of depth images (I) from S211,...,IT)trainPerforming a converged network operation frame by frame, wherein in the converged network operation, a three-layer convolution operation and a maximum pooling operation of adding a ReLU activation function into each layer are firstly used for performing down-sampling, and then three layers of full connection are performed; and obtaining the weight w through sigmoid function1And w2The two weights are obtained by the following formula:
w1,t=σ(Ffusion(It;Θfusion))
w2,t=1-w1,t
wherein w1,tAnd w2,tRespectively weight w of the t-th frame image1And w2;Ffusion(. is a converged network operation, ΘfusionTo fuse parameters in the network, σ (-) represents a sigmoid function.
S32, obtaining the fused coordinates in a linear weighting mode
Figure BDA0002304839760000051
Figure BDA0002304839760000052
where |, indicates a matrix element multiplication.
Further, in step S4, the establishing a prediction model for hand pose estimation specifically includes:
s41, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)1,...,XT) Outputting the position of the joint point of the hand in each frame
Figure BDA0002304839760000053
Thereby constructing a map in a neural network
Figure BDA0002304839760000054
Is formulated as:
Figure BDA0002304839760000055
wherein:
Figure BDA0002304839760000056
is a deep convolutional neural network;
s42 loss function of neural network
Figure BDA0002304839760000058
Comprises the following steps:
Figure BDA0002304839760000057
wherein: n is the video frequency; the index i in the parameter indicates the corresponding parameter value in the ith video;
loss function using Adam optimization method and back propagation algorithm
Figure BDA0002304839760000063
And training the whole neural network.
Further, in step S5, the performing hand pose estimation on the depth images of consecutive frames using the prediction model includes: subjecting successive frame depth images to the same scaling and binning as the training imageAfter normalization, inputting the data into a trained deep convolutional neural network, and outputting
Figure BDA0002304839760000061
I.e. the predicted hand joint point coordinates.
Compared with the existing hand posture estimation method, the hand posture estimation method based on the space-time context learning has the following beneficial effects:
firstly, the hand posture estimation method based on space-time context learning defines two important problems in hand posture estimation, namely extracting effective information in a depth image and accurately regressing hand coordinates by using extracted features. By seeking a solution for the two directions, the hand posture estimation under the complex condition can be effectively solved.
Secondly, the hand pose estimation method of the invention models the spatial context and the temporal context based on the deep convolutional neural network to extract the effective information in the depth image. The spatial context network effectively fuses multi-mode information from depth images and three-dimensional voxel expression through a hierarchical fusion method, and more robust visual expression features are extracted; the time context network utilizes the time sequence among the images of multiple frames and uses a recurrent neural network to model the corresponding relation among the multiple frames.
Finally, the hand gesture estimation method of the invention uses a fusion network to unify the time and space contexts in a frame, utilizes the weight of the time and space network of the self-adaptive learning of the input depth image, and uses the linear weighting method to effectively fuse a plurality of outputs.
The hand posture estimation method based on the space-time context learning can effectively improve the accuracy and efficiency of hand posture estimation in human-computer interaction, virtual reality and augmented reality, and has good application value. For example, in the application scene of human-computer interaction, the hand posture estimation method can quickly and accurately estimate the joint point position of the hand, so that the robot can be controlled by using the hand action.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 shows the result of an image with a portion of the two data sets being difficult to recognize.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.
Referring to fig. 1, in a preferred embodiment of the present invention, a hand pose estimation method based on spatiotemporal context learning includes the following steps:
s1, acquiring a continuous frame depth image data set for training hand posture estimation, wherein the continuous frame depth image data set comprises training videos (N in total) meeting the requirement of the number of training samples, and each training video contains a continuous frame depth image (X)1,...,XT)trainAnd manually pre-labeled hand joint point locations (J)1,...,JT)train,XtAnd JtRespectively showing the t-th frame depth image and the position of the hand joint point corresponding to the image.
The algorithm targets are defined as: and predicting the coordinates of the hand joint points in the arbitrary depth image.
S2, modeling the corresponding context information using a spatial network and a temporal network in spatial and temporal dimensions, respectively, including:
first, for successive frame depth images (X)1,...,XT)trainThe image is firstly scaled to 128X 128 size, then randomly rotated and turned, and finally normalized (scaled to between-1 and 1), (X)1,...,XT)trainAfter the processing, a normalized depth image (I) is finally obtained1,...,IT)trainAs an algorithm input; subsequently, the depth image (I) is normalized according to the depth value1,...,IT)trainConversion to 128 × 128 × 8 three-dimensional voxel representation (V)1,...,VT)trainAnd then also as algorithm input. In addition, the position (J) of the joint point of the hand is randomly rotated and turned over for the original image1,...,JT)trainVariations also occur, so that the pair (J) is required1,...,JT)trainDo and (X)1,...,XT)trainCorresponding rotation and turnover transformation are carried out to obtain the transformed positions of the joint points of the hand
Figure BDA0002304839760000085
Secondly, for modeling spatial context information, for any frame of depth image ItAnd a three-dimensional voxel representation VtPerforming spatial network operation Fspatio(. in the spatial network operation, for ItAnd VtDownsampling by using three layers of convolution operation (adding a ReLU activation function to each layer) and maximum pooling operation to respectively obtain features
Figure BDA0002304839760000081
And
Figure BDA0002304839760000082
two features were then fused using a hierarchical fusion approach with a total number of layers of 3, namely:
Figure BDA0002304839760000083
Figure BDA0002304839760000084
m=1,2
wherein: phi is a0,tDenotes the fusion characteristic of layer 0,. phim,tIndicating the fusion characteristics of the m-th layer,
Figure BDA0002304839760000091
and
Figure BDA0002304839760000092
is a full connection function of the mth layer,
Figure BDA0002304839760000093
and
Figure BDA0002304839760000094
all the parameters are the parameters of the m-th layer; returning the coordinates of the hand joint points by using a full-connection operation
Figure BDA0002304839760000095
Formally expressing the above spatial network operation as:
Figure BDA0002304839760000096
wherein: fspatio(. to) represents a spatial network operation, ΘspatioParameters in the spatial network;
thirdly, for modeling time context information, the multi-frame depth image (I) obtained in the previous step is subjected to1,...,IT)trainTime network operation F on a frame-by-frame basistempIn the time network operation, down-sampling is performed using a three-layer convolution operation (adding ReLU activation function to each layer) and a max-pooling operation to obtain a feature (ψ)1,...,ψT) Wherein the depth image ItCharacteristic psi oft=H(It;θc) H (-) is a convolution operation, θcIs a convolution parameter; using long and short term memory networks LSTM carries out time dimension correlation modeling on the obtained features to obtain hidden layer features (h)1,...,hT) Hidden layer feature h at time ttCalculated as:
it=σ(Whi*ht-1+Wxit+bi)
ft=σ(Whf*ht-1+Wxft+bf)
ot=σ(Who*ht-1+Wxot+bt)
ct=(ft⊙ct-1)+(it⊙tanh(Whc*ht-1+Wxct+bc))
ht=ot⊙tanh(ct)
wherein: i.e. itFor the output of the input gate at time t, ftOutput of forgetting gate at time t, otOutput of the output gate at time t, ctFor final memory at time t, htIs the output of the LSTM layer at time t, Whi、Wxi、Whf、Wxf、Who、Wxo、Whc、WxcAll represent a weight, bi、bf、bt、bcall representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operationt,ht]Performing hand joint point coordinates
Figure BDA0002304839760000101
Regression of (4);
formally expressing the time network operation as follows:
Figure BDA0002304839760000102
wherein: ftemp(. The) represents a time network operation, ΘtempIs time of dayParameters in the network.
S3, fusing the input images using a fusion network as the output of the spatio-temporal model, specifically including:
the first step is to extract multiple frames of depth images (I) in the previous step1,...,IT)trainPerforming fusion network operation frame by frame FfusionIn the operation of the converged network, firstly carrying out down-sampling by using three layers of convolution operation (adding a ReLU activation function into each layer) and maximum pooling operation, and then carrying out three layers of full connection;
and obtaining the weight w through sigmoid function1And w2The two weights are obtained by the following formula:
w1,t=σ(Ffusion(It;Θfusion))
w2,t=1-w1,t
wherein w1,tAnd w2,tRespectively weight w of the t-th frame image1And w2;Ffusion(. is a converged network operation, ΘfusionTo fuse parameters in the network, σ (-) represents a sigmoid function.
Secondly, obtaining the fused coordinates in a linear weighting mode
Figure BDA0002304839760000103
Figure BDA0002304839760000104
where |, indicates a matrix element multiplication.
S4, establishing a prediction model of hand posture estimation specifically comprises:
firstly, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)1,...,XT) Outputting the position of the joint point of the hand in each frame
Figure BDA0002304839760000111
Thereby constructing a map in a neural network
Figure BDA0002304839760000112
Is formulated as:
Figure BDA0002304839760000113
wherein:
Figure BDA0002304839760000114
is a deep convolutional neural network;
second, loss function of neural network
Figure BDA0002304839760000117
Comprises the following steps:
Figure BDA0002304839760000115
wherein: n is the number of videos used for training; the index i in the parameter indicates the corresponding parameter value in the ith video;
loss function using Adam optimization method and back propagation algorithm
Figure BDA0002304839760000118
And training the whole neural network.
S5, using the prediction model to carry out hand posture estimation on the depth images of the continuous frames, and the specific steps comprise:
after the continuous frame depth image is subjected to the same scaling and normalization operations as the training image, the continuous frame depth image is input into a depth convolution neural network after the training is finished, and the result is output
Figure BDA0002304839760000116
I.e. the predicted hand joint point coordinates.
It can be seen that the above method outputs three-dimensional coordinates of the hand node in each frame, given the continuous depth image.
The above-described method is applied to specific examples so that those skilled in the art can better understand the effects of the present invention.
Examples
In this embodiment, experiments are performed based on the above method, the specific implementation method is as described above, specific steps are not elaborated, and the results are shown below only for the experimental results.
NYU hand pose dataset: the data set contains ten video sequences for a total of 17604 frame depth images, eight of which contain 16008 frame depth images for training and two of which contain 1596 frame images for testing.
ICVL hand gesture data set: the data set contains three video sequences, for a total of 81009 frame depth images, one of which contains 72757 frame depth images for training and two of which contain 8252 frame images for testing.
Table 1 shows the comparison of evaluation indexes of the present example on NYU hand posture data set
Method of producing a composite material Mean error of joint point (mm)
HeatMap[1] 21.02
DeepPrior[2] 19.73
Feedback[3] 15.97
DeepModel[4] 16.90
Lie-X[5] 14.51
CADSTN 14.83
TABLE 1
Figure BDA0002304839760000121
Figure BDA0002304839760000131
TABLE 2
The CADSTN is the method of the invention, and the other methods correspond to the following references:
[1]J.Tompson,M.Stein,Y.Lecun,and K.Perlin,“Real-time continuous poserecovery of human hands using convolutional networks,”ACM Transactions onGraphics(ToG),vol.33,no.5,2014.
[2]M.Oberweger,P.Wohlhart,and V.Lepetit,“Hands deep in deep learningfor hand pose estimation,”in CVWW,2015.
[3]Oberweger,Markus and Wohlhart,Paul and Lepetit,Vincent,“Training afeedback loop for hand pose estimation,”in ICCV,2015.
[4]X.Zhou,Q.Wan,W.Zhang,X.Xue,and Y.Wei,“Model-based deep hand poseestimation,”IJCAI,2016.
[5]C.Xu,L.N.Govindarajan,Y.Zhang,and L.Cheng,“Lie-x:Depth image basedarticulated object pose estimation,tracking,and action recognition on liegroups,”Intemational Journal of Computer Vision,2017.
[6]D.Tang,H.Jin Chang,A.Tejani,and T.-K.Kim,“Latent regressionforest:Structured estimation of 3d articulated hand posture,”in CVPR,2014.
[7]C.Wan,T.Probst,L.Van Gool,and A.Yao,“Crossing nets:Combining gansand vaes with a shared latent space for hand pose estimation,”in CVPR,2017.
[8]C.Wan,A.Yao,and L.Van Gool,“Direction matters:hand pose estimationfrom local surface normals,”in ECCV,2016.
the implementation results of the images with the parts of the two data sets being difficult to recognize are shown in fig. 2, wherein the first line and the third line are implementation results of the pre-labeled joint, and the second line and the fourth line are implementation results of the method of the present invention. By observing the implementation result, the method can still carry out robust joint point estimation in the depth image scene with self-occlusion and noise of the hand.
In the above embodiment, the hand pose estimation method based on spatio-temporal context learning according to the present invention first models corresponding context information in spatial and temporal dimensions using a spatial network and a temporal network, and then effectively merges a plurality of predictions by using a merging method, and establishes a hand pose estimation model based on a deep neural network. And finally, predicting the positions of hand joint points of the continuous frame depth images by using the trained hand posture estimation model.
Through the technical scheme, the hand posture estimation method based on the space-time context learning is developed based on the deep learning technology. The invention can model the dependency relationship between the pixels in time and space dimensions, and uniformly use the space-time context for estimating the positions of the hand joint points through the fusion network.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (6)

1. A hand posture estimation method based on space-time context learning is characterized by comprising the following steps:
s1, acquiring a continuous frame depth image data set for training hand posture estimation;
s2, respectively modeling corresponding context information by using a space network and a time network in space and time dimensions;
s3, fusing the output of the time-space model by using a fusion network according to the input image;
s4, establishing a hand posture estimation prediction model;
and S5, performing hand posture estimation on the depth images of the continuous frames by using the prediction model.
2. The hand pose estimation method based on spatio-temporal context learning of claim 1, wherein in step S1, a continuous frame depth image data set for training hand pose estimation is obtained, comprising N training videos, each of which contains continuous frame depth images (X)1,...,XT)trainAnd a pre-marked hand joint point location (J)1,...,JT)train
3. The hand pose estimation method based on spatio-temporal context learning according to claim 2, wherein in step S2, the modeling the corresponding context information using the spatial network and the temporal network in the spatial and temporal dimensions respectively comprises:
s21, depth image (X) for continuous frames1,...,XT)trainScaling the image to 128 x 128 size, randomly rotating and turning, and normalizing to scale the image to-1 to obtain normalized depth image (I)1,...,IT)trainAs an algorithmic input, the normalized depth image is then converted to a 128 × 128 × 8 three-dimensional voxel representation (V) by depth value1,...,VT)trainThen also as algorithm input, and pair (J)1,...,JT)trainDo and (X)1,...,XT)trainCorresponding rotationChange over and turn over to obtain
Figure FDA0002304839750000021
S22, modeling spatial context information, and processing any frame depth image ItAnd a three-dimensional voxel representation VtPerforming spatial network operation Fspatio(. in the spatial network operation, for ItAnd VtPerforming down-sampling by using three-layer convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain features respectively
Figure FDA0002304839750000022
And
Figure FDA0002304839750000023
two features were then fused using a hierarchical fusion approach with a total number of layers of 3, namely:
Figure FDA0002304839750000024
Figure FDA0002304839750000025
m=1,2
wherein: phi is am,tIndicating the fusion characteristics of the m-th layer,
Figure FDA0002304839750000026
and
Figure FDA0002304839750000027
is a full connection function of the mth layer,
Figure FDA0002304839750000028
and
Figure FDA0002304839750000029
are all as followsFull connection layer parameters of m layers; returning the coordinates of the hand joint points by using a full-connection operation
Figure FDA00023048397500000210
Formally expressing the above spatial network operation as:
Figure FDA00023048397500000211
wherein: fspatio(. to) represents a spatial network operation, ΘspatioParameters in the spatial network;
s23, for modeling the time context information, the multi-frame depth image (I) obtained in S211,...,IT)trainTime network operation F on a frame-by-frame basistempIn the time network operation, the down-sampling is carried out by using three layers of convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain characteristic psi1,...,ψT) Wherein the depth image ItCharacteristic psi oft=H(It;θc) H (-) is a convolution operation, θcIs a convolution parameter; performing time dimension correlation modeling on the obtained features by using LSTM to obtain hidden layer features (h)1,...,hT) Hidden layer feature h at time ttCalculated as:
it=σ(Whi*ht-1+Wxit+bi)
ft=σ(Whf*ht-1+Wxft+bf)
ot=σ(Who*ht-1+Wxot+bt)
ct=(ft⊙ct-1)+(it⊙tanh(Whc*ht-1+Wxct+bc))
ht=ot⊙tanh(ct)
wherein: i.e. itFor the output of the input gate at time t, ftOutput of forgetting gate at time t, otOutput of the output gate at time t, ctFor final memorization at time t, Whi、Wxi、Whf、Wxf、Who、Wxo、Whc、WxcAll represent a weight, bt、bf、bt、bcall representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operationt,ht]Performing hand joint point coordinates
Figure FDA0002304839750000031
Regression of (4);
formally expressing the time network operation as follows:
Figure FDA0002304839750000032
wherein: ftemp(. The) represents a time network operation, ΘtempAre parameters in the time network.
4. The hand pose estimation method based on spatio-temporal context learning according to claim 3, wherein in step S3, fusing the output of the spatio-temporal model by using a fusion network according to the input images, specifically comprising:
s31, extracting multiple frames of depth images (I) from S211,...,IT)trainPerforming a converged network operation frame by frame, wherein in the converged network operation, a three-layer convolution operation and a maximum pooling operation of adding a ReLU activation function into each layer are firstly used for performing down-sampling, and then three layers of full connection are performed; and obtaining the weight w through sigmoid function1And w2The two weights are obtained by the following formula:
w1,t=σ(Ffusion(It;Θfusion))
w2,t=1-w1,t
wherein w1,tAnd w2,tRespectively weight w of the t-th frame image1And w2;Ffusion(. is a converged network operation, ΘfusionTo fuse parameters in the network, σ (-) represents a sigmoid function.
S32, obtaining the fused coordinates in a linear weighting mode
Figure FDA0002304839750000041
Figure FDA0002304839750000042
where |, indicates a matrix element multiplication.
5. The hand pose estimation method based on spatio-temporal context learning of claim 4, wherein in step S4, the establishing of the prediction model of the hand pose estimation specifically comprises:
s41, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)1,...,XT) Outputting the position of the joint point of the hand in each frame
Figure FDA0002304839750000043
Thereby constructing a map in a neural network
Figure FDA0002304839750000044
Is formulated as:
Figure FDA0002304839750000045
wherein:
Figure FDA0002304839750000046
is a deep convolutional neural network;
s42 loss of neural networkLoss function
Figure FDA0002304839750000047
Comprises the following steps:
Figure FDA0002304839750000048
wherein: n is the video frequency; the index i in the parameter indicates the corresponding parameter value in the ith video;
loss function using Adam optimization method and back propagation algorithm
Figure FDA0002304839750000051
And training the whole neural network.
6. The hand pose estimation method based on spatio-temporal context learning of claim 5, wherein in step S5, the hand pose estimation for the depth images of consecutive frames using the prediction model comprises: after the continuous frame depth image is subjected to the same scaling and normalization operations as the training image, the continuous frame depth image is input into a depth convolution neural network after the training is finished, and the output is
Figure FDA0002304839750000052
I.e. the predicted hand joint point coordinates.
CN201911235772.2A 2019-12-05 2019-12-05 Hand posture estimation method based on space-time context learning Pending CN111178142A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911235772.2A CN111178142A (en) 2019-12-05 2019-12-05 Hand posture estimation method based on space-time context learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911235772.2A CN111178142A (en) 2019-12-05 2019-12-05 Hand posture estimation method based on space-time context learning

Publications (1)

Publication Number Publication Date
CN111178142A true CN111178142A (en) 2020-05-19

Family

ID=70646492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911235772.2A Pending CN111178142A (en) 2019-12-05 2019-12-05 Hand posture estimation method based on space-time context learning

Country Status (1)

Country Link
CN (1) CN111178142A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666917A (en) * 2020-06-19 2020-09-15 北京市商汤科技开发有限公司 Attitude detection and video processing method and device, electronic equipment and storage medium
CN111833400A (en) * 2020-06-10 2020-10-27 广东工业大学 Camera position and posture positioning method
CN112328156A (en) * 2020-11-12 2021-02-05 维沃移动通信有限公司 Input device control method and device and electronic device
CN113723233A (en) * 2021-08-17 2021-11-30 之江实验室 Student learning participation degree evaluation method based on layered time sequence multi-example learning
JP2022541709A (en) * 2020-06-19 2022-09-27 ベイジン・センスタイム・テクノロジー・デベロップメント・カンパニー・リミテッド Attitude detection and video processing method, device, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066935A (en) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 Hand gestures method of estimation and device based on deep learning
CN107437246A (en) * 2017-07-05 2017-12-05 浙江大学 A kind of common conspicuousness detection method based on end-to-end full convolutional neural networks
CN108460329A (en) * 2018-01-15 2018-08-28 任俊芬 A kind of face gesture cooperation verification method based on deep learning detection
US20190034714A1 (en) * 2016-02-05 2019-01-31 Delphi Technologies, Llc System and method for detecting hand gestures in a 3d space
CN109961005A (en) * 2019-01-28 2019-07-02 山东大学 A kind of dynamic gesture identification method and system based on two-dimensional convolution network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034714A1 (en) * 2016-02-05 2019-01-31 Delphi Technologies, Llc System and method for detecting hand gestures in a 3d space
CN107066935A (en) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 Hand gestures method of estimation and device based on deep learning
CN107437246A (en) * 2017-07-05 2017-12-05 浙江大学 A kind of common conspicuousness detection method based on end-to-end full convolutional neural networks
CN108460329A (en) * 2018-01-15 2018-08-28 任俊芬 A kind of face gesture cooperation verification method based on deep learning detection
CN109961005A (en) * 2019-01-28 2019-07-02 山东大学 A kind of dynamic gesture identification method and system based on two-dimensional convolution network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YIMING WU: "Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images", pages 1 - 4 *
刘冰: "《深度核机器学习技术及应用》", 北京工业大学出版社, pages: 137 - 138 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111833400A (en) * 2020-06-10 2020-10-27 广东工业大学 Camera position and posture positioning method
CN111833400B (en) * 2020-06-10 2023-07-28 广东工业大学 Camera pose positioning method
CN111666917A (en) * 2020-06-19 2020-09-15 北京市商汤科技开发有限公司 Attitude detection and video processing method and device, electronic equipment and storage medium
JP2022541709A (en) * 2020-06-19 2022-09-27 ベイジン・センスタイム・テクノロジー・デベロップメント・カンパニー・リミテッド Attitude detection and video processing method, device, electronic device and storage medium
CN112328156A (en) * 2020-11-12 2021-02-05 维沃移动通信有限公司 Input device control method and device and electronic device
CN112328156B (en) * 2020-11-12 2022-05-17 维沃移动通信有限公司 Input device control method and device and electronic device
CN113723233A (en) * 2021-08-17 2021-11-30 之江实验室 Student learning participation degree evaluation method based on layered time sequence multi-example learning
CN113723233B (en) * 2021-08-17 2024-03-26 之江实验室 Student learning participation assessment method based on hierarchical time sequence multi-example learning

Similar Documents

Publication Publication Date Title
Siarohin et al. First order motion model for image animation
Zhang et al. Relational attention network for crowd counting
CN111178142A (en) Hand posture estimation method based on space-time context learning
CN107292912B (en) Optical flow estimation method based on multi-scale corresponding structured learning
CN113673307A (en) Light-weight video motion recognition method
Truong et al. Pdc-net+: Enhanced probabilistic dense correspondence network
Wang et al. What matters for 3d scene flow network
EP4060560B1 (en) Systems, methods, and storage media for generating synthesized depth data
Fooladgar et al. Multi-modal attention-based fusion model for semantic segmentation of RGB-depth images
Guo et al. JointPruning: Pruning networks along multiple dimensions for efficient point cloud processing
Sedai et al. A Gaussian process guided particle filter for tracking 3D human pose in video
Li et al. Face sketch synthesis using regularized broad learning system
CN112232134A (en) Human body posture estimation method based on hourglass network and attention mechanism
Xu et al. Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement
Wang et al. Adversarial learning for joint optimization of depth and ego-motion
Xu et al. RGB-T salient object detection via CNN feature and result saliency map fusion
Song et al. Contextualized CNN for scene-aware depth estimation from single RGB image
Afifi et al. Object depth estimation from a single image using fully convolutional neural network
Ukwuoma et al. Image inpainting and classification agent training based on reinforcement learning and generative models with attention mechanism
Ling et al. Human object inpainting using manifold learning-based posture sequence estimation
Zhang et al. DDF-HO: hand-held object reconstruction via conditional directed distance field
CN116189306A (en) Human behavior recognition method based on joint attention mechanism
Gupta et al. End-to-end differentiable 6DoF object pose estimation with local and global constraints
Wang et al. Robust point cloud registration using geometric spatial refinement
Zhu Reconstruction of missing markers in motion capture based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination