CN111178142A - Hand posture estimation method based on space-time context learning - Google Patents
Hand posture estimation method based on space-time context learning Download PDFInfo
- Publication number
- CN111178142A CN111178142A CN201911235772.2A CN201911235772A CN111178142A CN 111178142 A CN111178142 A CN 111178142A CN 201911235772 A CN201911235772 A CN 201911235772A CN 111178142 A CN111178142 A CN 111178142A
- Authority
- CN
- China
- Prior art keywords
- network
- hand
- time
- frame
- train
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a hand posture estimation method based on space-time context learning, which is used for outputting three-dimensional coordinates of hand nodes in each frame under the condition of giving continuous depth images. The method specifically comprises the following steps: acquiring a continuous frame depth image data set for training hand posture estimation, and defining an algorithm target; respectively modeling corresponding context information on a space dimension and a time dimension by using a space network and a time network; fusing the output of the time-space model by using a fusion network according to the input image; establishing a prediction model of hand posture estimation; and performing hand pose estimation on the depth images of the continuous frames by using the prediction model. The invention uses the hand gesture estimation in the real video, and has better effect and robustness in the face of various complex conditions.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a hand posture estimation method based on space-time context learning.
Background
Hand pose estimation is defined as the problem: the specific positions of the hand joint points relative to the camera are found and given in a given depth image containing the hand. Hand pose estimation is commonly used in human-computer interaction, augmented reality, or virtual reality applications. The traditional method carries out model parameter optimization by using a parameterized model for the hand and defining an energy function, but because the model-based method is expensive in calculation consumption and the deep neural network is developed in the recent years, the mode of the hand gesture is discovered from data by the method based on the apparent characteristics, and the resource consumption is smaller compared with the model-based method.
Due to the effectiveness of statistical modeling, current learning-based methods are increasingly applied to hand pose estimation tasks. The existing learning method based on the apparent characteristics mainly adopts an end-to-end deep neural network model, and outputs the predicted joint point position of the hand by inputting a single frame or a plurality of frames of depth images containing the hand. On the one hand, most of the methods today use depth images or three-dimensional voxels as inputs, and the present invention considers that the two inputs are correlated and can complement each other; on the other hand, in the actual scene, the multi-frame depth images have correlation, and the prediction accuracy of the network is improved by modeling the context information in the time dimension.
Disclosure of Invention
In order to solve the above problems, the present invention provides a hand pose estimation method based on spatiotemporal context learning. The method is based on a deep neural network, utilizes the neural network to carry out feature extraction and effective fusion on the deep image and three-dimensional voxel input, and uses the recurrent neural network to model the relationship between the features of the multi-frame images in the time dimension, so that the hand posture estimation in the multi-frame scene can be improved.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a hand posture estimation method based on space-time context learning comprises the following steps:
s1, acquiring a continuous frame depth image data set for training hand posture estimation;
s2, respectively modeling corresponding context information by using a space network and a time network in space and time dimensions;
s3, fusing the output of the time-space model by using a fusion network according to the input image;
s4, establishing a hand posture estimation prediction model;
and S5, performing hand posture estimation on the depth images of the continuous frames by using the prediction model.
Based on the technical scheme, the steps can be realized in the following preferred mode.
Preferably, in step S1, a continuous frame depth image data set for training hand pose estimation is obtained, which includes N training videos, each of which contains a continuous frame depth image (X)1,...,XT)trainAnd a pre-marked hand joint point location (J)1,...,JT)train。
Further, in step S2, the step of modeling the corresponding context information using the spatial network and the temporal network in the spatial and temporal dimensions includes:
s21, depth image (X) for continuous frames1,...,XT)trainScaling the image to 128 x 128 size, randomly rotating and turning, and normalizing to scale the image to-1 to obtain normalized depth image (I)1,...,IT)trainAs an algorithmic input, the normalized depth image is then converted to a 128 × 128 × 8 three-dimensional voxel representation (V) by depth value1,...,VT)trainThen also as algorithm input, and pair (J)1,...,JT)trainDo and (X)1,...,XT)trainCorresponding rotation and inversion transformation to obtain
S22, modeling spatial context information, and processing any frame depth image ItAnd a three-dimensional voxel representation VtPerforming spatial network operation Fspatio(. in the spatial network operation, for ItAnd VtTriple layer convolution operations all using addition of ReLU activation function per layer and best resultsLarge pooling operation for down-sampling to obtain featuresAndtwo features were then fused using a hierarchical fusion approach with a total number of layers of 3, namely:
m=1,2
wherein: phi m, t represents the fusion characteristics of the mth layer,andis a full connection function of the mth layer,andall the parameters are the parameters of the m-th layer; returning the coordinates of the hand joint points by using a full-connection operation
Formally expressing the above spatial network operation as:
wherein: fspatio(. to) represents a spatial network operation, ΘspatioParameters in the spatial network;
s23, for modeling the time context information, the multi-frame depth image (I) obtained in S211,...,IT)trainTime network operation F on a frame-by-frame basistempIn the time network operation, the down-sampling is carried out by using three layers of convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain characteristic psi1,...,ψT) Wherein the depth image ItCharacteristic psi oft=H(It;θc) H (-) is a convolution operation, θcIs a convolution parameter; performing time dimension correlation modeling on the obtained features by using LSTM to obtain hidden layer features (h)1,...,hT) Hidden layer feature h at time ttCalculated as:
it=σ(Whi*ht-1+Wxi*ψt+bi)
ft=σ(Whf*ht-1+Wxf*ψt+bf)
ot=σ(Who*ht-1+Wxo*ψt+bt)
ct=(ft⊙ct-1)+(it⊙tanh(Whc*ht-1+Wxc*ψt+bc))
ht=ot⊙tanh(ct)
wherein: i.e. itFor the output of the input gate at time t, ftOutput of forgetting gate at time t, otOutput of the output gate at time t, ctFor final memorization at time t, Whi、Wxi、Whf、Wxf、Who、Wxo、Whc、WxcAll represent a weight, bi、bf、bt、bcall representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operationt,ht]Carry out hand joint pointCoordinates of the objectRegression of (4);
formally expressing the time network operation as follows:
wherein: ftemp(. The) represents a time network operation, ΘtempAre parameters in the time network.
Further, in step S3, the fusing with the fusion network as the output of the spatio-temporal model according to the input image specifically includes:
s31, extracting multiple frames of depth images (I) from S211,...,IT)trainPerforming a converged network operation frame by frame, wherein in the converged network operation, a three-layer convolution operation and a maximum pooling operation of adding a ReLU activation function into each layer are firstly used for performing down-sampling, and then three layers of full connection are performed; and obtaining the weight w through sigmoid function1And w2The two weights are obtained by the following formula:
w1,t=σ(Ffusion(It;Θfusion))
w2,t=1-w1,t
wherein w1,tAnd w2,tRespectively weight w of the t-th frame image1And w2;Ffusion(. is a converged network operation, ΘfusionTo fuse parameters in the network, σ (-) represents a sigmoid function.
where |, indicates a matrix element multiplication.
Further, in step S4, the establishing a prediction model for hand pose estimation specifically includes:
s41, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)1,...,XT) Outputting the position of the joint point of the hand in each frameThereby constructing a map in a neural network
Is formulated as:
wherein: n is the video frequency; the index i in the parameter indicates the corresponding parameter value in the ith video;
loss function using Adam optimization method and back propagation algorithmAnd training the whole neural network.
Further, in step S5, the performing hand pose estimation on the depth images of consecutive frames using the prediction model includes: subjecting successive frame depth images to the same scaling and binning as the training imageAfter normalization, inputting the data into a trained deep convolutional neural network, and outputtingI.e. the predicted hand joint point coordinates.
Compared with the existing hand posture estimation method, the hand posture estimation method based on the space-time context learning has the following beneficial effects:
firstly, the hand posture estimation method based on space-time context learning defines two important problems in hand posture estimation, namely extracting effective information in a depth image and accurately regressing hand coordinates by using extracted features. By seeking a solution for the two directions, the hand posture estimation under the complex condition can be effectively solved.
Secondly, the hand pose estimation method of the invention models the spatial context and the temporal context based on the deep convolutional neural network to extract the effective information in the depth image. The spatial context network effectively fuses multi-mode information from depth images and three-dimensional voxel expression through a hierarchical fusion method, and more robust visual expression features are extracted; the time context network utilizes the time sequence among the images of multiple frames and uses a recurrent neural network to model the corresponding relation among the multiple frames.
Finally, the hand gesture estimation method of the invention uses a fusion network to unify the time and space contexts in a frame, utilizes the weight of the time and space network of the self-adaptive learning of the input depth image, and uses the linear weighting method to effectively fuse a plurality of outputs.
The hand posture estimation method based on the space-time context learning can effectively improve the accuracy and efficiency of hand posture estimation in human-computer interaction, virtual reality and augmented reality, and has good application value. For example, in the application scene of human-computer interaction, the hand posture estimation method can quickly and accurately estimate the joint point position of the hand, so that the robot can be controlled by using the hand action.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 shows the result of an image with a portion of the two data sets being difficult to recognize.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.
Referring to fig. 1, in a preferred embodiment of the present invention, a hand pose estimation method based on spatiotemporal context learning includes the following steps:
s1, acquiring a continuous frame depth image data set for training hand posture estimation, wherein the continuous frame depth image data set comprises training videos (N in total) meeting the requirement of the number of training samples, and each training video contains a continuous frame depth image (X)1,...,XT)trainAnd manually pre-labeled hand joint point locations (J)1,...,JT)train,XtAnd JtRespectively showing the t-th frame depth image and the position of the hand joint point corresponding to the image.
The algorithm targets are defined as: and predicting the coordinates of the hand joint points in the arbitrary depth image.
S2, modeling the corresponding context information using a spatial network and a temporal network in spatial and temporal dimensions, respectively, including:
first, for successive frame depth images (X)1,...,XT)trainThe image is firstly scaled to 128X 128 size, then randomly rotated and turned, and finally normalized (scaled to between-1 and 1), (X)1,...,XT)trainAfter the processing, a normalized depth image (I) is finally obtained1,...,IT)trainAs an algorithm input; subsequently, the depth image (I) is normalized according to the depth value1,...,IT)trainConversion to 128 × 128 × 8 three-dimensional voxel representation (V)1,...,VT)trainAnd then also as algorithm input. In addition, the position (J) of the joint point of the hand is randomly rotated and turned over for the original image1,...,JT)trainVariations also occur, so that the pair (J) is required1,...,JT)trainDo and (X)1,...,XT)trainCorresponding rotation and turnover transformation are carried out to obtain the transformed positions of the joint points of the hand
Secondly, for modeling spatial context information, for any frame of depth image ItAnd a three-dimensional voxel representation VtPerforming spatial network operation Fspatio(. in the spatial network operation, for ItAnd VtDownsampling by using three layers of convolution operation (adding a ReLU activation function to each layer) and maximum pooling operation to respectively obtain featuresAndtwo features were then fused using a hierarchical fusion approach with a total number of layers of 3, namely:
m=1,2
wherein: phi is a0,tDenotes the fusion characteristic of layer 0,. phim,tIndicating the fusion characteristics of the m-th layer,andis a full connection function of the mth layer,andall the parameters are the parameters of the m-th layer; returning the coordinates of the hand joint points by using a full-connection operation
Formally expressing the above spatial network operation as:
wherein: fspatio(. to) represents a spatial network operation, ΘspatioParameters in the spatial network;
thirdly, for modeling time context information, the multi-frame depth image (I) obtained in the previous step is subjected to1,...,IT)trainTime network operation F on a frame-by-frame basistempIn the time network operation, down-sampling is performed using a three-layer convolution operation (adding ReLU activation function to each layer) and a max-pooling operation to obtain a feature (ψ)1,...,ψT) Wherein the depth image ItCharacteristic psi oft=H(It;θc) H (-) is a convolution operation, θcIs a convolution parameter; using long and short term memory networks LSTM carries out time dimension correlation modeling on the obtained features to obtain hidden layer features (h)1,...,hT) Hidden layer feature h at time ttCalculated as:
it=σ(Whi*ht-1+Wxi*ψt+bi)
ft=σ(Whf*ht-1+Wxf*ψt+bf)
ot=σ(Who*ht-1+Wxo*ψt+bt)
ct=(ft⊙ct-1)+(it⊙tanh(Whc*ht-1+Wxc*ψt+bc))
ht=ot⊙tanh(ct)
wherein: i.e. itFor the output of the input gate at time t, ftOutput of forgetting gate at time t, otOutput of the output gate at time t, ctFor final memory at time t, htIs the output of the LSTM layer at time t, Whi、Wxi、Whf、Wxf、Who、Wxo、Whc、WxcAll represent a weight, bi、bf、bt、bcall representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operationt,ht]Performing hand joint point coordinatesRegression of (4);
formally expressing the time network operation as follows:
wherein: ftemp(. The) represents a time network operation, ΘtempIs time of dayParameters in the network.
S3, fusing the input images using a fusion network as the output of the spatio-temporal model, specifically including:
the first step is to extract multiple frames of depth images (I) in the previous step1,...,IT)trainPerforming fusion network operation frame by frame FfusionIn the operation of the converged network, firstly carrying out down-sampling by using three layers of convolution operation (adding a ReLU activation function into each layer) and maximum pooling operation, and then carrying out three layers of full connection;
and obtaining the weight w through sigmoid function1And w2The two weights are obtained by the following formula:
w1,t=σ(Ffusion(It;Θfusion))
w2,t=1-w1,t
wherein w1,tAnd w2,tRespectively weight w of the t-th frame image1And w2;Ffusion(. is a converged network operation, ΘfusionTo fuse parameters in the network, σ (-) represents a sigmoid function.
where |, indicates a matrix element multiplication.
S4, establishing a prediction model of hand posture estimation specifically comprises:
firstly, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)1,...,XT) Outputting the position of the joint point of the hand in each frameThereby constructing a map in a neural network
Is formulated as:
wherein: n is the number of videos used for training; the index i in the parameter indicates the corresponding parameter value in the ith video;
loss function using Adam optimization method and back propagation algorithmAnd training the whole neural network.
S5, using the prediction model to carry out hand posture estimation on the depth images of the continuous frames, and the specific steps comprise:
after the continuous frame depth image is subjected to the same scaling and normalization operations as the training image, the continuous frame depth image is input into a depth convolution neural network after the training is finished, and the result is outputI.e. the predicted hand joint point coordinates.
It can be seen that the above method outputs three-dimensional coordinates of the hand node in each frame, given the continuous depth image.
The above-described method is applied to specific examples so that those skilled in the art can better understand the effects of the present invention.
Examples
In this embodiment, experiments are performed based on the above method, the specific implementation method is as described above, specific steps are not elaborated, and the results are shown below only for the experimental results.
NYU hand pose dataset: the data set contains ten video sequences for a total of 17604 frame depth images, eight of which contain 16008 frame depth images for training and two of which contain 1596 frame images for testing.
ICVL hand gesture data set: the data set contains three video sequences, for a total of 81009 frame depth images, one of which contains 72757 frame depth images for training and two of which contain 8252 frame images for testing.
Table 1 shows the comparison of evaluation indexes of the present example on NYU hand posture data set
Method of producing a composite material | Mean error of joint point (mm) |
HeatMap[1] | 21.02 |
DeepPrior[2] | 19.73 |
Feedback[3] | 15.97 |
DeepModel[4] | 16.90 |
Lie-X[5] | 14.51 |
CADSTN | 14.83 |
TABLE 1
TABLE 2
The CADSTN is the method of the invention, and the other methods correspond to the following references:
[1]J.Tompson,M.Stein,Y.Lecun,and K.Perlin,“Real-time continuous poserecovery of human hands using convolutional networks,”ACM Transactions onGraphics(ToG),vol.33,no.5,2014.
[2]M.Oberweger,P.Wohlhart,and V.Lepetit,“Hands deep in deep learningfor hand pose estimation,”in CVWW,2015.
[3]Oberweger,Markus and Wohlhart,Paul and Lepetit,Vincent,“Training afeedback loop for hand pose estimation,”in ICCV,2015.
[4]X.Zhou,Q.Wan,W.Zhang,X.Xue,and Y.Wei,“Model-based deep hand poseestimation,”IJCAI,2016.
[5]C.Xu,L.N.Govindarajan,Y.Zhang,and L.Cheng,“Lie-x:Depth image basedarticulated object pose estimation,tracking,and action recognition on liegroups,”Intemational Journal of Computer Vision,2017.
[6]D.Tang,H.Jin Chang,A.Tejani,and T.-K.Kim,“Latent regressionforest:Structured estimation of 3d articulated hand posture,”in CVPR,2014.
[7]C.Wan,T.Probst,L.Van Gool,and A.Yao,“Crossing nets:Combining gansand vaes with a shared latent space for hand pose estimation,”in CVPR,2017.
[8]C.Wan,A.Yao,and L.Van Gool,“Direction matters:hand pose estimationfrom local surface normals,”in ECCV,2016.
the implementation results of the images with the parts of the two data sets being difficult to recognize are shown in fig. 2, wherein the first line and the third line are implementation results of the pre-labeled joint, and the second line and the fourth line are implementation results of the method of the present invention. By observing the implementation result, the method can still carry out robust joint point estimation in the depth image scene with self-occlusion and noise of the hand.
In the above embodiment, the hand pose estimation method based on spatio-temporal context learning according to the present invention first models corresponding context information in spatial and temporal dimensions using a spatial network and a temporal network, and then effectively merges a plurality of predictions by using a merging method, and establishes a hand pose estimation model based on a deep neural network. And finally, predicting the positions of hand joint points of the continuous frame depth images by using the trained hand posture estimation model.
Through the technical scheme, the hand posture estimation method based on the space-time context learning is developed based on the deep learning technology. The invention can model the dependency relationship between the pixels in time and space dimensions, and uniformly use the space-time context for estimating the positions of the hand joint points through the fusion network.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (6)
1. A hand posture estimation method based on space-time context learning is characterized by comprising the following steps:
s1, acquiring a continuous frame depth image data set for training hand posture estimation;
s2, respectively modeling corresponding context information by using a space network and a time network in space and time dimensions;
s3, fusing the output of the time-space model by using a fusion network according to the input image;
s4, establishing a hand posture estimation prediction model;
and S5, performing hand posture estimation on the depth images of the continuous frames by using the prediction model.
2. The hand pose estimation method based on spatio-temporal context learning of claim 1, wherein in step S1, a continuous frame depth image data set for training hand pose estimation is obtained, comprising N training videos, each of which contains continuous frame depth images (X)1,...,XT)trainAnd a pre-marked hand joint point location (J)1,...,JT)train。
3. The hand pose estimation method based on spatio-temporal context learning according to claim 2, wherein in step S2, the modeling the corresponding context information using the spatial network and the temporal network in the spatial and temporal dimensions respectively comprises:
s21, depth image (X) for continuous frames1,...,XT)trainScaling the image to 128 x 128 size, randomly rotating and turning, and normalizing to scale the image to-1 to obtain normalized depth image (I)1,...,IT)trainAs an algorithmic input, the normalized depth image is then converted to a 128 × 128 × 8 three-dimensional voxel representation (V) by depth value1,...,VT)trainThen also as algorithm input, and pair (J)1,...,JT)trainDo and (X)1,...,XT)trainCorresponding rotationChange over and turn over to obtain
S22, modeling spatial context information, and processing any frame depth image ItAnd a three-dimensional voxel representation VtPerforming spatial network operation Fspatio(. in the spatial network operation, for ItAnd VtPerforming down-sampling by using three-layer convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain features respectivelyAndtwo features were then fused using a hierarchical fusion approach with a total number of layers of 3, namely:
m=1,2
wherein: phi is am,tIndicating the fusion characteristics of the m-th layer,andis a full connection function of the mth layer,andare all as followsFull connection layer parameters of m layers; returning the coordinates of the hand joint points by using a full-connection operation
Formally expressing the above spatial network operation as:
wherein: fspatio(. to) represents a spatial network operation, ΘspatioParameters in the spatial network;
s23, for modeling the time context information, the multi-frame depth image (I) obtained in S211,...,IT)trainTime network operation F on a frame-by-frame basistempIn the time network operation, the down-sampling is carried out by using three layers of convolution operation and maximum pooling operation of adding ReLU activation function into each layer to obtain characteristic psi1,...,ψT) Wherein the depth image ItCharacteristic psi oft=H(It;θc) H (-) is a convolution operation, θcIs a convolution parameter; performing time dimension correlation modeling on the obtained features by using LSTM to obtain hidden layer features (h)1,...,hT) Hidden layer feature h at time ttCalculated as:
it=σ(Whi*ht-1+Wxi*ψt+bi)
ft=σ(Whf*ht-1+Wxf*ψt+bf)
ot=σ(Who*ht-1+Wxo*ψt+bt)
ct=(ft⊙ct-1)+(it⊙tanh(Whc*ht-1+Wxc*ψt+bc))
ht=ot⊙tanh(ct)
wherein: i.e. itFor the output of the input gate at time t, ftOutput of forgetting gate at time t, otOutput of the output gate at time t, ctFor final memorization at time t, Whi、Wxi、Whf、Wxf、Who、Wxo、Whc、WxcAll represent a weight, bt、bf、bt、bcall representing offsets, and respectively representing matrix multiplication and matrix corresponding element multiplication operations, sigma (-) representing sigmoid function, and a feature [ psi ] for concatenation using a fully connected operationt,ht]Performing hand joint point coordinatesRegression of (4);
formally expressing the time network operation as follows:
wherein: ftemp(. The) represents a time network operation, ΘtempAre parameters in the time network.
4. The hand pose estimation method based on spatio-temporal context learning according to claim 3, wherein in step S3, fusing the output of the spatio-temporal model by using a fusion network according to the input images, specifically comprising:
s31, extracting multiple frames of depth images (I) from S211,...,IT)trainPerforming a converged network operation frame by frame, wherein in the converged network operation, a three-layer convolution operation and a maximum pooling operation of adding a ReLU activation function into each layer are firstly used for performing down-sampling, and then three layers of full connection are performed; and obtaining the weight w through sigmoid function1And w2The two weights are obtained by the following formula:
w1,t=σ(Ffusion(It;Θfusion))
w2,t=1-w1,t
wherein w1,tAnd w2,tRespectively weight w of the t-th frame image1And w2;Ffusion(. is a converged network operation, ΘfusionTo fuse parameters in the network, σ (-) represents a sigmoid function.
where |, indicates a matrix element multiplication.
5. The hand pose estimation method based on spatio-temporal context learning of claim 4, wherein in step S4, the establishing of the prediction model of the hand pose estimation specifically comprises:
s41, establishing a deep convolution neural network, wherein the input of the neural network is continuous multiframe depth images (X)1,...,XT) Outputting the position of the joint point of the hand in each frameThereby constructing a map in a neural networkIs formulated as:
wherein: n is the video frequency; the index i in the parameter indicates the corresponding parameter value in the ith video;
6. The hand pose estimation method based on spatio-temporal context learning of claim 5, wherein in step S5, the hand pose estimation for the depth images of consecutive frames using the prediction model comprises: after the continuous frame depth image is subjected to the same scaling and normalization operations as the training image, the continuous frame depth image is input into a depth convolution neural network after the training is finished, and the output isI.e. the predicted hand joint point coordinates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911235772.2A CN111178142A (en) | 2019-12-05 | 2019-12-05 | Hand posture estimation method based on space-time context learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911235772.2A CN111178142A (en) | 2019-12-05 | 2019-12-05 | Hand posture estimation method based on space-time context learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111178142A true CN111178142A (en) | 2020-05-19 |
Family
ID=70646492
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911235772.2A Pending CN111178142A (en) | 2019-12-05 | 2019-12-05 | Hand posture estimation method based on space-time context learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111178142A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111666917A (en) * | 2020-06-19 | 2020-09-15 | 北京市商汤科技开发有限公司 | Attitude detection and video processing method and device, electronic equipment and storage medium |
CN111833400A (en) * | 2020-06-10 | 2020-10-27 | 广东工业大学 | Camera position and posture positioning method |
CN112328156A (en) * | 2020-11-12 | 2021-02-05 | 维沃移动通信有限公司 | Input device control method and device and electronic device |
CN113723233A (en) * | 2021-08-17 | 2021-11-30 | 之江实验室 | Student learning participation degree evaluation method based on layered time sequence multi-example learning |
JP2022541709A (en) * | 2020-06-19 | 2022-09-27 | ベイジン・センスタイム・テクノロジー・デベロップメント・カンパニー・リミテッド | Attitude detection and video processing method, device, electronic device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066935A (en) * | 2017-01-25 | 2017-08-18 | 网易(杭州)网络有限公司 | Hand gestures method of estimation and device based on deep learning |
CN107437246A (en) * | 2017-07-05 | 2017-12-05 | 浙江大学 | A kind of common conspicuousness detection method based on end-to-end full convolutional neural networks |
CN108460329A (en) * | 2018-01-15 | 2018-08-28 | 任俊芬 | A kind of face gesture cooperation verification method based on deep learning detection |
US20190034714A1 (en) * | 2016-02-05 | 2019-01-31 | Delphi Technologies, Llc | System and method for detecting hand gestures in a 3d space |
CN109961005A (en) * | 2019-01-28 | 2019-07-02 | 山东大学 | A kind of dynamic gesture identification method and system based on two-dimensional convolution network |
-
2019
- 2019-12-05 CN CN201911235772.2A patent/CN111178142A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190034714A1 (en) * | 2016-02-05 | 2019-01-31 | Delphi Technologies, Llc | System and method for detecting hand gestures in a 3d space |
CN107066935A (en) * | 2017-01-25 | 2017-08-18 | 网易(杭州)网络有限公司 | Hand gestures method of estimation and device based on deep learning |
CN107437246A (en) * | 2017-07-05 | 2017-12-05 | 浙江大学 | A kind of common conspicuousness detection method based on end-to-end full convolutional neural networks |
CN108460329A (en) * | 2018-01-15 | 2018-08-28 | 任俊芬 | A kind of face gesture cooperation verification method based on deep learning detection |
CN109961005A (en) * | 2019-01-28 | 2019-07-02 | 山东大学 | A kind of dynamic gesture identification method and system based on two-dimensional convolution network |
Non-Patent Citations (2)
Title |
---|
YIMING WU: "Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images", pages 1 - 4 * |
刘冰: "《深度核机器学习技术及应用》", 北京工业大学出版社, pages: 137 - 138 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111833400A (en) * | 2020-06-10 | 2020-10-27 | 广东工业大学 | Camera position and posture positioning method |
CN111833400B (en) * | 2020-06-10 | 2023-07-28 | 广东工业大学 | Camera pose positioning method |
CN111666917A (en) * | 2020-06-19 | 2020-09-15 | 北京市商汤科技开发有限公司 | Attitude detection and video processing method and device, electronic equipment and storage medium |
JP2022541709A (en) * | 2020-06-19 | 2022-09-27 | ベイジン・センスタイム・テクノロジー・デベロップメント・カンパニー・リミテッド | Attitude detection and video processing method, device, electronic device and storage medium |
CN112328156A (en) * | 2020-11-12 | 2021-02-05 | 维沃移动通信有限公司 | Input device control method and device and electronic device |
CN112328156B (en) * | 2020-11-12 | 2022-05-17 | 维沃移动通信有限公司 | Input device control method and device and electronic device |
CN113723233A (en) * | 2021-08-17 | 2021-11-30 | 之江实验室 | Student learning participation degree evaluation method based on layered time sequence multi-example learning |
CN113723233B (en) * | 2021-08-17 | 2024-03-26 | 之江实验室 | Student learning participation assessment method based on hierarchical time sequence multi-example learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Siarohin et al. | First order motion model for image animation | |
Zhang et al. | Relational attention network for crowd counting | |
CN111178142A (en) | Hand posture estimation method based on space-time context learning | |
CN107292912B (en) | Optical flow estimation method based on multi-scale corresponding structured learning | |
CN113673307A (en) | Light-weight video motion recognition method | |
Truong et al. | Pdc-net+: Enhanced probabilistic dense correspondence network | |
Wang et al. | What matters for 3d scene flow network | |
EP4060560B1 (en) | Systems, methods, and storage media for generating synthesized depth data | |
Fooladgar et al. | Multi-modal attention-based fusion model for semantic segmentation of RGB-depth images | |
Guo et al. | JointPruning: Pruning networks along multiple dimensions for efficient point cloud processing | |
Sedai et al. | A Gaussian process guided particle filter for tracking 3D human pose in video | |
Li et al. | Face sketch synthesis using regularized broad learning system | |
CN112232134A (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
Xu et al. | Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement | |
Wang et al. | Adversarial learning for joint optimization of depth and ego-motion | |
Xu et al. | RGB-T salient object detection via CNN feature and result saliency map fusion | |
Song et al. | Contextualized CNN for scene-aware depth estimation from single RGB image | |
Afifi et al. | Object depth estimation from a single image using fully convolutional neural network | |
Ukwuoma et al. | Image inpainting and classification agent training based on reinforcement learning and generative models with attention mechanism | |
Ling et al. | Human object inpainting using manifold learning-based posture sequence estimation | |
Zhang et al. | DDF-HO: hand-held object reconstruction via conditional directed distance field | |
CN116189306A (en) | Human behavior recognition method based on joint attention mechanism | |
Gupta et al. | End-to-end differentiable 6DoF object pose estimation with local and global constraints | |
Wang et al. | Robust point cloud registration using geometric spatial refinement | |
Zhu | Reconstruction of missing markers in motion capture based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |