CN115588237A

CN115588237A - Three-dimensional hand posture estimation method based on monocular RGB image

Info

Publication number: CN115588237A
Application number: CN202211255461.4A
Authority: CN
Inventors: 叶中付; 田瑞田
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-10-13
Filing date: 2022-10-13
Publication date: 2023-01-10

Abstract

The invention discloses a monocular RGB image-based three-dimensional hand posture estimation method, and designs a feature enhancement method which can explicitly introduce a hand inherent skeleton structure to adaptively enhance the features of joint points to be estimated by utilizing associated information so as to finally improve the accuracy of hand posture estimation. The designed method is as follows: firstly, extracting joint point-level semantic features and skeleton-level semantic features from an input hand image by using a convolutional neural network, and performing cross-semantic aggregation on the two features by using a feature fusion module; the characteristic self-adaptive enhancement module can then utilize the associated information to adaptively enhance the relevant characteristics of each joint point; and then obtaining a joint point two-dimensional thermodynamic diagram and a relative depth map through an output layer, continuously refining the two-dimensional thermodynamic diagram and the depth map by adopting a multi-stage optimization strategy to estimate more accurate hand two-dimensional joint point coordinates and relative depth, and finally calculating final hand three-dimensional coordinate information by utilizing camera parameters.

Description

Three-dimensional hand posture estimation method based on monocular RGB image

Technical Field

The invention relates to a method for estimating hand gesture of monocular RGB (red, green and blue) images with inherent associated information of hand joint points, belonging to the field of image processing.

Background

Three-dimensional hand pose estimation methods, which can be classified into depth image-based and RGB image-based three-dimensional hand pose estimation according to data sources, are more popular than depth image-based three-dimensional hand pose estimation solutions in many visual applications due to the low cost and low power consumption of RGB cameras. Compared with a depth image, a monocular RGB image has inherent defects of depth ambiguity, illumination sensitivity and the like, so that estimating a three-dimensional hand posture from the monocular RGB image is a challenging problem.

Three-dimensional pose estimation of monocular RGB images can be generally classified into a thermodynamic diagram detection-based method, a two-dimensional pose-to-three-dimensional pose lifting method, and a parametric model fitting method. The method based on thermodynamic diagram detection is different from a regression method in that the three-dimensional position of each key point is directly predicted, but the three-dimensional Gaussian distribution of each joint point in the thermodynamic diagram representation is predicted, and then the three-dimensional key point coordinates can be easily analyzed from the thermodynamic diagram through post-processing. Due to the fact that a plurality of robust two-dimensional posture estimation methods are available, many researchers concentrate on a method for improving two-dimensional posture estimation to three-dimensional posture, the whole method flow is mainly divided into two parts, firstly, an accurate two-dimensional hand posture estimation value is estimated from a monocular RGB image, and then a final three-dimensional posture estimation coordinate can be regressed through a low-complexity regression model. The method based on the regression of the parameter model is different from the two methods, but the shape of the human hand is approximately represented by the parameter model, most researches adopt a method based on the MANO modeling, the posture and the shape parameters are mapped to a triangular mesh to represent the shape of the human hand, and most of the work integrates the hand model into a differentiable layer of a neural network and automatically learns and maps to the MANO parameters to estimate the final key point because the parameter model contains the rich prior structure of the human hand.

Although the existing two-dimensional hand gesture estimation method has high accuracy. However, due to the problems of lack of inherent depth information of monocular RGB images, non-correspondence of back projection of two-dimensional pose estimation to a three-dimensional pose, and the like, all joint points are treated equally in consideration of previous work, differences of different joint points are not considered, and context relations of each joint point of a hand are ignored, and the joint point association relation can relieve depth blurring to a certain extent, and improve prediction accuracy, it is necessary to research a more robust monocular RGB image three-dimensional gesture estimation method to obtain an accurate pose estimation value.

Disclosure of Invention

The technical problem of the invention is solved: the hand posture estimation method based on the monocular RGB image is provided to overcome the defects of the prior art so as to obtain accurate hand joint point coordinates. The method comprises the steps of converting a hand three-dimensional posture estimation task into two subtasks of joint point two-dimensional position estimation in a plane and joint point three-dimensional relative depth estimation in a space, wherein a designed network consists of a visual feature extraction module, a semantic feature aggregation module and a joint point feature enhancement module, and a multi-stage optimization strategy is adopted to improve the accuracy of final prediction.

The purpose of the invention is realized by the following technical scheme:

the invention discloses a monocular RGB image-based three-dimensional hand posture estimation method, which comprises the following steps:

step 1: constructing a hand three-dimensional posture estimation network model, wherein the network model consists of a visual feature extraction module, a semantic feature aggregation module and a joint point feature self-adaptive enhancement module;

step 2, inputting monocular RGB image frames with hands as centers, generating a joint point positioning information graph and a skeleton correlation information graph in a visual feature extraction module through the visual feature extraction module, and then adopting a pre-trained ResNet18 as a feature extraction encoder to obtain image abstract semantic features; designing a binary decoder based on an hourglass structure, and monitoring by using a joint point positioning information graph and a skeleton related information graph to obtain joint point level characteristics containing hand joint point positioning information and skeleton level characteristics containing joint point related information;

step 3, sending the obtained joint point level features and skeleton level features into a semantic feature aggregation module, and adaptively fusing the captured skeleton level features and joint point level features across semantics by the semantic feature fusion module to obtain aggregated semantic features simultaneously containing the two features;

step 4, the aggregated semantic features are sent to a joint point feature self-adaptive enhancement module which explicitly introduces a hand inherent skeleton structure, and the relevant features are adaptively enhanced for each hand joint point to be detected by utilizing the relevant information to obtain the enhanced features of each joint point;

and 5, respectively obtaining a predicted two-dimensional heat map and a predicted relative depth map of the joint points through an output layer for the enhanced features of each joint point, continuously refining the two-dimensional heat map and the relative depth map of the joint points by adopting a multi-stage iterative optimization method, obtaining plane coordinates and relative depth values of the joint points through a decoding function, and finally obtaining three-dimensional coordinates of the joint points of the hand through camera parameter calculation, so that hand posture estimation is completely finished.

Further, the step 2 is specifically realized as follows: firstly, generating a joint point positioning information graph and a skeleton correlation information graph which accord with a Gaussian distribution form for network supervision training; then adopting ResNet18 pre-trained on ImageNet as a feature extraction encoder to obtain the abstract semantic features of the image; designing a binary decoder based on an hourglass structure, taking a joint point positioning information graph and a bone related information graph as supervision, and simultaneously outputting a bone-level characteristic F _b And joint point level features F _j To provide richer semantic information.

Further, in the step 3, the semantic feature aggregation module adaptively performs cross-semantic fusionCaptured bone level features F _b And joint point level characteristics F _j Obtaining a fusion feature containing both semantics

The process of (2) is as follows:

(31) Firstly, outputting a skeleton-level feature F based on the binary decoding of the hourglass structure _b And joint point level features F _j Splicing, obtaining the weight of the corresponding feature through a 3 × 3 convolution and sigmoid activation function, wherein the formula is as follows:

W _b ,W _j ＝σ(conv _3×3 (cat(F _b ,F _j ),θ ₁ ))

wherein theta is ₁ Representing the parameters to be learned by the network, F _b ,F _j Representing bone-level and joint-level features before a semantic aggregation module, W _b ,W _j Representing the weight of the learned bone-level features and joint-level features, sigma representing a sigmoid activation function, and cat representing splicing operation;

(32) The skeleton level characteristic branch and the joint point level characteristic branch in the binary decoding based on the hourglass structure are cross-semantic self-adaptive characteristic fusion operations with residual connection structures respectively, and the joint point level weight W obtained in the step (31) is utilized _j And skeletal level weight W _b And calculating to obtain the features after cross-semantic fusion, wherein the formula is as follows:

wherein

Representing the residual concatenated bone-level features and joint-level features,

representing a dot product operation;

(33) Finally, the bone features after cross-semantic fusion

And joint point level features

Sending the spliced result into 1 multiplied by 1 convolution to obtain the final characteristics of the aggregated semantics

Wherein theta is ₂ Indicating the parameters to be learned by the network,

and c represents the feature dimension.

Further, the joint point feature adaptive enhancement module of step 4 is implemented as follows:

firstly, constructing a joint point incidence structure matrix

Wherein J represents the number of hand joint points, and N represents N defined joint points having associated information with the joint points; then, the semantic features after the semantic aggregation module is aggregated

The division into J groups in the feature dimension ensures that each different joint point is assigned a unique feature

The unique means that the characteristics allocated to each joint point are different in all J joint points;

j ∈ 0.. J-1,C denotes the total feature dimension of all the joints, and c denotes each jointA characteristic dimension of the node;

the formula of the joint feature enhancement module is as follows,

wherein f is _j Representing the original features of the joint points to be estimated, f _i N-1 represents N joint features having an association relationship with the joint to be estimated, θ represents a parameter to be learned by the network,

n-1 represents learned association information of N joint points having an association relationship with a joint point to be estimated,

n-1 represents the learned N joint point weight coefficients having an association relationship with the joint point to be estimated,

representing the enhanced features of the joint to be estimated.

Further, in step 5, a method of multi-stage iterative optimization is implemented as follows: joint point level features after enhancement

Then, a two-dimensional thermodynamic diagram and a relative depth diagram of each joint point to be detected are predicted independently, wherein the formula is as follows, theta represents a parameter to be learned, H _j ,D _j Two-dimensional heat maps representing predicted joint points, respectivelyAnd relative depth map:

two-dimensional heat map H of predicted joint points _j Depth map D _j Enhancing post-semantic features with the joint

Splicing the two-dimensional heat map and the depth map of the joint points and sending the two-dimensional heat map and the depth map to the next stage of the network to learn the optimized joint points; then, a more accurate pose estimation result can be learned based on the optimized features, wherein the formula is as follows

Respectively representing the semantic features, two-dimensional heat map and depth map, phi, of the joint point to be detected in the t stage ^t+1 Denotes the characteristic polymerization process of stage t +1,. Phi ^t+1 The generation process of the t +1 stage two-dimensional heat map and the depth map is represented, the structure of the generation process is consistent with that of the t stage joint point feature enhancement module, and the generation process can capture longer-range associated information;

respectively decoding the two-dimensional joint point heat map and the relative depth heat map predicted at each stage to obtain uv coordinates of the joint point in a pixel plane and z coordinates of the joint point relative depth ^rel The formula is as follows:

wherein

Representing a two-dimensional thermodynamic diagram u after normalization of the joint points to be estimated _j ,v _j Two-dimensional coordinates representing the pixel plane of the joint point to be estimated,

representing relative depth values of the joint points to be estimated;

and finally, calculating the final coordinates of the hand joint points through the camera parameters, wherein the calculation formula is as follows:

wherein z is _root Representing the root relative depth coordinate and K representing the in-camera parameter matrix.

Compared with the prior art, the invention has the advantages that:

(1) Different from the prior method for treating all hand joint points equally, the method provided by the invention considers the difference of different joint points, innovatively introduces the inherent skeleton structure of the hand, and adaptively extracts the relevant characteristics of each joint point to be detected by utilizing the relevant information of each joint point to be detected, so that the problems of hand shielding and depth blurring are relieved, and the accuracy of hand posture estimation is improved. The invention achieves the aim by designing a binary decoder, a semantic aggregation module and a joint point characteristic self-adaptive enhancement module. The binary decoder and the semantic aggregation module can generate rich joint point level semantics and skeleton level semantics, when joint point two-dimensional plane coordinates and relative depth coordinates are estimated, the joint point feature adaptive enhancement module can adaptively extract features related to joint points to be predicted from the joint points with the association relationship, which means that each joint point regresses in a subspace of the joint point feature, the joint point features with the association relationship are fully utilized, and the decoupled prediction mode greatly reduces the prediction difficulty and enhances the estimation accuracy.

(2) The invention also adopts a multi-stage iterative optimization strategy to capture the long-range dependency relationship between the hand joint points, takes the rough two-dimensional thermodynamic diagram and the depth map as the guide, and continuously refines the output two-dimensional thermodynamic diagram and the depth map in multiple iterations. The proposed iterative optimization strategy can adaptively capture the associated information of different degrees aiming at the joint points of different degrees of freedom, so that the relation among different joint points can be realized, the problems of depth modules and self-occlusion are relieved, and the accuracy of hand posture estimation is finally enhanced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for estimating a three-dimensional hand pose based on monocular RGB images according to the present invention;

FIG. 2 is a model diagram of hand three-dimensional pose estimation for monocular RGB images in accordance with the present invention;

FIG. 3 is a visual display of the estimation results of the present invention;

fig. 4 is a comparison of the results of the same other methods of the present invention on public STB datasets.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a robust hand three-dimensional posture estimation method based on monocular RGB images to obtain accurate hand key point coordinates, designs a characteristic enhancement method for explicitly introducing an inherent skeleton structure of hand joint points, adaptively enhances the characteristics of the joint points to be estimated by utilizing associated information, inhibits the characteristics of irrelevant joint points, and simultaneously adopts a multi-stage optimization strategy to continuously refine a two-dimensional thermodynamic diagram and a depth diagram to estimate more accurate hand two-dimensional joint point coordinates and relative depth so as to finally improve the accuracy of hand posture estimation.

As shown in fig. 1, the method of the present invention comprises the steps of:

step 1, regarding monocular three-dimensional hand posture estimation as two subtasks of joint point two-dimensional position information estimation in a plane and joint point three-dimensional relative depth information estimation in a space. Firstly, generating a joint point positioning information graph and a skeleton correlation information graph which accord with a Gaussian distribution form for network supervision training, and then adopting ResNet18 pre-trained on ImageNet as a feature extraction encoder to obtain high-level abstract semantic features; then, a binary decoder is designed based on the hourglass structure to output the bone level features F simultaneously _b And joint level feature F _j Skeleton level features F containing joint point connection information _b And an articulation point level feature F containing articulation point positioning information _j Can complement each other, and obtains fusion characteristics rich in two semantics simultaneously through a subsequent characteristic fusion module

Step 2, the semantic feature fusion module can self-adaptively cross-semantic fusion captured skeleton-level features F _b And joint point level features F _j To obtain a fusion feature containing both semantics

The process is as follows, firstly, the bone characteristics F _b And joint point characteristics F _j Splicing, obtaining the weight of the corresponding feature through a 3 × 3 convolution and sigmoid activation function, wherein the formula is as follows:

W _b ,W _j ＝σ(conv _3×3 (cat(F _b ,F _j ),θ ₁ ))

wherein theta is ₁ Representing parameters to be learned by the network, F _b ,F _j Representing bone and joint characteristics before polymerization, W _b ,W _j Representing the weight of the learned bone features and joint point features, sigma representing a sigmoid activation function, and cat representing a splicing operation.

Then each branch is a cross-semantic self-adaptive feature fusion operation with a residual connection structure, and the formula is as follows:

wherein

Representing the bone and joint features after residual connection,

indicating a dot product operation.

Finally, after feature dimension splicing, sending the feature dimension spliced into 1 multiplied by 1 convolution to obtain the final feature of the aggregated semantics

Wherein theta is ₂ Indicating the parameters to be learned by the network,

indicates the final fusion characteristics, cat indicates splicingAnd (5) operating.

And 3, designing a joint point feature self-adaptive enhancement module capable of explicitly introducing a hand inherent skeleton structure, and extracting context information of joint points to be detected by utilizing the association information among the joint points to enhance relevant features and inhibit irrelevant features. Firstly, constructing a joint point incidence structure matrix

Wherein J represents the number of hand joint points, and J =21, N represents N joint points which are defined and have associated information with the joint points, and semantic features after aggregation

The feature dimensions are divided into J groups, so that each different joint point can be assigned to a specific feature

Wherein

J ∈ 0.. J-1,C represents the total feature dimension of all the joint points, and c represents the feature dimension of each joint point. When the two-dimensional plane coordinates and the relative depth coordinates of the joint points are estimated, the characteristics related to the joint points j to be predicted are extracted from the N joint points with the association relationship in a self-adaptive mode, so that each joint point j is regressed in the subspace of the joint point j, the characteristics of the joint points with the association relationship are fully utilized, meanwhile, the characteristics of irrelevant joint points are restrained, the prediction difficulty is greatly reduced by the aid of the decoupling prediction mode, and the estimation accuracy is enhanced. The formula of the joint point characteristic enhancing module is as follows:

wherein f is _j Representing the original features of the joint points to be estimated, f _i N-1 denotes N joint features having an association relationship with the joint to be estimated, θ denotes a parameter to be learned by the network,

representing the enhanced features of the joint to be estimated.

And 4, continuously refining the joint point heat map and the relative depth map by adopting a multi-stage optimization method to obtain a more accurate and robust posture estimation result. In particular, joint point level features after augmentation

Then, a two-dimensional thermodynamic diagram and a relative depth diagram of each joint point to be detected can be predicted independently, wherein the formula is as follows, theta represents a parameter to be learned, and H _j ,D _j Representing the predicted two-dimensional heat map of the joint point and the relative depth map, respectively:

step 5, two-dimensional heat map H of the predicted joint points _j Depth map D _j With the joint point enhanced postwordsSemantic features

The joint point two-dimensional heat map and the depth map are spliced and sent to a next stage of the network to learn the optimized joint point two-dimensional heat map and depth map, and the joint point incidence relation is continuously expanded in subsequent optimization so as to capture the longer-range dependence relation, and then a more accurate attitude estimation result can be learned based on the optimized characteristics, wherein the formula is as follows

Respectively representing the semantic features, two-dimensional heat map and depth map, phi, of the joint point to be detected in the t stage ^t+1 Denotes the characteristic polymerization process of stage t +1,. Phi ^t+1 The generation process of the two-dimensional heat map and the depth map in the t +1 th stage is consistent with the structure of the joint feature enhancement module in the t-th stage, but the associated information of a longer range is captured.

Respectively decoding the two-dimensional joint point heat map and the relative depth heat map predicted at each stage to obtain uv coordinates of the joint point in a pixel plane and z coordinates of the joint point relative depth ^rel The formula is as follows, wherein

representing relative depth values of the joint points to be estimated:

finally, the final coordinates of the joint points of the hand can be obtained through calculation by the camera parameters, and the calculation formula is as follows, wherein z _root And expressing the absolute depth coordinate, and K expresses a parameter matrix in the camera, so that the hand posture estimation process is completely finished.

As shown in fig. 2, which shows the overall network framework of the present invention, firstly, RGB images centered on the hand are input, a feature encoder and a binary feature decoder of a left frame can respectively obtain a skeleton-level feature and a joint-level feature, and then a semantic aggregation module semantically integrates the two features; and then, a joint point feature self-adaptive enhancement module is represented in the right frame, the aggregated semantic features are equally divided into 21 groups, then, for each joint point to be detected, relevant features can be adaptively enhanced by utilizing associated information, a two-dimensional thermodynamic diagram and a depth diagram are obtained through an output layer, then, an N-stage iterative optimization strategy is adopted, the output two-dimensional thermodynamic diagram and the depth diagram are continuously refined, finally, two-dimensional pixel coordinates and space depth coordinates of the 21 joint points are respectively decoded, and finally, three-dimensional coordinates of the hand joint points are obtained through camera parameter calculation. In fig. 2, encode represents an encoder, decode represents a decoder, fusion represents a feature aggregation module, enhance represents a feature adaptive enhancement module, and refine represents a multi-stage iterative optimization stage.

As shown in fig. 3, the visual display of the estimation result of the present invention shows the input human hand image on the left with the predicted two-dimensional joint coordinates marked thereon, the predicted joint coordinates in the middle, and the actual joint coordinates on the right. Fig. 3 visually shows the accuracy of the detection result of the present invention, in fig. 3, clr + Pred _ kp2D indicates that the predicted two-dimensional joint coordinates are marked on the original image, pred 3D indicates the predicted joint coordinates, and 3 dannottation indicates the real joint coordinates. Keypoint represents a predicted hand joint point, which is shown as a circle in the figure, thumb, index, middle, ring, and pinky.

Fig. 4 is a graph comparing the prediction results of the STB public data set with other methods in accordance with the present invention, wherein the 3D PCK is expressed as a ratio of correctly estimated key points, meaning that a ratio of normalized distance between the detected key points and their corresponding real labels smaller than a set threshold is calculated, and the larger the value is, the better the value is. Fig. 4 shows a comparison graph of the prediction results of the STB public data set and other methods of the present invention, where the results are better than those of other existing methods, error Thresholds represent Error Thresholds, and in mm units, 3D PCK represents the proportion that the normalized distance between the detected key point and the corresponding real tag is smaller than a set threshold, AUC represents the area under the 3D PCK curve, ours represents the method of the present invention, ge represents the algorithm proposed by Ge et al, yang et al represents the algorithm proposed by Yang et al, iqbal represents the algorithm proposed by Iqbal, and spur et al represents the algorithm proposed by Spurr et al.

Claims

1. A monocular RGB image-based three-dimensional hand posture estimation method is characterized by comprising the following steps:

step 2, inputting a monocular RGB image frame with a hand as a center, generating a joint point positioning information graph and a skeleton correlation information graph in a visual feature extraction module through the visual feature extraction module, and then adopting a pre-trained ResNet18 as a feature extraction encoder to obtain an image abstract semantic feature; designing a binary decoder based on an hourglass structure, and monitoring by using a joint point positioning information graph and a skeleton related information graph to obtain joint point level characteristics containing hand joint point positioning information and skeleton level characteristics containing joint point related information;

step 3, sending the obtained joint point level features and skeleton level features into a semantic feature aggregation module, wherein the semantic feature fusion module adaptively performs cross-semantic fusion on the captured skeleton level features and joint point level features, and obtains aggregated semantic features simultaneously containing the two features;

2. The method of claim 1, wherein: the step 2 is specifically realized as follows: firstly, generating a joint point positioning information graph and a skeleton correlation information graph which accord with a Gaussian distribution form for network supervision training; then adopting ResNet18 pre-trained on ImageNet as a feature extraction encoder to obtain the abstract semantic features of the image; designing a binary decoder based on an hourglass structure, taking a joint point positioning information graph and a bone related information graph as supervision, and simultaneously outputting a bone-level characteristic F _b And joint point level features F _j To provide richer semantic information.

3. The method of claim 1, wherein: in the step 3, the semantic feature aggregation module adaptively crosses the skeleton-level features F captured by semantic fusion _b And joint point level features F _j Obtaining a fusion feature containing the two semantics

The process is as follows:

W _b ,W _j ＝σ(conv _3×3 (cat(F _b ,F _j ),θ ₁ ))

wherein theta is ₁ Representing parameters to be learned by the network, F _b ,F _j Representing bone-level and joint-level features before a semantic aggregation module, W _b ,W _j Representing the learned bone-level features and joint point-level feature weights, sigma representing a sigmoid activation function, and cat representing a splicing operation;

(32) The branch of the skeleton-level feature and the branch of the joint-point-level feature in binary decoding based on the hourglass structure are cross-semantic self-adaptive feature fusion operations with residual connection structures respectively, and the joint-point-level weight W obtained in the step (31) is utilized _j And skeletal level weight W _b And calculating to obtain the features after cross-semantic fusion, wherein the formula is as follows:

wherein

representing a dot product operation;

(33) Finally, the bone features after cross-semantic fusion

And joint point level features

Wherein theta is ₂ Indicating the parameters to be learned by the network,

and c represents the feature dimension.

4. The method of claim 1, wherein: the joint point feature adaptive enhancement module in the step 4 is realized as follows:

firstly, constructing a joint point incidence structure matrix

Wherein J represents the number of hand joint points, and N represents N defined joint points having associated information with the joint points; the semantic features after the semantic aggregation module is aggregated

Are equally divided into J groups in the feature dimension, ensuring that each different joint point is assigned to a unique feature

c represents the total feature dimension of all the joint points, and C represents the feature dimension of each joint point;

the formula of the joint feature enhancement module is as follows,

f _j ⁱ ＝conv _1x1 (cat(avgpool(f _j ),avgpool(f _i )),θ)

representing the learned associated information of N joint points having associated relations with the joint points to be estimated,

represents the learned N joint point weight coefficients which have the association relation with the joint point to be estimated,

representing the enhanced features of the joint point to be estimated.

5. The method of claim 1, whichIs characterized in that: in step 5, a method of multi-stage iterative optimization is implemented as follows: joint point level features after enhancement

Then, a two-dimensional thermodynamic diagram and a relative depth diagram of each joint point to be detected are predicted independently, wherein the formula is as follows, theta represents a parameter to be learned, H _j ,D _j Representing a predicted two-dimensional heat map of the joint point and a relative depth map respectively,

Splicing the joint points, and sending the joint points to a next stage of a network to learn the optimized two-dimensional heat map and depth map of the joint points; then, a more accurate pose estimation result can be learned based on the optimized features, wherein the formula is as follows

wherein

representing relative depth values of the joint points to be estimated;

finally, the final coordinates of the hand joint points are obtained through the calculation of camera parameters, the calculation formula is as follows,